尝试着询问一下，一个比较麻烦的程序

1409

收藏 2014-07-30

现有数据如下, 三个column （title，authors(不同的name用|隔开），number_authors)

Title                   Authors                                                 Number_authors
Title 1             Name A | Name B                                              2
Title 2             Name A | Name B  | Name C                               3
Title 3             Name A | Name C  | Name E | Name Z                   4
TITLE 4             NAME A                                                          1
TITLE 5                NAME F | NAME Z                                           2
..
大概有20000个observations，其中
1. title是unique的
2. authors 内部是sorted，ie，author的前后顺序是按字母顺序来的；
有些author会频繁出现，有些只会出现一次
3. number_authors 取值从1-200.

目标：能不能设计一个程序从中找出weak unique 【至少两个author name repeat】的group（authors）所占的比例？？比如上述五个数据，title1 与title 2 是repeat的（A.B，满足了至少两个），同样title2与title3也是。
所以以上5个数据可以看做由4 个 weak unique group 产生。

或者推广至N个？

苦思未果，提前谢谢大家宝贵的意见和时间！

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

全部回复

spssone

2014-7-30 10:38:48

以上5个数据可以看做由4 个 weak unique group 产生...
没看懂啊

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

hqs811

2014-7-30 11:02:10

spssone 发表于 2014-7-30 10:38
以上5个数据可以看做由4 个 weak unique group 产生...
没看懂啊

多谢指正，以下定义了unique group和weak unique group.
Definition(Unique Group): A number of groups form a Unique group if all authors in these group are identical.

Definition(weak Unique Group): A number of groups form a weak unique group if at least two authors in these group are identical.

所以，title 1,2,3 形成了两个weak unique group...不知道这样解释能不能行

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

pobel

2014-7-30 12:53:26

hqs811 发表于 2014-7-30 11:02
多谢指正，以下定义了unique group和weak unique group.
Definition(Unique Group): A number of groups ...

实在是不太懂楼主具体要做出什么样的数据。
以下代码可能是需要的步骤，仅供参考：

data test;
  input Title & $10. Authors $40.  Number_authors ;
  authors=upcase(authors);
  cards;
Title 1 Name A | Name B                               2
Title 2 Name A | Name B  | Name C                      3
Title 3 Name A | Name C  | Name E | Name Z             4
TITLE 4 NAME A                                        1
TITLE 5 NAME F | NAME Z                               2
;

data test1;
set test;
if Number_authors=1 then do;
      author1=authors;
output;
end;
   else do i=1 to number_authors-1;
         author1=strip(scan(authors,i,"|"));
   do j=i+1 to number_authors;
      author2=strip(scan(authors,j,"|"));
   output;
   end;
end;
   keep author1 author2 title;
run;

proc sort data=test1;
by author1 author2;
run;

data weak_unique;
set test1;
by author1 author2;
retain titles;
if first.author2 then titles=cats(title);
else titles=catx(", ",titles, title);
if last.author2;
drop title;
run;

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群