2.%*抓取收录的总数量,用于控制读取的页面数;
data _null_;
infile baidu length=len lrecl=5000;
input _t1 $varying5000. len;
if substr(_t1,1,13)='<p id="page">';
total=compress(scan(scan(_t1,-3,">"),-2,"<"),"找到相关结果个约,");
call symput('total',total);
run;
这段无非是取“ 百度为您找到相关结果约 x,xxx,xxx个”,完全可以用find()函数.
3."%else %let pnum=%eval(%eval(&i-1)*&basicn);"
我认为可以简化点.
4.data _t&i.; set _t(where=(substr(t1,1,19)="<table cellpadding=") rename=(_t1=t1)); set _t(where=(substr(t2,1,19)="</h3><font size=-1>") rename=(_t1=t2)); href=scan(substr(t1,index(t1,"href="),length(t1)-index(t1,"href=")),2,""""); title=scan(substr(t1,index(t1,"target="),length(t1)-index(t1,"target=")),3,""""); title=substr(substr(title,2,length(title)),1,length(title)-1-length("</a")); format date yymmdd10.; date=input(scan(scan(substr(t2,index(t2,"<span class="),length(t2)-index(t2,"<span class=")),1,"<"),-1," "),yymmdd10.); order=_n_+%eval(%eval(&i-1)*&basicn); call symput("_n",_n_);run;
这一大段,完全可以采取Regular expression来处理.
5."%let dt_&k.=%sysfunc(putn(%eval((%sysfunc(inputn(&sysdate9,date9.)))-&k),yymmddn8.));"
这一段我看完就受不了了,你用不了这么复杂吧?
建议这样用:
%let dt_&k.=%sysfunc(intnx(day,"&sysdate9"d,-&k.,end),yymmddn8.);
6.proc sql noprint; create table _1drop as select URI,href from &dslib..&dsout._&&dt_&k. where URI not in (select distinct URI from &dslib..&dsout._&dt.) order by URI; quit; proc sql noprint; create table _2new as select URI,href from &dslib..&dsout._&dt. where URI not in (select distinct URI from &dslib..&dsout._&&dt_&k.) order by URI; quit;
我认为你应该用Merge,SQL的效率还是低些.
以上是些小小的建议,还有其他的一些感觉怪怪的。。。。。
综上我认为: 1.牛人写代码就应该写的漂亮些再发上来,不要误导了SAS初学者和打消对他们学习SAS的积极性.
2.我认为要学习“SAS抓取网页内容”,首先你得学习些网页HTML的知识吧,至少知道:www.w3.org吧.
3.要想学好SAS,我认为C或C++语言编程应该不差,其次是英语,能大致看懂老外写的什么,比如下面这两篇介绍
SAS Filename Url的,尤其是第二篇是SAS Global Forum 2012的,推荐大家看,链接为:
one:http://www2.sas.com/proceedings/sugi30/100-30.pdf
two:http://support.sas.com/resources/papers/proceedings12/119-2012.pdf
第三就是兴趣了.