关于“ 完整程序教你怎样利用SAS抓取网页内容”的几点建议

3512

收藏 2012-09-01

本人学习SAS几个月了，对“完整程序教你怎样利用SAS抓取网页内容”这篇很感兴趣，于是就看了下，看看后怎么感觉怪怪的？于是提出几点建议：
1. 首先你提到"pn，rn为百度查询参数，与SAS宏变量无关",那应不应该加个%nrstr()包起来呢？

2.%*抓取收录的总数量，用于控制读取的页面数;
data _null_;
infile baidu length=len lrecl=5000;
input _t1 $varying5000. len;
if substr(_t1,1,13)='<p id="page">';
total=compress(scan(scan(_t1,-3,">"),-2,"<"),"找到相关结果个约,");
call symput('total',total);
run;
这段无非是取“  百度为您找到相关结果约 x,xxx,xxx个”,完全可以用find()函数.
3."%else %let pnum=%eval(%eval(&i-1)*&basicn);"
我认为可以简化点.
4.data _t&i.; set _t(where=(substr(t1,1,19)="<table cellpadding=") rename=(_t1=t1)); set _t(where=(substr(t2,1,19)="</h3><font size=-1>") rename=(_t1=t2)); href=scan(substr(t1,index(t1,"href="),length(t1)-index(t1,"href=")),2,""""); title=scan(substr(t1,index(t1,"target="),length(t1)-index(t1,"target=")),3,""""); title=substr(substr(title,2,length(title)),1,length(title)-1-length("</a")); format date yymmdd10.; date=input(scan(scan(substr(t2,index(t2,"<span class="),length(t2)-index(t2,"<span class=")),1,"<"),-1," "),yymmdd10.); order=_n_+%eval(%eval(&i-1)*&basicn); call symput("_n",_n_);run;
这一大段，完全可以采取Regular expression来处理.
5."%let dt_&k.=%sysfunc(putn(%eval((%sysfunc(inputn(&sysdate9,date9.)))-&k),yymmddn8.));"
这一段我看完就受不了了，你用不了这么复杂吧？
建议这样用：
%let dt_&k.=%sysfunc(intnx(day,"&sysdate9"d,-&k.,end),yymmddn8.);
6.proc sql noprint;       create table _1drop as       select URI,href       from &dslib..&dsout._&&dt_&k.       where URI not in (select distinct URI from &dslib..&dsout._&dt.)       order by URI; quit; proc sql noprint;       create table _2new as       select URI,href       from &dslib..&dsout._&dt.       where URI not in (select distinct URI from &dslib..&dsout._&&dt_&k.)       order by URI; quit;
我认为你应该用Merge，SQL的效率还是低些.
以上是些小小的建议，还有其他的一些感觉怪怪的。。。。。
综上我认为:  1.牛人写代码就应该写的漂亮些再发上来，不要误导了SAS初学者和打消对他们学习SAS的积极性.
         2.我认为要学习“SAS抓取网页内容”，首先你得学习些网页HTML的知识吧，至少知道：www.w3.org吧.
         3.要想学好SAS,我认为C或C++语言编程应该不差，其次是英语，能大致看懂老外写的什么，比如下面这两篇介绍
         SAS Filename Url的，尤其是第二篇是SAS Global Forum 2012的，推荐大家看，链接为：
         one:http://www2.sas.com/proceedings/sugi30/100-30.pdf
                        two:http://support.sas.com/resources/papers/proceedings12/119-2012.pdf
                        第三就是兴趣了.