R进行网络爬虫 - 经管之家

1169

收藏 2020-02-04

代码：
library(xml2)
library(rvest)
library(stringr)
library(RCurl)
i=1:100
zsj=data.frame()
for(i in 1:100){
web=read_html(str_c("https://www.so.com/s?ie=utf-8&src=dlm&shb=1&hsid=7aff54c8af32e44e&ls=n08f4f9b899&q=%E6%99%BA%E6%85%A7%E8%AD%A6%E5%8A%A1%E6%96%B0%E9%97%BB",i),encoding="UTF-8")
link=web %>% html_nodes("cite") %>% html_text()#获取新闻链接
title=web %>% html_nodes(".res_title") %>% html_text()#获取新闻标题
sj=data.frame(link,title)
zsj=rbind(zsj,sj)
}
write.csv(zsj,file="./zsj.csv")

大神们可以帮忙看一下代码有什么问题吗？出来的CSV文件里只有link和title两个单词，其它啥也没有了。

另外如果爬取的信息条目多了使用data.frame也会出现问题
for(i in 1:1500){
web=read_html(str_c("https://www.so.com/s?ie=utf-8&src=dlm&shb=1&hsid=7aff54c8af32e44e&ls=n08f4f9b899&q=%E6%99%BA%E6%85%A7%E8%AD%A6%E5%8A%A1%E6%96%B0%E9%97%BB",i),encoding="UTF-8")
link=web %>% html_nodes("cite") %>% html_text()#新闻链接
title=web %>% html_nodes(".res_title") %>% html_text()#新闻标题
time=web %>% html_nodes(".gray") %>% html_text()#新闻发布时间
sites=web %>% html_nodes(".res-linkinfo") %>% html_text()#新闻发布网站
sj=data.frame(link,title,time,sites)
zsj=rbind(zsj,sj)
}

Error in data.frame(link, title, time, sites) : 参数值意味着不同的行数: 11, 0, 9, 10

其中HTML节点信息是用selectGadget获取的

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

栏目导航

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群