python多线程爬虫实例讲解

3312

收藏 2016-02-19

Python作为一门强大的脚本语言，我们经常使用python来写爬虫程序，简单的爬虫会写，可是用python写多线程网页爬虫，应该如何写呢？一般来说，使用线程有两种模式, 一种是创建线程要执行的函数, 把这个函数传递进Thread对象里，让它来执行. 另一种是直接从Thread继承，创建一个新的class，把线程执行的代码放到这个新的class里。

实现python多线程（http://www.maiziedu.com/group/article/10324/）网页爬虫，采用了多线程和锁机制，实现了广度优先算法的网页爬虫。

先给大家简单介绍下我的实现思路：

对于一个网络爬虫，如果要按广度遍历的方式下载，它是这样的：

1.从给定的入口网址把第一个网页下载下来

2.从第一个网页中提取出所有新的网页地址，放入下载列表中

3.按下载列表中的地址，下载所有新的网页

4.从所有新的网页中找出没有下载过的网页地址，更新下载列表

5.重复3、4两步，直到更新后的下载列表为空表时停止

python代码如下：

#!/usr/bin/env python

#coding=utf-8

import threading

import urllib

import re

import time

g_mutex=threading.Condition()

g_pages=[] #从中解析所有url链接

g_queueURL=[] #等待爬取的url链接列表

g_existURL=[] #已经爬取过的url链接列表

g_failedURL=[] #下载失败的url链接列表

g_totalcount=0 #下载过的页面数

class Crawler:

def __init__(self,crawlername,url,threadnum):

self.crawlername=crawlername

self.url=url

self.threadnum=threadnum

self.threadpool=[]

self.logfile=file("log.txt",'w')

def craw(self):

global g_queueURL

g_queueURL.append(url)

depth=0

print self.crawlername+" 启动..."

while(len(g_queueURL)!=0):

depth+=1

print 'Searching depth ',depth,'...\n\n'

self.logfile.write("URL:"+g_queueURL[0]+"........")

self.downloadAll()

self.updateQueueURL()

content='\n>>>Depth '+str(depth)+':\n'

self.logfile.write(content)

i=0

while i<len(g_queueURL):

content=str(g_totalcount+i)+'->'+g_queueURL+'\n'

self.logfile.write(content)

i+=1

def downloadAll(self):

global g_queueURL

global g_totalcount

i=0

while i<len(g_queueURL):

j=0

while j<self.threadnum and i+j < len(g_queueURL):

g_totalcount+=1

threadresult=self.download(g_queueURL[i+j],str(g_totalcount)+'.html',j)

if threadresult!=None:

print 'Thread started:',i+j,'--File number =',g_totalcount

j+=1

i+=j

for thread in self.threadpool:

thread.join(30)

threadpool=[]

g_queueURL=[]

def download(self,url,filename,tid):

crawthread=CrawlerThread(url,filename,tid)

self.threadpool.append(crawthread)

crawthread.start()

def updateQueueURL(self):

global g_queueURL

global g_existURL

newUrlList=[]

for content in g_pages:

newUrlList+=self.getUrl(content)

g_queueURL=list(set(newUrlList)-set(g_existURL))

def getUrl(self,content):

reg=r'"(http://.+?)"'

regob=re.compile(reg,re.DOTALL)

urllist=regob.findall(content)

return urllist

class CrawlerThread(threading.Thread):

def __init__(self,url,filename,tid):

threading.Thread.__init__(self)

self.url=url

self.filename=filename

self.tid=tid

def run(self):

global g_mutex

global g_failedURL

global g_queueURL

try:

page=urllib.urlopen(self.url)

html=page.read()

fout=file(self.filename,'w')

fout.write(html)

fout.close()

except Exception,e:

g_mutex.acquire()

g_existURL.append(self.url)

g_failedURL.append(self.url)

g_mutex.release()

print 'Failed downloading and saving',self.url

print e

return None

g_mutex.acquire()

g_pages.append(html)

g_existURL.append(self.url)

g_mutex.release()

if __name__=="__main__":

url=raw_input("请输入url入口:\n")

threadnum=int(raw_input("设置线程数:"))

crawlername="小小爬虫"

crawler=Crawler(crawlername,url,threadnum)

crawler.craw()

以上就用实例为大家讲解了python实现多线程爬虫的方法，有兴趣的朋友可以自己试试。

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

全部回复

沙发

Iris2126

2016-2-23 13:04:51

感谢分享

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

相关推荐

一千个选择python的理由

【Python】用Python实现多线程“生产者-消费者”模型的简单例子

多线程的 Python 教程——“贪吃蛇”

我在win7系统上面安装了python2.7.10，想要...

【量化小讲堂 - Python、Pandas系列】如何通过逐笔数据计算主力资金流数据

66币悬赏 python接淘宝api

Python 学习

python多线程实现方法及多条命令并发执行

Python设计模式之"外观模式"实例讲解

Python遍历方法readline()和readlines()实例讲解

栏目导航

python论坛

经管在职研

外文文献专区

人力资源管理

经管文库（原现金交易版）

金融实务版

热门文章

文本分析：从经管顶刊“加分项”到学术发表 ...

CDA 认证考试大纲 2025 重磅更新：一二级考 ...

CAIE人工智能工程师认证

CDA 数据分析师：线性回归实战指南 —— 从 ...

2025中国播客行业现状与发展趋势报告

2025年三季度中国消费者消费意愿调查报告

十五五规划建议思维导图

【详细整理,24重磅!】1990-2024上市公司市场 ...

“十五五”规划建议稿解读：乘势而上，因势 ...

奇瑞首夺J.D.Power-VDS自主冠军

推荐文章

AI狂潮席卷学术圈，不会编程也能打造专属智 ...

10月重磅来袭｜《打造Coze/Dify专属学术智能 ...

最快1年拿证，学费不足5W！热门美国人工智能 ...

关于如何利用文献的若干建议

关于学术研究和论文发表的一些建议

关于科研中如何学习基础知识的一些建议 (一 ...

一个自编的经济学建模小案例 --写给授课本科 ...

AI智能体赋能教学改革: 全国AI教育教学应用 ...

2025中国AIoT产业全景图谱报告-406页

关于文献求助的一些建议

扫码加我 拉你入群

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群

扫码加我拉你入群