<!-- markdown css tag --><div class="pinggu_markdown">
<div class="pinggu_markdown__html"><p>本办法是最基础也是在爬取链家数据中最容易上手的代码</p>
<pre class=" language-python"><code class="prism language-python"><span class="token comment"># 需要先安装并导入requests和bs4两个库,re是python自带,可以直接导入</span>
<span class="token keyword">import</span> requests
<span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoup
<span class="token keyword">import</span> re
<span class="token comment"># 链家的二手房基础页面只显示最多100页,每页30个房源的数据,也就是用这个办法,最多可以拿到3000家房源的数据</span>
page <span class="token operator">=</span> <span class="token number">2</span> <span class="token comment"># 用于定义页数</span>
<span class="token comment"># 先打开一个csv文件,定义好标题,以备数据插入</span>
<span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span>r<span class="token string">'c:\lianjia.csv'</span><span class="token punctuation">,</span><span class="token string">'a'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span>
f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'{},{},{},{},{},{},{},{},{},{},{},{},\n'</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span><span class="token string">'房源编号'</span><span class="token punctuation">,</span><span class="token string">'小区'</span><span class="token punctuation">,</span><span class="token string">'商圈'</span><span class="token punctuation">,</span><span class="token string">'户型'</span><span class="token punctuation">,</span><span class="token string">'面积'</span><span class="token punctuation">,</span><span class="token string">'朝向'</span><span class="token punctuation">,</span><span class="token string">'户型'</span><span class="token punctuation">,</span><span class="token string">'装修'</span><span class="token punctuation">,</span><span class="token string">'年代'</span><span class="token punctuation">,</span><span class="token string">'总价'</span><span class="token punctuation">,</span><span class="token string">'单价'</span><span class="token punctuation">,</span><span class="token string">'标题'</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span>page<span class="token punctuation">)</span><span class="token punctuation">:</span>
url <span class="token operator">=</span> <span class="token string">'https://xm.lianjia.com/ershoufang/pg'</span><span class="token operator">+</span><span class="token builtin">str</span><span class="token punctuation">(</span>i<span class="token punctuation">)</span>
<span class="token comment"># print(url)</span>
headers <span class="token operator">=</span> <span class="token punctuation">{</span>
<span class="token string">'User-Agent'</span><span class="token punctuation">:</span> <span class="token string">'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.66'</span>
<span class="token punctuation">}</span>
html <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span>headers <span class="token operator">=</span> headers<span class="token punctuation">)</span><span class="token punctuation">.</span>text
soup <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>html<span class="token punctuation">,</span><span class="token string">'html.parser'</span><span class="token punctuation">)</span>
<span class="token comment"># print(soup)</span>
infos <span class="token operator">=</span> soup<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'ul'</span><span class="token punctuation">,</span><span class="token punctuation">{</span><span class="token string">'class'</span><span class="token punctuation">:</span><span class="token string">'sellListContent'</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">'li'</span><span class="token punctuation">)</span>
<span class="token comment"># print(infos)</span>
<span class="token keyword">for</span> info <span class="token keyword">in</span> infos<span class="token punctuation">:</span>
<span class="token comment"># 获取房源ID</span>
house_id <span class="token operator">=</span> info<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'div'</span><span class="token punctuation">,</span><span class="token punctuation">{</span><span class="token string">'class'</span><span class="token punctuation">:</span><span class="token string">'title'</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'a'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'data-housecode'</span><span class="token punctuation">)</span>
<span class="token comment"># print(house_id)</span>
<span class="token comment"># 获取房源标题</span>
name <span class="token operator">=</span> info<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'div'</span><span class="token punctuation">,</span><span class="token punctuation">{</span><span class="token string">'class'</span><span class="token punctuation">:</span><span class="token string">'title'</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'a'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_text<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token comment"># 由于小区名字和小区所属商圈都在class=positionInfo这个切片下面,所以需要先将两个名字放入列表,然后分别提取</span>
weizhi <span class="token operator">=</span> info<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'div'</span><span class="token punctuation">,</span><span class="token punctuation">{</span><span class="token string">'class'</span><span class="token punctuation">:</span><span class="token string">'positionInfo'</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">'a'</span><span class="token punctuation">)</span>
data_list <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
<span class="token keyword">for</span> z <span class="token keyword">in</span> weizhi<span class="token punctuation">:</span>
data_list<span class="token punctuation">.</span>append<span class="token punctuation">(</span>z<span class="token punctuation">.</span>get_text<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
xiaoqu <span class="token operator">=</span> data_list<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
shangquan <span class="token operator">=</span> data_list<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span>
<span class="token comment"># 有了小区名和位置,接下来就是看房子的具体信息,同样需要先转为列表然后切片</span>
houseinfo <span class="token operator">=</span> info<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'div'</span><span class="token punctuation">,</span><span class="token punctuation">{</span><span class="token string">'class'</span><span class="token punctuation">:</span><span class="token string">'houseInfo'</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_text<span class="token punctuation">(</span><span class="token punctuation">)</span>
houseinfolist <span class="token operator">=</span> houseinfo<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">'|'</span><span class="token punctuation">)</span>
roominfo <span class="token operator">=</span> houseinfolist<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
<span class="token comment"># 面积是浮点数,要用正则表达式提取</span>
mianji <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span>r<span class="token string">'-?\d+\.?\d*e?-?\d*?'</span><span class="token punctuation">,</span>houseinfolist<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
chaoxiang <span class="token operator">=</span> houseinfolist<span class="token punctuation">[</span><span class="token number">2</span><span class="token punctuation">]</span>
zhuangxiu <span class="token operator">=</span> houseinfolist<span class="token punctuation">[</span><span class="token number">3</span><span class="token punctuation">]</span>
louceng <span class="token operator">=</span> houseinfolist<span class="token punctuation">[</span><span class="token number">4</span><span class="token punctuation">]</span>
nian <span class="token operator">=</span> houseinfolist<span class="token punctuation">[</span><span class="token number">5</span><span class="token punctuation">]</span>
<span class="token comment"># louxing = houseinfolist[6]</span>
<span class="token comment"># 接下来是总价</span>
totalprice <span class="token operator">=</span> info<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'div'</span><span class="token punctuation">,</span><span class="token punctuation">{</span><span class="token string">'class'</span><span class="token punctuation">:</span><span class="token string">'totalPrice'</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'span'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>text
<span class="token comment"># 接下来是每平单价,用正则表达式提取整数</span>
unitprice <span class="token operator">=</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span><span class="token string">'\D'</span><span class="token punctuation">,</span><span class="token string">''</span><span class="token punctuation">,</span>info<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'div'</span><span class="token punctuation">,</span> <span class="token punctuation">{</span><span class="token string">'class'</span><span class="token punctuation">:</span> <span class="token string">'unitPrice'</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'span'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>text<span class="token punctuation">)</span>
<span class="token comment"># 接下来是存入csv</span>
f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'{},{},{},{},{},{},{},{},{},{},{},{},\n'</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>house_id<span class="token punctuation">,</span>xiaoqu<span class="token punctuation">,</span>shangquan<span class="token punctuation">,</span>roominfo<span class="token punctuation">,</span>mianji<span class="token punctuation">,</span>chaoxiang<span class="token punctuation">,</span>zhuangxiu<span class="token punctuation">,</span>louceng<span class="token punctuation">,</span>nian<span class="token punctuation">,</span>totalprice<span class="token punctuation">,</span>unitprice<span class="token punctuation">,</span>name<span class="token punctuation">)</span><span class="token punctuation">)</span>
</code></pre>
</div>
</div>