全部版块 我的主页
论坛 数据科学与人工智能 数据分析与数据科学 python论坛
1460 1
2021-01-05
<!-- markdown css tag --><div class="pinggu_markdown">
<div class="pinggu_markdown__html"><p>本办法是最基础也是在爬取链家数据中最容易上手的代码</p>
<pre class=" language-python"><code class="prism  language-python"><span class="token comment"># 需要先安装并导入requests和bs4两个库,re是python自带,可以直接导入</span>
<span class="token keyword">import</span> requests
<span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoup
<span class="token keyword">import</span> re

<span class="token comment"># 链家的二手房基础页面只显示最多1&#48;&#48;页,每页3&#48;个房源的数据,也就是用这个办法,最多可以拿到3&#48;&#48;&#48;家房源的数据</span>
page <span class="token operator">=</span> <span class="token number">2</span> <span class="token comment"># 用于定义页数</span>
<span class="token comment"># 先打开一个csv文件,定义好标题,以备数据插入</span>
<span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span>r<span class="token string">'c:\lianjia.csv'</span><span class="token punctuation">,</span><span class="token string">'a'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span>
    f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'{},{},{},{},{},{},{},{},{},{},{},{},\n'</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span><span class="token string">'房源编号'</span><span class="token punctuation">,</span><span class="token string">'小区'</span><span class="token punctuation">,</span><span class="token string">'商圈'</span><span class="token punctuation">,</span><span class="token string">'户型'</span><span class="token punctuation">,</span><span class="token string">'面积'</span><span class="token punctuation">,</span><span class="token string">'朝向'</span><span class="token punctuation">,</span><span class="token string">'户型'</span><span class="token punctuation">,</span><span class="token string">'装修'</span><span class="token punctuation">,</span><span class="token string">'年代'</span><span class="token punctuation">,</span><span class="token string">'总价'</span><span class="token punctuation">,</span><span class="token string">'单价'</span><span class="token punctuation">,</span><span class="token string">'标题'</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
    <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span>page<span class="token punctuation">)</span><span class="token punctuation">:</span>
        url <span class="token operator">=</span> <span class="token string">'https://xm.lianjia.com/ershoufang/pg'</span><span class="token operator">+</span><span class="token builtin">str</span><span class="token punctuation">(</span>i<span class="token punctuation">)</span>
        <span class="token comment"># print(url)</span>
        headers <span class="token operator">=</span> <span class="token punctuation">{</span>
            <span class="token string">'User-Agent'</span><span class="token punctuation">:</span> <span class="token string">'Mozilla/5.&#48; (Windows NT 1&#48;.&#48;; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) chrome/87.&#48;.428&#48;.88 Safari/537.36 Edg/87.&#48;.664.66'</span>
        <span class="token punctuation">}</span>

        html <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span>headers <span class="token operator">=</span> headers<span class="token punctuation">)</span><span class="token punctuation">.</span>text
        soup <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>html<span class="token punctuation">,</span><span class="token string">'html.parser'</span><span class="token punctuation">)</span>
        <span class="token comment"># print(soup)</span>
        infos <span class="token operator">=</span> soup<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'ul'</span><span class="token punctuation">,</span><span class="token punctuation">{</span><span class="token string">'class'</span><span class="token punctuation">:</span><span class="token string">'sellListContent'</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">'li'</span><span class="token punctuation">)</span>
        <span class="token comment"># print(infos)</span>
        <span class="token keyword">for</span> info <span class="token keyword">in</span> infos<span class="token punctuation">:</span>
            <span class="token comment"># 获取房源ID</span>
            house_id <span class="token operator">=</span> info<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'div'</span><span class="token punctuation">,</span><span class="token punctuation">{</span><span class="token string">'class'</span><span class="token punctuation">:</span><span class="token string">'title'</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'a'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'data-housecode'</span><span class="token punctuation">)</span>
            <span class="token comment"># print(house_id)</span>
            <span class="token comment"># 获取房源标题</span>
            name <span class="token operator">=</span> info<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'div'</span><span class="token punctuation">,</span><span class="token punctuation">{</span><span class="token string">'class'</span><span class="token punctuation">:</span><span class="token string">'title'</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'a'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_text<span class="token punctuation">(</span><span class="token punctuation">)</span>
            <span class="token comment"># 由于小区名字和小区所属商圈都在class=positionInfo这个切片下面,所以需要先将两个名字放入列表,然后分别提取</span>
            weizhi <span class="token operator">=</span> info<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'div'</span><span class="token punctuation">,</span><span class="token punctuation">{</span><span class="token string">'class'</span><span class="token punctuation">:</span><span class="token string">'positionInfo'</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">'a'</span><span class="token punctuation">)</span>
            data_list <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
            <span class="token keyword">for</span> z <span class="token keyword">in</span> weizhi<span class="token punctuation">:</span>
                data_list<span class="token punctuation">.</span>append<span class="token punctuation">(</span>z<span class="token punctuation">.</span>get_text<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
            xiaoqu <span class="token operator">=</span> data_list<span class="token punctuation">[</span><span class="token number">&#48;</span><span class="token punctuation">]</span>
            shangquan <span class="token operator">=</span> data_list<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span>
            <span class="token comment"># 有了小区名和位置,接下来就是看房子的具体信息,同样需要先转为列表然后切片</span>
            houseinfo <span class="token operator">=</span> info<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'div'</span><span class="token punctuation">,</span><span class="token punctuation">{</span><span class="token string">'class'</span><span class="token punctuation">:</span><span class="token string">'houseInfo'</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_text<span class="token punctuation">(</span><span class="token punctuation">)</span>
            houseinfolist <span class="token operator">=</span> houseinfo<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">'|'</span><span class="token punctuation">)</span>
            roominfo <span class="token operator">=</span> houseinfolist<span class="token punctuation">[</span><span class="token number">&#48;</span><span class="token punctuation">]</span>
            <span class="token comment"># 面积是浮点数,要用正则表达式提取</span>
            mianji <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span>r<span class="token string">'-?\d+\.?\d*e?-?\d*?'</span><span class="token punctuation">,</span>houseinfolist<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">&#48;</span><span class="token punctuation">]</span>
            chaoxiang <span class="token operator">=</span> houseinfolist<span class="token punctuation">[</span><span class="token number">2</span><span class="token punctuation">]</span>
            zhuangxiu <span class="token operator">=</span> houseinfolist<span class="token punctuation">[</span><span class="token number">3</span><span class="token punctuation">]</span>
            louceng <span class="token operator">=</span> houseinfolist<span class="token punctuation">[</span><span class="token number">4</span><span class="token punctuation">]</span>
            nian <span class="token operator">=</span> houseinfolist<span class="token punctuation">[</span><span class="token number">5</span><span class="token punctuation">]</span>
            <span class="token comment"># louxing = houseinfolist[6]</span>
            <span class="token comment"># 接下来是总价</span>
            totalprice <span class="token operator">=</span> info<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'div'</span><span class="token punctuation">,</span><span class="token punctuation">{</span><span class="token string">'class'</span><span class="token punctuation">:</span><span class="token string">'totalPrice'</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'span'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>text
            <span class="token comment"># 接下来是每平单价,用正则表达式提取整数</span>
            unitprice <span class="token operator">=</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span><span class="token string">'\D'</span><span class="token punctuation">,</span><span class="token string">''</span><span class="token punctuation">,</span>info<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'div'</span><span class="token punctuation">,</span> <span class="token punctuation">{</span><span class="token string">'class'</span><span class="token punctuation">:</span> <span class="token string">'unitPrice'</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'span'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>text<span class="token punctuation">)</span>
            <span class="token comment"># 接下来是存入csv</span>

            f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'{},{},{},{},{},{},{},{},{},{},{},{},\n'</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>house_id<span class="token punctuation">,</span>xiaoqu<span class="token punctuation">,</span>shangquan<span class="token punctuation">,</span>roominfo<span class="token punctuation">,</span>mianji<span class="token punctuation">,</span>chaoxiang<span class="token punctuation">,</span>zhuangxiu<span class="token punctuation">,</span>louceng<span class="token punctuation">,</span>nian<span class="token punctuation">,</span>totalprice<span class="token punctuation">,</span>unitprice<span class="token punctuation">,</span>name<span class="token punctuation">)</span><span class="token punctuation">)</span>





</code></pre>
</div>
</div>
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

全部回复
2021-1-17 10:49:33
是不是有bug,为什么发布后会变成html源代码的形式?
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群