全部版块 我的主页
论坛 数据科学与人工智能 数据分析与数据科学 python论坛
1893 8
2022-08-09
<!-- markdown css tag --><div class="pinggu_markdown">
<div class="pinggu_markdown__html"><h1 id="既然学了python,应该花5分钟学下自动化爬虫">既然学了Python,应该花5分钟学下自动化&amp;爬虫</h1>
<p>看完Al Sweigart的《Automate the Boring Stuff with Python》,附上网页<a href="https://automatetheboringstuff.com/">原版</a>, 喜欢实体书的可以豆瓣<a href="https://book.douban.com/subject/268367&#48;&#48;/">中文版</a>。 这里不介绍统计、画图、机器学习了。 讲一下另一个更加有趣和实用的功能,自动化。 当然你也可能把它用来当做爬虫。</p>
<p>下面的例子教大家如何自动下载YouTube视频。 同样你也可以修改一下用来下B站视频,<a href="https://www.bilibili.com/read/cv15&#48;33695">参考</a>。</p>
<p>简单介绍一下 <a href="https://www.clicknium.com/documents/quickstart">Clicknium</a>是2&#48;22年最新的Python自动化库,可以用来自动化操作网页和Windows App. 采用可视化操作和代码结合的方式实现自动化。<br>
<a href="https://pytube.io/en/latest/">pytube</a>是一个用来下载youtube视频的Python库。 只要用视屏的link就可以下载(当然网首先得通)</p>
<p>确保你有:</p>
<ul>
<li>一个Python3.7+的环境</li>
<li>VS Code</li>
<li>梯子</li>
</ul>
<h3 id="配置clicknium">配置Clicknium</h3>
<ol>
<li>在VS Code的插件市场中搜索并安装Clicknium</li>
</ol>
<p><img src="https://s1.328888.xyz/2&#48;22/&#48;8/1&#48;/4bSI7.png" alt="sss"></p>
<p>在VS Code右侧点击Clicknium的图标进入Welcome页面, 跟着welcome 点击按钮安装module, chrome的插件和注册账号。</p>
<p><img src="https://s1.328888.xyz/2&#48;22/&#48;8/1&#48;/4bInd.png" alt="hhh"></p>
<h3 id="code">Code</h3>
<p>用VS Code 创建一个Python文件,比如 <code>youtube.py</code></p>
<pre class=" language-python"><code class="prism  language-python"><span class="token keyword">from</span> clicknium <span class="token keyword">import</span> clicknium <span class="token keyword">as</span> cc<span class="token punctuation">,</span> locator

<span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
    tab <span class="token operator">=</span> cc<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"https://www.youtube.com"</span><span class="token punctuation">)</span>

<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">"__main__"</span><span class="token punctuation">:</span>
    main<span class="token punctuation">(</span><span class="token punctuation">)</span>
</code></pre>
<p>可以通过F5运行。 这行代码Clicknium会自动打开Chrome浏览器,并且进入油管的首页。</p>
<h3 id="抓取">抓取</h3>
<p>这个时候,我们要下载霉霉的视频, 那就得先打开霉霉视频页,类似把大象塞进冰箱:</p>
<p>步骤:</p>
<ol>
<li>在搜索框输入Taylor Swift</li>
<li>点击搜索按钮</li>
<li>点击进入霉霉首页</li>
<li>点击进入霉霉视频页</li>
</ol>
<p>这里一个涉及到四个元素:搜索框, 搜索按钮,搜索结果中的艺人名字,视频页切换按钮</p>
<p>Clicknium中使用Locator来定位UI元素, 并且提供了Recorder来生成Locator。</p>
<p>我们使用上面的代码打开油管页面然后打开VS Code调用Recorder。</p>
<p><img src="https://pic4.zhimg.com/8&#48;/v2-a6c&#48;44263d&#48;7d381f9347&#48;b8&#48;f48d72b_144&#48;w.jpg" alt=""></p>
<p>调用Recorder</p>
<p>点击上图VS Code中Locator tab上这个小小的capture按钮启用Recorder。这个按钮比较隐蔽。如果没看到LOCATORS 这个tab,点击右上角的三个点勾选Locator。 将鼠标移动到搜索栏上,会自动高亮显示input。按住Ctrl+Click(鼠标右键单击)即可抓取 搜索框。 同样的方法 抓取<strong>搜索按钮</strong>,和下图右上角<strong>Taylor Swift的链接</strong>。</p>
<p><img src="https://pic4.zhimg.com/8&#48;/v2-&#48;de369&#48;728d&#48;63e86f12ebc9aa8b7c77_144&#48;w.jpg" alt=""></p>
<p>Youtube页</p>
<p><img src="https://pic3.zhimg.com/8&#48;/v2-a597af797527d812b8913fdc9f1199be_144&#48;w.jpg" alt=""></p>
<p>Recorder</p>
<p>每次抓取都会在Recorder中生成一个UI元素对应的locator,可以对其进行重命名。完成后点击Complete。</p>
<p>然后回到VS Code</p>
<pre class=" language-python"><code class="prism  language-python"><span class="token keyword">from</span> clicknium <span class="token keyword">import</span> clicknium <span class="token keyword">as</span> cc<span class="token punctuation">,</span> locator
<span class="token keyword">from</span> clicknium<span class="token punctuation">.</span>common<span class="token punctuation">.</span>enums <span class="token keyword">import</span> <span class="token operator">*</span>

<span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
    urlArrary <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
    tab <span class="token operator">=</span> cc<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"https://www.youtube.com"</span><span class="token punctuation">)</span>
    tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>searchBar<span class="token punctuation">)</span><span class="token punctuation">.</span>set_text<span class="token punctuation">(</span>
        <span class="token string">"Taylor Swift"</span><span class="token punctuation">,</span> by<span class="token operator">=</span><span class="token string">'sendkey-after-click'</span><span class="token punctuation">)</span>
    tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>button_search_icon_legacy<span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
    tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>TS<span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>

<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">"__main__"</span><span class="token punctuation">:</span>
    main<span class="token punctuation">(</span><span class="token punctuation">)</span>
</code></pre>
<p>通过find_element函数传入对应locator定位到某个UI元素然后利用set_text方法将“Taylor Swift”写入搜索框。</p>
<p>下一行是相同的方法定位到搜索按钮,然后用click函数表示鼠标点击。TS表示搜索结果中taylor Swift的连接。</p>
<p>运行上面的code,进入霉霉的主页,我们采用相同的方法进入视频列表。</p>
<p><img src="https://pic4.zhimg.com/8&#48;/v2-c9be1de93b826a65d997889a8892b38f_144&#48;w.jpg" alt=""></p>
<p>视频列表</p>
<p>在上图的列表中我们需要拿到每个视频的地址。 这个地址可以通过locator取得。 我们不可能给每个视频都抓取一个Locator,这里使用Clicknium Recorder一个非常强大的功能 Similar elements。点击下图的按钮后, 同样采用Ctrl+Click的方式,Clicknium就能自动识别的到类型的元素,生成一个locator。</p>
<p><img src="https://pic1.zhimg.com/8&#48;/v2-5169acab&#48;94c7e83e3e57ce9&#48;7fd275&#48;_144&#48;w.jpg" alt=""></p>
<pre class=" language-python"><code class="prism  language-python">    tab <span class="token operator">=</span> cc<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"https://www.youtube.com"</span><span class="token punctuation">)</span>
    tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>searchBar<span class="token punctuation">)</span><span class="token punctuation">.</span>set_text<span class="token punctuation">(</span>
        <span class="token string">"Taylor Swift"</span><span class="token punctuation">,</span> by<span class="token operator">=</span><span class="token string">'sendkey-after-click'</span><span class="token punctuation">)</span>
    tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>button_search_icon_legacy<span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
    tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>TS<span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
    tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>div_video<span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
    tab<span class="token punctuation">.</span>wait_appear<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>a_video_title<span class="token punctuation">)</span><span class="token comment">#等待视频列表加载完毕</span>
    vidioTitles <span class="token operator">=</span> tab<span class="token punctuation">.</span>find_elements<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>a_video_title<span class="token punctuation">)</span>
</code></pre>
<p>因为视频列表是异步加载的, 我们需要使用wait_appear等待locator出现。因为使用similar elements抓取了多个视频,这个locator指向了多个视频所以使用find_elements方法。这个方法会返回一个UiElement的List。我们可以href从中获取到视频的相对路径,拼接上油管的地址就能得到完整的URL。</p>
<p>有了完整url,就可以使用Pytube下载视频了。 Pytube可以根据指定的参数下载不同分辨率的视频,需要注意高画质是video codec 和audio codec分开的。 具体可以参考<a href="https://pytube.io/en/latest/user/streams.html">Working with Streams and StreamQuery</a>。 这里我们下载1&#48;8&#48;p的版本,下载路径可以修改SAVE_PATH。可是使用相同的方法实现视频上传。 反响好我就再写一篇。</p>
<p>下面是完整的代码:</p>
<pre class=" language-python"><code class="prism  language-python"><span class="token keyword">from</span> pytube <span class="token keyword">import</span> YouTube
<span class="token keyword">from</span> clicknium <span class="token keyword">import</span> clicknium <span class="token keyword">as</span> cc<span class="token punctuation">,</span> locator
<span class="token keyword">from</span> clicknium<span class="token punctuation">.</span>common<span class="token punctuation">.</span>enums <span class="token keyword">import</span> <span class="token operator">*</span>

<span class="token keyword">def</span> <span class="token function">downloadVideo</span><span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">:</span>
    SAVE_PATH <span class="token operator">=</span> <span class="token string">"C:\\Users\\zhaoritian\\Downloads\\Youtube"</span>
    <span class="token keyword">try</span><span class="token punctuation">:</span>
        yt <span class="token operator">=</span> YouTube<span class="token punctuation">(</span>url<span class="token punctuation">)</span>
        yt<span class="token punctuation">.</span>streams<span class="token punctuation">.</span><span class="token builtin">filter</span><span class="token punctuation">(</span>res<span class="token operator">=</span><span class="token string">"1&#48;8&#48;p"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>first<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>download<span class="token punctuation">(</span>output_path<span class="token operator">=</span>SAVE_PATH<span class="token punctuation">)</span>
    <span class="token keyword">except</span><span class="token punctuation">:</span>
        <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"Connection Error"</span><span class="token punctuation">)</span>  <span class="token comment"># to handle exception</span>

    <span class="token comment"># filters out all the files with "mp4" extension</span>
    <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'Task Completed!'</span><span class="token punctuation">)</span>


<span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
    urlArrary <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
    tab <span class="token operator">=</span> cc<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"https://www.youtube.com"</span><span class="token punctuation">)</span>
    tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>searchBar<span class="token punctuation">)</span><span class="token punctuation">.</span>set_text<span class="token punctuation">(</span>
        <span class="token string">"Taylor Swift"</span><span class="token punctuation">,</span> by<span class="token operator">=</span><span class="token string">'sendkey-after-click'</span><span class="token punctuation">)</span>
    tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>button_search_icon_legacy<span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
    tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>TS<span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
    tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>div_video<span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
    tab<span class="token punctuation">.</span>wait_appear<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>a_video_title<span class="token punctuation">)</span>
    videoTitles <span class="token operator">=</span> tab<span class="token punctuation">.</span>find_elements<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>a_video_title<span class="token punctuation">)</span>
    <span class="token keyword">for</span> locat <span class="token keyword">in</span> videoTitles<span class="token punctuation">:</span>
        url <span class="token operator">=</span> <span class="token string">"https://www.youtube.com"</span> <span class="token operator">+</span> locat<span class="token punctuation">.</span>get_property<span class="token punctuation">(</span><span class="token string">"href"</span><span class="token punctuation">)</span>
        urlArrary<span class="token punctuation">.</span>append<span class="token punctuation">(</span>url<span class="token punctuation">)</span>
    tab<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span>

    <span class="token keyword">for</span> v <span class="token keyword">in</span> urlArrary<span class="token punctuation">:</span>
        downloadVideo<span class="token punctuation">(</span>v<span class="token punctuation">)</span>


<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">"__main__"</span><span class="token punctuation">:</span>
    main<span class="token punctuation">(</span><span class="token punctuation">)</span>
</code></pre>
<p>参考:</p>
<p><a href="https://link.zhihu.com/?target=https%3A//www.clicknium.com/documents">Clicknium</a></p>
<p><a href="https://link.zhihu.com/?target=https%3A//pytube.io/en/latest/index.html">Pytube</a></p>
</div>
</div>
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

全部回复
2022-8-9 20:29:12
用markdown 编辑的。。。。怎么发出来是。。html明文。。。。。
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2022-8-11 14:53:17
heartlocker 发表于 2022-8-9 20:29
用markdown 编辑的。。。。怎么发出来是。。html明文。。。。。
不能用markdown编辑,我也是这个问题
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2022-8-11 15:16:25
请点击高级模式,编辑器右上有个图标:M! markdown发帖

二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2022-8-11 15:17:04
zhhony 发表于 2022-8-11 14:53
不能用markdown编辑,我也是这个问题
回复时,也有Markdown发帖的图片,点一下应该就可以了。
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2022-8-11 15:21:24
<!-- markdown css tag --><div class="pinggu_markdown">
<div class="pinggu_markdown__html"><p>点了这个也会存在这个问题,这段回复就是在高级模式的markdown编辑发表的</p>
</div>
</div>
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

点击查看更多内容…
相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群