<!-- markdown css tag --><div class="pinggu_markdown">
<div class="pinggu_markdown__html"><h1 id="既然学了python,应该花5分钟学下自动化爬虫">既然学了Python,应该花5分钟学下自动化&爬虫</h1>
<p>看完Al Sweigart的《Automate the Boring Stuff with Python》,附上网页<a href="https://automatetheboringstuff.com/">原版</a>, 喜欢实体书的可以豆瓣<a href="https://book.douban.com/subject/26836700/">中文版</a>。 这里不介绍统计、画图、
机器学习了。 讲一下另一个更加有趣和实用的功能,自动化。 当然你也可能把它用来当做爬虫。</p>
<p>下面的例子教大家如何自动下载YouTube视频。 同样你也可以修改一下用来下B站视频,<a href="https://www.bilibili.com/read/cv15033695">参考</a>。</p>
<p>简单介绍一下 <a href="https://www.clicknium.com/documents/quickstart">Clicknium</a>是2022年最新的Python自动化库,可以用来自动化操作网页和Windows App. 采用可视化操作和代码结合的方式实现自动化。<br>
<a href="https://pytube.io/en/latest/">pytube</a>是一个用来下载youtube视频的Python库。 只要用视屏的link就可以下载(当然网首先得通)</p>
<p>确保你有:</p>
<ul>
<li>一个Python3.7+的环境</li>
<li>VS Code</li>
<li>梯子</li>
</ul>
<h3 id="配置clicknium">配置Clicknium</h3>
<ol>
<li>在VS Code的插件市场中搜索并安装Clicknium</li>
</ol>
<p><img src="https://s1.328888.xyz/2022/08/10/4bSI7.png" alt="sss"></p>
<p>在VS Code右侧点击Clicknium的图标进入Welcome页面, 跟着welcome 点击按钮安装module, chrome的插件和注册账号。</p>
<p><img src="https://s1.328888.xyz/2022/08/10/4bInd.png" alt="hhh"></p>
<h3 id="code">Code</h3>
<p>用VS Code 创建一个Python文件,比如 <code>youtube.py</code></p>
<pre class=" language-python"><code class="prism language-python"><span class="token keyword">from</span> clicknium <span class="token keyword">import</span> clicknium <span class="token keyword">as</span> cc<span class="token punctuation">,</span> locator
<span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
tab <span class="token operator">=</span> cc<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"https://www.youtube.com"</span><span class="token punctuation">)</span>
<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">"__main__"</span><span class="token punctuation">:</span>
main<span class="token punctuation">(</span><span class="token punctuation">)</span>
</code></pre>
<p>可以通过F5运行。 这行代码Clicknium会自动打开Chrome浏览器,并且进入油管的首页。</p>
<h3 id="抓取">抓取</h3>
<p>这个时候,我们要下载霉霉的视频, 那就得先打开霉霉视频页,类似把大象塞进冰箱:</p>
<p>步骤:</p>
<ol>
<li>在搜索框输入Taylor Swift</li>
<li>点击搜索按钮</li>
<li>点击进入霉霉首页</li>
<li>点击进入霉霉视频页</li>
</ol>
<p>这里一个涉及到四个元素:搜索框, 搜索按钮,搜索结果中的艺人名字,视频页切换按钮</p>
<p>Clicknium中使用Locator来定位UI元素, 并且提供了Recorder来生成Locator。</p>
<p>我们使用上面的代码打开油管页面然后打开VS Code调用Recorder。</p>
<p><img src="https://pic4.zhimg.com/80/v2-a6c044263d07d381f93470b80f48d72b_1440w.jpg" alt=""></p>
<p>调用Recorder</p>
<p>点击上图VS Code中Locator tab上这个小小的capture按钮启用Recorder。这个按钮比较隐蔽。如果没看到LOCATORS 这个tab,点击右上角的三个点勾选Locator。 将鼠标移动到搜索栏上,会自动高亮显示input。按住Ctrl+Click(鼠标右键单击)即可抓取 搜索框。 同样的方法 抓取<strong>搜索按钮</strong>,和下图右上角<strong>Taylor Swift的链接</strong>。</p>
<p><img src="https://pic4.zhimg.com/80/v2-0de3690728d063e86f12ebc9aa8b7c77_1440w.jpg" alt=""></p>
<p>Youtube页</p>
<p><img src="https://pic3.zhimg.com/80/v2-a597af797527d812b8913fdc9f1199be_1440w.jpg" alt=""></p>
<p>Recorder</p>
<p>每次抓取都会在Recorder中生成一个UI元素对应的locator,可以对其进行重命名。完成后点击Complete。</p>
<p>然后回到VS Code</p>
<pre class=" language-python"><code class="prism language-python"><span class="token keyword">from</span> clicknium <span class="token keyword">import</span> clicknium <span class="token keyword">as</span> cc<span class="token punctuation">,</span> locator
<span class="token keyword">from</span> clicknium<span class="token punctuation">.</span>common<span class="token punctuation">.</span>enums <span class="token keyword">import</span> <span class="token operator">*</span>
<span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
urlArrary <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
tab <span class="token operator">=</span> cc<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"https://www.youtube.com"</span><span class="token punctuation">)</span>
tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>searchBar<span class="token punctuation">)</span><span class="token punctuation">.</span>set_text<span class="token punctuation">(</span>
<span class="token string">"Taylor Swift"</span><span class="token punctuation">,</span> by<span class="token operator">=</span><span class="token string">'sendkey-after-click'</span><span class="token punctuation">)</span>
tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>button_search_icon_legacy<span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>TS<span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">"__main__"</span><span class="token punctuation">:</span>
main<span class="token punctuation">(</span><span class="token punctuation">)</span>
</code></pre>
<p>通过find_element函数传入对应locator定位到某个UI元素然后利用set_text方法将“Taylor Swift”写入搜索框。</p>
<p>下一行是相同的方法定位到搜索按钮,然后用click函数表示鼠标点击。TS表示搜索结果中taylor Swift的连接。</p>
<p>运行上面的code,进入霉霉的主页,我们采用相同的方法进入视频列表。</p>
<p><img src="https://pic4.zhimg.com/80/v2-c9be1de93b826a65d997889a8892b38f_1440w.jpg" alt=""></p>
<p>视频列表</p>
<p>在上图的列表中我们需要拿到每个视频的地址。 这个地址可以通过locator取得。 我们不可能给每个视频都抓取一个Locator,这里使用Clicknium Recorder一个非常强大的功能 Similar elements。点击下图的按钮后, 同样采用Ctrl+Click的方式,Clicknium就能自动识别的到类型的元素,生成一个locator。</p>
<p><img src="https://pic1.zhimg.com/80/v2-5169acab094c7e83e3e57ce907fd2750_1440w.jpg" alt=""></p>
<pre class=" language-python"><code class="prism language-python"> tab <span class="token operator">=</span> cc<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"https://www.youtube.com"</span><span class="token punctuation">)</span>
tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>searchBar<span class="token punctuation">)</span><span class="token punctuation">.</span>set_text<span class="token punctuation">(</span>
<span class="token string">"Taylor Swift"</span><span class="token punctuation">,</span> by<span class="token operator">=</span><span class="token string">'sendkey-after-click'</span><span class="token punctuation">)</span>
tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>button_search_icon_legacy<span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>TS<span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>div_video<span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
tab<span class="token punctuation">.</span>wait_appear<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>a_video_title<span class="token punctuation">)</span><span class="token comment">#等待视频列表加载完毕</span>
vidioTitles <span class="token operator">=</span> tab<span class="token punctuation">.</span>find_elements<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>a_video_title<span class="token punctuation">)</span>
</code></pre>
<p>因为视频列表是异步加载的, 我们需要使用wait_appear等待locator出现。因为使用similar elements抓取了多个视频,这个locator指向了多个视频所以使用find_elements方法。这个方法会返回一个UiElement的List。我们可以href从中获取到视频的相对路径,拼接上油管的地址就能得到完整的URL。</p>
<p>有了完整url,就可以使用Pytube下载视频了。 Pytube可以根据指定的参数下载不同分辨率的视频,需要注意高画质是video codec 和audio codec分开的。 具体可以参考<a href="https://pytube.io/en/latest/user/streams.html">Working with Streams and StreamQuery</a>。 这里我们下载1080p的版本,下载路径可以修改SAVE_PATH。可是使用相同的方法实现视频上传。 反响好我就再写一篇。</p>
<p>下面是完整的代码:</p>
<pre class=" language-python"><code class="prism language-python"><span class="token keyword">from</span> pytube <span class="token keyword">import</span> YouTube
<span class="token keyword">from</span> clicknium <span class="token keyword">import</span> clicknium <span class="token keyword">as</span> cc<span class="token punctuation">,</span> locator
<span class="token keyword">from</span> clicknium<span class="token punctuation">.</span>common<span class="token punctuation">.</span>enums <span class="token keyword">import</span> <span class="token operator">*</span>
<span class="token keyword">def</span> <span class="token function">downloadVideo</span><span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">:</span>
SAVE_PATH <span class="token operator">=</span> <span class="token string">"C:\\Users\\zhaoritian\\Downloads\\Youtube"</span>
<span class="token keyword">try</span><span class="token punctuation">:</span>
yt <span class="token operator">=</span> YouTube<span class="token punctuation">(</span>url<span class="token punctuation">)</span>
yt<span class="token punctuation">.</span>streams<span class="token punctuation">.</span><span class="token builtin">filter</span><span class="token punctuation">(</span>res<span class="token operator">=</span><span class="token string">"1080p"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>first<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>download<span class="token punctuation">(</span>output_path<span class="token operator">=</span>SAVE_PATH<span class="token punctuation">)</span>
<span class="token keyword">except</span><span class="token punctuation">:</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"Connection Error"</span><span class="token punctuation">)</span> <span class="token comment"># to handle exception</span>
<span class="token comment"># filters out all the files with "mp4" extension</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'Task Completed!'</span><span class="token punctuation">)</span>
<span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
urlArrary <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
tab <span class="token operator">=</span> cc<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"https://www.youtube.com"</span><span class="token punctuation">)</span>
tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>searchBar<span class="token punctuation">)</span><span class="token punctuation">.</span>set_text<span class="token punctuation">(</span>
<span class="token string">"Taylor Swift"</span><span class="token punctuation">,</span> by<span class="token operator">=</span><span class="token string">'sendkey-after-click'</span><span class="token punctuation">)</span>
tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>button_search_icon_legacy<span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>TS<span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
tab<span class="token punctuation">.</span>find_element<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>div_video<span class="token punctuation">)</span><span class="token punctuation">.</span>click<span class="token punctuation">(</span><span class="token punctuation">)</span>
tab<span class="token punctuation">.</span>wait_appear<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>a_video_title<span class="token punctuation">)</span>
videoTitles <span class="token operator">=</span> tab<span class="token punctuation">.</span>find_elements<span class="token punctuation">(</span>locator<span class="token punctuation">.</span>chrome<span class="token punctuation">.</span>youtube<span class="token punctuation">.</span>a_video_title<span class="token punctuation">)</span>
<span class="token keyword">for</span> locat <span class="token keyword">in</span> videoTitles<span class="token punctuation">:</span>
url <span class="token operator">=</span> <span class="token string">"https://www.youtube.com"</span> <span class="token operator">+</span> locat<span class="token punctuation">.</span>get_property<span class="token punctuation">(</span><span class="token string">"href"</span><span class="token punctuation">)</span>
urlArrary<span class="token punctuation">.</span>append<span class="token punctuation">(</span>url<span class="token punctuation">)</span>
tab<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token keyword">for</span> v <span class="token keyword">in</span> urlArrary<span class="token punctuation">:</span>
downloadVideo<span class="token punctuation">(</span>v<span class="token punctuation">)</span>
<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">"__main__"</span><span class="token punctuation">:</span>
main<span class="token punctuation">(</span><span class="token punctuation">)</span>
</code></pre>
<p>参考:</p>
<p><a href="https://link.zhihu.com/?target=https%3A//www.clicknium.com/documents">Clicknium</a></p>
<p><a href="https://link.zhihu.com/?target=https%3A//pytube.io/en/latest/index.html">Pytube</a></p>
</div>
</div>