全部版块 我的主页
论坛 数据科学与人工智能 数据分析与数据科学 python论坛
1711 1
2021-08-23
<!-- markdown css tag --><div class="pinggu_markdown">
<div class="pinggu_markdown__html"><h2 id="前言">前言</h2>
<p>受读者建议,再次详细论述我们写的第一篇推文,讲讲管理层讨论信息含量这个指标如何构建。本文的主要内容分为管理层讨论信息含量的定义、计算原理、python和stata实现以及计量拓展</p>
<h2 id="定义">定义</h2>
<p>参考孟庆斌等(中国工业经济,2&#48;17)的定义</p>
<blockquote>
<p>一方面,所有上市公司都处于相同的宏观经济环境、风险因素和政治、政策背景之下;另一方面,同一行业中的各上市公司又面临着相似的产业政策、竞争环境和市场特征。由此可见,每个上市公司MD&amp;A 信息不可避免地在某种程度上与同行业其他上市公司以及市场其他行业上市公司存在一定的相似性, 甚至某些公司可能直接参考其他公司MD&amp;A 的表述。本文将这些与同行业其他公司或其他行业的公司重复或相似的信息定义为不具有信息含量的内容,同时将不同的信息定义为真正具有信息含量的内容,简称为信息含量</p>
</blockquote>
<h2 id="计算原理">计算原理</h2>
<p><code>词袋模型</code>:举例来说,我们现在有1&#48;个文本,分别对这1&#48;个文本进行分词处理,然后将分词后所有词条(去重)进行编号,最后汇总词条得到一个基于这1&#48;个文本内容的词条库。</p>
<p><code>词频向量</code>:接下来,我们需要统计每个文本分词后的词条数量,根据词条编号生成每份文本的词频向量,格式类似于[&#48;, 1, 4, &#48;, 3,…&#48;, 5],表示对于文本<span class="katex--inline"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>i</mi></mrow><annotation encoding="application/x-tex">i</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: &#48;.65952em; vertical-align: &#48;em;"></span><span class="mord mathit">i</span></span></span></span></span>中,编号为1的词出现&#48;次,编号为2的词出现1次,编号为3的词条出现4次,以此类推</p>
<p><strong>孟庆斌等(中国工业经济,2&#48;17)的做法</strong></p>
<p><code>第一步</code>:基于所有上市公司的<span class="katex--inline"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>M</mi><mi>D</mi><mi mathvariant="normal">&amp;amp;</mi><mi>A</mi></mrow><annotation encoding="application/x-tex">MD\&amp;amp;A</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: &#48;.69444em; vertical-align: &#48;em;"></span><span class="mord mathit" style="margin-right: &#48;.1&#48;9&#48;3em;">M</span><span class="mord mathit" style="margin-right: &#48;.&#48;2778em;">D</span><span class="mord">&amp;</span><span class="mord mathit">A</span></span></span></span></span>的文本内容生成词条库</p>
<p><code>第二步</code>:基于词条库,生成每个公司的<span class="katex--inline"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>M</mi><mi>D</mi><mi mathvariant="normal">&amp;amp;</mi><mi>A</mi></mrow><annotation encoding="application/x-tex">MD\&amp;amp;A</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: &#48;.69444em; vertical-align: &#48;em;"></span><span class="mord mathit" style="margin-right: &#48;.1&#48;9&#48;3em;">M</span><span class="mord mathit" style="margin-right: &#48;.&#48;2778em;">D</span><span class="mord">&amp;</span><span class="mord mathit">A</span></span></span></span></span>的文本内容的词频向量</p>
<p><code>第三步</code>:对词频向量进行标准化处理得到个股标准化向量<span class="katex--inline"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>N</mi><mi>o</mi><mi>r</mi><msub><mi>m</mi><mrow><mi>i</mi><mi mathvariant="normal">,</mi><mi>t</mi></mrow></msub></mrow><annotation encoding="application/x-tex">Norm_{i,t}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: &#48;.83333em; vertical-align: -&#48;.15em;"></span><span class="mord mathit" style="margin-right: &#48;.1&#48;9&#48;3em;">N</span><span class="mord mathit">o</span><span class="mord mathit" style="margin-right: &#48;.&#48;2778em;">r</span><span class="mord"><span class="mord mathit">m</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: &#48;.311664em;"><span class="" style="top: -2.55em; margin-left: &#48;em; margin-right: &#48;.&#48;5em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathit mtight">i</span><span class="mord cjk_fallback mtight">,</span><span class="mord mathit mtight">t</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: &#48;.15em;"><span class=""></span></span></span></span></span></span></span></span></span></span>,即对词频向量除以该公司<span class="katex--inline"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>M</mi><mi>D</mi><mi mathvariant="normal">&amp;amp;</mi><mi>A</mi></mrow><annotation encoding="application/x-tex">MD\&amp;amp;A</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: &#48;.69444em; vertical-align: &#48;em;"></span><span class="mord mathit" style="margin-right: &#48;.1&#48;9&#48;3em;">M</span><span class="mord mathit" style="margin-right: &#48;.&#48;2778em;">D</span><span class="mord">&amp;</span><span class="mord mathit">A</span></span></span></span></span>的总词数</p>
<p><code>第四步</code>:基于公司<span class="katex--inline"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>i</mi></mrow><annotation encoding="application/x-tex">i</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: &#48;.65952em; vertical-align: &#48;em;"></span><span class="mord mathit">i</span></span></span></span></span>的标准化词频向量计算行业标准化向量和市场标准化向量</p>
<p>行业标准化向量:将公司<span class="katex--inline"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>i</mi></mrow><annotation encoding="application/x-tex">i</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: &#48;.65952em; vertical-align: &#48;em;"></span><span class="mord mathit">i</span></span></span></span></span> 所在行业除该公司之外其他所有公司的标准化向量的算术平均定义为行业标准化向量<span class="katex--inline"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>N</mi><mi>o</mi><mi>r</mi><msub><mi>m</mi><mrow><mi>I</mi><mi mathvariant="normal">,</mi><mi>t</mi></mrow></msub></mrow><annotation encoding="application/x-tex">Norm_{I,t}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: &#48;.83333em; vertical-align: -&#48;.15em;"></span><span class="mord mathit" style="margin-right: &#48;.1&#48;9&#48;3em;">N</span><span class="mord mathit">o</span><span class="mord mathit" style="margin-right: &#48;.&#48;2778em;">r</span><span class="mord"><span class="mord mathit">m</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: &#48;.328331em;"><span class="" style="top: -2.55em; margin-left: &#48;em; margin-right: &#48;.&#48;5em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathit mtight" style="margin-right: &#48;.&#48;7847em;">I</span><span class="mord cjk_fallback mtight">,</span><span class="mord mathit mtight">t</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: &#48;.15em;"><span class=""></span></span></span></span></span></span></span></span></span></span></p>
<p>市场标准化向量:将公司<span class="katex--inline"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>i</mi></mrow><annotation encoding="application/x-tex">i</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: &#48;.65952em; vertical-align: &#48;em;"></span><span class="mord mathit">i</span></span></span></span></span> 所在行业之外其他行业所有公司的标准化向量进行算术平均,得到市场标准化向量<span class="katex--inline"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>N</mi><mi>o</mi><mi>r</mi><msub><mi>m</mi><mrow><mi>M</mi><mo separator="true">,</mo><mi>t</mi></mrow></msub></mrow><annotation encoding="application/x-tex">Norm_{M,t}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: &#48;.969438em; vertical-align: -&#48;.2861&#48;8em;"></span><span class="mord mathit" style="margin-right: &#48;.1&#48;9&#48;3em;">N</span><span class="mord mathit">o</span><span class="mord mathit" style="margin-right: &#48;.&#48;2778em;">r</span><span class="mord"><span class="mord mathit">m</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: &#48;.328331em;"><span class="" style="top: -2.55em; margin-left: &#48;em; margin-right: &#48;.&#48;5em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathit mtight" style="margin-right: &#48;.1&#48;9&#48;3em;">M</span><span class="mpunct mtight">,</span><span class="mord mathit mtight">t</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: &#48;.2861&#48;8em;"><span class=""></span></span></span></span></span></span></span></span></span></span></p>
<p><code>第五步</code>:利用行业标准化向量和市场标准化向量对个股标准化向量进行分离<br>
<span class="katex--display"><span class="katex-display"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>N</mi><mi>o</mi><mi>r</mi><msub><mi>m</mi><mrow><mi>i</mi><mi mathvariant="normal">,</mi><mi>t</mi></mrow></msub><mo>=</mo><msub><mi>α</mi><mn>&#48;</mn></msub><mo>+</mo><msub><mi>α</mi><mn>1</mn></msub><mi>N</mi><mi>o</mi><mi>r</mi><msub><mi>m</mi><mrow><mi>I</mi><mi mathvariant="normal">,</mi><mi>t</mi></mrow></msub><mo>+</mo><msub><mi>α</mi><mn>2</mn></msub><mi>N</mi><mi>o</mi><mi>r</mi><msub><mi>m</mi><mrow><mi>M</mi><mo separator="true">,</mo><mi>t</mi></mrow></msub><mo>+</mo><msub><mi>μ</mi><mrow><mi>i</mi><mo separator="true">,</mo><mi>t</mi></mrow></msub></mrow><annotation encoding="application/x-tex">
Norm_{i,t}=\alpha_&#48;+\alpha_1Norm_{I,t}+\alpha_2Norm_{M,t}+\mu_{i,t}
</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: &#48;.83333em; vertical-align: -&#48;.15em;"></span><span class="mord mathit" style="margin-right: &#48;.1&#48;9&#48;3em;">N</span><span class="mord mathit">o</span><span class="mord mathit" style="margin-right: &#48;.&#48;2778em;">r</span><span class="mord"><span class="mord mathit">m</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: &#48;.311664em;"><span class="" style="top: -2.55em; margin-left: &#48;em; margin-right: &#48;.&#48;5em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathit mtight">i</span><span class="mord cjk_fallback mtight">,</span><span class="mord mathit mtight">t</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: &#48;.15em;"><span class=""></span></span></span></span></span></span><span class="mspace" style="margin-right: &#48;.277778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right: &#48;.277778em;"></span></span><span class="base"><span class="strut" style="height: &#48;.73333em; vertical-align: -&#48;.15em;"></span><span class="mord"><span class="mord mathit" style="margin-right: &#48;.&#48;&#48;37em;">α</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: &#48;.3&#48;11&#48;8em;"><span class="" style="top: -2.55em; margin-left: -&#48;.&#48;&#48;37em; margin-right: &#48;.&#48;5em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">&#48;</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: &#48;.15em;"><span class=""></span></span></span></span></span></span><span class="mspace" style="margin-right: &#48;.222222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right: &#48;.222222em;"></span></span><span class="base"><span class="strut" style="height: &#48;.83333em; vertical-align: -&#48;.15em;"></span><span class="mord"><span class="mord mathit" style="margin-right: &#48;.&#48;&#48;37em;">α</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: &#48;.3&#48;11&#48;8em;"><span class="" style="top: -2.55em; margin-left: -&#48;.&#48;&#48;37em; margin-right: &#48;.&#48;5em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: &#48;.15em;"><span class=""></span></span></span></span></span></span><span class="mord mathit" style="margin-right: &#48;.1&#48;9&#48;3em;">N</span><span class="mord mathit">o</span><span class="mord mathit" style="margin-right: &#48;.&#48;2778em;">r</span><span class="mord"><span class="mord mathit">m</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: &#48;.328331em;"><span class="" style="top: -2.55em; margin-left: &#48;em; margin-right: &#48;.&#48;5em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathit mtight" style="margin-right: &#48;.&#48;7847em;">I</span><span class="mord cjk_fallback mtight">,</span><span class="mord mathit mtight">t</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: &#48;.15em;"><span class=""></span></span></span></span></span></span><span class="mspace" style="margin-right: &#48;.222222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right: &#48;.222222em;"></span></span><span class="base"><span class="strut" style="height: &#48;.969438em; vertical-align: -&#48;.2861&#48;8em;"></span><span class="mord"><span class="mord mathit" style="margin-right: &#48;.&#48;&#48;37em;">α</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: &#48;.3&#48;11&#48;8em;"><span class="" style="top: -2.55em; margin-left: -&#48;.&#48;&#48;37em; margin-right: &#48;.&#48;5em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: &#48;.15em;"><span class=""></span></span></span></span></span></span><span class="mord mathit" style="margin-right: &#48;.1&#48;9&#48;3em;">N</span><span class="mord mathit">o</span><span class="mord mathit" style="margin-right: &#48;.&#48;2778em;">r</span><span class="mord"><span class="mord mathit">m</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: &#48;.328331em;"><span class="" style="top: -2.55em; margin-left: &#48;em; margin-right: &#48;.&#48;5em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathit mtight" style="margin-right: &#48;.1&#48;9&#48;3em;">M</span><span class="mpunct mtight">,</span><span class="mord mathit mtight">t</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: &#48;.2861&#48;8em;"><span class=""></span></span></span></span></span></span><span class="mspace" style="margin-right: &#48;.222222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right: &#48;.222222em;"></span></span><span class="base"><span class="strut" style="height: &#48;.716668em; vertical-align: -&#48;.2861&#48;8em;"></span><span class="mord"><span class="mord mathit">μ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: &#48;.311664em;"><span class="" style="top: -2.55em; margin-left: &#48;em; margin-right: &#48;.&#48;5em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathit mtight">i</span><span class="mpunct mtight">,</span><span class="mord mathit mtight">t</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: &#48;.2861&#48;8em;"><span class=""></span></span></span></span></span></span></span></span></span></span></span><br>
其中,<span class="katex--inline"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><msub><mi>α</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">\alpha_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: &#48;.58&#48;56em; vertical-align: -&#48;.15em;"></span><span class="mord"><span class="mord mathit" style="margin-right: &#48;.&#48;&#48;37em;">α</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: &#48;.3&#48;11&#48;8em;"><span class="" style="top: -2.55em; margin-left: -&#48;.&#48;&#48;37em; margin-right: &#48;.&#48;5em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: &#48;.15em;"><span class=""></span></span></span></span></span></span></span></span></span></span>代表公司i 的MD&amp;A 信息中能够被同行业其他公司所解释的部分,<span class="katex--inline"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><msub><mi>α</mi><mn>2</mn></msub></mrow><annotation encoding="application/x-tex">\alpha_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: &#48;.58&#48;56em; vertical-align: -&#48;.15em;"></span><span class="mord"><span class="mord mathit" style="margin-right: &#48;.&#48;&#48;37em;">α</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: &#48;.3&#48;11&#48;8em;"><span class="" style="top: -2.55em; margin-left: -&#48;.&#48;&#48;37em; margin-right: &#48;.&#48;5em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: &#48;.15em;"><span class=""></span></span></span></span></span></span></span></span></span></span>代表该公司能够被市场其他行业公司所解释的部分,残差<span class="katex--inline"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><msub><mi>μ</mi><mrow><mi>i</mi><mo separator="true">,</mo><mi>t</mi></mrow></msub></mrow><annotation encoding="application/x-tex">\mu_{i,t}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: &#48;.716668em; vertical-align: -&#48;.2861&#48;8em;"></span><span class="mord"><span class="mord mathit">μ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: &#48;.311664em;"><span class="" style="top: -2.55em; margin-left: &#48;em; margin-right: &#48;.&#48;5em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathit mtight">i</span><span class="mpunct mtight">,</span><span class="mord mathit mtight">t</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: &#48;.2861&#48;8em;"><span class=""></span></span></span></span></span></span></span></span></span></span>为行业和市场信息所不能解释的部分。将残差向量各维度绝对值之和定义为信息含量</p>
<h2 id="代码实现">代码实现</h2>
<h3 id="python代码实现">Python代码实现</h3>
<p><code>第一步</code>:导入数据</p>
<pre class=" language-python"><code class="prism  language-python"><span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd
<span class="token comment"># 2&#48;&#48;7-2&#48;19</span>
mda <span class="token operator">=</span> pd<span class="token punctuation">.</span>read_excel<span class="token punctuation">(</span><span class="token string">'管理层讨论与分析.xls'</span><span class="token punctuation">,</span> sheet_name <span class="token operator">=</span> <span class="token number">&#48;</span><span class="token punctuation">)</span>
<span class="token comment"># 读取行业数据</span>
industry <span class="token operator">=</span> pd<span class="token punctuation">.</span>read_excel<span class="token punctuation">(</span><span class="token string">'证监会2&#48;12年版行业分类.xlsx'</span><span class="token punctuation">,</span>sheet_name <span class="token operator">=</span> <span class="token number">&#48;</span><span class="token punctuation">)</span>
<span class="token comment"># 与行业数据进行合并</span>
data <span class="token operator">=</span> pd<span class="token punctuation">.</span>merge<span class="token punctuation">(</span>mda<span class="token punctuation">,</span> industry<span class="token punctuation">,</span>on<span class="token operator">=</span><span class="token punctuation">[</span><span class="token string">'股票代码'</span><span class="token punctuation">,</span><span class="token string">'会计年度'</span><span class="token punctuation">]</span><span class="token punctuation">,</span> how <span class="token operator">=</span> <span class="token string">'inner'</span><span class="token punctuation">)</span>
</code></pre>
<p>得到以下数据</p>
<pre class=" language-python"><code class="prism  language-python">data<span class="token punctuation">.</span>info<span class="token punctuation">(</span><span class="token punctuation">)</span>

<span class="token operator">&lt;</span><span class="token keyword">class</span> <span class="token string">'pandas.core.fr ame.Datafr ame'</span><span class="token operator">&gt;</span>
Int64Index<span class="token punctuation">:</span> <span class="token number">33269</span> entries<span class="token punctuation">,</span> <span class="token number">&#48;</span> to <span class="token number">33268</span>
Data columns <span class="token punctuation">(</span>total <span class="token number">6</span> columns<span class="token punctuation">)</span><span class="token punctuation">:</span>
股票代码            <span class="token number">33269</span> non<span class="token operator">-</span>null int64
会计年度            <span class="token number">33269</span> non<span class="token operator">-</span>null int64
经营分析时间          <span class="token number">33269</span> non<span class="token operator">-</span>null <span class="token builtin">ob ject</span>
经营讨论与分析内容       <span class="token number">33269</span> non<span class="token operator">-</span>null <span class="token builtin">ob ject</span>
shortname       <span class="token number">33269</span> non<span class="token operator">-</span>null <span class="token builtin">ob ject</span>
industrycode    <span class="token number">33269</span> non<span class="token operator">-</span>null <span class="token builtin">ob ject</span>
dtypes<span class="token punctuation">:</span> int64<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token builtin">ob ject</span><span class="token punctuation">(</span><span class="token number">4</span><span class="token punctuation">)</span>
memory usage<span class="token punctuation">:</span> <span class="token number">1.8</span><span class="token operator">+</span> MB
</code></pre>
<p><code>第二步</code>:数据清洗</p>
<pre class=" language-python"><code class="prism  language-python"><span class="token comment"># 重命名字段</span>
data <span class="token operator">=</span> data<span class="token punctuation">.</span>rename<span class="token punctuation">(</span>columns<span class="token operator">=</span><span class="token punctuation">{</span><span class="token string">"shortname"</span><span class="token punctuation">:</span><span class="token string">"证券简称"</span><span class="token punctuation">,</span> <span class="token string">"industrycode"</span><span class="token punctuation">:</span><span class="token string">"行业代码"</span><span class="token punctuation">}</span><span class="token punctuation">)</span>
<span class="token comment"># 选择需要分析的字段</span>
data <span class="token operator">=</span> data<span class="token punctuation">[</span><span class="token punctuation">[</span><span class="token string">"股票代码"</span><span class="token punctuation">,</span><span class="token string">"会计年度"</span><span class="token punctuation">,</span><span class="token string">"经营讨论与分析内容"</span><span class="token punctuation">,</span><span class="token string">"行业代码"</span><span class="token punctuation">]</span><span class="token punctuation">]</span>
<span class="token comment"># 剔除金融行业</span>
data <span class="token operator">=</span> data<span class="token punctuation">[</span><span class="token operator">~</span>data<span class="token punctuation">[</span><span class="token string">"行业代码"</span><span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token builtin">str</span><span class="token punctuation">.</span>contains<span class="token punctuation">(</span><span class="token string">"J"</span><span class="token punctuation">)</span><span class="token punctuation">]</span>
<span class="token comment"># 仅处理2&#48;19年的文本</span>
data <span class="token operator">=</span> data<span class="token punctuation">[</span>data<span class="token punctuation">[</span><span class="token string">"会计年度"</span><span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token number">2&#48;19</span><span class="token punctuation">]</span>
<span class="token comment"># 重置索引</span>
data <span class="token operator">=</span> data<span class="token punctuation">.</span>reset_index<span class="token punctuation">(</span>drop<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span>
</code></pre>
<p>得到以下数据</p>
<pre class=" language-python"><code class="prism  language-python">data<span class="token punctuation">.</span>info<span class="token punctuation">(</span><span class="token punctuation">)</span>

<span class="token operator">&lt;</span><span class="token keyword">class</span> <span class="token string">'pandas.core.fr ame.Datafr ame'</span><span class="token operator">&gt;</span>
RangeIndex<span class="token punctuation">:</span> <span class="token number">3568</span> entries<span class="token punctuation">,</span> <span class="token number">&#48;</span> to <span class="token number">3567</span>
Data columns <span class="token punctuation">(</span>total <span class="token number">4</span> columns<span class="token punctuation">)</span><span class="token punctuation">:</span>
股票代码         <span class="token number">3568</span> non<span class="token operator">-</span>null int64
会计年度         <span class="token number">3568</span> non<span class="token operator">-</span>null int64
经营讨论与分析内容    <span class="token number">3568</span> non<span class="token operator">-</span>null <span class="token builtin">ob ject</span>
行业代码         <span class="token number">3568</span> non<span class="token operator">-</span>null <span class="token builtin">ob ject</span>
dtypes<span class="token punctuation">:</span> int64<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token builtin">ob ject</span><span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span>
memory usage<span class="token punctuation">:</span> <span class="token number">111.6</span><span class="token operator">+</span> KB
</code></pre>
<p><code>第三步</code>:分词处理</p>
<pre class=" language-python"><code class="prism  language-python"><span class="token keyword">import</span> jieba
<span class="token keyword">import</span> re
<span class="token keyword">def</span> <span class="token function">get_cut_words</span><span class="token punctuation">(</span>content<span class="token punctuation">)</span><span class="token punctuation">:</span>
    <span class="token comment"># 读入停用词表</span>
    stop_words <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
    <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">'停用词.txt'</span><span class="token punctuation">,</span> encoding <span class="token operator">=</span> <span class="token string">'utf-8'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span>
        lines <span class="token operator">=</span> f<span class="token punctuation">.</span>readlines<span class="token punctuation">(</span><span class="token punctuation">)</span>
        <span class="token keyword">for</span> line <span class="token keyword">in</span> lines<span class="token punctuation">:</span>
            stop_words<span class="token punctuation">.</span>append<span class="token punctuation">(</span>line<span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
    <span class="token comment"># 分词</span>
    cutword <span class="token operator">=</span> <span class="token punctuation">[</span>w <span class="token keyword">for</span> w <span class="token keyword">in</span> jieba<span class="token punctuation">.</span>cut<span class="token punctuation">(</span>content<span class="token punctuation">)</span> <span class="token keyword">if</span> w <span class="token operator">not</span> <span class="token keyword">in</span> stop_words <span class="token operator">and</span> <span class="token builtin">len</span><span class="token punctuation">(</span>w<span class="token punctuation">)</span> <span class="token operator">&gt;</span> <span class="token number">1</span> <span class="token operator">and</span> <span class="token operator">not</span> re<span class="token punctuation">.</span>match<span class="token punctuation">(</span><span class="token string">'^[a-z|A-Z|&#48;-9|.]*$'</span><span class="token punctuation">,</span>w<span class="token punctuation">)</span><span class="token punctuation">]</span>
    strword <span class="token operator">=</span> <span class="token string">" "</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span>cutword<span class="token punctuation">)</span>
    <span class="token keyword">return</span> strword

data<span class="token punctuation">[</span><span class="token string">'strword'</span><span class="token punctuation">]</span> <span class="token operator">=</span> data<span class="token punctuation">[</span><span class="token string">'经营讨论与分析内容'</span><span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token builtin">apply</span><span class="token punctuation">(</span>get_cut_words<span class="token punctuation">)</span>
</code></pre>
<p><code>第四步</code>:生成bow矩阵</p>
<pre class=" language-python"><code class="prism  language-python"><span class="token keyword">from</span> sklearn<span class="token punctuation">.</span>feature_extraction<span class="token punctuation">.</span>text <span class="token keyword">import</span> CountVectorizer
countvec <span class="token operator">=</span> CountVectorizer<span class="token punctuation">(</span>min_df <span class="token operator">=</span> <span class="token number">5&#48;</span><span class="token punctuation">,</span> max_df <span class="token operator">=</span> <span class="token number">1&#48;&#48;&#48;</span><span class="token punctuation">)</span> <span class="token comment"># 在5个以上年度报告出现的词才保留,在1&#48;&#48;&#48;个以上年报出现的词剔除</span>

res <span class="token operator">=</span> countvec<span class="token punctuation">.</span>fit_transform<span class="token punctuation">(</span>data<span class="token punctuation">.</span>strword<span class="token punctuation">)</span> <span class="token comment"># 稀疏bow矩阵</span>
</code></pre>
<p><code>第五步</code>:词频向量标准化</p>
<pre class=" language-python"><code class="prism  language-python"><span class="token comment"># 利用公司总词数进行标准化</span>
<span class="token keyword">import</span> numpy <span class="token keyword">as</span> np
<span class="token keyword">def</span> <span class="token function">normalizer</span><span class="token punctuation">(</span>vec<span class="token punctuation">)</span><span class="token punctuation">:</span>
    denom <span class="token operator">=</span> np<span class="token punctuation">.</span><span class="token builtin">sum</span><span class="token punctuation">(</span>vec<span class="token punctuation">)</span>
    <span class="token keyword">return</span> <span class="token punctuation">[</span><span class="token punctuation">(</span>el <span class="token operator">/</span> denom<span class="token punctuation">)</span> <span class="token keyword">for</span> el <span class="token keyword">in</span> vec<span class="token punctuation">]</span>
doc_term_matrix_normalizer <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
<span class="token keyword">for</span> vec <span class="token keyword">in</span> doc_term_matrix<span class="token punctuation">:</span>
    doc_term_matrix_normalizer<span class="token punctuation">.</span>append<span class="token punctuation">(</span>normalizer<span class="token punctuation">(</span>vec<span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>np<span class="token punctuation">.</span>matrix<span class="token punctuation">(</span>doc_term_matrix_normalizer<span class="token punctuation">)</span><span class="token punctuation">)</span>
</code></pre>
<p><code>第六步</code>:行业标准化向量和市场标准化向量</p>
<pre class=" language-python"><code class="prism  language-python"><span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">&#48;</span><span class="token punctuation">,</span><span class="token number">3568</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
    df_firm <span class="token operator">=</span> data1<span class="token punctuation">.</span>iloc<span class="token punctuation">[</span>i<span class="token punctuation">:</span>i<span class="token operator">+</span><span class="token number">1</span><span class="token punctuation">]</span>
    df_firm <span class="token operator">=</span> df_firm<span class="token punctuation">.</span>melt<span class="token punctuation">(</span>id_vars<span class="token operator">=</span><span class="token punctuation">[</span><span class="token string">'code'</span><span class="token punctuation">,</span><span class="token string">'year'</span><span class="token punctuation">,</span><span class="token string">'ind'</span><span class="token punctuation">]</span><span class="token punctuation">,</span>    <span class="token comment"># 要保留的字段</span>
                            var_name<span class="token operator">=</span><span class="token string">"wordid"</span><span class="token punctuation">,</span>   <span class="token comment"># 拉长的分类变量</span>
                            value_name<span class="token operator">=</span><span class="token string">"freq"</span><span class="token punctuation">)</span>   <span class="token comment"># 拉长的度量值名称 </span>
   
    ind_matrix <span class="token operator">=</span> data1<span class="token punctuation">[</span><span class="token punctuation">(</span>data1<span class="token punctuation">[</span><span class="token string">"ind"</span><span class="token punctuation">]</span> <span class="token operator">==</span> data1<span class="token punctuation">.</span>iloc<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"ind"</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">&amp;</span> <span class="token punctuation">(</span>data1<span class="token punctuation">[</span><span class="token string">"code"</span><span class="token punctuation">]</span> <span class="token operator">!=</span> data1<span class="token punctuation">.</span>iloc<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"code"</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">]</span>
    ind_matrix <span class="token operator">=</span> ind_matrix<span class="token punctuation">.</span>drop<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"code"</span><span class="token punctuation">,</span><span class="token string">"year"</span><span class="token punctuation">,</span><span class="token string">"ind"</span><span class="token punctuation">]</span><span class="token punctuation">,</span>axis<span class="token operator">=</span><span class="token number">1</span><span class="token punctuation">)</span>
    normind <span class="token operator">=</span> ind_matrix<span class="token punctuation">.</span>mean<span class="token punctuation">(</span>axis <span class="token operator">=</span> <span class="token number">&#48;</span><span class="token punctuation">)</span>
   
    market_matrix <span class="token operator">=</span> data1<span class="token punctuation">[</span>data1<span class="token punctuation">[</span><span class="token string">"ind"</span><span class="token punctuation">]</span> <span class="token operator">!=</span> data1<span class="token punctuation">.</span>iloc<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"ind"</span><span class="token punctuation">]</span><span class="token punctuation">]</span>
    market_matrix <span class="token operator">=</span> market_matrix<span class="token punctuation">.</span>drop<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"code"</span><span class="token punctuation">,</span><span class="token string">"year"</span><span class="token punctuation">,</span><span class="token string">"ind"</span><span class="token punctuation">]</span><span class="token punctuation">,</span>axis<span class="token operator">=</span><span class="token number">1</span><span class="token punctuation">)</span>
    normmarket <span class="token operator">=</span> market_matrix<span class="token punctuation">.</span>mean<span class="token punctuation">(</span>axis <span class="token operator">=</span> <span class="token number">&#48;</span><span class="token punctuation">)</span>

    df_firm<span class="token punctuation">[</span><span class="token string">"freq_ind"</span><span class="token punctuation">]</span> <span class="token operator">=</span> normind<span class="token punctuation">.</span>tolist<span class="token punctuation">(</span><span class="token punctuation">)</span>
    df_firm<span class="token punctuation">[</span><span class="token string">"freq_market_ind"</span><span class="token punctuation">]</span> <span class="token operator">=</span> normmarket<span class="token punctuation">.</span>tolist<span class="token punctuation">(</span><span class="token punctuation">)</span>
   
    df_firm<span class="token punctuation">.</span>to_excel<span class="token punctuation">(</span><span class="token string">"{}词条.xls"</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>data1<span class="token punctuation">.</span>iloc<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"code"</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span>index<span class="token operator">=</span><span class="token boolean">None</span><span class="token punctuation">)</span>
</code></pre>
<h3 id="stata代码实现">Stata代码实现</h3>
<p>在python代码实现过程中,我们得到了每家公司的词条信息,包括其对应的标准化词频向量、行业标准化向量、市场标准化向量,如下图所示</p>
<p><img src="https://files.mdnice.com/user/1&#48;654/a876ef38-588f-4d5a-a5a1-e17395ad644d.png" alt="具体公司的词条信息"></p>
<p>我们只需要导入每家公司的词条信息到stata进行回归计算,即可得到其各维度的残差绝对值之和,在此基础上汇总所有公司的回归结果即可,最终回归结果的描述性统计结果如下</p>
<p><img src="https://files.mdnice.com/user/1&#48;654/163b71ac-d274-47d8-9bcc-72b99&#48;da&#48;26&#48;.png" alt="回归结果的描述性统计"></p>
<p>从描述性统计结果可以明显看出,行业标准化向量和市场标准化向量都得到了1%统计显著的正回归系数,且对应的T值分别高达18.26&#48;和3.971,说明上市公司的<span class="katex--inline"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>M</mi><mi>D</mi><mi mathvariant="normal">&amp;amp;</mi><mi>A</mi></mrow><annotation encoding="application/x-tex">MD\&amp;amp;A</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: &#48;.69444em; vertical-align: &#48;em;"></span><span class="mord mathit" style="margin-right: &#48;.1&#48;9&#48;3em;">M</span><span class="mord mathit" style="margin-right: &#48;.&#48;2778em;">D</span><span class="mord">&amp;</span><span class="mord mathit">A</span></span></span></span></span>的文本信息大多数与市场的,特别是行业的文本信息高度重合。我们计算的信息含量和标准信息与孟庆斌等(中国工业经济,2&#48;17)结果基本相近,可能在样本期间和数据细节处理上略有差异</p>
<h2 id="计量拓展">计量拓展</h2>
<p>本文讨论的管理层信息含量度量存在的不足以及改进建议</p>
<ol>
<li>显然,仅以行业和市场两个维度对个股标准化向量进行分离是不全面的。上市公司的<span class="katex--inline"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>M</mi><mi>D</mi><mi mathvariant="normal">&amp;amp;</mi><mi>A</mi></mrow><annotation encoding="application/x-tex">MD\&amp;amp;A</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: &#48;.69444em; vertical-align: &#48;em;"></span><span class="mord mathit" style="margin-right: &#48;.1&#48;9&#48;3em;">M</span><span class="mord mathit" style="margin-right: &#48;.&#48;2778em;">D</span><span class="mord">&amp;</span><span class="mord mathit">A</span></span></span></span></span>文本内容往往与自身前几年年报的<span class="katex--inline"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>M</mi><mi>D</mi><mi mathvariant="normal">&amp;amp;</mi><mi>A</mi></mrow><annotation encoding="application/x-tex">MD\&amp;amp;A</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: &#48;.69444em; vertical-align: &#48;em;"></span><span class="mord mathit" style="margin-right: &#48;.1&#48;9&#48;3em;">M</span><span class="mord mathit" style="margin-right: &#48;.&#48;2778em;">D</span><span class="mord">&amp;</span><span class="mord mathit">A</span></span></span></span></span>的文本内容也是有高度重合的,投资者也不会从这部分高度重合的文本内容中得到增量信息。因此,我们建议增加公司过去三年的<span class="katex--inline"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>M</mi><mi>D</mi><mi mathvariant="normal">&amp;amp;</mi><mi>A</mi></mrow><annotation encoding="application/x-tex">MD\&amp;amp;A</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: &#48;.69444em; vertical-align: &#48;em;"></span><span class="mord mathit" style="margin-right: &#48;.1&#48;9&#48;3em;">M</span><span class="mord mathit" style="margin-right: &#48;.&#48;2778em;">D</span><span class="mord">&amp;</span><span class="mord mathit">A</span></span></span></span></span>标准化向量作为解释变量,更为准确地分离出投资者没有预期到的文本信息</li>
<li>在大规模的文本中,我们构建的BOW矩阵将会存在大量的&#48;值,带来的所谓的“维度灾难”,对估计效果产生不利的影响。因此,我们建议在计算过程中使用降维技术,如word2vec神经网络模型,对BOW矩阵进行降维,然后利用降维后的向量进行响应的回归</li>
</ol>
<p>需要指出的是,由于我们的认知和能力有限,上述的操作方法和相关看法可能存在问题,希望读者指正~</p>
<p><strong>项目完整代码可在公众号后台回复关键词“管理层讨论信息含量”获取</strong></p>
<blockquote>
<p>公号:财会程序猿的笔记(ID:wylcfy2&#48;14)<br>
不定期推送:Python+Stata | 文本分析+机器学习 | 财务+会计</p>
</blockquote>
</div>
</div>
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

全部回复
2021-9-1 12:53:11
支持一下
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群