利用变分推理和MapReduce进行主题缩放建模

339

收藏 2022-04-12

摘要翻译：
潜在Dirichlet分配(LDA)是研究文档集合的一种流行的主题建模技术。由于大规模数据集的日益普遍，需要提高LDA推理的可扩展性。本文提出了一种在MapReduce框架中容纳大量语料库的方法&~\emph{MapReduce LDA}（Mr.LDA）。与其他LDA缩放推理技术使用吉布斯抽样相比，我们使用变分推理。我们的解决方案有效地分配了计算，并且实现相对简单。更重要的是，与高度调优和专门化的实现不同，这种可变的实现很容易扩展。我们用这个可扩展的框架对模型进行了两个扩展：引导主题发现的先验信息和从多语言语料库中对主题进行建模。
---
英文标题：
《Using Variational Inference and MapReduce to Scale Topic Modeling》
---
作者：
Ke Zhai, Jordan Boyd-Graber, and Nima Asadi
---
最新提交年份：
2011
---
分类信息：

一级分类：Computer Science 计算机科学
二级分类：Artificial Intelligence 人工智能
分类描述：Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
涵盖了人工智能的所有领域，除了视觉、机器人、机器学习、多智能体系统以及计算和语言（自然语言处理），这些领域有独立的学科领域。特别地，包括专家系统，定理证明（尽管这可能与计算机科学中的逻辑重叠），知识表示，规划，和人工智能中的不确定性。大致包括ACM学科类I.2.0、I.2.1、I.2.3、I.2.4、I.2.8和I.2.11中的材料。
--
一级分类：Computer Science 计算机科学
二级分类：Distributed, Parallel, and Cluster Computing 分布式、并行和集群计算
分类描述：Covers fault-tolerance, distributed algorithms, stabilility, parallel computation, and cluster computing. Roughly includes material in ACM Subject Classes C.1.2, C.1.4, C.2.4, D.1.3, D.4.5, D.4.7, E.1.
包括容错、分布式算法、稳定性、并行计算和集群计算。大致包括ACM学科类C.1.2、C.1.4、C.2.4、D.1.3、D.4.5、D.4.7、E.1中的材料。
--

---
英文摘要：
Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for exploring document collections. Because of the increasing prevalence of large datasets, there is a need to improve the scalability of inference of LDA. In this paper, we propose a technique called ~\emph{MapReduce LDA} (Mr. LDA) to accommodate very large corpus collections in the MapReduce framework. In contrast to other techniques to scale inference for LDA, which use Gibbs sampling, we use variational inference. Our solution efficiently distributes computation and is relatively simple to implement. More importantly, this variational implementation, unlike highly tuned and specialized implementations, is easily extensible. We demonstrate two extensions of the model possible with this scalable framework: informed priors to guide topic discovery and modeling topics from a multilingual corpus.
---
PDF链接：
https://arxiv.org/pdf/1107.3765

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群