全部版块 我的主页
论坛 数据科学与人工智能 数据分析与数据科学 python论坛
2269 2
2015-03-15
Latent Dirichlet Allocation

David M. Blei                                                                                                                    BLEI @ CS . BERKELEY . EDU
Computer Science Division University of California
Berkeley, CA 94720, USA

Andrew Y. Ng                                                                                                                   ANG @ CS . STANFORD . EDU
Computer Science Department Stanford University
Stanford, CA 94305, USA

Michael I. Jordan                                                                                                              JORDAN @ CS . BERKELEY . EDU
Computer Science Division and Department of Statistics University of California
Berkeley, CA 94720, USA


Abstract

We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

1.Introduction

In this paper we consider the problem of modeling text corpora and other collections of discrete data. The goal is to find short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basic tasks such as classification, novelty detection, summarization, and similarity and relevance judgments.  

Significant progress has been made on this problem by researchers in the field of informa- tion retrieval (IR) (Baeza-Yates and Ribeiro-Neto, 1999). The basic methodology proposed by IR researchers for text corpora—a methodology successfully deployed in modern Internet search engines—reduces each document in the corpus to a vector of real numbers, each of which repre- sents ratios of counts. In the popular tf-idf scheme (Salton and McGill, 1983), a basic vocabulary of “words” or “terms” is chosen, and, for each document in the corpus, a count is formed of the number of occurrences of each word. After suitable normalization, this term frequency count is compared to an inverse document frequency count, which measures the number of occurrences of a word in the entire corpus (generally on a log scale, and again suitably normalized). The end result is a term-by-document matrix X whose columns contain the tf-idf values for each of the documents in the corpus. Thus the tf-idf scheme reduces documents of arbitrary length to fixed-length lists of numbers.

While the tf-idf reduction has some appealing features—notably in its basic identification of sets of words that are discriminative for documents in the collection—the approach also provides a rela- tively small amount of reduction in description length and reveals little in the way of inter- or intra- document statistical structure. To address these shortcomings, IR researchers have proposed several other dimensionality reduction techniques, most notably latent semantic indexing (LSI) (Deerwester et al., 1990). LSI uses a singular value decomposition of the X matrix to identify a linear subspace in the space of tf-idf features that captures most of the variance in the collection. This approach can achieve significant compression in large collections. Furthermore, Deerwester et al. argue that the derived features of LSI, which are linear combinations of the original tf-idf features, can capture some aspects of basic linguistic notions such as synonymy and polysemy.

To substantiate the claims regarding LSI, and to study its relative strengths and weaknesses, it is useful to develop a generative probabilistic model of text corpora and to study the ability of LSI to recover aspects of the generative model from data (Papadimitriou et al., 1998). Given a generative model of text, however, it is not clear why one should adopt the LSI methodology—one can attempt to proceed more directly, fitting the model to data using maximum likelihood or Bayesian methods.

A significant step forward in this regard was made by Hofmann (1999), who presented the probabilistic LSI (pLSI) model, also known as the aspect model , as an alternative to LSI. The pLSI approach, which we describe in detail in Section 4.3, models each word in a document as a sample from a mixture model, where the mixture components are multinomial random variables that can be viewed as representations of “topics.” Thus each word is generated from a single topic, and different words in a document may be generated from different topics. Each document is represented as a list of mixing proportions for these mixture components and thereby reduced to a probability distribution on a fixed set of topics. This distribution is the “reduced description” associated with the document.

While Hofmann’s work is a useful step toward probabilistic modeling of text, it is incomplete in that it provides no probabilistic model at the level of documents. In pLSI, each document is represented as a list of numbers (the mixing proportions for topics), and there is no generative probabilistic model for these numbers. This leads to several problems: (1) the number of parame- ters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting, and (2) it is not clear how to assign probability to a document outside of the training set.

To see how to proceed beyond pLSI, let us consider the fundamental probabilistic assumptions underlying the class of dimensionality reduction methods that includes LSI and pLSI. All of these methods are based on the “bag-of-words” assumption—that the order of words in a document can be neglected. In the language of probability theory, this is an assumption of exchangeability for the words in a document (Aldous, 1985). Moreover, although less often stated formally, these methods also assume that documents are exchangeable; the specific ordering of the documents in a corpus can also be neglected.

A classic representation theorem due to de Finetti (1990) establishes that any collection of ex- changeable random variables has a representation as a mixture distribution—in general an infinite mixture. Thus, if we wish to consider exchangeable representations for documents and words, we need to consider mixture models that capture the exchangeability of both words and documents.This line of thinking leads to the latent Dirichlet allocation (LDA) model that we present in the current paper.

It is important to emphasize that an assumption of exchangeability is not equivalent to an as- sumption that the random variables are independent and identically distributed. Rather, exchange- ability essentially can be interpreted as meaning “ conditionally independent and identically dis- tributed,” where the conditioning is with respect to an underlying latent parameter of a probability distribution. Conditionally, the joint distribution of the random variables is simple and factored while marginally over the latent parameter, the joint distribution can be quite complex. Thus, while an assumption of exchangeability is clearly a major simplifying assumption in the domain of text modeling, and its principal justification is that it leads to methods that are computationally efficient, the exchangeability assumptions do not necessarily lead to methods that are restricted to simple frequency counts or linear operations. We aim to demonstrate in the current paper that, by taking the de Finetti theorem seriously, we can capture significant intra-document statistical structure via the mixing distribution.

It is also worth noting that there are a large number of generalizations of the basic notion of exchangeability, including various forms of partial exchangeability, and that representation theo- rems are available for these cases as well (Diaconis, 1988). Thus, while the work that we discuss in the current paper focuses on simple “bag-of-words” models, which lead to mixture distributions for single words (unigrams), our methods are also applicable to richer models that involve mixtures for larger structural units such as n -grams or paragraphs.

The paper is organized as follows. In Section 2 we introduce basic notation and terminology. The LDA model is presented in Section 3 and is compared to related latent variable models in Section 4. We discuss inference and parameter estimation for LDA in Section 5. An illustrative example of fitting LDA to data is provided in Section 6. Empirical results in text modeling, text classification and collaborative filtering are presented in Section 7. Finally, Section 8 presents our conclusions.
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

全部回复
2015-3-16 00:34:25

EM and VBEM

On the Slow Convergence of EM and VBEM in Low-Noise Linear Models

We analyze convergence of the expectation maximization (EM) and vari- ational Bayes EM (VBEM) schemes for parameter estimation in noisy lin- ear models. The analysis shows that both schemes are inefficient in the low-noise limit. The linear model with additive noise includes as special cases independent component analysis, probabilistic principal compo- nent analysis, factor analysis, and Kalman filtering. Hence, the results are relevant for many practical applications.

1 Introduction

The expectation maximization (EM) algorithm introduced by Dempster, Laird, and Rubin (1977) is widely used for maximum likelihood estimation in hidden variable models. More recently, a generalization of the EM al- gorithm, the so-called variational Bayes EM algorithm (VBEM), has been introduced (see, e.g., Attias, 1999), which allows more accurate modeling of parameter uncertainty. EM convergence is known to slow dramatically when the signal-to-noise ratio is high, and a natural question is then: Will the more accurate modeling of parameter variance in VBEM assist the con- vergence? Here we analyze both schemes and show that they are subject to slow convergence in the low-noise limit.

二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

2015-3-16 01:07:44

高性能计算之并行编程技术 — — MPI 并行程序设计

高性能计算之并行编程技术 — — MPI 并行程序设计



介绍了并行程序设计的基础,提供给读者进行并行程序设计所需要的基本知识; 然后介绍了MPI 的基本功能, 从简单的例子入手,告诉读者MPI 程序设计的基本过程和框架,这一部分是具有 C 或 /FORTRAN 串行程序设计经验的人员很容易理解和接受的。接下来介绍 MPI 程序设计的高级特征,是已经掌握了 MPI 基本程序设计的人员进一步编写简洁、高效的 MPI 程序 使用各种高级和复杂的 MPI 功能所需要的;最后一部分介绍了 MPI 的最新发展和 扩充 MPI-2,主要包括三个部分,动态进程管理,远程存储访问和并行文件读写。

二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群