Latent Dirichlet Allocation note

合集下载

lda模型方法描述 -回复

lda模型方法描述-回复问题的答案。

LDA（Latent Dirichlet Allocation）是一种常用的主题建模方法，经常被用来从大量文本数据中发现隐藏的话题结构。

LDA是基于概率图模型的生成模型，该模型假设每个文档是由多个主题组成的，而每个主题又由多个单词组成。

LDA通过对文档和主题的概率分布进行推断，从而得到文档和主题之间的关系。

LDA模型的主要思想是，每个文档都由一组主题构成，而每个主题又由一组单词构成。

在LDA中，首先需要指定主题的数量K，然后对文档和主题之间的分布进行假设。

LDA采用了狄利克雷先验分布来建模文档、主题和单词之间的关系。

狄利克雷先验分布是一种多维概率分布，用于描述多个事件之间的维度关系。

在LDA中，每个主题都由一个狄利克雷先验分布表示，而每个文档则由一组主题的概率分布表示。

每个主题中的单词也由一个狄利克雷先验分布表示。

LDA模型的训练过程可以分为三个步骤：初始化、迭代和推断。

在初始化阶段，首先需要确定要训练的主题数量K。

对于每个文档中的每个单词，随机分配一个主题。

这样可以初始化文档和主题之间的关系。

在迭代阶段，通过一系列迭代操作来更新主题和文档之间的概率分布。

首先，对于每个文档中的每个单词，计算当前主题分配下该单词出现的概率。

然后，根据这个概率来重新分配每个单词的主题。

这个过程可以通过Gibbs 采样算法来实现。

Gibbs采样算法根据当前的主题分布，从条件概率分布中采样，更新每个单词的主题。

通过多次迭代更新，可以逐渐优化文档和主题之间的关系。

在推断阶段，通过对文档和主题的概率分布进行推断，可以得到文档和主题之间的关系。

在LDA中，通常使用Gibbs采样算法进行推断。

通过对文档中的每个单词进行采样，可以得到该单词属于每个主题的概率分布。

然后，可以根据这个概率分布推断文档和主题之间的关系。

LDA模型的应用范围非常广泛。

它可以用于文本分类、情感分析、主题发现等任务。

在文本分类任务中，LDA可以帮助识别文档所属的类别。

latent dirichlet allocation详细介绍 -回复

latent dirichlet allocation详细介绍-回复什么是潜在狄利克雷分配（Latent Dirichlet Allocation）？潜在狄利克雷分配（Latent Dirichlet Allocation，简称LDA）是一种生成模型，用于将文本或其他类型的数据组织成主题模型。

它最初由Blei 等人于2003年提出，并成为自然语言处理中主题挖掘的重要方法之一。

LDA的基本思想是假设每个文档是由多个主题组成的，其中每个主题又由多个词语组成。

LDA使用的基础是狄利克雷分布（Dirichlet distribution），该概率分布主要用于多个离散变量的分布建模。

在LDA中，使用狄利克雷分布来建模文档-主题分布和主题-词语分布。

通过计算这些分布，可以推断出每个文档的主题和每个主题的词语分布。

LDA的基本假设是：每个文档都可以看作是多个主题的混合，而每个主题又由多个词语组成。

根据这个假设，LDA的目标就是找到每个文档的主题分布和每个主题的词语分布，从而可以推断出文档中包含的主题和主题中包含的词语。

LDA的工作流程包括以下步骤：1. 数据预处理：首先，需要对原始文本进行预处理，包括去除标点符号、停用词和数字，进行分词等操作。

这样可以将文本转化为可处理的形式。

2. 构建词袋模型：建立一个词袋模型，记录文档中所有不重复的词语和它们的计数。

这个词袋模型可以用于后续的主题建模工作。

3. 设定参数和主题数：设定LDA模型的参数，包括主题数和迭代次数等。

主题数是一个重要的参数，决定了LDA模型对文档中隐含主题的分析程度。

4. 训练LDA模型：使用Gibbs采样等方法对LDA模型进行训练。

通过对每个词语的主题进行抽样，可以得到文档-主题和主题-词语分布。

迭代多次后，可以得到稳定的主题模型。

5. 主题分析和可视化：通过分析每个文档的主题分布和每个主题的词语分布，可以推断出每个文档的主题和每个主题的含义。

可以使用可视化工具，如词云图和主题-主题分布图，来展示主题模型的结果。

latentdirichletallocation learning_method -回复

latentdirichletallocation learning_method -回复什么是潜在狄利克雷分配（Latent Dirichlet Allocation，LDA）？LDA 是一种无监督的机器学习模型，用于发现文本集合中的主题结构。

本文将从LDA的基本概念开始，逐步介绍其学习方法和应用，最后讨论其优缺点和未来发展方向。

第一部分：潜在狄利克雷分配的基本概念潜在狄利克雷分配（LDA）是一种生成模型，用于描述文本集合中的主题结构。

LDA假设每篇文档由多个主题构成，而每个主题又由多个词语组成。

通过学习文档和词语之间的关系，LDA可以自动地发现潜在的主题，并将文本分配到不同的主题中。

LDA模型的核心思想是假设文档中的每个词汇都是从主题中随机生成的。

具体来说，对于每篇文档，LDA首先从一个主题分布中随机选择一个主题，然后从这个主题的词汇分布中随机选择一个词汇。

这个过程被称为生成过程，通过反向推理可以得到主题分布和词汇分布的参数。

第二部分：潜在狄利克雷分配的学习方法LDA的学习方法将通过观察文本集合中的词语出现情况，推导出最有可能生成这些文本的主题分布和词汇分布。

学习过程可以通过EM算法来实现。

首先，需要初始化每个文档的主题和每个主题的词汇分布。

然后，迭代进行以下两个步骤：1. E步骤（Expectation）：根据当前的参数估计，计算每个文档中每个词汇属于每个主题的概率。

这个概率可以通过主题分布和词汇分布来计算。

2. M步骤（Maximization）：根据E步骤计算得到的概率，更新主题分布和词汇分布的参数。

通过多次迭代，可以逐渐优化参数估计，得到更准确的主题和词汇分布。

第三部分：潜在狄利克雷分配的应用LDA在文本挖掘、信息检索和自然语言处理等领域有广泛的应用。

通过LDA模型，可以自动发现文本集合中的主题结构，帮助理解大规模文本数据的内容和关联关系。

例如，在文本分类任务中，LDA可以将文档分配到不同的主题中，从而实现文本的自动分类。

LDA模型

LDA(主题模型)算法&&概念：首先引入主题模型(Topic Model)。

何谓“主题”呢？望文生义就知道是什么意思了，就是诸如一篇文章、一段话、一个句子所表达的中心思想。

不过从统计模型的角度来说，我们是用一个特定的词频分布来刻画主题的，并认为一篇文章、一段话、一个句子是从一个概率模型中生成的。

LDA可以用来识别大规模文档集（document collection）或语料库（corpus）中潜藏的主题信息。

它采用了词袋（bag of words）的方法，这种方法将每一篇文档视为一个词频向量，从而将文本信息转化为易于建模的数字信息。

LDA(Latent Dirichlet Allocation)是一种文档主题生成模型，也称为一个三层贝叶斯概率模型，包含词、主题和文档三层结构。

所谓生注：每一篇文档代表了一些主题所构成的一个概率分布，而每一个主题又代表了很多单词所构成的一个概率分布。

备注：流程（概率分布）：→→许多（单）词某些主题一篇文档/**解释：LDA生成过程*对于语料库中的每篇文档，LDA定义了如下生成过程(generativeprocess): *1.对每一篇文档，从主题分布中抽取一个主题;*2.从上述被抽到的主题所对应的单词分布中抽取一个单词;*3.重复上述过程直至遍历文档中的每一个单词。

**/把各个主题z在文档d中出现的概率分布称之为主题分布，且是一个多项分布。

把各个词语w在主题z下出现的概率分布称之为词分布，这个词分布也是一个多项分布。

&&深入学习：理解LDA，可以分为下述5个步骤：1.一个函数：gamma函数2.四个分布：二项分布、多项分布、beta分布、Dirichlet分布3.一个概念和一个理念：共轭先验和贝叶斯框架4.两个模型：pLSA、LDA（在本文第4 部分阐述）5.一个采样：Gibbs采样本文便按照上述5个步骤来阐述，希望读者看完本文后，能对LDA有个尽量清晰完整的了解。

自然语言处理中常见的文本挖掘工具(六)

自然语言处理中常见的文本挖掘工具自然语言处理（Natural Language Processing, NLP）是人工智能领域的一个重要分支，它致力于让计算机能够理解、处理和生成自然语言。

文本挖掘则是NLP的一个重要应用领域，它通过技术手段从海量文本数据中挖掘出有价值的信息，为决策支持、商业智能等领域提供了强大的工具。

在文本挖掘的过程中，使用各种工具对文本进行分析、抽取、建模等操作，本文将介绍自然语言处理中常见的文本挖掘工具。

一、分词工具分词是文本挖掘的基础工作，它将连续的文本序列切分成有意义的词语或短语。

在中文文本处理中，分词是一个特别重要的工作，因为中文中的词语并不像英文一样用空格分隔。

常见的中文分词工具包括jieba、HanLP等。

jieba是一款基于Python的中文分词工具，它具有简单易用、分词效果较好的特点。

HanLP是由哈工大讯飞联合实验室开发的自然语言处理工具包，它不仅包括了分词功能，还具有词性标注、命名实体识别等功能，是一款功能丰富的文本处理工具。

二、词性标注工具词性标注是将分词结果中的每个词语标注上其在句子中的词性，如名词、动词、形容词等。

词性标注对于理解文本语义、进行信息抽取等任务非常重要。

常见的词性标注工具包括NLTK、Stanford NLP等。

NLTK是一款Python自然语言处理工具包，它提供了丰富的语料库和算法库，包括了词性标注、句法分析等功能。

Stanford NLP是由斯坦福大学开发的自然语言处理工具包，它不仅提供了高效的词性标注功能，还具有依存句法分析、语义角色标注等功能，是一款功能强大的文本处理工具。

三、实体识别工具实体识别是从文本中抽取出命名实体（如人名、地名、组织机构名等）的过程，它对于信息抽取、知识图谱构建等任务非常重要。

常见的实体识别工具包括LTP、Spacy等。

LTP是由哈工大语言云实验室开发的自然语言处理工具包，它提供了中文实体识别、依存句法分析等功能。

lda 中overall term frequency的解释

lda 中overall term frequency的解释（实用版）目录1.LDA 简介2.Overall term frequency 的定义3.Overall term frequency 的作用4.Overall term frequency 的计算方法5.结论正文1.LDA 简介LDA（Latent Dirichlet Allocation）是一种主题模型，用于从文档集合中发现潜在的主题结构。

LDA 假设文档是由多个主题生成的，而每个主题又是由多个单词构成的。

通过这种假设，LDA 可以很好地解释文档中的词汇分布，并为文档分配合适的主题标签。

2.Overall term frequency 的定义在 LDA 中，Overall term frequency（OTF）是指一个单词在整个文档集合中出现的频率。

它可以用来衡量一个单词在文档中的重要性，以及它在 LDA 模型中的作用。

3.Overall term frequency 的作用Overall term frequency 在 LDA 中有以下作用：（1）衡量单词重要性：OTF 可以用来衡量一个单词在文档集合中的重要性。

OTF 值越大，说明该单词在文档中的重要性越高，可能在多个主题中出现。

（2）影响主题分配：在 LDA 模型中，每个文档都会被分配到一个或多个主题。

文档与主题之间的匹配程度取决于文档中的单词与主题中的单词的相似度。

而 OTF 可以用来衡量单词与主题的相似度，从而影响文档的主题分配。

（3）筛选关键词：通过计算 OTF 值，可以筛选出在文档集合中出现频率较高的关键词，这些关键词可能对文档的主题具有较高的区分度。

4.Overall term frequency 的计算方法Overall term frequency 的计算方法很简单，它等于一个单词在所有文档中出现的次数除以文档总数。

计算公式如下：OTF(word) = (count(word) / n)其中，count(word) 表示单词在所有文档中出现的次数，n 表示文档总数。

LDA模型的原理及其应用

狄立克雷分布是多项分布的共轭先验
p p x | p | x p x
概率分布—可交换性及de Finetti 定理

可交换性：随机变量z , z , , z 称为是可交换的，如果满足如下条件：
1 2 n
p z1 , z2 , , zn p z 1 , z 2 , , z n
Variational Inference

变分推理是一种用来近似计算后验概率的方法。

对于EM算法中的E step，我们是通过令 q Z p Z | X , 来得到 L q, 的极大值的。如果这个后验概率的计算很困难，那么我们该怎么办呢？
old
限制 q Z 的可选范围来近似求解
g 为归一化因子为参数，其中，
概率分布—共轭先验(Conjugate Prior)
对于概率分布(或密度)函数 p x | ，若p 满足如下条件，则称 p 为 p x | 的共轭先验： (1) 后验分布p | x 与p 有相同的函数形式。指数分布族中的每一个成员均具有如下形式的共轭先验： v p | , v f , v g exp v T
old
L q, p Z | X , old ln p X , Z | p Z | X , old ln p Z | X , old Q , old const
Z Z
在迭代的过程中，似然函数的值是单调增加的
k 1
K
zk
Z就是隐含变量
EM的一般过程
给定联合分布 p X , Z | ，其中X为观测到为参数，以下的变量，Z为隐含变量，过程用来求解似然函数 p X | 的极大值：

Latent dirichlet allocation

Journal of Machine Learning Research3(2003)993-1022Submitted2/02;Published1/03Latent Dirichlet AllocationDavid M.Blei BLEI@ Computer Science DivisionUniversity of CaliforniaBerkeley,CA94720,USAAndrew Y.Ng ANG@ Computer Science DepartmentStanford UniversityStanford,CA94305,USAMichael I.Jordan JORDAN@ Computer Science Division and Department of StatisticsUniversity of CaliforniaBerkeley,CA94720,USAEditor:John LaffertyAbstractWe describe latent Dirichlet allocation(LDA),a generative probabilistic model for collections of discrete data such as text corpora.LDA is a three-level hierarchical Bayesian model,in which each item of a collection is modeled as aﬁnite mixture over an underlying set of topics.Each topic is,in turn,modeled as an inﬁnite mixture over an underlying set of topic probabilities.In the context of text modeling,the topic probabilities provide an explicit representation of a document.We present efﬁcient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation.We report results in document modeling,text classiﬁcation, and collaborativeﬁltering,comparing to a mixture of unigrams model and the probabilistic LSI model.1.IntroductionIn this paper we consider the problem of modeling text corpora and other collections of discrete data.The goal is toﬁnd short descriptions of the members of a collection that enable efﬁcient processing of large collections while preserving the essential statistical relationships that are useful for basic tasks such as classiﬁcation,novelty detection,summarization,and similarity and relevance judgments.Signiﬁcant progress has been made on this problem by researchers in theﬁeld of informa-tion retrieval(IR)(Baeza-Yates and Ribeiro-Neto,1999).The basic methodology proposed by IR researchers for text corpora—a methodology successfully deployed in modern Internet search engines—reduces each document in the corpus to a vector of real numbers,each of which repre-sents ratios of counts.In the popular tf-idf scheme(Salton and McGill,1983),a basic vocabulary of“words”or“terms”is chosen,and,for each document in the corpus,a count is formed of the number of occurrences of each word.After suitable normalization,this term frequency count is compared to an inverse document frequency count,which measures the number of occurrences of aB LEI,N G,AND J ORDANword in the entire corpus(generally on a log scale,and again suitably normalized).The end result is a term-by-document matrix X whose columns contain the tf-idf values for each of the documents in the corpus.Thus the tf-idf scheme reduces documents of arbitrary length toﬁxed-length lists of numbers.While the tf-idf reduction has some appealing features—notably in its basic identiﬁcation of sets of words that are discriminative for documents in the collection—the approach also provides a rela-tively small amount of reduction in description length and reveals little in the way of inter-or intra-document statistical structure.To address these shortcomings,IR researchers have proposed several other dimensionality reduction techniques,most notably latent semantic indexing(LSI)(Deerwester et al.,1990).LSI uses a singular value decomposition of the X matrix to identify a linear subspace in the space of tf-idf features that captures most of the variance in the collection.This approach can achieve signiﬁcant compression in large collections.Furthermore,Deerwester et al.argue that the derived features of LSI,which are linear combinations of the original tf-idf features,can capture some aspects of basic linguistic notions such as synonymy and polysemy.To substantiate the claims regarding LSI,and to study its relative strengths and weaknesses,it is useful to develop a generative probabilistic model of text corpora and to study the ability of LSI to recover aspects of the generative model from data(Papadimitriou et al.,1998).Given a generative model of text,however,it is not clear why one should adopt the LSI methodology—one can attempt to proceed more directly,ﬁtting the model to data using maximum likelihood or Bayesian methods.A signiﬁcant step forward in this regard was made by Hofmann(1999),who presented the probabilistic LSI(pLSI)model,also known as the aspect model,as an alternative to LSI.The pLSI approach,which we describe in detail in Section4.3,models each word in a document as a sample from a mixture model,where the mixture components are multinomial random variables that can be viewed as representations of“topics.”Thus each word is generated from a single topic,and different words in a document may be generated from different topics.Each document is represented as a list of mixing proportions for these mixture components and thereby reduced to a probability distribution on aﬁxed set of topics.This distribution is the“reduced description”associated with the document.While Hofmann’s work is a useful step toward probabilistic modeling of text,it is incomplete in that it provides no probabilistic model at the level of documents.In pLSI,each document is represented as a list of numbers(the mixing proportions for topics),and there is no generative probabilistic model for these numbers.This leads to several problems:(1)the number of parame-ters in the model grows linearly with the size of the corpus,which leads to serious problems with overﬁtting,and(2)it is not clear how to assign probability to a document outside of the training set.To see how to proceed beyond pLSI,let us consider the fundamental probabilistic assumptions underlying the class of dimensionality reduction methods that includes LSI and pLSI.All of these methods are based on the“bag-of-words”assumption—that the order of words in a document can be neglected.In the language of probability theory,this is an assumption of exchangeability for the words in a document(Aldous,1985).Moreover,although less often stated formally,these methods also assume that documents are exchangeable;the speciﬁc ordering of the documents in a corpus can also be neglected.A classic representation theorem due to de Finetti(1990)establishes that any collection of ex-changeable random variables has a representation as a mixture distribution—in general an inﬁnite mixture.Thus,if we wish to consider exchangeable representations for documents and words,we need to consider mixture models that capture the exchangeability of both words and documents.L ATENT D IRICHLET A LLOCATIONThis line of thinking leads to the latent Dirichlet allocation(LDA)model that we present in the current paper.It is important to emphasize that an assumption of exchangeability is not equivalent to an as-sumption that the random variables are independent and identically distributed.Rather,exchange-ability essentially can be interpreted as meaning“conditionally independent and identically dis-tributed,”where the conditioning is with respect to an underlying latent parameter of a probability distribution.Conditionally,the joint distribution of the random variables is simple and factored while marginally over the latent parameter,the joint distribution can be quite complex.Thus,while an assumption of exchangeability is clearly a major simplifying assumption in the domain of text modeling,and its principal justiﬁcation is that it leads to methods that are computationally efﬁcient, the exchangeability assumptions do not necessarily lead to methods that are restricted to simple frequency counts or linear operations.We aim to demonstrate in the current paper that,by taking the de Finetti theorem seriously,we can capture signiﬁcant intra-document statistical structure via the mixing distribution.It is also worth noting that there are a large number of generalizations of the basic notion of exchangeability,including various forms of partial exchangeability,and that representation theo-rems are available for these cases as well(Diaconis,1988).Thus,while the work that we discuss in the current paper focuses on simple“bag-of-words”models,which lead to mixture distributions for single words(unigrams),our methods are also applicable to richer models that involve mixtures for larger structural units such as n-grams or paragraphs.The paper is organized as follows.In Section2we introduce basic notation and terminology. The LDA model is presented in Section3and is compared to related latent variable models in Section4.We discuss inference and parameter estimation for LDA in Section5.An illustrative example ofﬁtting LDA to data is provided in Section6.Empirical results in text modeling,text classiﬁcation and collaborativeﬁltering are presented in Section7.Finally,Section8presents our conclusions.2.Notation and terminologyWe use the language of text collections throughout the paper,referring to entities such as“words,”“documents,”and“corpora.”This is useful in that it helps to guide intuition,particularly when we introduce latent variables which aim to capture abstract notions such as topics.It is important to note,however,that the LDA model is not necessarily tied to text,and has applications to other problems involving collections of data,including data from domains such as collaborativeﬁltering, content-based image retrieval and bioinformatics.Indeed,in Section7.3,we present experimental results in the collaborativeﬁltering domain.Formally,we deﬁne the following terms:•A word is the basic unit of discrete data,deﬁned to be an item from a vocabulary indexed by {1,...,V}.We represent words using unit-basis vectors that have a single component equal to one and all other components equal to zero.Thus,using superscripts to denote components, the v th word in the vocabulary is represented by a V-vector w such that w v=1and w u=0for u=v.•A document is a sequence of N words denoted by w=(w1,w2,...,w N),where w n is the n th word in the sequence.•A corpus is a collection of M documents denoted by D={w1,w2,...,w M}.B LEI,N G,AND J ORDANWe wish toﬁnd a probabilistic model of a corpus that not only assigns high probability to members of the corpus,but also assigns high probability to other“similar”documents.tent Dirichlet allocationLatent Dirichlet allocation(LDA)is a generative probabilistic model of a corpus.The basic idea is that documents are represented as random mixtures over latent topics,where each topic is charac-terized by a distribution over words.1LDA assumes the following generative process for each document w in a corpus D:1.Choose N∼Poisson(ξ).2.Chooseθ∼Dir(α).3.For each of the N words w n:(a)Choose a topic z n∼Multinomial(θ).(b)Choose a word w n from p(w n|z n,β),a multinomial probability conditioned on the topicz n.Several simplifying assumptions are made in this basic model,some of which we remove in subse-quent sections.First,the dimensionality k of the Dirichlet distribution(and thus the dimensionality of the topic variable z)is assumed known andﬁxed.Second,the word probabilities are parameter-ized by a k×V matrixβwhereβi j=p(w j=1|z i=1),which for now we treat as aﬁxed quantity that is to be estimated.Finally,the Poisson assumption is not critical to anything that follows and more realistic document length distributions can be used as needed.Furthermore,note that N is independent of all the other data generating variables(θand z).It is thus an ancillary variable and we will generally ignore its randomness in the subsequent development.A k-dimensional Dirichlet random variableθcan take values in the(k−1)-simplex(a k-vector θlies in the(k−1)-simplex ifθi≥0,∑k i=1θi=1),and has the following probability density on this simplex:Γ ∑k i=1αip(θ|α)=1.We refer to the latent multinomial variables in the LDA model as topics,so as to exploit text-oriented intuitions,butwe make no epistemological claims regarding these latent variables beyond their utility in representing probability distributions on sets of words.L ATENT D IRICHLET ALLOCATIONFigure1:Graphical model representation of LDA.The boxes are“plates”representing replicates.The outer plate represents documents,while the inner plate represents the repeated choiceof topics and words within a document.where p(z n|θ)is simplyθi for the unique i such that z i n=1.Integrating overθand summing over z,we obtain the marginal distribution of a document:p(w|α,β)= p(θ|α) N∏n=1∑z n p(z n|θ)p(w n|z n,β) dθ.(3)Finally,taking the product of the marginal probabilities of single documents,we obtain the proba-bility of a corpus:p(D|α,β)=M∏d=1 p(θd|α)N d∏n=1∑z dn p(z dn|θd)p(w dn|z dn,β) dθd.The LDA model is represented as a probabilistic graphical model in Figure1.As theﬁgure makes clear,there are three levels to the LDA representation.The parametersαandβare corpus-level parameters,assumed to be sampled once in the process of generating a corpus.The variables θd are document-level variables,sampled once per document.Finally,the variables z dn and w dn are word-level variables and are sampled once for each word in each document.It is important to distinguish LDA from a simple Dirichlet-multinomial clustering model.A classical clustering model would involve a two-level model in which a Dirichlet is sampled once for a corpus,a multinomial clustering variable is selected once for each document in the corpus, and a set of words are selected for the document conditional on the cluster variable.As with many clustering models,such a model restricts a document to being associated with a single topic.LDA, on the other hand,involves three levels,and notably the topic node is sampled repeatedly within the document.Under this model,documents can be associated with multiple topics.Structures similar to that shown in Figure1are often studied in Bayesian statistical modeling, where they are referred to as hierarchical models(Gelman et al.,1995),or more precisely as con-ditionally independent hierarchical models(Kass and Steffey,1989).Such models are also often referred to as parametric empirical Bayes models,a term that refers not only to a particular model structure,but also to the methods used for estimating parameters in the model(Morris,1983).In-deed,as we discuss in Section5,we adopt the empirical Bayes approach to estimating parameters such asαandβin simple implementations of LDA,but we also consider fuller Bayesian approaches as well.B LEI,N G,AND J ORDAN3.1LDA and exchangeabilityAﬁnite set of random variables{z1,...,z N}is said to be exchangeable if the joint distribution is invariant to permutation.Ifπis a permutation of the integers from1to N:p(z1,...,z N)=p(zπ(1),...,zπ(N)).An inﬁnite sequence of random variables is inﬁnitely exchangeable if everyﬁnite subsequence is exchangeable.De Finetti’s representation theorem states that the joint distribution of an inﬁnitely exchangeable sequence of random variables is as if a random parameter were drawn from some distribution and then the random variables in question were independent and identically distributed,conditioned on that parameter.In LDA,we assume that words are generated by topics(byﬁxed conditional distributions)and that those topics are inﬁnitely exchangeable within a document.By de Finetti’s theorem,the prob-ability of a sequence of words and topics must therefore have the form:p(w,z)= p(θ) N∏n=1p(z n|θ)p(w n|z n) dθ,whereθis the random parameter of a multinomial over topics.We obtain the LDA distribution on documents in Eq.(3)by marginalizing out the topic variables and endowingθwith a Dirichlet distribution.3.2A continuous mixture of unigramsThe LDA model shown in Figure1is somewhat more elaborate than the two-level models often studied in the classical hierarchical Bayesian literature.By marginalizing over the hidden topic variable z,however,we can understand LDA as a two-level model.In particular,let us form the word distribution p(w|θ,β):p(w|z,β)p(z|θ).p(w|θ,β)=∑zNote that this is a random quantity since it depends onθ.L ATENT D IRICHLET A LLOCATIONFigure2:An example density on unigram distributions p(w|θ,β)under LDA for three words and four topics.The triangle embedded in the x-y plane is the2-D simplex representing allpossible multinomial distributions over three words.Each of the vertices of the trian-gle corresponds to a deterministic distribution that assigns probability one to one of thewords;the midpoint of an edge gives probability0.5to two of the words;and the centroidof the triangle is the uniform distribution over all three words.The four points markedwith an x are the locations of the multinomial distributions p(w|z)for each of the fourtopics,and the surface shown on top of the simplex is an example of a density over the(V−1)-simplex(multinomial distributions of words)given by LDA.We now deﬁne the following generative process for a document w:1.Chooseθ∼Dir(α).2.For each of the N words w n:(a)Choose a word w n from p(w n|θ,β).This process deﬁnes the marginal distribution of a document as a continuous mixture distribution:p(w|α,β)= p(θ|α) N∏n=1p(w n|θ,β) dθ,where p(w n|θ,β)are the mixture components and p(θ|α)are the mixture weights.Figure2illustrates this interpretation of LDA.It depicts the distribution on p(w|θ,β)which is induced from a particular instance of an LDA model.Note that this distribution on the(V−1)-simplex is attained with only k+kV parameters yet exhibits a very interesting multimodal structure.B LEI,N G,AND J ORDAN(c)pLSI/aspect modelFigure3:Graphical model representation of different models of discrete data.4.Relationship with other latent variable modelsIn this section we compare LDA to simpler latent variable models for text—the unigram model,a mixture of unigrams,and the pLSI model.Furthermore,we present a uniﬁed geometric interpreta-tion of these models which highlights their key differences and similarities.4.1Unigram modelUnder the unigram model,the words of every document are drawn independently from a singlemultinomial distribution:p(w)=N∏n=1p(w n).This is illustrated in the graphical model in Figure3a.4.2Mixture of unigramsIf we augment the unigram model with a discrete random topic variable z(Figure3b),we obtain a mixture of unigrams model(Nigam et al.,2000).Under this mixture model,each document is gen-erated byﬁrst choosing a topic z and then generating N words independently from the conditional multinomial p(w|z).The probability of a document is:p(w)=∑z p(z)N∏n=1p(w n|z).L ATENT D IRICHLET A LLOCATIONWhen estimated from a corpus,the word distributions can be viewed as representations of topics under the assumption that each document exhibits exactly one topic.As the empirical results in Section7illustrate,this assumption is often too limiting to effectively model a large collection of documents.In contrast,the LDA model allows documents to exhibit multiple topics to different degrees. This is achieved at a cost of just one additional parameter:there are k−1parameters associated with p(z)in the mixture of unigrams,versus the k parameters associated with p(θ|α)in LDA.4.3Probabilistic latent semantic indexingProbabilistic latent semantic indexing(pLSI)is another widely used document model(Hofmann, 1999).The pLSI model,illustrated in Figure3c,posits that a document label d and a word w n are conditionally independent given an unobserved topic z:p(w n|z)p(z|d).p(d,w n)=p(d)∑zThe pLSI model attempts to relax the simplifying assumption made in the mixture of unigrams model that each document is generated from only one topic.In a sense,it does capture the possibility that a document may contain multiple topics since p(z|d)serves as the mixture weights of the topics for a particular document d.However,it is important to note that d is a dummy index into the list of documents in the training set.Thus,d is a multinomial random variable with as many possible values as there are training documents and the model learns the topic mixtures p(z|d)only for those documents on which it is trained.For this reason,pLSI is not a well-deﬁned generative model of documents;there is no natural way to use it to assign probability to a previously unseen document.A further difﬁculty with pLSI,which also stems from the use of a distribution indexed by train-ing documents,is that the number of parameters which must be estimated grows linearly with the number of training documents.The parameters for a k-topic pLSI model are k multinomial distri-butions of size V and M mixtures over the k hidden topics.This gives kV+kM parameters and therefore linear growth in M.The linear growth in parameters suggests that the model is prone to overﬁtting and,empirically,overﬁtting is indeed a serious problem(see Section7.1).In prac-tice,a tempering heuristic is used to smooth the parameters of the model for acceptable predic-tive performance.It has been shown,however,that overﬁtting can occur even when tempering is used(Popescul et al.,2001).LDA overcomes both of these problems by treating the topic mixture weights as a k-parameter hidden random variable rather than a large set of individual parameters which are explicitly linked to the training set.As described in Section3,LDA is a well-deﬁned generative model and generalizes easily to new documents.Furthermore,the k+kV parameters in a k-topic LDA model do not grow with the size of the training corpus.We will see in Section7.1that LDA does not suffer from the same overﬁtting issues as pLSI.4.4A geometric interpretationA good way of illustrating the differences between LDA and the other latent topic models is by considering the geometry of the latent space,and seeing how a document is represented in that geometry under each model.B LEI,N G,AND J ORDANFigure4:The topic simplex for three topics embedded in the word simplex for three words.The corners of the word simplex correspond to the three distributions where each word(re-spectively)has probability one.The three points of the topic simplex correspond to threedifferent distributions over words.The mixture of unigrams places each document at oneof the corners of the topic simplex.The pLSI model induces an empirical distribution onthe topic simplex denoted by x.LDA places a smooth distribution on the topic simplexdenoted by the contour lines.Figure5:(Left)Graphical model representation of LDA.(Right)Graphical model representation of the variational distribution used to approximate the posterior in LDA.All four of the models described above—unigram,mixture of unigrams,pLSI,and LDA—operate in the space of distributions over words.Each such distribution can be viewed as a point on the(V−1)-simplex,which we call the word simplex.The unigram modelﬁnds a single point on the word simplex and posits that all words in the corpus come from the corresponding distribution.The latent variable models consider k points on the word simplex and form a sub-simplex based on those points,which we call the topic simplex. Note that any point on the topic simplex is also a point on the word simplex.The different latent variable models use the topic simplex in different ways to generate a document.•The mixture of unigrams model posits that for each document,one of the k points on the word simplex(that is,one of the corners of the topic simplex)is chosen randomly and all the words of the document are drawn from the distribution corresponding to that point.•The pLSI model posits that each word of a training document comes from a randomly chosen topic.The topics are themselves drawn from a document-speciﬁc distribution over topics,i.e.,a point on the topic simplex.There is one such distribution for each document;the set oftraining documents thus deﬁnes an empirical distribution on the topic simplex.•LDA posits that each word of both the observed and unseen documents is generated by a randomly chosen topic which is drawn from a distribution with a randomly chosen parameter.This parameter is sampled once per document from a smooth distribution on the topic simplex. These differences are highlighted in Figure4.5.Inference and Parameter EstimationWe have described the motivation behind LDA and illustrated its conceptual advantages over other latent topic models.In this section,we turn our attention to procedures for inference and parameter estimation under LDA.5.1InferenceThe key inferential problem that we need to solve in order to use LDA is that of computing the posterior distribution of the hidden variables given a document:p(θ,z|w,α,β)=p(θ,z,w|α,β)∏iΓ(αi) k∏i=1θαi−1i N∏n=1k∑i=1V∏j=1(θiβi j)w j n dθ,a function which is intractable due to the coupling betweenθandβin the summation over latent topics(Dickey,1983).Dickey shows that this function is an expectation under a particular extension to the Dirichlet distribution which can be represented with special hypergeometric functions.It has been used in a Bayesian context for censored discrete data to represent the posterior onθwhich,in that setting,is a random parameter(Dickey et al.,1987).Although the posterior distribution is intractable for exact inference,a wide variety of approxi-mate inference algorithms can be considered for LDA,including Laplace approximation,variational approximation,and Markov chain Monte Carlo(Jordan,1999).In this section we describe a simple convexity-based variational algorithm for inference in LDA,and discuss some of the alternatives in Section8.5.2Variational inferenceThe basic idea of convexity-based variational inference is to make use of Jensen’s inequality to ob-tain an adjustable lower bound on the log likelihood(Jordan et al.,1999).Essentially,one considers a family of lower bounds,indexed by a set of variational parameters.The variational parameters are chosen by an optimization procedure that attempts toﬁnd the tightest possible lower bound.A simple way to obtain a tractable family of lower bounds is to consider simple modiﬁcations of the original graphical model in which some of the edges and nodes are removed.Consider in particular the LDA model shown in Figure5(left).The problematic coupling betweenθandβarises due to the edges betweenθ,z,and w.By dropping these edges and the w nodes,and endow-ing the resulting simpliﬁed graphical model with free variational parameters,we obtain a family of distributions on the latent variables.This family is characterized by the following variationaldistribution:q(θ,z|γ,φ)=q(θ|γ)N∏n=1q(z n|φn),(4)where the Dirichlet parameterγand the multinomial parameters(φ1,...,φN)are the free variational parameters.Having speciﬁed a simpliﬁed family of probability distributions,the next step is to set up an optimization problem that determines the values of the variational parametersγandφ.As we show in Appendix A,the desideratum ofﬁnding a tight lower bound on the log likelihood translates directly into the following optimization problem:(γ∗,φ∗)=arg min(γ,φ)D(q(θ,z|γ,φ) p(θ,z|w,α,β)).(5)(1)initializeφ0ni:=1/k for all i and n(2)initializeγi:=αi+N/k for all i(3)repeat(4)for n=1to N(5)for i=1to k(6)φt+1ni :=βiwnexp(Ψ(γt i))(7)normalizeφt+1n to sum to1.(8)γt+1:=α+∑N n=1φt+1n(9)until convergenceFigure6:A variational inference algorithm for LDA.Thus the optimizing values of the variational parameters are found by minimizing the Kullback-Leibler(KL)divergence between the variational distribution and the true posterior p(θ,z|w,α,β). This minimization can be achieved via an iterativeﬁxed-point method.In particular,we show in Appendix A.3that by computing the derivatives of the KL divergence and setting them equal to zero,we obtain the following pair of update equations:φni∝βiwnexp{E q[log(θi)|γ]}(6)γi=αi+∑N n=1φni.(7) As we show in Appendix A.1,the expectation in the multinomial update can be computed as follows:E q[log(θi)|γ]=Ψ(γi)−Ψ ∑k j=1γj ,(8) whereΨis theﬁrst derivative of the logΓfunction which is computable via Taylor approxima-tions(Abramowitz and Stegun,1970).Eqs.(6)and(7)have an appealing intuitive interpretation.The Dirichlet update is a poste-rior Dirichlet given expected observations taken under the variational distribution,E[z n|φn].The multinomial update is akin to using Bayes’theorem,p(z n|w n)∝p(w n|z n)p(z n),where p(z n)is approximated by the exponential of the expected value of its logarithm under the variational distri-bution.It is important to note that the variational distribution is actually a conditional distribution, varying as a function of w.This occurs because the optimization problem in Eq.(5)is conducted forﬁxed w,and thus yields optimizing parameters(γ∗,φ∗)that are a function of w.We can write the resulting variational distribution as q(θ,z|γ∗(w),φ∗(w)),where we have made the dependence on w explicit.Thus the variational distribution can be viewed as an approximation to the posterior distribution p(θ,z|w,α,β).In the language of text,the optimizing parameters(γ∗(w),φ∗(w))are document-speciﬁc.In particular,we view the Dirichlet parametersγ∗(w)as providing a representation of a document in the topic simplex.We summarize the variational inference procedure in Figure6,with appropriate starting points forγandφn.From the pseudocode it is clear that each iteration of variational inference for LDA requires O((N+1)k)operations.Empirically,weﬁnd that the number of iterations required for a。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Latent dirichlet allocation noteBy: Zhou Li (smzlkimi@)Blog: Code&Doc: /p/lsa-lda/July 31, 20091 基础知识：贝叶斯统计假设有两个箱子，每个箱子装了8个球，A箱子3个红球5个白球，B箱子6个红球2个白球。

如果问从A箱子摸出一个红球的概率，那么答案是3/8，如果问从B箱子摸出一个白球的概率，那么为2/8。

这样的正向推理很简单。

但是如果问摸出一个红球，它是从A箱子中摸出的概率是多少，这又如何求呢？贝叶斯方法正是用来求这种”逆”概率。

贝叶斯问题的详细描述可以参考Pattern Recognition and Machine Learning[1].该书第一章对贝叶斯方法做了详细的解释。

下面讨论一个概率问题，一对夫妇有两个孩子，已知其中一个是男孩，问另一个也是男孩的概率？令A=另一个也是男孩B=已知其中一个是男孩由贝叶斯：P(A|B) = P(B|A)P(A)/P(B)其中P(B|A) = 1 ，因为另一个也是男孩，表示两个都是男孩。

P(A) = 0.25 即如果有两个孩子，两个都是男孩的概率0.25P(B) = 0.75 即如果有两个孩子，那么其中一个是男孩的概率为0.75因此P(A|B) = 1*0.25/0.75=1/31.1 基础知识：Dirichlet distribution假设我们在和一个不老实的人玩掷骰子游戏。

按常理我们觉得骰子每一面出现的几率都是1/6，但是掷骰子的人连续掷出6，这让我们觉得骰子被做了手脚，而这个骰子出现6的几率更高。

而我们又不确定这个骰子出现6的概率到底是多少，所以我们猜测有50%的概率是：6出现的概率2/7，其它各面1/7；有25%的概率是：6出现的概率3/8，其它各面1/8；还有25%的概率是：每个面出现的概率都为1/6，也就是那个人没有作弊，走运而已。

用图表表示如下：我们所猜测的值，如果设为X的话，则表示X的最自然的分布便是Dirichlet distribution。

设随机变量X服从Dirichlet分布，简写为Dir(α)，即X~Dir(α)。

Α是一个向量，表示的是某个事件出现的次数。

比如对于上例，骰子的可能输出为{1,2,3,4,5,6}，假设我们分别观察到了5次1~5，10次6，那么α = {5,5,5,5,5,10}。

X则表示上例中的各种概率组合，比如{1/7,1/7,1/7, 1/7,1/7,2/7}；{1/8, 1/8, 1/8, 1/8, 1/8, 3/8}；{1/6, 1/6, 1/6, 1/6, 1/6, 1/6}，那么P(X)则表示了该概率组合出现的概率，也就是概率的概率。

以下是公式：下图来自WIKI[2]，图像化了当K=3时的dirichlet分布。

Dirichlet分布的重要性质：Dirichlet分布是多项分布的共轭分布，也就是说，先验分布为Dirichlet分布，似然函数为多项分布，那么后验分布仍为Dirichlet分布。

在LDA中，Dirichlet分布是为了描述文档—主题层面的概率分布，一个文档由多个主题组成，而Dirichlet分布描述了主题集合的分布。

具体将在后面讨论。

之所以选择Dirichlet分布是因为其共轭特性大大减小了计算量。

1.2 基础知识：Expectation-Maximization (EM) Algorithm[3][4]EM算法是用来计算极大似然估计。

EM有两个主要应用环境，第一个是观测到的数据不完整或其它原因导致数据丢失，第二个是似然函数无法直接计算但可以用隐含变量表示。

LDA 中的参数估计属于后者。

概括的说，EM算法首先随机给每个参数赋值，然后迭代的执行两个步骤，分别叫做E-STEP 和M-STEP。

在E-STEP，EM算法计算出期望的似然函数，由于这时已经为所有参数赋值，所以似然函数是可以计算的。

在M-STEP，EM算法重新估计参数值，按照最大化似然函数的标准。

这样多次迭代直到收敛为止。

本文大致讨论下EM的推理，更具体的分析参考Further Reading中EM相关资料。

假设某次迭代中我们估计的参数是θ(n)，而我们的目的是希望找到θ(n+1)使得P(X|θ(n+1))尽可能的大于P(X|θ(n))。

将lnP(X|θ)表示成L(θ|X)，则我们的目标是使使下式尽可能的大：(1)现在考虑隐含变量Z：于是(1)式改写成：(2)下面是The Expectation Maximization Algorithm A short tutorial[3]关于(2)式的推导：因此：(3)现在令：(4)前面已经提高过，我们的目的是找到θ，使得L(θ)最大。

而从(3)，(4)中我们可以看到l(θ|θn)就是L(θ)的下界，所以我们的目标就成了找到一个下界逼近L(θ):E-STEP:计算条件数学期望M-STEP：最大化这个数学期望，得出此时的θ。

1.3 基础知识：Variational Inference[5]Variational Inference是用来估计后验分布的方法。

该方法无法直接计算后验分布的情况。

在Variational Message Passing and its Applications[5]的1.8节有该方法的非常详细的推理，我这里只大致介绍该方法的思想。

当我们遇到无法计算后验分布的情况，会希望使用一个函数来近似它，设为Q：我们自然希望P,Q的差最小。

这里的差用Kullback-Leibler (KL) divergence表示：经过下式的变换：可以从上面看到，右边的logP(D)不依赖于Q，所以我们的工作相当于使右式加号左边的表达式最小。

另该项为L(Q)：经过推导（具体见论文[5]）可得下式：右边第一项为似然函数，第二项L(Q)相当于一个下界。

要使KL尽量下，就要使右边两项尽可能的接近。

所以L(Q)相当于logP(D|H)的下界。

我们现在要做的，就是用L(Q)下届逼近logP(D|H)。

还要记住我们是要用Q(H)来近似后验分布，因此我们需要找一个可解的Q。

最简单的方法是认为Q的n个参数都是独立的，于是：根据这个前提，就可以迭代来求L(Q)，直到收敛。

具体过程参见Variational Message Passing and its Applications1.4 基础知识：Bayesian Network在LDA原始paper[6]中有幅贝叶斯网络图，想看懂这幅图只需要一点贝叶斯网络的基础知识就可以了，所以这里把需要理解的地方列出来，贝叶斯网络的深入讨论可以参考Pattern Recognition and Machine Learning[1]一书第8章。

先举一个例子：联合概率P(a,b,c)=P(c|a,b)P(b|a)P(a)可以表示为如下图箭头表示条件概率，圆圈表意一个随机变量。

这样我们就可以很容易的画出一个条件概率对于的贝叶斯网络。

对于更复杂的概率模型，比如由于有N个条件概率，当N很大时，在图中画出每一个随机变量显然不现实，这是就要把随机变量画到方框里：这就表示重复N个tn.在一个概率模型中，有些是我们观察到的随机变量，而有些是需要我们估计的随机变量，这两种变量有必要在图中区分开：如上图，被填充的圆圈表明该随机变量被观察到并已经设为了被观察到的值。

了解上面三个定理就能轻松的读懂LDA原始paper中的贝叶斯网络图了。

2 Latent Dirichlet Allocation IntroductionLDA是给文本建模的一种方法，它属于生成模型。

生成模型是指该模型可以随机生成可观测的数据，LDA可以随机生成一篇由N个主题组成文章。

通过对文本的建模，我们可以对文本进行主题分类，判断相似度等。

在90年代提出的LSA中，通过对向量空间进行降维，获得文本的潜在语义空间。

在LDA中则是通过将文本映射到主题空间，即认为一个文章有若干主题随机组成，从而获得文本间的关系。

LDA模型有一个前提：bag of word。

意思就是认为文档就是一个词的集合，忽略任何语法或者出现顺序关系。

3 生成模型LDA的建模过程是逆向通过文本集合建立生成模型，在讨论如何建模时，我们先要理解LDA 的生成模型如何生成一篇文档。

假设一个语料库中有三个主题：体育，科技，电影一篇描述电影制作过程的文档，可能同时包含主题科技和主题电影，而主题科技中有一系列的词，这些词和科技有关，并且他们有一个概率，代表的是在主题为科技的文章中该词出现的概率。

同理在主题电影中也有一系列和电影有关的词，并对应一个出现概率。

当生成一篇关于电影制作的文档时，首先随机选择某一主题，选择到科技和电影两主题的概率更高；然后选择单词，选择到那些和主题相关的词的概率更高。

这样就就完成了一个单词的选择。

不断选择N个单词，这样就组成了一篇文档。

具体来说，生成一篇文档按照如下步骤：1.选择N，N服从Poisson(ξ)分布，这里N代表文档的长度。

2.选择θ，θ服从Dirichlet(α)分布，这里θ是列向量，代表的是个主题发生的概率，α是dirichlet分布的参数3.对N个单词中的每一个:a)选择主题z n，z n服从Multinomial(θ)多项分布。

z n代表当前选择的主题b)选择w n，根据p(w n | z n; β)：在z n条件下的多项分布。

上式中β是一个K x V的矩阵，βij = P(w j = 1 | z i = 1)，也就是说β记录了某个主题条件下生成某个单词的概率。

观察第二步，这里是LDA和PLSA的区别所在。

假设每篇文档由3个主题组成，θ就表明每个主题发生的概率，比如{1/6,2/6,3/6}，这样不同的文档对应的θ也就不同，而θ可以用来判断文档的相似度等。

LDA Graphical model representation：几乎所有讨论LDA的文章都包括上面的这幅图。

它代表的概率模型：上式计算边缘概率，便可得：其中D代表一个语料库，M代表语料库中文档的总数。

4 参数估计通过对LDA生成模型的讨论我们理解到对文本的建模实际上就是要计算α和β两个参数。

α和β可以采用极大似然估计，但是这里遇到一个问题，就是似然函数由于α和β的耦合无法直接求出来：回想前面提到过的variational inference方法，为了估计后验分布，寻找一个似然函数的下界，在这里，这个下界正好可以被用来做为参数估计，因此LDA原始paper[6]选择使用variational inference方法来计算似然函数的下界。