Document Topic Modeling

May 21st, 2010 by Kristopher Reese

Topic Modeling is a classic problem in Information Retrieval, and despite the extensive amount of research, the idea of clustering documents together under a broad range of topics is still relatively uncommon in Content Management Systems on the web.  With algorithms such as Latent Dirichlet Allocation, Latent Sematic Analysis, and Gamma-Poisson, one would expect to see more of these data clustering algorithms implemented for finding potentially related documents in a system and even inferring the relevance that other documents might have to a document being currently viewed by a user.

One of the most robust algorithm for Topic Modeling is the Latent Dirichlet Allocation (LDA).  It’s a statistical method which uses probabilistic principles to generate a list of topics for which a document can be associated with.  The LDA model makes several assumptions about distributions on several of the parameters for the algorithm.  These are [1]:

  1. Choose N \sim Poisson(\xi)
  2. Choose \theta \sim Dir(\alpha)
  3. For each N words w_{n}
    1. Choose a topic z_{n} \sim Multinomial(\theta)
    2. Choose a word w_{n} from p \left ( w_{n} | z_{n}, \beta \right ), a multinomial probability conditioned on the topic
      zn.
Latent Dirichlet Allocation Model

Mathematical Model for Latent Dirichlet Allocation

Knowing these assumptions, we can generate a mathematical model, shown in figure 1, using the hyperparameters, α and β, which we use to help to sample the distributions of θ and φ (which are both Dirichlet-Multinomial distributions), where θ is the distribution of topics in documents and φ is the distribution of topics for words.  Ultimately, these parameters will be marginalized and are unnecessary to use, except to help us understand the model.  By setting α and β arbitrarily, we simply make the assumption that the distributions θ and φ exist.

To calculate the distributions, we will use a Gibbs Sampling Algorithm to sample the z-topic for each of the words in the document.  The Gibbs Sampler will probabilistically be determined using the following equations:

P\left ( z_{i}=j \mid z_{-i}, w \right ) \propto P\left ( w_{i} \mid z_{i}=j, z_{-i}, w_{-i} \right )P\left(z_{i}=j \mid z_{-i} \right )
P\left ( z \mid \alpha, \beta, w \right ) = \frac{n_{-i,j}^{w_{i}}+\beta}{n_{-i,j}^{w_{i}}+W\beta} \frac{n_{-i,j}^{d_{i}}+\alpha}{n_{-i,j}^{d_{i}}+W\alpha}

where

n_{-i,j}^{w_{i}} is the number of times the word in w_{i} is assigned to topic j
n_{-i,j}^{d_{i}} is the number of times topic j is used in document d

Using this model can ultimately be looked at as a decomposition of a document-word matrix into both a topic-word and topic-document matrix.  It’s a powerful tool for information retrieval.  I am currently working (in the little bit of spare time that I have) an automated clustering algorithm using LDA for WordPress blogs.  When this is completed, it would automatically cluster related posts into their respective categories without the author needing to worry about posting the entry in a category.  This could later be expanded into a full blown data clustering and relation engine which would use some unsupervised bayesian principles in inferring the relevance of potential documents to current documents.

Leave a Reply