Skip to main content
Figure 2 | BMC Bioinformatics

Figure 2

From: Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span

Figure 2

Graphical model representation of the LDA model (left) and the variational distribution used to approximate the posterior in LDA (right) [22]. LDA defines a distribution on a collection of documents in much the same manner that a profile hidden Markov model yields a distribution on a set of (biological) sequences [31]. The corpus depicted contains M documents and each is a sequence of N words. Open circles are parameters (α, β, γ, φ) or latent variables (θ, z). The shaded circle is the observed word variable (w) and boxes (plates) represent replicates. The Dirichlet parameter, α, and topic-word matrix, β, are corpus-level parameters sampled once in the process of generating a corpus. The topic proportions, θ, is a document-level variable sampled from αonce per document. The topic, z, is a word-level variable sampled from θ once for each word in a document. Formally, a K-topic LDA specifies a two-level probabilistic process that generates a document as follows, (i) a K-dimensional vector, θ, is chosen from the distribution p(θ|α), and (ii) words are sampled repeatedly from the document-specific mixture distribution, p(w|θ). Exact inference and parameter estimation involve calculating the posterior distribution on a document p(θ, z|w, α, β). This is intractable because the latent variables are coupled via the edge between θ and z. The posterior can be approximated by computing the variational Dirichlet parameter γ and the variational multinomial parameter φ for each word in the document. The subscripts m, n, and k on a parameter (β, γ, φ) or variable (θ, z, w) donate the m th document, nth word and kth topic respectively. Note that the Dirichlet variable α is a distinct component of the probability model and not merely an expression of uncertainty about a parameter. This differs from profile hidden Markov models where a mixture of Dirichlet distributions is used as a prior for amino acid/nucleotide probability distributions.

Back to article page