Skip to main content
Figure 3 | BMC Bioinformatics

Figure 3

From: Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span

Figure 3

Graphical model representation of a mixture of unigrams model with K latent topics (right) and a unigram model (left). The corpus depicted contains M documents and each is a sequence of N words. Open circles represent latent variables (z)or parameters (β, θ). Each shaded circle is an observed word variable (w). Boxes (plates) represent replicates. The subscripts m, n abd k on a parameter (β, θ) or variable (z, w) donate the mth document, nth word and kth topic respectively.A mixture of unigrams generates all the words in a given document from exactly one topic, z. This differs from the LDA model where a single document can express multiple topics (Figure 2). Note that the naive Bayes model used to cluster transcript profiling data [41-43] has the same topology as the mixture of unigrams but the observed variables are continuous-valued expression measurements rather than discrete words.

Back to article page