Skip to main content
Figure 6 | BMC Bioinformatics

Figure 6

From: Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span

Figure 6

Results for four illustrative latent topics specified by a 50-topic CGC LDA model estimated from a corpus with 5,225 documents and a vocabulary of 28,971 words. Each panel shows results for a particular topic. The y-axis of the graph is topic-specific word probability (β kv )and words are arranged along the x-axis according to this likelihood. Only the 500 topic annotation words are plotted since the remaining words in the vocabulary have negligible probabilities. The words displayed explicitly are unigrams in the CGC vocabulary, including the names of C. elegans genes, and GO terms. The position of a word along the x-axis represents its rank; the staggering of words along the y-axis is not significant and is designed only to improve legibility. The graph legend lists two types of automatically-generated topic labels. CGC-based topic labels are a subset of the 50 × 500 topic annotation words that are unique to a topic and are words from the CGC vocabulary; these labels are ordered according to decreasing β kv values. GO-based topic labels are the parents and grandparents GO terms of GO terms that are also topic annotation words. Only GO terms that occur four or more times are given and are listed in decreasing frequency (MF: molecular function; CC: cellular component, BP: biological process). A CGC-based label is unique to a topic whereas a GO-based label can be applied to one or more topic.

Back to article page