Discovering gene functional relationships using FAUN (Feature Annotation Using Nonnegative matrix factorization)
© Berry et al. 2010
Published: 7 October 2010
Skip to main content
© Berry et al. 2010
Published: 7 October 2010
Searching the enormous amount of information available in biomedical literature to extract novel functional relationships among genes remains a challenge in the field of bioinformatics. While numerous (software) tools have been developed to extract and identify gene relationships from biological databases, few effectively deal with extracting new (or implied) gene relationships, a process which is useful in interpretation of discovery-oriented genome-wide experiments.
In this study, we develop a Web-based bioinformatics software environment called FAUN or Feature Annotation Using Nonnegative matrix factorization (NMF) to facilitate both the discovery and classification of functional relationships among genes. Both the computational complexity and parameterization of NMF for processing gene sets are discussed. FAUN is tested on three manually constructed gene document collections. Its utility and performance as a knowledge discovery tool is demonstrated using a set of genes associated with Autism.
FAUN not only assists researchers to use biomedical literature efficiently, but also provides utilities for knowledge discovery. This Web-based software environment may be useful for the validation and analysis of functional associations in gene subsets identified by high-throughput experiments.
The MEDLINE 2010 literature database at NIH contains over 19 million records and is growing at an exponential rate . With such rapid growth of the biomedical literature and the breakdown of disciplinary boundaries, it can be overwhelming to manually track all new relevant discoveries, even on specialized topics. Moreover, the recent advances in genomic and proteomic technologies have added an abundance of genomic information into biomedical knowledge, which makes the situation even more complicated. One main difficulty in understanding high-throughput genomic data is to determine the functional relationships between genes.
By design, high throughput experimental approaches are expected to yield new discoveries. For example, gene expression profiling by DNA microarray technology can identify hundreds of genes whose expression is co-regulated with experimental treatments. The researcher is expected to reduce this list to functional pathways and mechanisms that can be further investigated experimentally. While some of the differentially expressed genes may be known to functionally interact, it is expected that many interactions are implied and weakly supported in the literature. Therefore, there is a growing need to develop new text-mining tools to assist researchers in discovering hidden or implicit functional information about genes directly from biomedical literature. Surveys of online tools for literature-based discovery in the life sciences are available [2, 3], and a more recent article by Roos et al.  describes the integration of Semantic Web technologies with text extraction and mining for hypothesis generation within a workflow environment.
Numerous data mining tools have been proposed for bioinformatics research (see reviews in [5–9]). One of the major steps in text mining is information retrieval (IR)  which consists of three basic types of models: set-theoretic (Boolean), probabilistic, and algebraic (vector space). Documents in each case are retrieved based on Boolean logic, probability of relevance to the query, and the degree of similarity to the query, respectively.
Some of the current software tools utilize functional gene annotations provided in public databases, such as Gene Ontology (GO) , Medical Subject Heading (MeSH) index , and KEGG . For example, GoPubMed , a thesaurus-driven system, classifies abstracts based on GO, HAPI  identifies gene relationships based on co-occurrence of MeSH index terms in representative MEDLINE citations, and EASE  identifies gene relationships using the gene function classifications in GO. These co-occurrence based methods can be highly error prone. Ontology definitions help provide insights into biological processes, molecular functions and cellular compartments of individual genes. However, they are often incomplete and lack information related to associated phenotypes . In addition, Kostoff et al.  found a significant amount of conceptual information present in MEDLINE abstracts missing from the manually indexed MeSH terms. Moreover, indexing in MEDLINE can be inconsistent because of assignments by different human-indexers .
Several alternative approaches that use Medline-derived relationships to functionally group related genes have been reported . Alako et al.  have developed CoPub Mapper which identifies shared terms that co-occurred with gene names in MEDLINE abstracts. PubGene  developed by Jenssen et al. constructs gene relationship networks based on co-occurence of gene symbols in MEDLINE abstracts. Because of the inconsistency issues in gene symbol usage in MEDLINE, PubGene has low recall (ratio of relevant documents retrieved to the total number of relevant documents). It identifies only 50% of the known gene relationships on average. In addition to the official gene symbol, each gene typically has several names or aliases. In IR, these problems are referred to as synonymy (multiple words having the same meaning) and polysemy (words having multiple meanings). Several methods have been proposed to solve these ambiguity issues in gene or protein names [22–24].
The concept of literature based discovery was introduced by Don Swanson in 1986  and has since been developed and applied to many different areas of research [26, 27]. Many online literature-based discovery tools have been developed, some of which have resulted in documented discoveries . Chilibot , Textpresso , and PreBIND  are examples of tools that are specifically geared toward genomic and proteomic applications . Chilibot is a system with a special focus on the extraction of relationships between genes, proteins and other information. Textpresso is an information-retrieval tool for biological entities that was originally designed for WormBase and later applied to other model organisms. Finally, PreBIND provides utilities in the extraction of protein-protein interactions. One drawback of many of these existing tools is that they are not amenable to analysis of high throughput genomic experiments which result in hundreds or even thousands of genes that must be further analyzed.
We previously developed a text-mining tool called Semantic Gene Organizer(SGO) , which implements Latent Semantic Indexing (LSI) to extract functional relationships among genes from MEDLINE abstracts. Homayouni et al.  demonstrated that SGO extracted both explicit (direct) and implicit (indirect) gene relationships based on keyword queries, as well as gene-abstract queries, from the biomedical literature with better accuracy than term co-occurrence methods. The underlying SVD factorization technique decomposes the original term-by-document nonnegative matrix into a new set of factor matrices containing positive and negative values. These matrix factors can be used to represent both terms and documents in a low-dimensional subspace. Unfortunately, the interpretation of the LSI factors is non-intuitive and difficult to interpret due to the negative factor components. The main limitation of LSI is that while it is robust in identifying what genes are related, it has difficulty in answering why they are related.
To address this issue, a new method called nonnegative matrix factorization (NMF) has been developed for genomic applications. Unlike SVD, NMF produces decompositions that can be readily interpreted.
Lee and Seung  were among the first researchers to introduce the nonnegative matrix factorization (NMF) problem. They demonstrated the application of NMF in text mining and image analysis. NMF decomposes and preserves the nonnegativity of the original data matrix. The low-rank factors produced by NMF can be interpreted as parts of the data. Recently, NMF has been widely used in the bioinformatics field, including the analysis of gene expression data, sequence analysis, gene tree labeling, functional characterization of gene lists and text mining [35–41]. Chagoyen et al. have shown the usefulness of NMF methods in extracting the semantic features in biomedical literature . Pascual-Montano et al.  developed an analytical tool called bio-NMF for simultaneous clustering of genes and samples. It requires (on input) a data matrix (e.g., term-by-doc matrix) and outputs the corresponding matrix factors. Even though the tool is robust and flexible, its use by biologists might not be obvious. Therefore, an intuitive interface that allows the biologist to use literature-based NMF methods for determining functional relationships among genes is still needed.
In this study, we develop a Web-based bioinformatics software environment called Feature Annotation Using Nonnegative matrix factorization (FAUN) to facilitate both knowledge discovery and classification of functional relationships among genes. The ability to facilitate knowledge discovery makes FAUN very attractive to genomic scientists. Thus, one of the main design goals of FAUN is to be biologically user-friendly. Providing a list of genes with gene identifiers such as gene IDs or gene names, FAUN constructs a gene-list-specific document collection from the biomedical literature. NMF can be used to exploit the nonnegativity of term-by-gene document data, and can extract the interpretable features of text which might represent usage patterns of words that are common across vastly different gene documents. NMF methods are iterative in nature so that the problem involves computational issues such as: proper initialization, rank estimation (i.e., subspace dimension), stopping criteria, and convergence. To address these issues, many variations of NMF with different parameter choices have been proposed [38, 42, 43]. While developing FAUN, we try to understand how the NMF model can be adapted or improved for gene datasets that will not only yield good mathematical approximations, but also provide valuable biological information.
A simple demonstration of the FAUN bioinformatics software environment (using an anonymous login) is available to the public at https://grits.eecs.utk.edu/faun; accounts can be created upon request and an upload feature for creating NMF models of user-supplied genelists is available.
A typical usage scenario for FAUN concept features is shown in Figure 1. Once a document collection is built, three NMF models are generated with NMF ranks k = 10, 15, and 20, for low, medium, and high resolutions by default. Even higher resolutions are certainly possible, and the screenshot in Figure 1 is taken from a NMF model with 30 features (only the first 5 features are shown). The user can then look through the top terms in each feature and supply (if possible) an appropriate label. For example, Feature 2 in Figure 1 could readily be labeled (or annotated) as a descriptor of insulin signaling.
If the user is interested in further exploring the Insulin Signaling feature, he/she can then click on the feature to show all the genes in the collection that FAUN suggested to be highly associated with the feature. A description of how FAUN identifies the genes associated with each feature is provided in the next section.
For each feature, genes that are highly associated with it can be extracted from the feature-by-gene (H) matrix. A gene can be described by more than one feature. The association strength (feature weight) between gene and feature is determined by the appropriate element in the matrix. Genes that share one or more of the same features might be functionally related to one another.
Genes are listed from left-to-right by strength of association with the selected feature. The log-entropy weight of the terms in each gene is color-coded for visual analysis, with more red for a higher weight. The number of genes to be displayed is set to 15 and can be changed using the display-genes drop-down menu. All genes, above the set feature weight threshold, with their terms and term weights can be downloaded in csv format for further analysis. There might not be a single optimal threshold value that works the best for every case. FAUN provides global and local gene filter options to let users try different thresholds. The gene filter option allows users to filter the genes associated with each feature globally, across all gene documents above a certain threshold, or locally, within each gene document above the 70th local percentile. The medium global gene filter setting (i.e., feature weight = 1.0) is the default.
To see how the feature terms and/or gene symbols are used in the original gene document article, sentences using the terms and/or the gene symbols can be viewed. The sentences are ranked based on term frequency. The ranked sentences are displayed in the popup window by clicking on the gene symbol at the head of the column. This popup window also serves as the quick summary page for the gene and provides a link to the Entrez Gene page for more information about that gene.
At this point, the FAUN user might have some ideas about what kinds of features are present in the gene document collection, and some familiarity with the genes that are associated with certain features. Genes belonging to the same feature might suggest that they are functionally related based on the literature. Such hypotheses may well lead to new discoveries in gene characterization. Namely, genes represented by the same feature may function in the same pathway.
To explore even further why certain subsets of genes are related, and how strongly they are related, the user can click on the gene vs gene correlation link shown at the bottom of the screenshot in Figure 2.
The Pearson correlation matrix for all the genes is then generated. The correlation is color-coded for visual analysis, with more red for a stronger correlation. An example of the correlation matrix for all the genes in Feature 20 (Methylation) of the NatRev collection is shown in Figure 3. Users can view the correlation of any pair of genes and with respect to any combinations of features with a minimum of 3 features selected. By default, the user-selected feature along with its left and right neighboring features are used to compute the Pearson correlation.
A new gene document added to a gene document collection can be analyzed (for the presence of annotated features) without having to update the NMF model for the collection. The FAUN classifier can accept a stream of new documents and determine their features based on the presence of terms in the previously-annotated features. It is be possible to automatically retrieve newly published articles and run the FAUN classifier to determine if they are related to any of the interest features in the studied gene collection without having to continually update the NMF model. Of course, periodic updating of the NMF model to reflect changes in literature may be needed.
FAUN consists of a computational core and Web-based user interface. The computational core consists of programs that build the gene document collection, parse the collection, build an NMF model from the collection, and classify new documents based on the NMF model. These programs will be described in more detail in the following sections. The primary design goal of the user interface was to make the analysis of NMF accessible to biologists.
All genes in a given gene list are used to compile titles and abstracts in Entrez Gene . Currently, to avoid polysemy and synonymy issues, there are still human interventions in the document compilation process, such that abstracts are not specific to a particular gene name or alias. Titles and abstracts for a specific gene are concatenated to prepare a gene document.
The collection of gene documents is parsed into terms using the current C++ version of General Text Parser (GTP) . Terms in the document collection that are common and non-distinguishing are discarded using a stoplist (see ftp://ftp.cs.cornell.edu/pub/smart/English.stop for a sample stoplist). In addition, terms that occur less than twice locally in the gene document or globally in the entire document collection are ignored and not considered as dictionary terms. Hyphens and underscores are considered as valid characters. All other punctuation and capitalization are ignored.
A term-by-gene document matrix is then constructed where the entries of the matrix are the nonnegative weighted frequencies for each term. These term weights, computed using a log-entropy weighting scheme , are used to describe the relative importance of term i for the corresponding (gene) document j. That is, the term-by-gene document matrix is defined asA = [w ij ], where w ij = l ij × g i .
where ƒ ij is the frequency of term i in document j, p ij is the probability of the term i occurring in document j and n is the number of gene documents in the collection. This log-entropy weighting pair, which has performed well in several LSI-based retrieval experiments, decreases the effect of term spamming while giving distinguishing terms higher weight.
To summarize, a document collection can be expressed as an m × n matrix A, where m is the number of terms in the dictionary and n is the number of documents in the collection. Once, the nonnegative matrix A has been created, nonnegative matrix factorizaton (NMF) is performed.
NMF is a matrix factorization algorithm to best approximate the matrix A by finding reduced-rank nonnegative factors W and H such that A ≈ WH. The sparse matrix W is commonly referred to as the feature matrix containing feature (column) vectors representing certain usage patterns of prominent weighted terms, while H is referred to as the coefficient matrix since its columns describe how each document spans each feature and to what degree.
This cost function, half of the squared Frobenius norm of the residual error, equals 0 if and only if A = WH. The minimization of f (W, H) can be challenging due to the existence of local minima owing to the fact that f (W, H) is non-convex in both W and H. As noted before, due to its iterative nature, the NMF algorithm may not necessarily converge to a unique solution on every run. For a particular NMF solution of W and H, WDD −1 H is is also a solution for any nonnegative invertible matrix D . The NMF solution depends on the initial conditions for W and H. To address this issue, we use the Nonnegative Double Singular Value Decomposition (NNDSVD) initialization strategy proposed by Boutsidis and Gallopoulos . This NNDSVD algorithm does not rely upon randomization and is based on approximations in the positive components of the truncated SVD factors of the original data matrix. Essentially, this provides NMF a fixed starting point, and the iteration to generate W and H will converge to the same minima. As noted by Chagoyen et al. in , having multiple NMF solutions does not necessarily mean that any of the solutions must be erroneous.
To avoid division by zero, the small constant 10−9 is added to the denominator of each update rule above. In each iteration, both W and H are updated, which generally gives faster convergence than updating each matrix factor independently of the other. The computational complexity of the multiplicative update algorithm is easily shown to be O(kmn) floating-point operations per iteration.
where α and β are relatively small regularization (control) parameters and J 1(W) and J 2(H) are functions defining additional constraints (e.g., smoothness or sparsity) on W and H, respectively. As explained in [42, 50], the rationale for enforcing smoothness or sparsity constraints on the W factor is to potentially improve the interpretability of its feature (column) vectors. Applying such contraints to the columns of the H (coefficient) matrix factor can control the span (or use) of features to explain documents in the collection.
The FAUN classifier accepts a new document to be classified, the entropy weights of terms in Equation (2) used in the NMF model, the term-by-feature matrix factor (W), stop words, and thresholds for entropy weight and term frequency.
The module then computes the weight for each feature based on the weight of its terms whose entropy is larger than the entropy threshold and frequency is larger than the term frequency threshold. It then outputs the features sorted by weight from the highest to the lowest. The process of mapping features to gene classes will be described below.
Preliminary testing indicated that the classifier accuracy was around 80%. The test was conducted based on the first dataset (described below) that contained 50 genes. NMF models were first built with ranks of 10, 20, 30 and 40 using 40 genes randomly selected from the 50-gene dataset. The classifier was then trained using the matrix in newly built NMF models. The accuracy was tested using the remaining 20% of the gene documents.
The FAUN classifier described above classifies genes based on annotated features in the NMF models. The process of annotating the features is typically done manually with the FAUN interface while exploring the gene dataset. Features in he NMF model can be annotated manually by the domain-expert using dominant feature terms. To automate the process for the other two datasets, features are annotated or mapped to gene classes using the FAUN annotation script (see ). In order to assign classfication categories (classes) to the genes, the script requires the matrix from the NMF model, the (known) classification categories, the NMF rank, and a feature weight threshold.
List of categories for each dataset used to evaluate FAUN classification performance. GC is the gene count per category.
Dataset 1 (50TG)
Cancer & Development
Alzheimer & Development
Dataset 2 (BGM)
Biocarta: Caspase cascade in apoptosis
Biocarta: Sonic hedgehog pathway
Biocarta: Adhesion and diapedesis of lymphocytes
GO: Biological process: telomere maintenance
GO: Cellular constituent: cornified cell envelope
GO: Molecular function: DNA helicase
MeSH: Disease: retinitis pigmentosa
MeSH: Disease: retinitis pancreatitis
MeSH: Disease: nephroblastoma (Wilm’s tumor)
Dataset 3 (NatRev)
Mammary Gland Development
Proper initialization of the NMF factors W and H is an important consideration for reproducibility (i.e., uniqueness of the factorization). The NNDSVD (see Methods for details) is one such approach for generating a robust and consistent factorization. NNDSVD basically starts with the truncated SVD of the gene-by-document (sparse) matrix A. Although NNDSVD produces a static starting point, different methods (see [43, 50]) can be applied to remove zeros from the initial approximation and thereby prevent them from being fixed throughout the (multiplicative) update process.
The values of factorization ranks considered for the three datasets were 5, 10, 15, 20, 25, 30, 40, and 50. We restricted the maximum number of iterations to 1000 and 2000 and stopped iterations if the consecutive iterates of W and H (generated by the multiplicative update algorithm defined in Equation (4)) if the consecutive iterates of W and H were closer than τ W = 0.01 and τ H = 0.001, respectively, in Frobenius norm. That is, ||W old − W new ||F <τ W and ||H old − H new ||F <τ H . The effect of contraints on smoothing and sparsity for the W and H iterates has been studied and we refer the reader to [43, 50] for more details on these effects.
The total gene recall per category was also investigated. For each feature, the corresponding maximum row entry of H (max H) is found and all genes in the feature with their H values < max H × f τ , where ƒ τ is a chosen feature threshold, are skipped. For each gene, all features (above f τ ) associated with this gene are taken and then categories are assigned to the gene based on feature labeling.
The classification accuracy is evaluated in such a way that if the correct class is not among the classes assigned, the correctness is defined to be 0. If the correct class is among the classes assigned, the correctness is defined to be 1/(number of classes assigned to the gene). The total correctness is the sum of correctness assigned to every gene expressed as a percentage (0-100%). Using a feature weight threshold ƒ τ = 1.0, gene recall ranges of 78%-100%, 71.6%-97.1%, and 42.7%-80.9% for the 50TG, BGM, and NatRev datasets, respectively, have been reported .
Low classification accuracy equates to the misclassification of a human-curated category in the dataset. However, this misclassification does not necessarily imply that FAUN cannot be used to infer new (previously unknown) functional properties. A a few examples of such discovery are mentioned in the next section.
One of the most important capabilities of FAUN is to discover novel gene relationships from the biomedical literature, leading to generation of experimental hypotheses by the end user (biologist). This is essentially accomplished by clustering genes according to word usage patterns from the literature. By altering parameters in FAUN, a user can control the granularity by which genes and terms/features are associated. In this section we evaluate the effect of two parameters (rank-k and H-matrix threshold) on the discovery process using the NatRev dataset that contains 26 Autism associated genes from a total of 110 genes in the dataset.
It is important to point out that NMF clusters genes together even if they do not share every top weighted term for the feature. For instance, the autism feature for the k = 20 model included DISC1, SHANK3 and GATA3 although the term autism did not appear in the abstracts used to build the NMF model (Figure 8, red highlighted genes and terms). Indeed, the abstracts used in our collection were limited to 2006 and earlier and the discovery of SHANK3 and DISC1 as autism genes occurred only after 2007 [58, 59]. This association is due to the overlap of other terms that are highly weighted in this feature, demonstrating the utility of NMF for discovering new gene associations based on word pattern usage.
Taken together, we presented two possible strategies here that can be used within FAUN to explore relationships between genes and to make predictions that can later be tested experimentally. Related genes may be identified by lowering the rank-k and using general terms to cluster genes together. Alternatively, related genes can be identified by lowering the H-matrix threshold on a higher rank-k model, which uses more specific terms to cluster genes together. Each strategy has its own merit and would likely produce different results with different datasets. It is important to point out that both strategies produced reasonably high precision and recall.
Computational cost of the NMF multiplicative-update algorithm.
NMF Rank k
Number of iterations
CPU Time (seconds)
Millions of ops. per iteration
In this study, we have developed a software environment called FAUN which implements nonnegative matrix factorization to extract gene associations from the biomedical literature. The tool was evaluated using three different gene sets as ground truth. Given a list of genes, FAUN allows researchers to not only hypothesize why genes might be related but also classify them functionally with promising accuracy. FAUN not only assists researchers to use biomedical literature efficiently, but also provides utilities for knowledge discovery which is particularly important for interpretation of discovery-oriented genomic data.
Project home page:https://grits.eecs.utk.edu/faun
Operating system:Linux 2.6.18-128.1.10.el5
Programming language(s):PHP 5.1.6 (cli), C++ (gcc version 4.1.2)
Source code restrictions:License needed
This work is supported by NIH-subcontract (HD052472) involving the University of Tennessee, University of Memphis, Oak Ridge National Laboratory, and the University of British Columbia.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 6, 2010: Proceedings of the Seventh Annual MCBIOS Conference. Bioinformatics: Systems, Biology, Informatics and Computation. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S6.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.