Integrating phenotype and gene expression data for predicting gene function
© Malone et al. 2009
Published: 8 October 2009
Skip to main content
© Malone et al. 2009
Published: 8 October 2009
This paper presents a framework for integrating disparate data sets to predict gene function. The algorithm constructs a graph, called an integrated similarity graph, by computing similarities based upon both gene expression and textual phenotype data. This integrated graph is then used to make predictions about whether individual genes should be assigned a particular annotation from the Gene Ontology.
A combined graph was generated from publicly-available gene expression data and phenotypic information from Saccharomyces cerevisiae. This graph was used to assign annotations to genes, as were graphs constructed from gene expression data and textual phenotype information alone. While the F-measure appeared similar for all three methods, annotations based upon the integrated similarity graph exhibited a better overall precision than gene expression or phenotype information alone can generate. The integrated approach was also able to assign almost as many annotations as the gene expression method alone, and generated significantly more total and correct assignments than the phenotype information could provide.
These results suggest that augmenting standard gene expression data sets with publicly-available textual phenotype data can help generate more precise functional annotation predictions while mitigating the weaknesses of a standard textual phenotype approach.
With the advent the "omics technologies," researchers are faced with the problem of analyzing high throughput datasets. The Gene Ontology (GO) was initiated to provide a controlled vocabulary for describing the cellular location, biological process, and molecular function of gene products and to thus enable extraction of biological meaning from these large datasets . The terms in the GO are organized in a directed acyclic graph where directed edges represent relationships among terms. The primary relationships between terms in the GO are "part_of" and "is_a". Assignment of a GO term to a gene product is called annotation. GO annotation has become a "gold standard" in describing and the function of gene products and in supporting computational methods for analyzing high throughput datasets.
Assigning GO terms to gene products has now become a major bottleneck in the analysis of large datasets and has prompted the development of many computational approaches. The Gene Ontology Annotation (GOA) project  employs a pipeline which incorporates both manually curated and electronic approaches to annotate UniProtKB entries with GO terms. The manual assignment of annotations relies on curators searching through literature for evidence that a protein has a particular function. While this process can be slow and expensive, the results are typically very accurate and detailed. The electronic aspect of the pipeline incorporates results from a variety of sources including Swiss-Prot keywords, cross references to InterPro, and orthology mapping from a source species to a target species. Electronic annotation is particularly useful for the assignment of GO terms to the proteins of non-model organisms which likely would not receive manual annotations. Many other computational annotation pipelines for assignment of Gene Ontology terms have been developed. For example, DAVID  agglomerates data from many sources, both manually curated and computationally populated, into a single database. CLUGO  utilized homology search combined with clustering to assign terms to new sequences. Text mining is also frequently used to computationally predict gene functions with the goal of automating the manual process of annotating gene products from the literature. For example, Daraselia et al.  automatically extract functional annotations for mammalian proteins from Medline texts by building regular expression to find relationships between GO terms and proteins. Groth et al. [6–8] use text mining to associate phenotypes with genes by clustering term frequency-inverse document frequency (tf-idf) arrays. Functional predictions are inferred for all genes in a cluster for a particular annotation when at least half of the genes in the cluster had that annotation. In addition to sequence and text data, gene expression data is also often used in predicting functional annotations. For example, Virtual Gene Ontology (VIRGO)  constructs functional linkage networks (FLNs) in which nodes in a graph represent genes and edges indicate the Pearson correlation between the expression arrays of genes. Functional annotations are propagated across the network by treating the network as a discrete Hopfield network .
We present a new algorithm that combines text mining of phenotypic data with inference based on gene expression patterns to predict whether a particular gene should receive a particular GO annotation based on its similarity to other genes known to have the annotation. We demonstrate the utility of our approach with the well-annotated yeast genome where current annotations are considered the "true annotation." The algorithm will be most useful, however, for annotating gene products of less well studied organisms without large research communities.
Our algorithm first computes the similarity of all genes under consideration based on two types of data: phenotype extracted using text mining and gene expression profiles. A complete graph is then constructed where each vertex corresponds to a gene and the weights on edges represent the similarity of a pair of genes. Assignment of a GO annotation is determined for each gene based on the similarities to other genes with this annotation.
In order for the algorithm to predict functions associated with unlabeled genes, it must have existing labels to use as a training set. This algorithm uses current GO annotations as labels . The notation annotation(a, g) indicates that gene g has annotation a.
A similarity graph is used to integrate multiple data sources to predict whether a gene should receive a particular GO annotation. Similarity functions form the basis of the prediction algorithm. A similarity function takes as input a representation of two genes and returns a value between -1 and 1 reflecting the similarity between the two genes, where -1 represents high dissimilarity and 1 indicates high similarity. More specifically, a similarity function is defined for each data set. Thus, integrating n data sets requires n similarity functions. The functions need not be distinct. So, f: G × G → [0, 1], where f is a similarity function and G is the set of genes.
where v i and v j are arrays representing gene expression profiles for genes g i and g j , respectively.
where v i and v j are tf-idf arrays associated with genes g i and g j , respectively.
We consider this a lower bound and will not assign the annotation to gene g if it has a total similarity to other genes with annotation a lower than this threshold.
Because the prediction algorithm given in Figure 1 only tests for a single gene and a single annotation at a time, it implicitly uses a jackknifing, or leave-one-out, approach for prediction . In this approach, all of the genes except the one in question are used to make predictions about whether that gene should receive the annotation. A nice property of jackknifing is the small amount of bias it induces when considering the generalization of models . Metrics assessing the quality of the predictions can be computed by comparing the annotations predicted for gene g when the gene is "left out" with the annotations already assigned to that gene.
As mentioned previously, the algorithm uses a cutoff to distinguish between positive and negative predictions. In our experiments, we use a series of cutoffs. For example, with a cutoff value of 0.6, we would test if the interpolated similarity (interpolated_sim in the pseudocode in Figure 1) for a particular gene and annotation is greater than 0.6. If it is, then the gene is predicted to have the annotation. In all cases, annotations with fewer than five genes known to have the annotation are disregarded. For comparison, we have predicted annotations using a combination of gene expression and textual phenotype data, using gene expression data alone, and using phenotype data alone. All results referring to predictions made using the graph constructed from the combination of the data sets and taking the sum of the similarities will be referred to as results from the "integrated sum data set," and those from the graph constructed from the combined data sets and taking the maximum of the similarities will be referred to as results from the "integrated max data set." Results referring to predictions from the gene expression graph will be referred to as results from the "gene expression data set," and those from the phenotype data will be referred to as results from the "phenotype data set."
Yeast cell cycle-time point 0 min 2001-10-30_O.rfm Yeast W303 cells
Yeast cell cycle-time point 5 min 2001-11-09_0005.rfm Yeast W303 cells
Yeast cell cycle-time point 10 min 2001-11-09_0010.rfm Yeast W303 cells
Yeast cell cycle-time point 15 min 2001-11-09_0015.rfm Yeast W303 cells
Yeast cell cycle-time point 20 min 2001-11-09_0020.rfm Yeast W303 cells
Yeast cell cycle-time point 25 min 2001-11-09_0025.rfm Yeast W303 cells
Yeast cell cycle-time point 30 min 2001-11-09_0030.rfm Yeast W303 cells
Yeast cell cycle-time point 35 min 2001-11-09_0035.rfm Yeast W303 cells
Yeast cell cycle-time point 40 min 2001-11-09_0040.rfm Yeast W303 cells
Yeast cell cycle-time point 45 min 2001-11-09_0045.rfm Yeast W303 cells
Yeast cell cycle-time point 50 min 2001-11-09_0050.rfm Yeast W303 cells
Yeast cell cycle-time point 55 min 2001-11-09_0055.rfm Yeast W303 cells
Yeast cell cycle-time point 60 min 2001-11-09_0060.rfm Yeast W303 cells
Yeast cell cycle-time point 65 min 2001-11-21_0065.rfm Yeast W303 cells
Yeast cell cycle-time point 70 min 2001-11-21_0070.rfm Yeast W303 cells
Yeast cell cycle-time point 75 min 2001-11-28_0075.rfm Yeast W303 cells
Yeast cell cycle-time point 80 min 2001-11-28_0080.rfm Yeast W303 cells
Yeast cell cycle-time point 85 min 2001-11-29_0085.rfm Yeast W303 cells
Yeast cell cycle-time point 90 min 2001-11-29_0090.rfm Yeast W303 cells
Yeast cell cycle-time point 95 min 2001-11-29_0095.rfm Yeast W303 cells
Yeast cell cycle-time point 100 min 2001-11-29_0100.rfm Yeast W303 cells
Yeast cell cycle-time point 105 min 2001-12-06_0105.rfm Yeast W303 cells
Yeast cell cycle-time point 110 min 2001-11-29_0110.rfm Yeast W303 cells
Yeast cell cycle-time point 115 min 2001-11-29_0115.rfm Yeast W303 cells
Yeast cell cycle-time point 120 min 2001-11-29_0120.rfm Yeast W303 cells
Yeast cell cycle-time point 0 min 2001-05-03_0000.rfm
Yeast cell cycle-time point 10 min 2001-05-03_0010.rfm
Yeast cell cycle-time point 20 min 2001-05-03_0020.rfm
Yeast cell cycle-time point 30 min 2001-05-03_0030.rfm
Yeast cell cycle-time point 40 min 2001-04-11_0040.rfm
Yeast cell cycle-time point 50 min 2001-04-11_0050.rfm
Yeast cell cycle-time point 60 min 2001-04-11_0060.rfm
Yeast cell cycle-time point 70 min 2001-04-11_0070.rfm
Yeast cell cycle-time point 80 min 2001-04-11_0080.rfm
Yeast cell cycle-time point 90 min 2001-04-11_0090.rfm
Yeast cell cycle-time point 100 min 2001-04-11_0100.rfm
Yeast cell cycle-time point 110 min 2001-04-11_0110.rfm
Yeast cell cycle-time point 120 min 2001-04-11_0120.rfm
As previously mentioned, the identifiers for the gene expression data do not exactly correlate to single genes. Affymetrix provides a bridge which maps between expression identifiers and Entrez gene symbols . Not all expression identifiers mapped to a gene symbol, and others mapped to more than one gene symbol. Only expression identifiers which mapped to a single gene symbol were retained. All other expression data was discarded. A total of 6251 expression identifiers were present in 39 expression runs. After mapping identifiers to Entrez gene symbols, 3169 entries remained. Therefore, each of the 3169 genes had an associated 39-dimensional array of expression values.
The PhenomicDB http://www.phenomicdb.de/ incorporates data from many different data sources about a wide variety of organisms, including human, yeast, mouse, and many others . The database provides a large number of searching options, including searching by Entrez gene symbols. For each of the gene symbols identified with gene expression values, PhenomicDB was consulted for phenotypes associated with that gene symbol in yeast. The data was downloaded on November 23, 2008.
In general, PhenomicDB contains multiple phenotypes for each gene symbol. Each phenotype is a textual description. To form a single document for each gene symbol, all of the phenotypes are simply concatenated. However, this plain text representation of knowledge does not easily lend itself to learning approaches.
The document associated with each symbol was transformed into a tf-idf array. The doc2mat utility from the CLUTO package  applies a stop word list and the Porter stemming algorithm to produce a term frequency description of each document . A stop word list is used to remove common, uninformative words, such as articles and prepositions, from the documents. The stemming algorithm is used to remove prefixes and suffixes from words. The term frequency and inverse document frequency values for each term are multiplied to produce a tf-idf array for each document. A total of 6541 distinct terms were discovered after pruning and stop words were applied. Hence, each of the tf-idf arrays had 6541 dimensions. Each dimension in the array corresponds to one unique term. The value of each dimension is a fraction in which the numerator is the number of times the term corresponding to that dimension occurs in the document and the denominator is the total number of documents in which the term appears. Because the numerator cannot be less than 0 and the denominator cannot be less than 1, the resulting values are always nonnegative.
Our algorithm utilizes GO terms as labels. Fortunately, the file provided by Affymetrix which provides the mapping between expression identifiers and gene symbols also includes all GO terms associated with each gene symbol . A total of 3,466 distinct GO annotations were identified in the Affymetrix file. A total of 39,680 annotation assignments were defined between the GO annotations and the 3,169 genes.
When considering the correctness of predictions, two different approaches were used. In the first case, only exact annotation matches are considered correct. For example, if predicting that gene g has annotation a, the prediction is considered a true positive only if g is labelled exactly with a. Otherwise, the prediction is a false positive. These are referred to as "exact" predictions. However, the Gene Ontology enforces the "true path rule" stating that "the pathway from a child term all the way up to its top-level parent(s) must always be true" . This means that if annotation a is predicted for gene g and the gene has been previously assigned a GO term that is a child of a, the assignment of a to the gene g is also correct. Therefore, we use an alternate method of computing the number of correct predictions where, if predicting that gene g has annotation a, the prediction is considered a true positive if g is labelled exactly with a or with any child term of a. The second case is referred to as "generalized" predictions.
The results show that, as expected, the generalized scoring method yields higher precision, recall and F-measure values than the exact method. One can argue that the exact scoring method is unnecessarily strict and somewhat arbitrary because it requires the automated method to learn GO terms at exactly the same level as those assigned by expert annotators. In general, automated procedures tend to assign GO terms at higher levels than can be obtained by expert biocurators reading the literature. The "true path rule" of the Gene Ontology guarantees that the annotations scored as correct by the generalized scoring method are truly correct. The weakness of this scoring method is that more general terms are less informative than more specific terms. Although the precision values obtained using generalized scoring are substantially higher than those obtained with exact scoring, precision is still quite low. It should be noted that some of the GO term assignments scored as incorrect, may indeed be correct. Although yeast is one of the best annotated model organisms, annotation of yeast gene products is not complete and new annotations are constantly being added. In some cases the automated algorithm may have "learned" a more specific term than is currently assigned. Another factor contributing to the low precision is the type of gene expression data used. Because all of the experiments concern cell cycle, many of the genes do not have informative expression profiles. Including other types of gene expression data could help alleviate this problem and increase precision. The higher precision scored obtained by the integrated approach indicates that this approach allows one to take advantage of the large number of assignments that can be made based on gene expression data while at the same time gaining the precision afforded by the phenotype data.
In summary, the integrated approach results in nearly as many annotation predictions as the gene expression data, as indicated in Figure 3, but still maintains much of the precision of the phenotype data set, as shown in Figure 5.
The integrated methods do produce biologically relevant predictions which are not made by the individual data sets. For example, the Saccaromyces Genome Database http://yeastgenome.org/ indicates that the gene PDR11 is a "multidrug transporter involved in multiple drug resistance." While it is annotated with GO:0015918 (sterol transport) and GO:0042626 (ATPase activity, coupled to transmembrane movement of substances), it is not explicitly annotated with any functions related to multidrug transport. The MAX integrated data set predicts that it should be annotated with "multidrug transport," GO: 0006855. The gene expression data set alone is not able to make this prediction. As another example, the MAX integrated data set predicts that SPT21 should be annotated with GO: 0006348 "chromatin silencing at telomere." The Saccaromyces Genome Database description of SPT21 states that the gene is involved in telomere maintenance; however, it is not annotated with any GO molecular functions. This prediction is not made when using only the phenotype data set. These examples demonstrate that not only can the prediction algorithm make novel predictions consistent with biological knowledge, but also that integrating the data types can result in predicted annotations that either individual data set alone would fail to identify.
This paper presents an algorithm that incorporates both gene expression data and textual phenotype data to predict the function of genes. This graph-based approach generates a complete graph weighted with gene-gene similarities. It then makes predictions based on the weights connecting the nodes. The results indicate that integrating the gene expression with the textual phenotypes produces more precise annotations than predictions based upon either type of data alone.
The integrated approach outperformed the gene expression-only graph in the precision metric; it also tended to outperform the textual phenotype graph in the recall metric. Furthermore, the integrated similarity graph produced many more correct annotation assignments than the phenotype graph alone. We believe that this integrated approach can augment the usefulness of standard gene expression data by facilitating annotation predictions with increased precision and an increased F-measure deeper within the GO hierarchy.
Future work could focus on development of better methods to integrate the data sets. For example, rather than equally weighting the gene expression and textual data, methods could be developed for assigning different weights to different data types when determining the edge weights. A less naïve integration method could be used to map the correlation and cosine values to more meaningful numbers, such as p-values.
This research was funded in part by grant DEFG3606G086025 from the Department of Energy to the Sustainable Energy Research Center at Mississippi State University and NSF EPSCoR grant EPS-0556308. We thank Dr. Bindu Nanduri for help in assessing the biological validity of predictions.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 11, 2009: Proceedings of the Sixth Annual MCBIOS Conference. Transformational Bioinformatics: Delivering Value from Genomes. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S11.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.