Predicting gene ontology from a global meta-analysis of 1-color microarray experiments
© Dozmorov et al; licensee BioMed Central Ltd. 2011
Published: 18 October 2011
Skip to main content
© Dozmorov et al; licensee BioMed Central Ltd. 2011
Published: 18 October 2011
Global meta-analysis (GMA) of microarray data to identify genes with highly similar co-expression profiles is emerging as an accurate method to predict gene function and phenotype, even in the absence of published data on the gene(s) being analyzed. With a third of human genes still uncharacterized, this approach is a promising way to direct experiments and rapidly understand the biological roles of genes. To predict function for genes of interest, GMA relies on a guilt-by-association approach to identify sets of genes with known functions that are consistently co-expressed with it across different experimental conditions, suggesting coordinated regulation for a specific biological purpose. Our goal here is to define how sample, dataset size and ranking parameters affect prediction performance.
13,000 human 1-color microarrays were downloaded from GEO for GMA analysis. Prediction performance was benchmarked by calculating the distance within the Gene Ontology (GO) tree between predicted function and annotated function for sets of 100 randomly selected genes. We find the number of new predicted functions rises as more datasets are added, but begins to saturate at a sample size of approximately 2,000 experiments. For the gene set used to predict function, we find precision to be higher with smaller set sizes, yet with correspondingly poor recall and, as set size is increased, recall and F-measure also tend to increase but at the cost of precision.
Of the 20,813 genes expressed in 50 or more experiments, at least one predicted GO category was found for 72.5% of them. Of the 5,720 genes without GO annotation, 4,189 had at least one predicted ontology using top 40 co-expressed genes for prediction analysis. For the remaining 1,531 genes without GO predictions or annotations, ~17% (257 genes) had sufficient co-expression data yet no statistically significantly overrepresented ontologies, suggesting their regulation may be more complex.
As the availability and abundance of microarray data has grown across the major microarray data repositories, such as Gene Expression Omnibus (GEO) , ArrayExpress , and the Stanford Microarray Database  to encompass hundreds of thousands of experiments now, there is an increased interest in methods of mining this data. One approach has been to conduct what might be termed a global meta-analysis, which differs from traditional meta-analysis of experimental datasets that are normally undertaken to increase sample size by using highly similar experimental conditions and tissue types. The motivation behind a meta-analysis of heterogeneous data is to focus on gene-gene transcriptional patterns rather than experiment-experiment patterns.
In any given microarray experiment, there are likely many different processes going on at once when comparing experiment versus control. For example, even when one condition, such as oxidative stress, is induced, cells experience changes in multiple processes such as enzymatic activity, chromatin structure, apoptosis-related signaling, antioxidant production, etc. Identifying gene pairs that are consistently differentially expressed with each other across many different conditions permits an analysis of which of the many genes involved in these different processes are specific to each other outside of individual experimental conditions (e.g., apoptosis can be induced in response to many different stimuli).
The generation of high-throughput data provides the opportunity to examine patterns across datasets to identify correlations . Analysis of co-expression networks to try to identify regulatory patterns and modularity in co-expression dates back to early studies in yeast [5, 6] and soon thereafter in higher organisms . One of the motivations in identifying these patterns is that they can be used to predict gene function [8, 9], as well as potential roles for non-coding elements . Different methods of identifying gene-gene correlations are used, of which Pearson’s has been widely used[11–14], but also other patterns such as dividing patterns into parallel/anti-parallel  and Boolean quadrants  have been useful. It’s becoming clear that methods to study gene-gene co-expression patterns across unrelated experiments can tell us about the underlying genetic regulation, which has broad implications. For example, by establishing what could be considered “normal” gene-gene regulation one can than try to detect abnormal or disease-related disturbances . In particular, with approximately 34% of human genes still without known function [15, 18], it is important to develop methods to accurately predict function. The situation in human is not unique – the fraction of still uncharacterized genes ranges from ~38% in mouse  to ~17% in perhaps the best-studied eukaryotic organism of all, yeast .
Our initial study used 3,600 human 2-color arrays, which yielded a 34% direct hit rate (i.e., the predicted GO category was the same as the known GO category) . On the surface, this may appear to be a low accuracy, but because GO annotation lags the literature and our knowledge of genes is biased towards a select few [15, 18], it is difficult to estimate the false-positive rate in this type of analysis. That is, it is not clear how many predicted functions are simply not known. For example, we have experimentally examined some of the predicted functions made from this 2-color analysis and have found them to be generally accurate (manuscripts in preparation) and have published the results of one of these studies . In this initial study, gene expression was categorized into genes that were expressed in parallel (i.e., tended to be up-regulated and down-regulated together), anti-parallel (i.e., when one was up-regulated, the other tended to be down-regulated and vice versa) and not differentially expressed. Here, we first want to establish whether or not similar parameters can be used for 1-color arrays which do not display differential expression like their 2-color counterparts, but merely expression or non-expression, an important technical distinction . Second, we want to know how the set size of genes used for inference affects precision and recall (e.g., in Figure 1, 3 genes are shown as being used to infer function). Finally, we want to know what the relative contribution of the ranked genes is to the process of inference – that is, do the top genes contribute to more accurate predicted functions.
From NCBI’s Gene Expression Omnibus (GEO) database , GEO Dataset (GDS) files were obtained. Analysis was restricted to datasets that came that contained the following headers: dataset_sample_organism=“homo sapiens”, dataset_type=”nucleotide” or “gene expression”, dataset_channel_count=”single” or “1”, and dataset_value_type=”count”. This ensures only raw data from one-channel human microarray samples were processed. All dataset annotation, gene annotation, GO annotation data were stored in an MS Access database and queued from custom programs written in Visual Basic 2010.
Gene Ontology annotations were downloaded on 3/14/2011 from NCBI . Only human (Taxonomy ID 9606) gene to GO mapping was used in the current study. An enrichment of GOs within a set of co-expressed genes was calculated using chi-square test. The Gene Ontology tree file in Open Biomedical Ontology (OBO) format v1.2 was downloaded from Gene Ontology  on 3/14/2011.
Datasets with mean or median values of 0 were excluded, as well as those with mean to median ratio <= 1. Data for each experiment were sorted and distribution of expression values around maximum was examined. No more than 0.1% of maximum expression showed abnormally high expression values, due to either technical of software errors. These values were selected, a minimum (flooring) value among then wad identified, and all these values were set to this flooring value. As such, abnormally high expression values were brought to a reasonably high expression level. Each experimental dataset was then adjusted to fit within 0 to 10,000 range to make data ranges comparable across datasets. Distributions of the data fit within 0-10,000 range were quantile-normalized , which makes them equal and amenable for defining of the noise threshold used in the subsequent analysis. A total of 2,325 GDS files were downloaded, out of which 587 contained raw gene expression data from human single-channel microarrays, while others contained data for other organisms or from two-color microarrays.
Probe mapping was done by mapping gene names and accession numbers downloaded from NCBI  to unique Entrez ID identifiers. Within each experiment for multiple probes mapping to the same gene, the maximum expression value was used under the assumption that they may reflect different exons and the most intense signal is likely to come from constitutive exon. Probe identifiers were mapped to Entrez IDs, totaling 20,813 genes. All data were assembled in a matrix with columns representing EntrezIDs and rows containing expression values for a given experiment.
Parameters used for prediction analysis and their properties
Information it gives
Total = Frequency of gene pair co-expression
Total number of times a gene pair is expressed, excluding missing values
Some genes are expressed more frequently than others
MIM = Mutual Information Measure
Specificity of co-expression
When Total is small, MIM can be artificially high
R2 = Pearson’s Correlation Coefficient
Correlation between gene pair expression levels
Will detect global, but not conditional, co-regulation. Also, non-expression is far more common than expression, biasing R2 (e.g., 2 genes never expressed will show perfect correlation)
P = Purity
When co-expression “behavior” is described in terms of discrete categories, purity reflects a relative breakdown of behavioral observations
Information can be lost when discretizing a continuous variable
Equation 1: SCORE = log2MIMpara * Purity2 * Total * R2
To test how the number of microarray experiments and the number of co-expressed genes used for predictions influence prediction power for a set of randomly selected 100 genes having from 1 to 20 officially annotated ontologies. To determine how well the “guilt by association” is working for a given query gene, we first rank a set of co-expressed genes that we believe should be most representative of the function of the query gene. Then, these genes are analyzed for what GO categories they have in common by GO enrichment analysis. The enriched categories are the predicted GO categories, which are then compared to the known (annotated) GO categories to see how far they are on the GO tree from the known categories. The minimum distance to the nearest GO category is used in cases where more than one GO category exists for a gene (which is common). True positive hits (TP) were defined as the number of predicted GO categories correctly inferred for a query gene (defined as a distance of zero between the predicted and known GO category). False positives (FP) were predicted GO categories with a distance > 0. False negatives (FN) were defined as the total number of GO categories assigned to a target gene that were not predicted.
Recall, precision, and F-measure were calculated as follows:
Equation 2: Recall = TP/(TP+FN)
Equation 3: Precision = TP/(TP+FP)
Equation 4: F-measure = 2*Precision*Recall/(Precision+Recall)
To explore how predicted ontologies for a given gene within an acyclic GO graph correspond to known ontologies we calculated the shortest distance between predicted and annotated ontologies for each gene, as was done in our previous work . Traversing the GO tree to identify the shortest path between GO categories using “is_a” relationships was conducted using Dijkstra’s algorithm, which is part of the QuickGraph package v3.0 (http://quickgraph.codeplex.com/). Venn diagram was build using Venny tool .
Out of 20,813 genes, 5,720 (27.5%) did not have GO annotations. We examined several different ways of predicting functions of these unannotated genes from annotated gene-gene co-expression sets. There are several different parameters that can be used for selection of co-expression sets, either individually or in combination (Table 1).
Besides calculating overall F-measure we investigated recall and precision in each of the three separate GO namespaces. Using top 40 co-expressed genes predictions for GO annotations in “cellular component” namespace performed best (F-measure=0.16), followed by predictions in “molecular function” namespace (F-measure=0.11). Although the “biological process” namespace had the largest number of predicted GO annotations, general performance was not as high (F-measure=0.07).
Two-color microarrays contain information about co-repression of gene pairs, but in 1-color arrays the measurement focuses on expression rather than differential expression and information is only obtained on co-induction of gene pairs. Non-expressed gene pairs in 1-color arrays are normally uninformative since most genes are not expressed at any given point in time. The ability to use a large set of heterogeneous microarray datasets as a means of studying gene-gene co-expression and predicting function has been demonstrated by us [15, 25] and others [6, 7, 24] previously, but much of the details underlying why and how the approach works has not been explored.
Our metrics for selection of co-expressed genes (Equation 1) have been validated in the previous work , however, for one-color microarray data addition of Pearson’s correlation coefficient increased the performance of predictions. The main limitation of our work is incomplete GO annotations, which led to relatively low F-measure. Also, other metrics for finding similarities among GO annotations  can be used to improve statistical calculations. We expect the quality of predictions to increase as new GO annotations will be added.
We observed the F-measure of predictions begin to taper off around 2,000 experiments analyzed (Figure 6). While rarely expressed transcripts will likely benefit from more experiments analyzed, this finding is important because it tells us that, for most genes, there is more than enough experimental data to make predictions. It is also important because, as more data is accumulated on non-coding RNA (ncRNA) expression, it suggests approximately how many experiments we will need to begin correlating ncRNA expression with gene function. Currently, among the GSE files deposited in GEO there were 370 annotated as “non-coding RNA profiling by array” at the time of this writing. Interestingly, precision remained constant independently of the number of experiments used (Figure 6), suggesting overall that only a small amount of co-expression data is sufficient to garner accurate predictions for genes that are included within these experiments, but that more data is required to produce predictions for transcripts that are more rarely expressed or are included within few platforms, resulting in a more gradual increase in recall.
Another important observation is that precision drops quickly once genes outside the top ranked group are chosen. Although not unexpected, it is useful to know that most of the informative gene pairs are within a relatively small group or module. This suggests that it is important to algorithmically identify and characterize these modules to maximize our ability to infer associations from these co-expressed gene sets.
Interestingly, we observed an overall trend that genes tend to co-express with other genes from the same subfamilies. For example, among top 20 genes best co-expressed with LILRB1 (leukocyte immunoglobulin-like receptor, subfamily B member 1) there were LILRB3, LILRA6, LILRA3 and others alike, such as PILRA (paired immunoglobulin-like type 2 receptor alpha). This not only further strengthens “guilt by association” principle but also explains why some poorly annotated genes did not have any predictions. A poorly annotated gene tends to be co-expressed with other poorly annotated genes. As such, the top co-expressed genes without ontologies would be discarded (because they cannot be used to infer GO category), while annotated genes further down the list already don’t have enough precision for functional prediction (see Figure 3).
Precision and recall differed for different GO namespaces. Seemingly counter intuitively, “cellular component” has the highest precision/recall rates, and “biological process” has the lowest. An explanation lies in the total number of annotations in a given category. “Cellular component” has the lowest number of annotations and subcategories, is easier to establish experimentally than either molecular function or biological process, and thus the rate of true positives would be higher. “Biological process” on the other hand has largest number of annotations and, other parameters equal, the rate of direct hits in relation to the total number of annotations would be lower. “Molecular function” had intermediate number of annotations and intermediate precision/recall rates.
A global analysis of gene-gene co-expression behavior is a promising means of predicting gene function, particularly for the third of human genes that are still uncharacterized and for those that are only sparsely characterized. Our examination of how much data is needed to effectively conduct these analyses and how different parameters affect the precision and recall of inference will help enable this approach.
We would like to thank the NIH for their support of this work via NIH grants #5P20RR020143 and #R21ES019248.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 10, 2011: Proceedings of the Eighth Annual MCBIOS Conference. Computational Biology and Bioinformatics for a New Decade. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S10.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.