Systematic survey reveals general applicability of "guilt-by-association" within gene coexpression networks
© Wolfe et al. 2005
Received: 22 June 2005
Accepted: 14 September 2005
Published: 14 September 2005
Skip to main content
© Wolfe et al. 2005
Received: 22 June 2005
Accepted: 14 September 2005
Published: 14 September 2005
Biological processes are carried out by coordinated modules of interacting molecules. As clustering methods demonstrate that genes with similar expression display increased likelihood of being associated with a common functional module, networks of coexpressed genes provide one framework for assigning gene function. This has informed the guilt-by-association (GBA) heuristic, widely invoked in functional genomics. Yet although the idea of GBA is accepted, the breadth of GBA applicability is uncertain.
We developed methods to systematically explore the breadth of GBA across a large and varied corpus of expression data to answer the following question: To what extent is the GBA heuristic broadly applicable to the transcriptome and conversely how broadly is GBA captured by a priori knowledge represented in the Gene Ontology (GO)? Our study provides an investigation of the functional organization of five coexpression networks using data from three mammalian organisms. Our method calculates a probabilistic score between each gene and each Gene Ontology category that reflects coexpression enrichment of a GO module. For each GO category we use Receiver Operating Curves to assess whether these probabilistic scores reflect GBA. This methodology applied to five different coexpression networks demonstrates that the signature of guilt-by-association is ubiquitous and reproducible and that the GBA heuristic is broadly applicable across the population of nine hundred Gene Ontology categories. We also demonstrate the existence of highly reproducible patterns of coexpression between some pairs of GO categories.
We conclude that GBA has universal value and that transcriptional control may be more modular than previously realized. Our analyses also suggest that methodologies combining coexpression measurements across multiple genes in a biologically-defined module can aid in characterizing gene function or in characterizing whether pairs of functions operate together.
From the very start of the high-throughput microarray expression revolution it was understood [1, 2] that guilt-by-association was a powerful heuristic to both explain why genes might have correlated expression in a set of experiments and infer what might be the function of a gene coexpressed with genes of better known function. As gene expression data have increased in numbers and quality, a variety of investigations have been leveraged from this GBA heuristic. Analyses of gene coexpression [3–7] have demonstrated that clusters with similar overall expression are often enriched for genes with similar functions, consistent with the hypothesis of modularly-behaving gene programs, where sets of genes are activated in concert to carry out functions.
GBA has also been exploited highly successfully by investigators who have used a priori determined modules or gene sets and assess if these sets have statistically significant overrepresentation in the genes changed in groups of arrays [8–15]. By exploiting the insight that subtle but coordinated changes in expression can be detected by combining measurements across multiple members of a functional module, these focused studies have successfully found specific modules that are important in diabetes , aging , and cancer [10, 11, 14, 15], or assigned functions to previously uncharacterized genes in yeast [8, 9]. These approaches essentially integrate two frameworks of viewing gene function , one framework reflected in module sets that are derived from prior biological knowledge and another framework from the characteristics of gene expression data.
These studies reflect two bidirectional uses of GBA: either using coexpression to define the members of functionally related sets or using sets to define function of coexpressed genes. That is, the first uses prior gene expression data and the second uses prior biological knowledge. We extend these approaches, taking the a priori framework of knowledge available in Gene Ontology (GO)  to systematically explore the breadth of GBA across a large and varied corpus of expression data to answer the following questions. 1) To what extent is the GBA heuristic broadly applicable to the transcriptome and GO? 2) In the GBA heuristic, how well does coexpression inform function and vice versa? 3) Which GO heuristics are the most interrelated as measured by a GBA metric?
The testbed for evaluating the extent and organization of GBA were five coexpression networks, constructed using 8341 microarrays representing a variety of tissue types and conditions. For each network we determine whether coordinated coexpression can be detected across multiple genes of each GO-defined module. Our approach is better suited than clustering to systematically examine GBA because it allows for pleiotropy: it does not assign genes to a single function or a single cluster but rather calculates a probabilistic score between each gene and each GO category. This approach better captures complex interrelationships , such as genes that code for proteins with multiple functions . We discover that there is a ubiquitous signature of functional association in all of the coexpression networks in that the genes in a module often demonstrate higher-than-expected numbers of coexpressed genes belonging to that same module.
To further illustrate the breadth of GBA, we present the extent of which coexpression implicates members of three sets of genes that are usually thought of as belonging to a very specific biological context: skeletal development, neuropeptide receptor activity, and feeding behavior. We show that these Gene Ontology categories, as well as hundreds of other categories, are associated with coordinated expression patterns across the variety of tissue types and conditions in our data.
We constructed five different coexpression networks (four single-species networks and one unified multi-species network), which are graphs where genes are nodes and the edges are represented by values reflecting the significance of coexpression between a pair of genes. We selected mammalian organisms for which extensive and diverse microarray data were available on four Affymetrix platforms in the Gene Expression Omnibus (GEO): Homo sapiens (HG-U95A and HG-U133A), Mus musculus (MG-U74A), and Rattus norvegicus (RG-U34A). Orthologs between these organisms were obtained from HomoloGene and from this information, 6624 "metagenes" (hereafter referred to as genes) were defined consisting of sets of orthologous genes across at least two different organisms on the chosen microarray platforms. The multi-species network integrates the 8341 microarrays from all four Affymetrix platforms into a unified coexpression network, using order statistics  to assign coexpression P-values (P c ) between all possible pairs of genes. Previous work  suggested that by using the signal of evolutionary conservation in a multi-species coexpression network the effect of noise is reduced and the significance of functionally important gene pairs is enhanced, although this approach is only valid when homologous genes share functionality. The four single-species coexpression networks were calculated from Pearson correlation coefficients between genes, in each case using only data from one of the four Affymetrix platforms.
With a network of coexpression relations computed between pairs of genes, and networks of coexpression enrichment relations computed between all pairs of GO categories and genes, we evaluated how reliably coexpression enrichment P e -values for a GO category identify genes annotated with that function. Each GO category contains a set of specific P e -values to score relations to all genes. Taking one GO category at a time, we calculated the true and false positive rates for identifying genes annotated with that GO category at threshold P e -values throughout its range. We plotted these true and false positive rates on Receiver Operating Characteristic (ROC) curves  for each GO category. If there were a threshold P e -value below which all genes are annotated with the correct function, and above which no genes are annotated with the correct function, then the area under such an ideal ROC curve would be 1. An area of 0 would mean that identifying annotated genes using P e performs perfectly incorrectly, and an area of 0.5 indicates no overall identification efficiency using P e (performance equivalent to random chance). Thus the area under an ROC curve for each GO category is a metric for GBA, scoring how well coexpression enrichment P e -values perform as a "self-diagnostic" for the genes annotated to a category.
The results were tested by taking the multi-species coexpression network and applying the same analysis with randomized GO sets. The population of self-diagnostic ROC areas for the randomized GO sets is centered at 0.5 (Figure 3d). The case of a randomized network, with P c -values permuted between gene pairs, also yields a distribution that is centered at 0.5 (Figure 3e). Thus the upward shift of the true distributions is unlikely to occur by chance.
We tested whether the ROC areas were correlated with other factors (see additional file 2: supplementary methods), but found that the correlations were not strong, ranging between +/-0.2. We tested whether the type of evidence used to construct a GO set, given in the GO evidence codes, has any relation to the ROC areas; whether there was any correlation between the expression levels in a GO category and the ROC areas; whether there was a correlation between the ROC areas and the average number of GO annotations for the genes in each set.
To examine which GO categories are the most interrelated, we test whether coexpression enrichment for one GO set can be used to assign genes to a different GO category ("cross diagnostics"). These analyses score how well different GO modules tend to be coexpressed together, such as whether coexpression enrichment for the mitochondrion module is a characteristic of the oxidative phosphorylation module (the multi-species cross-diagnostic ROC area is 0.94 for this case). In one sense, these scores indicate the strength of coexpression links in a network where the graph nodes are GO categories, rather than genes. However, a complication is that pairs of gene sets may significantly overlap in their annotated genes. Therefore, for the multi-species network we present the systematics between pairs of GO categories that are together on the same graph, where GO relationships are defined and provide additional context for interpreting the results. Gene Ontology organizes biological processes, molecular functions, and cellular components separately on three directed acyclic graphs. A parent GO category has a set of more specific children (from those GO categories just one step below a parent on a graph) and more specific descendents (from all GO categories in the entire subgraph below a parent). To test the results, we apply the same analysis with randomized GO sets that are constructed in a manner that mimics the GO mappings.
However, the distribution (Figure 4c) does display longer tails than for randomized GO sets (Figure 4d), indicating how there is a nonrandom tendency for some of these modules to either be highly coexpressed together (high areas) or not highly coexpressed together (low areas). (Note that increasing the scale in Figure 4d does not reveal any additional detail in the tails of the distribution.) In addition, some subgraphs of GO show uniformly high cross diagnostics, such as the subgraphs under immune response and cell cycle (Figure 4e-f), where there is a signal that modules from the different sub-categories are often coexpressed together in the types of tissues in our analysis.
Of the 812,702 possible cross diagnostic GO pairings, only a small percentage are related by coexpression (e.g., 5% have ROC areas greater than 0.7). As shown in the above analyses, at least some of the positive relationships are consistent with the known biology reflected the Gene Ontology hierarchy.
The observed scores appear to reflect the behavior of the transcriptome rather than being dominated by the mix of samples in each of the networks or the choice of microarray platform. Though the GEO data in our study originated from many laboratories with inhomogeneous protocols, our analyses demonstrate how the extent of GBA for each GO module and the interrelatedness between GO module pairs nonetheless have high reproducibility in the networks. The reproducibility between the entire multi-species and single-species networks is lower than the reproducibility of ROC areas (see additional file 2: supplementary methods), demonstrating how a functionally-based analysis enhances the similarity of the signals between different networks. Our results are consistent with a recent study of expression variability across different platforms and laboratories  that found highest reproducibility when the analysis was based on biological themes defined by GO.
Our study provides an investigation of the functional organization of five coexpression networks using data from three mammalian organisms. This method integrates information from two different frameworks of viewing gene function , one framework essentially from the manual and subjective curation of evidence in the literature into the Gene Ontology hierarchy and another framework from a probabilistic analysis of expression datasets. Across all five networks, we find a signature that coexpression enrichment predicts coannotation across GO categories, and thus the guilt-by-association heuristic is broadly applicable. Although for gene pairs within a specified GO set the coexpression value may only be weak, by combining coexpression measurements across multiple genes in the module, there is a systematic and reproducible signature of functional association. Because the genes in a particular module demonstrate higher-than-expected numbers of coexpressed genes belonging to that same module, the values for gene set coexpression enrichment tend to be predictive of gene function.
It was unexpected that a simple test based on coexpression would have value in assigning genes to so many different types of GO categories. While some GO annotations may themselves have been defined on the basis of expression, there are also many GO annotations that did not necessarily employ expression results, such as the annotations in the cellular component domain, where the population of ROC areas still displays better-than-random ability to correctly identify the genes annotated to GO categories. That some GO categories score better than others likely reflects the characteristics of underlying biological behavior, as the scores of GO categories are reproducible across all of the coexpression networks. This study demonstrates how using coexpression enrichment to assign a probabilistic score between genes and functions can add information about gene function. We note that an analogous data mining approach to ours was previously applied by Lamb et al.  to discover that C/EBPβ was a mechanism of cyclin D1 action, using a single module gene set of cyclin D1 target genes. Our more comprehensive study of 902 GO module gene sets suggests this type of approach should also be successful for other biological systems. Our results are in agreement with a recent study  that used a support vector machines method on mouse coexpression data and found that genes in many GO biological process categories could be identified as being in those categories. Our results disagree with low degree of GBA found by Clare and King , who clustered yeast microarray data and found the clusters did not in generally agree with functional annotation classes. One explanation for this disagreement may be that the use of clustering by Clare and King  does not reveal the more subtle signal of GBA that we discover using gene set coexpression enrichment. Another difference may be that our larger and more comprehensive dataset (8341 Affymetrix mammalian microarrays) is better suited to identify GBA.
Our strategy demonstrates that the functions of a cell operate on an exquisitely coordinated level and that the modular character of cell biology  is evident across the biologically variable microarray data in our analysis. Within the large scope of the considered GEO samples and GO categories, we find that the guilt-by-association identification of gene function on the basis of expression has universal value. This result provides optimism that high-throughput measurements of gene expression and community-based gene annotation efforts will continue to demonstrate synergy in the collective investigations of cellular physiology and understanding of human diseases.
Genes from one organism were associated with their orthologous counterparts in other organisms using HomoloGene (downloaded on June 22, 2004). 6624 "metagenes "were defined as sets of orthologs across at least two organisms with available microarray probes, using cases where no more than one gene was found for each organism. Microarray probes for orthologs were assigned into metagene probe groups. Because some genes have multiple probes on an array, for each of the 6624 metagenes, we considered all combinations of probes across the microarray platforms (see additional file 2: supplemental methods).
Microarray data consisted of 8341 arrays from 4 different platforms downloaded from NCBI Gene Expression Omnibus. 2179 arrays were Affymetrix HG-U95A (version 2), 2438 arrays were Affymetrix HG-U133A, 2216 arrays were Affymetrix MG-U74A (version 2), and 1508 arrays were Affymetrix RG-U34A. We normalized expression data on each array by converting values to rank percentile. For the multi-species network, for each probe group we computed Pearson correlation coefficients between other probes on a platform and then ranked these other probes according to their correlations. For each distinct pair of metagene probe groups, a probabilistic method based on order statistics was used to evaluate the probability of observing the ranks by chance (see additional file 2: supplemental methods). This generates coexpression P-values (P c (m i , m j )) between pairs of metagenes. A unique P c -value between metagene pairs is selected based on lowest P c -value obtained from all of the analyzed probe groups, with the philosophy that when coexpression is present, more significant P c -values will be associated with more accurate probes. Single-species coexpression networks for each of the four different platforms were calculated from Pearson correlation coefficients between gene pairs, limited to those genes also analyzed in the multi-species network and selecting the highest correlation coefficient obtained for the cases where multiple probes are available for gene pairs.
For each network, gene sets were compiled for 902 GO categories with at least 20 genes in the multi-species network. The graph relationships were obtained from the Gene Ontology MySQL database, downloaded on September 24, 2004. The annotations of genes to GO categories were taken from LocusLink, downloaded on September 27, 2004. Gene Ontology organizes biological processes, molecular functions, and cellular components separately on three directed acyclic graphs, with more general parent categories having subgraphs of more specific descendent categories. The GO true path rule is that annotation to a category implies annotation to all parents and gene products are conventionally annotated just to the most specific levels of the ontology. We associate a gene to a GO set if it is annotated with that GO category in human, mouse, or rat or if it is annotated with a descendent of that GO category. See additional file 2: supplemental methods, for the construction of the randomized GO sets.
For each gene m i , all other linked genes are ranked by the most significant value obtained for coexpression. We use the hypergeometric distribution to calculate a coexpression enrichment P-value (P e (m i , g j )) for whether GO set g j was significantly overrepresented in the top 250 genes with most significant coexpression values to m i (Figure 1). Similar results were obtained for cases where the number of top ranked genes selected was different (decreased to 100 or increased to 500) or where an enrichment score was based on a normalized Kolmogorov-Smirnov statistic .
A self-diagnostic ROC curve tests whether the P e (m i , g j )-values for GO set g j can distinguish genes associated to g j . An ROC curve is constructed for a range of closely spaced P e (m i , g j )-value cutoffs. At a given cutoff, the true-positive rate (sensitivity) is calculated as the number of genes associated to GO set g j with P e (m i , g j )-values below the cutoff divided by the total number associated to g j ; the false-positive rate (1-specificity) is calculated as the number of genes not associated to a GO set g j with P e (m i , g j )-values below the cutoff divided by the total number not associated to a GO set. The area is estimated by trapezoidal integration and 95% confidence intervals are also calculated . Cross-diagnostic ROC areas are calculated as above, except we test whether the P e (m i , g j )-values for g j can distinguish genes associated to a different GO set g k .
This work as supported by grants from the the NIH National Center for Biomedical Computing (U54 LM008748), National Library of Medicine (2TA5 LM07092-11 and 5T15 LM07092), National Institute of Diabetes and Digestive and Kidney Diseases (K12 DK63696 and R01 DK62948), the Harvard-MIT Division of Health Sciences and Technology, and the Lawson Wilkins Pediatric Endocrine Society.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.