An efficient algorithm to explore liquid association on a genome-wide scale
© Gunderson and Ho; licensee BioMed Central Ltd. 2014
Received: 3 July 2014
Accepted: 30 October 2014
Published: 28 November 2014
The growing wealth of public available gene expression data has made the systemic studies of how genes interact in a cell become more feasible. Liquid association (LA) describes the extent to which coexpression of two genes may vary based on the expression level of a third gene (the controller gene). However, genome-wide application has been difficult and resource-intensive. We propose a new screening algorithm for more efficient processing of LA estimation on a genome-wide scale and apply its use to a Saccharomyces cerevisiae data set.
On a test subset of the data, the fast screening algorithm achieved >99.8% agreement with the exhaustive search of LA values, while reduced run time by 81-93 %. Using a well-known yeast cell-cycle data set with 6,178 genes, we identified triplet combinations with significantly large LA values. In an exploratory gene set enrichment analysis, the top terms for the controller genes in these triplets with large LA values are involved in some of the most fundamental processes in yeast such as energy regulation, transportation, and sporulation.
In summary, in this paper we propose a novel, efficient algorithm to explore LA on a genome-wide scale and identified triplets of interest in cell cycle pathways using the proposed method in a yeast data set. A software package named fastLiquidAssociation for implementing the algorithm is available through http://www.bioconductor.org.
Large-scale gene expression data provide snapshots of transcription activity at a genome-wide scale. There is a growing wealth of gene expression data available in public databases (such as the Gene Expression Omnibus) and as well as the capability for easily generating additional data using high-throughput technologies.
Many methods for the statistical analysis of gene expression data exist . Initially data analyses for differential expression focuse on a single gene at a time -. These one-gene-at-a-time analyses separate data into groups depending on the phenotypic status and perform gene-by-gene analysis. However recently the focus has shifted to higher order coexpression patterns (i.e. correlations of the expression levels of two or more genes) with the belief that they may reflect more fully the complex interactions between genes -.
One type of multi-dimensional differential expression analysis is called liquid association. Liquid association (LA) describes the extent to which coexpression of two genes (X 1,X 2) may vary based on the expression level of a third gene (X 3), with the third gene being viewed as a controller gene that can represent the pathway status or the cellular state . Liquid association has been demonstrated to be useful in identifying disease candidate genes for multiple sclerosis and performing dimension reduction for candidate genes in survival studies ,. Li's work  applied LA in two distinct ways. The first fixed a controller gene (i.e. the gene in the X 3 position) or a small subset of controller genes and searched for pairs of genes (X 1,X 2) that showed significant liquid association while the second method reversed this process, specifying one or both of the pair of genes (X 1,X 2) and searching for a controller gene (X 3) that regulates their correlation ,,.
Software is available to assist in the calculation of individual liquid association triplets as in Li's work, and one study has performed brute-force exhaustive searches for liquid association ; however neither of these approaches are efficient for genome-wide use. Computational analyses for LA on a genome-wide scale have proven more intractable due to the issue of dimensionality, with the number of possible combinations increasing exponentially in a situation where the number of samples is already greatly exceeded by the number of genes potentially of interest. For example, in a typical microarray with 6,000 genes, there are more than 1.079-1011 all possible triplet combinations need to be examined in a exhaustive search. In other words, assuming each LA calculation took one one-thousandth of a second, the full calculation of all possible values when performed in sequence would still take approximately 3.4 years. Obviously a different approach is needed. Thus in this paper, we develop a fast-screening algorithm with an R software package available for applying liquid association in a genome-wide scale search and implement it in a yeast data set.
We used the yeast dataset described in . Yeast is a model organism for studying complex gene interdependencies due to its short generation time, ease of culture, and that yeast's fundamental biological processes are conserved among all eukaryotes, which allows us to apply the increased understanding obtained to other organisms . The raw data set is publically available at the Yeast Cell Cycle Analysis Project website and was also available in . The data set contains the gene expression measures for 6,178 yeast genes under 73 normal growth conditions and was intended to represent a comprehensive catalog for transcripts that vary periodically within the cell cycle .
Methods for estimating liquid association
Li  used E(X 1X 2|X 3) to measure the co-expression of X 1 and X 2, and ultimately results in an estimation of L A(X 1,X 2|X 3)=E(X 1X 2X 3), with the standard error obtainable by bootstrap . Ho et al.  noted that Li's measure does not account for instances where the conditional means and variances of X 1 and X 2 may depend on X 3 and proposed a new measure named modified liquid association (MLA). Compared to Li's original measure, MLA is able to consider more intricate co-dependencies among these variables and was proven to be more robust for data analysis applications . Hence in the following analysis, we applied MLA to assess the magnitude of liquid association.
To estimate MLA, both a robust direct estimate and a trivariate conditional normal model (CNM) framework were proposed in . For instances where the CNM does not fit the data well, the more robust direct estimate with bootstrapping standard error can be used.
where ρ median is the correlation when X 3 is in the middle tertile. Triplet combinations that exhibit large ρ diff value are likely to manifest large liquid association. (2) ρ diff can be computed much more quickly through matrix algebra than MLA estimation.
After the first screening step, triplet combinations with a large |ρ diff | value were retained for further model fitting and estimation. As illustrated in Figure 1, during the second step of the algorithm, the magnitude of liquid association is estimated through the CNM if the model fits the triplet data well. Two versions for estimating MLA using the CNM are available, a full and simple version of the model, depending on which model fits the triplet data better. In the case when the CNM model does not adequately describe the data, the robust estimation can be used instead. More detail about the CNM and robust estimation procedure are described in . Gene set enrichment analysis using Gene Ontology  were performed for the top triplet combinations identified in the yeast dataset .
By narrowing the triplets with |ρ diff |>0.5, we reduced the number of the triplet combinations needed to be examined from 7,719,000 to 918,688 triplets (11.9% of all triplet combinations) in the 250 gene subset analysis. In the middle plot of Figure 2, only two out of 7,719,000 triplets (rank 80 and 2680 among all triplets) have |MLA|>0.4 (≈ 50% of maximum MLA value) and are missed by ρ diff >0.5 screening criteria. Because of discretizing X 3, there is a small reduction of resolution for measuring MLA using ρ diff in these two cases. However, the reduction in run time was substantial due to a much smaller number of triplets needed to be examined after the screening. Compared to the exhaustive analysis, the relative run time required for completion using the fastLA algorithm was 19.1% when using |ρ diff |=0.3 as the cut-off threshold and 6.51% when using |ρ diff |=0.5 (run times 2876 seconds and 979 seconds respectively vs. 15046 seconds using the exhaustive search). Processing was performed on servers at the Minnesota Supercomputing Institute on two-socket, quad-core 2.8 GHz Intel Xeon X5560 Nehalem EP processors with 22 GB of RAM.
Top 10 triplets by p-value
X 1/ X 2
X 2/ X 1
In saccharomyces cerevisiae, there are 171 genes with transcription factor specificities that show DNA binding ability and have at least 1 identified motif according to the yeast transcription factor compendium . In the top 342 triplet combination with p value <10-8, 10 (5.8%) of the 171 genes were reported as the controller gene (X 3) in the list. These 10 genes are provided in Additional file 7.
Results of GO analysis
Top 15 GO terms for X 3 analysis using biological processes ontology
Carbohydrate derivative biosynthetic process
rRNA metabolic process
Regulation of glycogen biosynthetic process
Signal transduction involved in filamentous growth
ATP synthesis coupled proton transport
Regulation of polysaccharide biosynthetic process
Sulfur compound transport
Regulation of ion transport
GPI anchor biosynthetic process
Maturation of SSU-rRNA from tricistronic rRNA transcript (SSU-rRNA, 5.8S rRNA, LSU-rRNA)
Endonucleolytic cleavage of tricistronic rRNA transcript (SSU-rRNA, 5.8S rRNA, LSU-rRNA)
Top 15 GO terms for full triplet analysis using biological processes ontology
DNA replication-dependent nucleosome assembly
Serine family amino acid catabolic process
Copper ion import
Purine ribonucleoside triphosphate metabolic process
Cellular response to osmotic stress
Nitrogen catabolite regulation of transcription from RNA polymerase II promoter
GTP catabolic process
Regulation of translational initiation
Regulation of vacuole fusion, non-autophagic
Fructose metabolic process
Given that the Spellmen et al. experiments created nutrient-depleted conditions for growth, it is biologically feasible to see that for the controller position (X 3), many top terms are involved energy regulation such as carbohydrate derivative, glycogen, and polysaccharide biosynthesis described in Table 2. Glycogen in yeast is formed during periods where carbon, nitrogen, phosphorus or sulfur is limited . In addition, several top terms are related to transportation of cellular molecules such as vitamin, sulfur compound, and ion. Furthermore, GPI-anchor protein biosynthesis could be related to cell wall formation for sporulation during nutrient-depleted environment.
The results from the analyses using the full triplet set trend toward functions of energy regulation, and transport of molecules as shown in Table 3. Glycolysis is related to the utilization of glucose. In addition, regulation of lipid, carbohydrate, hexose, purine ribonucleotide could also be involved in energy regulation process. The findings presented by GO analysis could suggest feasible biological hypotheses; however, liquid association measure describes `association” between gene triplet, but it does not necessary confers “causation.” Further functional experiments will be needed to validate the top triplets identified with large MLA values.
In the data analysis, we set |ρ diff |=0.5. There are a few considerations about setting the threshold value for |ρ diff |. The maximum value is theoretically 2 (as ρ diff =ρ high -ρ low and ). For general use, too high a value for |ρ diff | risks missing those triplets whose MLA values are not fully reflected by the more simplistic correlation, while too low a value approaches testing all possible triplets and forfeits any increase in testing efficiency. We set the default value at 0.5 (25% of the realizable correlation difference) as we found >99.98% of the triplets with a large MLA were captured by setting |ρ diff |=0.5 in the validation subset. If we increase the threshold for the |ρ diff | cutoff, we could further decrease time without substantial loss in sensitivity. Of the top 10,000 triplets, only 128 would have been missed using a cutoff of |ρ diff |=1.0. However, this would have substantially decreased the number of triplets that needed to be checked for MLA estimates, which in turn would have helped decrease memory usage and overall processing time.
In the algorithm, ρ diff is calculated based on the difference between a “high” versus “low” subset of the data for each gene in the controller position. Initially the median (after removal of any data with a missing value in the X 3 position) was used as the demarcation between the high and low subsets. However, we found that the central points diluted the ρ diff estimate and decreased sensitivity. The algorithm was respecified to split the data into three parts based on the X 3 values, with high being the top third and low being the bottom third in our analysis. By using the upper and lower tertiles for Z expression, the values of ρ diff increase in triplets with large liquid association and hence increase the sensitivity to identify triplets with large MLA values. Based on data obtained in the verification process, we used this specification of the algorithm in this analysis. Furthermore, the splitting of X 3 can be easily modified for other analysis; however, in practice, we suggest to have between 15 to 30 samples as recommended by Ho et al.  samples in each bin to achieve stable estimates of ρ.
During the course of parameter estimations using the CNM models, we identified a subset of triplets where the CNM does not fit the data well. In total, of the 2.8605-107 triplets that were tested using the full model, 23,830 triplets can not be adequately described by the CNM full model.
A concern that has been raised in regards to using Hypergeometric-based tests is the problem of defining the gene universe. When a larger gene universe is used, it in general will tend to (assuming all other variables remain the same) have the effect of making the p-value seem more significant . Given the genome-wide scope and nature of our testing (in that a priori, we had no way of distinguishing which genes might be found to be “interesting” and thus all genes were equally likely to be selected), it was decided that all analyzed genes would be included in the gene universe for analysis and the results interpreted conservatively. While the data used from Spellman et al. were obtained from cDNA arrays and thus more likely to have prior rationale of biological plausibility for probe inclusion, for commercial chips performing some non-specific filtering prior to analysis may help reduce the size of the gene universe and manage to avoid the issue.
We proposed the fastLA algorithm for exploring liquid association in a genome-wide scale. Some modifications of the fast liquid association algorithm could be: (1) For binary traits, ρ diff can be used as the liquid association measure. Our algorithm can be easily adapted to the binary case, (2) Use a rank-based correlation statistic. Using non-parametric correlation would make the model more robust to outliers and potential violations of the assumption that the variables are bivariately normally distributed; however, rank-based correlation statistic could be less statistically powerful comparing to the Pearson correlation.
On the basis of the results of this study, it appears that ρ diff would be an appropriate screening metric for MLA in use for exploratory genome-wide searches and that both metrics are suitable for identifying triplets of interest. Given the high correlation observed between ρ diff and MLA and the increased speed of calculation of ρ diff due to its matrix manipulation to perform the estimate, this would significantly reduce both processing time and memory requirements. While there remain reservations that ρ diff may not be suitable for a comprehensive identification of triplets of significant p-values, nevertheless it is a fast and efficient screening tool to identify potentially significant gene triplets using liquid association.
YYH designed the algorithm. TG performed the statistical analysis and drafted the manuscript. Both authors read and approved the final manuscript.
The authors are thankful for the resources from the University of Minnesota Supercomputing Institute. The authors are thankful for the helpful discussion with Dr. Jeffrey Leek. Yen-Yi Ho is partially supported by grants 2P30CA077598, P50CA101955, UL1TR000114 and U54-MD008620.
- Slonim D, Yanai I: Getting started in gene expression microarray analysis . PLoS Comput Biol. 2009, 5 (10): 1000543-10.1371/journal.pcbi.1000543. doi:10.1371/journal.pcbi.1000543,View ArticleGoogle Scholar
- Dudoit S, Yang YH, Callow MJ, Speed TP: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments . Statistica Sinica. 2002, 12 (1): 111-140.Google Scholar
- Pan W: A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments . Bioinformatics. 2002, 18 (4): 546-554. 10.1093/bioinformatics/18.4.546. doi:10.1093/bioinformatics/18.4.546,View ArticlePubMedGoogle Scholar
- Hahne F, Huber W, Gentleman R: Bioconductor Case Studies. Use R! . 2008, Springer, New YorkView ArticleGoogle Scholar
- Arevalillo JM, Navarro H: A new method for identifying bivariate differential expression in high dimensional microarray data using quadratic discriminant analysis . BMC Bioinformatics. 2011, 12 (Suppl 12): 6-10.1186/1471-2105-12-S12-S6. doi:10.1186/1471-2105-12-S12-S6,View ArticleGoogle Scholar
- Dettling M, Gabrielson E, Parmigiani G: Searching for differentially expressed gene combinations . Genome Biol. 2005, 6 (10): 88-10.1186/gb-2005-6-10-r88. doi:10.1186/gb-2005-6-10-r88,View ArticleGoogle Scholar
- Li K-C: Genome-wide coexpression dynamics: theory and application . Proc Natl Acad Sci. 2002, 99 (26): 16875-16880. 10.1073/pnas.252466999.View ArticlePubMed CentralPubMedGoogle Scholar
- Li K-C, Liu C-T, Sun W, Yuan S, Yu T: A system for enhancing genome-wide coexpression dynamics study . Proc Natl Acad Sci. 2004, 101 (44): 15561-15566. 10.1073/pnas.0402962101. doi:10.1073/pnas.0402962101,View ArticlePubMed CentralPubMedGoogle Scholar
- Lai Y, Wu B, Chen L, Zhao H: A statistical method for identifying differential gene-gene co-expression patterns . Bioinformatics. 2004, 20 (17): 3146-3155. 10.1093/bioinformatics/bth379. doi:10.1093/bioinformatics/bth379,View ArticlePubMedGoogle Scholar
- Hu R, Qiu X, Glazko G, Klebanov L, Yakovlev A: Detecting intergene correlation changes in microarray analysis: a new approach to gene selection . BMC Bioinformatics. 2009, 10 (1): 20-10.1186/1471-2105-10-20. doi:10.1186/1471-2105-10-20,View ArticlePubMed CentralPubMedGoogle Scholar
- Zhang J, Ji Y, Zhang L: Extracting three-way gene interactions from microarray data . Bioinformatics. 2007, 23 (21): 2903-2909. 10.1093/bioinformatics/btm482.View ArticlePubMedGoogle Scholar
- Li K-C, Palotie A, Yuan S, Bronnikov D, Chen D, Wei X, Choi O-W, Saarela J, Peltonen L: Finding disease candidate genes by liquid association . Genome Biol. 2007, 8 (10): R205-10.1186/gb-2007-8-10-r205.View ArticlePubMed CentralPubMedGoogle Scholar
- Wu T, Sun W, Yuan S, Chen C-H, Li K-C: A method for analyzing censored survival phenotype with gene expression data . BMC Bioinformatics. 2008, 9 (1): 417-10.1186/1471-2105-9-417.View ArticlePubMed CentralPubMedGoogle Scholar
- Lin P-Y: Genome-wide coexpression dynamics in lung adenocarcinoma. Master's thesis, Emory University, 2011. , [http://pid.emory.edu/ark:/25593/9225b]
- Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization . Mol Biol Cell. 1998, 9 (12): 3273-3297. 10.1091/mbc.9.12.3273.View ArticlePubMed CentralPubMedGoogle Scholar
- Botstein D, Chervitz SA, Cherry JM: Yeast as a model genetic organism . Science. 1997, 227: 1259-1280. 10.1126/science.277.5330.1259.View ArticleGoogle Scholar
- Ho Y-Y, Parmigiani G, Louis TA, Cope LM: Modeling liquid association . Biometrics. 2011, 67 (1): 133-141. 10.1111/j.1541-0420.2010.01440.x.View ArticlePubMedGoogle Scholar
- Falcon S, Gentleman R: Using GOstats to test gene lists for GO term association . Bioinformatics. 2007, 23 (2): 257-258. 10.1093/bioinformatics/btl567.View ArticlePubMedGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing . J Roy Stat Soc B. 1995, 57 (1): 289-300.Google Scholar
- de Boer CG, Hughes TR: Yetfasco: a database of evaluated yeast transcription factor sequence specificities . Nucleic Acids Res. 2012, 40 (Database issue): 169-179. 10.1093/nar/gkr993. doi:10.1093/nar/gkr993,View ArticleGoogle Scholar
- Wilson WA, Roach PJ, Montero M, Baroja-Fernández E, Muñoz FJ, Eydallin G, Viale AM, Pozueta-Romero J: Regulation of glycogen metabolism in yeast and bacteria . FEMS Microbiol Rev. 2010, 34 (6): 952-985.View ArticlePubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.