Vector analysis as a fast and easy method to compare gene expression responses between different experimental backgrounds
© Breitling et al; licensee BioMed Central Ltd. 2005
Received: 04 April 2005
Accepted: 19 July 2005
Published: 19 July 2005
Gene expression studies increasingly compare expression responses between different experimental backgrounds (genetic, physiological, or phylogenetic). By focusing on dynamic responses rather than a direct comparison of static expression levels, this type of study allows a finer dissection of primary and secondary regulatory effects in the various backgrounds. Usually, results of such experiments are presented in the form of Venn diagrams, which are intuitive and visually appealing, but lack a statistical foundation.
Here we introduce Vector Analysis (VA) as a simple, yet principled, approach to comparing expression responses in different experimental backgrounds. VA enables the automatic assignment of genes to response prototypes and provides statistical significance estimates to eliminate spurious response patterns. The application of VA to a real dataset, comparing nutrient starvation responses in wild type and mutant Arabidopsis plants, reveals that consistent patterns of expression behavior are present in the data and are reliably detected by the algorithm.
Vector analysis is a flexible, easy-to-use technique to compare gene expression patterns in different experimental backgrounds. It compares favorably with the classical Venn diagram approach and can be implemented manually using spreadsheets, such as Excel, or automatically by using the supplied software.
Large-scale gene expression measurements by microarray technology are used to compare mRNA levels in different experimental or biological conditions . However, in an increasing number of cases, it seems far more relevant to compare differences in expression responses, rather than static expression levels. Perhaps the most common situation involves the comparison between a wild type and a mutant organism. Here, the mRNA profile in any condition will differ between the two genetic backgrounds, but these differences will be a complex combination of the primary effect of the mutation and secondary effects of various kinds. E.g., the mutant may show growth defects, disease reactions, or compensating adjustments in its physiology. All of these make a direct comparison between the expression profiles problematic. In contrast, comparing how organisms of each genetic background respond to a common relevant stimulus can reveal regulatory mechanisms that are lost or gained by the mutation as well as shared or 'disregulated' responses. Of course, the same approach is useful for other studies comparing gene expression in distinct types of background, e.g. between cell lines, tissues, or even organisms. In each case, comparing dynamic responses can provide more biological insight than a static direct comparison of expression profiles.
Despite the importance of comparing expression responses in diverse backgrounds, accessible statistical techniques for this common analytical task are sorely lacking. Usually, genes that are differentially expressed in either background are first identified independently and then compared in the form of Venn diagrams that depict the overlap between the two sets of genes (see [2–5] for examples, and [6, 7] for a mathematical introduction to Venn diagrams). This approach is very attractive because of its simplicity and immediate visualization. It is implemented in many commercial microarray analysis packages (e.g. Genespring) and has also been used as an alternative to clustering techniques to identify similarities between experimental results (Venn mapping, ) and to visualize general relationships among the functional annotations associated with lists of differentially expressed genes . Venn diagrams, however, have a number of limitations, most importantly the arbitrariness of the initial definition of changed genes. In particular, the content of the intersection of the two gene sets ("shared responses") depends critically on the selection threshold used in the initial definition of differentially expressed genes. Another disadvantage is that differential responses in the two backgrounds are not further characterized, e.g. it is not obvious whether the difference of a gene's response between the two backgrounds is due to the "regulated/non-regulated" or "up-regulated/down-regulated" effect. More sophisticated statistical techniques have been used to approach this issue (e.g. ANOVA , Principle Component Analysis , Singular Value Decomposition , Linear Factor Models , or Integrative Correlation Analysis ). Each of these successfully addresses certain aspects of the problem, by reducing the dimensionality of the data or identifying consistent patterns of behavior across conditions. However, they all lack the intuitive appeal and simplicity of Venn diagram visualization. Here we present a simple alternative to Venn diagrams that is based on similar concepts but provides more flexibility and an added degree of objectivity of the results.
Results and discussion
If there are replicate experiments, as is generally the case in microarray studies, we calculate the representative "average" vector v REP by (1) determining the individual vectors v[i], where the vector v[i]represents the comparison of the i-th pair of experiments (if there are N replicates in background A and M replicates in background B, there will be n = N × M pairs); (2) calculating the average length of these vectors, , where |v[i]| denotes the length of the vector v[i]; (3) calculating the sum of the unit vectors pointing in the same direction as the individual pairwise vectors, ; and finally (4) determining the representative vector by combining the length (l) and direction information (v SUM ), .
The length of the vector (l) indicates the average strength of the response and can be used to filter out genes that show little response in either background. The direction of the vector describes which prototypical behavior comes closest to the behavior of this particular gene. To decide on the assignment of a particular gene to a response prototype, one can calculate the angle between the representative vector and the various possible prototype vectors (e.g., or ) as cosα = v REP ·vPrototype/(|v REP ||vprototype|), 0 ≤ α < 180°, where v REP ·vprototype is the scalar product of the two vectors and |v REP | ≠ 0. The gene is then assigned to the prototype closest to it (minimal α).
It is clear that the vector approach generalizes to multi-dimensional cases, i.e. to comparisons between more than two backgrounds. However, the number of possible prototype behaviors increases rapidly, as N = 3 k - 1, where k is the number of dimensions.
By randomly sampling from the measured expression values and calculating the sum vector lengths for these random data (which should not show consistent behavior) one can estimate the null distribution of the sum vector length. This is done by randomly assigning the original expression values within each replicate to other genes. All consistency between replicates and, thus, between experimental backgrounds should then be lost and the resulting |v SUM | values will be those that are expected if no consistency is present. This can be used to assign a p-value to the assignment of genes to behavior prototypes (consistency p-value). This value, calculated by the procedure described above, will be a non-parametric estimate of the real p-value, and the exact value will vary slightly in each run of the method, unless the same random sampling is used each time.
Additional file 6 shows the results of vector analysis applied to a simulated dataset, where the response type of each gene is known [see Additional file 6]. Three replicates for each experimental background were created by drawing random expression values from normal distributions with variance 1 and a mean of 0, -2, and 2 for unchanged, down-regulated and up-regulated genes, respectively. In this small illustrative example, 87.5% of regulated genes are assigned the correct response type. The remaining genes are assigned one of the neighboring types. Genes that are unchanged in both conditions are also assigned to the closest response prototype, but none of these achieves a significant consistency p-value. Of course, in a real-world application unchanged genes would usually be filtered before applying vector analysis, because otherwise they will be assigned arbitrary angular and location values that add noise to the results. If VA is applied to genes that are not changed at all, it will always assign these genes to "incorrect" response classes, and even when the consistency p-value of VA is used, some of these genes will reach significance simply due to multiple testing. Therefore, VA is usually applied only to genes that are significantly changed in at least one experimental background, based on any of the standard methods for the detection of differentially expressed genes. However, the filtering does not have to be very strict and the results of VA may still yield interesting trends for borderline cases, as shown in the example below.
Number of genes showing the various types of prototypic behavior in two genetic backgrounds of Arabidopsis plants as identified by vector analysis.
Mutant specific up
Mutant specific down
WT specific up
WT specific down
WT and Mutant up
WT and Mutant down
Mutant up, WT down
Mutant down, WT up
One of the properties of this dataset is that very few genes show a strong expression response in any background. Only one out of 1000 genes has an l-value larger than 1 (roughly corresponding to a two-fold expression change), and only 35 genes have l-values larger than 0.5. Thus, a Venn analysis based on significantly changed genes would be all but impossible. The vector analysis, in contrast, identifies 32 genes with consistency p-values smaller than 0.01 (expected 10) and 258 genes with p-values smaller than 0.1 (expected 100). It thus reveals the presence of consistent response patterns even among genes with very slight absolute expression changes.
Using the two parameters of the method (vector length = overall response intensity, and p-value = response pattern consistency) allows the flexible dissection of the observed expression in the two experimental backgrounds. At the same time it is possible to assign the most likely response pattern even to genes that show little absolute expression change.
In contrast to Venn diagrams, which can only be used to compare genes that are reliably identified as responsive, vector analysis assigns all genes to behavioral categories. Also note that these categories are not fixed, but can be adjusted as appropriate for any experiment, by simply changing the boundaries of the sectors. Also, genes can be sorted by their angular distance from any reference gene (or reference behavior), to generate lists that are sorted by closeness of genes to a particular response pattern.
Vector analysis provides a flexible, easy-to-use, and intuitive approach to the comparison of gene expression patterns in different experimental backgrounds. While it does not supply the detailed statistical insights available by alternative classical statistics approaches such as ANOVA, it excels in terms of simplicity and straight-forward interpretation. In this respect vector analysis compares favorably with the Venn diagram technique which is currently in wide-spread use for this common and ubiquitous task, but lacks the flexibility of vector analysis, in particular for noisy data.
For small datasets with few replicates, vector analysis is straightforward enough to be carried out manually, e.g. in Excel or OpenOffice spreadsheets. It uses only the most basic vector algebra. The Excel file in the supplementary material [see Additional file 1] demonstrates how l, v SUM , and v REP are calculated and used to automatically assign genes to the various response prototypes. A second sheet in the same file is used to randomly permute the experimental measurements by sorting them along a vector of random numbers, so that within each replicate (column) the original expression values are randomly assigned to new genes and all consistencies between columns are lost. The vector lengths calculated from these random data are then used in a third sheet to estimate the p-values associated with the observed response patterns (for details of the procedure [see Additional file 2]). For larger numbers of replicates, the manual procedure becomes quite tedious and a Perl script [see Additional file 3] is provided that performs vector analysis and p-value estimation automatically, taking a tab-delimited text file of log-fold changes in all replicates [see Additional file 4] as its input. The obtained results [see Additional file 5] can then be sorted, filtered and explored in various ways to dissect the details of comparative expression behavior.
This work was supported by BBSRC grants 17/G17989 and 17/P17237 to AA.
- Lockhart DJ, Winzeler EA: Genomics, gene expression and DNA arrays. Nature 2000, 405: 827–836. 10.1038/35015701View ArticlePubMedGoogle Scholar
- Oono Y, Seki M, Nanjo T, Narusaka M, Fujita M, Satoh R, Satou M, Sakurai T, Ishida J, Akiyama K, Iida K, Maruyama K, Satoh S, Yamaguchi-Shinozaki K, Shinozaki K: Monitoring expression profiles of Arabidopsis gene expression during rehydration process after dehydration using ca 7000 full-length cDNA microarray. Plant J 2003, 34: 868–887. 10.1046/j.1365-313X.2003.01774.xView ArticlePubMedGoogle Scholar
- Mariadason JM, Corner GA, Augenlicht LH: Genetic reprogramming in pathways of colonic cell maturation induced by short chain fatty acids: comparison with trichostatin A, sulindac, and curcumin and implications for chemoprevention of colon cancer. Cancer Res 2000, 60: 4561–4572.PubMedGoogle Scholar
- Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Gabrielson E, Garcia JG, Geoghegan J, Germino G, Griffin C, Hilmer SC, Hoffman E, Jedlicka AE, Kawasaki E, Martinez-Murillo F, Morsberger L, Lee H, Petersen D, Quackenbush J, Scott A, Wilson M, Yang Y, Ye SQ, Yu W: Multiple-laboratory comparison of microarray platforms. Nat Methods 2005, 2: 345–350. 10.1038/nmeth756View ArticlePubMedGoogle Scholar
- Ramalho-Santos M, Yoon S, Matsuzaki Y, Mulligan RC, Melton DA: "Stemness": transcriptional profiling of embryonic and adult stem cells. Science 2002, 298: 597–600. 10.1126/science.1072530View ArticlePubMedGoogle Scholar
- Grünbaum B: Venn Diagrams II. Geombinatorics 1992, II: 25–32.Google Scholar
- Grünbaum B: Venn Diagrams I. Geombinatorics 1992, I: 5–12.Google Scholar
- Smid M, Dorssers LC, Jenster G: Venn Mapping: clustering of heterologous microarray data based on the number of co-occurring differentially expressed genes. Bioinformatics 2003, 19: 2065–2071. 10.1093/bioinformatics/btg282View ArticlePubMedGoogle Scholar
- Kestler HA, Muller A, Gress TM, Buchholz M: Generalized Venn diagrams: a new method of visualizing complex genetic set relations. Bioinformatics 2005, 21: 1592–1595. 10.1093/bioinformatics/bti169View ArticlePubMedGoogle Scholar
- Pavlidis P, Noble WS: Analysis of strain and regional variation in gene expression in mouse brain. Genome Biol 2001, 2: RESEARCH0042. 10.1186/gb-2001-2-10-research0042PubMed CentralView ArticlePubMedGoogle Scholar
- Raychaudhuri S, Stuart JM, Altman RB: Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac Symp Biocomput 2000, 455–466.Google Scholar
- Alter O, Brown PO, Botstein D: Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms. Proc Natl Acad Sci U S A 2003, 100: 3351–3356. 10.1073/pnas.0530258100PubMed CentralView ArticlePubMedGoogle Scholar
- Girolami M, Breitling R: Biologically valid linear factor models of gene expression. Bioinformatics 2004, 20: 3021–3033. 10.1093/bioinformatics/bth354View ArticlePubMedGoogle Scholar
- Parmigiani G, Garrett-Mayer ES, Anbazhagan R, Gabrielson E: A cross-study comparison of gene expression studies for the molecular classification of lung cancer. Clin Cancer Res 2004, 10: 2922–2927.View ArticlePubMedGoogle Scholar
- Feng S, Ma L, Wang X, Xie D, Dinesh-Kumar SP, Wei N, Deng XW: The COP9 signalosome interacts physically with SCF COI1 and modulates jasmonate responses. Plant Cell 2003, 15: 1083–1094. 10.1105/tpc.010207PubMed CentralView ArticlePubMedGoogle Scholar
- Armengaud P, Breitling R, Amtmann A: The potassium-dependent transcriptome of Arabidopsis reveals a prominent role of jasmonic acid in nutrient signaling. Plant Physiol 2004, 136: 2556–2576. 10.1104/pp.104.046482PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.