A novel method for cross-species gene expression analysis
© Kristiansson et al.; licensee BioMed Central Ltd. 2013
Received: 25 June 2012
Accepted: 13 February 2013
Published: 27 February 2013
Skip to main content
© Kristiansson et al.; licensee BioMed Central Ltd. 2013
Received: 25 June 2012
Accepted: 13 February 2013
Published: 27 February 2013
Analysis of gene expression from different species is a powerful way to identify evolutionarily conserved transcriptional responses. However, due to evolutionary events such as gene duplication, there is no one-to-one correspondence between genes from different species which makes comparison of their expression profiles complex.
In this paper we describe a new method for cross-species meta-analysis of gene expression. The method takes the homology structure between compared species into account and can therefore compare expression data from genes with any number of orthologs and paralogs. A simulation study shows that the proposed method results in a substantial increase in statistical power compared to previously suggested procedures. As a proof of concept, we analyzed microarray data from heat stress experiments performed in eight species and identified several well-known evolutionarily conserved transcriptional responses. The method was also applied to gene expression profiles from five studies of estrogen exposed fish and both known and potentially novel responses were identified.
The method described in this paper will further increase the potential and reliability of meta-analysis of gene expression profiles from evolutionarily distant species. The method has been implemented in R and is freely available at http://bioinformatics.math.chalmers.se/Xspecies/.
Gene expression microarray and RNA-seq provide fast and cost-efficient measurement of mRNA abundance for thousands of genes simultaneously. The amount of gene expression data generated by these techniques is constantly increasing and public repositories such as Gene Expression Omnibus and ArrayExpress contains today a large body of information from a wide range of species and experimental conditions [1, 2]. Large-scale gene expression assays are however plagued with high variability which complicates data interpretation. The abundance of mRNA is stochastic by nature, both on a cellular and multicellular level [3, 4], and there are often large variability between gene expression patterns from different organisms . In addition, technical parameters such as tissue heterogeneity, probe affinities and batch effects may introduce substantial levels of noise [6–8]. Gene expression data is therefore non-trivial to analyze and to put into a biological context.
One way to increase the potential of large-scale gene expression analysis is to combine information between different species. If a biological process is evolutionarily conserved between two species, it is also likely that the transcriptional responses associated with that process share similarities. Indeed, cross-species meta-analysis of gene expression profiles has previously been used to address many questions in biology and medicine. For example, gene expression analysis performed in model species such as mouse and rat are commonly used to study human diseases  including cancer [10, 11], Alzheimer’s disease , diabetes  and hypertension . Comparative analysis of gene expression profiles in human and mouse embryonic stem cells has been used to identify similarities and differences associated with the developmental biology in these species . Cross-species meta-analysis has also proven useful in biogeronotology where evolutionarily conserved age-related gene expression responses have been identified based on data from several species, including the fruit fly Drosophila melanogaster and the worm Caenorhabditis elegans[16, 17]. Another example is ecotoxicology, where changes of molecular biomarkers are used to detect toxic effects and to monitor populations and ecosystem health . Such biomarkers should be as general as possible and thus responsive in a wide range of species. Meta-analysis of gene expression profiles from multiple species therefore provides a powerful tool for identification and evaluation of biomarkers [19, 20].
Cross-species meta-analysis is however not straight-forward. Different species have different genomes and thus also essential differences in their transcriptomes. The evolutionary process of the eukaryotic genome includes events such as duplication and recombination, which creates complex relations between genes . There is no guarantee that genes from different species with a shared common ancestry (orthologs) have a one-to-one correspondence since gene duplications after speciation may have resulted in one or more additional gene copies (in-paralogs). For species with a relatively short evolutionary distance, such as human and mouse, the number of in-paralogs is low (5.9% of all homologs according to Homologene release 65). The numbers are however higher for species with larger evolutionary distance. For example, 9.6% of all human homologs in Drosophila melanogaster have at least one in-paralog and the corresponding numbers for Saccharomyces cerevisiae and Arabidopsis thaliana are 13.2% and 51% respectively (Homologene release 65). The function of paralogous genes tends to diverge over time and have in general a high gene expression diversity compared to single-copy genes [22–27]. Hence, information from all genes, including both orthologs and paralogs, is vital for cross-species analysis of gene expression profiles.
Several methods have previously been suggested for cross-species analysis of gene expression profiles. Fisher’s combined probability test, which transforms p-values from any number of tests into one single p-value, has been a popular method for comparing multiple gene expression experiments [28–31]. Another approach, which was developed by Stuart et al., was used to compare gene expression of homologs (identified using reciprocal best BLAST hits) over a wide range of experimental conditions . Le et al. developed a computationally efficient procedure that compares the distance between ranks of genes from pairs of species . The method was then applied to a large set of microarrays from man and mouse. Another method called mDEDS was developed by Campain and Yang and uses several different statistical measures to perform cross-species comparison of gene expression profiles . Other methods includes LOLA  and L2L  which are both online tools for comparisons of ranking lists of differentially expressed genes from microarrays studies, including lists from different species. However, all these methods assume a one-to-one correspondence between genes from different species. This assumption may be acceptable when comparing relatively closely related species such as mouse and man, but it makes these procedures inapplicable when comparing more distantly related species.
Lu and co-authors have previously developed methods for analysis of gene expression between different species that takes many-to-many relations into account [36–38]. By using Markov random fields and belief propagation, they were able to identify cell cycling genes in human and yeast . The methods were also used to analyze genes which shared expression profiles in human and mice infected by various pathogens . However, the topology of the Markov random fields depends on the experimental design which makes them hard to adapt to many forms of gene expression experiments. They also make explicit assumption of the distribution of the gene expression, either in the form of an extreme value distribution  or a Gaussian distribution . This makes them unsuitable for many heterogeneous datasets with observations from multiple measurement platforms, such as gene expression microarrays and RNA-seq. To enable cross-species meta-analysis of existing and future gene expression data, novel flexible methods that can handle many-to-many relationships between genes are needed [30, 39].
In this paper we describe a new statistical method for meta-analysis of gene expression profiles from different species. The method was derived to take all orthologous and co-orthologous genes into account. Similar to Fisher’s method, the proposed method uses gene-specific p-values, which makes it applicable to many forms of measurement platforms including microarrays and sequencing based techniques such as RNA-seq. A simulation study showed that the proposed method resulted in a substantial gain of statistical power for identification of differentially expressed genes. As a proof of concept, we used the method to identify evolutionarily conserved regulation of stress responsive genes in eight species subjected to heat stress. We also applied the method to gene expression data from aquatic vertebrates exposed to estrogens to demonstrate its applicability within ecotoxicology.
Assume that a number of large-scale gene expression experiments have been performed in a set of species investigating an evolutionarily conserved transcriptional response. Assume further that each experiment has been analyzed individually resulting in a p-value for each measured gene describing the significance of the differential expression (e.g. between two treatments). We will also assume that there is a fixed and known evolutionary structure describing all groups of orthologous and co-orthologous genes present in the species of interest. Such homology groups are readily available from multiple sources, such as Homologene , OrthMCL-DB  and InParanoid  or can alternatively be inferred de novo by tools such as OrthoMCL .
The method proposed in this paper operates on the gene-specific p-values generated from each experiment. For each homology group and species, the method summarizes all in-paralogs into one single value by selecting the minimum (most significant) p-value. A weighted score is then calculated by summing the negative logarithms of the minimum p-values from each gene expression experiment. A combined p-value for each homology group is finally derived by comparing the observed score to the null distribution which has a known, but non-trivial, analytic form. Finally, a Benjamini-Hochberg false discovery rate (FDR) is calculated to control for the multiple testing of several homology groups (typically ∼10,000 homology groups are tested).
The weights used to combine the different experiments are based on the evolutionary structure. Under the assumption of no differential expression, genes with many in-paralogs are more likely to result in a lower minimum p-value than genes with few or no in-paralogs. The weights therefore decrease with the number of in-paralogs to generate an unbiased score. The weights also contain an arbitrary component, which can be used to weigh individual experiments up or down. For example, the arbitrary weights can be used to prevent bias if multiple experiments are performed in the same species.
Full mathematical details, including the derivation of the weights and the analytical null distribution, can be found in Methods. An R-implementation of the methods if freely available at http://bioinformatics.math.chalmers.se/Xspecies/.
The proposed method: the most significant p-value of the in-paralogs in each species is combined across species.
The combination method: the expression data from in-paralogs are treated as independent biological replicates from the same gene .
The average method: expression data from in-paralogs are combined into one single observation by taking the average value of the raw expression data .
The random method: only expression data from one in-paralog is used (randomly selected). All other values are discarded .
For the combined, average and random method the cross-species p-value is calculated by Fisher’s combined probability test.
The methods were also evaluated using simulations in more diverse settings. When a second in-paralog was differentially expressed in the same direction, i.e. the same effect added to two genes, the performance of the combined and average method increased (Additional file 1: Figure A1). However, when an effect in the opposite direction was added to a second in-paralog (half of the effect subtracted), the power of the average method decreased substantially. At an effect of 6, the power of the average method was reduced from 0.68 to 0.28 while the power for the proposed method decreased from 0.82 to 0.71 (Figure 1 and Additional file 1: Figure A2). When the normal distribution was replaced by a t-distribution with five degrees of freedom, the power decreased equally for all methods (Additional file 1: Figure A3). A similar result was seen when errors were introduced in the homology structure by randomly replacing orthologous genes with non-orthologous genes from the same species (Additional file 1: Figure A4 and A5).
A summary of the experiments used in the meta-analysis of heat stress
A summary of the experiments used in the meta-analysis of estrogen-exposed fish
E2, injected, 10 mg/kg
Pers. com. TD Williams, 
E2, water, 50 ng/L
Pers. com. TD Williams, 
EE2, water, 10 ng/L
GEO:GSE7220, GEO: 
E2, dietary, 5ppm
EE2, water, 10 ng/L
Furthermore, several significant homology groups contained genes that were not identified as estrogen responsive by any of the individual studies, e.g. fatty acid desaturase 2 (group 582, FDR= 1.5×10−7), sodium/potassium-transporting ATPase subunit alpha-1 (group 61, FDR= 7.8×10−6) and translocon-associated proteins delta and gamma (groups 561 and 1423, FDR= 2.9×10−8 and 3.9×10−9 respectively). These genes have all previously been shown to be estrogen responsive in mammals [68–70]. In addition, the translocon-associated protein subunit delta has been shown to be differentially expressed on protein level in Danio rerio exposed to estrogen .
Meta-analysis of gene expression profiles is hampered by the lack of a one-to-one correspondence between orthologous genes from different species. Evolutionary events, such as gene duplications, have resulted in paralogous genes which makes traditional approaches for meta-analysis inapplicable. We therefore developed a new statistical method for meta-analysis of gene expression profiles between experiments performed in evolutionarily distant species. The method takes advantage of the homology structure between the species of interest and can therefore take any number of orthologous and co-orthologous genes into account. The method is general in the sense that it operates on p-values from individual gene expression experiments and is therefore independent of the type of the raw gene expression data. This makes the method applicable to any gene expression measurement platform, including DNA microarrays and quantitative PCR as well as techniques based on sequencing such as RNA-seq. Using p-values also makes it possible to include results from already analyzed experiments where the raw data is not publicly available or missing.
The proposed method can be seen as an extension of Fisher’s combined probability test , which is widely used statistical method for meta-analysis. In fact, when no in-paralogous genes are present in any of the species, the proposed method and Fisher’s method are equivalent. Similarly to the Fisher’s combined probability test, the proposed method is dependent on the validity of the statistical models used to analyze the individual experiments. The combined cross-species p-values are calculated from an analytical distribution derived based on the assumption of gene-specific p-values that are independently and uniformly distributed under the null hypothesis. An alternative approach, which is less dependent on the model assumptions, is to use permutations . For many experimental designs, the null-distribution can be estimated by randomly permuting the labels of the samples in each experiment. However, permutation-based estimation of the null-distribution requires a relatively large number of biological replicates in order to generate a sufficiently large number of permutations. The heat stress data analyzed in this study had, for example, too few observations for estimation of the null-distribution using permutations.
Cross-species meta-analysis of gene expression is dependent of the evolutionary relationship between the orthologous and co-orthologous genes present in the species of interest. Identification of homologous genes in evolutionarily distant species is however complex and can result in false predictions . Such errors will either group non-related genes in the same homology group or, vice versa, scatter homologous genes between different homology groups. Since the proposed method assumes that the evolutionary structure is known and correct, such errors will affect the results negatively. Improved and more accurate algorithms for predicting homologous genes will thus further increase the potential of cross-species meta-analysis of gene expression. On the other hand, the conserved expression profiles generated by the proposed method can be used to correct false predictions of homology. In the heat stress analysis Homologene group 111895 (HSP70-homologs, Homologene release 65) was found to be highly significant in all species except for D. melanogaster. Interestingly, a closer examination of that homology group showed that the HSP70 functional domain was missing from the D. melanogaster gene and which suggests that it may indeed not be a true homolog.
The statistical power of the proposed method and three previously suggested methods for combining multiple observations in microarray analysis was evaluated using simulations. The proposed method was the only solution that was explicitly developed to handle in-paralogous genes and its power was, not surprisingly, considerably higher (Figure 1). The resulting false discovery rate was also lower (Figure 2). When multiple in-paralogs from the same homology group had a similar transcriptional pattern the difference in performance between the methods was reduced. However, when then multiple in-paralogs showed a divergent transcriptional pattern, the difference in performance increased in favor of the proposed method. This reflects the underlying assumptions, where the proposed method assumes that only one of the in-paralogs in homology group is differentially expressed while the others are non-responsive. The combination and average methods does, on the other hand, assume that all in-paralogs are affected by the treatment. It should also be noted that conditions used in the simulations are idealized and the results should therefore be interpreted as such. Real gene expression data does not follow a Gaussian distribution and has a complex correlation structure, both between genes and samples [6, 73–75]. The simulation study shows, however, that the loss in statistical power of detecting differentially expressed genes in cross-species meta-analysis may be substantial if in-paralogs are not properly incorporated in the analysis.
The proposed method was used to compare the gene expression response to heat stress based on microarray data from eight eukaryotes. The analysis identified several well-known mechanisms involved in the transcriptional response to heat. Most pronounced was the up-regulation of molecular chaperons and 10 of the 15 most significant homology groups corresponded to heat stress proteins from four of the five major chaperon families (Figure 3). Functional enrichment of gene ontology terms revealed additional biological processes associated with the cellular response to heat. The number of significant homology groups was also shown to increase with the number of included species. These results show that the proposed model generated biologically relevant results by combining gene expression profiles from evolutionarily distant species. Analysis of evolutionarily conserved gene expression changes under heat stress has previously been suggested as an efficient approach to further understand the underlying biological processes . It is therefore plausible that a more in-depth analysis of our result from the cross-species meta-analysis may result in more insights and novel findings within this area.
Inter-species extrapolations is a cornerstone of ecotoxicological risk assessment since only a tiny fraction of the species present in the environment can be studied in the laboratory . Comparisons of inter-species gene expression profiles provide an attractive way to identify evolutionarily conserved modes of action and novel biomarkers of exposure or effect. We therefore used the proposed method to find common transcriptional responses in four different fish species. The analysis revealed several known and well-established responses of estrogen, some which have been associated with adverse physiological effects. The method also identified differentially regulated genes that were not classified as estrogen responsive by the individual experiments. This shows that the method can be used to identify evolutionarily conserved transcriptional responses to toxicants in ecologically relevant species and it demonstrates the potential of cross-species meta-analysis within ecotoxicology.
Cross-species analysis of gene expression is dependent on the similarities in the transcriptional responses of the studied species. However, evolutionarily distant species have fundamental differences in their physiology which makes it hard, or even impossible, to perform experiments under identical conditions. Even though the associated biological processes are evolutionarily conserved the differences in experimental design and execution can introduce substantial variability in the transcriptional responses. In the cross-species analysis of heat stress we included data from eight species that were treated with different degrees of heat stress during different time spans. There were also differences in the designs of the estrogen exposures, e.g. exposure concentrations, times and routes. Our results show, however, that for both these examples of cross-species analysis, the experiments were similar enough to generate biological relevant results. It is, on the other hand, hard to estimate what evolutionarily conserved transcriptional responses that are not identified due to differences in the experimental designs.
Cross-species analysis of gene expression is complicated by the non-trivial relationships between genes from different species. The new statistical method proposed in this study takes the evolutionary structure into account and can therefore compare transcriptional profiles from species with any number of orthologous and co-orthologous genes. The performance of the proposed method, compared to other existing solutions, was therefore considerably higher when in-paralogous genes are present. As a proof-of-concept, the method was used to identify evolutionarily conserved transcriptional responses in microarray data from heat stress experiment performed in eight diverse species. The applicability of the method within ecotoxicology was also demonstrated by the identification of known and novel responses in fish exposed to estrogens. An implementation of the method for the statistical language R is available for free at http://bioinformatics.math.chalmers.se/Xspecies/.
It follows that any pair of genes g ijk and in homology group i are in-paralogs if j = j ′ and orthologs or co-orthologs if j ≠ j ′.
where K ij is a constant and w j are arbitrary experiment-specific weights summing to 1.
S i is thus a weighted sum of independent exponentially distributed random variables with intensity 1. The weights contains two parts, an experimental specific weight w j and 1/(kExp[Y i j ]). The latter compensates for the number of paralogs in order to avoid bias from large homology groups. The weights w j are arbitrarily and can be set to weigh individual experiments up and down. This is for example useful when multiple experiments are performed in a single organism (see Estrogen exposure below for an example). However, more sophisticated weighting strategies are also possible, such as weights based on the evolutionary distance between the included species (e.g. evolutionary distinctiveness score ).
The hypothesis in 1 can now be tested and a corresponding p-value calculated by comparing the observed value S i with the null distribution of S i .
Simulations were performed on homology groups from Homologene for the species Saccharomyces cerevisiae (4932), Schizosaccharomyces pombe (4896), Arabidopsis thaliana (3702), Oryza sativa (4530), Drosophila melanogaster (7227), Danio rerio (7955) Mus musculus (10090), Homo sapiens (9606) (NCBI Taxonomy IDs are given in parenthesis). Each gene was assumed to be measured in two different groups, one control and one treated, with three independent observations from each. Data was simulated from a Gaussian distribution with mean value 0 and variance 1 and p-value calculated using a two-population t-test assuming equal variance. For differentially expressed orthologous groups (10%, randomly selected) an effect ranging from 0 to 10 was added to the treated group (e.g. changing the expected value from 0 to the effect). For groups and species with in-paralogous genes the effect was added to one single in-paralog (randomly selected). The weights w ij in S i were set to be uniform. For the combined method all observations from in-paralogs treated as independent replicated observations for one single gene (homology group). For the average method, an average was taken over all observations from in-paralogs generating one single observation for each observation. For the random method one of the in-paralogs was randomly selected and other discarded. For these three methods the cross-species p-value was calculated by Fisher’s combined probability test . The false discovery rate for homology group i was estimated by calculating the proportion of false positives among the i most significant groups.
Intensity data from Affymetrix type of microarrays was pre-processed using RMA  while intensity data from two-channel microarrays was normalized using global loess . The quality of each microarray was assessed by inspecting scatter and MA plots of probe-wise intensity before and after normalization. For all include experiments, differentially expressed genes were identified using the moderated t-statistic  implemented in the LIMMA R-package. Cross-species analysis using was performed using the proposed method where up- and down-regulated genes were tested separately using one-sided tests. The most significant p-value was then selected. The cross-species p-values were finally corrected for multiple testing using Benjamini-Hochbergs false discovery rate.
Gene expression data from eight experiments investigating the effects of heat stress in eight species were fetched from Gene Express Omnibus and ArrayExpress (Table 1). Homologene release 65 was used to describe the evolutionary relationship between the genes from the different species. The arbitrary component of the weights was set to be uniform over the eight experiments. The homology groups were populated with Gene Ontology terms based on species-specific annotations retrieved from the GO Consortium FTP (ftp://ftp.geneontology.org/pub/go/gene-associations/). Only terms with an experimental evidence code (i.e. EXP, IDA, IPI, IMP, IGI and IEP) were considered. Functional enrichment was inferred using the topGO R package .
The five gene expression experiments included in the analysis are summarized in Table 2. Gene expression data was retrieved from the Gene Expression Omnibus, ArrayExpress or through direct contact with the authors. Homology groups were inferred from the corresponding EST and transcript sequences using OrthoMCL  with an inflation index of 1.5 (all other parameters had default values). To avoid bias from the multiple experiments performed in Oncorhynchus mykiss the arbitrary weight component was set to 0.25, 0.25, 0.25, 0.125 and 0.125 (following the order in Table 2).
This research was supported by the Life Science Area of Advance at Chalmers University of Technology, Sweden, the Swedish Research Council (VR), the Foundation for Strategic Environmental Research (MISTRA) and the Swedish Research Council for Environment, Agricultural Sciences and Spatial Planning (FORMAS) and Swedish Society for Medical Research (SSMF). We also acknowledge Timothy D Williams for providing gene expression data. Support from the Gothenburg Bioinformatics Network (GOTBIN) is also gratefully acknowledged.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.