Phylogenetic reconstruction of ancestral character states for gene expression and mRNA splicing data
© Rossnes et al; licensee BioMed Central Ltd. 2005
Received: 23 February 2005
Accepted: 27 May 2005
Published: 27 May 2005
As genomes evolve after speciation, gene content, coding sequence, gene expression, and splicing all diverge with time from ancestors with close relatives. A minimum evolution general method for continuous character analysis in a phylogenetic perspective is presented that allows for reconstruction of ancestral character states and for measuring along branch evolution.
A software package for reconstruction of continuous character traits, like relative gene expression levels or alternative splice site usage data is presented and is available for download at http://www.rossnes.org/phyrex. This program was applied to a primate gene expression dataset to detect transcription factor binding sites that have undergone substitution, potentially having driven lineage-specific differences in gene expression.
Systematic analysis of lineage-specific evolution is becoming the cornerstone of comparative genomics. New methods, like phyrex, extend the capabilities of comparative genomics by tracing the evolution of additional biomolecular processes.
Following speciation, there are many possible molecular events that can drive the divergence of species. Three of the most important mechanisms include changes in the coding sequence of proteins that alter protein function, changes in regulatory regions that affect gene expression, and changes in regulatory regions that affect mRNA splicing.
The evolution of protein-coding sequences has been studied systematically in The Adaptive Evolution Database (TAED), where such sequences were grouped into gene families . Within these gene families, the ratio of nonsynonymous to synonymous nucleotide substitution rates (Ka/Ks) was used to detect an excess of nonsynonymous substitution, with positive selection as a proxy for potential functional change. All cases of positive selection were mapped together from the gene tree to the species tree.
No systematic approach has been taken to examine relative gene expression or mRNA splicing in the same way, partly because both appropriate methods and datasets are lacking. One approach to examine the evolution of gene expression is to examine the substitution rate in promoters and look for lineages with excess substitution, analogous to Ka/Ks for protein coding sequences . This can then be correlated with relative expression levels. An alternative approach is to reconstruct ancestral gene expression states and to examine lineages that show a significant change. This has recently been implemented using a maximum likelihood approach for gene expression data .
Using the principle of minimum evolution, a general fast method has been developed that explicitly reconstructs the ancestral state of continuous character traits, like gene expression and mRNA splicing. The speed of this method will enable application to large datasets with many species and readily enables a subsequent mapping of data from gene expression trees to species trees.
Another limitation towards extending TAED-like approaches is the lack of applicable datasets. For mRNA splicing, comparisons of quantitative expressed sequence tag (EST) data and genomic sequence data are used to evaluate relative splicing levels, but existing cross-species comparisons include very long branches . For gene expression, several datasets now exist including closely related species or isolates of yeast  and primates . While these datasets are preliminary, they are a starting point to enable testing of methods. Here, we present our minimum evolution method, which is available as free software to download at http://www.rossnes.org/phyrex and test its performance on the cross-species primate dataset of Enard et al. .
Gene Expression and Sequence Data
Gene expression data was collected from Enard et al.  and contains samples from brain and liver of human, chimpanzee and orangutan. Sequence data was collected from Ensembl  and consists of the sequence 200 bp upstream of the gene transcription start site of the genes in the gene expression dataset.
The reference species tree was taken from Arnason et al., an accepted phylogeny in the field .
Reconstruction of Ancestral Gene Expression States
The reconstruction of continuous characters was done using a minimum evolution approach. A range of values was obtained by running up and down a phylogeny and determining intervals consistent with minimum total evolution over a tree. Once the values converged on final intervals, the mid-point of the range was selected.
Ancestral sequence reconstruction
Along branch analysis
TESS-Transcription Element Search Software  was used to search the TRANSFAC database. TESS takes a candidate sequence as input and searches TRANSFAC for transcription factor binding sites that can be locally aligned with regions of the input sequence. The output from TESS consists of a list of transcription factor binding sites that match the input sequence and the position and length of the transcription factor. These lists were manually controlled for each input sequence and correlated with the substitutional information calculated from PAML. Promoters with more than 5% pairwise substitution between human and chimpanzee were discarded. If a substitution occurred within a transcription factor binding site along any branch, it was annotated. Distributions were generated of the amount of along branch change in gene expression and ultimately the number of transcription factor binding sites from TESS that were mutated along any branch. This was normalized by the total amount of substitution to generate an enrichment value.
Results and Discussion
A software package that utilizes a minimum evolution algorithm to reconstruct ancestral states of continuous character data, like relative gene expression or alternative splicing levels and parse the amount of change to each branch of a phylogenetic tree is presented. This software package is available for download at http://www.rossnes.org/phyrex.
Enard et al. present an analysis of gene expression in a set of genes in brain and liver from human, chimpanzee with orangutan as an outgroup . Using this dataset, we reconstructed ancestral gene expression values at the last common ancestor of human and chimpanzee. The promoter sequences (200 bp upstream of the gene start site) for these genes from human, chimpanzee, and mouse as an outgroup were downloaded from Ensembl , aligned, and the last common ancestor sequence from human and chimpanzee was reconstructed using BASEML from the PAML package , as described in the methods section.
While enhancers can regulate gene expression over long distances and can be critical to changes in gene expression, many important regulators of transcription are located in the 200 bp immediately upstream of the gene start site . While our knowledge of enhancer function does not permit a fully systematic analysis, analysis of promoter regions can be used to identify a non-exhaustive set of candidates.
The distribution of gene expression changes across branches is shown in Figures 5 and 6 for the human and chimpanzee lineages, respectively. The strong central peak was expected, given the conservative properties of the method. The asymmetry of the distributions was not expected and may reflect problems with the original dataset. If chimpanzee genes are hybridized to human sequences and then normalized to correct for substitution rates, this type of bias may be expected. However, despite the unexpected shape of the distributions, there is still signal in the data, reflected in the significant enrichment values obtained.
The average number of substitutions that occurred in transcription factor binding sites in genes at the tails and center of the distributions are shown. When normalized by the total substitution rate in these promoters, the enrichment of transcription factor binding site substitution detection is shown.
0.64 ± 0.25
0.42 ± 0.01
0.92 ± 0.40
0.48 ± 0.24
0.59 ± 0.15
0.90 ± 0.28
The supplementary materials http://www.rossnes.org/phyrex/supl.html show the actual genes that have been implicated by this analysis, including the prospective transcription factor binding sites that have undergone substitution. The dataset of genes is too small to pick out significant gene function signal from the upregulated and downregulated genes along each lineage. Along the human lineage, there were fewer substitutions predicted to destroy transcription factor binding sites on the up-regulated gene lineages compared with the control, while other lineage data were not different from the control. Because it is not clear which destroyed binding sites are normally occupied by transcriptional activators, it is difficult to interpret the biological significance of this result. While the binding sites predicted may be candidates for playing an important role in the lineage-specific divergence of human and chimpanzee and warrant further testing for their activity in regulating expression from the respective promoters, little experimental data is currently available to further validate the study beyond the statistical validation seen in the enrichment values. However, evolutionary approaches that consider along branch change as opposed to pairwise comparison of extant sequences (as in Figure 1) do hold promise in pinpointing substitutions that cause the divergence of gene expression during species diversification.
All together, a method (and software) are made available for analysis of gene expression and alternative splicing shifts in a phylogenetic context and for detecting substitutions responsible for driving such shifts. Given some of the approximations made (enhancers ignored, minimum evolution rather than maximum likelihood, asymmetrical dataset to start with), the method performs surprisingly well and is a valuable starting point for this type of analysis, as well as being subject to future improvements. Ultimately, it will be valuable in comparative genomics to compare lineage-specific changes in gene content and in coding sequences, with changes in gene expression and alternative splicing to get a fuller picture of evolution.
Availability and requirements
Project name: Phyrex
Project home page: http://www.rossnes.org/phyrex
Operating systems: Linux
Programming language: Java
Other requirements: Java 1.4.2
We are grateful to FUGE, the Norwegian Functional Genomics Platform for providing funding and the Informatics Institute at University of Bergen for providing support.
- Roth C, Betts MJ, Steffansson P, Saelensminde G, Liberles DA: The Adaptive Evolution Database (TAED): A phylogeny based tool for comparative genomics. Nucleic Acids Research 2005, 33: D495-D497. 10.1093/nar/gki090PubMed CentralView ArticlePubMedGoogle Scholar
- Khaitovich P, Weiss G, Lachmann M, Hellmann I, Enard W, Muetzel B, Wirkner U, Ansorge W, Paabo S: A neutral model of transcriptome evolution. PLOS Biology 2004, 2(5):e132. 10.1371/journal.pbio.0020132PubMed CentralView ArticlePubMedGoogle Scholar
- Gu X: Statistical framework for phylogenomic analysis of gene family expression profiles. Genetics 2004, 167: 531–542. 10.1534/genetics.167.1.531PubMed CentralView ArticlePubMedGoogle Scholar
- Fitch WM: Toward defining the course of evolution: Minimal change for a specific tree topology. Syst Zool 1971, 19: 99–113.View ArticleGoogle Scholar
- Modrek B, Lee CJ: Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nature Genetics 2003, 34: 177–180. 10.1038/ng1159View ArticlePubMedGoogle Scholar
- Townsend JP, Cavalieri D, Hartl DL: Population genetic variation in genome-wide gene expression. Molecular Biology and Evolution 2003, 20: 955–963. 10.1093/molbev/msg106View ArticlePubMedGoogle Scholar
- Enard W, Khaitovich P, Klose J, Zollner S, Heissig F, Giavalisco P, Nieselt-Struwe K, Muchmore E, Varki A, Ravid R, Doxiadis GM, Bontrop RE, Paabo S: Intra- and interspecific variation in primate gene expression patterns. Science 2002, 296: 340–343. 10.1126/science.1068996View ArticlePubMedGoogle Scholar
- Birney E, Andrews TD, Bevan P, Caccamo M, Chen Y, Clarke L, Coates G, Cuff J, Curwen V, Cutts T, Down T, Eyras E, Fernandez-Suarez XM, Gane P, Gibbins B, Gilbert J, Hammond M, Hotz HR, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D, Keenan S, Lehvaslaiho H, McVicker G, Melsopp C, Meidl P, Mongin E, Pettett R, Potter S, Proctor G, Rae M, Searle S, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, Stalker J, Storey R, Ureta-Vidal A, Woodwark KC, Cameron G, Durbin R, Cox A, Hubbard T, Clamp M: An overview of Ensembl. Genome Research 2004, 14: 925–928. 10.1101/gr.1860604PubMed CentralView ArticlePubMedGoogle Scholar
- Arnason U, Xu X, Gullberg A, Graur D: The " Phoca standard": an external molecular reference for calibrating recent evolutionary divergences. Journal of Molecular Evolution 1996, 43: 41–45.View ArticlePubMedGoogle Scholar
- Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD: Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Research 2003, 31: 3497–500. 10.1093/nar/gkg500PubMed CentralView ArticlePubMedGoogle Scholar
- Yang Z, PAML: A program package for phylogenetic analysis by maximum likelihood. CABIOS 1997, 13: 555–556.PubMedGoogle Scholar
- Taatjes DJ, Marr MT, Tijan R: Regulatory diversity among metazoan co-activator complexes. Nature Reviews Molecular and Cellular Biology 2004, 5: 403–410. 10.1038/nrm1369View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.