Inferring transcript phylogenies
© Christinat and Moret; licensee BioMed Central Ltd. 2012
Published: 11 June 2012
Alternative splicing, an unknown mechanism 20 years ago, is now recognized as a major mechanism for proteome and transcriptome diversity, particularly in mammals--some researchers conjecture that up to 90% of human genes are alternatively spliced. Despite much research on exon and intron evolution, little is known about the evolution of transcripts.
In this paper, we present a model of transcript evolution and an associated algorithm to reconstruct transcript phylogenies. The evolution of the gene structure--exons and introns--is used as basis for the reconstruction of transcript phylogenies. We apply our model and reconstruction algorithm on two well-studied genes, MAG and PAX6, obtaining results consistent with current knowledge and thereby providing evidence that a phylogenetic analysis of transcripts is feasible and likely to be informative.
Alternative splicing is a mechanism to produce different proteins from the same gene--the end of the paradigm "one gene, one protein." In many genomes, several, or even most, genes are split into pieces called exons, separated by regions called introns, and a splicing mechanism takes the transcribed string of exons and introns, removes the introns, and splices the exons to form a single continuous string that is then translated into a protein. In alternative splicing, a mechanism underestimated until 1990, the splicing produces variants in which some of the exons can be omitted (and occasionally even some of the introns retained), thereby causing different proteins to be produced. Alternative splicing exists to some degree in most eukaryotes, but is most frequent in the more complex lineages. Thus it is present, but limited, in plants and fungi, but quite common in vertebrates--some researchers have conjectured that up to 90% of the human genes are alternatively spliced [1–3]. Alternative splicing is now recognized as a major mechanism for proteome and transcriptome diversity [4, 5].
The implications of this shift in paradigm are significant. The basic model of transcriptome evolution--DNA modification at the gene level and alternative transcription start sites--is incomplete: any modification that affects the splicing mechanism has to be considered. However, while evolution of DNA at the gene level has been the subject of intense scrutiny for decades, very little is known regarding the changes in the splicing products of alternative splicing. Thus there is a need to define a model of evolution for transcripts, not at the nucleotide level, but at the splicing level--which exons (and introns) are included, which excluded?
A better understanding of the relationships among different transcripts would benefit annotation transfer. Different proteins from one gene may have different functions, may be localized to certain tissues, or may be present at different developmental stages. Knowledge of their evolution would help in assessing the function of their homologues. Transcript phylogenies would also contribute to next-gen sequencing methods, especially RNA-seq. For instance, the "DREAM6 Alternative Splicing Challenge" asks to reconstruct alternatively spliced mRNA transcripts from short mRNA-seq data without a reference genome, but using the transcriptomes of other organisms . A transcript phylogeny would help in assessing the support value of a predicted transcript.
In this paper, we propose a model of transcript evolution and an associated algorithm to reconstruct transcript phylogenies.
Many studies have been published on the rate of exon insertion and deletion and on the statistics of different types of splicing, but few researchers so far have studied the evolution of transcripts . Harr and Turner showed that most transcripts among Mus subspecies were novel . Nurtdinov et al. compared the human and mouse transcriptomes and concluded that half of the genes give rise to species-specific isoforms and only three quarters of all isoforms are present in their orthologous genes . Splicing is also affected by non-DNA events. Modification of the chromatin structure can yield changes in the expression of a given transcript and may even create a new transcript or silence an existing one . Finally, a few groups studied the correlation between gene duplication and alternative splicing [9–11].
These studies indicate that alternative splicing is a fast-evolving mechanism and hint that most of the transcripts may be little more than evolutionary noise. These studies also indicate that groups of species share a significant number of transcripts, whose relationship can only be delineated with a more complete model.
A model of transcript evolution
In the description of alternative splicing, the simplest concepts are those of constitutive exons, which are part of every transcript, and of cassette exons, which may or may not be present in any given transcript. In general, any exon that is not constitutive is called alternative. There exists other types of splicing mechanisms, of which alternative 3'- or 5'-sites and intron retention are the most frequently cited [1, 3, 5, 12, 13]. Note that the definition of a constitutive exon requires that all transcripts for a given gene be known. If, however, alternative splicing is closer to a biased random sampling from the space of all possible isoforms (so that every possible splice form is produced at some or other time), then there may be no such thing as a constitutive exon. As the debate on this issue continues and as our aim is to provide a model against which to test various hypotheses regarding transcript evolution, we develop a model in which we consider the existence of constitutive exons as a given.
We thus model a transcript as a subset of the gene exons. We model alternative 3'- or 5'-sites as constitutive exons with two or more internal states--each state encoding for one particular configuration. Finally, we assimilate intron retention to cassette exons. We model transcript evolution as a two-level process. The gene structure, viewed in terms of its collection of exons and introns, constitutes one level, while the collection of transcripts obtained from that structure constitutes the other level. Modification of the gene structure affects the transcriptome, but modification of the transcriptome does not affect the gene structure. Peng and Li  showed that the status of an exon, constitutive or alternative, is conserved through tandem exon duplication, a finding that hints at a model of evolution where the status of an exon is encoded at the gene level. Consequently we have three possible exon states in our model of gene evolution: absent, constitutive, or alternative. We assume that all transitions--birth, death, and mutation between constitutive and alternative--are equally likely.
Evolutionary events on the transcriptome
No transcript had this exon and none will have it.
Some transcripts had this exon and none will have it.
All transcripts had this exon and none will have it
No transcript had this exon and some will have it.
Some transcripts had this exon and some will have it.
All transcripts had this exon and some will have it.
No transcript had this exon and all will have it.
Some transcripts had this exon and all will have it.
All transcripts had this exon and all will have it.
Finally we assume that a transcript can undergo a lethal mutation or be subject to regulation and disappear at any time. In a manner similar to gene duplication, new transcripts may also be created at any time during evolution.
Our model yields a forest of transcript trees, which represents the evolution from ancestral transcripts to observed transcripts. Each transcript tree is a subtree of the gene tree, since all transcripts arise from that gene family and, if they evolve, must evolve on the same tree. If a new transcript arises from an existing one, the new transcript will be considered as the root of a new transcript tree. Our basic model uses a fixed cost for the creation of new transcripts. Of course, the basic model does not assume that transcripts are created ab initio; rather, it postulates a hidden relationship with an unknown ancestor.
New transcripts arise from existing ones and thus are the result of evolutionary changes that may legitimately correspond to different costs. We use a fixed cost for simplicity and also because it leads to a very efficient pruning of the search space. We have designed and implemented an extended model in which the creation of a new transcript is dynamically assigned a cost that corresponds to its evolution from its closest ancestor (a maximum parsimony approach). However, the dynamic cost computation prevents good pruning and makes the problem intractable for medium-sized instances.
Our algorithm starts by reconstructing the exon structure of the ancestral genes, then looks for the most parsimonious forest of transcript trees. For the ancestral gene reconstruction algorithm, we used a maximum parsimony approach, using Dollo's parsimony--that is, an exon cannot be created twice [15, 16].
The algorithm was tested on two well studied gene families to assess the correctness of the model on biological data. Further testing was done on simulated data to test the algorithm itself.
Results on the MAG gene
The Myelin-Associated Glycoprotein (MAG) is a neuronal transmembrane glycoprotein that acts both as a ligand for an axonal receptor and as a receptor for an axonal signal . It has been extensively studied and due to its short length and limited alternative splicing, it makes a perfect candidate for testing our algorithm.
Results with the extended model
Many solutions of minimum cost may exist. Consequently our algorithm can return several solutions. In order to sort the best solutions, we computed the total number of events including exon gain and loss at the gene level. This process acts as a second filter. However, the number of solutions is highly informative. For instance, Figure 6-B shows only one of the 32 solutions whereas only 2 solutions were found for c D = c B = 1. This indicates that the phylogenies for c B = c D > 1 are far from being certain and should thus be considered with extreme care. Moreover, only constitutive exons are shared between fishes and mammals. Consequently, the edges linking a fish transcript to a mammal transcript will highly depend on the ancestral gene reconstruction.
Results on the PAX6 gene
The PAX6 gene is part of the well-studied paired box gene family (PAX), which encodes transcription factors for many developmental processes and is subject to heavy alternative splicing [24–26]--41 transcripts were found in a gene in the pigeon retina . The canonical isoform is characterized by an N-terminal paired domain followed by a linker, a paired-type homeodomain, and a (P/S/T)-rich C-terminal domain, yielding a 422-amino-acid protein (437 in zebrafish). The gene undergoes alternative splicing and the best-known alternative isoform, +5a, differs from the canonical isoform by the inclusion of exon 5a. This 14-amino-acid insertion in the paired domain disrupts the DNA-binding ability of the N-terminal domain and enhances the binding of the C-terminal domain, thus creating a new set of interactions . As can be seen in Figure 3-B, gene duplication occurred in the fish species leading to two PAX6 genes in the zebrafish and the fugu fish.
Transcripts and orthologous exons
Mammal transcripts were obtained from the Human-transcriptome Database for Alternative Splicing (H-DBAS)  and fish transcripts from the Ensembl database . In the H-DBAS database, we considered only transcripts that were present in both the cDNA and mRNA databases, except in the case of R. norvegicus, where only the mRNA database was available. Similarly, with the Ensembl database, we used only transcripts that had CDS or UTR support. The gene tree was obtained from the Ensembl database and genes are thus named accordingly. The canonical and the +5a transcripts were identified through their protein product and the literature.
The literature on the PAX6 gene in the fugu fish is very sparse--we could find only one article, by Miles et al. , but that article does not corroborate the information in the Ensembl database. Thus we used the Ensembl data, as it is more recent, but we have no ground truth regarding the canonical or alternative isoforms.
Results on simulated data
In order to test the performance of the algorithm, we designed a simple scheme to generate transcripts with a given tree structure. Starting with n T ancestral transcripts at the root and n E random exons, each gene exon can either be born or die independently along the tree. The same evolutionary process applies to transcripts with exon gain or loss depending on the current set of gene exons.
We also tested the algorithm for scalability, since the running time grows faster than exponential in the worst case. Running the algorithm on the MAG gene takes a few seconds while running it on the PAX6 gene takes a few days. However our focus in this paper is to validate the concept of transcript phylogenies and show that reconstruction, within some limits, is possible. Past this step, we are confident that a heuristic can be designed to handle large problems with reasonable accuracy.
The input of the algorithm is a gene tree with a set of leaf transcripts and orthologous exons. (Paralogous exons are considered as unrelated.)
The algorithm begins by reconstructing the state of the ancestral genes' exons--absent, alternative, or constitutive--using Sankoff's algorithm for the small parsimony problem . Without any further knowledge on exon evolution, we assumed that every transition has equal cost. A constraint is added to the algorithm to ensure that the result is consistent with Dollo parsimony--that an exon cannot be created more than once.
Transcript phylogenies are then reconstructed using a two-step algorithm. For each topology, a lower bound is computed. If the lower bound is higher than the minimum encountered so far, then the topology is discarded, since there could not exist a solution with this topology with a lower score than the current optimum. Otherwise the best solution for this topology is computed. We call this last step the leaf assignment step; it is the only part of the algorithm that makes use of the previously reconstructed evolution of the gene's exons.
Now, in order to prune the search space efficiently, we need to establish quickly a rather good solution. Since the algorithm explores the topology space in a deterministic, breadth-first search manner, it could, in the worst case, move from worst to best topology, improving the score at each step, and thus unable throughout the procedure to prune any part of the solution space. To make such a behavior extremely unlikely on any data, we establish initial solutions by randomly sampling the search space for each number of trees before the exploration of the search space starts and retaining the configuration with the lowest score as an upper bound.
Our model makes no distinction between an event of zero cost and no event at all. Yet we would like to see only solutions that have the lowest number of events, so our algorithm uses (a version of) the number of events as a secondary criterion to rank the optimal solutions. For each tree in a given solution, we sum over all leaves the exons that are present in at least two leaves, and then divide this value by the number of exons that are present in at least one leaf and by the number of leaves. The result is an index between 0 (all exons are unique to their leaves) and 1 (all leaves have the same exons).
Topologies are generated with increasing numbers of trees. For each topology with t trees (t-topology), any edge removal yields a new topology with t+1 trees. However, that process alone does not suffice to generate all (t + 1)-topologies. Therefore, once that procedure has been applied to all t-topologies (and duplicates have been removed), we use branch-swapping to generate the remaining (t + 1)-topologies. A branch swap disconnects edge (n1, p1) from tree t1 and edge (n2, p2) from tree t2 and creates two new edges: (n1, p2) and (n2, p1). Here n1 and n2, and also p1 and p2, represent transcripts from the same ancestral gene. The algorithm again checks for duplicates, as it searches for all branch swaps on the set of (t + 1)-topologies until no new topology can be generated.
Scoring solutions and topologies
where N tree is the number of trees, N death the total number of transcript losses, and S F the sum of the maximum parsimony scores of each tree. S F is the only quantity that depends on the evolution of the gene's exons. c B and c D are parameters that control the cost of transcript birth and death.
A lower bound for topologies can be computed by considering the first part of the scoring function, c B · N tree + c D · N death . This value does not depend on the transcripts, but only on the topology. However, a better lower bound can be computed by adding a lower bound on the S F score. For each tree, we compute the best leaf assignment as if all transcripts were available, corresponding to a topology with a single nontrivial tree. (In the real leaf assignment procedure, of course, trees compete for transcripts.) The sum of these values is a true lower bound.
Leaf assignment procedure
Given a topology, leaf assignment remains challenging: given N genes and k transcripts per gene, a topology can lead to k! N- 1 possible leaf assignments. To tackle this problem, we combine a bottom-up dynamic programming algorithm with Sankoff's algorithm for the small parsimony problem.
We define a state as an ordered list of transcripts for a given gene. Each transcript t in a state has pointers to its left and right children, l(t) and r(t)--if any. The ordering of the transcripts is the same in two states of the same gene, but the pointers change to reflect phylogenetic relationships. The number of transcripts of ancestral genes (inner nodes) is defined by the topology.
possible states. The product of the first two terms of Equation 2 is the number of possible assignments for the k LR transcripts that have two children, while the last two terms compute the number of assignments for the transcripts that have only child. Since the first part selected k LR elements, there remains only n - k LR elements to choose from.
Where roots (s A , s parent ) contains all transcripts of s A that have no ancestor in s parent . (If s parent is null then it is the set of transcripts in s A .)
where l and r are the left and right children of transcript t, c(a, b) is the cost of evolving from a to b and E is the set of all exon states. In our case we have c(a, b) = 0 for a = b and c(a, b) = 1 otherwise. However the cost function must be slightly modified to account for exon evolution at the gene level. If a constitutive exon was gained or an exon was lost (at the gene level), then we set the cost of the change to zero. Additionally, if exon i is absent in the gene, then for all transcripts in the gene we have t ix = ∞,∀x > 0. Note that the left and right children of t depend on the choice of s L and s R . A similar equation can be derived in case of single-child transcripts. minMP(t) is then the sum over all exons of min u t iu .
The values in Equation (3) are assigned during the postorder traversal; once the score of every state at the root of the gene tree has been computed, the minimum score is retained. Backtracking from the root to the leaves will then produce all optimal transcript phylogenies.
An extended model
The extended model sets a dynamic cost for transcript birth, but retains a constant cost for transcript death. Given a topology, the best leaf assignment is computed and backtracking is used to reconstruct the ancestral transcripts' sequences. Creation of new transcripts is assigned a cost that corresponds to its evolution from its closest ancestor. The birth cost is added only once the leaf assignment procedure has terminated and thus has no influence on the transcript assignment, except in case of multiple solutions, where only those that have a minimum birth cost will be selected.
Developing a good lower bound on the birth cost remains a challenge. This cost can vary between zero and the number of exons, so that simply using the lowest possible value would produce very loose bounds and thus be of no help in the search. (On simulated data and our two test genes, the search procedure using a zero value as a bound on the birth cost always had to look at every topology.)
In this study we addressed the lack of evolutionary model for alternative splicing by presenting a two-level model of transcript evolution and an algorithm to reconstruct transcript phylogenies. Our work opens the door for the study of transcript evolution, as it provides tools for testing evolutionary hypotheses.
We presented two models. The basic model uses a fixed cost for the creation of new transcripts--an unrealistic assumption, but one that greatly decreases the computational cost. The extended model assigns a cost dynamically by finding the closest (least cost) ancestor; however, the dynamic nature of the cost defeats our pruning strategy and the problem became intractable for medium-sized instances.
Results on a well-studied gene, MAG, showed that the extended model yielded results similar to those obtained with the basic model. Good clustering of known isoforms was achieved with the basic model for both gene families (MAG and PAX6) we studied.
Future work involves a faster version of the algorithm, and eventually approximation methods to enable us to use the extended model on large problems.
We would like to thank Prof. Marc Robinson-Réchavi and Dr. Eyal Privman for fruitful discussions that helped us to make our model more biologically plausible.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 9, 2012: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2011: Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S9.
- Keren H, Lev-Maor G, Ast G: Alternative splicing and evolution: diversification, exon definition and function. Nat Rev Genet. 2010, 11 (5): 345-55. 10.1038/nrg2776.View ArticlePubMedGoogle Scholar
- Artamonova II, Gelfand MS: Comparative genomics and evolution of alternative splicing: the pessimists' science. Chem Rev. 2007, 107 (8): 3407-30. 10.1021/cr068304c.View ArticlePubMedGoogle Scholar
- Kim E, Magen A, Ast G: Different levels of alternative splicing among eukaryotes. Nucleic Acids Res. 2007, 35: 125-31. 10.1093/nar/gkm529.PubMed CentralView ArticlePubMedGoogle Scholar
- Harrison PM, Kumar A, Lang N, Snyder M, Gerstein M: A question of size : the eukaryotic proteome and the problems in defining it. Nucleic Acids Res. 2002, 30 (5): 1083-1090. 10.1093/nar/30.5.1083.PubMed CentralView ArticlePubMedGoogle Scholar
- Modrek B, Lee C: A genomic view of alternative splicing. Nat Genet. 2002, 30: 13-9. 10.1038/ng0102-13.View ArticlePubMedGoogle Scholar
- DREAM6 Alternative Splicing Challenge. 2011, [http://www.the-dream-project.org/challenges/dream6-alternative-splicing-challenge]
- Harr B, Turner LM: Genome-wide analysis of alternative splicing evolution among Mus subspecies. Mol Ecol. 2010, 19 (Suppl 1): 228-39.View ArticlePubMedGoogle Scholar
- Nurtdinov RN: Low conservation of alternative splicing patterns in the human and mouse genomes. Hum Mol Genet. 2003, 12 (11): 1313-1320. 10.1093/hmg/ddg137.View ArticlePubMedGoogle Scholar
- Roux J, Robinson-Rechavi M: Age-dependent gain of alternative splice forms and biased duplication explain the relation between splicing and duplication. Genome Res. 2011, 21 (3): 357-10.1101/gr.113803.110.PubMed CentralView ArticlePubMedGoogle Scholar
- Su Z, Wang J, Yu J, Huang X, Gu X: Evolution of alternative splicing after gene duplication. Genome Res. 2006, 16 (2): 182-9.PubMed CentralView ArticlePubMedGoogle Scholar
- Kopelman NM, Lancet D, Yanai I: Alternative splicing and gene duplication are inversely correlated evolutionary mechanisms. Nat Genet. 2005, 37 (6): 588-9. 10.1038/ng1575.View ArticlePubMedGoogle Scholar
- Lu J: Alternative splicing in teleost fish genomes: same-species and cross-species analysis and comparisons. Mol Genet Genomics. 2010, 283: 531-539. 10.1007/s00438-010-0538-3.View ArticlePubMedGoogle Scholar
- Matlin AJ, Clark F, Smith CWJ: Understanding alternative splicing: towards a cellular code. Nat Rev Mol Cell Biol. 2005, 6 (5): 386-98. 10.1038/nrm1645.View ArticlePubMedGoogle Scholar
- Peng T, Li Y: Tandem exon duplication tends to propagate rather than to create de novo alternative splicing. Biochem Biophys Res Commun. 2009, 383 (2): 163-6. 10.1016/j.bbrc.2009.03.162.View ArticlePubMedGoogle Scholar
- Dollo L: Les lois de l'évolution. Bull Soc Belge Geol Pal Hydr. 1893, 7: 164-166.Google Scholar
- Alekseyenko AV, Lee CJ, Suchard Ma: Wagner and Dollo: a stochastic duet by composing two parsimonious solos. Syst Biol. 2008, 57 (5): 772-84. 10.1080/10635150802434394.View ArticlePubMedGoogle Scholar
- Quarles RH: Myelin-associated glycoprotein (MAG): past, present and beyond. J Neurochem. 2007, 100 (6): 1431-48.PubMedGoogle Scholar
- Lai C: Two forms of 1B236/myelin-associated glycoprotein, a cell adhesion molecule for postnatal neural development, are produced by alternative splicing. Proc Natl Acad Sci USA. 1987, 84 (12): 4337-41. 10.1073/pnas.84.12.4337.PubMed CentralView ArticlePubMedGoogle Scholar
- Lehmann F, Gäthje H, Kelm Sr, Dietz F: Evolution of sialic acid-binding proteins: molecular cloning and expression of fish siglec-4. Glycobiology. 2004, 14 (11): 959-68. 10.1093/glycob/cwh120.View ArticlePubMedGoogle Scholar
- Fujita N: Developmentally regulated alternative splicing of brain myelin-associated glycoprotein mRNA is lacking in the quaking mouse. FEBS Lett. 1988, 232 (2): 323-7. 10.1016/0014-5793(88)80762-2.View ArticlePubMedGoogle Scholar
- Georgiou J, Tropak MB, Roder JC: Myelin-associated glycoprotein gene. Myelin Biology and Disorders. 2004, 1: 421-467.Google Scholar
- Flicek P: Ensembl's 10th year. Nucleic Acids Res. 2010, 38 (Database issue): D557-62.PubMed CentralView ArticlePubMedGoogle Scholar
- Darling AE, Mau B, Perna NT: progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One. 2010, 5 (6): e11147-10.1371/journal.pone.0011147.PubMed CentralView ArticlePubMedGoogle Scholar
- Holland LZ, Short S: Alternative splicing in development and function of chordate endocrine systems: a focus on pax genes. Integr Comp Biol. 2010, 50: 22-34. 10.1093/icb/icq048.View ArticlePubMedGoogle Scholar
- Short S, Holland LZ: The evolution of alternative splicing in the Pax family: the view from the Basal chordate amphioxus. J Mol Evol. 2008, 66 (6): 605-20. 10.1007/s00239-008-9113-5.View ArticlePubMedGoogle Scholar
- Nornes S: Zebrafish contains two pax6 genes involved in eye development. Mech Dev. 1998, 77 (2): 185-96. 10.1016/S0925-4773(98)00156-7.View ArticlePubMedGoogle Scholar
- Bandah D: A complex expression pattern of Pax6 in the pigeon retina. Invest Ophthalmol Vis Sci. 2007, 48 (6): 2503-9. 10.1167/iovs.06-1014.View ArticlePubMedGoogle Scholar
- Epstein JA: Two independent and interactive DNA-binding subdomains of the Pax6 paired domain are regulated by alternative splicing. Genes Dev. 1994, 8 (17): 2022-2034. 10.1101/gad.8.17.2022.View ArticlePubMedGoogle Scholar
- Takeda J: H-DBAS: human-transcriptome database for alternative splicing: update 2010. Nucleic Acids Res. 2010, 38 (Database issue): D86-90.PubMed CentralView ArticlePubMedGoogle Scholar
- Miles C: Complete sequencing of the Fugu WAGR region from WT1 to PAX6: dramatic compaction and conservation of synteny with human chromosome 11p13. Proc Natl Acad Sci USA. 1998, 95 (22): 13068-13072. 10.1073/pnas.95.22.13068.PubMed CentralView ArticlePubMedGoogle Scholar
- Sankoff D: Minimal mutation trees of sequences. SIAM J Appl Math. 1975, 28: 35-42. 10.1137/0128004.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.