HyLiTE: accurate and flexible analysis of gene expression in hybrid and allopolyploid species
© Duchemin et al.; licensee BioMed Central. 2015
Received: 27 June 2014
Accepted: 16 December 2014
Published: 16 January 2015
Forming a new species through the merger of two or more divergent parent species is increasingly seen as a key phenomenon in the evolution of many biological systems. However, little is known about how expression of parental gene copies (homeologs) responds following genome merger. High throughput RNA sequencing now makes this analysis technically feasible, but tools to determine homeolog expression are still in their infancy.
Here we present HyLiTE – a single-step analysis to obtain tables of homeolog expression in a hybrid or allopolyploid and its parent species directly from raw mRNA sequence files. By implementing on-the-fly detection of diagnostic parental polymorphisms, HyLiTE can perform SNP calling and read classification simultaneously, thus allowing HyLiTE to be run as parallelized code. HyLiTE accommodates any number of parent species, multiple data sources (including genomic DNA reads to improve SNP detection), and implements a statistical framework optimized for genes with low to moderate expression.
HyLiTE is a flexible and easy-to-use program designed for bench biologists to explore patterns of gene expression following genome merger. HyLiTE offers practical advantages over manual methods and existing programs, has been designed to accommodate a wide range of genome merger systems, can identify SNPs that arose following genome merger, and offers accurate performance on non-model organisms.
KeywordsHybrid Allopolyploid Homeolog RNA-seq Read assignment
While evolution is usually a gradual process, the creation of a new species through the merger of different parent species occurs near instantaneously . Although increasingly recognized as an important process in the evolution of many biological systems [2-5], how different gene copies (homeologs) are expressed following genome merger remains a major outstanding question [6,7]. Most studies have been restricted to observing just a few genes, thus limiting the ability to study interactions between competing gene regulation systems . High throughput mRNA sequencing now permits whole-genome screening of hybrid and allopolyploid gene expression [9,10]. However, identifying the parental origin of mRNA reads remains challenging, especially for researchers without advanced bioinformatics skills .
To fill this gap, we have developed HyLiTE – Hybrid Lineage Transcriptome Explorer – to produce tables of homeolog expression data from raw mRNA read files in a single step. HyLiTE automatically i) maps reads to a reference genome, ii) masks gene regions with low read coverage, iii) identifies polymorphisms that are diagnostic of parental lineages, iv) classifies reads to parental types, and v) produces detailed summary reports of gene expression in both the hybrid or allopolyploid and its parent species. The final product – tables of homeolog read counts – can be used immediately for downstream analyses (such as determining differential expression between biological conditions, and between the new species and its parents).
Accommodating any number of parent species (for instance, three-parent allopolyploids such as modern hexaploid wheat) .
The ability to study systems with both haploid or diploid parents, thus allowing hybrids or allopolyploids with different homeolog and allelic copies.
Using gene references from any species closely related to the study system (hybrid and allopolyploid species often lack good genome resources).
Accommodating any number of biological replicates (and boosting SNP identification by combining information across replicates).
Identifying new polymorphisms that have arisen within the hybrid or allopolyploid (especially important in species derived from older merger events).
Improving SNP calling by using (optional) genomic DNA information in addition to high throughput mRNA sequences.
Providing statistical validation of SNP calls and automatically masking ‘polymorphisms’ with low statistical support.
An experimental feature that identifies putative chimeric genes (i.e., genes in which the homeologs have recombined within the hybrid or allopolyploid) , but see Additional file 1 for details on current limits of accuracy.
The standard HyLiTE analysis, which will be adequate for most users, comprises a single, short command line. However, advanced users have complete flexibility to override individual steps. For instance, by default, Bowtie2 is used for read mapping, but HyLiTE can be run with any mapping software that returns the standard SAM mapping format.
Because HyLiTE analyzes each gene independently, the software has low RAM requirements and runtime is linear with the number of genes under study. This independence between genes also allows HyLiTE to be parallelized via optional executables (see project website for details; http://hylite.sourceforge.net). HyLiTE regularly autosaves the run state, and analyses can therefore be stopped and re-started from the last checkpoint. Extensive documentation about the algorithms implemented in HyLiTE, software validation and benchmarking against alternative pipelines is provided in Additional file 1.
Results and discussion
The main output of HyLiTE comprises a list of read counts for each homeolog in each biological replicate. Using presence and absence of diagnostic parental SNPs, reads are classified as i) derived from a given parent, ii) consistent with two or more parents (i.e., lacking diagnostic SNPs), or iii) unknown (i.e., masked due to low read coverage). The last two classes are equally uninformative for determining homeolog expression, but can distinguish whether improvements may be possible with additional sequence data (the ‘unknown’ category) or whether part of the gene is simply uninformative for ancestry (no diagnostic parental SNPs identified). Finally, each read is marked with an additional flag if one or more new SNPs are detected within the hybrid or allopolyploid.
A major point of difference between HyLiTE and alternative approaches (e.g., PolyCat ) is its robust statistical assessment of SNP calls and automatic masking of ‘polymorphisms’ with low statistical support. Due to the substantial error rate of high throughput sequencing technologies, sequencing errors can easily be confused with genuine polymorphisms in genes with low expression (and hence, low read coverage). The probability that a polymorphism at any given nucleotide position is a SNP rather than an error is given by a binomial distribution conditioned on the coverage level. Nucleotides with coverage less than this threshold are masked, but because coverage varies widely across even a single gene, typically only small, uninformative regions of any given gene are masked. This ‘dynamic masking’ substantially improves the accuracy with which reads are assigned to homeologs for genes with low to moderate expression. Detection of expression levels can be improved further by including genomic DNA reads due to the accuracy this imparts to SNP calling (see Additional file 1 for details).
Plants. To show application to a plant system, we also analyzed gene expression in a natural cotton allotetraploid, Gossypium hirsutum, together with diploid representatives of the A (G. arboreum) and D (G. raimondii) genomes (∼3% divergence) . Assignment accuracy was tested by classifying known reads from the two diploid species. HyLiTE assigned reads to homeologs with a very low error rate (1.6%; see Additional file 1 for details). It also identified 46,206 new SNPs specific to G. hirsutum.
Animals. Finally, we analyzed gene expression in a synthetic allotetraploid fish derived from diploid goldfish (Carassius auratus) and diploid common carp (Cyprinus carpio) (∼6% divergence) (NCBI BioProject accession number: PRJNA82763). The very small number of reads available per gene (an average of only 15) caused HyLiTE to reject most SNP calls and therefore classify the majority of reads as parentally uninformative. However, the reads for which sufficient information was available to assign parental ancestry showed a very low error rate (0.22%).
The formation of a new species from the merger of two or more different parent species is important in the evolutionary history of many eukaryotic lineages. Hybrid and allopolyploid species carry multiple copies of each gene (homeologs), and while homeolog expression levels can be determined from high throughput RNA sequence data, assigning reads is extremely challenging. Here, we have developed HyLiTE to automate the process of moving from raw mRNA sequence files to tables of homeolog expression in a hybrid or allopolyploid and its parent species. This single-step analysis is specifically designed for ease-of-use, particularly for non-computational scientists. HyLiTE therefore allows gene expression patterns to be explored on a whole-genome scale even for species with very complex patterns of genome merger.
Availability and requirements
Project name: HyLiTEProject home page: http://hylite.sourceforge.net Operating systems: Linux, OS X, WindowsProgramming language: PythonOther requirements: NoneLicense: GNU GPL v. 3.0Any restrictions to use by non academics: None
Research support was provided to MPC by the Royal Society of New Zealand via a Rutherford Fellowship (RDF-10-MAU-001) and by the BioProtection Research Center, a New Zealand Center of Research Excellence (CoRE), via a Principal Investigator award. These funding bodies played no role in study design; collection, analysis or interpretation of data; writing of the manuscript; or the decision to submit this manuscript for publication.
- Wendel JF. Genome evolution in polyploids. Plant Mol Biol. 2000; 42:225–249.View ArticlePubMedGoogle Scholar
- Dehal P, Boore JL. Two rounds of whole genome duplication in the ancestral vertebrate. Plos Biol. 2005; 3:1700–1708.View ArticleGoogle Scholar
- Sémon M, Wolfe KH. Preferential subfunctionalization of slow-evolving genes after allopolyploidization in Xenopus laevis. Proc Natl Acad Sci USA. 2008; 105:8333–8338.View ArticlePubMedPubMed CentralGoogle Scholar
- Soltis PS, Soltis DE. The role of hybridization in plant speciation. Ann Rev Plant Biol. 2009; 60:561–588.View ArticleGoogle Scholar
- Jiao Y, Wickett NJ, Ayyampalayam S, Chanderbali AS, Landherr L, Ralph PE, Tomsho LP, Hu Y, Liang H, Soltis PS, Soltis DE, Clifton SW, Schlarbaum SE, Schuster SC, Ma H, Leebens-Mack J, dePamphilis CW. Ancestral polyploidy in seed plants and angiosperms. Nature. 2011; 473:97–100.View ArticlePubMedGoogle Scholar
- Doyle JJ, Flagel LE, Paterson AH, Rapp RA, Soltis DE, Soltis PS, Wendel JF. Evolutionary genetics of genome merger and doubling in plants. Annu Rev Genet. 2008; 42:443–461.View ArticlePubMedGoogle Scholar
- Sémon M, Wolfe KH. Consequences of genome duplication. Curr Opin Genet Dev. 2007; 17:505–512.View ArticlePubMedGoogle Scholar
- Adams KL, Wendel JF. Novel patterns of gene expression in polyploid plants. Trends Genet. 2005; 21:539–543.View ArticlePubMedGoogle Scholar
- Cox MP, Dong T, Shen G, Dalvi Y, Scott DB, Ganley ARD. An interspecific fungal hybrid reveals cross-kingdom rules for allopolyploid gene expression patterns. PLoS Genet 1004; 10:180.Google Scholar
- Yoo M-J, Szadkowski E, Wendel JF. Homoeolog expression bias and expression level dominance in allopolyploid cotton. Heredity. 2013; 110:171–180.View ArticlePubMedGoogle Scholar
- Buggs RJA, Renny-Byfield S, Chester M, Jordon-Thaden IE, Viccini LF, Chamala S, Leitch AR, Schnable PS, Barbazuk WB, Soltis PS, Soltis DE. Next-generation sequencing and genome evolution in allopolyploids. Am J Bot. 2012; 99:372–382.View ArticlePubMedGoogle Scholar
- Brenchley R, Spannagl M, Pfeifer M, Barker GLA, D’Amore R, Allen AM, McKenzie N, Kramer M, Kerhornou A, Bolser D, Kay S, Waite D, Trick M, Bancroft I, Gu Y, Huo N, Luo M-C, Sehgal S, Gill B, Kianian S, Anderson O, Kersey P, Dvorak J, McCombie WR, Hall A, Mayer KFX, Edwards KJ, Bevan MW, Hall N. Analysis of the bread wheat genome using whole-genome shotgun sequencing. Nature. 2012; 491:705–710.View ArticlePubMedPubMed CentralGoogle Scholar
- Gaeta RT, Pires JC. Homoeologous recombination in allopolyploids: The polyploid ratchet. New Phytologist. 2010; 186:18–28.View ArticlePubMedGoogle Scholar
- Page JT, Gingle AR, Udall JA. PolyCat: A resource for genome categorization of sequencing reads from allopolyploid organisms. G3. 2013; 3:517–525.View ArticlePubMedPubMed CentralGoogle Scholar
- Moon CD, Craven KD, Leuchtmann A, Clement SL, Schardl CL. Prevalence of interspecific hybrids amongst asexual fungal endophytes of grasses. Mol Ecol. 2004; 13:1455–1467.View ArticlePubMedGoogle Scholar
- Schardl CL, Leuchtmann A, Tsai HF, Collett MA, Watt DM, Scott DB. Origin of a fungal symbiont of perennial ryegrass by interspecific hybridization of a mutualist with the ryegrass choke pathogen, Epichloë typhina. Genetics. 1994;1307–1317.Google Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.