A composite genome approach to identify phylogenetically informative data from next-generation sequencing

Background Improvements in sequencing technology now allow easy acquisition of large datasets; however, analyzing these data for phylogenetics can be challenging. We have developed a novel method to rapidly obtain homologous genomic data for phylogenetics directly from next-generation sequencing reads without the use of a reference genome. This software, called SISRS, avoids the time consuming steps of de novo whole genome assembly, multiple genome alignment, and annotation. Results For simulations SISRS is able to identify large numbers of loci containing variable sites with phylogenetic signal. For genomic data from apes, SISRS identified thousands of variable sites, from which we produced an accurate phylogeny. Finally, we used SISRS to identify phylogenetic markers that we used to estimate the phylogeny of placental mammals. We recovered eight phylogenies that resolved the basal relationships among mammals using datasets with different levels of missing data. The three alternate resolutions of the basal relationships are consistent with the major hypotheses for the relationships among mammals, all of which have been supported previously by different molecular datasets. Conclusions SISRS has the potential to transform phylogenetic research. This method eliminates the need for expensive marker development in many studies by using whole genome shotgun sequence data directly. SISRS is open source and freely available at https://github.com/rachelss/SISRS/releases. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0632-y) contains supplementary material, which is available to authorized users.


Introduction
Until recently, phylogenetic studies relied on tens of loci (at most) from the genome to determine evolutionary relationships (e.g. Giribet et al., 2001;Harpke et al., 2013). However, these datasets often had insufficient information to provide strong support for all the relationships of interest (e.g. Stanley et al., 2011). This lack of resolution in phylogenies can be caused by many biological phenomena: (1) rapid radiations at the base of a clade producing few synapomorphies, (2) sites in the genome that have incurred multiple substitutions resulting in similarities between species that are not indicative of evolutionary history (i.e. homoplasy), or (3) regions of the genome with discordant histories (i.e. differential lineage sorting) (Patel et al., 2013;Rokas and Carroll, 2006;Kingman, 1982;Maddison, 1997;Philippe et al., 2011).
Recent changes in sequencing technology have enabled phylogenetic studies to use larger datasets in an attempt to resolve previously undetermined evolutionary relationships (Burleigh et al., 2011;Cohen and Chor, 2012;Crawford et al., 2012;Delsuc et al., 2005;Faircloth et al., 2012;McCormack et al., 2012McCormack et al., , 2013Yoder et al., 2013). Theoretically, larger datasets will contain enough data such that sufficient information 1 is available to resolve every part of the tree (Gee, 2003). However, more data do not necessarily provide better phylogenetic resolution (O'Neill et al., 2013;Philippe et al., 2011;Rokas and Carroll, 2006). Additional data can increase noise as much as signal, leading to minimal additional support for relationships of interest Rokas and Carroll, 2006). Furthermore, proper analysis of multi-gene datasets requires complex models of the substitution process, and separate models for individual loci; however, computational power is currently insufficient for such analyses, resulting in the application of overly simple substitution models to the entire dataset (Burleigh et al., 2011;Yoder et al., 2013;Heath et al., 2012;Roure and Philippe, 2011;Lanfear et al., 2012;Decker et al., 2009). This analytical process decreases the accuracy of the estimated phylogeny.
The key to identifying phylogenetically informative data lies in finding loci that are variable among species, but have signal that is not overwhelmed by homoplasy . There are currently two major approaches to identify such data. First, existing datasets are screened for variation in the taxa of interest (e.g. O'Neill et al., 2013;Senn et al., 2013;Steele et al., 2008). Second, regions that are conserved across taxa are identified from whole-genome alignments; both the conserved elements and regions adjacent to them may contain phylogenetic information (e.g. Crawford et al., 2012;Faircloth et al., 2012;McCormack et al., 2012;Lemmon et al., 2012). However, the drawback of these approaches is that new phylogenetic markers must be developed for each research study. The data developed for one study is not necessarily useful for additional studies, requiring repeated efforts to obtain information at different taxonomic levels.
Sequencing whole genomes has been suggested as a way to circumvent this discovery process. However, the approaches taken to-date have required aligning data to a reference genome, or aligning whole genomes, to extract informative loci. Thus, such methods have primarily focused on closely related species for which alignments are possible (e.g. Yoder et al., 2013;Fan et al., 2013).
Here, we demonstrate a new method to rapidly provide informative markers for phylogenetic studies directly from shotgun sequencing of whole genomes. We call this approach SISRS (pronounced "scissors"), for Site Identification from Short Read Sequences. SISRS requires neither a reference genome nor a priori knowledge of potentially informative loci. This approach circumvents the difficulties in identifying homologous sites from whole-genome alignments when rearrangement events have occurred because the conserved regions are not required to share identifiable synteny across taxa. SISRS also takes advantage of the raw data to (1) avoid erroneous genotype calling and (2) identify sites that are fixed within taxa.
We demonstrate that SISRS provides high quality phylogenetic datasets across a range of simulated and empirical data. The data output by SISRS for simulated shotgun reads was congruent with the starting phylogeny at all depths in the tree. These data allowed us to estimate the true phylogeny even with large evolutionary distances among taxa and very small genomes. Second, using previously sequenced shotgun data for seven primate taxa, we were able to estimate the known phylogeny rapidly and accurately. We provide a software implementation of SISRS so that phylogenies can be produced from next-generation sequencing reads in a matter of days.

Overview
First, SISRS assembles a "composite genome" from shotgun sequencing reads for all taxa. Second, the composite genome is used as a reference to align the sequencing data for each sample. Third, the nucleotide for each tip taxon for each position is identified using a strict consensus (i.e. sites that are variable are called as unknown). Finally, SISRS selects variable sites that have a limited amount of missing information. In this way, it identifies sites across the entire genome that are phylogenetically informative, while reducing noise. This pipeline is outlined succinctly in Figure 1.

Composite reference genome
The composite genome contains composite copies of sequences that are conserved across some or all taxa, as well as copies of sequences that are unique to a single taxon. This reference provides a set of sequences against which all data from all taxa will be mapped. The conserved regions are most likely to provide useful phylogenetic markers; the composite nature of these regions (i.e. containing a mixture of variation from across species) makes it more likely that reads from different taxa will align to them compared to a typical approach of using a single-species reference genome. The use of a composite genome as a reference rather than the genome of one species also reduces the amount of data required for each taxon. The current software implementation of SISRS uses Velvet (Zerbino and Birney, 2008) to construct the composite genome. Velvet uses de Bruijn graphs, which allow it to output single copies of shared and unique regions of the composite genome regardless of gene order.
The large amount of data required to determine genotypes accurately is not necessary to determine highly conserved genomic regions. Thus, we subsampled the data to reduce the memory required for the composite genome assembly and the number of unique regions included. The total number of paired reads sampled for the composite assembly is one-twentieth of the user-specified genome size, and an equal number of reads are sampled from each taxon. Assuming 200bp per read-pair, this approach results in an average of 10x coverage for conserved regions. In the software implementation of SISRS, reads for each species are sampled with a custom python script using a reservoir sampling approach in order to use the minimum amount of memory, while randomly sampling reads (Vitter, 1985).

Site calling for each position in the reference
SISRS does not call the genotype of any site that was variable within a taxon due to the reduced information in such sites. Variable sites are evolving more rapidly and are more likely to provide noise rather than information. Because DNA has limited possible variation, a site that is variable both within and among taxa is more likely to exhibit homoplasy. Variation cannot be distinguished from some forms of error when using multiple samples from a single taxon, and genotypes can also be identified 3 incorrectly due to sequencing error or copy number variable regions aligning together. Avoiding paralogous data can significantly reduce non-phylogenetic signal . Thus, removing all variable sites is an overly conservative but necessary approach to maximizing phylogenetic information.
The current software implementation of SISRS implements the mapping process using the software Bowtie 2 (Langmead and Salzberg, 2012) with parameters set to allow some mismatch in alignment and to accept alignments where only part of the read mapped to the reference (i.e. local alignment). Finally, we limited our dataset to sites that (1) have information for most or all taxa (as specified by the user), and (2) are variable among taxa. Sites with too much missing data provide limited information, while sites that are invariable are not phylogenetically informative.

Simulations to test methodology
To determine how well our approach identified phylogenetically informative sites, we simulated 168 datasets of next-generation sequencing reads with different levels of sequencing coverage on multiple phylogenies. We used four phylogenies to simulate genomes: one two-taxon tree and three eight-taxon trees. The eight-taxon trees were pectinate with (1) equal internode branch lengths, (2) increasing internode branch lengths from root to tip, and (3) decreasing internode branch lengths from root to tip ( Figure 2). We produced a total of 24 trees with increasing levels of divergence among species by multiplying each branch length by values from 0.01 to 0.06. We simulated genomes of one million nucleotides on each of these trees using the Jukes-Cantor model with Dawg 2.0 (Cartwright, 2005;Jukes and Cantor, 1969). 99 bp paired-end reads (the maximum allowed by the simulation software) were simulated using the software ART (version BananaPancakes-04-02-2013) with an error model that was derived from an empirical dataset produced using an Illumina HiSeq 2000 (Huang et al., 2012). Each simulation had either 1, 2, 4, 8, 10, 20, and 50x coverage. For each data set we recorded (1) the total number of variable sites simulated, (2) the total number of variable sites output by SISRS, (3) the number of these sites that could be mapped, and (4) the number of mapped variable sites that were true variants.
We expected some variable sites not to be recovered by SISRS. We counted the number of these false negatives due to (1) insufficient data in the simulations (defined as fewer than three reads for that site), (2) lack of coverage by the composite genome (for sites with sufficient data), (3) additional insufficient data due to lack of mapped reads, and (4) additional sites that were filtered out due to variation. Additionally we counted the number of potential and actual false positive variable sites. Potential false positives were due to sites that would be called as variable (due to error) based on the simulated reads. Actual false positives were (1) the subset of these sites that were identified as variable by SISRS, or (2) additional sites due to erroneous read mapping.
To determine the value of the data output by SISRS we counted the number (and percent) of sites that were concordant with the true tree as a function of branch scale, coverage, depth, and tree topology.

Empirical Data
We further tested our approach using data from apes, including human (Homo sapiens), chimpanzee (Pan troglodytes), bonobo (P. paniscus), gorilla (Gorilla gorilla and G. beringei ), and orangutan (Pongo pygmaeus and P. abelii ). The crab-eating macaque (Macaca fascicularis) and rhesus macaque (M. mulatta) were used to root the tree. These primates were chosen to test the efficiency and effectiveness of this method on empirical data due to their well-established phylogeny (Perelman et al., 2011).
Raw Illumina paired-end sequence data was obtained from the European Nucleotide Archive (Suppl. Table 1). To limit the size of the dataset being analyzed, we aligned the data to the human genome (build 37) using Bowtie 2 as in SISRS. We extracted only the data that aligned to human chromosome 21. These reads were then placed in FASTQ files as new paired-end datasets, as would be generated directly from a sequencing run. Potentially informative sites were obtained using SISRS as for simulations. The genome size specified for the composite genome subsampling procedure was 48 million, approximately the size of human chromosome 21. For comparison, we conducted an identical analysis using human chromosome 21, rather than the composite genome, as the reference.
We treated the data output by SISRS as a single concatenated locus (Yoder et al., 2013), and analyzed the data in a maximum likelihood (ML) framework with 1000 bootstraps implemented in RAxML-HPC2 version 7.3.2 on CIPRES (Stamatakis, 2006;Miller et al., 2010). A GTRCAT model was used for the bootstrapping phase, and GTRGAMMA for the final tree inference. The final tree was rooted using macaques as an outgroup.

Ape tree is recovered
We identified 365,405 variable sites that contained observations from every species (i.e. no missing data). The analysis was conducted using 40 cores on a FreeBSD 10.0 server, and the total time to produce this dataset from the raw reads was 56 hours. The maximum amount of memory required during the composite genome assembly was 300Gb. An ML estimate of the phylogeny with 1000 bootstraps was fully concordant with the known phylogeny of apes with all nodes supported at 100% (Figure 6). Of the variable sites, 360,091 were biallelic, and of those, 351,378 (96%) were concordant with the tree. The maximum number of sites supporting an alternate relationship was 1646 (0.5%); these sites supported human and gorilla as sister taxa, which is consistent with the similarity of the human and gorilla genomes (Scally et al., 2012) (Suppl. Table 2).
For comparison, we identified 651,938 variable sites using human chromosome 21 as a reference. By aligning the composite genome to the human reference genome, we determined that 266,898 sites (73% of the SISRS data) were found in both datasets. The phylogeny estimated from these data was similar, although with less support for the relationship between the bonobo / chimp / human lineage and gorillas (Suppl. Figure 2).

Discussion
The approach introduced here has the potential to transform phylogenetic research. SISRS eliminates the need for marker development in many studies by using whole genome shotgun sequence data directly. Alternatively, SISRS can be used to identify variable sites that can be used as markers across large numbers of taxa. SISRS also promotes the reuse of data. Shotgun genomic sequences available in public databases can be used directly for phylogenetic analyses, as we have done in this study. Sequencing performed with the goal of identifying phylogenetic data using SISRS can be made available for subsequent use in other studies, including phylogenetics at any taxonomic level, or any other study utilizing genomic data.
By using shotgun sequence data, SISRS has two distinct advantages over previously developed methods in phylogenomics. First, the raw data can be grouped into any taxonomic level of interest. SISRS will only output data that are fixed within a taxon, regardless of whether that taxon is a species or phylum. This process removes data that are evolving too rapidly to be useful for determining relationships at the level of interest. Any data for which there are substitutions within a taxon's history are likely evolving too rapidly to be of use in estimating deeper relationships. Low-coverage data for multiple species within a clade can also be combined to identify markers that provide information at the taxonomic level required. The necessity of using loci informative at appropriate time scales has been advocated previously (Townsend, 2007;Townsend et al., 2008;Townsend and López-Giráldez, 2010;Townsend and Leuenberger, 2011). However, previous work required selecting loci a priori rather than on the basis of variation within and among taxa as with SISRS.
Second, error in next-generation sequence data does not affect subsequent analyses. In contrast, the sequence of a locus determined from next-generation sequencing data can contain error due to low coverage and unequal numbers of each allele. If this error 6 results in mistaking a heterozygote for a homozygote, a site will be included in the analysis that is potentially homoplasious and misleading. Our approach removes these sites.
Studies using SISRS are straightforward, quick, and inexpensive. Reusing available next-generation sequencing reads can substantially reduce costs. For studies assessing deep phylogenetic relationships, costs can be minimized by sequencing at low coverage for multiple species because it is not necessary to accurately determine the sequence for each species at each site.

Data produced by SISRS
For simulations, SISRS was able to recover large numbers of variable sites, unless branch lengths were unreasonably long or coverage was low. Based on these results, coverage should average 5-10x for optimal marker identification; however, low coverage sequencing will also identify useful data. Most genomes are much larger than the one million bases in our simulations; far more sites will be identified for larger genomes, making the use of low coverage data feasible. Although site identification was challenging for trees with long branches (i.e. large evolutionary distances per number of substitutions per site between taxa), this result is less problematic for empirical data. In real genomes, unlike our simulations, sites evolve at different rates; thus, there will always be some sites for which the branch length (in substitutions per site) between taxa is very small. These sites will be identified by SISRS.
As expected, deeper nodes were more difficult to recover, likely because the synapomorphies between the two clades may be overwritten by new substitutions. However, the sites that were identified are informative about these relationships. Overall, the concordance of the data with the simulation trees demonstrates that the SISRS approach produces extensive phylogenetically informative data for deep and shallow evolutionary time scales.

Advantages of a composite genome
Generating a composite genome has multiple advantages over aligning data to a reference genome to identify potential phylogenetically informative sites. First, an assembled reference genome similar to the taxa of interest is not always available. Second, assembling a reference genome requires high coverage data from at least one species and is time consuming; assembling the whole genome is necessary because it is impossible to determine a priori regions of the genome that may be phylogenetically informative. In contrast, SISRS does not require high levels of coverage or a time consuming assembly. When taxa are highly diverged, data may align poorly to a single reference genome. In contrast, a composite genome contains data from all taxa. Given low coverage for each taxon, unique regions will be limited in the final assembly, while maintaining an optimal assembly for conserved regions. Within each conserved region, the composite genome contains sites with the most common base, making it more likely that data from all taxa will align to this region, resulting in the maximum data for each species 7 for potentially informative sites.

Future Directions
The software implementation of SISRS is under active development. Future versions will output more variable sites more rapidly as a result of improved assembly of the composite genome and improved genotype calling.

Acknowledgments
This work was supported by ASU startup funds to RAC, a National Science Foundation Doctoral Dissertation Improvement Grant [grant number BCS-1232582 to KH and ACS], and a National Institutes of Health Grant [grant number R01-GM101352-01A1 to R. Zufall, R. Azevedo, and R. Cartwright]. Some data were provided by the Wellcome Trust Sanger Institute prior to publication; they and can be obtained from the European Nucleotide Archive. Members of the Cartwright lab provided feedback on this project and manuscript. : SISRS produced substantial amounts of informative data for phylogenies of different shapes and evolutionary distances. The number of true variable sites identified from simulated data is shown for each of four trees (symbols) (shown in Figure 2) for increasing numbers of substitutions between species (panels), and increasing levels of coverage (x axis). Symbols are for two taxon trees; • for equal branch length trees; for trees with short deep branches; for trees with long deep branches. : SISRS identified fewer informative sites for deeper nodes in the tree; however, in most cases the number of sites was sufficient to resolve the tree. Results are separated by branch length (panels), coverage (x axis), and tree depth (symbols). The number of sites supporting the node A+B are denoted as •; denotes sites supporting A+B+C; denotes A+B+C+D; denotes A+B+C+D+E; * denotes A+B+C+D+E+F; • denotes A+B+C+D+E+F+G. Only results for the equal-branch-length tree are shown; results for the three eight-taxon trees were similar (Suppl. Fig. 1). Fewer sites were recovered for the tree with short deep branches, compared to the equal-branch-length tree, while more sites were recovered for the tree with long branches.
Variable Sites Found Insufficient Aligned Data Not Covered by Composite Genome Insufficient Simulated Data 1 2 4 8 10 20 50 1 2 4 8 10 20 50 1 2 4 8 10 20 50 1 2 4 8 10 20 50 1 2 4 8 10 20 50 1 2 4 8 10 20 Figure 1: SISRS identified fewer informative sites for deeper nodes in the tree; however, in most cases the number of sites was sufficient to resolve the tree. Results are separated by branch length (panels), coverage (x axis), and tree depth (symbols). The number of sites supporting the node A+B are denoted as •; denotes sites supporting A+B+C; denotes A+B+C+D; denotes A+B+C+D+E; * denotes A+B+C+D+E+F; • denotes A+B+C+D+E+F+G. Top: results for the equal-branch-length tree. Middle: results for the tree with short deep branches. Bottom: results for the tree with long deep branches.  Table 1: Accession numbers for ape data downloaded from the European Nucleotide Archive. *Human data are from the 1000 genomes project (1000Genomes Project Consortium, 2010