The aTRAM package is downloadable from GitHub, (www.github.com/juliema/aTRAM) and is written as a Perl package that links together widely-cited programs in a novel way. These programs include BLAST [9]; two alternate de novo assemblers, Velvet [10] and Trinity [11]; and two multiple sequence aligners, MUSCLE [12] and MAFFT [13], aTRAM was designed so that new assembly and alignment software can be added as they become available. aTRAM has two components. The first component constructs an aTRAM formatted BLAST database from the original paired-end FASTQ or FASTA file and is performed once per sample. The second component is the search for a locus of interest, using an iterative approach aTRAM queries all or a fraction of the constructed short read database for the locus of interest and performs a de novo assembly. The package also includes post-processing scripts for validation of the results and pipeline scripts that automate multiple gene alignments across a number of short-read datasets.
Database creation
The aTRAM software creates a database from a paired-end FASTA or FASTQ sequence file using a MapReduce strategy [14]. Because sequence names are unrelated to the genomic content of the reads, the MapReduce strategy speeds up subsequent searches by a hashing function to distribute the reads across many partitions, or shards. The sizes of the shards are approximately equal and each should contain a random sample of reads from the original run. Searches can now be done more efficiently across a smaller subset of the whole dataset. For each shard, a BLAST database is constructed, corresponding to one end of each paired-end read in the shard. The mate of each read is placed in an easily searchable file (Figure 1A). This sharding process allows aTRAM to be parallelizable, because each shard can be searched independently on its own process. Furthermore, because each shard contains a random sample of the full short read dataset, any number of shards can be searched in an aTRAM run, allowing the user to vary the coverage depth used in the assembly and reducing computational time if genomic coverage is high.
Gene assembly
The aTRAM pipeline iteratively searches the formatted database to assemble the gene of interest. Each shard is searched independently and the results are combined for de novo assembly. The target sequence is provided as either a DNA or amino acid FASTA file. For the first iteration, aTRAM uses BLAST to search each shard for reads that are similar to the target sequence. The top hits and their mates are retrieved, combined across all shards, and used as input to a de novo assembler (Velvet or Trinity). The possibility exists for other de novo assemblers to be written into aTRAM as plugins in the future [15-19]. The resulting contigs are then compared to the original target sequence using BLAST, and the most similar ones used as target sequences in the next iteration (Figure 1B). Because the subsequent iterations are using target sequences that were assembled directly from the short read database, further iterations will involve short reads that are not just similar but identical to the contig being assembled. The program stops when the total number of iterations determined by the user have been completed, or if the resulting contigs from any iteration matches exactly the contigs from the previous iteration. Alternatively, an autocomplete flag can be set to end the search if one of the contigs has sequence matching both the beginning and the end of the query sequence, suggesting that the contig includes the entire target. As mentioned, aTRAM can be adjusted to use a fraction of the available shards: the fraction should be calculated based on expected coverage of the target locus in the sequencing run, for example, a short read dataset that contains 5x coverage of the nuclear genome may contain 200x coverage of the ribosomal DNA and 800x coverage of the chloroplast genome [20]. Although 20-50x coverage may be optimal for many genes, it has been suggested that this method can work with coverage as low as 5x [8].
Pipelines
The aTRAM package also includes ready-made pipelines for running aTRAM on multiple samples for many target sequences. AssemblyPipeline runs aTRAM on multiple target sequences for multiple samples; it is ideal for quickly producing a list of putatively orthologous genes from different species. AlignmentPipeline produces a set of aligned homologous sequences for a set of genes and a set of samples, allowing straightforward production of multiple gene alignments for gene tree analyses.
Performance
aTRAM Compared to other methods
To compare the performance of aTRAM to genome assembly based methods and verify similar results a dataset of 1,534 single copy orthologs from Pediculus schaeffi, the chimpanzee louse, was chosen. These genes were first assembled using a reference-based approach against the body louse genome P. humanus in Johnson et al. [21]. Their study used one lane of an Illumina HiSeq2000 run, which resulted in 36 GB of data and over ~100X coverage of the genome (NCBI; SAMN02438447). The authors used CLC Genomics Workbench (CLCbio) to map paired-end reads to the reference genome and verified orthology using a reciprocal best-BLAST test (http://dx.doi.org/10.5061/dryad.9fk1s). The same set of genes was assembled using aTRAM and a completely de novo approach to compare the three sequence retrieval methods.
An aTRAM database was created from the same P. schaeffi paired-end library from Johnson et al. [21] taking a total of 2.37 hours to format the 36 GB library. The 1,534 reference P. humanus proteins were used as the target sequences for aTRAM assembly. Because the expected coverage of the genome for the complete Illumina run was over 100x, aTRAM was run using only 25% of the available shards, providing an estimated genomic coverage of ~25x. The program was set to run for five iterations using the autocomplete option. This run was performed on the Institute for Genomic Biology's Biocluster at the University of Illinois, which uses two Intel Xeon E5530 2.4GHz quad-core processors per node with 24 GB RAM per node. Both aTRAM steps were run on one node with four processors.
Finally, the same P. schaeffi paired-end library was used to create a completely de novo assembly. The raw reads were trimmed for nucleotide bias at the 5′ end and for low-quality bases at the 3′ end using the FASTX toolkit and error-corrected using Quake [22] with c = 2.83 for 19-mers. Paired-end reads were assembled in SOAPdenovo v1.05 [19] using K = 49, which is roughly half of the read lengths, and the optional GapCloser v1.10 algorithm with a minimum overlap = 31. Finally, the 1,534 genes were identified by creating a BLAST-formatted database of the de novo assembled contigs and using the P. humanus transcripts as targets for a BLAST search. The top hits were selected as the de novo contigs.
The aTRAM contigs, top hits from the de novo assembly, and the reference-based assembly sequences were each aligned against the original Pediculus humanus reference DNA sequences using MAFFT [13] with the included post-processing PercentCoverage script. Uncorrected p-distances (proportion of sites with differing nucleotides, not corrected with a model of molecular evolution) were calculated using a custom Perl script used originally in Johnson et al. [21] (Available on Github: juliema/publications/). Orthology was verified for the aTRAM and de novo contigs with the same reciprocal best-BLAST test that was previously used for the reference-based assemblies in Johnson et al. [21]. Because each method used BLAST for assembly the resultant contigs were then reciprocally compared to the entire Pediculus humanus protein coding genome, and if the original query sequence was the top hit, the assembled gene was considered to be orthologous to the query gene. Finally, the outputs from aTRAM, the reference-based assembly, and de novo assembly were aligned to each other and uncorrected p-distances calculated to determine if the three methods produced the same sequence for each gene.
aTRAM and Divergent Taxa
Samples of six species of lice were sequenced on an Illumina sequencer combining two species in a lane (NCBI: SAMN03360966 – SAMN03360971). Four species were sucking lice from the suborder Anoplura and thought to range from 25 – 75 million years divergent from the reference sequence P. humanus [23]. The other two species were chewing lice from the suborder Ischnocera and thought to be ~ 110 million years divergent from the reference species [24]. Johnson et al. [8] had previously identified a set of 1,107 genes as single copy orthologs protein coding genes across nine insect genomes, including lice, using OrthoDB [25]. The amino acid sequences from P. humanus for these 1,107 genes were used as query sequences in aTRAM for each of the six louse species. Each aTRAM contig was compared to the entire P. humanus protein-coding genome using the reciprocal best-BLAST test for orthology. The orthologous contigs were then aligned back to the P. humanus genome and uncorrected p-distances were calculated. To determine if a DNA query would also assemble genes across the divergent datasets, we ran 10 genes using the DNA from the reference P. humanus, and only those from the congener Pediculus schaeffi assembled. This suggests that aTRAM can be limited by the success of the initial BLAST search (Additional file 1), and as taxa become more divergent, amino acid sequences are the more optimal target sequence.