ARYANA: Aligning Reads by Yet Another Approach
- Milad Gholami†1,
- Aryan Arbabi†2,
- Ali Sharifi-Zarchi3, 4,
- Hamidreza Chitsaz5 and
- Mehdi Sadeghi6Email author
© Gholami et al.; licensee BioMed Central Ltd. 2014
Published: 10 September 2014
Although there are many different algorithms and software tools for aligning sequencing reads, fast gapped sequence search is far from solved. Strong interest in fast alignment is best reflected in the $106 prize for the Innocentive competition on aligning a collection of reads to a given database of reference genomes. In addition, de novo assembly of next-generation sequencing long reads requires fast overlap-layout-concensus algorithms which depend on fast and accurate alignment.
We introduce ARYANA, a fast gapped read aligner, developed on the base of BWA indexing infrastructure with a completely new alignment engine that makes it significantly faster than three other aligners: Bowtie2, BWA and SeqAlto, with comparable generality and accuracy. Instead of the time-consuming backtracking procedures for handling mismatches, ARYANA comes with the seed-and-extend algorithmic framework and a significantly improved efficiency by integrating novel algorithmic techniques including dynamic seed selection, bidirectional seed extension, reset-free hash tables, and gap-filling dynamic programming. As the read length increases ARYANA's superiority in terms of speed and alignment rate becomes more evident. This is in perfect harmony with the read length trend as the sequencing technologies evolve. The algorithmic platform of ARYANA makes it easy to develop mission-specific aligners for other applications using ARYANA engine.
ARYANA with complete source code can be obtained from http://github.com/aryana-aligner
Every living cell carries a book of life consisting of several thousand to billions of characters with answers to many vital questions. Human efforts to decipher that book has gained increasing momentum since 1953 when the double helical structure of DNA was discovered. Twenty years later. W. Gilbert and A. Maxarn read the first 24-character word of the book . when F. Sanger and his colleagues were developing another sequencing method based on the application of labeled dideoxynucleotide triphosphates that act as chain-terminators in a PCR reaction [2, 3].
About three decades after the first DNA sequencing, the dream of reading the human book of life was realized by completion of the human genome project [4–6]. The International Human Genome Sequencing Consortium used a laborious hierarchical process to divide the genome into smaller covering tiles while the Celera Genomics firm replaced that by a computational sequence-assembly software applied to the data generated from blindly shredded (shotgun) whole genome [7, 8]. The automated Sanger method was the gold standard for about two decades, as the first generation of DNA sequencing, until increasing demand for fast and inexpensive methods to produce high volume of error-free genomic information caused emergence of new technologies, the so called Next-Generation Sequencing (NGS) .
A paradigm shift in both the experimental techniques and computational methods occurred due to the transition to the NGS technologies and also availability of finished reference genomes, such as the human genome, for more than 2000 prokaryotes. eukaryotes and Archaea. Long, accurate, expensive Sanger mate-paired reads (~ 400 to 750 bp)  which were mostly used for de novo sequencing and assembly are now replaced by several fold more (ultra-)short. erroneous, but inexpensive NGS reads. There is significant ongoing effort for the de novo assembly  of NGS data in combination with additional information such as long reads and optical maps  in order to uncover the whole genomes of different organisms. However, the vast majority of NGS data generated today in transcriptomics. epigenomics. and variation studies belong to the organisms with identified whole-genomes. which are mapped to the existing reference genomes using short or long read aligners. Emergence of the 1000 human genome project to catalogue all of the human genome variants through population resequencing is a good representative evidence for this paradigm shift .
In the new paradigm, aligning reads to a reference sequence lies in the core of numerous different applications including detection and annotation of single nucleotide polymorphisms (SNPs) [14–17], structural and copy number variations (CNVs) [18, 19], detection and alignment of transcript variants and splicing [20–22], and browsing and visualization [23–26]. There is a wide range of software available to process the NGS data from lightweight tools working on a small desktop  to more sophisticated resources designed for clouds [27–29].
Although there are many different algorithms and software tools for aligning NGS reads [30–41], of which BWA [42, 43] and Bowtie [44, 45] have been extensively used in many studies mainly due to their low memory footprint and fast and highly accurate results, fast gapped sequence search is still far from solved. A good evidence is the 106 prize of the Innocentive competition  entitled "Identify Organisms from a Stream of DNA Sequences" on aligning a collection of NGS reads, generated by diverse platforms including Illumina, Roche 454, Ion Torrent, and Pacific Biosciences, to a given database of reference genomes.
Here we introduce our seed-and-extend aligner called ARYANA which is a fast and general purpose solution with on-par accuracy and small memory usage. We compare ARYANA with other aligners: Bowtie2 , BWA-SW , and SeqAlto . ARYANA is multiple times faster than all of these aligners with comparable generality and accuracy. This superiority in performance is revealed more as the read length increases, which is in perfect harmony with the fact that the read length is increasing as the NGS technologies evolve.
Every read is individually aligned by ARYANA, which enables using it in distributed computing frameworks by partitioning the input read data set, in addition to the multithreaded parallel infrastructure embedded in ARYANA that permits complete CPU usage when running on a multi-core machine.
Alignment of a single read consists of two main phases:
In the first phase of the algorithm ARYANA extracts a set of seeds from the read sequence that satisfies certain conditions. These conditions and the approach for extracting these seeds are explained in the sections searching for the exact matches of a seed and seed extraction. For each exact match of these seeds in the reference genome. ARYANA grants score to some corresponding genomic region. The genomic regions are represented by partitions of the reference genome called tags. The scores provide a preliminary criterion for ranking the tags based on their associated genomic region's similarities to the read. Details of how the tags are defined and handled and the scoring system is explained in sections tags and scoring and accessing and updating tag information.
In the second phase we focus on the tags that received the highest scores during the first phase and consider them as candidates for the final alignment. The read is more precisely aligned to each of these candidate regions by using a differential-position dynamic programming algorithm to find the region which has the best alignment. More details of the second phase of the algorithm is available in section precise alignment to the candidate segments.
Searching for the exact matches of a seed
ARYANA uses the Burrows-Wheeler Aligner (BWA) implementation of the Burrows-Wheeler transform (BWT) and Ferragina-Manzini index (FM-index) [43, 47] to search for exact matches of a seed. To ensure the reverse DNA strand is also being considered, the reverse complement of the reference genome is attached to the end of the forward genome, and index tables are constructed for the double sized reference.
We define two search procedures that work by using this data structure:
forward exact search: The search process is performed in several iterations, starting from the rightmost letter of the seed and extending the suffix one letter per iteration to the left. At each iteration we have access (with O(1) time complexity) to list of the exact matches of the current suffix.
backward exact search: Since the index tables are built by concatenation of the reference genome to its reverse complement, we can search in the opposite direction, from left of the seed to the right, by performing forward exact search on the reversed complement of the seed. Although the matches found by this search are reverse complements of the original seed, we can still find out how far we can continue matching and extending the prefix of the original seed (which corresponds to the suffix for the reversed complement of the seed).
By using this data structure we can find list of the BWT indices for all matches of a suffix in O(k) where k is length of the suffix. We should note that finding the genomic positions from BWT indices can be done relatively fast. Furthermore, our experiments showed ARYANA consumes about 5.1 GB of memory when aligning reads to a human genome which is an amount that even today's typical personal computers can provide.
Which seeds to extract?
Each seed has at least k basr pairs.
No couple of seeds overlap more than k basr pairs.
Each seed has at least one exact match in the reference genome.
The seeds are maximal: i.e. if we extend a seed the set no longer remains valid.
The value of k is decided dynamically by ARYANA, being 16 for reads shorter than 50 bp and increased for longer reads.
There are three main reasons for having these conditions. Firstly, we force the seeds to have some minimum size and to be maximal in order to avoid the seeds that have too many matches in the reference. These seeds generally do not help distinguishing the correct region among its rivals. Secondly, it is possible for a seed to not match to the correct region due to some error or variant but to match to another region. In this case we do not want to lose all other seeds that overlap with this seed. This is why we have allowed overlaps with less than k base pairs. Thirdly, by limiting the size of the overlaps the total number of seeds and their lengths reduces thus the speed improves.
How to extract these seeds?
Algorithm 1 is a pseudo code for the above procedure. MATCH LEFT TO RIGHT(seq, s, max) is based on the reverse exact match introduced in section searching for the exact matches of a seed. It matches at most match of seq to the reference starting at s, moving from left to right. It returns the length of the matched string. MATCH LEFT TO RIGHT(seq, s, max), which is based on the forward exact match introduced in section searching for the exact matches of a seed, does the same except moving from right to left. It returns BWT indices for the beginning and end of the matched region and length of the matched string. BWTPOSITION(index) returns the reference position of index, where index is a BWT index. GRANT SCORE(pos, s) grants score for the tag associated with the position pos, and adds s points to its score. The scoring system and the tags are explained in section tags and scoring.
Algorithm 1 extracting seeds
function MAXIMALLY SEED(seq, k)
right ← LENGTH(seq)
while right ≥ k do
matched ← MATCH LEFT TO RIGHT(seq, right − k + 1, k)
if matched < k then
right ← right − k + matched
begin, end, matched ← MATCH RIGHT TO LEFT(seq, right, INF)
for index from begin to end do
pos ← BWTPOSITION(index)
GRANT SCORE(pos − (right − matched + 1), matched)
right ← right − matched + k − 1
Tags and scoring
Each exact match of a seed increases the score associated with exactly one of the tags, which is the one that contains the start position of the whole read. To find this tag we compute the relative start position of the read and update score of the tag containing this position (Figure 3). More precisely, if there are n letters before the seed in the read, and the match position starts at the m-th letter of the genome we estimate m − n to be roughly the start position of the read if it were to be aligned to the genome accordingly. This way the consecutive seeds of the same read will produce similar estimated read start positions if their exact match locations are consecutive.
The actual start position of the read might be slightly different from the estimated value due to possible indels. Likewise, the start positions estimated for consecutive matches of different seeds of the same read might slightly differ: however, the estimated start positions fall into one or at most two adjacent tags if the total size of indels inside the read is less than L.
For each exact match of the seed, the tag containing estimated start position of the read is granted a score equal to the length of that seed. As a result the final score of each tag will be sum of the size of the seeds that correspond to this tag. In case of too small seed lengths or repeat elements where there might be many exact matches for the same seed sequence, only the first P matches are granted the scores, where the default value for P is 50 but can be changed through command-line parameters.
Because of using non-overlapped tags there is the possibility of dividing the total score regarding one match of the read between two adjacent tags. This happens in extreme cases where the read's start or end position is near the boundaries of a tag and at the same time there is an indel inside the read: however this is not a significant problem as we consider several candidate tags for the second phase.
Accessing and updating tag information
There are a total of G/L tags, where G and L are the lengths of the genome and the tags, respectively. A simple way is to assign tag scores to an array of size G/L, which might not seem a problem at the first glance. However, it takes long time to reset the whole array for each read, and the storage space would also be considerable if there are multiple threads aligning the reads simultaneously. To address this challenge, ARYANA keeps track of only those tags that have a non-zero score in a hash table with open-addressing collision management that provides fast access to the records. Upon granting some score to a tag, first the segment number is looked up in the hash table and if found its score is updated: otherwise, a new record is assigned and inserted into the hash table.
While the hash table size is considerably smaller than the total number of tags, it still takes considerable time to free it upon a new read. Additionally, each hash record contains the read ID for which the scores were granted. While looking up a tag, all hash records belonging to the previous reads are ignored and the corresponding cells of hash table are treated as if empty. Hence, there is no need to reset the hash table on a new read, which has a great impact on efficiency of ARYANA. Furthermore, to get rid of scanning the hash table for selecting the top score tags, a dynamic list keeps track of the t top-scoring tags, where t is 10 by default. The list is updated if necessary following each update in the hash table.
In addition to the tag number as the hash key and the tag score, for each tag we store the seed information for all of the seeds that have resulted in updating its score. This information includes the seed length, its position in the read sequence and the genomic position of its match in reference.
Precise alignment to the candidate segments
The t top scoring tags are selected for the second phase of the algorithm that performs a precise dynamic programming alignment of the given read to each of the regions associated with the candidate tags and finds the best overall alignment consisting of matches, mismatches and indels (the so-called CIGAR sequence in the SAM files). The region associated to the tag is extended e nucleotides (20 bp by default) from both sides to ensure the potential alignment region of the read is completely covered by the extended segment.
Above, ref and read are the reference and the read respectively, I [true] = 0, I [false] = 1, and 0 < i < length (gap) and |offset| < d for the dimension sizes in which length(gap) is the size of the gap in the read and d is the largest difference between the sizes of any two corresponding subsequences on the best alignment path. This algorithm has the running time of O(dn) and is faster than the regular Needleman-Wunsch algorithm for a limited d.
Aligning paired-end reads
For paired-end data, ARYANA aligns each read separately and finds a couple of match groups, each containing t best matches of one read to the reference (the default value for t is 10). It then looks for a pair of matches one from each group that meet the requirements given for the paired alignment, including maximum and minimum distances between the reads and their relative orientation. In the case of multiple answers, the total scores for each pair of matches are used to rank them and report the best pair.
Results and discussion
We compared ARYANA with three other NGS aligners in terms of speed and accuracy. We selected BWA and Bowtie2 as the two most widely cited aligners and also SeqAlto which is a more recent aligner that outperforms many other recent aligners. All aligners were tested with the default parameters but executed with multithread in some experiments.
The experiments were performed on a platform with 48 AMD Opteron Processor 6174 CPUs each having 12 cores with clock speed of 2.2 GHz. The hg19 human genome assembly was used as the reference for all test cases. We used dwgsirn (https://github.com/nh13/DWGSIM/wiki) to simulate data sets similar to real reads produced by Illumina NGS platforms.
Time (s), recall (%), and precision (%) for aligning reads with different error rates.
Ti me (s), recall (%), and precision (%) for aligning reads with different read lengths.
To see how ARYANA works on the real data, we compared the aligners on two datasets SRR946843 (http://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR946843) and SRR003161 (http://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR003161). The SRR946843 dataset has been generated by the very recent Ion Torrent PGM technology with the average read length of 172 bp. and the SRR003161 dataset is produced by Roche 454 with an average read length of 572 bp as a part of the 1000 human genomes project.
There are many different factors that lead to the impressive performance of ARYANA. The algorithm we use to extract seeds collects a smaller set of seeds in compare to the classic approach of using fixed seeds, thus reducing the total time spent on matching them to the reference genome while not significantly losing precision and recall. Furthermore, our algorithm extracts these seeds much faster than the naive approaches that extract the same seeds. The main reason for this is that in many cases our algorithm is confident that a seed will fail to match to the reference based on the information it had gained when it was matching the previous seeds. Additionally, the data structure we have used (the hash table) to manage the information regarding the possible genomic positions of the read (tags) provides functions to update and access this genome wide information fast enough to be guaranteed of no overall time overhead, while consuming an inconsiderable amount of memory. Finally the previously matched blocks during the first phase and the approach we have in the dynamic programming algorithm have generally- decreased the time spent for the second phase.
In overall, our results on both simulated and experimental data are evident for the efficient and accurate algorithmic architecture used in ARYANA. We have developed ARYANA such that it would be convenient to use the same architecture in development of the mission-specific aligners for analysing the other types of biological data.
Authors would like to thank Prof. Hans Schoeler and Marcos J. Arauzo-Bravo at the Max Planck institute for molecular biomedicine, Shervin Daneshpajouh at computer engineering department, Sharif University of Technology, and Dr. Hossein Baharvand at Royan institute for valuable contributions to the work. We also thank Markus Bradter and Majid Ashtiani for their helpful contribution in computing environment. The computer clusters of the Institute for Research in Fundamental Sciences IPM, Tehran, Iran and Max Planck Institute for molecular biomedicine, Muenster, Germany have been used for development and test of the software.
Funding for this research was partially provided by the National Science Foundation through grant number DBI-1262565 to H.Ch.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 9, 2014: Proceedings of the Fourth Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-Seq 2014). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S9.
- Gilbert W, Maxam A: The nucleotide sequence of the lac operator. Proceedings of the National Academy of Sciences of the United States of America. 1973, 70 (12): 3581-3584.PubMed CentralView ArticlePubMedGoogle Scholar
- Sanger F, Coulson AR: A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. Journal of Molecular Biology. 94 (3): 441-448.Google Scholar
- Sanger F, Nicklen S, Coulson AR: DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences of the United States of America. 1977, 74 (12): 5463-5467.PubMed CentralView ArticlePubMedGoogle Scholar
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle Me: Initial sequencing and analysis of the human genome. Nature. 2001, 409 (6822): 860-921.View ArticlePubMedGoogle Scholar
- Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HOe: The Sequence of the Human Genome. Science. 2001, 291 (5507): 1304-1351.View ArticlePubMedGoogle Scholar
- Venter JC: A part of the human genome sequence. Science. 2003, 299 (5610): 1183-1184.View ArticlePubMedGoogle Scholar
- Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KH, Remington KA, Anson EL, Bolanos RA, Chou HH, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai Z, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng X, Rubin GM, Adams MD, Venter JC: A whole-genome assembly of Drosophila. Science. 2000, 287 (5461): 2196-2204.View ArticlePubMedGoogle Scholar
- Denisov G, Walenz B, Halpern AL, Miller J, Axelrod N, Levy S, Sutton G: Consensus generation and variant detection by Celera Assembler. Bioinformatics. 2008, 24 (8): 1035-1040.View ArticlePubMedGoogle Scholar
- M L: Sequencing technologies -- the next generation. Nature Reviews Genetics. 2009, 11 (1): 31-46.Google Scholar
- Schuster SC: Next-generation sequencing transforms today's biology. Nature Chemical Biology. 2007, 5 (1): 16-18.Google Scholar
- Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, Boisvert S, Chapman JA, Chapuis G, Chikhi R, Chitsaz He: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience. 2013, 2 (1): 10-PubMed CentralView ArticlePubMedGoogle Scholar
- Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G, Wang Z, Rasko DA, McCombie WR, Jarvis ED: Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature biotechnology. 2012, 30 (7): 693-700.PubMed CentralView ArticlePubMedGoogle Scholar
- Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA, Altshuler DM, Durbin RMe: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012, 491 (7422): 56-65.View ArticlePubMedGoogle Scholar
- Li R, Li Y, Fang X, Yang H, Wang J, Kristiansen K, Wang J: SNP detection for massively parallel whole-genome resequencing. Genome Research. 2009, 19 (6): 1124-1132.PubMed CentralView ArticlePubMedGoogle Scholar
- Pico AR, Smirnov IV, Chang JS, Yeh RF, Wiemels JL, Wiencke JK, Tihan T, Conklin BR, Wrensch M: SNPLogic: an interactive single nucleotide polymorphism selection, annotation, and prioritization system. Nucleic Acids Research. 2009, 37 (Database): 803-809.View ArticleGoogle Scholar
- Souaiaia T, Frazier Z, Chen T: ComB: SNP calling and mapping analysis for color and nucleotide space platforms. Journal of Computational Biology. 2011, 18 (6): 795-807.PubMed CentralView ArticlePubMedGoogle Scholar
- Simola DF, Kim J: Sniper: improved SNP discovery by multiply mapping deep sequenced reads. Genome Biology. 2011, 12 (6): 55-View ArticleGoogle Scholar
- Ge D, Ruzzo EK, Shianna KV, He M, Pelak K, Heinzen EL, Need AC, Cirulli ET, Maia JM, Dickson SP, Zhu M, Singh A, Allen AS, Goldstein DB: SVA: software for annotating and visualizing sequenced human genomes. Bioinformatics. 2011, 27 (14): 1998-2000.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang J, Mullighan CG, Easton J, Roberts S, Heatley SL, Ma J, Rusch MC, Chen K, Harris CC, Ding L, Holmfeldt L, Payne-Turner D, Fan X, Wei L, Zhao D, Obenauer JC, Naeve C, Mardis ER, Wilson RK, Downing JR, Zhang J: CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nat Methods. 2011, 8 (8): 652-654.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu TD, Nacu S: Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010, 26 (7): 873-881.PubMed CentralView ArticlePubMedGoogle Scholar
- De Bona F, Ossowski S, Schneeberger K, Ratsch G: Optimal spliced alignments of short sequence reads. Bioinformatics. 2008, 24 (16): 174-180.View ArticleGoogle Scholar
- Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL: TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013, 14 (4): 36-View ArticleGoogle Scholar
- Hou H, Zhao F, Zhou L, Zhu E, Teng H, Li X, Bao Q, Wu J, Sun Z: MagicViewer: integrated solution for next-generation sequencing data visualization and genetic variation detection and annotation. Nucleic Acids Research. 2010, 38 (Web Server): 732-736.View ArticleGoogle Scholar
- Abeel T, Van Parys T, Saeys Y, Galagan J, Van De Peer Y: GenomeView: a next-generation genome browser. Nucleic Acids Research. 2012, 40 (2): 12-12.View ArticleGoogle Scholar
- Milne I, Bayer M, Cardie L, Shaw P, Stephen G, Wright F, Marshall D: Tablet-next generation sequence assembly visualization. Bioinformatics. 2010, 26 (3): 401-402.PubMed CentralView ArticlePubMedGoogle Scholar
- Toedling J, Ciaudo C, Voinnet O, Heard E, Barillot E: girafe - an R/Bioconductor package for functional exploration of aligned next-generation sequencing reads. Bioinformatics. 2010, 26 (22): 2902-2903.PubMed CentralView ArticlePubMedGoogle Scholar
- Schatz MC: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics. 2009, 25 (11): 1363-1369.PubMed CentralView ArticlePubMedGoogle Scholar
- Kim D, Yoon J, Kong J, Hong S, Lee U: Cloud-scale SNP detection from RNA-Seq data. The 3rd International Conference on Data Mining and Intelligent Information Technology Applications (ICMiA). 2011, 321-323.Google Scholar
- Doddavula SK, Rani M, Sarkar S, Vachhani HR, Jain A, Kaushik M, Ghosh A: Implementation of a scalable next generation sequencing business cloud platform - An experience report. Proceedings of the 4th IEEE International Conference on Cloud Computing (CLOUD). 2011, 598-605.Google Scholar
- Mu JC, Jiang H, Kiani A, Mohiyuddin M, Bani Asadi N, Wong WH: Fast and accurate read alignment for resequencing. Bioinformatics. 2012, 28 (18): 2366-2373.PubMed CentralView ArticlePubMedGoogle Scholar
- Hach F, Hormozdiari F, Alkan C, Hormozdiari F, Birol I, Eichler EE, Sahinalp SC: mrsFAST: a cache-oblivious algorithm for short-read mapping. Nat Methods. 2010, 7 (8): 576-577.PubMed CentralView ArticlePubMedGoogle Scholar
- Coarfa C, Yu F, Miller CA, Chen Z, Harris RA, Milosavljevic A: Pash 3.0: A versatile software package for read mapping and integrative analysis of genomic and epigenomic variation using massively parallel DNA sequencing. BMC Bioinformatics. 2010, 11 (1): 572-PubMed CentralView ArticlePubMedGoogle Scholar
- Li Y, Terrell A, Patel JM: WHAM: a high-throughput sequence alignment method. Proceedings of the international conference on Management of data. 2011, 445-456.Google Scholar
- Zaharia M, Bolosky WJ, Curtis K, Fox A, Patterson D, Shenker S, Stoica I, Karp RM, Sittler T: Faster and more accurate sequence alignment with snap. arXiv preprint arXiv. 2011, 111-5572.Google Scholar
- Chen Y, Schmidt B, Maskell DL: A hybrid short read mapping accelerator. BMC Bioinformatics. 2013, 14 (67):Google Scholar
- Liu CM, Wong T, Wu E, Luo R, Yiu SM, Li Y, Wang B, Yu C, Chu X, Zhao K, Li R, Lam TW: SOAP3:ultra-fast GPU-based parallel alignment tool for short reads. Bioinformatics. 2012, 28 (6): 878-879.View ArticlePubMedGoogle Scholar
- Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, Brudno M: SHRiMP: accurate mapping of short color-space reads. PLoS Comput Biol. 2009, 5 (5): 1000386-View ArticleGoogle Scholar
- Liu Y, Schmidt B: Long read alignment based on maximal exact match seeds. Bioinformatics. 2012, 28 (18): 318-324.View ArticleGoogle Scholar
- Lunter G, Goodson M: Stampy: a statistical algorithm for sensitive and fast mapping of lllumina sequence reads. Genome Res. 2011, 21 (6): 936-939.PubMed CentralView ArticlePubMedGoogle Scholar
- Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR: STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013, 29 (1): 15-21.PubMed CentralView ArticlePubMedGoogle Scholar
- Chaisson MJ, Tesler G: Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics. 2012, 13 (238):Google Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010, 26 (5): 589-595.PubMed CentralView ArticlePubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 2009, 10 (3): 25-View ArticleGoogle Scholar
- Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012, 9 (4): 357-359.PubMed CentralView ArticlePubMedGoogle Scholar
- Innocentive-Challenge:Identify Organisms from a Stream of DNA Sequences 2013. [http://www.innocentive.com/ar/challenge/index/9933138]
- Ferragina P, Manzini G: Opportunistic data structures with applications. Proceedings of the 41st Annual Symposium on Foundations of Computer Science. 2000, IEEE Computer Society, Washington, DC, USA, 390-398.View ArticleGoogle Scholar
- C D: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 48 (3): 443-453.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.