Magic-BLAST, an accurate DNA and RNA-seq aligner for long and short reads

Next-generation sequencing technologies can produce tens of millions of reads, often paired-end, from transcripts or genomes. But few programs can align RNA on the genome and accurately discover introns, especially with long reads. We introduce Magic-BLAST, a new aligner based on ideas from the Magic pipeline. It uses innovative techniques that include the optimization of a spliced alignment score and selective masking during seed selection. We evaluate the performance of Magic-BLAST to accurately map short or long sequences and its ability to discover introns on real RNA-seq data sets from PacBio, Roche and Illumina runs, and on six benchmarks, and compare it to other popular aligners. Additionally, we look at alignments of human idealized RefSeq mRNA sequences perfectly matching the genome. We show that Magic-BLAST is the best at intron discovery over a wide range of conditions and the best at mapping reads longer than 250 bases, from any platform. It is versatile and robust to high levels of mismatches or extreme base composition, and reasonably fast. It can align reads to a BLAST database or a FASTA file. It can accept a FASTQ file as input or automatically retrieve an accession from the SRA repository at the NCBI.


INTRODUCTION
given number of bases following the applied operation. The first operation whose requirement is met is applied to the alignment and the algorithm proceeds to the next position on both sequences. A single substitution is reported if no requirement is satisfied. The list of alignment operations and their associated conditions used in Magic-BLAST is presented in Table 1.
Figure1 shows an example alignment extension. First, there are two matches and the algorithm moves to the right by two positions on both sequences. When a mismatch (T-G) is encountered the algorithm tries successively each alignment operation and checks its requirements. The first operation, a substitution which requires nine matching bases following the mismatch, fails. The second operation, an insertion which requires ten consecutive matches, succeeds and is applied to the alignment. In the last step there is a match (G-G).
We use the X-drop algorithm (6) to stop the extension. At each position, we record the running alignment score. The algorithm stops at the end of a sequence or when the current score is smaller than the best score found so far by more than the most penalized gapped alignment operation (threebase gap in Table 1). The algorithm then backtracks to the position with the best score.
Because most reads align to a reference with few or no mismatches, this method is faster and more memory efficient than the dynamic programming-based extension procedure used in other BLAST programs. Moreover, this approach facilitates collection of traceback information at little additional cost. This method can be tuned to a given sequencing technology for an expected rate of mismatches or gaps simply by adapting Table 1. For example, in Roche 454 or PacBio, where insertions and deletions are more frequent than substitutions, one could switch to a modified table.
We compute an alignment score using the following system: 1 for each matching pair of bases, -4 for a base substitution, zero for gap opening (either a read or reference sequence), and -4 for each base alignment has a score of at least 50, we search successively for the minor splice sites or their complement: GC-AG or CT-GC, AT-AC or GT-AT, then for any other non-canonical site. The first site found is accepted. The alignment score threshold of 50 was chosen because minor and noncanonical splice sites are rare, but pairs of di-nucleotides are frequently found in the genome. As a result, for reads shorter than 100 bases, Magic-BLAST conservatively only calls GT-AG introns.
Magic-BLAST also attempts to produce spliced alignments if a read has several local alignments separated by one to ten unaligned bases. First, we look for a splice signal within four bases of the end of the left alignment and, if found, we fix the beginning of the candidate intron. Second, we search for the corresponding end of intron signal at offsets that ensure a continuous alignment on the read, allowing at most one additional insertion or deletion. If this fails, the procedure is repeated with the end of the intron fixed and a search for the signal indicating the start of the intron. When the candidate splice signals are found, the alignments are trimmed or extended to the splice signals. The benefit of this method is that it correctly identifies introns even in the presence of a substitution, insertion, or deletion close to the intron boundaries. Because this procedure is very sensitive and can produce many spurious alignments, Magic-BLAST only allows the GT-AG signal in this situation.
The spliced alignment is scored with the same scoring system as the local alignment. There is no reward or penalty for splice sites and no preference is given to continuous versus spliced alignments.
When mapping RNA to the genome, Magic-BLAST does not use an annotation file or a two-pass method. We recommend instead to map in parallel on the genome and on an annotated transcriptome, then use the universal scoring system of Magic-BLAST to select the best alignment, be it genomic or transcriptomic, for each fragment.
transcript sequences so that they exactly match the genome (see supplementary section 2.1).
iRefSeq mRNAs range in length from 147 bases to 109,224 bases. This perfect data set forms a useful benchmark for RNA-seq aligners, because it seems simple to align, and each mismatch in the alignment indicates an imperfect mapping. Furthermore, the coordinates of the 210,509 distinct introns in iRefSeq are known.
The Baruzzo benchmark set of simulated RNA-seq reads, presented in (4), was also used. This set has some qualities that make it appealing for our analysis. The authors document their procedure well, they produce 10 million paired-end 100+100 Illumina-like reads at three vastly different error rates, nominally from 6.1 to 55 mismatches per kb, and they produce data for human and Plasmodium falciparum, a protozoan causing malaria in human (we refer to the latter sets as 'malaria'). Baruzzo (9)). Here, we will refer to each of these RNA-seq sets by its technology: PacBio, Roche and Illumina. Figure 2 presents a histogram of read lengths for the iRefSeq and these three experimental sets. To refine the conclusions, we later added two large PacBio runs (testes SRR5189667 and brain SRR5189652 from the GENCODE project (17), with non-native sequences edited using Illumina runs), two Illumina runs with longer read pairs (250+250 bases SRR5438850 from metastatic melanoma; 300+300 bases SRR5437876 from MCF7 cells, also studied in (1)). We finally truncated the Baruzzo human T1 sample to create a shorter 50+50 bases run and subsampled to 1% the malaria T1 sample to obtain, by keeping only every hundredth sequence, a coverage depth similar to that in human.
We examined the performance of several programs aligning RNA to the genome in the absence of a transcriptome (Human genome GRCh38 and P. falciparum genome provided in (4)). Magic-BLAST was compared to programs from 2009 to 2015: HISAT2 (9), STAR (10,11), STAR long (10,1,17) and TopHat2 (12,13)  can download all the data, realign the sequences, and reproduce supplementary tables and sections 2 to 7, which support our entire analysis.
We first measure how well the different aligners identify introns, then we examine the properties of the alignments.

Intron discovery
To test how well the aligners discover introns, the splice sites were extracted from the BAM output using the 'N' operation in the CIGAR string (15). For the iRefSeq and Baruzzo benchmark sets, the true position of each intron is known. The experimental sets do not come with such a "Ground Truth", but a proxy for true and false positives are the introns annotated and not-annotated in iRefSeq. Of course, this is not strictly correct as some unannotated introns are no doubt real and just have not been discovered or annotated on the genome. On the other hand, it seems likely that all (or almost all) the annotated introns are real. This strategy allows a comparison of the results of all programs on all datasets and the measurement of precision, recall and F-score for intron discovery.
Magic-BLAST, HISAT2 in relaxed or standard mode, and STAR long are able to align very long reads on the genome and find introns. TopHat2 and the standard STAR failed to produce any results for very long reads, although TopHat marginally worked for Roche 454.
We first use a ROC curve approach (18) to precisely judge the quality of intron discovery, true versus false, as a function of minimal read coverage (Figure 3 and S3). We group the introns by read coverage in up to 100 bins. For each bin, the number of true positive (or annotated) introns are plotted on the Y-axis while the number of false positives (or unannotated) introns are on the X-axis.
The resulting curves have up to 100 points and give us visual insight into how the different programs behave when the support, given by the number of reads mapped to each junction, increases. The truth for the benchmark sets is vertical and dark blue. The best curve, of course, would have all the true positives before any false positives, meaning the steeper the slope, and ultimately the higher to the left the curve is, the better. In fifteen cases, Magic-BLAST (red) is to the left and above all other curves and qualifies as the best intron finder in all conditions tested, for reads from 100 bases to 100 kilobases, with any level of mismatches, from perfect match to 8.6% mismatch. This observation applies to benchmark as well as all seven real data sets tested from PacBio, Roche or Illumina.
This observation can be summarized by quantifying intron precision, recall, and F-score for the annotates introns and genes from all tissues, and typically a sample derived from a single tissue and sequenced deeply will express 70 to 75% of all annotated genes and introns, this explains the recall around 72% in this Illumina set. For the shallow data sets from Roche adult Brain and PacBio colon carcinoma cells, precision seems good (up to 92% in Magic-BLAST) because these tissues were used intensely in RefSeq annotation, and because at low coverage, one sees mainly the highly expressed genes, which are the best annotated. Yet the low depth of sequencing explains why Roche finds only 31% of the annotated iRefSeq introns, and the very small PacBio run finds less than 6%.
The ROC curves make it apparent that the ability of the aligners to discover introns, with a good balance of true to false positives, changes as the read coverage for introns decreases. We use this insight to calculate a coverage dependent best F-score for the aligners (Supplementary sections 3 and 4). At the best F-score, Magic-BLAST has the highest F-score in 15 experiments, plus the 12 duplicate benchmarks. The only exception is the human T1 truncated at 50+50, where HISAT2 takes the lead. Magic-BLAST also reaches its best score at the lowest coverage of all aligners in almost all cases. The other aligners achieve optimal scores at much higher coverage than Magic-BLAST for the deep experimental Illumina and Baruzzo malaria sets. Magic-BLAST is more conservative than the other programs, and even the introns supported by a single read appear reasonably trustworthy.
Another notable feature in the ROC curves for the Baruzzo benchmark is that STAR produces the largest number of false positives in every case, followed by HISAT2 ( Figure 3 Figure 7g presents intron discovery F-scores for the Baruzzo set. Magic-BLAST has the best Fscores for all the sets, but really excels for the T3 human set and all the malaria sets.

Alignment quality
Various metrics can be applied to characterize the ability of aligners to accurately map the reads. For shows that Magic-BLAST has the fewest complete alignments, but the situation is reversed for the longer Illumina reads 250+250 and 300+300 ( Figure S2.2) where Magic-BLAST aligns 50% more bases than STAR and twice as many as HISAT2.
We now detail the results for the iRefSeq benchmark, then the six Baruzzo   Another functionally important case is the proper identification of the first and last exons, which reveal the location of promoters and 3' ends regulatory regions. In Magic-BLAST, five alignments overlap the truth but extend outside of the annotated gene, creating a new or incorrect first or last exon, but this problem happens much more frequently in HISAT2 (191 and 79 times) and in STAR long (58 times).
In a gene reconstruction project, this type of defect is hard to fix and may lead to incorrectly intertwined neighboring genes.
The iRefSeq mRNAs should match the genome exactly, and the number of mismatches shows how well each aligner performs ( Figure 6d). The truth (blue) has no mismatch. Magic-BLAST has 771 mismatches, STAR long and HISAT2 relaxed 38018 and 66358 mismatches respectively, and STAR long aligns many fewer bases. On iRefSeq, Magic-BLAST alignments are superior to both HISAT2 and STAR long by all criteria.
We then assessed the accuracy of the alignments and measured the mapping statistics using 100+100 base Illumina-like simulated RNA-seq data from the Baruzzo benchmark (4) (Figure 7). The first two columns (7a and 7b) use the built-in truth to measure the precision and recall of the mapping: an alignment is counted as true if it overlaps the benchmark position, even if it is partial; otherwise it is counted as false (Supplementary section 6). In case of multiply aligned reads (Figure 7e), each alignment contributes. The human T1 set should be the easiest, and STAR, HISAT and Magic-BLAST do well, in this order: their precision is above 95% and recall above 99%. TopHat2 is noticeably lower.
For the human T2 set, STAR 2-pass does the best, and HISAT2 and TopHat2 show some degradation. Magic-BLAST maintains a strong performance at the T3 level, STAR shows a significant degradation of the recall, and HISAT2 and TopHat2 perform very poorly. The percentage of reads and bases aligned tells a similar story (Figure 7c, 7d), with results degrading from T1 to T2 to T3, especially for HISAT2 and TopHat probably because the rate of mismatches in the benchmark, 60 per kb in Human T3 (Figure 7f), exceeds the thresholds of these programs.
The Plasmodium (malaria) benchmark sets should be more challenging owing to the AT-rich strong bias in the composition of the genome; also, the coverage is 100-fold deeper. The results are similar to the human benchmark; the main difference is that STAR is very slow and inefficient on malaria and produces huge numbers of false positive introns ( Figure S3) and Magic-BLAST now has the best alignment precision and recall for T1, T2 and T3 ( Figure 7).
Finally, we examine the ability of the programs to align experimental RNA-seq reads generated with different technologies. We find that, by all criteria, Magic-BLAST is the best mapper for all long reads, from Illumina 250+250 or 300+300 to Roche and PacBio (raw or edited) as well as for iRefSeq. The quality of the aligners is synthetized in Figure 8, which plots the number of matches against the number of mismatches summed over all 'primary' alignments (maximum one alignment per read). The mismatch rate depends on the sample and sequencing quality, its true level is unknown, but mismapping by the aligner will tend to increase the apparent rate. The most efficient aligner will align the largest number of bases -hence be most to the right -yet limit its mismatch rate to a minimumhence be higher than other aligners with a similar number of matching bases. We indicate next to the best aligner and in the same color the number of mismatches per kilobase, which is our best estimate of the true level of mismatches in the sample (   (malaria T1 T2 and Illumina) and in truncated 50+50 bases T1 pairs.
Supplementary sections 6 and 7 provide more details on all these aspects. Figure 9 presents the CPU times and peak memory requirements for the alignments. A priori, the comparison seems straightforward, however, in presenting a fair assessment, the quality of the results matters. The performance. Magic-BLAST is not the best tool to align 50+50 paired end reads ( Figure S7.1), where

Run times
TopHat or the ten times faster but less efficient STAR would be better choices. HISAT2 was memory efficient and did rather well on short reads with few mismatches.
Intron discovery posed different challenges for the aligners. We looked at ROC curves, using the provided results for the benchmark sets and the annotations on the human genome as a guide for the experimental sets. As discussed, the representation of introns in the RefSeq annotation may be uneven. Certain tissue types may be underrepresented, while highly expressed genes from other tissues are certain to be included. This remark explains the need to measure the performance of an aligner on real data, with all the messiness of biology, but also on benchmark data with known results.
It is clear from the ROC curves that our cautious approach to intron discovery pays dividends: Magic-BLAST is the best intron finder in all datasets, with far fewer false positives produced for a given number of true positives. The intron discovery F-score tells a similar story. Magic-BLAST has the best results in all datasets and its results are trustworthy, even at very low coverage. For the T1 or T2 human Illumina type reads with few mismatches, the difference in ROC curves between Magic-BLAST and HISAT2 was relatively small, but Magic-BLAST excelled for more distant matches or the compositionally biased malaria sets. For the T3 sets, Magic-BLAST had much better intron-finding Fscores than the other programs, consistent with its read mapping F-scores. For the iRefSeq, PacBio, Roche and long Illumina sequences, Magic-BLAST again produced the best introns results.
We also found that the intron discovery ROC curves for STAR 2-pass were worse than STAR 1-pass for deep sets such as the malaria sets or the Illumina 101+101, even though STAR 2-pass is expected to improve upon STAR 1-pass. For the Illumina run (zoom in Figure S3.1), the 1-pass curve lies just below Magic-BLAST, but the 2-pass curve is strongly shifted towards the unannotated/false positive. HISAT2, which also uses a 2-pass technique, is twice further than STAR 2-pass towards the noise. One could argue that most introns discovered by HISAT2 or STAR 2-pass in Illumina are real and missing from the annotation. However, our design contains seven controls where the true curve is vertical: in iRefSeq and in the Baruzzo sets, especially malaria, there is no doubt that the highly covered unannotated introns of STAR and HISAT2 are false positives, despite their high coverage.
Both STAR 2-pass and HISAT2 perform better for the shallow T1 human set than the deep malaria T1. The optimal F-scores ( Figure S4.1), calculated for the different aligners and experiments, are consistent with the ROC-curves in this regard. We have shown by subsampling that in both aligners, the second pass reinforces the false discoveries of the greedy first pass, and this tendency is also visible in all PacBio runs and Roche. Sadly, this type of noise cannot be erased, and if used in a gene reconstruction project, as was done in (17), those well supported but false positive introns will generate alternative splice variants that do not exist in the biological sample and will durably pollute reference gene models.
In figure 9, we presented the run times for the various aligners and their relative performance on each dataset. For the longer sequences (first seven rows), from Illumina 250+250 to iRefSeq, Magic-BLAST was consistently in the best category. In some cases, Magic-BLAST could be 3-7 times slower than the next best aligner (STAR long or HISAT2 relaxed), but the quality of the results, both in terms of the alignments and intron finding, made it the best choice. For shorter sequences, such as Illumina 101+101 or the Baruzzo 100 base T1 and T2 benchmark sets, most aligners performed relatively well. Most were faster than Magic-BLAST and therefore a reasonable choice for short reads similar to the target. The other aligners did not perform as well for the more distant T3 Baruzzo sets and STAR slowed down dramatically for the T3 malaria case, becoming 10 times slower than Magic-BLAST.
Magic-BLAST was the only aligner to produce good results for a wide variety of sequence lengths, from 100 bases to 100 kilobases, compositions, or error rates without changes to the command-line whereas the standard options of other programs are often suboptimal (4). We also field-tested Magic-BLAST in several hackathons, allowing us to identify problems, hear user suggestions, and improve usability.
It is finally instructive to examine how the standard BLASTN algorithm handles spliced alignments.
We search an mRNA (u00001.1) against the human genome reference assembly (GRCh38.p12) with BLASTN and find two problems.

BLAST toolkit integration
Magic-BLAST takes full advantage of the BLAST Toolkit. It uses a BLAST database for reference sequences that compresses sequences 4-to-1, saving disk space and memory. The same database can also hold reference sequence metadata such as taxonomy, length, identifiers, and titles. The sequences and the metadata can be retrieved with the blastdbcmd executable (see https://www.ncbi.nlm.nih.gov/books/NBK279690/). Additionally, the databases can also hold usersupplied masking information for the reference sequences that can be selectively enabled. Similar to the BLAST+ programs, one can use the -seqid_list option to map sequences only to selected reference sequences in a larger BLAST database. There is a caveat though: not mapping the reads competitively on the entire genome will gather all the reads coming from the selected target, but also some from its cognate genes, homologous sequences and eventual pseudogenes.

Conclusion and next steps
We presented Magic-BLAST, a new tool for mapping next generation RNA-seq runs with read length between 100 nucleotides and 100 kilobases against a genome or a transcriptome. Its performance was compared with that of similar popular programs: HISAT2, STAR, and TopHat2. Magic-BLAST was the best intron finder on all tested sets, real or simulated, and the best aligner for all long reads, including Illumina 250+250 or 300+300, Roche 454, PacBio and full-length mRNA sequences.
Magic-BLAST integrates very well with other NCBI tools and services and is convenient to use since it recognizes NCBI accessions for SRA runs, mRNA or genomic sequences, and uses BLAST databases.
We are exploring ways to improve Magic-BLAST, such as exporting lists of introns and support, improving adapter detection, shortening the required exon length, and identifying repeats. We are also working closely with users to address their needs.         Percentage of reads aligned uniquely. A higher rate of unique alignments is desirable, but the true number of ambiguous reads is unknown. f) The number of mismatches per kilobase aligned. The true rate (first line of each dataset) was measured using the official benchmark mapping. A rate below the truth, correlated with lower alignment recall, characterizes an aligner which cannot deal with high levels of mismatches. g) the intron discovery F-score, measured for all introns (coverage at least 1).
Magic-BLAST reaches the highest F-score for intron discovery in all datasets. yet keeps a controlled mismatch rates. In Illumina 101+101, STAR long has slightly more matching bases but with a much higher mismatch rate. Notice that the PacBio brain and testes runs had been edited by the submitters using Illumina runs, which explains their low mismatch rate.