A random-permutations-based approach to fast read alignment

Background Read alignment is a computational bottleneck in some sequencing projects. Most of the existing software packages for read alignment are based on two algorithmic approaches: prefix-trees and hash-tables. We propose a new approach to read alignment using random permutations of strings. Results We present a prototype implementation and experiments performed with simulated and real reads of human DNA. Our experiments indicate that this permutations-based prototype is several times faster than comparable programs for fast read alignment and that it aligns more reads correctly. Conclusions This approach may lead to improved speed, sensitivity, and accuracy in read alignment. The algorithm can also be used for specialized alignment applications and it can be extended to other related problems, such as assembly. More information: http://alignment.commons.yale.edu


Background
The exponential growth in high-throughput sequencing exceeds the pace of speed increase in computer hardware. Therefore, advancements in software and algorithms for read analysis have been required in order to analyze the tremendous amount of data obtained from sequencing.
Software packages released in recent years use these approaches very efficiently. When the reference is not very large and not very repetitive, when the number of reads is not large, and when it is possible to "mask" large parts of the reference, existing algorithms and tools provide a computationally inexpensive solution. However, as the throughput continues to grow and new applications emerge, a new approach to read alignment may be useful for many applications.
In this paper, we introduce a random-permutationsbased approach to read alignment. The approach is conceptually related to the use of random projections in randomized nearest neighbors algorithms (e.g. [12]). An outline for a random-permutations-based algorithm for string searches has been presented by Charikar [13]. We formulate the read alignment problem as a special nearest neighbors search problem and propose a practical search algorithm based on the random permutations approach.
The applicability of the algorithm is demonstrated by comparing an implementation of the algorithm to existing fast read alignment programs.

Problem definition
In this subsection we formulate the problem as a "nearest neighbors search" problem.
In the study of the genome, the sequences of nucleotides that form DNA molecules are represented as strings composed of the characters A, C, G and T. We investigate the following scenario: we are given a long reference string (the reference DNA) and a large number of short strings, called "reads." For each of the reads, we would like to find the most similar substring of the reference.
We assume that all our reads are strings of the same length. This assumption often holds in practice, and the approach can be extended to non-uniform lengths. We denote the length of the reads by M. A typical value can be M ≈ 100.
We denote the long reference string, which represents the entire reference genome, by W. In the human genome, the length of this string is in the order of N ≈ 3 × 10 9 .
Instead of considering W as one long string, we examine its contiguous substrings of length M. There are N -M + 1 such substrings, each of them starts at a different location in W. We denote each of these substrings by its location in W, so X i is the substring that begins at the ith character of W .
We can now phrase our alignment problem as follows: given a read Y , find the most "similar" string X i in the reference library. The measure for similarity is based on the number of mismatches, or the "Hamming distance" between strings; the smaller the distance, the higher the similarity.
This type of search problems (with any definition of distance) is known as "nearest neighbors search." In this discussion, we describe an algorithm for finding a single, unique, "true nearest neighbor". We assume that no two reference strings are identical. We also assume that there is a unique "true nearest neighbor" for every read, so no other reference string has the same Hamming distance to the read as the "true nearest neighbor." These assumptions simplify the definitions and the analysis, but the approach is applicable when these assumptions do not hold.

Extended frameworks
The principles discussed in this limited framework and under these assumptions can be extended. For more general search problems, we consider a two-step framework for read alignment: a search step and a refinement step. In the first step, we use a search algorithm to recover candidate alignments and apply a coarse filter to obtain a "small" list of final candidates. In the second step, a more refined method is used to process the list of candidates. This step may include scores (such as the Hamming distance) and thresholds, but may also include cross-referencing of the information recovered from different reads as well as additional searches. This framework is appropriate for permutations-based-algorithms which automatically enumerate many possible candidates.
The prototype presented in the results section implements a version of the algorithm which preforms the first approximate search step and returns a small number of candidates, rather than a single "best" match.
In most of this paper, we restrict our attention to mismatch-type variations and errors. Although considering mismatches is sufficient for some applications, there are other important variations: insertions and deletions ("indels") of characters that change the relative positions of other characters. The implementation in the results section demonstrates one of the extensions for fast and accurate alignment in the presence of indels. A comprehensive discussion of the extensions for indels is beyond the scope of this paper.

Observations
In the description of the algorithm, we discuss arrays of strings which we sort lexicographically, like words in a dictionary. In particular, we discuss lexicographically sorted libraries containing versions of the strings in the reference library. In this subsection, we describe several properties of lexicographically sorted libraries and properties of strings comparison.
Definition 1 If a string is present in the lexicographically sorted array, we define its "lexicographical position" in the array as its position in the array. If a string is not present in the lexicographically sorted array, we define its "lexicographical position" in the array as the position where it would have been inserted if we had added it to the array.
Observation 1 There are search algorithms, such as "binary search" [14], that allow us to find the lexicographical position of reads in sorted libraries of reference strings in O(log(N)) strings comparison operations. Furthermore, when the reference library contains a perfect match for the read, these search algorithms find the perfect match.
This operation is very similar to looking up a word in a dictionary.
Observation 2 Suppose that all the mismatches in some read with respect to its true nearest neighbor are known to occur "late enough" in the read, so that the lexicographical position of the read in the sorted array is within K positions from the position of the true nearest neighbor. Then we can find the true nearest neighbor in O(log(N) + K) strings comparison operations.
This can be done by first finding the lexicographical position of the read, and then considering the "neighborhood" of~2K strings around it. This operation is analogous to finding the correct page in a dictionary and then examining all the words on that page.
Observation 3 If the same permutation is applied to two strings, the Hamming distance between the permuted strings is equal to the Hamming distance between the original strings.
A permutation of a string reorders the characters in the string. Therefore, the same mismatches still occur in the permuted versions, only the positions where they occur are changed by the permutations.

An informal description of the algorithm
In our search problem, we have some library of reference strings and a read. Suppose that our read Y and its true nearest neighbor X i have p mismatches. Based on observation 3, if we apply the same permutation π (j) to our read and all the reference strings, the Hamming distance between the permuted version of our read and each permuted reference string is the same as the distance between the original read and the corresponding original reference string. In particular, the number of mismatches between the permuted read Y (j) and the permuted version of the true nearest neighbor X (j) i is still p, and the permuted version of the true nearest neighbors is the true nearest neighbor of the permuted read. If we are "lucky" enough, we happen to move the p mismatches to positions that are "far enough" from the beginning of the string. Based on observation 2, if the positions are "far enough," the search for the lexicographical position of the read leads us to the "neighborhood" of the "true nearest neighbor." Formal definitions of "neighborhoods" are presented below.
We do not know in advance which characters to "push" to the end of the string and we cannot expect to always be "lucky" enough to permute the correct characters away from the beginning. Instead, for each read that we receive, we repeat the procedure described above several times, using a different random permutation each time. Consequently, we have a high probability of finding the true nearest neighbor in at least one of the repetitions.
This procedure uses sorted arrays of permuted strings to define and search for "neighborhoods." Different versions of the algorithm use other data structures, such as prefix-trees of permuted strings.
To illustrate what permutations do, we generated a random "reference genome" of length N = 20000, and built a library of all substrings of length 15. In this example, we consider the read Y = CTtGCCAAAGCCATG, which should be mapped to the location 10000, where X 10000 = CTCGCCAAAGCCATG.
We attempt to look for a match to Y in the sorted library. Since the mismatch occurred "early" in the read, our search takes us to a distant position in the sorted array and we do not find X 10000 there. The correct neighborhood of X 10000 is presented in table 1.
If we permute Y using the code π (1) : (11,2,7,13,9,15,4,5,1,12,10,14,3,8,6), the mismatch in position 3 is permuted to the 13th position: Y (1) = π (1) (Y) = CTAA AGGCCCGTtAC. When we look for Y (1) , we find the permuted version of X 10000 , which is X (1) If we use the same permutation and the mismatch occurs in a different position, we may not find X 10000 . In fact, if the mismatch occurs in position 11, it becomes a mismatch in the first character of the permuted string. Therefore, we use several randomly chosen permutations to reduce the probability of failure. When we use long strings and random permutations, the probability of error drops rapidly as the number of iterations grows.

A more formal description of the algorithm
We now describe the algorithm more formally. First, we describe an indexing procedure (part 1). Then we describe the search for candidate neighbors (part 2). Finally, we describe an approach to filtering the proposed neighbors and finding the best one (part 3).
For each permutation π (j) : Use π (j) to permute all the original reference strings. Build a sorted array Ar (j) of the permuted reference strings. Store permutation π (j) and index Ar (j) for use in part 2.
End For.
End For.

Part 3: Filter
For each read Y : Calculate the Hamming distance between X i and Y . Keep track of the candidate string most similar to the read. End For. Report the most similar string as the alignment of Y . End For.

"Neighborhoods" of strings
In this subsection we give one of the possible definitions for a "neighborhood." We also define the terms "prefix neighborhood" and "resolution length," which we use in the analysis.
We define a neighborhood size K >0, which is an order of magnitude of the number of strings that we compare to the read in each iteration.
Suppose that we are looking for the string Y in a sorted array of reference strings.
Definition 2 If k is the lexicographical position of the string Y in a sorted array of strings, then the "neighborhood" of Y is defined as the list of strings in positions k -K to k + K in the sorted array.
Definition 3 The "prefix neighborhood of length l" of the string Y is the list of all strings that have the same llong prefix as the string Y .
Definition 4 Given a string X, we define the "resolution length" (L) as the smallest value such that the L-long prefix of X is the prefix of no more than K strings in the library.

Analysis
In this subsection, we discuss the probability of obtaining the "true nearest neighbor" for a read. We denote the number of mismatches between the read and the true nearest neighbor by p.
We assume a constant value of "resolution length" for the true nearest neighbor across the different permutations used. We denote it by L. Different reads may have different true nearest neighbors with different values of L. This assumption can be relaxed in more detailed analyses of the algorithm. We assume that p <(M -L).
There are M! possible permutations for a string of M characters. The permutations that we use are chosen randomly from these M! possibilities with equal probabilities.
We begin by considering a single permutation and a single iteration of part 2 of the algorithm. By definition, if there are no mismatches in the first L characters of the permuted string, then the true nearest neighbor is in the "neighborhood" and it is added to the list of candidates in this iteration. Since the "neighborhood" examined in part 2 of the algorithm is larger than the "prefix neighborhood of length L," some additional reference strings are added to the list of candidates. There are The probability for being "unlucky" in any one experiment is at most 1 − PrGoodPerm(p, L, M). The permutations are chosen at random, and they are independent. Therefore, the probability that at least one of the lists contains the true nearest neighbor is: In part 3 of the algorithm, we check all the candidates directly, so if the true nearest neighbor is in the list, we are guaranteed to report it as the the best match for the read. So, PrSuccess(p, L, M, J) is also the probability of reporting the correct search result.
Each read has its own true nearest neighbor, therefore the value of L and the number of mismatches (p) varies between reads. Given a distribution of L for the different reads and their "true nearest neighbors," a distribution of the number of mismatches and criteria for the desired probability of success in different cases, we set appropriate values of K and J. Our experiments suggest that in practice, low values of K and J, which allow fast computation, can produce good alignment results in a wide range of scenarios.

Complexity
The indexing of the reference in part 1 of the algorithm is a simple sorting operation which requires O(Nlog(N)) strings comparison operations for each of the indexes.
Based on observations 1 and 2, the number of strings comparison operations required by parts 2 and 3 of the algorithm is O(J(log(N) + K)) per read.

Filtration and reporting multiple possible alignments
Since the algorithm evaluates multiple candidates in part 3, some degree of multiple alignments analysis is a byproduct of the algorithm and the algorithm can be extended to report multiple possible alignments. This property allows us to extend the algorithm to perform the fast search required in the first step of the extended alignment framework.
An improved filtration component, the "hit-count" filter, can be used to generate a small list of candidates ("coarse filtration") and also to accelerate the algorithm. A version of the algorithm that uses "hit-counts" stores the number of times each candidate appeared in the searches in part 2 of the algorithm (the "hit-count" for that candidate). In part 3 of this version, the algorithm evaluates and reports only candidates that appeared in several searches in different indexes ("received multiple hits").

Memory considerations and practical indexes
Large reference genomes may require multiple large indexes. It is enough to store the original reference string and the permutation rules, and it is not necessary to store all the permuted strings explicitly in the sorted arrays. It is also not necessary to store all the indexes in the RAM at the same time; one can load an index, perform one iteration of part 2 of the algorithm for a batch of reads, and then load another index.
Furthermore, each single index can be used almost as if it were multiple indexes with different permutations. To achieve this, we use a sliding window to take contiguous substrings of the reads. We permute each of these substrings and search for it in the sorted array of permuted reference strings.

Basic alignment
We implemented a version of the algorithm in C (with no SIMD/SSE). Our permutations-based prototype implementation was used in the same three modes in all the experiments.
For the comparison presented here, we chose some of the popular programs which preform the fastest alignment to a human reference genome. We used Bowtie [3] as the main benchmark for performance evaluation because it is one of the fastest aligners [11]. We also compared the performance to BWA [2] and the more recent Bowtie2 [4].
The purpose of the comparison is to demonstrate the applicability of the algorithm to a large-scale problem like aligning to a full human genome. This is not meant to be a complete comparison to all programs in all scenarios.
All the real reads were obtained from The 1000 Genomes Project [15]. All the simulated reads were produced using wgsim [16]. The human genome GRCh37 [17] (obtained from The 1000 Genomes Project) was used as the reference. Some large regions were masked with "N"s in the original reference, but other repetitive regions were not masked.
The comparison was performed on a cluster node with (2) E5620 CPUs and 48 GB RAM. Similar experiments of alignment to the full human genome, using low-cost ($500-600) desktops with 16 GB and 32 GB RAM, produced similar results. All the programs were used in single thread mode. Bowtie requires about 2.2 GB of RAM, Bowtie2 requires about 3.1 GB and BWA requires about 2.3 GB. The permutations-based prototype implementation requires about 15 GB of RAM for the full human genome (Using more memory would allow to index the reverse complement, doubling the speed. There is a 8 GB version of the program for computers with smaller RAM. Smaller references require smaller indexes).
In Figure 1 we compare the best alignments obtained by Bowtie and the permutations-based prototype. In certain settings, Bowtie found alignments with fewer mismatches for some reads. For example, "Bowtie -v 3" found alignments with up to 3 mismatches for about 0.1% more reads (not visible in the figure). The permutations-based prototype found more alignments for reads with a large number of mismatches than all the modes which we tested in this experiment (most modes are not shown in the figure), and found more alignments with a low number of mismatches than some modes of Bowtie (the default "-n" modes).
In table 3 we compare the search times of Bowtie, Bow-tie2, BWA and the permutations-based prototype in alignment of real reads. The different programs have different criteria for reporting, and the permutations-based prototype usually generates more possible alignments, so the number of alignments would be misleading and it is not reported in the tables.
We also simulated reads to measure how many of the reads are aligned to the "correct original location in the genome." In some cases, there may be several equally probable alignments and in some cases the "best" alignment with the fewest mismatches may be different than the "correct location." Ideally, the alignment program should report both the "best" alignment and other possibilities that could be the "correct" alignment. The results of this experiment are presented in Figure 2 and table 4. In most cases, the permutations-based prototype was faster than Bowtie, Bowtie2 and BWA and produced more correct alignments to the correct "original location" than all other programs.

Alignment of paired-end reads, in the presence of indels
One common variation in the alignment scenario we described is the use of a different type of distance: "edit distance." Strings may be close in "edit distance" even in the presence of indels, although the indels are likely to make the strings far apart in Hamming distance.
Another common variation in the scenario is the availability of "paired-end reads": reads are presented in pairs, in which both reads are known to have originated from nearby areas in the genome. In this case, we are required to find nearest neighbors for both strings, Figure 1 Single-end alignment of real reads: best match. The percent of reads for which an alignment with up to n mismatches was found. Additional alignments are ignored. Dataset: 105 reads from ERR009392_1. subject to some constraint on the distance between the locations in the reference genome. A modified version of our prototype performs pairedend alignment in the presence of indels. We first align each of the reads in the pair separately. Then, for each "reasonable" candidate found for one of the reads, we restrict our attention to the area of the reference genome around that candidate, and attempt to align substrings of the other read to that area.
Since Bowtie does not allow indels, we used Bowtie2 [4] and BWA [2] as benchmarks. All the programs were used in single thread mode. This version of the prototype requires about 25 GB of RAM for the alignment of paired-end reads to a human genome. Bowtie requires about 2.9 GB of RAM, Bowtie2 requires about 3.2 GB and BWA requires about 2.3 GB.
In table 5 we compare the search times of Bowtie, Bowtie2, BWA and the permutations-based prototype implementation in paired-end alignment of real reads.
In Figure 3 and table 6 we present the results of aligning simulated pairs of reads with indels. The permutations-based prototype was faster than the other programs and usually produced more correct alignments when there were few indels. The permutations-based prototype was also able to align reads with higher indel rates almost as well as the best performing benchmark program, but significantly faster.

Conclusions
An algorithm has been constructed for the fast alignment of DNA reads to a reference genome. The algorithm handles mismatches by design, and it has been demonstrated that it can be extended to allow some inserts and deletions in some cases of practical interest.
The algorithm has been implemented and compared to existing programs. Our experiments indicate that the algorithm can produce alignments comparable to those generated by existing fast alignment algorithms, often aligning more reads with a significant speed increase. Future implementations of the algorithm are expected to be faster and more efficient, sensitive and accurate.
The permutations-based prototype implementation requires 15 GB of RAM for the alignment of reads to a human genome (25 GB for paired-end alignment). Some existing programs require significantly less memory. However, the amount of memory required by this implementation is available in low cost computers, and other versions of the algorithm utilize memory more efficiently.
Our current implementation of the algorithm is not a complete software package and does not replace existing  Each dataset contained 10 6 reads. Mutation rate: 0.1%, indel ratio: 15%. We report the search time (for BWA: overall run time) and the percent of the reads which were aligned to the correct position in the genome. In some cases and in some settings, the programs may report several possible alignments for some reads. When needed, additional filtering can be added to aligners in order to eliminate some of the results, as appropriate for specific applications. In the experiment, the programs reported <4 alignments/read (in average, in sensitive modes), with 0-1 alignments for the majority of reads. In this table, when the program produces multiple possible alignments, it is enough that one of the reported alignments corresponds to the correct location in order to consider the alignment correct.   software packages. This prototype implementation demonstrates how the proposed algorithm can be used to enhance existing software packages and to build new software packages. The scope of this discussion is limited to the basic problem of fast alignment to large genomes. Separate work on this class of algorithms indicates that the algorithms can also be used for very fast alignment of long 454 and Ion Torrent reads which may have many indels. Other work indicates that these algorithms can be used for other applications, such as assembly. Additional preliminary results and technical reports are available at http://alignment.commons.yale.edu.

Competing interests
The author developed patent-pending methods for processing in sequencing.