Basic definitions
Let w be a string of length n of symbols over an alphabet . w[i] denotes the ith symbol of w and w[i…j] the substring of w from position i to j, 1 ≤ i, j ≤ n. w[1…i] is the prefix of w ending at position i and w[j…n] is the suffix of w starting at position j. A substring of w is proper if it is different from w. A substring of w is internal if it is neither a prefix nor a suffix of w.
A read r is a string over the alphabet {A, C, G, T} which is assumed to be sorted by the alphabetical order < such that A < C < G < T. ⊴ denotes the lexicographic order of all substrings of the reads induced by the alphabetical order < . Let n be the length of r. The reverse complement of r, denoted by , is the sequence , where indicates the Watson-Crick complement of base a.
Computing suffix- and prefix-free read sets
The first step of our approach for assembling a collection of reads is to eliminate reads that are prefixes or suffixes of other reads. Here we describe a method to recognize these reads. Consider an ordered set of reads, possibly of variable length, in which some reads may occur more than once (so is indeed a multiset). We assume that, for all i, 1 ≤ i ≤ m, the ith read r
i
in is virtually padded by a sentinel symbol $
i
at the right end and that the alphabetical order < is extended such that A < C < G < T < $1 < $2 < · · · < $
m
.
We define a binary relation ≺ on such that r
i
≺ r
j
if and only if i < j. That is, ≺ reflects the order of the reads in the input. is prefix-free if for all reads r in there is no r′ in such that r is a prefix of r′. is suffix-free if for all r in there is no read r′ in such that r is a suffix of r′.
To obtain a prefix- and suffix-free set of reads we lexicographically sort all reads using a modified radixsort for strings, as described in [14]. In this algorithm, the strings to be sorted are first inserted into buckets according to their first character. Each bucket is then sorted recursively according to the next character of all reads in the bucket. A bucket always consists of reads which have a common prefix. Once a bucket is smaller than some constant, the remaining suffixes of the reads in the bucket are sorted by insertion sort [15].
During the sorting process, the length of the longest common prefix (lcp) of two lexicographically consecutive reads is calculated as a byproduct. For two lexicographically consecutive reads r and r′ with an lcp of length ℓ = |r|, we can conclude that r is a prefix of r′. If ℓ < |r′|, then r is a proper prefix of r′ and we mark r. If ℓ = |r′|, then r and r′ are identical and we mark the read which is larger according to the binary relation ≺ .
To handle reverse complements and to mark reads which are suffixes of other reads, one simply applies this method to the multiset where for all i, 1 ≤ i ≤ m. As includes the reverse complements of the reads, the method also marks reads which are suffixes of other reads. This is due to the observation that if read r is a suffix of read r′, then is a prefix of .
In a final step of the algorithm one eliminates all reads from which have been marked. The remaining unmarked reads from are processed further. The algorithm to compute a suffix- and prefix-free set of reads runs in time, where λmax is the maximum length of a read and ω is the machine’s word size. As we consider λmax to be a constant (which does not imply that the reads are all of the same length), the algorithm runs in O(m) time.
Computing suffix-prefix matches
Suppose that is a suffix- and prefix-free set of m reads. Let ℓ
min
> 0 be the minimum length parameter. The set of suffix-prefix matches (SPMs, for short) is the smallest set of triples 〈r, r′, ℓ〉 such that and strings u, v, w exist such that r = uv, r′ = vw, and |v| =ℓ ≥ ℓ
min
. ℓ is the length of a suffix-prefix match 〈r, r′, ℓ〉 . The suffix-prefix matching problem is to find all suffix-prefix matches. As the reads of length smaller than ℓ
min
cannot, by definition, contribute any SPM, we can ignore them and thus we assume that only contains reads of length at least ℓ
min
.
The method to solve the suffix-prefix matching problem presented here consists of two main algorithms. The first algorithm identifies and lexicographically sorts all SPM-relevant suffixes, i.e. a subset of all suffixes of all reads from which one can compute all suffix-prefix matches. The second algorithm enumerates these matches given the sorted list of all SPM-relevant suffixes.
Consider a suffix-prefix match 〈r, r′, ℓ〉 . By definition, the suffix of length ℓ of r exactly matches the prefix of length ℓ of r′. Obviously, the suffix of r involved in the match starts at some position j, in r. This implies that r must be at least of length ℓ
min
+ 1. The suffix cannot start at the first position in r, as otherwise r would be a prefix of some other read, contradicting our assumption that is prefix-free.
To enumerate the set of all suffix-prefix matches of length at least ℓ
min
, we preprocess all reads and determine all proper suffixes of the reads which may be involved in a suffix-prefix match. More precisely, for all reads r we determine all matching candidates, i.e. all proper suffixes s of r such that the length of s is at least ℓ
min
and there is a read r′ such that s and r′ have a common prefix of length at least k, where k is an user-defined parameter satisfying . There are two reasons for imposing this constraint on k: First, we want to represent a string of length k over an alphabet of size 4 in one machine word, thus . Second, the suffixes of the reads from which we take the prefixes of length k have minimum length ℓ
min
, thus we choose k ≤ ℓ
min
.
The set of all matching candidates and all reads forms the set of all (ℓ
min
, k)-SPM-relevant suffixes. For simplicity sake, we use the notion SPM-relevant suffixes if ℓ
min
and k are clear from the context. While all SPMs can be constructed from the SPM-relevant suffixes, not all SPM-relevant suffixes lead to an SPM.
An efficient algorithm for identifying and sorting all SPM-relevant suffixes
The first two phases of our algorithm follow a strategy that is borrowed from the counting sort algorithm [15]. Like this, our algorithm has a counting phase and an insertion phase. However, in our problem, the elements (i.e. SPM-relevant suffixes) to be sorted are only determined during the algorithm. Moreover, the number of keys (i.e. initial k-mers) whose occurrences are counted is on the order of the number of elements to be sorted. Therefore, in a space efficient solution, it is not trivial to access a counter given a key. We have developed a time and space efficient method to access the counter for a key, exploiting the fact that counting and inserting the SPM-relevant suffixes does not have to be done immediately. Instead, the items to be counted/inserted are first buffered and then sorted. A linear scan then performs the counting or inserting step.
In contrast to counting sort, our algorithm uses an extra sorting step to obtain the final order of elements pre-sorted in the insertion phase. Under the assumption that the maximum read length is a constant (which does not imply that the reads are all of the same length), our algorithm runs in O(n) time and space, where n is the total length of all reads. To the best of our knowledge a method employing a similar strategy has not yet been developed for the suffix-prefix matching problem.
We first give a description of our algorithm using string notation. In a separate section, we explain how to efficiently implement the algorithm. In the following, we only consider the reads in the forward direction. However, it is not difficult to extend our method to also incorporate the reverse complements of the reads and we comment on this issue at the end of the methods section.
The initial k mer of some sequence s is the prefix of s of length k. To determine the matching candidates efficiently, we first enumerate the initial k-mers of all reads and store them in a table of size m. This can be done in O(m) time. The notion table size always refers to the number of entries in the table. The next step lexicographically sorts the k-mers in the table in ascending order. This string sorting problem can be transformed into an integer sorting problem (see Implementation) which can be solved by radixsort [15] in O(m) time and O(m) extra working space.
In the next step, a linear scan of the sorted k-mers removes duplicates from the table and counts how many times each k-mer occurs in the table. This scan requires O(m) time. Let d ≤ m be the number of different k-mers. These can be stored in a table K of size d.
The counts for the elements in K require another table C of size d. In addition to the duplicate removal and counting, the linear scan of the sorted k-mers constructs two sets P and Q, the size of which depends on two user defined parameters k′ ≤ k and k ″≤ k. P is the set of all initial k′-mers of the reads. Q is the set of all k-mers for some . We assume that elements can be added to P and Q in constant time and that membership in these sets can be decided in constant time. Thus the linear scan constructs P and Q in O(m) time. As P is a subset of a set of size , P can be stored in bits. Q requires bits.
Up until now, only the initial k-mers of the reads were considered, resulting in a sorted table K of d non-redundant keys (i.e. initial k-mers of reads), a table C of size d for counting k-mers and two sets P and Q. By construction, each count in C is at least 1 and the sum of the counts is m. The next task is to enumerate, for all reads r, the suffixes of r at all positions j, . r has |r| − ℓ
min
such suffixes. For each such suffix s (which by construction is of length ≥ ℓ
min
), one extracts two strings v = s[1…k′] and . If v does not occur in P, then v is not a prefix of any read in and thus s is not a matching candidate and can be discarded. If w does not occur in Q, then for all reads and thus s is not a matching candidate and can be discarded. Thus P and Q serve as filters to efficiently detect suffixes which can be discarded. For read r the suffixes s and corresponding strings v and w can be enumerated in O(|r| − ℓ
min
) time. Checking membership in P and in Q requires constant time. Therefore, each read r is processed in O(|r| − ℓ
min
) time. Thus the enumeration and checking requires time altogether.
The next task is to process a suffix, say s which has passed the P-filter and the Q-filter. That is, v = s[1…k′] ∈ P and holds. One now has to check if u = s[1…k] occurs in K to verify if s is a matching candidate. If the latter is true, the appropriate counter needs to be incremented. Hence this is the counting phase of our algorithm. The simplest way to check the occurrence of u in K, is to perform a binary search, taking u as the key. However, this would require O(log2d) time for each k-mer passing the filters. This is too slow. Using a hash table turned out to be too slow as well and would require too much extra space, which we do not want to afford.
We propose an efficient method that works as follows: Store each k-mer s[1.k] passing the P and the Q-filter in a buffer B of fixed size for some constant γ > 1. Once B is full or all k-mers have been added to B, sort the elements in B in ascending lexicographic order. Then perform a binary search in K, but only for the first element in B, say x. As B is sorted, x is the smallest element. The binary search for x in K finds the smallest element in K greater than or equal to x using O(log2 d) time. If such an element occurs in K, say at index e, then simultaneously scan B beginning with the first index and K beginning at index e. For any element in B that is equal to an element in K, say at index i in K, increment the counter in C at the same index.
This simultaneous linear scan of B and (a part of) K takes O(b + d) time and finds all k-mers from B occurring in K. Once the scan and the associated increments are done, the buffer is emptied for the next round. Suppose that there are in total b∗k-mers that have passed B. Thus there are rounds filling the buffer. Each round is associated with a sorting step, a binary search and a linear scan. Sorting requires O(b) time using radixsort. This gives a running time of . As , the running time is linear in the total length of the reads.
Once all reads have been processed, for any initial k-mer u of any read, the following holds: If u is the ith initial k-mer in K, then C[i] is the number of SPM-relevant suffixes of which u is a prefix. To prepare for the insertion phase, compute the partial sums of C in an array π of size d + 1, such that π[0] = C[0], for all i, 1 ≤ i ≤ d − 1, and π[d] = π[d − 1]. π[d] is the number of all SPM-relevant suffixes. One creates a table S of size g: = π[d] to hold pairs of read numbers and read offsets. As in the counting phase, enumerate all suffixes of reads of length at least ℓ
min
passing the P- and the Q-filter. Suppose that s is such a suffix of read number p and with read offset q. Let u be the initial k-mer of s. Then we store (p, q, u) in a buffer B′ of fixed size . We choose this buffer size, as the elements in B′ require twice as much space as the elements in B. As in the counting phase, sort the buffer in lexicographic order of the k-mers it stores, and then process the buffer elements using the k-mer, say u, as a key to determine if u matches some element in K, say at index i. Then insert (p, q) at index π[i] − 1 in S and decrement π[i].
After all b∗ elements passed the buffer and have been processed, S holds all SPM-relevant suffixes (represented by read numbers and read offsets) in lexicographic order of their initial k-mers. Let u be the ith k-mer in K. Then all SPM-relevant suffixes with common prefix u are stored in S from index π[i] to . Thus S can uniquely be divided into buckets of SPM-relevant suffixes with the same initial k-mer. Each such bucket can be sorted independently from all other buckets. Moreover, each SPM-relevant suffix not occurring in the ith bucket, has an initial k-mer different from u and thus cannot have a common prefix of length ≥ ℓ
min
with the suffixes in the ith bucket. As a consequence, all suffix-prefix matches are derivable from pairs of SPM-relevant suffixes occurring in the same bucket. Thus, the suffix-prefix matches problem can be divided into d subproblems, each consisting of the computation of suffix-prefix matches from a bucket of SPM-relevant suffixes. This problem is considered later.
To sort the ith bucket one extracts the remaining suffixes relevant for sorting the bucket and stores them in a table. This strategy minimizes the number of slow random accesses to the reads. Consider the ith bucket and let (p, q) be one of the suffixes in the bucket, referring to the suffix of read r
p
at read offset q. Then extract the suffix of r
p
starting at position q + k. As the maximum read length is considered to be constant, the total size of the remaining suffixes to be stored is . The remaining suffixes can be sorted using radixsort in time. An additional linear time scan over the sorted suffixes of the bucket delivers a table L of size , such that L
j
is the length of the longest common prefix of the suffixes and S[π[i] + j] for all j, .
Sorting all remaining suffixes and computing the lcp-table L thus requires O(β
max
) space and time where β
max
is the maximum size of a bucket and g is the total number of SPM-relevant suffixes. The bucket of sorted SPM-relevant suffixes and the corresponding table L are processed by Algorithm 2 described after the following implementation section and Algorithm 3 described in Additional file 1, Section 7.
All in all, our algorithm runs in O(m + n + g) = O(n) time and O(m + 4k′ + 4k′′ + β
max
+ g + n) space. As we choose k′′ ≤ k′ ∈ O(log2 n) and m, g, and β
max
are all smaller than n, the space requirement is O(n). Thus the algorithm for identifying and sorting all (ℓ
min
, k)-SPM-relevant suffixes is optimal.
Implementation
We will now describe how to efficiently implement the algorithm described above. An essential technique used in our algorithm are integer codes for k-mers. These are widely used in sequence processing. As we have three different mer-sizes (k, k′, and k″) and dependencies between the corresponding integer codes, we shortly describe the technique here. In our problem, a k-mer always refers to a sequence of which it is a prefix. Therefore, we introduce integer codes for strings of length ≥ k: For all strings s of length at least k define the integer code ϕ, where ϕ is the mapping [A → 0, C → 1, G → 2, T → 3] uniquely assigning numbers from 0 to 3 to the bases in the alphabetical order of the bases. Note that only the first k symbols of s determine ϕ
k
(s), which is an integer in the range [0…4k − 1]. For all strings s and s′ of length at least k, s ⊴ s′ implies ϕ
k
(s) ≤ ϕ
k
(s′), where ⊴ denotes the lexicographic order of strings and ≤ denotes the order of integers.
Besides ϕ
k
, we use the encodings ϕk′ and ϕ for some k′, k ″≤ k. ϕk′ encodes the prefix of s of length k′ and is defined in analogy to ϕ
k
(replacing k by k′). ϕ encodes the suffix of s[1…k] of length k″, i.e. ϕ. ϕk′(s) and ϕ can be computed from ϕ
k
(s) according to the following equations:
(1)
(2)
We implement k-mers by their integer codes. Each integer code can be computed in constant time by extracting the appropriate sequence of consecutive bit pairs from a 2bit per base encoding of the read set. In our implementation, we use the representation and the appropriate access functions from the GtEncseq software library [16]. As we can store each integer code in an integer of the machine’s word size. We sort m integer codes for the initial k-mers using quicksort, adapting the code from [17]. Our implementation works without recursion and uses an extra stack of size O(log2 m) to sort m integers. This small additional space requirement is the main reason to choose quicksort instead of radixsort, which is usually more than twice as fast, but requires O(m) extra working space, which we do not want to afford.
The sets P and Q are implemented by bit vectors v
P
and v
Q
of 4k′ and 4k′′ bits, respectively. Bit v
P
q is set if and only if q = ϕk′(r) for some . Bit v
Q
q is set if and only if for some read . To obtain the bit index, one computes ϕk′(s) and from ϕ
k
(s) using Equations (1) and (2). Equation (1) can be implemented by a bitwise right shift of 2(k − k′) bits. Equation (2) can be implemented by a bitwise and operation with the integer 22k′′ − 1. Thus, given the integer code for s, both ϕk′(s) and ϕk″k(s) can be computed in constant time. Therefore, the sets P and Q can be constructed in O(m) time and each access takes constant time.
When determining the k-mer codes in the counting phase and in the insertion phase, we sweep a window of width k over the sequence reads and compute the integer code for the sequence in the current window in constant time.
We implement the counts by a byte array of size d and store counts larger than 255 in an additional hash table. Additional file 1, Section 1 gives the details.
The partial sums in table π are bounded by g, the number of SPM-relevant suffixes. For large read sets, g can be larger than 232 − 1. However, as the partial sum are strictly increasing, one can implement π by a 32 bit integer table PS of size d + 1, such that PS[i] = π[i] mod 232 for any i, 0 ≤ i ≤ d and an additional integer table of size marking the boundaries of carry bits. Details are given in Additional file 1, Section 2.
For the insertion phase we need a representation of the read set (2n bits), table K (2kd bits), set P and Q (4k′ and 4k′′ bits, respectively), table π (32(d + 1) bits) and table S of size g. As S holds pairs of read numbers and read offsets, each entry in S is stored compactly in bits. This would give an integer array of size if we would store S completely. But we do not, as we employ a partitioning strategy, explained next.
Although the data structures representing tables S, K, P and π are of different sizes, their access follows the same scheme: Suppose that i is the smallest index, such that . Roughly half of the suffixes to be inserted in S are placed in buckets of lower order (with index ≤ i) and the other half are placed in buckets of higher order (with index > i). The buckets of lower order are associated with the k-mers in K up to index i. Hence, for these, one needs table K and PS only up to index i. Let s be some suffix of length ≤ ℓ
min
such that ϕ
k
(s) ≤ K[i]. To apply the P-filter to s, one checks v
P
at index , which is in the first half of vector v
P
. This strategy, dividing tables S, K, P and π into q = 2 parts of roughly the same size, can be generalized to q > 2 parts. Each part is defined by a lower and an upper integer code and by corresponding lower and upper boundaries referring to sections of the four mentioned tables. Partitioning S means to only allocate the maximum space for holding all buckets belonging to a single part.
The four tables that can be split over q parts require h(g, k, d, k′′, σ) = 2kd + 4k″ + 32(d + 1) + g σ bits. Hence, in the insertion phase, our method requires bits, where 2n + 4k′′ bits are for the representation of the reads and the set Q (which cannot be split). As gσ dominates all other terms, h(g, k, d, k′′, σ) is much larger than 2n + 4k′′ so that the space gain of our partitioning strategy is obvious. As the space required for the insertion phase for any number of parts can be precalculated, one can choose a memory limit and calculate the minimal number of parts such that the limit is not exceeded. In particular, choosing the space peak of the counting phase as a memory limit for the insertion phase allows for balancing the space requirement of both phases. More details on the partitioning technique are given in Additional file 1, Section 3.
An obvious disadvantage of the partitioning strategy (with, say q, parts) is the requirement of q scans over the read set. However, the sequential scan over the read set is very fast in practice and only makes up for a small part of the running time of the insertion phase.
The expected size of a bucket to be sorted after the insertion phase is smaller than the average read length. The maximum bucket size (determining the space requirement for this phase) is 1 to 2 orders of magnitude smaller than d. As we can store bases in one integer of ω bits, the remaining suffixes (which form the sort keys) can be stored in integers, where β
max
is the maximum size of a bucket and λmax is the maximum length of a read. The additional constant 2 is for the length of the remaining suffix, for the read number and the read offset. The sort keys are thus sequences of integers of different length which have to be compared up to the longest prefix of the strings they encode. We use quicksort in which bases are compared using a single integer comparison. As a side effect of the comparison of the suffixes, we obtain the longest common prefix of two compared suffixes in constant extra time, and store this in a table L of the size of the bucket. The suffixes in the bucket and the table L are passed to Algorithm 2, described next, and to Algorithm 3( Additional file 1, Section 7).
An efficient algorithm for computing suffix-prefix matches from buckets of sorted SPM-relevant suffixes
The input to the algorithm described next is a bucket of sorted SPM-relevant suffixes, with the corresponding table L, as computed by the algorithm of the previous subsection. Consider the ith bucket in S and let H
j
= S[π[i] + j] be the jth suffix in this bucket for all j, 0 ≤ j ≤ β − 1 where β = π[i + 1] − π[i] is the size of the bucket. By construction, we have Hj-1 ⊴ Hj, L
j
≥ k, and L
j
is the length of the longest common prefix of Hj−1 and H
j
for j, 1 ≤ j ≤ β − 1.
Note that the bucket-wise computation does not deliver the lcp-values of pairs of SPM-relevant suffixes on the boundary of the buckets. That is, for all i > 0, the length of the longest common prefix of S[π[i] − 1] and S[π[i]] is not computed, because S[π[i] − 1] is the last suffix of the (i − 1)th bucket and S[π[i]] is the first suffix of the ith bucket. However, as both suffixes belong to two different buckets, their longest common prefix is smaller than k (and thus smaller than ℓ
min
) and therefore not of interest for the suffix-prefix matching problem.
The suffixes occurring in a bucket will be processed in nested intervals, called lcp-intervals, a notion introduced for enhanced suffix arrays by [18]. We generalize this notion to table H and L as follows: An interval e.f, 0 ≤ e < f ≤ β − 1, is an lcp-interval of lcp-value ℓ if the following holds:
-
-
(4)
-
(5)
-
(6)
We will also use the notation ℓ − [e.f] for an lcp-interval [e.f] of lcp-value ℓ. If ℓ − [e.f] is an lcp-interval such that w = H
e
[1…ℓ] is the longest common prefix of the suffixes H
e
, He+1, …, H
f
, then [e.f] is called the w-interval.
An lcp-interval ℓ′ − [e′.f′] is said to be embedded in an lcp-interval ℓ − [e.f] if it is a proper subinterval of ℓ − [e.f] (i.e., e ≤ e′ < f′ ≤ f) and ℓ′ > ℓ. The lcp-interval ℓ − [e.f] is then said to be enclosing [e′.f′]. If [e.f] encloses [e′.f′] and there is no interval embedded in [e.f] that also encloses [e′.f′], then [e′.f′] is called a child interval of [e.f] and [e.f] is the parent interval of [e′.f′]. We distinguish lcp-intervals from singleton intervals [e′] for any e′, 0 ≤ e′, ≤ β − 1. [e′] represents He′. The parent interval of [e′] is the smallest lcp-interval [e.f] with e ≤ e′ ≤ f.
This parent–child relationship of lcp-intervals with other lcp-intervals and singleton intervals constitutes a virtual tree which we call the lcp-interval tree for H and L. The root of this tree is the lcp-interval 0 − [0.(β − 1)]. The implicit edges to lcp-intervals are called branch-edges. The implicit edges to singleton-intervals are called leaf-edges. Additional file 1, Section 10 gives a comprehensive example illustrating these notions.
Abouelhoda et al. ([18], Algorithm 4.4) present a linear time algorithm to compute the implicit branch-edges of the lcp-interval tree in bottom-up order. When applied to a bucket of sorted suffixes, the algorithm performs a linear scan of tables H and L. In the eth iteration, 0 ≤ e ≤ β − 2, it accesses the value Le+1 and H
e
. We have non-trivially extended the algorithm to additionally deliver leaf edges. The pseudocode, with some additions in the lines marked as new, is given in Algorithm 1 (Figure 1). We use the following notation and operations:
-
A stack stores triples (ℓ, e, f) representing an lcp-interval ℓ − [e.f]. To access the elements of such a triple, say sv, we use the notation sv.lcp (for the lcp-value ℓ), sv.lb (for the left boundary e) and sv.rb (for the right boundary f).
-
stack.push(e) pushes an element e onto the stack.
-
stack.pop pops an element from the stack and returns it.
-
stack.top returns a reference to the topmost element of the stack.
-
⊥ stands for an undefined value.
-
process_leafedge(firstedge, itv, (p, q)) processes an edge from the lcp-interval itv to the singleton interval representing the suffix r
p
[q…|r
p
|]. firstedge is true if and only if the edge is the first processed edge outgoing from itv.
-
process_branchedge(firstedge, itv, itv’) processes an edge from the lcp-interval itv to the lcp-interval itv’. The value itv’.rb is defined and firstedge is true if and only if the edge is the first edge outgoing from itv.
-
process_lcpinterval(itv) processes the lcp-interval itv. The value itv.rb is defined.
Depending on the application, we use different functions process_leafedge, process_branchedge, and process_lcpinterval.
Additional file 1, Section 4, explains why Algorithm 1 also delivers the leaf edges of the lcp-interval tree in the correct bottom-up order.
Consider a path in the lcp-interval tree from the root to a singleton interval [e′] representing He′ = r
p
[q…|r
p
|]. Let ℓ − [e.f] be an lcp-interval on this path, and consider the edge on this path outgoing from ℓ − [e.f]. If the edge goes to an lcp-interval of, say lcp-value ℓ′, then the edge is implicitly labeled by the non-empty sequence . Suppose the edge goes to a singleton interval: Then the edge is implicitly labeled by the non-empty sequence r
p
[q + ℓ…|r
p
| − 1]$
p
. If q + ℓ = |r
p
|, then r
p
[q + ℓ…|r
p
| − 1] is the empty string, which implies that the edge to the singleton interval is labeled by the sentinel $
p
. Such an edge is a terminal edge for r
p
. If the read offset q is 0, we call [e′] a whole-read interval for r
p
, and the path in the lcp-interval tree from the root to [e′] a whole-read path for r
p
.
Consider a suffix-prefix match 〈 r
p
, r
j
, ℓ 〉 , such that the suffix w of r
p
of length ℓ has a prefix u of length k. Recall that u is the common prefix of all suffixes in the ith bucket. Due to the implicit padding of reads at their end, the symbol following w as a suffix of r
p
is $
p
. By definition, w is also prefix of r
j
and the symbol in r
j
following this occurrence of w is different from $
p
. Thus, there is a w-interval [e.f] in the lcp-interval tree for H and L. [e.f] is on the path from the root-interval to the whole-read leaf interval for r
j
. Moreover, there is a terminal edge for r
p
outgoing from [e.f]. Vice versa, an lcp-interval of lcp-value ℓ on the path to the whole-read leaf interval for r
j
and with a terminal edge for r
p
identifies the suffix-prefix match 〈 r
p
, r
j
, ℓ 〉 . This observation about suffix-prefix matches is exploited in Algorithm 2 (Figure 2) which performs a bottom-up traversal of the lcp-interval tree for H and L, collecting whole-read leaves and terminal edges for lcp-intervals of lcp-value at least ℓ
min
. More precisely, whenever a whole-read leaf for r
p
, 1 ≤ p ≤ m, is found (line 9), p is appended to the list W. With each lcp-interval itv on the stack used in the bottom-up traversal, an integer itv.firstinW is associated. The elements in W[itv.firstinW…|W|] are exactly the read numbers of whole-read leaves collected for itv. The value of itv.firstinW is set whenever the first edge outgoing from itv is detected: If the first edge outgoing from itv is a leaf-edge, no previous whole-read leaf for itv has been processed: Thus |W| + 1 is the first index in list W where the whole read leaf information (if any) for itv will be stored (see line 8). If the first edge is a branch-edge to lcp-interval itv′, then the corresponding subset of W for itv′ must be inherited to itv. Technically, this is achieved by inheriting the firstinW-attribute from itv′ to itv, see line 18 of Algorithm 2.
Whenever a terminal edge for read r
p
, outgoing from an interval itv is processed (line 11), p is added to the list T. Suppose that this terminal edge is outgoing from the lcp-interval itv. The first symbol of the label of the terminal edge is $
p
. Suppose there is a branch-edge outgoing from itv to some lcp-interval itv′. Then the first symbol, say a, of the implicit label of this edge must occur more than once. Thus it cannot be a sentinel, as these are considered different in the lexicographic ordering of the suffixes. Hence the first symbol a is either A, C, G or T. As these symbols are, with respect to the lexicographic order, smaller than the sentinels, the branch-edge from itv to itv’ appears before the terminal edge from itv. Hence the terminal edges outgoing from itv′ have been processed in line 25, and so we only need a single list T for the entire algorithm.
As soon as all edges outgoing from itv have been processed, we have collected the terminal edges in T and the whole-read leaves in W. If itv.lcp exceeds the minimum length, Algorithm 2 computes the cartesian product of T with the appropriate subset of W and processes the corresponding suffix-prefix matches of length itv.lcp in line 25. At this point suffix-prefix matches may be output or post-processed to check for additional constraints, such as transitivity. Once the cartesian product has been computed, the elements from T are no longer needed and T is emptied (line 26). Note that the algorithm empties W once an lcp-interval of lcp-value smaller than ℓ
min
is processed. After this event, there will only be terminal edges from v-intervals such that the longest common prefix of v and the reads in W is smaller than ℓ
min
. Therefore there will be no suffix-prefix match of the form 〈_, w, ℓ〉 such that ℓ ≥ ℓ
min
and w is a read represented in W. So the list can safely be emptied.
The lcp-interval tree for H and L contains β leaf-edges. As all lcp-intervals have at least two children, there are at most β − 1 branch-edges and β lcp-intervals. As each of the three functions specified in Algorithm 2 is called once for every corresponding item, the number of functions calls is at most 3β − 1. Recall that Algorithm 2 is applied to each bucket and the total size of all buckets is g. Hence there are at most 3g − d calls to the three functions. process_leafedge and process_branchedge run in constant time. The running time of process_lcpinterval is determined by the number of SPMs processed. Assuming that the processing takes constant time, the overall running time of Algorithm 2 for all buckets is O(g + z) where z is the number of processed SPMs.
Handling reverse complements of reads
Reads may originate from both strands of a DNA molecule. For this reason, suffix-prefix matches shall also be computed between reads and reverse complements of other reads. Handling the reverse complements of all reads is conceptually easy to integrate into our approach: One just has to process instead of .
The three steps which involve scanning the reads are extended to process both strands of all reads. This does not require doubling the size of the read set representation, as all information for the reverse complemented reads can efficiently be extracted from the forward reads. Additional file 1, Section 5, shows how to compute the integer codes for the reversed reads from the integer codes of the forward reads in constant time.
The scan of the reverse complemented reads has a negligible impact on the runtime. Of course, the size of the table S, K and PS roughly doubles when additionally considering reverse complements.
When computing suffix-prefix matches some minor modifications are necessary: Applying Algorithm 2 to finds all SPMs, including some redundant ones, which we want to omit. This is formalized as follows: an SPM is non-redundant if and only if one of the following conditions is true:
-
-
-
For any SPM, these conditions can easily be checked in constant time, see Algorithm 3 ( Additional file 1, Section 7).
Recognition of transitive and irreducible suffix-prefix matches
For the construction of the string graph, we do not need transitive SPMs. An SPM is transitive if and only if there are two SPMs 〈r, s, ℓ〉 and 〈s, t, ℓ′〉 such that ℓ + ℓ′ = |s|+ ℓ″. Figure 3 shows a concrete example of a transitive SPM. An SPM which is not transitive is irreducible.
The following theorem characterizes an SPM by a read and a single irreducible SPM satisfying a length constraint and a match constraint, see Figure 4 for an illustration.
Theorem 1. Let 〈r, t, ℓ″〉 be an SPM. Then 〈r, t, ℓ″〉 is transitive if and only if there is an and an irreducible SPM 〈s, t, ℓ′〉 such that ℓ′ > ℓ″, |r| − ℓ″ ≥ |s| − ℓ′ and s[1…|s| − ℓ′] = r[|r| − ℓ″ − (|s| − ℓ′) + 1…|r| − ℓ″].
The proof of Theorem 1 can be found in Additional file 1, Section 6.
If the SPM 〈r, t, ℓ″〉 is transitive and 〈s, t, ℓ′〉 is the SPM satisfying the conditions of Theorem 1, then we say that 〈r, t, ℓ″〉 is derived from 〈s, t, ℓ′〉.
Theorem 1 suggests a way to decide the transitivity of an SPM 〈r, t, ℓ〉: Check if there is an irreducible SPM 〈s, t, ℓ′〉 from which it is derived. The check involves comparison of up to |s| − ℓ′ symbols to verify if s[1…|s| − ℓ′] is a suffix of . As there may be several irreducible SPMs from which 〈r, t, ℓ″〉 may be derived, it is necessary to store the corresponding left contexts: For any sequence s and any ℓ′, 1 ≤ ℓ′ < |s|, the left context LC(s, ℓ′) of s is the non-empty string s[1…|s| − ℓ′].
Due to the bottom-up nature of the traversal in Algorithm 2, the SPMs involving the different prefixes of a given read are enumerated in order of match length, from the longest to the shortest one. Thus, Algorithm 2 first delivers the irreducible SPM 〈s, t, ℓ′〉 from which is possibly derived, because .
From Theorem 1 one can conclude that the first SPM, say 〈s, t, ℓ′〉, found on a whole-read path for t is always irreducible. Hence, one stores LC(s, ℓ′). An SPM 〈r, t, ℓ″〉 detected later while traversing the same whole-read path for t is classified as transitive if and only if LC(s, ℓ′) is a suffix of LC(r, ℓ″) (see Figure 5 for an illustration). If 〈r, t, ℓ″〉 is transitive it is discarded. Otherwise, LC(r, ℓ″) must be stored as well to check the transitivity of the SPMs found later for the same whole-read path. So each SPM is either classified as transitive, or irreducible, in which case a left context is stored. To implement this method, we use a dictionary D of left contexts, with an operation LCsearch(D, s), which returns true if there is some t ∈ D such that t is a suffix of s. Otherwise, it adds s to D and returns false. Such a dictionary can, for example, be implemented by a trie [19] storing the left contexts in reverse order. In our implementation we use a blind-trie [20]. In Additional file 1, Section 7 we present a modification of Algorithm 2 to output non-redundant irreducible SPMs only.
Recognition of internally contained reads
At the beginning of the methods section we have shown how to detect reads which are prefixes or suffixes of other reads. When constructing the string graph we also have to discard internally contained reads, which are contained in other reads without being a suffix or a prefix. More precisely, is internally contained, if a read exists, such that r′ = urw for some non-empty strings u and v. In Additional file 1, Section 8, we show how to efficiently detect internally contained reads.
Construction of the assembly string graph
Consider a read set which is suffix- and prefix-free. The assembly string graph [8] is a graph of the relationships between the reads, constructed from , the set of all non-redundant irreducible SPMs from restricted to reads which are not internally contained.
For each the graph contains two vertices denoted by r.B and r.E, representing the two extremities of the read. B stands for begin, E stands for end.
For each non-redundant irreducible SPM satisfying ℓ ≥ ℓ
min
, the graph contains two directed edges, defined as follows:
-
1.
if there are two edges:
-
2.
if there are two edges:
-
3.
if there are two edges:
In our implementation of the string graph, vertices are represented by integers from 0 to 2m − 1. To construct the graph from the list of non-redundant irreducible SPMs, we first calculate the outdegree of each vertex. From the counts we calculate partial sums. In a second scan over the list of SPMs, we insert the edges in an array of size 2ρ, where . This strategy allows to allocate exactly the necessary space for the edges and to access the first edge outgoing from a vertex in constant time. The array of edges is stored compactly using bits, where λmax is the maximum length of a read. bits are used for the destination of an edge (the source of the edge is clear from the array index where the edge is stored). bits are used for the length of the edge label.
To output the contigs, we first write references (such as read numbers and edge lengths) to a temporary file. Once this is completed, the memory for the string graph is deallocated, and the read sequences are mapped into memory. Finally, the sequences of the contigs are derived from the references and the contigs are output.
To verify the correctness of our string graph implementation and to allow comparison with other tools, we have implemented the graph cleaning algorithms described in [9] as an experimental feature. More sophisticated techniques, such as the network flow approach described in [8], are left for future work, as the main focus of this paper lies in the efficient computation of the irreducible SPMs and the construction of the string graph.