GapFiller is a local assembler based on a seed-and-extend schema [13]. Seed-and-extend assemblers repeatedly pick up a seed (it can be either a read or a previously assembled contig) and extend it using other reads. This procedure is realised by computing and analysing all--or almost all--the overlaps between seed's tips and the remaining available reads. The reads used for an extension are those with the highest alignment score. It is clear that the seed-and-extend assemblers' computation bottleneck is their capability to quickly cope with all the alignment scores to be determined.
GapFiller begins by storing all useful reads in a memory efficient data structure that allows to readily compute overlaps between the contig under construction and the remaining available reads. In a second phase each seed read (possibly belonging to a new set of paired reads) is selected one after the other and used to start an extension phase. Such phase halts when a stop condition is reached. Depending on the stop condition, the contig produced is labelled as trusted or not trusted (i.e., positive or negative).
Definitions
Let Σ be an alphabet and Σ* be the set of the words from Σ. For every S ∈ Σ∗ we will denote with |S| the number of characters of S and with S[p, . . ., p + l - 1] the sub-sequence of S starting in p ∈ {0, . . ., |S| - 1} and of length l ∈ {0, . . ., |S| - p}. We will refer to S[p, . . ., p + l - 1] as prefix if p = 0, suffix if p + l = |S|, and as the p- th character of S if l = 1, and we will simply write S[p].
In order to quickly identify overlaps between the contig under construction and the reads' tips, we use an approach closely related to the one presented in [20] based on an Hamming-aware hash function. The idea is that, by representing a string of length l as a base-|Σ| number, one can often replace expensive char-by-char comparison by fast integer (or bit-string) comparison. However, for practical values of l, the integers to be compared would not fit in a memory word. For this reason, as in the classical Karp-Rabin exact string matching algorithm [21], we can work with numbers modulo q considering equality modulo q only as an indication (necessary condition) that pairs of strings may be the same (i.e., operating with the strings' fingerprints). Policriti et al. in [22] proposed an extension of the approach by Karp and Rabin, introducing a technique to deal with mismatches, based on the idea of replacing simple fingerprints comparison with a more articulated test. In particular they noticed that, by choosing q to be a Mersenne (prime, when possible) number (i.e., q = 2w - 1, for some w ∈ ℕ), to check whether two strings align against each other at a small Hamming distance can be implemented in average linear time.
Given a string S ∈ Σ* and its base-|Σ| numerical representation s ∈ ℕ, let us define the hash function as
(1)
where q is a (prime) number of the form q = 2w- 1, for some w ∈ ℕ. The value f
H
(S) is called the fingerprint of the sequence in S ∈ Σ* coded with s.
In our context, the use of f
H
significantly reduces the size of the set employed in the search of the overlapping reads. Every read r, as well as its reverse-complement, is indexed by the fingerprint of a substring of length b, starting at a fixed position x in r (see also Figure 1). Formally, given a set of reads , a sequence , a maximum allowed Hamming distance k, the set of the witnesses (the Hamming sphere of radius k around , see [22] for more details), a fixed value b for the length of the substring on which the fingerprint is computed in r, and two positions x and y, the following set:
(2)
contains at least all the reads such that the hamming distance between r[x, . . ., x + b - 1] and is not greater than k. False positives can be present but, as showed in [22], their amount is limited. On this ground the search for reads overlapping can be restricted to those belonging to , for some x, y ∈ ℤ.
As far as GapFiller is concerned, we set k = 0 as default, meaning that we search for exact b-length substrings in the reads (i.e., , for some x and y). As a consequence, better quality output will be obtained if we select a position x in r such that the average base quality is expected to be the highest possible. This point will be further discussed in the section specifically addressing data structures' design and implementation.
Dataset preparation
In order to avoid the generation of wrong contigs, it is of utmost importance to use only correct reads over the entire extension phase. Several tools are available to perform error correction on Illumina data using the so-called "read spectrum" (consider QUAKE [23], Hammer [24], and Allpaths [25] just to mention the most recent ones). Other tools discard reads or try to improve their reliability using quality information (rNA [20] and QSRA [16]).
Our approach, when we are given raw data, is to first trim (and possibly filter) the reads on the ground of quality information using a specific rNA option (refer to [20] for details), and to subsequently correct them with an error correction tool like QUAKE [23].
Another important way to assess a dataset's global quality is to plot the reads' k-mers distribution. This can be easily done using Jellyfish [26]. If the genome has been sequenced tens of times, then two peaks are expected: one in correspondence of the expected coverage and one in correspondence of coverage one. k-mers composing this second peak are likely to be sequencing errors. As a rule of thumb, a low number of k-mers occurring only once suggests that the dataset has a good global quality.
Contig extension
In the contig extension phase, each read is selected in a loop and used as seed in order to create a new contig. Once a seed read is selected, the suffix-prefix overlaps with other reads are computed and, if a sufficiently high level of global similarity is reached, they are clustered in a consensus string, which is subsequently used to perform further extensions. The procedure continues while some overlapping reads exist and the consensus string is highly representative of the clustered reads. If either one of the previous two conditions is not met, the extension phase stops, the current sequence is returned in output, and the loop continues.
Before the extension phase some parameters are set: the minimum overlap length L and the maximum shift Δ: an overlap between the current contig's suffix and the read's prefix is considered only if the overlap length l belongs to the interval [L, L + Δ].
GapFiller builds a cluster every time a contig is to be extended with the overlapping reads. In particular, GapFiller uses only those reads aligning against the contig's suffix with at most δ mismatches (where δ = δ(l) is a function of the overlap length l) and requires at least m reads in order to compute a consensus string. Notice that b ≤ L ≤ l holds, hence suffix-prefix overlaps might occur with more than k = 0 mismatches (see section Definitions).
Let be the set of the input reads for GapFiller and be a seed read. At step i = 0 the current sequence is initialized with the seed S0 := r0. Denoting by S
i
the current contig at the generic i-th step of the algorithm, the procedure to build Si+1is described below:
Step1 Reads are selected according to their similarity with the current contig S
i
(see Figure 2a). At this point, every read overlapping S
i
for l ∈ [L, L + Δ] characters with at most δ mismatches is selected.
Step2 The reads are clustered and a consensus string is computed. Every character of the consensus string is assigned a flag indicating how it is representative of the reads from which it is built. More precisely, for every position j, GapFiller selects the most occurring character in the considered reads, and the majority consensus string C is computed (see Figure 2b). Depending on two parameters T1 and T2 such that T1 <T2, we say that a position j is non-represented, low-represented, or high-represented if the representation rate of the corresponding character in C is lower than T1, lower than T2, or higher than T2, respectively.
Step3 The reads used to build the consensus C are filtered and trimmed, depending on the presence of low-represented and non-represented positions, respectively. The idea is that on low-represented positions we need a minimum percentage of reads matching the consensus string, and that on non-represented positions the extension is considered to be unsafe. Reads differing from C in correspondence of low-represented positions are discarded and the remaining ones are also trimmed if a non-represented position occurs (see Figure 2c).
Step4 A new consensus string C
new
is computed, considering only the reads obtained at Step 3, and possibly the current contig is extended (see Figure 2d). The extension is done only if the number of reads is at least m and the consensus C
new
exceeds S
i
's right end: in this case, a new contig Si+1is built and the procedure restarts. Otherwise the algorithm stops and the contig S
i
is returned.
The adopted strategy is aimed at either avoiding errors and overcoming the problems arising when GapFiller attempts to cluster reads that are different from each other. In the last part of this section we will discuss in more detail how the algorithm works. The reader who is not interested in the technical formalism might skip this part and move directly to the Subsection Stop criteria.
Step 1. Overlapping reads selection
Let us denote with the set of the putative overlapping reads with respect to the l-suffix of S
i
, selected by their fingerprint values (see (2), with y = |S| - l + x, for some values of x ∈ {0, . . ., l - b}). For every fixed value of l, the set of the reads overlapping the l-suffix of S
i
with at most δ mismatches is defined as
(3)
where d
H
: Σl × Σl → ℝ+ is the Hamming distance. The set of all the overlapping reads will be denoted by
(4)
Given a read we define its starting and ending positions as
(5)
I(r) and F(r) represent the position of the read r with respect to the current contig S
i
, therefore we set I(S
i
) = 0. For instance, in the case depicted in Figure 2, we have and , I(r1) = 10 and F(r1) = 20.
Step 2. Reads clustering and consensus string computation
The subsequent phase consists of the computation of the consensus string obtained from the set of reads (see (4)). Notice that, in order to compute reliable extensions, we require the number of reads to be at least m, a parameter that may depend on the dataset used. If there exists no l such that the l-suffix of S
i
is covered by at least m reads of , then the procedure stops. Otherwise, the starting and ending positions of the consensus string C with respect to S
i
can be computed, thanks to (5). In practice, we let the consensus string start from the leftmost reads, i.e., those covering the longest suffix of S
i
(see, for instance, the read r2 in Figure 2) and end at the rightmost position in which the number of reads is at least m. More precisely, the starting and ending positions of C are defined as
respectively. If F(C) > |S
i
|-1 the procedure continues, otherwise it stops as S
i
cannot be further extended. Looking at Figure 1 we have I(C) = 9 and F(C) = 21 and the procedure continues since F(C) >F (Si+1) = 17.
The consensus string C is then computed by selecting the most represented character at every position. For every X ∈ Σ and for every j = I(C), . . ., F(C) we define the number of occurrences of the character X in position j with respect to S
i
as
The consensus string C is defined, for every j = I(C), . . ., F (C), by setting C[j - I(C)] equal to the highest occurring character, i.e., the X ∈ Σ with the highest number of occurrences in position j
Loosely speaking, the character selected on a particular position of the consensus string is the most occurring character in the reads on that position; hence σ(C[j - I(C)], j) is the number of occurrences of character C[j - I(C)] on position j.
Step 3. Consensus-based reads selection
As above mentioned, in order to check, on the one hand, whether a read r is highly representative of the consensus C and, on the other hand, if the extension is "safe", it is important to introduce the notion of non-represented, low-represented, and high-represented characters in the consensus string. We simply define the representation rate of the position j as
(6)
Hence we fix two threshold values T1 and T2 such that 0.25 ≤ T1 <T2 < 1 (notice that π (j) ∈ [0.25, 1] as |Σ| = 4) and we distinguish three types of positions in the consensus string:
The idea is to discard those reads that "differ from C” and to cut them out, as there is not sufficiently high evidence that GapFiller is extending correctly. In practice, we do not consider a read r if it does not match the consensus string on a low-represented position, i.e., r[j -I(r)] ≠ C[j - I(C)], for some j such that π (j) ≤ T2. Clearly, this applies to non-represented positions as well. Then, we trim every read overlapping any non-represented position of C. More precisely, if j
not
is the first non-represented position occurring in r (i.e., π (j
not
) ≤ T1), we consider r[0, . . ., j
not
- I(r) - 1] instead of r.
After unsafe reads are discarded and the remaining ones are trimmed, a new set of reads, that can be denoted by , is finally obtained (see Figure 2c). Every read in is both matching the consensus string C on each low-represented position and not covering any non-represented one. Using this mechanism we take into account only the most representative reads and do not extend the contig with a consensus character when its representation rate is too low.
Step 4. Final consensus string computation and contig update
After previous step, the new set of overlapping reads is obtained. A new consensus string C
new
can be computed as C was before. If F(C
new
) > |S
i
| - 1 the extension is performed, the current contig is updated
and the (i + 2)-th extension phase restarts from Si+1.
Stop criteria
The algorithm described in the previous section may potentially extend a contig for an arbitrarily large number of times, without checking any "global" properties of the current sequence. With our method the extension phase halts if at least one of the following conditions is met: (i) the available overlapping reads for the consensus C are less than m; (ii) the available overlapping reads for the new consensus C
new
are less than m; (iii) contig's length exceeds the maximum length; (iv) the seed-mate has been found.
Let S
i
be the contig obtained at the i-th step, starting from the seed read r0. Criterion (i) applies when the consensus string C does not exceed the current contig. This means that there are no more than m - 1 overlapping reads, or that they are too short. In such a case, the contig produced is labelled as NO_MORE_EXTENSION.
Criterion (ii) applies when the consensus may have been produced as consequence of the presence of reads belonging to different genomic locations. More precisely, this situation is likely to appear when the consensus extension is "trying" to exit from a repeat. In this case, either too many reads are discarded (due to the presence of low-represented positions) or a significant trimming of them has been performed (as some non-represented positions occur far before the end of the consensus). In such a situation, the extension is halted and the contig is labelled as REPEAT_FOUND.
Criterion (iii) is satisfied as |Si+1| >Lmax, where Lmax is fixed at the beginning of the algorithm and is usually set to the maximum insert size, plus a tolerance value. In such a situation, we could have been able to continue the extension but, however, we could not find the seed-mate. This suggests that the contig produced may be wrong or, at least, that it contains a high number of unreliable bases. When the maximum allowed length is exceeded, the computation is halted and the contig, labelled as LENGTH_EXCEED, is returned.
Criterion (iv) is used to stop the extension as the mate of the seed r0 is found. At the generic i-th step, every is checked to see whether the following condition is satisfied
(7)
where M is the maximum number of mismatches allowed between and S
i
. Inequality (7) is satisfied if and only if the mate is found in S
i
at position p with no more than M mismatches. This control is performed on-the-fly and hence the positions already checked at the i-th step will not be re-checked. The mate-check criterion is used as a guarantee of correctness of the whole contig. This is in contrast to previous criteria, which are used to detect and prevent errors introduced in the extension phase. From this point of view, criteria (i) and (ii) can be seen as strictly local, since no information collected during previous steps is used. In this last case the contig returned is labelled as MATE_FOUND.
Data structures
In this section we will take a closer look at the data structures designed for our algorithm and at their implementation. GapFiller's core is the module working during the extension phase. At this point, we assume that the set has already been trimmed and possibly filtered.
The basic idea is to pre-compute as much as possible of the useful information on the reads, in order to speed up the computation of the overlaps needed to perform the extension phase. Suppose that GapFiller is working at the (i+1)-th step of an extension, with i ≥ 0, and let S
i
be the current contig. When constructing the consensus string C (see Figure 2a) we are always interested in obtaining overlaps between suffixes of S
i
and prefixes of reads belonging to .
In order to compute overlaps, GapFiller employs a hashing schema based on the one implemented in rNA [20]; in particular, a data structure similar to the one proposed in [22] is built. A simplified schema of GapFiller's data structure is presented in Figure 3. The basic idea behind GapFiller is the possibility to obtain in a fast and efficient way the set of reads whose prefixes overlap a suffix of the partial contig under construction. Therefore we used the rNA hash function to find reads that are likely to overlap a suffix of S
i
; those reads are subsequently checked to see if they actually overlap S
i
or not.
Obviously, all the data must be stored in the main memory, thus requiring a careful data structures' engineering. It is clear that, since overlaps between reads and the current contig can take place on both strands, reads must be stored together with their reverse complement.
With the goal to save as much memory as possible, reads are represented as arrays of integers, so that a base needs 2 bits instead of 8 (A→00, C→01, G→10, T→11). The data structure used to compute overlaps and to construct contigs is built from the reads. Three arrays are used to represent in a compact way the reads stored in ℛ and to compute overlaps among them:
-
1.
HASHcounter: it is an array of pointers to HASHvalues. In position i it stores the first position in HASHvalues such that a read r or its reverse complement has a prefix whose fingerprint is i.
-
2.
HASHvalues: each array entry stores the read's location in the array Reads together with a boolean value indicating whether the fingerprint has been computed from the original read or from its reverse complement. For this reason the size of HASHvalues is twice the number of reads in ;
-
3.
Reads: this array stores the reads and other useful informations, like paired read location, paired read order (first or second in a pair), and read status (used, not used, etc.).
The overall memory requirement for GapFiller depends on the size of HASHcounter and on the number of reads. As for rNA, a reasonable value for q is 230 -1. Such a number guarantees a reduction of the number of false positives (i.e., reads reported to align with the contig suffix, even though they do not overlap with it). As far as the number of reads is concerned, we can limit q, without loss of generality, to 231: with state-of-the-art Illumina technology, such a number of reads represents approximately a 70× coverage of the human genome. An Illumina read of length 100 bp requires two memory locations in HASHvalues of 4 bytes each (31 bits to access array Reads and one bit to store the overlap orientation) and one entry in Reads of 9 bytes (7 bytes to store the read's numerical representation, one to store the mate position in Reads, and one more byte to store several useful informations about read status). In total the amount of memory required is bytes.
The reads' fingerprint is computed on a precise substring of length b (see (2)). As pointed out in section Definitions, the fingerprint of should be computed on the position x such that the (expected) average base quality is as high as possible and the substring r[x, . . ., x + b - 1] falls into the contigs' suffix, independently on the overlap length l. For these two reasons, having the Illumina error-profile in mind, we choose x = 0 if r is considered on its original strand, x = L - b if r has been reverse-complemented (see Figure 4).
In order to compute the overlaps between the current contig S
i
and the reads, one has to compute the fingerprints of the substrings of length b starting from y, for every y ∈ {|S
i
| - L - Δ, . . ., |S
i
| - L} if original-stranded reads are searched, and for every y ∈ {|S
i
| - Δ - b, . . ., |S
i
| - b} if reverse-complemented ones are to be extracted. Let us indicate with s
y
the fingerprint computed from S
i
[y, . . ., y + b - 1] (see Figure 4). GapFiller uses this number to retrieve reads whose l-length prefix (l = |S
i
| - y for original-stranded reads, l = |S
i
| - y + L - b for reverse-complemented ones) is likely to match a substring of S
i
close to the sequence's end. In particular GapFiller accesses all HASHvalues positions between HASHcounter[s
y
] and HASHcounter[s
y
+ 1] and, subsequently, accesses Reads to identify the set of candidate overlapping sequences (in Figure 3 GapFiller scans all positions between k and r - 1 of HASHvalues). Finally, the set is used to compute , the set of real overlapping reads. This is done by checking all candidate reads singularly. Due to the fact that only a limited number of mismatches is allowed in this phase and that the employed hash function guarantees a low false positive rate, this step is extremely fast.