As shown in Figure 1 HECTOR consists of three stages, i.e. k-hopo encoding stage, homopolymer spectrum construction stage and error correction stage. Firstly, HECTOR encodes a read as a string of homopolymer pairs, in which each homopolymer pair is encoded as a byte. Secondly, all of the encoded reads are then used to construct the homopolymer spectrum, where HECTOR counts the number of occurrences of all non-unique k-hopo using a combination of a Bloom filter [24] and a hash table. The homopolymer spectrum is defined as a set of all k-hopos in the dataset, where the k-hopos whose multiplicity exceeds a certain coverage cut-off are considered as trusted and otherwise, untrusted. A k-hopo is a substring of k homopolymers, in which each homopolymer is encoded as a single base. HECTOR then automatically estimates the coverage cut-off from the coverage histogram of all non-unique k-hopos. Finally, in the error correction stage, HECTOR utilizes three techniques, namely, two-sided conservative correction, one-sided aggressive correction and voting-based refinement. In addition, HECTOR takes advantage of parallelization by using multi-threading to benefit from the compute power of common multi-CPU systems. HECTOR is written in C++ and uses a combination of Pthreads and OpenMP for parallelization.
k-hopo Definition and encoding
HECTOR represents a homopolymer as a pair (R X), where R is the homopolymer length and X ∈ Σ = {A, C, G, T, N}, and encodes a read as a string of homopolymer pairs. For example, the read S = AACCCCCGGGT will be encoded as (2, A) (5, C) (3, G) (1, T). In this case, all of the 3-hopos for S are (2, A) (5, C) (3, G) and (5, C) (3, G) (1, T). In HECTOR, we represent each homopolymer pair as a byte and then calculate the value using the following formula: (128*L + |∑|*R + B - 4), where B is an encoded nucleotide (A = 0, C = 1, G = 2, T = 3, N = 4) and L is the case flag (1 if the nucleotide is lowercase, 0 otherwise). Due to the 1 byte encoding implementation, the value of R has range between 1 and 25.
Homopolymer spectrum construction
HECTOR encodes all reads as strings of homopolymer pairs. The homopolymer spectrum is constructed from all of the encoded reads where the counting of k-hopos is performed using Bloom filters and hash tables and is parallelized using multi-threading. In our implementation, the default value of k is 21. This value is obtained empirically and is adapted from Musket [21].
The first stage filters out as many unique k-hopos as possible using a Bloom filter and stores all non-unique k-hopos in a hash table. However some unique k-hopos are likely to still exist in the hash table due to the false positive probability of a Bloom filter. In this stage, the masters fetch reads in parallel from the input file and then distribute all the k-hopos in the reads to the slaves. The accesses to the file are mutually exclusive and are guaranteed by locks. Hashing is used to distribute each k-hopo to its respective destination slave to ensure a good load balance. Once the destination slave is determined, the canonical k-hopo is transferred to the slave through the corresponding communication channels. The canonical k-hopo is the smaller numerical representation of the k-hopo and its reverse complement. Each slave holds a local Bloom filter and a local hash table to capture all non-unique k-hopos as well as filter out most unique k-hopos. When a k-hopo arrives, the slave performs membership look up in the local Bloom filter for the k-hopos. If the k-hopo queried exists in the Bloom filter, the slave inserts it in the local hash table because it is likely to have more than one occurrence. Otherwise, it is inserted in the Bloom filter. At the end of this stage, all non-unique k-hopos are stored in all local hash tables of all slaves, and the number of unique k-hopos occasionally existing in all hash tables depends on the false positive probability rate of all local Bloom filters. No synchronization between the slave threads is required in this stage due to the independence of the local Bloom filters and hash tables, which greatly benefits efficiency.
The second stage computes the multiplicity of each k-hopo in the hash table to determine the unique hopos that are still stored in the hash table. In this stage the masters follow the same procedure as in the previous stage and all slaves listen to their corresponding communication channels and wait for the arrival of k-hopos. Once a k-hopo arrives, a slave queries the existence of the k-hopo in the local hash table and then increments the multiplicity if it exists. At the end of this stage, each slave holds the multiplicity of each k-hopo stored in its hash table, which is used to determine the uniqueness of a k-hopo.
The third stage removes all unique k-hopos from the hash table leaving all non-unique ones in place. In this stage, each slave deletes all unique k-hopos from its hash table while the masters are idle. At the end of this stage, each slave holds a partition of the set of all non-unique k-hopos in the input reads.
The last stage determines the coverage cut-off for the homopolymer spectrum from the k-hopo coverage histogram. A k-hopo coverage histogram illustrates a mixture of two distributions: one for the coverage of likely correct k-hopos and the other for spurious k-hopos. The k-hopos distributed at the right of the valley are supposed to be true k-hopos (trusted) and the ones on the opposite side to be spurious (untrusted). A homopolymer spectrum is different from a k-mer spectrum i.e. the multiplicity along the x-axis represents a range of actual sequence length. A comparison between a homopolymer spectrum and a k-mer spectrum of SRR000868, SRR000870 and SRR639330 reads is shown in Figure 2. In Illumina datasets where the reads have equal length, the distributions of the k-mer spectrum are normally bimodal. However, due to the irregular read length nature of 454 datasets, the distribution of the k-mer spectrum tends to be unimodal while the distribution of the homopolymer spectrum is normally bimodal. Therefore, by encoding the spectrum construction in homopolymer space, the homopolymer spectrum obtained is analogous to the k-mer spectrum in Musket. As seen in Figure 2a and b the homopolymer spectrums of the SRR000868 and SRR000870 datasets are similar because their datasets were sequenced in the same experiment and have similar coverage (11.7× and 11×), respectively. Figure 2c shows that the SRR639330 dataset has a different homopolymer spectrum because it has a higher coverage (69.8×) and was sequenced in a different experiment. However, in general, all the homopolymer spectrums are bimodal. HECTOR then adopts the coverage cut-off selection method of Musket by choosing the multiplicity corresponding to the smallest density around the valley as the coverage cut-off. As shown in Figure 2, the cut-off values for SRR000868, SRR000870 and SRR639330 datasets are 3, 3 and 6, respectively. These values correspond to the multiplicity associated with the smallest density around the valley in the homopolymer spectrum for each of the datasets, which are 91,923, 100,408 and 38,081, respectively. In addition, HECTOR provides a parameter to allow users to specify the cut-off.
Error correction
Figure 3 shows how sequencing errors i.e. insertions, deletions and substitutions, are represented in the context of k-hopos, given a k-hopo S
k
. Due to the representation of the k-hopos in (R, X) format described in the previous section, identifying a homopolymer-length error can then be represented as a difference in R. Therefore, whatever the type of error is, only a single k-hopo pair is actually changed. Thus, explaining why identification of errors in our method can be very efficient. We also ignore error corrections that occur in the first or the last homopolymer of a read. This is because the actual polynucleotide in the reference genome can be much longer than the homopolymer at the beginning or end of a read. Thus, it is impossible to tell if the read should have been longer or shorter by a single base or two.
The examples in Figure 3 are not the only conceivable scenarios of substitution and run-length errors. There are others that could change k-hopos more drastically. We argue in Additional file 1 why these scenarios do generally not occur in 454 data.
The error correction phase of HECTOR consists of three stages i.e. two-sided conservative correction, one-sided aggressive correction and voting-based refinement.
Two-sided conservative correction
In the two-sided correction we assume that there is at most one error in any k-hopo of a read. This assumption is later relaxed in the one-sided aggressive correction. The two-sided correction starts with the classification of trusted and untrusted bases for a read. A base is considered to be trusted if it is covered by any trusted k-hopo. Otherwise, it is considered to be untrusted and is considered to be a potential homopolymer-length error. For a sequencing error occurring at position i of a read of l bases, it causes up to min{k, i, l - i} erroneous k- hopos. HECTOR only evaluates both the leftmost and the rightmost k-hopos that cover position i on the read, instead of all possible k k-hopos that cover position i, thus, significantly improving speed and avoiding high computational overhead. For each base, the correction is made only if an alternative is found to make both the leftmost and the rightmost k-hopos trusted. Otherwise, if more than one alternative is found, the base will be kept unchanged as a result of ambiguity. For a substitution error, the alternative is a homopolymer of length 1 with a different nucleotide. For an indel error, the alternative allows up to 3 bases deletion and up to 3 bases insertion of the same nucleotide. For a read, the two-sided correction will be executed for a fixed number of iterations or until no base change is made.
One-sided aggressive correction
Since the two-sided conservative correction assumes that there is only one error in a single k-hopo it is inadequate in cases where more than one error occurs in a single k-hopo. Therefore, we utilize a one-sided correction to aggressively correct errors in these cases. The core idea is similar to the one described in [21] but we are looking for an insertion or deletion alternative, instead of a substitution alternative. Given a read H, define H
i
to denote the base at position i of H, and H
i,k
to denote the k-hopo starting at position i. If H
i,k
is trusted, but H
i+1,k
is untrusted, the homopolymer H
i+k
is likely to be a sequencing error. Our one-sided correction aims to find an alternative of H
i+k
that yields a trusted H
i+1,k
. Unlike the two-sided correction, this correction selects the alternative, which makes the resulting trusted k-hopo H
i+1,k
have the largest multiplicity, if more than one alternative is found.
The one-sided correction begins with the location of trusted regions for a read. A trusted region means that every base in this region is trusted thus having a minimum length of k. For each trusted region, error corrections are conducted towards each of the two orientations on the read. For each orientation, the error correction process does not stop until it either reaches a neighbouring trusted region or fails to correct the current base.
This correction approach is effective but has the drawback that it strictly relies on the correctness of k-hopo H
i,k
. Thus if H
i,k
does contain sequencing errors but is deemed to be trusted, this one-sided correction is possible to cause cumulative incorrect corrections, mutating a series of correct bases to incorrect ones. Look-ahead validation and voting-based refinement techniques are used to reduce this adverse effect. The look-ahead validation evaluates the trustworthiness of a predefined maximal number of neighbouring k-hopos that cover the base position at which a sequencing error likely occurs. If all evaluated k-hopos are trusted for a certain alternative on that position, this alternative is reserved as one potential correction. The voting-based refinement is described below. There is also a constraint on the number of corrections that are allowed in any k-hopo of a read. The default value of the constraint is 4. During the correcting process, we track the number of corrections that have been made in any k-hopo. If the number of corrections of a k-hopo exceeds the defined constraint, all corrections made in the k-hopo will be disregarded.
Voting-based refinement
The voting-based refinement method used is similar to the voting algorithm originally used in DecGPU [16]. The voting algorithm attempts to find the correct base by replacing all possible bases at each position of the k-hopo and checking the solidities of the resulting k-hopos. This approach introduces the fewest new errors even though it does not correct as many errors as other correctors. However, the objective of this approach is to reduce of the number of new errors from the one-sided aggressive correction.
Parallelization strategy
Our parallelization strategy uses the master–slave model a typical parallelization paradigm in which masters are dedicated to task distribution, and slaves are assigned to work on individual tasks. Typically in many applications, the master–slave model is implemented using a single master and multiple slaves. In HECTOR, multiple masters are used to avoid the bottleneck caused by the task distribution as the number of slaves grows larger. In general, the model is configured to have more slaves than masters, with a master-to-slave ratio chosen to be 1:3. This type of multiple masters, multiple slaves parallelization has also been implemented by Musket [21]. A hybrid combination of Pthreads and OpenMP parallel programming models is used to implement the master–slave model.