### Generalized *k*-base encoding

Given an *k*-base encoded DNA sequence *c* = *c*
_{1}, ..., *c*
_{
n
}, it is the goal of the proposed algorithm to minimize the edit distance between c and some regular DNA sequence *y* = *y*
_{1}, ..., *y*
_{
m
} given a set of valid edit operators Σ. The DNA alphabet is assumed to be Λ = {*A, C, G, T*} and the valid edit operators include a color substitution, base substitution, a deletion, and an insertion. Similar to Homer et al. (2009), the color substitution operator is required to be applied before applying the insertion, deletion, or base substitution operators. Each operator is assigned a score, with ∏(*C*
_{1}, *C*
_{2}) and Δ(*B*
_{1}, *B*
_{2}) corresponding to the color substitution scoring and base substitution scoring functions respectively. To model insertions and deletions, affine gap penalties are used whereby a score of *ρ* is applied for the first insertion (or deletion), with ϵ applied for any consecutive insertion (or deletion) that extends the insertion (or deletion). It is assumed that for any bases *B*
_{1} ≠ *B*
_{2} and for any colors *C*
_{1} ≠ *C*
_{2} that 0 ≤ Δ(*B*
_{1}, *B*
_{1}), Δ(*B*
_{1}, *B*
_{2}) < 0, 0 ≤ ∏(*C*
_{1}, *C*
_{1}), ∏(*C*
_{1}, *C*
_{2}) < 0, *ρ* < ϵ < 0. These scoring assumptions penalize edits that change the encoded or decoded DNA sequence relative to the reference, thereby ensuring their similarity.

To illustrate the encoding and decoding method used by this technology, let *x* = *x*
_{1}, ..., *x*
_{
n
} be a DNA sequence. To encode a DNA sequence, the function Φ^{
k
}(*B*
_{1}, ..., *B*
_{
k
}) is defined to return a color *C*
_{
k
} using the bases *B*
_{1}, ..., *B*
_{
k
}, where B_{i-1} occurs before B_{
i
} in the sequence. For example, to encode the DNA sequence *x* = *x*
_{1}, ..., *x*
_{
n
}, first a known start adaptor *p* = *p*
_{1}, ..., *p*
_{k-1} ∈ Λ^{k-1} is assumed. Next, the function Φ^{
k
} is iteratively applied to the concatenation of *p* and *x*. *c* In this case, *c*
_{1} = Φ^{
k
}(*p*
_{1}, ..., *p*
_{k-1}, *x*
_{1}), *c*
_{2} = Φ^{
k
} (*p*
_{2}, ..., *p*
_{k-1}, *x*
_{1}, *x*
_{2}), ..., *c*
_{
n
} = Φ^{
k
}(*x*
_{n-k+1}, *x*
_{
n
}). The adaptor sequence p is known in practice and is used in the physical chemistry of the sequencer (for *k* = 2), not the DNA sequence in question [5–7].

The encoding function Φ^{
k
} (*B*
_{1}, ..., *B*
_{
k
}) transforms each base *B*
_{
i
} into an integer representation (*i.e*. *A* = 0, *C* = 1, *G* = 2, *T* = 3), sums the integer values, and returns the result modulo four. Let *δ* return the integer representation of a base as described above, then Φ^{
k
}(*B*
_{1}, ..., *B*
_{
k
}) =
*mod* |Λ|. Modulo four is chosen since four colors are used in current technologies. The properties of the modulo-four-specific encoding are discussed after how a *k*-base encoded sequence is decoded.

To decode an encoded sequence, the function Γ^{
k
}(*B*
_{1}, ..., *B*
_{k-1}, *C*) is defined to return the decoded base *B*
_{
k
} using the encoded color *C* and the previous bases *B*
_{1}, ...*B*
_{k-1}. To compute Γ^{
k
}(*B*
_{1}, ..., *B*
_{k-1}, *C*), *B* must be solved for in the equation
, which is easily solved. For example, to decode the encoded sequence *c* = *c*
_{1}, ..., *c*
_{
n
} with a known start adaptor *p* = *p*
_{1}, ...*p*
_{k-1} ∈ Λ^{k-1}, Γ^{
k
} is iteratively used. The decoded sequence will be *x*
_{1} = Γ^{
k
}(*p*
_{1}, ..., *p*
_{k-1}, *c*
_{1}), *x*
_{2} = Γ^{
k
}(*p*
_{1}, ..., *p*
_{k-2}, *x*
_{1}, *c*
_{2}), ..., *x*
_{
n
} = Γ^{
k
}(*x*
_{n-k+1}, *x*
_{n-1}, *c*
_{
n
}). Without the start adaptor *p*, there would be |Λ|^{k-1} possible decodings of the encoded sequence.

This encoding function has two useful properties. First, if one base in x is changed to obtain a new DNA sequence

*x'*, then the new

*k* colors that encode the changed base in

*x'* will differ from the corresponding

*k* colors in

*x*. For example, if

*k* = 5 then changing one base in

*x* to obtain new DNA sequence

*x'* will cause there to be 5 color differences between the encoded version of x and the encoded version of

*x'*. A second useful property is if one color in the encoded version of

*x* is changed to obtain a new encoded version, say

*c'*, then every base in the decoded version of

*c'* that occurs after the changed color will be different from the corresponding bases in

*x*. The first property defines the signature of base substitutions in the encoded sequence, which becomes pronounced as

*k* increases. The second property tells us that an encoding error will modify all bases after the encoding error. Intuitively, one can simplify by observing that for any base substitution there exists a set of

*k* consecutive errors that can be applied to achieve the same base substitution, and therefore if variants are being searched for (base substitutions in particular) then constraints should be placed on the scoring system to prefer calling base substitutions rather than color substitutions when comparing to a reference. An example of such a constraint is given that removes the above ambiguity. Suppose there exists a sub-sequence of the encoded read

, such that they all encode a base

. Next, consider the reference base

and the

*k* "colors" that encode

*B*
_{
i
}:

*C*
_{
i
}, ...,

*C*
_{i+k-1}. Let the

*i*th DNA base be

such that

encode

*B*
_{
i
}. The following constraint is made to prefer a base change and

*k* color matches to a base match and

*k* consecutive color mismatches:

In this case, it is assumed that
and *B*
_{
i
} ≠
(∀*i* ≤ *j* ≤ *i* + *k* - 1). Numerous other constraints based on real-world requirements are possible but not explored here.

### The Algorithm

Suppose that a color sequence

*c* =

*c*
_{1}, ...,

*c*
_{
n
} with a known adaptor

*p* ∈ Λ

^{k-1} is to be aligned to a reference sequence

*y* =

*y*
_{1}, ...,

*y*
_{
m
}. To search over all possible base substitution, base insertion, base deletions, and color substitutions, define a recursive formula that is the repeated calculation in the dynamic programming algorithm.

Intuitively, Equation 2 is filling in an *n* by *m* matrix, with each cell in the matrix containing 3 × Λ^{k-1} sub-cells. It is interesting to observe for *k* > 2 when computing
and
that 2 considers only previous sub-cells that are consistent with the current sub-cell. In other words, the first *k* - 2 bases of *σ* (the current sub-cell) must correspond to the last *k* - 2 bases of *ϕ* (the previous sub-cell). In this formula, the *h* sub-cells represent bases present in *y* but not in *x*, while *v* sub-cells represent bases present in *x* but not in *y*. The *s* sub-cells represents a base *x*
_{
i
} aligning to a base *y*
_{
i
} in the reference sequence *y*.

An alignment that begins or ends with a deletion is ignored, since a sequence must span the break-point for the deletion to be observable (with respect to *x*). This is a valid assumption when *x* is the observed sequence, and *y* is a fixed reference. An insertion followed by a deletion (or vice versa) is ignored since this is rare for short DNA sequences, although to consider such an event would require minimal changes to the above formula.

If the color match scores are the same (∀

*i* ≠

*j*, ∏(

*c*
_{
i
},

*c*
_{
i
}) = ∏(

*c*
_{
j
},

*c*
_{
j
})) and all color mismatch scores are the same (∀

*i* ≠

*j, k*, ≠

*l*, ∏(

*c*
_{
i
},

*c*
_{
j
}) = ∏(

*c*
_{
k
},

*c*
_{
l
}), then Equation

2 can be simplified. The recursive rule for the

term becomes:

This modification forces any color substitution to be at the beginning or end of any inserted bases in *x* and can reduce the complexity of the algorithm dramatically. The intuition behind this simplification stems from not having any reference bases to which to compare the inserted bases. This forces the maximum score path through the insertion to have no color errors

Various initializations are possible, and the alignment of the entire encoded DNA sequence *x* to some subsequence of *y* is presented here. Therefore, the initialization becomes for
,
if *σ* = Γ^{
k
}(*p*, *c*
_{
i
}) and
otherwise, and for
if *σ* = Γ^{
k
}(*ϕ*, *c*
_{
i
}), so that the local alignment spans the entire encoded sequence and insertions are allowed at the beginning of any alignment. Notice that if there were any color errors within the beginning an insertion, they are aligned such that they occur at the end of the insertion.
for *j* ≥ 0 is initialized so that the alignment does not begin with a deletion. The remaining initializations are:
for *j* ≥ 0 *σ* = *p* and *σ* ∈ Λ^{k-1}, and
if *σ* = *p*,
otherwise, for *j* ≥ 0 and *σ* ∈ Λ^{k-1}. These initializations enforce that the starting adaptor is *p*. To find the optimal local alignment, the cells
and
are searched over for a cell with maximum score, again ignoring the case where the alignment ends with a deletion. Backtracking is used to recover maximum scoring alignment.

This algorithm is in fact finding the shortest path on a graph with the nodes defined by the sub-cells of the matrix, and the edges weighted and defined by the recursive rules. To analyze the time complexity, it is observed that given the *k*-base encoding scheme for each sub-cell of type *h*, *v*, and *s* there are |Λ|^{k-1} sub-cells. For each
sub-cell it is required to calculate the maximum over two values. For each
and
sub-cell it is required to calculate the maximum over 3 × |Λ| values. Therefore for each cell, various maximum must be computed over 2 × |Λ|^{k-1} + |Λ|^{k-1} × 3 × |Λ| + |Λ|^{k-1} × 3 × |Λ| = 2 × |Λ|^{k-1}(1 + 3 × |Λ|). In practice, |Λ| = 4 and therefore various maxima must be computed over 26 × 4^{k-1} values. From this analysis, it is clear that the running time of this algorithm is *O*(*nm*|Λ|^{
k
}), which unfortunately scales exponentially with respect to the length of the encoding *k*.

### Simulations

Simulations were performed to assess the power and performance of *k*-base encoding, for *k* = 1...5. Sets of 10, 000 test sequences were randomly sampled from E. Coli (DH10B, NC_010473, CP000948). All sequences in a given set had a fixed read length (25, 50, 75), a fixed error-rate (0, 0.01, ..., 0.2), and a fixed number of SNPs (0, 1, 2). For the case of the 1-base encoding (*k* = 1), the standard Smith-Waterman algorithm was used, where errors were modeled as base changes, requiring that SNPs and base errors do not co-occur. For the case of *k*-base encoding with *k* > 1 errors were modeled as color substitutions (encoding errors). Similar to Homer et al. (2009), an alignment is defined to be accurate or correct if the returned alignment has the same score (or likelihood when the scores represent log- likelihoods) as the true alignment, which is known by the nature of these simulations.

To allow for insertions and deletions, the original sequence is used (before applying errors and variants) with an additional 10 bp before and after as the reference or target DNA sequence. In accordance with Equation 1, ϵ = - 50, *ρ* = - 175, ∏(*C*
_{1}, *C*
_{2}) = -125 (*C*
_{1} ≠ *C*
_{2}), ∏(*C*
_{1}, *C*
_{1}) = 0, Δ(*B*
_{1}, *B*
_{2}) = - 150(*B*
_{1} ≠ *B*
_{2}) and Δ(*B*
_{1}, *B*
_{1}) = 50. Due to these initializations, the optimization in Equation 3 is able to be performed. To model real-world error-rates, the simulated error-rates are learned from a run of an ABI SOLiD sequencer (50 base pairs), utilizing the aligned reads to calculate the 2-base encoding error, which is inherently dependent on the decoding algorithm used Homer et al. (2009). The error-rate was not uniform by sequencing position, therefore producing a color-error-rate for each position in the 50 color sequence reads. The observed error rate for each sequencing position was: 0.014, 0.005, 0.006, 0.007, 0.006, 0.006, 0.006, 0.008, 0.008, 0.008, 0.007, 0.006, 0.009, 0.009, 0.009, 0.009, 0.008, 0.015, 0.015, 0.012, 0.012, 0.011, 0.021, 0.021, 0.018, 0.019, 0.014, 0.037, 0.033, 0.031, 0.029, 0.022, 0.055, 0.052, 0.051, 0.043, 0.036, 0.087, 0.084, 0.076, 0.071, 0.060, 0.125, 0.118, 0.118, 0.108, 0.092, 0.179, 0.175, 0.184. To evaluate various scoring schemes of 5-base encodings, simulations of only 1, 000 test sequences were used due to running time limitations. For these evaluations a dual quad-core Intel Xeon E5420 machine at 2.5 GHz, with 32 GB of RAM and 2TB of RAID 0 disk space, was used, although the actual hardware requirements of the algorithm itself beyond CPU power are negligible relative to any modern computer.

### Scoring constraints 5-base encoding

Various scoring schemes were evaluated for 5-base encoding. For notational convenience, for all colors

*C*
_{1} ≠

*C*
_{2} and bases

*B*
_{1} ≠

*B*
_{2} let

*CE* = ∏(

*C*
_{1},

*C*
_{2}),

*CM* = ∏(

*C*
_{1},

*C*
_{1}),

*BE* = Δ(

*B*
_{1},

*B*
_{2}), and

*BM* = Δ(

*B*
_{1},

*B*
_{1}). Consider the scoring scenarios that satisfy one of the following constraints:

- 1.
5*CE + BM > 5CM + BE* (-*25*)

- 2.
5*CE + BM < 5CM + BE and 4CE + CM + BM > CE + 4CM + BE* (-*50*)

- 3.
4*CE + CM + BM < CE + 4CM + BE* and *3CE+2CM+BM > 2CE+3CM+BE* (-*75* and -*150*)

- 4.
3*CE +2CM +BM < 2CE +3CM +BE* and *2CE+3CM+BM > 3CE+2CM+BE* (*200*)

- 5.
2*CE +3CM +BM < 3CE +2CM +BE* and *1CE + 4CM + BM > 4CE + CM + BE*

- 6.
*CE + 4CM + BM < 4CE + CM + BE*

Intuitively, these scenarios try to decide if a given set of color errors should be preferred if they can be explained by a SNP and possibly other color errors. For example, the first scenario always prefers calling color errors over anything that can be explained by a SNP. The second scenario will prefer to explain the encoding with a SNP if it results in no color errors, but does not prefer to explain the encoding with a SNP if it is accompanied by any color errors. In the extreme, the last scenario would prefer to explain all color errors as a combination of a SNP and possibly color errors.

Nevertheless, given the assumptions that *BM* ≥ 0, *CM* ≥ 0, *BE* < 0, and *CE* < 0, it is observed that scenarios 5 - 6 are not possible. For example, constraint is equivalent to the following constraint: *CM + BM < CE + BE* and *4CE + BM > 3CE + BE*. Intuitively, our assumptions prefer to penalize color errors and nucleotide variants, rewarding color matches and nucleotide matches, thereby making the left-side of the above constraint without a solution. If scenarios 5-6 are truly desired, instead of considering color errors, color matches, base errors, and base matches separately, a joint function could be considered conditional on the various combinations of events over the *k* base and color window. This would also allow for the incorporation of any specific experimental bias for the combinations of events, or even specific bases.

In the above constraints color error scores are given that satisfy the constraints given the previously defined base match, base substitution, and color match scores. The score -150 is also included, which was previously used, to illustrate that there is flexibility even within these constraints to tune the scoring scheme.