# Local alignment of two-base encoded DNA sequence

- Nils Homer
^{1, 2}Email author, - Barry Merriman
^{2}and - Stanley F Nelson
^{2}

**10**:175

https://doi.org/10.1186/1471-2105-10-175

© Homer et al; licensee BioMed Central Ltd. 2009

**Received: **01 February 2009

**Accepted: **09 June 2009

**Published: **09 June 2009

## Abstract

### Background

DNA sequence comparison is based on optimal local alignment of two sequences using a similarity score. However, some new DNA sequencing technologies do not directly measure the base sequence, but rather an encoded form, such as the two-base encoding considered here. In order to compare such data to a reference sequence, the data must be decoded into sequence. The decoding is deterministic, but the possibility of measurement errors requires searching among all possible error modes and resulting alignments to achieve an optimal balance of fewer errors versus greater sequence similarity.

### Results

We present an extension of the standard dynamic programming method for local alignment, which simultaneously decodes the data and performs the alignment, maximizing a similarity score based on a weighted combination of errors and edits, and allowing an affine gap penalty. We also present simulations that demonstrate the performance characteristics of our two base encoded alignment method and contrast those with standard DNA sequence alignment under the same conditions.

### Conclusion

The new local alignment algorithm for two-base encoded data has substantial power to properly detect and correct measurement errors while identifying underlying sequence variants, and facilitating genome re-sequencing efforts based on this form of sequence data.

## Keywords

## Background

DNA sequence comparison is a common problem in biology. In this problem, we wish to measure the similarity of two sequences of DNA. Hamming distance [1] can be used to quantify similarity but forces the two sequences to be of the same length. More generally, the idea of a weighted edit distance can be applied, which allows for base changes, insertions and deletions [2], with weights chosen to reflect their likelihood of occurrence. Given some set of operators that can modify a sequence, we wish to find the set of edit operators that transforms one sequence into a (sub)sequence of the other by maximizing a similarity score. This problem can be solved by a dynamic programming algorithm, which was first described in 1970 [3]. This led to the Smith-Waterman algorithm [4] that has been a critical component of local sequence alignment. Affine gap penalties were subsequently introduced, whereby in practice the per-base average penalty decreases, but the overall penalty increases with longer length[5]. This algorithm has a known O(*nm*) running time and O(min(*n*, *m*)) space requirements, for both finding a maximum similarity score and finding a transformation that achieves the maximum similarity score, where *n* and *m* are the lengths of the two sequences to be compared [3–9]. The resulting algorithm has become the standard for DNA sequence comparison [3, 4, 10, 11].

In a typical re-sequencing experiment using next-generation sequencing technology, millions of short sequence "reads", 20–100 bases in length, must be aligned to a large reference genome, such as the human genome. This demands an initial search space reduction step [12–14, 18–20] (Homer N, Merriman B, Nelson SF: BFAST: the BLAT-like Fast Accurate Search Tool for Large-Scale Genome Resequencing, submitted) prior to performing the more expensive optimal local alignment. This first step typically involves some form of indexed look-up or hashing of the full genome or reads, so that a small number of candidate alignment locations are quickly obtained for each read, in a way that is tolerant of the read containing errors or real variants relative to the reference. The optimal local alignments are then used to select which of these candidates is the true location, as well as to identify the differences from the reference sequence at that location. In the case of color space data, the look-up phase can be performed entirely in color space, using the color-space encoded form of the reference genome to find candidate locations for each color space read. The optimal alignment algorithm described here would then be used as the finishing step, which simultaneously decodes, identifies color (measurement) errors, and optimally aligns resulting DNA sequence to a short candidate segment of the reference sequence, typically 100–1000 bases in length (to allow for insertions and deletions in the read).

## Results

### Power of two-base encoding

### Performance of two-base encoding

We performed simulations to evaluate the performance of the current algorithm compared to the local alignment without two-base encoding (see Methods for details). We found that for length 25 and 50 color space sequences our algorithm was 36 and 28 times slower, respectively, than the standard Dynamic Programming algorithm applied to base space sequence. Although the algorithmic complexity as a function of read length and reference length is not increased, the absolute number of operations does increase (see Methods), and thus we observe a decrease in the speed performance compared to sequences without the two-base encoding. This performance decrease is particularly relevant given that an experimentalist may be required to choose between competing sequencing technologies that do not utilize the two-base encoding scheme and sequencing technologies that do use the two-base encoding scheme. Two base encoding has potentially powerful error correction modes and at the time of this publication is able to generate substantially more data than direct sequencing approaches. Thus, the two base encoding strategy while preferable in some scenarios for base error correction and better performance of alignment does impose a need for increased computational capacity largely due to the local sequence alignment complexity.

## Discussion

Although the power of this algorithm enables accurate alignment of color space sequences with greater error, it is also computationally an order of magnitude more expensive than the standard dynamic programming algorithm applied in sequence space. To partially mitigate this, the performance can be optimized without changing the results by employing some simple search space reduction and greedy search techniques, as follows: first, decode the encoded sequence by the standard deterministic rules and perform an exact string matching search. If an exact match is found, then the algorithm stops. Upon unsuccessful return, we find a lower bound for the optimal similarity for the proposed algorithm by first performing our two-base encoded alignment but without allowing insertion or deletion edits, which substantially reduces the computational cost. Using this lower bound, we then reduce the search space of our full algorithm by omitting the paths where the search parameters that permit detection of insertions or deletions would result in a score below the established lower bound. In this manner, the empirical running time of the algorithm can be improved by approximately 20% (data not shown) while still obtaining the true optimal alignment.

We note that the general strategy of two-base encoding in color space is possible to apply in more complex formats for error correction. For instance, three or more bases may be encoded by four or more colors. This would further increase the power of discriminating between encoding errors and base substitutions, albeit at a substantial added cost in local alignment performance. In practice these alternate encodings could further reduce false-positives detections when the goal is to find biological variants with next-generation sequencing technology with relatively high measurement error rates. This may be an advantageous strategy, for example, to increase read lengths by accepting noisier color space reads that are correctable after alignment. The current algorithm can be extended to accommodate these generalizations, and in future work we will investigate the detailed performance properties of such hypothetical encodings.

The present algorithm can be readily extended to include support for the case where sequence data is missing or unavailable, in either the given color-encoded sequence or in the target base space sequence. We introduce a fifth color code to represent an unknown color in encoded sequence, and a fifth base code (traditionally "N") to represent an unknown base in the decoded or target sequence. To incorporate an unknown encoding color we modify the color substitution function Π to include a score for this fifth unknown color and any other color. To incorporate an unknown base in the target, we modify the base substitution function Δ to include a score for the unknown base and any other base. Also a simple modification to the initialization step in the algorithm is required if the start base *p* is not known. While we do not rely on quality values for each color read, however it is possible to incorporate into the current alignment algorithm quality values that represent the certainty of color calling similar to sequence calling with Phred scores [23–26] by weighting the color substitution function Π.

Finally, Figures 2, 3, 4, and 5 demonstrate the power to correctly align two-base encoded sequences in the presence of a large number of color errors. Depending on the distribution of sequences with a given number of errors, two-base encoding and this algorithm may make it feasible to accept higher error sequences generated by next-generation sequencing technology, improving both throughput and cost-effectiveness. Additionally, we place a constraint on our scoring functions, making a conscious choice to prefer a base substitution to two adjacent color substitutions that would cause that base to match the reference. This is by no means the only constraint available, but serves to help define the trade-off in power to detect errors over biological variants. In these practically important but ambiguous cases, a decision must be made over which scenario to prefer, and in practice this ambiguity can be overcome by using coverage where multiple sequences observe the same event.

## Conclusion

DNA sequence alignment algorithms have been thoroughly studied in molecular biology, resulting in well-developed Dynamic Programming algorithms that optimize an edit distance to find optimal alignments between two sequences. However, there is a resurgence of interest in sequence alignment due to large scale re-sequencing efforts made possible by massively parallel sequencing technology. The classical algorithm remains an ideal approach for local alignment of such short-read sequence data, but some sequencing technologies produce reads in encoded form, which must be decoded to obtain standard DNA sequence. We extend the previous class of Dynamic Programming algorithms to allow for errors in the encoding, as well as the usual base substitutions, insertions and deletions. Our algorithm remains O(*nm*) time, where *n* and *m* are the length of the encoded and target sequence respectively. We show in practice that performance is decreased due to the added complexity of considering encoding errors, although this can be somewhat mitigated by standard search optimization. This performance decrease must be kept in mind when comparing the overall computational cost of analyzing various next-generation sequencing technologies. Using this new algorithm, local sequence alignment as well as error detection and correction are performed in a reliable and systematic manner, enabling the direct comparison of encoded DNA sequence reads to a candidate reference DNA sequence. This new algorithm should facilitate the use of two-base encoded data for large-scale re-sequencing projects.

## Methods

### The Problem

*c*=

*c*

_{1},...,

*c*

_{ n }, we wish to maximize the similarity between

*c*and some regular DNA sequence

*y*=

*y*

_{1},...,

*y*

_{ m }, with the valid edit operators Σ. In this case the alphabet is {

*A, C, G, T*} corresponding to the bases in DNA, and the encoded alphabet is {

*0, 1, 2, 3*}. We assume the encoded sequence is composed of a two base encoding, referred to as colors, as well as assume a known start base

*p*, which is known in practice [16, 17, 27]. The valid edit operators are:

- 1.
A base substitution, which substitutes one base for another in the encoded sequence after decoding.

- 2.
An insertion, which inserts a base into the encoded sequence after decoding.

- 3.
A deletion, which deletes a base from the encoded sequence after decoding.

- 4.
A color substitution, which substitutes one encoded color for another.

*B*

_{1},

*B*

_{2}) that returns the base substitution score for substituting base

*B*

_{2}for base

*B*

_{1}. The score ρ is applied for the first insertion or deletion operator used. Any insertion or deletion operator that is applied so that the insertion or deletion is extended has a score

*ε*. Therefore, for a length

*g*>0 base insertion or deletion, the cost of the entire insertion or deletion is

*ρ*+

*ε*(

*g*-1) and has an average per-gap cost of (

*ρ*+

*ε*(

*g*-1))/

*g*. In practice, this affine gap penalty is useful to penalize a start of an insertion or deletion more heavily than extending the insertion or deletion. The function Π(

*C*

_{1},

*C*

_{2}) returns the color substitution score for substituting color

*C*

_{2}for color

*C*

_{1}. The base and color substitutions functions are both symmetric, and are defined even if

*B*

_{1}=

*B*

_{2}for Δ, or

*C*

_{1}=

*C*

_{2}for Π. To decode an encoded sequence, we define the function Γ(

*B*,

*C*) that returns the decoded base using the encoded color

*C*and the previous base

*B*(see Figure 6). For example, to decode the encoded sequence

*c*=

*c*

_{1},...,

*c*

_{ n }with a known start base

*p*, we iteratively use Γ. The decoded sequence will be

*x*

_{1}= Γ(

*p*,

*c*

_{1}),

*x*

_{2}= Γ(

*x*

_{1},

*c*

_{2}),...,

*x*

_{ n }= Γ(

*x*

_{n-1},

*c*

_{ n }). To encode a sequence, we define the function Φ(

*B*

_{1},

*B*

_{2}) that returns a color using the bases

*B*

_{1}and

*B*

_{2}, where

*B*

_{1}occurs before

*B*

_{2}in the sequence (see Figure 1). For example, to encode DNA sequence

*x*=

*x*

_{1},...,

*x*

_{ n }, we assume a known start base

*p*and iteratively use Φ to encode x. Here we have

*c*

_{1}= Φ(

*p*,

*x*

_{1}),

*c*

_{2}= Φ(

*x*

_{1},

*x*

_{2}),...,

*c*

_{ n }= Φ(

*x*

_{n-1},

*x*

_{ n }). This encoding function is analogous to the Klein Four Group under addition or the X-OR function when the colors and DNA are represented as binary numbers [14, 15, 17]. The function Φ is used to encode the base sequence whereas the function Γ is used to decode the color sequence. To represent the transformation of

*x*into

*y*, we pair bases in

*x*with bases in

*y*as well as including dashes to indicate that an insertion or deletion occurred. If

*x*

_{ i }and

*y*

_{ j }are matched, then we pair

*x*

_{ i }and

*y*

_{ j }and draw: . A deletion of a base in

*x*relative to

*y*is represented using a dash (-) and the base

*y*

_{ j }, and is drawn as: . An insertion into

*x*relative to

*y*is represented using a dash and the base

*x*

_{ i }, and is drawn as: . For example, for

*x*=

*GATTACA*and

*y*=

*GATACA*, a valid alignment may be: . In this example, we apply three base substitution operators, one insertion operator, and then three base substitution operators. The base substitution operators do not change the bases in this example, but are defined for completeness when

*x*

_{ i }=

*y*

_{ j }. In this manner, we describe an alignment using the base substitution, insertion and deletion operators. To model encoding errors, we assume a two-base encoding scheme; therefore, the encoding can be visualized by placing the colors in between the bases assuming the starting base is an

*A*. For the reference sequence

*y*, we place colors of the encoded version of

*y*in between the bases of

*y*. Let

*c'*be the encoded DNA sequence resulting from applying all color substitution operators to

*c*. Below we place the colors of the encoded sequence

*c'*between the bases of the decoded version of c'. Finally we place the original encoded sequence

*c*below

*c'*. Given an encoded sequence

*c*=

*2030311*and target DNA sequence

*y*=

*GATACA*a valid alignment may be: . The placement of the color (in

*y*) within the insertion (relative to

*c*) is arbitrary since it is compared to the composition of the colors within insertion in

*c*as will be seen later. In the above alignment, the second color is changed using a color substitution, where the second color encodes for the first and second base. Without the color substitution, the alignment would be: illustrating the necessity to model encoding errors.

*x*into

*y*by maximizing the similarity score, thus maximizing sequence similarity. In practice,

*x*is an observed encoded sequence, and

*y*is a decoded target or reference sequence. We prefer to penalize applications of the edit operators where base substitutions or color substitutions occur. Therefore, for all

*B*

_{1}≠

*B*

_{2}and

*C*

_{1}≠

*C*

_{2}, we assume that Δ(

*B*

_{1},

*B*

_{2}) ≤ 0, 0 ≤ Δ(

*B*

_{1},

*B*

_{1}),

*ε*≤ 0,

*ρ*≤ 0, Π(

*C*

_{1},

*C*

_{2}) ≤ 0 and 0 ≤ Π(

*C*

_{1},

*C*

_{1}). Furthermore, to avoid always placing an insertion, we must have that for any

*C*

_{1}that

*ε*+ Π(

*C*

_{1},

*C*

_{1}) ≤ 0 and

*ρ*+ Π(

*C*

_{1},

*C*

_{1}) ≤ 0. A subtle but important point is that two adjacent color substitutions in the encoded sequence in some cases are equivalent to a base substitution in-between the two colors. An example of this equivalence can be seen in the following two sub-alignments and . In practice we make the assumption that for any bases

*B*

_{1},

*B*

_{2}, ,

*B*

_{3}with

*B*

_{2}≠ , and for any colors

*C*

_{2}, ,

*C*

_{3}, with

*C*

_{2}≠ and

*C*

_{3}≠ such that Γ (

*B*

_{1},

*C*

_{2}) =

*B*

_{2}, Γ (

*B*

_{2},

*C*

_{3}) =

*B*

_{3}, , :

This will ensure that two adjacent color substitutions (
for *C*_{2} and
for *C*_{3} above) that are compatible with a base substitution (
for *B*_{2}) will not be preferred over the compatible base substitution. Considering more complex alignments, for example whether to prefer two adjacent color substitutions or an adjacent color substitution and a base substitution, can help fine-tune the power to detect color errors as well as base substitutions by adding additional constraints on the scoring functions.

### The Algorithm

Intuitively, we are filling in an *n* by *m* matrix, with each cell containing 12 sub-cells. The *h* sub-cells correspond to bases that are present in *y* but deleted in *x*, the *v* sub-cells correspond to bases inserted into *x* but absent in *y*, and each *s* sub-cell represents a base *x*_{
i
}(where
) aligning to a base *y*_{
j
}to the reference sequence *y*. All possible color substitutions are considered by transitioning from a sub-cell
,
, or
to the sub-cell
.

We first observe that base substitutions and color substitutions occur in tandem. This is because given the previous base *x*_{i-1}, the subsequent base *x*_{
i
}uniquely determines the joining color *c*_{
i
}(or equivalently the joining color *c*_{
i
}uniquely determines the subsequent base *x*_{
i
}). Additionally, we assume that color substitutions do not occur directly before a base that has been deleted. In the deletion case, we have one color that spans the entire deletion. Due to base substitutions and color substitutions occurring in tandem, we must consider a color substitution while considering a base substitution, which occurs at the end of the deletion. For insertions, if the color substitution score are equal, meaning the same score is given for all color matches and color mismatches respectively, we need only consider *σ* = Γ(*φ*, *c*_{
i
}) in the v-term. This reduces the number of terms over which we compute the maxima from eight terms to two terms. The simplification results from the absence of bases for which to compare the inserted base(s) as well as the observation that placing the color substitution at the end of the insertion will result in the same score as placing the color substitution anywhere else in the insertion, including the beginning of the insertion. Since base substitutions are to be penalized, as was previously assumed, we assume that the inserted bases, and therefore the colors encoding the inserted bases, are correct. Thus, when beginning or extending an insertion, we ignore the color substitution score, and consider the insertion of the base *x*_{
i
}= Γ(*x*_{i-1}, *c*_{
i
}). Finally, we ignore the case where an insertion (or deletion) is directly followed by a deletion (or insertion), since for current technologies, the length of the sequences being compared are very short making this scenario (switching) very biologically unlikely. Nevertheless, to include this case requires minimal modification to Equation 2.

What is left is to describe is how to initialize
,
,
,
,
, and
for *i* > 0, j ≥ 0, and *σ* ∈ {*A*, *C*, *G*, *T*}. In our specific application, we wish to align the entire encoded sequence *c* to the target sequence *y*. Therefore, we initialize for *i*>0
=
= -∞,
if *σ* = Γ (*p*, *c*_{1}) and
otherwise, and for *i>1*
if *σ* = Γ (*φ*, *c*_{
i
}) and
= -∞ otherwise, so that the local alignment spans the entire encoded sequence as well as allowing for an insertion at the beginning of the alignment. We initialize
= -∞ for j ≥ 0 so that the alignment does not begin with a deletion. We observe that deletions are detected on the basis that a reads spans the deletion breakpoint. This is reflected in our scoring system where we assume that a deletion has negative score, and therefore the alignment resulting from removal of a deletion at the beginning or end of the alignment has a score greater than or equal to the original alignment. We thus remove from consideration any instances of a sequence starting or ending with a deletion. We initialize
= -∞ for j ≥ 0 and *σ* ∈ {*A*, *C*, *G*, *T*}. If *σ* = *p* then we t
= 0, and
= -∞ otherwise, for j ≥ 0 and *σ* ∈ {*A*, *C*, *G*, *T*}. This initialization enforces that the starting base is *p*. Other initializations can find the optimal subsequence of *x* that aligns to *y*, among other applications [10, 11]. To find the optimal local alignment we search over cells
and
for a cell with maximum score, again ignoring the case where the alignment ends with a deletion, and backtrack to recover a maximum scoring alignment.

From Equation 2, and for each *i* and *j*, we must calculate maxima over 88 different values, which can be reduced to 64 values if the color match and color mismatch scores respectively are the same. In contrast, the Dynamic Programming solution with affine gap penalties to compare sequences with no encoding requires the calculation of a maxima over 7 different values [10, 11]. Although the running time of this algorithm is O(*nm*), where *n* is the length of the encoded sequence and *m* is the length of the target sequence, the running time is nonetheless greater than the algorithm without encoding as seen in practice (see Results).

### Simulations

To evaluate the power of the algorithm, we created sets of 100,000 test sequences randomly sampled from the Human genome (build 36), and gave each a known number of errors, base substitutions, insertions and deletions. For encoded sequences, we model errors as color substitutions (encoding errors) and for decoded sequences we model errors as base substitutions. It is possible for a class of alignments to have equal likelihood, and therefore we define an alignment to be correct if the alignment returned has equal score to the true alignment. To evaluate the performance of the algorithm, we created 1,000,000 artificial sequences from the Human genome (build 36) with no edits applied. In both cases, we evaluated sequences of length 25 and 50, reflecting a range of possible and currently available sequences generated with color space encoding. The target DNA reference sequence had length three times the length of the encoded sequence to allow for potential insertions and deletions to be placed correctly. For the simulations, in accordance with Equation 1, we set *ρ* = -175, *ε* = -50, Π(*C*_{1}, *C*_{2}) = -125 (*C*_{1} ≠ *C*_{2}), Π(*C*_{1}, *C*_{1}) = 0, Δ(*B*_{1}, *B*_{2}) = -150 (*B*_{1} ≠ *B*_{2}), and Δ(*B*_{1}, *B*_{1}) = 50. Since the color match and color mismatch scores respectively are the same, we are able to make the simplification to the v-term in Equation 2 as described above. For these evaluations, we used a dual quad-core Intel Xeon E5420 machine at 2.5 GHz, with 32 GB of RAM and 2TB of RAID 0 disk space, although the actual hardware requirements of the algorithm itself are negligible relative to any modern computer. The implementation for all the simulations performed can be found in BFAST at http://genome.ucla.edu/bfast, which was configured using the –enable-unoptimized-sw argument (Homer N, Merriman B, Nelson SF: BFAST: the BLAT-like Fast Accurate Search Tool for Large-Scale Genome Resequencing, submitted).

## Declarations

### Acknowledgements

This research was partially supported by University of California Systemwide Biotechnology Research and Education Program GREAT Training Grant 2007–10 (to NH), the NIH Neuroscience Microarray Consortium (U24NS052108), and a grant from the NIMH (R01 MH071852).

We would also like to thank members of the Nelson Lab: Zugen Chen, Hane Lee, Bret Harry, Jordan Mendler, Brian O'Connor for input and computational infrastructure support.

## Authors’ Affiliations

## References

- Hamming R: Error Detecting and Error Correcting Codes.
*Bell System Technical Journal*1950, 29: 147–160.View ArticleGoogle Scholar - Levenshtein VI: Binary Codes Capable of Correcting Deletions, Insertions, and Reversals.
*Soviet Physics Doklady*1966, 10: 706–710.Google Scholar - Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins.
*J Mol Biol*1970, 48: 443–453. 10.1016/0022-2836(70)90057-4View ArticlePubMedGoogle Scholar - Smith TF, Waterman MS: Identification of common molecular subsequences.
*J Mol Biol*1981, 147: 195–197. 10.1016/0022-2836(81)90087-5View ArticlePubMedGoogle Scholar - Gotoh O: An improved algorithm for matching biological sequences.
*J Mol Biol*1982, 162: 705–708. 10.1016/0022-2836(82)90398-9View ArticlePubMedGoogle Scholar - Hirschberg DS: A linear space algorithm for computing maximal common subsequences.
*Commun ACM*1975, 18: 341–343. 10.1145/360825.360861View ArticleGoogle Scholar - Huang X, Miller W: A time-efficient linear-space local similarity algorithm.
*Adv Appl Math*1991, 12: 337–357. 10.1016/0196-8858(91)90017-DView ArticleGoogle Scholar - Myers EW, Miller W: Optimal alignments in linear space.
*Comput Appl Biosci*1988, 4: 11–17.PubMedGoogle Scholar - Powell DR, Allison L, Dix TI: A versatile divide and conquer technique for optimal string alignment.
*Inf Process Lett*1999, 70: 127–139. 10.1016/S0020-0190(99)00053-8View ArticleGoogle Scholar - Ewans W, Grant G:
*Statistical Methods in Bioinformatics.*New York: Springer; 2002.Google Scholar - Jones N, Pevzner P:
*An Introduction to Bioinformatics Algorithms (Computational Molecular Biology).*Cambridge MA: The MIT Press; 2004.Google Scholar - Kent WJ: BLAT–the BLAST-like alignment tool.
*Genome Res*2002, 12: 656–664.PubMed CentralView ArticlePubMedGoogle Scholar - Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, Brudno M: SHRiMP: Accurate Mapping of Short Color-space Reads.
*PLoS Comput Biol*2009, 5: e1000386. 10.1371/journal.pcbi.1000386PubMed CentralView ArticlePubMedGoogle Scholar - Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores.
*Genome Res*2008, 18: 1851–1858. 10.1101/gr.078212.108PubMed CentralView ArticlePubMedGoogle Scholar - Ma B, Tromp J, Li M: PatternHunter: faster and more sensitive homology search.
*Bioinformatics*2002, 18: 440–445. 10.1093/bioinformatics/18.3.440View ArticlePubMedGoogle Scholar - Applied Biosystems Incorporated: Principles of Di-Base Sequencing and the Advantages of Color Space Analysis in the SOLiD System.[http://marketing.appliedbiosystems.com/images/Product_Microsites/Solid_Knowledge_MS/pdf/SOLiD_Dibase_Sequencing_and_Color_Space_Analysis.pdf]
- Applied Biosystems Incorporated: A Theoretical Understanding of 2 Base Color Codes and Its Application to Annotation, Error Detection, and Error Correction.[http://www3.appliedbiosystems.com/cms/groups/mcb_marketing/documents/generaldocuments/cms_058265.pdf]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
*Nucleic Acids Res*1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar - Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program.
*Bioinformatics*2008, 24: 713–714. 10.1093/bioinformatics/btn025View ArticlePubMedGoogle Scholar - Ning Z, Cox AJ, Mullikin JC: SSAHA: a fast search method for large DNA databases.
*Genome Res*2001, 11: 1725–1729. 10.1101/gr.194201PubMed CentralView ArticlePubMedGoogle Scholar - Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G,
*et al*.: The diploid genome sequence of an individual human.*PLoS Biol*2007, 5: e254. 10.1371/journal.pbio.0050254PubMed CentralView ArticlePubMedGoogle Scholar - Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation.
*Nucleic Acids Res*2001, 29: 308–311. 10.1093/nar/29.1.308PubMed CentralView ArticlePubMedGoogle Scholar - Ewing B, Green P: Base-calling of automated sequencer traces using phred. II. Error probabilities.
*Genome Res*1998, 8: 186–194.View ArticlePubMedGoogle Scholar - Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using phred. I. Accuracy assessment.
*Genome Res*1998, 8: 175–185.View ArticlePubMedGoogle Scholar - Izmailov A, Goloubentzev D, Jin C, Sunay S, Wisco V, Yager TD: A general approach to the analysis of errors and failure modes in the base-calling function in automated fluorescent DNA sequencing.
*Electrophoresis*2002, 23: 2720–2728. 10.1002/1522-2683(200208)23:16<2720::AID-ELPS2720>3.0.CO;2-ZView ArticlePubMedGoogle Scholar - Izmailov A, Yager TD, Zaleski H, Darash S: Improvement of base-calling in multilane automated DNA sequencing by use of electrophoretic calibration standards, data linearization, and trace alignment.
*Electrophoresis*2001, 22: 1906–1914. 10.1002/1522-2683(200106)22:10<1906::AID-ELPS1906>3.0.CO;2-5View ArticlePubMedGoogle Scholar - Smith DR, Quinlan AR, Peckham HE, Makowsky K, Tao W, Woolf B, Shen L, Donahue WF, Tusneem N, Stromberg MP,
*et al*.: Rapid whole-genome mutational profiling using next-generation sequencing technologies.*Genome Res*2008, 18: 1638–1642. 10.1101/gr.077776.108PubMed CentralView ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.