DNA Molecule Classification Using Feature Primitives
 Raja Tanveer Iqbal^{1}Email author,
 Matthew Landry^{2} and
 Stephen WintersHilt^{2, 3}
https://doi.org/10.1186/147121057S2S15
© Iqbal et al; licensee BioMed Central Ltd. 2006
Published: 26 September 2006
Abstract
Background
We present a novel strategy for classification of DNA molecules using measurements from an alphaHemolysin channel detector. The proposed approach provides excellent classification performance for five different DNA hairpins that differ in only one basepair. For multiclass DNA classification problems, practitioners usually adopt approaches that use decision trees consisting of binary classifiers. Finding the best tree topology requires exploring all possible tree topologies and is computationally prohibitive. We propose a computational framework based on feature primitives that eliminates the need of a decision tree of binary classifiers. In the first phase, we generate a pool of weak features from nanopore blockade current measurements by using HMM analysis, principal component analysis and various wavelet filters. In the next phase, feature selection is performed using AdaBoost. AdaBoost provides an ensemble of weak learners of various types learned from feature primitives.
Results and Conclusion
We show that our technique, despite its inherent simplicity, provides a performance comparable to recent multiclass DNA molecule classification results. Unlike the approach presented by WintersHilt et al., where weaker data is dropped to obtain better classification, the proposed approach provides comparable classification accuracy without any need for rejection of weak data. A weakness of this approach, on the other hand, is the very "handson" tuning and feature selection that is required to obtain good generalization. Simply put, this method obtains a more informed set of features and provides better results for that reason. The strength of this approach appears to be in its ability to identify strong features, an area where further results are actively being sought.
Keywords
Background
Nanopore Detectors: Experimental Setup
The nine basepair hairpin molecules examined share an eight basepair hairpin core sequence, to which one of the four permutations of WatsonCrick basepairs that may exist at the blunt end terminus are attached, i.e. 5'GC3', 5'CG3', 5'TA3', and 5'AT3'. These are denoted by 9GC, 9CG, 9TA, and 9AT. The sequence of the 9CG hairpin is 5'CTTCGAACG TTTTCGTTCGAAG3'. The basepairing region is underlined. An eight basepair DNA hairpin with a 5'GC3' terminus was also tested. This control molecule is denoted by 8GC. The DNA oligonucleotides were synthesized using an ABI 392 Synthesizer, purified by PAGE, and stored at 70C in TE buffer. The prediction that each hairpin would adopt one basepaired structure was tested and confirmed using the DNA mfold server [7].
AdaBoost: An Overview
AdaBoost [9–11] is an iterative scheme to obtain a weighted ensemble of weak learners. The basic idea is that one can combine rules of thumb to form an ensemble whose joint decision rule has good performance on the training set. Successive component classifiers are trained on a subset of the training data that is most informative. AdaBoost learns a sequence of weak classifiers and then boosts them by a linear combination into a single strong classifier. The input to the algorithm is a training set {(x_{1}, y_{1}), ..., (x_{ N }, y_{ N })} where y_{i} ∈ Y = {1, +1} is the correct label of instance x_{i} ∈ X and N is the number of training examples in the data set. A weak learning algorithm is repeatedly called in a series of rounds t = 1, ..., T with different weights distributions D_{ t }on the training data. This set of weights associated with the training data at each round t is denoted by D_{ t }(i). In general, sampling weights associated with each example are initially set equal, i.e. a uniform sampling distribution is assumed. For the t^{th} iteration, a classifier is learned from the training examples and the classifier with error ε_{ t }≤ 0.5 is selected. In each iteration, the weights of misclassified examples are increased which results in these examples getting more attention in subsequent iterations. AdaBoost is outlined in Algorithm 1 below. It is interesting to note that α_{ t }measures the importance assigned to the hypothesis h_{ t }and it gets larger as the training error ε_{ t }gets smaller. The final classification decision H of a test point x is a weighted majority vote of the weak hypotheses.
Algorithm 1. The AdaBoost algorithm
Input: S = {(x_{1}, y_{1}), ..., (x_{N}, y_{N})} where x_{ i }∈ X and y_{ i }∈ Y = {1, +1}
Initialization: D_{1}(i) = 1/N, for all i = 1, ..., N
 1.
Train weak learners with respect to the weighted sample set {S,D_{ t }} and obtain hypothesis h_{ t }: X → Y.
 2.
Obtain the error rates ε_{ t }of h_{ t }over the distribution D_{ t }such that
 3.
Set α_{ t }= 1/2 ln(1ε_{ t }/ε_{ t })
 4.
Update the weights: D_{t+1}(i) = (D_{ t }(i)/Z_{t}) , where Z_{t} is the normalizing factor such that D_{t+1}(i) is a distribution.
 5.
Break if ε_{ t }= 0 or ε_{ t }≥ 1/2.
end
Output: H(x) = sign(Σ^{T}_{t = 1}α_{ t }h_{ t }(x_{i}))
DNA Molecule Classification Using Boosted Naive Bayes
Given n classes and an input x, naive Bayes assigns to x the class label ω_{ i }for class i for which the posterior probability given by the following expression is maximum:
p(ω_{ i } x) = p(x  ω_{ i })p(ω_{ i })/Σ^{n}_{j = 1} p(x  ω_{ j })p(ω_{ j }).
The probability p(ω_{ i }) is the prior probability that represents the fraction of examples in the dataset that belong to class ω_{ i }and n in the total number of class labels that are possible. The probability p(x  ω_{ i }) is computed by making the assumption that the features in the dataset are independent and hence the probability p(x  ω_{ i }) is given by
p(x  ω_{ i }) = ∏^{m}_{j = 1} p(x_{j}  ω_{ i }),
Results of one against rest approach on principal components obtained from the HMM projections.
Class1  Class2  Sensitivity  Specificity 

8GC  9AT,9CG,9GC,9TA  0.9549  0.9758 
9AT  8GC,9CG,9GC,9TA  0.9295  0.9161 
9CG  8GC,9AT,9GC,9TA  0.8143  0.9434 
9GC  8GC,9AT,9CG,9TA  0.8156  0.9452 
9TA  8GC,9AT,9CG,9GC  0.8501  0.9902 
Results using all pairs approach on principal components obtained from the HMM projections.
8GC  9AT  9CG  9GC  9TA  

8GC  x  Sens = 97.30 Spec = 98.00  Sens = 97.30 Spec = 98.25  Sens = 98.85 Spec = 97.95  Sens = 97.15 Spec = 98.15 
8GC  x  x  Sens = 96.50 Spec = 98.50  Sens = 99.25 Spec = 98.75  Sens = 96.40 Spec = 94.30 
8GC  x  x  x  Sens = 98.20 Spec = 93.80  Sens = 96.40 Spec = 94.30 
8GC  x  x  x  x  Sens = 95.70 Spec = 95.15 
In order to obtain a single classifier for classifying all five molecules a decision tree structure is used, where each of the nodes is a binary classifier which classifies the input into two groups. This process is repeated until a single class label for the input has been found. As discussed in earlier sections, this approach is computationally expensive as choosing the right topology for the decision tree structure would require empirically evaluating all possible topologies (for the datasets examined in [6], however, linear trees were found to be optimal with drop of weak data). In the following section we discuss a framework that eliminates the need for a decision tree structure for multiclass classification.
DNA Molecule Classification Using Boosting Over Stumps
Results and Discussion
We applied several rounds of AdaBoost on data sets consisting of following feature sets

Data set I: HMM Projections

Data set II: Data set I enhanced with first 50 principal components obtained from HMM projections, approximation and detail coefficients obtained using a haar filter

Data set III: Data set II enhanced with approximation and detail coefficients obtained using a second and tenth order Daubechies wavelet filter

Data set IV: Data set III enhanced with approximation and detail coefficients obtained using a second and tenth order Symlets wavelet filter
Authors’ Affiliations
References
 Akeson M, Branton D, Kasianowicz J, Brandin E, Deamer DW: Microsecond timescale discrimination among polycytidilic acid, polyadenylic acid and polyuridylic acid as homopolymers or as segments within single RNA molecules. Biophysical Journal 1999, 77(6):3227–3233.PubMed CentralView ArticlePubMedGoogle Scholar
 Kasianowicz J, Brandin E, Deamer DW: Characterization of individual polynucleotide molecules using a membrane channel. Proceedings of National Academy of Sciences 1996, 93(24):13770–13773. 10.1073/pnas.93.24.13770View ArticleGoogle Scholar
 Meller A, Nivon L, Brandin E, Golovchenko J, Branton D: Rapid nanopore discrimination between single polynucleotide molecules. Proceedings of National Academy of Sciences 2000, 97(3):1079–1084. 10.1073/pnas.97.3.1079View ArticleGoogle Scholar
 Meller A, Nivon L, Branton D: Voltagedriven DNA translocations through a nanopore. Physical Review Letters 2001, 86(15):3435–3438. 10.1103/PhysRevLett.86.3435View ArticlePubMedGoogle Scholar
 Vercoutere W, WintersHilt S, Olsen H, Deamer D, Haussler D, Akeson M: Rapid discrimination among individual DNA hairpin molecules at singlenucleotide resolution using an ion channel. Nature Biotechnology 2001, 19(3):248–252. 10.1038/85696View ArticlePubMedGoogle Scholar
 WintersHilt S, Vercoutere W, DeGuzman VS, Deamer D, Akeson M, Haussler D: Highly Accurate Classification of WatsonCrick Basepairs on Termini of Single DNA Molecules. Biophysical Journal 2003, 84: 967–976.PubMed CentralView ArticlePubMedGoogle Scholar
 SantaLucia J: A unified view of polymer, dumbbell, and oligonucleotide DNA nearestneighbor thermodynamics. Proceedings of National Academy of Sciences 1998, 95(4):1460–1465. 10.1073/pnas.95.4.1460View ArticleGoogle Scholar
 Song L, Hobaugh M, Shustak C, Cheley S, Bayley H, Gouaux JE: Structure of staphylococcal alphaHemolysin, a heptameric transmembrane pore. Science 1996, 274: 1859–1866. 10.1126/science.274.5294.1859View ArticlePubMedGoogle Scholar
 Freund Y, Schapire R: A decisiontheoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences 1997, 55: 119–139. 10.1006/jcss.1997.1504View ArticleGoogle Scholar
 Fruend Y, Schapire RE, Bartlett P, Lee WS: Boosting the margin. a new explanation for the effectiveness of voting methods. Annals of Statistics 1998, 26: 1651–1686. 10.1214/aos/1024691352View ArticleGoogle Scholar
 Schapire RE, Singer Y: Improved Boosting Using Confidencerated Predictions. Machine Learning 1999, 37(3):297–336. 10.1023/A:1007614523901View ArticleGoogle Scholar
 Duda R, Hart P, Stork D: Pattern Classification. Second edition. 2001. [John Wiley and Sons Inc]Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.