Volume 7 Supplement 2
Third Annual MCBIOS Conference. Bioinformatics: A Calculated Discovery
Support Vector Machine Implementations for Classification & Clustering
 Stephen WintersHilt^{1, 2}Email author,
 Anil Yelundur^{1},
 Charlie McChesney^{1} and
 Matthew Landry^{1}
DOI: 10.1186/147121057S2S4
© WintersHilt et al; licensee BioMed Central Ltd. 2006
Published: 26 September 2006
Abstract
Background
We describe Support Vector Machine (SVM) applications to classification and clustering of channel current data. SVMs are variationalcalculus based methods that are constrained to have structural risk minimization (SRM), i.e., they provide noise tolerant solutions for pattern recognition. The SVM approach encapsulates a significant amount of modelfitting information in the choice of its kernel. In work thus far, novel, informationtheoretic, kernels have been successfully employed for notably better performance over standard kernels. Currently there are two approaches for implementing multiclass SVMs. One is called external multiclass that arranges several binary classifiers as a decision tree such that they perform a singleclass decision making function, with each leaf corresponding to a unique class. The second approach, namely internalmulticlass, involves solving a single optimization problem corresponding to the entire data set (with multiple hyperplanes).
Results
Each SVM approach encapsulates a significant amount of modelfitting information in its choice of kernel. In work thus far, novel, informationtheoretic, kernels were successfully employed for notably better performance over standard kernels. Two SVM approaches to multiclass discrimination are described: (1) internal multiclass (with a single optimization), and (2) external multiclass (using an optimized decision tree). We describe benefits of the internalSVM approach, along with further refinements to the internalmulticlass SVM algorithms that offer significant improvement in training time without sacrificing accuracy. In situations where the data isn't clearly separable, making for poor discrimination, signal clustering is used to provide robust and useful information – to this end, novel, SVMbased clustering methods are also described. As with the classification, there are Internal and External SVM Clustering algorithms, both of which are briefly described.
Background
Support Vector Machine
SVMs use variational methods in their construction and encapsulate a significant amount of discriminatory information in their choice of kernel. In reference [3] informationtheoretic kernels provided notably better performance than standard kernels. Feature extraction was designed to arrive at probability vectors (i.e., discrete probability distributions) on a predefined, and complete, space of possibilities. (The different blockade levels, and their frequencies, the emission probabilities, and the transition probabilities, for example.) This turns out to be a very general formulation, wherein feature extraction makes use of signal decomposition into a complete set of separable states that can be interpreted or represented as a probability vector. A probability vector formulation also provides a straightforward handoff to the SVM classifiers since all feature vectors have the same length with such an approach. What this means for the SVM, however, is that geometric notions of distance are no longer the best measure for comparing feature vectors. For probability vectors (i.e., discrete distributions), the best measures of similarity are the various informationtheoretic divergences: KullbackLeibler, Renyi, etc. By symmetrizing over the arguments of those divergences a rich source of kernels is obtained that works well with the types of probabilistic data obtained.
The SVM discriminators are trained by solving their KKT relations using the Sequential Minimal Optimization (SMO) procedure [4]. A chunking [5, 6] variant of SMO also is employed to manage the large training task at each SVM node. The multiclass SVM training generally involves thousands of blockade signatures for each signal class. The data cleaning needed on the training data is accomplished by an extra SVM training round.
Binary Support Vector Machines
Binary Support Vector Machines (SVMs) are based on a decisionhyperplane heuristic that incorporates structural risk management by attempting to impose a traininginstance void, or "margin," around the decision hyperplane [1].
Feature vectors are denoted by x_{ik}, where index i labels the M feature vectors (1 ≤ i ≤ M) and index k labels the N feature vector components (1 ≤ i ≤ N). For the binary SVM, labeling of training data is done using label variable y_{i} = ± 1 (with sign according to whether the training instance was from the positive or negative class). For hyperplane separability, elements of the training set must satisfy the following conditions: w_{β}x_{iβ}b ≥ +1 for i such that y_{i} = +1, and w_{β}x_{iβ}b ≤ 1 for y_{i} = 1, for some values of the coefficients w_{1}, ..., w_{N}, and b (using the convention of implied sum on repeated Greek indices). This can be written more concisely as: y_{i}(w_{β}x_{iβ}b)  1 ≥ 0. Data points that satisfy the equality in the above are known as "support vectors" (or "active constraints").
Once training is complete, discrimination is based solely on position relative to the discriminating hyperplane: w_{β}x_{iβ}  b = 0. The boundary hyperplanes on the two classes of data are separated by a distance 2/w, known as the "margin," where w^{2} = w_{β}w_{β}. By increasing the margin between the separated data as much as possible the optimal separating hyperplane is obtained. In the usual SVM formulation, the goal to maximize w^{1} is restated as the goal to minimize w^{2}. The Lagrangian variational formulation then selects an optimum defined at a saddle point of L(w,b;α) = (w_{β}w_{β})/2  α_{γ}y_{γ}(w_{β}x_{γβ}b)  α_{0}, where α_{0} = Σ_{γ}α_{γ}, α_{γ} ≥ 0 (1 ≤ γ ≤ M). The saddle point is obtained by minimizing with respect to {w_{1}, ...,w_{N},b} and maximizing with respect to {α_{1}, ..., α_{M}}. If y_{i}(w_{β}x_{iβ}b)  1 ≥ 0, then maximization on α_{i} is achieved for α_{i} = 0. If y_{i}(w_{β}x_{iβ}b)  1 = 0, then there is no constraint on α_{i}. If y_{i}(w_{β}x_{iβ}b)  1 < 0, there is a constraint violation, and α_{i} → ∞. If absolute separability is possible the last case will eventually be eliminated for all α_{i}, otherwise it's natural to limit the size of α_{i} by some constant upper bound, i.e., max(α_{i}) = C, for all i. This is equivalent to another set of inequality constraints with α_{i} ≤ C. Introducing sets of Lagrange multipliers, ξ_{γ} and μ_{γ}(1 ≤ γ ≤ M), to achieve this, the Lagrangian becomes:
L(w,b;α,ξ,μ) = (w_{β}w_{β})/2  α_{γ}[y_{γ}(w_{β}x_{γβ}b)+ξ_{γ}] + α_{0} + ξ_{0}C  μ_{γ}ξ_{γ}, where ξ_{0} = Σ_{γ}ξ_{γ}, α_{0} = Σ_{γ}α_{γ}, and α_{γ} ≥ 0 and ξ_{ξ} ≥ 0 (1 ≤ γ ≤ M).
At the variational minimum on the {w_{1}, ...,w_{N},b} variables, w_{β} = α_{γ}y_{γ}x_{γβ}, and the Lagrangian simplifies to: L(α) = α_{0}  (α_{δ}y_{δ}x_{δβ} α_{γ}y_{γ}x_{γβ}/2, with 0 ≤ α_{γ} ≤ C (1 ≤ γ ≤ M) and α_{γ}y_{γ} = 0, where only the variations that maximize in terms of the α_{γ} remain (known as the Wolfe Transformation). In this form the computational task can be greatly simplified. By introducing an expression for the discriminating hyperplane: f_{i} = w_{β}x_{iβ}  b = α_{γ}y_{γ}x_{γβ}x_{iβ}  b, the variational solution for L(α) reduces to the following set of relations (known as the KarushKuhnTucker, or KKT, relations): (i) α_{i} = 0 ⇔ y_{i}f_{i} ≥ 1, (ii) 0 < α_{i} < C ⇔ y_{i}f_{i} = 1, and (iii) α_{i} = C ⇔ y_{i}f_{i} ≤ 1. When the KKT relations are satisfied for all of the α_{γ} (with α_{γ}y_{γ} = 0 maintained) the solution is achieved. (The constraint α_{γ}y_{γ} = 0 is satisfied for the initial choice of multipliers by setting the α's associated with the positive training instances to 1/N^{(+)} and the α's associated with the negatives to 1/N^{()}, where N^{(+)} is the number of positives and N^{()} is the number of negatives.) Once the Wolfe transformation is performed it is apparent that the training data (support vectors in particular, KKT class (ii) above) enter into the Lagrangian solely via the inner product x_{iβ}x_{jβ}. Likewise, the discriminator f_{i}, and KKT relations, are also dependent on the data solely via the x_{iβ}x_{jβ} inner product.
Generalization of the SVM formulation to datadependent inner products other than x_{iβ}x_{jβ} are possible and are usually formulated in terms of the family of symmetric positive definite functions (reproducing kernels) satisfying Mercer's conditions [1].
Binary SVM Discriminator Implementation
The SVM discriminators are trained by solving their KKT relations using the Sequential Minimal Optimization (SMO) procedure of [4]. The method described here follows the description of [4] and begins by selecting a pair of Lagrange multipliers, {α_{1},α_{2}}, where at least one of the multipliers has a violation of its associated KKT relations (for simplicity it is assumed in what follows that the multipliers selected are those associated with the first and second feature vectors: {x_{1},x_{2}}). The SMO procedure then "freezes" variations in all but the two selected Lagrange multipliers, permitting much of the computation to be circumvented by use of analytical reductions:
L(α_{1},α_{2};α_{β'≥3}) = α_{1} + α_{2}  (α_{1}^{2}K_{11} + α_{2}^{2}K_{22} + 2α_{1}α_{2}y_{1}y_{2}K_{12})/2  α_{1}y_{1}v_{1}  α_{2}y_{2}v_{2} + α_{β'}U_{β'}  (α_{β'}α_{γ'}y_{β'}K_{β'γ'})/2,
with β',γ' ≥ 3, and where K_{ij} ≡ K(x_{i}, x_{j}), and v_{i} ≡ α_{β'}y_{β'}K_{iβ'} with β' ≥ 3. Due to the constraint α_{β}y_{β} = 0, we have the relation: α_{1} + sα_{2} = γ, where γ ≡ y_{1}α_{β'}y_{β'} with β' ≥ 3 and s ≡ y_{1}y_{2}. Substituting the constraint to eliminate references to α_{1}, and performing the variation on α_{2}: ∂L(α_{2};α_{β'≥3})/∂α_{2} = (1  s) + ηα_{2} + sγ(K_{11}  K_{22}) + sy_{1}v_{1}  y_{2}v_{2}, where η ≡ (2K_{12}  K_{11} + K_{22}). Since v_{i} can be rewritten as v_{i} = w_{β}x_{iβ}  α_{1}y_{1}K_{i1}  α_{2}y_{2}K_{i2}, the variational maximum ∂L(α_{2};α_{β'≥3})/∂α_{2} = 0 leads to the following update rule:
α_{2}^{new} = α_{2}^{old}  y_{2}((w_{β}x_{1β}y_{1})  (w_{β}x_{2β}y_{2}))/η.
Once α_{2}^{new} is obtained, the constraint α_{2}^{new} ≤ C must be reverified in conjunction with the α_{β}y_{β} = 0 constraint. If the L(α_{2};α_{β'≥3}) maximization leads to a α_{2}^{new} that grows too large, the new α_{2} must be "clipped" to the maximum value satisfying the constraints. For example, if y_{1} ≠ y_{2}, then increases in α_{2} are matched by increases in α_{1}. So, depending on whether α_{2} or α_{1} is nearer its maximum of C, we have max(α_{2}) = argmin{α_{2}+(Cα_{2}); α_{2}+(Cα_{1})}. Similar arguments provide the following boundary conditions: (i) if s = 1, max(α_{2}) = argmin{α_{2}; C+α_{2}α_{1}}, and min(α_{2}) = argmax{0; α_{2}α_{1}}, and (ii) if s = +1, max(α_{2}) = argmin{C; α_{2}+α_{1}}, and min(α_{2}) = argmax{0; α_{2}+α_{1}C}. In terms of the new α_{2}^{new, clipped}, clipped as indicated above if necessary, the new α_{1} becomes:
α_{1}^{new} = α_{1}^{old} + s(α_{2}^{old}α_{2}^{new, clipped}),
where s ≡ y_{1}y_{2} as before. After the new α_{1} and α_{2} values are obtained there still remains the task of obtaining the new b value. If the new α_{1} is not "clipped" then the update must satisfy the nonboundary KKT relation: y_{1}f(x_{1}) = 1, i.e., f^{new}(x_{1})  y_{1} = 0. By relating f^{new} to f^{old} the following update on b is obtained:
b^{new1} = b  (f^{new}(x_{1})  y_{1})  y_{1}(α_{1}^{new  α}_{1}^{old})K_{11}  y_{2}(α_{2}^{new, clipped}  α_{2}^{old})K_{12}.
If α_{1} is clipped but α_{2} is not, the above argument holds for the α_{2} multiplier and the new b is:
b^{new2} = b  (f^{new}(x_{2})  y_{2})  y_{2}(α_{2}^{new  α}_{2}^{old})K_{22}  y_{1}(α_{1}^{new, clipped}  α_{1}^{old})K_{12}.
If both α_{1} and α_{2} values are clipped then any of the b values between b^{new1} and b^{new2} is acceptable, and following the SMO convention, the new b is chosen to be:
b^{new} = (b^{new1} + b^{new2})/2.
Multiclass SVM Methods
The SVM binary discriminator offers high performance and is very robust in the presence of noise. This allows a variety of reductionist multiclass approaches, where each reduction is a binary classification (for classifying cards by suit, maybe classify as red or black first, then as heart or diamond for red and spade or club for black, for example). The SVM Decision Tree is one such approach, and a collection of them (a SVM Decision Forest) can be used to avoid problems with throughput biasing. Alternatively, the variational formalism can be modified to perform a multihyperplane optimization situation for a direct multiclass solution [7–9], and that is what is described next.
SVMInternal Multiclass
In the formulation in [7], there are 'k' classes and hence 'k' linear decision functions – a description of their approach is given here. For a given input 'x', the output vector corresponds to the output from each of these decision functions. The class of the largest element of the output vector gives the class of 'x'.
Each decision function is given by: f_{m}(x) = w_{m}.x + b_{m} for all m = (1,2, ..., k). If y_{i} is the class of the input x_{i}, then for each input data point, the misclassification error is defined as follows: max_{m}{f_{m}(x_{i}) + 1  δ_{i}^{m}}  f_{yi}(x_{i}), where δ_{i}^{m} is 1 if m = y_{i} and 0 if m ≠ y_{i}. We add the slack variable ζ_{i} where ζ_{i} ≥ 0 for all i that is proportional to the misclassification error: max_{m}{f_{m}(x_{i}) + 1  δ_{i}^{m}}  f_{yi}(x_{i}) = ζ_{i}, hence f_{yi}(x_{i})  f_{m}(x_{i}) + δ_{i}^{m} ≥ 1  ζ_{i} for all i, m. To minimize this classification error and maximize the distance between the hyperplanes (Structural Risk Minimization) we have the following formulation:
Minimize: ∑_{i}ζ_{i} + β(1/2)∑_{m}w_{m}^{T}w_{m} + (1/2)∑_{m}b_{m}^{2},
where β > 0 is defined as a regularization constant.
Constraint: w_{yi}.x_{i} + b_{yi}  w_{m}.x_{i}  b_{m}  1 + ζ_{i} + δ_{i}^{m} ≥ 0 for all i,m
Note: the term (1/2)∑_{ m }b_{ m }^{ 2 }is added for decoupling, 1/β = C, and m = y_{i} in the above constraint is consistent with ζ_{i} ≥ 0. The Lagrangian is:
L(w,b,ζ) = ∑_{i}ζ_{i} + β(1/2)∑_{m}w_{m}^{T}w_{m} + (1/2)∑_{m}b_{m}^{2}  ∑_{i}∑_{m}α_{i}^{m}(w_{yi}x_{i} + b_{yi}  w_{m}.x_{i}  b_{m}  1 + ζ_{i} + δ_{i}^{m})
Where all α_{i}^{m}s are positive Lagrange multipliers. Now taking partial derivatives of the Lagrangian and equating them to zero (Saddle Point solution): ∂L/∂ζ_{i} = 1  ∑_{m}α_{i}^{m} = 0. This implies that ∑_{m}α_{i}^{m} = 1 for all i. ∂L/∂b_{m} = b_{m} + ∑_{i}α_{i}^{m}  ∑_{i}δ_{i}^{m} = 0 for all m. Hence b_{m} = ∑_{i}(δ_{i}^{m}  α_{i}^{m}). Similarly: ∂L/∂w_{m} = βw_{m} + ∑_{i}α_{i}^{m}x_{i}  ∑_{i}δ_{i}^{m}x_{i} = 0 for all m. Hence w_{m} = (1/β)[∑_{i}(δ_{i}^{m}  α_{i}^{m})x_{i}] Substituting the above equations into the Lagrangian and after simplification reduces into the dual formalism:
Maximize: 1/2∑_{i,j}∑_{m}(δ_{i}^{m}  α_{i}^{m})(δ_{j}^{m}  α_{j}^{m})(K_{ij} + β)  β∑_{i,m}δ_{i}^{m}α_{i}^{m}
Constraint: 0 ≤ α_{i}^{m}, ∑_{m}α_{i}^{m} = 1, i = 1...1; m = 1...k
Where K_{ij} = x_{i}.x_{j} is the Kernel generalization. In vector notation:
Maximize: 1/2∑_{i,j}(Δ_{yi}  A_{i})(Δ_{yj}  A_{j})(K_{ij} + β)  β∑_{i}Δ_{yi}A_{i}
Constraint: 0 ≤ A_{i}, A_{i}. 1 = 1, i = 1 ...1
Let τ_{i} = Δ_{yi}  A_{i}. Hence after ignoring the constant: 1/2∑_{i,j}τ_{i}.τ_{j}(K_{ij} + β) + β∑_{i}Δ;_{yi}τ_{i}, subject to: τ_{i} ≤ Δ_{yi}, τ_{i}.1 = 0, i = 1 ...l. The dual is solved (determine the optimum values of all the τs) using the decomposition method.
Minimize: 1/2∑_{i,j}τ_{i}^{m}.τ_{j}^{m}(K_{ij} + β)  β∑_{i,m}δ_{i}^{m}τ_{i}^{m}
Constraint: τ_{i} ≤ Δ_{yi}, τ_{i}.1 = 0, i = 1 ...l
The Lagrangian of the dual is:
L = 1/2∑_{i,j,m}τ_{i}^{m}.τ_{j}^{m}(K_{ij} + β)  φ∑_{i,m}δ_{i}^{m}τ_{i}^{m}  ∑_{i,m}u_{i}^{m}(δ_{i}^{m}  τ_{i}^{m})  ∑_{i}v_{i}∑_{m}τ_{i}^{m}
Subject to u_{i}^{m} ≥ 0
We take the gradient of the Lagrangian with respect to τ_{i}^{m}:
▼_{τ}^{m}[L] = ∑_{i}τ_{j}^{m}(K_{ij} + β)  βδ_{i}^{m} + u_{i}^{m}  v_{i} = 0
Introducing f(τ) = ∑_{i}τ_{j}^{m}(K_{ij} + β)  βδ_{i}^{m} + u_{i}^{m}  v_{i} = 0 and f_{i}^{m} = ∑_{i}τ_{j}^{m}(K_{ij} + β)  βδ_{i}^{m}, then f(τ) = f_{i}^{m} + u_{i}^{m}  v_{i} = 0. By KKT conditions we get two more equations:
u_{i}^{m}(δ_{i}^{m}  τ_{i}^{m}) = 0 and u_{i}^{m} ≥ 0
Case I: if δ_{i}^{m} = τ_{i}^{m}, then u_{i}^{m} ≥ 0, hence f_{i}^{m} ≤ v_{i}. Case II: if τ_{i}^{m} < δ_{i}^{m}, then u_{i}^{m} = 0, hence f_{i}^{m} = v_{i}. Note: There is atleast one 'm' for all i such that τ_{i}^{m} < δ_{i}^{m} is satisfied.
Therefore combining Case I & II, we get:
max_{m}{f_{i}^{m}} ≤ v_{i} ≤ min_{m: τi}^{m} < δ_{i}^{m}{f_{i}^{m}}
Or max_{m}{f_{i}^{m}} ≤ min_{m: τi}^{m} < δ_{i}^{m}{f_{i}^{m}}
Or max_{m}{f_{i}^{m}}  min_{m: τi}^{m} < δ_{i}^{m}{f_{i}^{m}} ≤ ε
Note: τ_{i}^{m} < δ_{i}^{m} implies that α_{i}^{m} > 0. Since ∑_{m}α_{i}^{m} = 1, for any i each α_{i}^{m} is treated as the probability that the data point belongs to class m. Hence we define KKT violators as:
max_{m}{f_{i}^{m}}  min_{m: τi}^{m} < δ_{i}^{m}{f_{i}^{m}} > ε for all i.
Decomposition Method to Solve the Dual
Using the method in [7] to solve the Dual, maximize
Q(τ) = 1/2∑_{i,j}τ_{i}.τ_{j}(K_{ij} + β) + β∑_{i}Δ_{yi}τ_{i}
Subject to: τ_{i} ≤ Δ_{yi}, τ_{i}.1 = 0, i = 1 ...l
Expanding in terms of a single 'τ' vector:
Q_{p}(τ_{p}) = 1/2A_{p}(τ_{p}. τ_{p})  B_{p}.τ_{p} + C_{p}
Where:
A_{p} = K_{pp} + β
B_{p} = βΔ_{yp} + ∑_{i≠p}τ_{i}(K_{ip} + β)
C_{p} = 1/2∑_{i,j≠p}τ_{i}.τ_{j}(K_{ij} + β) + β∑_{i≠p}τ_{i}Δ_{yi}
Therefore ignoring the constant term 'C_{p}', we have to minimize:
Q_{p}(τ_{p}) = 1/2A_{p}(τ_{p}. τ_{p}) + B_{p}.τ_{p}
Subject to: τ_{p} ≤ Δ_{yp} and τ_{p}.1 = 0
The above equation can also be written as:
Q_{p}(τ_{p}) = 1/2A_{p}(τ_{p} + B_{p}/A_{p}).(τ_{p} + B_{p}/A_{p})  B_{p}.B_{p}/2A_{p}
Substitute v = (τ_{p} + B_{p}/A_{p}) & D = (Δ_{yp} + B_{p}/A_{p}) in the above equation. Hence, after ignoring the constant term B_{p}.B_{p}/2A_{p} and the multiplicative factor 'A_{p}' we have to minimize:
Q(v) = 1/2v.v = 1/2v^{2}
Subject to: v ≤ D and v.1 = D.1  1
The Lagrangian is given by:
L(v) = 1/2v^{2}  ∑_{m}ρ_{m}(D_{m}  v_{m})  σ[∑_{m}(v_{m}  D_{m}) + 1]
Subject to: ρ_{m} ≤ 0
Hence ∂L/∂v_{m} = v_{m} + ρ_{m}  σ = 0. By KKT conditions we have: ρ_{m}(D_{m}  v_{m}) = 0 & ρ_{m} ≥ 0, also v_{m} ≤ D_{m}. Hence by combining the above inequalities, we have: v_{m} = Min{D_{m}, σ}, or ∑_{m}v_{m} = ∑_{m}Min{D_{m}, σ} = ∑_{m}D_{m}  1. The above equation uniquely defines the 'σ' that satisfies the above equation AND that 'σ' is the optimal solution of the quadratic optimization problem. (Refer to [7] for a formal proof).
Solve for 'σ': We have Min{D_{m}, σ} + Max{D_{m}, σ} = D_{m} + σ, hence ∑_{m}[D_{m} + σ  Max{D_{m}, σ}] = ∑_{m}D_{m}  1, or σ = 1/K[∑_{m}Max{D_{m}, σ}  1], hence we find σ (iteratively) that satisfies the equation: (σ_{l}  σ_{l+1})/σ_{l} ≤ tolerance. The initial value for 'σ' is set to σ_{1} = 1/K[∑_{m}D_{m}  1].
Update rule for 'τ': Once we have 'σ', τ_{new}^{m} = v_{m}  B_{p}^{m}/(K_{pp} + β), or:
τ_{new}^{m} = v_{m}  f_{p}^{m}/(K_{pp} + β) + τ_{old}^{m}
SVMInternal Clustering
Let {x_{i}} be a data set of 'N' points in R^{d}. Using a nonlinear transformation φ, we transform 'x' to some highdimensional space called Kernel space and look for the smallest enclosing sphere of radius 'R'. Hence we have: φ(x_{j})  a ^{2} ≤ R^{2} for all j = 1,...,N; where 'a' is the center of the sphere. Soft constraints are incorporated by adding slack variables 'ζ_{j}':
φ(x_{j})  a ^{2} ≤ R^{2} + ζ_{j} for all j = 1,...,N
Subject to: ζ_{j} ≥ 0
We introduce the Lagrangian as:
L = R^{2}  ∑_{j}β_{j}(R^{2} + ζ_{j}  φ(x_{j})  a ^{2})  ∑_{j}ζ_{j}μ_{j} + C∑_{j}ζ_{j}
Subject to: β_{j} ≥ 0, μ_{j} ≥ 0,
where C is the cost for outliers and hence C∑_{j}ζ_{j} is a penalty term. Setting to zero the derivative of 'L' w.r.t. R, a and ζ we have: ∑_{j}β_{j} = 1; a = ∑_{j}β_{j}φ(x_{j}); and β_{j} = C  μ_{j}.
Substituting the above equations into the Lagrangian, we have the dual formalism as:
W = 1  ∑_{i,j}β_{i}β_{j}K_{ij} where 0 ≤ β_{i} ≤ C; K_{ij} = exp(x_{i}  x_{j}^{2}/2σ^{2})
Subject to: ∑_{i}β_{i} = 1
By KKT conditions we have: ζ_{j}μ_{j} = 0 and β_{j}(R^{2} + ζ_{j}  φ(x_{j})  a ^{2}) = 0.
In the kernel space of a data point 'x_{j}' if ζ_{j} > 0, then β_{j} = C and hence it lies outside of the sphere i.e. R^{2} < φ(x_{j})  a ^{2}. This point becomes a bounded support vector or BSV. Similarly if ζ_{j} = 0, and 0 < β_{j} < C, then it lies on the surface of the sphere i.e. R^{2} = φ(x_{j})  a ^{2}. This point becomes a support vector or SV. If ζ_{j} = 0, and β_{j} = 0, then R^{2} > φ(x_{j})  a ^{2} and hence this point is enclosed within the sphere.
Nanopore Detector based Channel Current Cheminformatics
Information measures
The fundamental information measures are Shannon entropy, mutual information, and relative entropy (also known as the KullbackLeibler divergence or distance). Shannon entropy, σ = Σ_{ x }p(x)log(p(x)), is a measure of the information in distribution p(x). Mutual Information, μ = Σ_{ x }Σ_{ y }p(xy)log(p(xy)/p(x)p(y)), is a measure of information one random variable has about another random variable. Relative Entropy (KullbackLeibler distance): ρ = Σ_{ x }p(x) log(p(x)/q(x)), is a measure of distance between two probability distributions. Mutual information is a special case of relative entropy between a joint probability (twocomponent in simplest form) and the product of component probabilities.
Khinchin derivation of Shannon entropy
In his now famous 1948 paper, Claude Shannon [21] provided a qualitative measure for entropy in connection with communication theory. The Shannon entropy measure was later put on a more formal footing by A. I. Khinchin in an article where he proves that with certain reasonable assumptions the Shannon entropy is unique [22]. A statement of the theorem is as follows:
Khinchine Uniqueness Theorem
 (1)
For given n and for Σ_{ k }p_{ k } = 1, the function takes its largest value for p_{ k } = 1/n (k = 1,2,...,n). This is equivalent to Laplace's principle of insufficient reason, which says if you don't know anything assume the uniform distribution (also agrees with Occam's Razor assumption of minimum structure).
 (2)
H(ab) = H(a) + H_{ a }(b), where H_{ a }(b) = Σ_{ a }p(a)log(p(ba)), is the conditional entropy. This is consistent with H(ab)=H(a)+H(b), for probabilities of a and b independent, with modifications involving conditional probability being used when not independent.
 (3)
H(p_{ 1 },p_{ 2 },...,p_{ n },0) = H(p_{ 1 },p_{ 2 },...,p_{ n }). This reductive relationship, or something like it, is implicitly assumed when describing any system in "isolation."
Relative Entropy Uniqueness
This falls out of a geometric formalism on families of distributions: the Information Geometry formalism described by S. Amari [23–25]. Together with Laplace's principle of insufficient reason on the choice of "reference" distribution in the relative entropy expression, this will reduce to Shannon entropy, and thus uniqueness on Shannon entropy from a geometric context. The parallel with geometry is the Euclidean distance for "flat" geometry (simplest assumption of structure), vs. the "distance" between distributions as described by the KullbackLeibler divergence.
The Success of Distributions of Nature suggests Generalization from Geometric FeatureSpace Kernels to Distribution FeatureSpace Kernels
Using the Shannon entropy measure it is possible to derive the classic probability distributions of statistical physics by maximizing the Shannon measure subject to appropriate linear momentum constraints. Constrained variational optimizations involving the Shannon entropy measure can, thus, provide a unified framework with which to describe all, or most, of statistical mechanics. The distributions derivable within the maximum entropy formalism include the MaxwellBoltzmann, BoseEinstein, FermiDirac, and Intermediate distributions. The maximum entropy method for defining statistical mechanical systems has been extensively studied by [26].
Both statistical estimation and maximum entropy estimation are concerned with drawing inferences from partial information. The maximum entropy approach estimates a probability density function when only a few moments are known (where there are an infinite number of higher moments). The statistical approach estimates the density function when only one random sample is available out of an infinity of possible samples. The maximum entropy estimation may be significantly more robust (against overfitting, for example) in that it has an Occam's Razor argument that "cuts both ways" – use all of the information given and avoid using any information not given. This means that out of all of the probability distributions consistent with the set of constraints, choose the one that has maximum uncertainty, i.e., maximum entropy [27].
At the same time that Jaynes was doing his work, essentially an optimization principle based on Shannon entropy, Soloman Kullback was exploring optimizations involving a notion of probabilistic distance known as the KullbackLeibler distance, referred to above as the relative entropy [28]. The resulting minimum relative entropy (MRE) formalism reduces to the maximum entropy formalism of Jaynes when the reference distribution is uniform. The information distance that Kullback and Leibler defined was an oriented measure of "distance" between two probability distributions. The MRE formalism can be understood to be an extension of Laplace's Principle of Insufficient Reason (e.g., if nothing known assume the uniform distribution) in a manner like that employed by Khinchine in his uniqueness proof, but now incorporating constraints.
In their book Entropy Optimization Principles with Applications [27], Kapur and Kesavan argue for a generalized entropy optimization approach to the description of distributions. They believe every probability distribution, theoretical or observed, is an entropy optimization distribution, i.e., it can be obtained by maximizing an appropriate entropy measure, or by minimizing a relative entropy measure with respect to an appropriate a priori distribution. The primary objective in such a modeling procedure is to represent the problem as a simple combination of probabilistic entities that have a simple set of moment constraints. Generalized measures of distributional distance can also be explored along the lines of generalized measures of geometric distance. In physics, not every geometric distance is of interest, however, since the special theory of relativity tells us that spacetime is locally flat (Lorentzian, which is Euclidean on spatial slices), with metric generalization the Riemannian metrics. Likewise, perhaps not all distributional distance measures are created equal either. What the formalism of Information Geometry [23–25] reveals, among other things, is that relative entropy is uniquely structureless (like flat geometry) and is perturbatively stable, i.e., has a welldefined Taylor expansion at short divergence range, just like the locally Euclidean metrics at short distance range.
Results
SVM Kernel/Algorithm Variants
The SVM algorithm variants being explored are only briefly mentioned here. In the standard Platt SMO algorithm, η = 2*K12K11K22, and speedup variations are described to avoid calculation of this value entirely. A middle ground is sought with the following definition "η = 2*K122; If (η >= 0) { η = 1;}" (labeled WH SMO in Fig. 3, underflow handling and other implementations differ slightly in the implementation shown as well).
SVMInternal Speedup via differentiating BSVs and SVs
For the BSV/SVtracking speedup, the KKT violators are redefined as:
For all m ≠ y_{i} we have:
α_{i}^{m}{f_{yi}  f_{m}  1 + ζ_{i}} ≥ 0
Subject to: 1 ≥ α_{i}^{m} ≥ 0; ∑_{m}α_{i}^{m} = 1;ζ_{i} ≥ 0 for all i,m
Where f_{m} = (1/β)[w_{m}.x_{i} + b_{m}] for all m
Case I:
If α_{i}^{m} = 0 for m S.T f_{m} = f_{m}^{max}
Implies α_{i}^{yi} > 0 and hence ζ_{i} = 0
Hence f_{yi}  f_{m}^{max}  1 ≥ 0
Case II:
If 1 > α_{i}^{m} > 0 for m S.T f_{m} = f_{m}^{max} and α_{i}^{yi} > α_{i}^{m}
Implies ζ_{i} = 0
Hence f_{yi}  f_{m}^{max}  1 = 0
Case III:
If 1 ≥ α_{i}^{m} > 0 for m S.T f_{m} = f_{m}^{max} and α_{i}^{yi} ≤ α_{i}^{m}
Implies ζ_{i} > 0
Hence f_{yi}  f_{m}^{max}  1 + ζ_{i} = 0
Or f_{yi}  f_{m}^{max}  1 < 0
Data Rejection Tuning with SVMInternal vs SVMExternal Classifiers
Marginal Drop with SVMInternal
The table shows the results of dropping data that falls in the margin. For any data point x_{i}; let max_{m}{f_{m}(x_{i})} = f_{yi}, and Let f_{m} = max_{m}{f_{m}(x_{i})} for all m ≠ yi, then we define the margin as: (f_{yi}  f_{m}), hence data point x_{i} is dropped if (f_{yi}  f_{m}) ≤ Confidence Parameter. Using the margin drop approach, there is even less tuning, and there is improved throughput (approximately 75% for all species).
Kernel  8GC  9AT  9CG  9GC  9TA 

Gaussian  P: 1268 TP: 1087 SN+SP: 1.76 P: 1087 TP: 1087 SN+SP: 2 Drop = 9.42  P: 1178 TP: 934 SN+SP: 1.57 P: 934 TP: 934 SN+SP: 2 Drop = 22.17  P: 1166 TP: 904 SN+SP: 1.53 P: 904 TP: 904 SN+SP: 2 Drop = 24.67  P: 1172 TP: 897 SN+SP: 1.51 P: 897 TP: 897 SN+SP: 2 Drop = 25.25  P: 1216 TP: 1027 SN+SP: 1.70 P: 1027 TP: 1027 SN+SP: 2 Drop = 14.42 
AbsDiff  P: 1407 TP: 1134 SN+SP: 1.75 P: 1134 TP: 1134 SN+SP: 2 Drop = 5.5  P: 1151 TP: 928 SN+SP: 1. 58 P: 928 TP: 928 SN+SP: 2 Drop = 22.67  P: 1177 TP: 906 SN+SP: 1.53 P: 906 TP: 906 SN+SP: 2 Drop = 24.5  P: 1050 TP: 870 SN+SP: 1.55 P: 870 TP: 870 SN+SP: 2 Drop = 27.5  P: 1215 TP: 1040 SN+SP: 1.72 P: 1040 TP: 1040 SN+SP: 2 Drop = 13.33 
Entropic  P: 1165 TP: 1038 SN+SP: 1.75 P: 1038 TP: 1038 SN+SP: 2 Drop = 13.5  P: 1480 TP: 995 SN+SP: 1.50 P: 991 TP: 991 SN+SP: 2 Drop = 17.42  P: 1348 TP: 922 SN+SP: 1.45 P: 920 TP: 920 SN+SP: 2 Drop = 23.33  P: 960 TP: 804 SN+SP: 1.50 P: 803 TP: 803 SN+SP: 2 Drop = 33.08  P: 1047 TP: 970 SN+SP: 1.73 P: 970 TP: 970 SN+SP: 2 Drop = 19.17 
SVMInternal Clustering
The SVMInternal approach to clustering was originally defined by [29]. Data points are mapped by means of a kernel to a high dimensional feature space where we search for the minimal enclosing sphere. In what follows, Keerthi's method is used to solve the dual (see Methods for further details).
The minimal enclosing sphere, when mapped back into the data space, can separate into several components; each enclosing a separate cluster of points. The width of the kernel (say Gaussian) controls the scale at which the data is probed while the soft margin constant helps to handle outliers and overlapping clusters. The structure of a dataset is explored by varying these two parameters, maintaining a minimal number of support vectors to assure smooth cluster boundaries.
We have used the algorithm defined in [29] to identify the clusters, with methods adapted from [30,31 for their handling. If the number of data points is 'n', then we require n(n1)/2 number of comparisons. We have made modifications to the algorithm such that we eliminate comparisons that do not have an impact on the cluster connectivity. Hence the number of comparisons required will be less than n(n1)/2.
The table shows clustering predictions when working with 400 Samples (200 each of 9GC & 9CG) with a Gaussian Kernel with Width = 50 (σ^{2} = 0.01).
C Value  Number of SV  Percent of Outliers  Number of Clusters  Number of Comparisons 

0.25  91  0  10  39005 
0.025  87  1.25  5  37020 
0.0125  44  13.75  4  29202 
0.01  29  21.75  2  24145 
The table shows clustering predictions when working with 1200 Samples (600 each of 9GC & 9CG) with a Gaussian Kernel with Width = 50 (σ^{2} = 0.01).
C Value  Number of SV  Percent of Outliers  Number of Clusters  Number of Comparisons 

0.00833  106  5.8  4  10873 
0.00417  37  18.25  2  232021 
0.00333  31  23.8  2  203278 
0.00278  23  29.08  2  177533 
SVMExternal Clustering
Machine Learning and Cheminformatics Tools are Accessible via Website
Discussion
Adaptive Feature Extraction/Discrimination
Adaptive feature extraction and discrimination, in the context of SVMs, can be accomplished by small batch reprocessing using the learned support vectors together with the new information to be learned. The benefit is that the easily deployed properties of SVMs can be retained while at the same time coopting some of the online adaptive characteristics familiar from online learning with neural nets. This is also compatible with the chunking processing that is already implemented. A situation where such adaptation might prove necessary in nanopore signal analysis is if the instrumentation was found to have measurable, but steady, drift (at a new level of sensitivity for example). At the forefront of online adaptation, where the discrimination and feature extraction optimizations are inextricably mixed, further progress may derive benefit from the InformationGeometrical methods of S. Amari [23–25].
Robust SVM performance in the presence of noise
In a parallel datarun to that indicated in Fig. 2a, with 150 component feature vectors, feature vectors with the full set of 2600 components were extracted (i.e., no compression was employed on the transition probabilities). SVM performance on the same train/test data splits, but with 2600 component feature vectors instead of 150 component feature vectors, offered similar performance after drop optimization. This demonstrates a significant robustness to what the SVM can "learn" in the presence of noise (some of the 2600 component have richer information, but even more are noise contributors).
AdaBoost Feature Selection
If SVM performance on the full HMM parameter set (the features extracted for each blockade signal) offers equivalent performance after rejecting weak data, then the possibility for significant improvement with selection on good parameters. An AdaBoost method is being used to select HMM parameters by representing each feature vector component as an independent Naïve Bayes classifier (trained on the data given), that then comprise the pool of experts in the AdaBoost algorithm [32–34]. The experts AdaBoost assigns heaviest weighting will then the components selected in the new, AdaBoost assigned, feature vector compression.
Conclusion

External Multiclass SVM gave best results with Sentropic Kernel while Internal Multiclass SVM gave best results with AbsDiff kernel.

Internal Multiclass approach overcomes the need to search for the best performing tree out of many possibilities. This is a huge advantage especially when the number of classes is large.

Using a margin to define the drop zone for the internal multiclass approach produced far better results i.e. fewer data were dropped to achieve 100% accuracy.

Additional benefit of using the margin is that the drop zone tuning to achieve 100% accuracy becomes trivial.

External and Internal SVM Clustering Methods were also examined. The results show that our SVMbased clustering implementations can separate data into proper clusters without any prior knowledge of the elements' classification. this can be a powerful resource for insight into data linkages (topology).
Methods
The Feature Extraction used to obtain the Feature Vectors for SVM analysis
Signal Preprocessing Details
The Nanopore Detector is operated such that a stream of 100 ms samplings are obtained (throughput was approximately one sampling per 300 ms in [3]). Each 100 ms signal acquired by the timedomain FSA consists of a sequence of 5000 subblockade levels (with the 20 μs analogtodigital sampling). Signal preprocessing is then used for adaptive lowpass filtering. For the data sets examined, the preprocessing is expected to permit compression on the sample sequence from 5000 to 625 samples (later HMM processing then only required construction of a dynamic programming table with 625 columns). The signal preprocessing makes use of an offline wavelet stationarity analysis (Offline Wavelet Stationarity Analysis, Figure 2b, also see [35]).
HMMs and Supervised Feature Extraction Details
With completion of preprocessing, an HMM [36] is used to remove noise from the acquired signals, and to extract features from them (Feature Extraction Stage, Fig. 2b). The HMM is, initially, implemented with fifty states, corresponding to current blockades in 1% increments ranging from 20% residual current to 69% residual current. The HMM states, numbered 0 to 49, corresponded to the 50 different current blockade levels in the sequences that are processed. The state emission parameters of the HMM are initially set so that the state j, 0 <= j <= 49 corresponding to level L = j+20, can emit all possible levels, with the probability distribution over emitted levels set to a discretized Gaussian with mean L and unit variance. All transitions between states are possible, and initially are equally likely. Each blockade signature is denoised by 5 rounds of ExpectationMaximization (EM) training on the parameters of the HMM. After the EM iterations, 150 parameters are extracted from the HMM. The 150 feature vector components are extracted from the 50 parameterized emission probabilities, a 50element compressed representation of the 50^{2} transition probabilities, and an a posteriori information from the Viterbi path solution which is, essentially, a denoised histogram of the bloackade sublevel occupation probabilities (further details in [3]). This information elucidates the blockade levels (states) characteristic of a given molecule, and the occupation probabilities for those levels, but doesn't directly provide kinetic information. An HMMwithDuration has recently been introduced to better capture the latter information, but such feature vectors are not used in the studies shown in this paper, so this approach isn't discussed further in this paper.
Solving the Dual (Based on Keerthi's SMO [37])
The dual formalism is: 1  ∑_{i,j}β_{i}β_{j}K_{ij} where 0 ≤ β_{i} ≤ C; K_{ij} = exp(x_{i}  x_{j}^{2}/2σ^{2}), also ∑_{i}β_{i} = 1. For any data point 'x_{k}', the distance of its image in kernel space from the center of the sphere is given by: R^{2}(x_{k}) = 1  2∑_{i}β_{i}K_{ik} + ∑_{i,j}β_{i}β_{j}K_{ij}. The radius of the sphere is R = {R(x_{k})  x_{k} is a Support Vectors}, hence data points which are Support Vectors lie on cluster boundaries. Outliers are points that lie outside of the sphere and therefore they do not belong to any cluster i.e. they are Bounded Support Vectors. All other points are enclosed by the sphere and therefore they lie inside their respective cluster. KKT Violators are given as: (i) If 0 < β_{i} < C and R(x_{i}) ≠ R; (ii) If β_{i} = 0 and R(x_{i}) > R; and (iii) If β_{i} = C and R(x_{i}) < R.
The Wolfe dual is: f(β) = Min _{β} {∑_{i,j}β_{i}β_{j}K_{ij}  1}. In the SMO decomposition, in each iteration we select β_{i} & β_{j} and change them such that f(β) reduces. All other β's are kept constant for that iteration. Let us denote β_{1} & β_{2} as being modified in the current iteration. Also β_{1} + β_{2} = (1  ∑_{i = 3}β_{i}) = s, a constant. Let ∑_{i = 3}β_{i}K_{ik} = C_{k}, then we obtain the SMO form: f(β_{1},β_{2}) = β^{2}_{1} + β^{2}_{2} + ∑_{i,j = 3}β_{i}β_{j}K_{ij} + 2β_{1}β_{2}K_{12} + 2β_{1}C_{1} + 2β_{2}C_{2}. Eliminating β_{1}: f(β_{2}) = (s  β_{2})^{2} + β^{2}_{2} + ∑_{i,j = 3}β_{i}β_{j}K_{ij} + 2(s  β_{2})β_{2}K_{12} + 2(s  β_{2})C_{1} + 2β_{2}C_{2}. To minimize f(β_{2}), we take the first derivative w.r.t. β_{2} and equate it to zero, thus f'(β_{2}) = 0 = 2β_{2}(1  K_{12})  s(1  K_{12})  (C_{1}  C_{2}), and we get the update rule: β_{2}^{new} = [(C_{1}  C_{2})/2(1  K_{12})] + s/2. We also have an expression for "C_{1}  C_{2}" from: R(x_{1}^{2})  R(x_{2}^{2}) = 2(β_{2}  β_{1})(1  K_{12})  2(C_{1}  C_{2}), thus C_{1}  C_{2} = [R(x_{2}^{2})  R(x_{1}^{2})]/2 + (β_{2}  β_{1})(1  K_{12}), substituting, we have:
β_{1}^{new} = β_{1}^{old}  [R(x_{2}^{2})  R(x_{1}^{2})]/[4(1  K_{12})]
Keerthi Algorithm
Compute 'C': if percent outliers = n and number data points = N, then: C = 100/(N*n)
Initialize β: Initialize m = int(1/C)  1 number of randomly chosen indices to 'C'
Initialize two different randomly chosen indices to values less than 'C' such that ∑_{i}β_{i} = 1
Compute R^{2}(x_{i}) for all 'i' based on the current value of β.
Divide data into three sets: Set I if 0 < β_{i} < C; Set II if β_{i} = 0; and Set III if β_{i} = C.
Compute R^{2}_low = Max{ R^{2}(x_{i})  0 ≤ β_{i} < C} and R^{2}_up = Min{ R^{2}(x_{i})  0 < β_{i} ≤ C}.
In every iteration execute the following two paths alternatively until there are no KKT violators:
1. Loop through all examples (call Examine Example subroutine)
Keep count of number of KKT Violators.
2. Loop through examples belonging only to Set I (call Examine Example subroutine) until R^{2}_low  R^{2}_up < 2*tol.
Examine Example Subroutine
a. Check for KKT Violation. An example is a KKT violator if:
Set II and R^{2}(x_{i}) > R^{2}_up; choose R^{2}_up for joint optimization
Set III and R^{2}(x_{i}) < R^{2}_low; choose R^{2}_low for joint optimization
Set I and R^{2}(x_{i}) > R^{2}_up + 2*tol OR R^{2}(x_{i}) < R^{2}_low  2*tol; choose R^{2}_low or R^{2}_up for joint optimization depending on which gives a worse KKT violator
b. Call the Joint Optimization subroutine
 a.
Compute η = 4(1  K_{12}) where K_{12} is the kernel evaluation of the pair chosen in Examine Example
 b.
Compute D = [R^{2}(x_{2})  R^{2}(x_{1})]/η
 c.
Compute Min{(C  β_{2}), β_{1}} = L1
 d.
Compute Min{(C  β_{1}), β_{2}} = L2
 e.
If D > 0; then D = Min{D, L1}
 f.
Update β_{2} as: β_{2} = β_{2} + D
 g.
Update β_{1} as: β_{1} = β_{1}  D
 h.
Recompute R^{2}(x_{i}) for all 'i' based on the changes in β_{1} & β_{2}
 i.
Recompute R^{2}_low & R^{2}_up based on elements in Set I, R^{2}(x_{1}) & R^{2}(x_{2})
The SVMExternal Clustering Method
 1.
Start with a set of data vectors (obtained through running raw data through tFSA/HMM feature extraction in Fig. 2b).
 2.
Randomly label each vector in the set as positive or negative.
 3.
Run the SVM on the randomly labeled data set until convergence is obtained (random relabeling is needed if prior random label scheme does not allow for convergence).
 4.
After initial convergence is obtained for the randomly labeled data set, relabel the misclassified data vectors, which have confidence factor values greater than some threshold.
 5.
Rerun the SVM on the newly relabeled data set.
 6.
Continue relabeling and rerunning SVM until no vectors in the data set are misclassified (or there is no improvement).
Declarations
Acknowledgements
SWH would like to thank MA and Prof. David Deamer at UCSC for strong collaborative support postKatrina. Funding was provided by grants from the National Institutes for Health, The National Science Foundation, The Louisiana Board of Regents, and NASA.
Authors’ Affiliations
References
 Vapnik VN: The Nature of Statistical Learning Theory. 2nd edition. SpringerVerlag, New York; 1998.Google Scholar
 Burges CJC: A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 1998, 2: 121–67.View ArticleGoogle Scholar
 WintersHilt S, Vercoutere W, DeGuzman VS, Deamer DW, Akeson M, Haussler D: Highly Accurate Classification of WatsonCrick Basepairs on Termini of Single DNA Molecules. Biophys J 2003, 84: 967–976.PubMed CentralView ArticlePubMedGoogle Scholar
 Platt JC: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In Advances in Kernel Methods – Support Vector Learning. Volume Ch. 12. Edited by: Scholkopf B, Burges CJC, Smola AJ. MIT Press, Cambridge, USA; 1998.Google Scholar
 Osuna E, Freund R, Girosi. F: An improved training algorithm for support vector machines. In Neural Networks for Signal Processing VII. Edited by: Principe J, Gile L, Morgan N, and Wilson E. IEEE, New York; 1997:276–85.Google Scholar
 Joachims T: Making largescale SVM learning practical. In Advances in Kernel Methods – Support Vector Learning. Volume Ch. 11. Edited by: Scholkopf B, Burges CJC, Smola AJ. MIT Press, Cambridge, USA; 1998.Google Scholar
 Crammer K, Singer Y: On the Algorithmic Implementation of Multiclass Kernelbased Vector Machines. Journal of Machine Learning Research 2001, 2: 265–292.Google Scholar
 Hsu CW, Lin CJ: A Comparison of Methods for Multiclass Support Vector Machines. IEEE Transactions on Neural Networks 2002, 13;: 415–425.PubMedGoogle Scholar
 Lee Y, Lin Y, Wahba G: Multicategory Support Vector Machines. Technical Report 1043, Department of Statistics University of Wisconsin, Madison, WI; 2001. [http://citeseer.ist.psu.edu/lee01multicategory.html]Google Scholar
 Bezrukov SM, Vodyanoy I, Parsegian VA: Counting polymers moving through a single ion channel. Nature 1994, 370(6457):279–281.View ArticlePubMedGoogle Scholar
 Kasianowicz JJ, Brandin E, Branton D, Deamer DW: Characterization of Individual Polynucleotide Molecules Using a Membrane Channel. Proc Natl Acad Sci USA 1996, 93(24):13770–73.PubMed CentralView ArticlePubMedGoogle Scholar
 Akeson M, Branton D, Kasianowicz JJ, Brandin E, Deamer DW: Microsecond TimeScale Discrimination Among Polycytidylic Acid, Polyadenylic Acid, and Polyuridylic Acid as Homopolymers or as Segments Within Single RNA Molecules. Biophys J 1999, 77(6):3227–3233.PubMed CentralView ArticlePubMedGoogle Scholar
 Bezrukov SM: Ion Channels as Molecular Coulter Counters to Probe Metabolite Transport. J Membr Biol 2000, 174: 1–13.View ArticlePubMedGoogle Scholar
 Meller A, Nivon L, Brandin E, Golovchenko J, Branton D: Rapid nanopore discrimination between single polynucleotide molecules. Proc Natl Acad Sci USA 2000, 97(3):1079–1084.PubMed CentralView ArticlePubMedGoogle Scholar
 Meller A, Nivon L, Branton D: Voltagedriven DNA translocations through a nanopore. Phys Rev Lett 2001, 86(15):3435–8.View ArticlePubMedGoogle Scholar
 Vercoutere W, WintersHilt S, Olsen H, Deamer DW, Haussler D, Akeson M: Rapid discrimination among individual DNA hairpin molecules at singlenucleotide resolution using an ion channel. Nat Biotechnol 2001, 19(3):248–252.View ArticlePubMedGoogle Scholar
 WintersHilt S: Highly Accurate RealTime Classification of ChannelCaptured DNA Termini. Third International Conference on Unsolved Problems of Noise and Fluctuations in Physics, Biology, and High Technology 2003, 355–368.Google Scholar
 Vercoutere W, WintersHilt S, DeGuzman VS, Deamer D, Ridino S, Rogers JT, Olsen HE, Marziali A, Akeson M: Discrimination Among Individual WatsonCrick BasePairs at the Termini of Single DNA Hairpin Molecules. Nucl Acids Res 2003, 31: 1311–1318.PubMed CentralView ArticlePubMedGoogle Scholar
 WintersHilt S: Nanopore detection using channel current cheminformatics. SPIE Second International Symposium on Fluctuations and Noise 25–28 May, 2004 25–28 May, 2004Google Scholar
 WintersHilt S, Akeson M: Nanopore cheminformatics. DNA Cell Biol 2004, 23(10):675–83.View ArticlePubMedGoogle Scholar
 Shannon CE: A mathematical theory of communication. Bell Sys Tech Journal 1948, 27: 379–423. 623–656 623–656View ArticleGoogle Scholar
 Khinchine AI: Mathematical foundations of information theory. Dover. 1957.Google Scholar
 Amari S: Dualistic Geometry of the Manifold of HigherOrder Neurons. Neural Networks 1991, 4(4):443–451.View ArticleGoogle Scholar
 Amari S: Information Geometry of the EM and em Algorithms for Neural Networks. Neural Networks 1995, 8(9):1379–1408.View ArticleGoogle Scholar
 Amari S, Nagaoka H: Methods of Information Geometry. Translations of Mathematical Monographs 2000., 191:Google Scholar
 Jaynes E: Paradoxes of Probability Theory. 1997. Internet accessible book preprint: http://omega.albany.edu:8008/JaynesBook.html Internet accessible book preprint:Google Scholar
 Kapur JN, Kesavan HK: Entropy optimization principles with applications. Academic Press; 1992.View ArticleGoogle Scholar
 Kullback S: Information Theory and Statistics. Dover. 1968.Google Scholar
 BenHur A, Horn D, Siegelmann HT, Vapnik V: Support Vector Clustering. Journal of Machine Learning Research 2001, 2: 125–137.Google Scholar
 Scholkopf B, Platt JC, ShaweTaylor J, Smola AJ, Williamson RC: Estimating the Support of a HighDimensional Distribution. Neural Comp 2001, 13: 1443–1471.View ArticleGoogle Scholar
 Yang J, EstivillCastro V, Chalup SK: Support Vector Clustering Through Proximity Graph Modeling. Proceedings, 9th International Conference on Neural Information Processing (ICONIP'02) 2002, 898–903.Google Scholar
 Freund Y, Schapire R: A decisiontheoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences 1997, 55;: 119–139.View ArticleGoogle Scholar
 Freund Y, Schapire RE, Bartlett P, Lee WS: Boosting the margin: a new explanation for the effectiveness of voting methods. Proc 14th International Conference on Machine Learning 1998.Google Scholar
 Schapire RE, Singer Y: Improved Boosting Using ConfiodenceWeighted Predictions. Machine Learning 1999, 37(3):297–336.View ArticleGoogle Scholar
 Diserbo M, Masson P, Gourmelon P, Caterini R: Utility of the wavelet transform to analyze the stationarity of single ionic channel recordings. J Neurosci Methods 2000, 99(1–2):137–141.View ArticlePubMedGoogle Scholar
 Durbin R: Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge, UK & New York: Cambridge University Press; 1998.View ArticleGoogle Scholar
 Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK: Improvements to Platt's SMO algorithm for SVM classifier design. Neural Computation 2001, 13: 637–649.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.