### Channel current signal analysis & pattern recognition

A Channel Current Spike Detector algorithm was developed in [8] to characterize the brief, very strong, blockade "spike" behavior observed for molecules that occasionally break in the region exposed to the limiting aperture's strong electrophoretic force region. (In [6–11], where nine base-pair hairpins were studied, the spike events were attributed to a fray/extension event on the terminal base-pair.) Together, the formulation of HMM-EM, FSAs and Spike Detector provide a robust method for analysis of channel current data. The spike detector software is designed to count "anomalous" spikes, i.e., spike noise not attributable to the gaussian fluctuations about the mean of the dominant blockade-level. Spike count plots are generated to show increasing counts as cut-off thresholds are relaxed (to where eventually any downward deflection will be counted as a spike). The plots are automatically generated and automatically fit with extrapolations of their linear phases (exponential phases occur when cut-offs begin to probe the noise band of a blockade state – typically gaussian noise "tails"). The extrapolations provide an estimate of "true" anomalous spike counts (see figure in Additional file 6).

The signal processing architecture (Fig. 2) is designed to rapidly extract useful information from noisy blockade signals using feature extraction protocols, wavelet analysis, Hidden Markov Models (HMMs) and Support Vector Machines (SVMs). For blockade signal acquisition and simple, time-domain, feature-extraction, a Finite State Automaton (FSA) approach is used [19] that is based on tuning a variety of threshold parameters. A generic HMM can be used to characterize current blockades by identifying a sequence of sub-blockades as a sequence of state emissions [6–9, 11]. The parameters of the generic-HMM can then be estimated using a method called Expectation/Maximization, or 'EM" [40], to effect de-noising. The HMM method with EM, denoted HMM/EM, is used in what follows (further Background on these methods can be found in [6–11]). Classification of feature vectors obtained by the HMM for each individual blockade event is then done using SVMs, an approach which automatically provides a confidence measure on each classification.

The Nanopore Detector is operated such that a stream of 100 ms samplings are obtained. Each 100 ms signal acquired by the time-domain FSA consists of a sequence of 5000 sub-blockade levels (with the 20 μs analog-to-digital sampling). Signal preprocessing is then used for adaptive low-pass filtering. For the data sets examined, the preprocessing is expected to permit compression on the sample sequence from 5000 to 625 samples (later HMM processing then only required construction of a dynamic programming table with 625 columns). The signal preprocessing makes use of an off-line wavelet stationarity analysis (Off-line Wavelet Stationarity Analysis, Figure 2) to determine the amount of sample compression (effective low-pass filtering) that can be sustained and still have good structure resolution.

With completion of preprocessing, an HMM is used to remove noise from the acquired signals, and to extract features from them (Feature Extraction Stage, Fig. 2). The HMM is, initially, implemented with fifty states, corresponding to current blockades in 1% increments ranging from 20% residual current to 69% residual current. The HMM states, numbered 0 to 49, corresponded to the 50 different current blockade levels in the sequences that are processed. The state emission parameters of the HMM are initially set so that the state j, 0 <= j <= 49 corresponding to level L = j + 20, can emit all possible levels, with the probability distribution over emitted levels set to a discretized Gaussian with mean L and unit variance. All transitions between states are possible, and initially are equally likely. Each blockade signature is de-noised by 5 rounds of Expectation-Maximization (EM) training on the parameters of the HMM. After the EM iterations, 150 parameters are extracted from the HMM. The 150 feature vectors obtained from the 50-state HMM-EM/Viterbi implementation in [6–11] are: the 50 dwell percentage in the different blockade levels (from the Viterbi trace-back states), the 50 variances of the emission probability distributions associated with the different states, and the 50 merged transition probabilities from the primary and secondary blockade occupation levels (fits to two-state dominant modulatory blockade signals). Plots of the class-averaged 150-component feature vector profiles are shown in Fig.'s 7, 11, 12, 15, 16, 18, and 19.

The HMM-with-Duration implementation, described in [1, 43, 44], has been tested in terms of its performance at parsing synthetic blockade signals. The synthetic data used in [1, 43, 44] was designed to have two levels, with lifetime in each level determined by a governing distribution (Poisson and Gaussian distributions with a range of mean values were considered). The results clearly demonstrate the superior performance of the HMM-with-duration over its simpler, HMM without Duration, formulation. With use of the EVA-projection method this affords a robust means to obtain kinetic feature extraction. The HMM with duration is critical for accurate kinetic feature extraction, and the results in [1, 43, 44] suggest that this problem can be elegantly solved with a pairing of the HMM-with-Duration stabilization with EVA-projection.

In the implementation of the HMM-with-duration in [43], the transition probabilities for state 's' to remain in state 's', a "ss" transition can be computed as: Prob(ss | s_{length} = L) = Prob(s_{length} ≥ L + 1)/Prob(s_{length} ≥ L). The transition probabilities out of state 's' can have some subtleties, as shown in the following where the states are exon (e), intron (i), and junk (j). In this case, the transition probabilities governing the following transitions, (jj) -> (je), (ee) -> (ej), (ee) -> (ei), (ii) -> (ie) are computed as: Prob(ei | e_{length} = L) = Prob(e_{length} = L)/Prob(e_{length} ≥ L) × 40/(40 + 60) and Prob(ej | e_{length} = L) = Prob(e_{length} = L)/Prob(e_{length} ≥ L) × 60/(40 + 60), where the total number of (ej) transitions is 60 and the total number of (ei) transitions is 40. Further details, and recent results are described in [1].

The conventional HMM method is based on a stationary set of emission and transition probabilities. Emission broadening, via amplification of the emission state variances, is a filtering heuristic that leads to level-projection that strongly preserves transition times between major levels (see [43] for further details). This approach does not require the user to define the number of levels (classes). This is a major advantage compared to existing tools that require the user to determine the levels (classes) and perform a state projection. This allows kinetic features to be extracted with a "simple" FSA (Finite State Automaton) that requires minimal tuning. One important application of the HMM-with-duration method used in [43] includes kinetic feature extraction from EVA projected channel current data (the HMM-with-Duration is shown to offer a critical stabilizing capability in an example in [43]). The EVA-projected/HMMwDur processing offers a hands-off (minimal tuning) method for extracting the mean dwell times for various blockade states (the core kinetic information).

Binary Support Vector Machines (SVMs) are based on a decision-hyperplane heuristic that incorporates structural risk management by attempting to impose a training-instance void, or "margin," around the decision hyperplane [45, 46]. Feature vectors are denoted by x_{ik}, where index i labels the M feature vectors (1 ≤ i ≤ M) and index k labels the N feature vector components (1 ≤ i ≤ N). For the binary SVM, labeling of training data is done using label variable y_{i} = ±1 (with sign according to whether the training instance was from the positive or negative class). For hyperplane separability, elements of the training set must satisfy the following conditions: w_{β}x_{iβ} - b ≥ +1 for i such that y_{i} = +1, and w_{β}x_{iβ} - b ≤ -1 for y_{i} = -1, for some values of the coefficients w_{1},...,w_{N}, and b (using the convention of implied sum on repeated Greek indices). This can be written more concisely as: y_{i}(w_{β}x_{iβ} - b) - 1 ≥ 0. Data points that satisfy the equality in the above are known as "support vectors" (or "active constraints").

Once training is complete, discrimination is based solely on position relative to the discriminating hyperplane: w_{β}x_{iβ} - b = 0. The boundary hyperplanes on the two classes of data are separated by a distance 2/w, known as the "margin," where w^{2} = w_{β}w_{β}. By increasing the margin between the separated data as much as possible the optimal separating hyperplane is obtained. In the usual SVM formulation, the goal to maximize w^{-1} is restated as the goal to minimize w^{2}. The Lagrangian variational formulation then selects an optimum defined at a saddle point of L(w, b; α) = (w_{β}w_{β})/2 - α_{γ}y_{γ}(w_{β}x_{γβ} - b) - α_{0}, where α_{0} = Σ_{γ}α_{γ}, α_{γ} ≥ 0 (1 ≤ γ ≤ M). The saddle point is obtained by minimizing with respect to {w_{1},...,w_{N}, b} and maximizing with respect to {α_{1},...,α_{M}}. If y_{i}(w_{β}x_{iβ} - b) - 1 ≥ 0, then maximization on α_{i} is achieved for α_{i} = 0. If y_{i}(w_{β}x_{iβ} - b) - 1 = 0, then there is no constraint on α_{i}. If y_{i}(w_{β}x_{iβ} - b) - 1 < 0, there is a constraint violation, and α_{i} → ∞. If absolute separability is possible the last case will eventually be eliminated for all α_{i}, otherwise it's natural to limit the size of α_{i} by some constant upper bound, i.e., max(α_{i}) = C, for all i. This is equivalent to another set of inequality constraints with α_{i} ≤ C. Introducing sets of Lagrange multipliers, ξ_{γ} and μ_{γ} (1 ≤ γ ≤ M), to achieve this, the Lagrangian becomes:L(w, b; α, ξ, μ) = (w_{β}w_{β})/2 - α_{γ}[y_{γ} (w_{β}x_{γβ} - b) + ξ_{γ} ] + α_{0} + ξ_{0}C - μ_{γ}ξ_{γ}, where ξ_{0} = Σ_{γ}ξ_{γ}, α_{0} = Σ_{γ}α_{γ}, and α_{γ} ≥ 0 and ξ_{γ} ≥ 0 (1 ≤ γ ≤ M).

At the variational minimum on the {w_{1},...,w_{N}, b} variables, w_{β} = α_{γ}y_{γ}x_{γβ}, and the Lagrangian simplifies to: L(α) = α_{0} - (α_{δ}y_{δ}x_{δβ}α_{γ}y_{γ}x_{γβ})/2, with 0 ≤ α_{γ} ≤ C (1 ≤ γ ≤ M) and α_{γ}y_{γ} = 0, where only the variations that maximize in terms of the α_{γ} remain (known as the Wolfe Transformation). In this form the computational task can be greatly simplified. By introducing an expression for the discriminating hyperplane: f_{i} = w_{β}x_{iβ} - b = α_{γ}y_{γ}x_{γβ}x_{iβ} - b, the variational solution for L(α) reduces to the following set of relations (known as the Karush-Kuhn-Tucker, or KKT, relations): (i) α_{i} = 0 ⇔ y_{i}f_{i} ≥ 1, (ii) 0 < α_{i} < C ⇔ y_{i}f_{i} = 1, and (iii) α_{i} = C ⇔ y_{i}f_{i} ≤ 1. When the KKT relations are satisfied for all of the α_{γ} (with α_{γ}y_{γ} = 0 maintained) the solution is achieved. (The constraint α_{γ}y_{γ} = 0 is satisfied for the initial choice of multipliers by setting the α's associated with the positive training instances to 1/N^{(+)} and the α's associated with the negatives to 1/N^{(-)}, where N^{(+)} is the number of positives and N^{(-)} is the number of negatives.) Once the Wolfe transformation is performed it is apparent that the training data (support vectors in particular, KKT class (ii) above) enter into the Lagrangian solely via the inner product x_{iβ}x_{jβ}. Likewise, the discriminator f_{i}, and KKT relations, are also dependent on the data solely via the x_{iβ}x_{jβ} inner product.

Generalization of the SVM formulation to data-dependent inner products other than x_{iβ}x_{jβ} are possible and are usually formulated in terms of the family of symmetric positive definite functions (reproducing kernels) satisfying Mercer's conditions [45, 46].

The SVM Kernels that are used are based on "regularized" distances or divergences like those used in [8, 47], where regularization is achieved by exponentiating the negative of a distance-measure squared (d^{2}(x, y)) or a symmetrized divergence measure (D(x, y)), the former if using a geometric heuristic for comparison of feature vectors, the latter if using a distributional heuristic. For the Gaussian Kernel: d^{2}(x, y) = Σ_{k}(x_{k} - y_{k})^{2}; for the Absdiff Kernel d^{2}(x, y) = (Σ_{k}|x_{k} - y_{k}|)^{1/2}; and for the Symmetrized Relative Entropy Kernel D(x, y) = D(x||y) + D(y||x), where D(x||y) is the standard relative entropy.