- Proceedings
- Open Access

# A novel, fast, HMM-with-Duration implementation – for application with a new, pattern recognition informed, nanopore detector

- Stephen Winters-Hilt
^{1, 2}Email author and - Carl Baribault
^{1}

**8 (Suppl 7)**:S19

https://doi.org/10.1186/1471-2105-8-S7-S19

© Winters-Hilt and Baribault; licensee BioMed Central Ltd. 2007

**Published:**01 November 2007

## Abstract

### Background

Hidden Markov Models (HMMs) provide an excellent means for structure identification and feature extraction on stochastic sequential data. An HMM-with-Duration (HMMwD) is an HMM that can also exactly model the hidden-label length (recurrence) distributions – while the regular HMM will impose a best-fit geometric distribution in its modeling/representation.

### Results

A Novel, Fast, HMM-with-Duration (HMMwD) Implementation is presented, and experimental results are shown that demonstrate its performance on two-state synthetic data designed to model Nanopore Detector Data. The HMMwD experimental results are compared to (i) the ideal model and to (ii) the conventional HMM. Its accuracy is clearly an improvement over the standard HMM, and matches that of the ideal solution in many cases where the standard HMM does not. Computationally, the new HMMwD has all the speed advantages of the conventional (simpler) HMM implementation. In preliminary work shown here, HMM feature extraction is then used to establish the first pattern recognition-informed (PRI) sampling control of a Nanopore Detector Device (on a "live" data-stream).

### Conclusion

The improved accuracy of the new HMMwD implementation, at the same order of computational cost as the standard HMM, is an important augmentation for applications in gene structure identification and channel current analysis, especially PRI sampling control, for example, where speed is essential. The PRI experiment was designed to inherit the high accuracy of the well characterized and distinctive blockades of the DNA hairpin molecules used as controls (or blockade "test-probes"). For this test set, the accuracy inherited is 99.9%.

## Keywords

- Support Vector Machine
- Hide Markov Model
- Support Vector Machine Classification
- Finite State Automaton
- Splitting Factor

## Introduction

The conventional HMM is first order and fixed (scalar) in the transition probabilities for remaining in a given state, which leads to a geometric length distribution for remaining in that state [5] – i.e., conventional HMMs automatically impose geometric length distributions on their same-state regions, such as exon or intron lengths or blockade level durations. An HMM-with-Duration (HMMwD) is an HMM where true, or a much more complete, knowledge of the length distributions on same-state regions is incorporated into the model [6]. Here we describe a novel HMMwD where the non-geometric length distribution information is incorporated via dwell-time dependent transition probabilities (for transitions of state to self) [4]. New experimental results are shown, and compared to an exact HMMwD (described in [7]).

Part of the novelty of the new "cellular" HMMwD that is proposed is that it is defined at the cell-level in its dynamic programming table construction, much like the conventional HMM, with one column's computation only dependent on information held in the prior column (in an overall table computation involving a single pass through the table). Our HMMwD can be defined for either the Viterbi or the Forward/Backward algorithms (see Methods). This is convenient because we have a method for distributed HMM processing based on such table computations (paper in preparation), that takes advantage of the basic, underlying Markov assumption to do distributed processing with simplicity of the "chunk" distributed processing that we use for the SVM training [8]. That method is shown to work very effectively for the Viterbi algorithm (similar distribution methods for the Forward/Backward distribution algorithm are also discussed there). The end-result of all of this is that very sophisticated feature extraction tools can be brought to bear for real-time pattern recognition informed operation.

Real-time control of a nanopore detector, based on live, streaming measurements and sufficiently fast pattern recognition identification of any blockading ("captured") analyte, holds great promise for single molecule experiments. Real-time *sampling* control of a nanopore detector, alone, can boost nanopore detector sampling productivity by orders of magnitude, depending on the mix of desirable signal classes vs. undesirable in the data being analyzed. An example of such an experiment is the focus of the proof-of-principle experiment performed here. If there is a 1 to 100 ratio of desirable to undesirable, for example, then one obtains desirable signal sampling only about 1% of the time with a passive sampling system. With pattern recognition informed sampling this can potentially be changed to desirable signal sampling almost 99% of the time. In a real-time setting the challenge is to perform the HMM feature extraction sufficiently quickly (whereas the SVM is trained off-line, so operates very quickly on-line). In this work we show that this can be accomplished with pattern recognition used to identify DNA molecules within the first few hundred milliseconds of their blockade of the detector channel.

We establish a real-time experimental setup for Patter Recognition Informed (PRI) sampling control via integration of LabWindows automation capabilities with our "in-house" Channel Current Cheminformatics (CCC) methods (see Methods). Data acquired with LabWindows is passed to a network of CCC software clients, on a streaming real time basis, for analysis and classification. The classification results are then quickly returned to the LabWindows automation software for experimental feedback control. Further details of a real-time test of PRI sampling is described in the Results. The classification inherits the 99.9% accuracy of the non real-time implementation (established in prior work [1]) as nothing has changed in regards to the features extracted and the classifier used.

To make these Channel Current Cheminformatics and Machine Learning tools available to fellow researchers, we are developing web-accessible machine-learning tools. Using hidden Markov model (HMM) processing, and finite state automata (FSAs), we are able to extract robust features and obtain very accurate support vector machine (SVM) classification results. The Machine Learning web-interfaces described here are for both machine learning experts and non-experts, particularly biologists and biophysicists/biochemists. For non-experts, default values are specified on the key parameters.

Web-accessible tools for HMM-based feature extraction and SVM classification are accessible at http://logos.cs.uno.edu/~nano/. Examples of the Web interfaces are shown in the Results. The web tools can help in identification of blockade levels, the level transition and lifetime characteristics, and the fast blockade "spike" characteristics. The SVM classification is of general use for any kind of classification problem, and a number of novel kernels and novel implementations are employed. SVM-based clustering is also implemented in a novel way to yield a non-parametric clustering approach, which is used to cluster signals into multiple classes (particularly important for complex multi-orientation data-sampling situations such as with an antibody).

## Background

### Nanopore detector

Single biomolecules, and the ends of biopolymers such as DNA, have been examined in solution with nanometer-scale precision using nanopore blockade detection [1–3, 9–11]. In early studies [2], it was found that complete base-pair dissociations of dsDNA to ssDNA, "melting", could be observed for sufficiently short DNA hairpins. In later work [1–3, 9], the nanopore detector attained Angstrom resolution and was used to "read" the ends of dsDNA molecules, and was operated as a chemical biosensor. In [9–13], the nanopore detector was used to observe the conformational kinetics of the end regions of individual DNA hairpins (see Fig. 1, Lower Panel).

The α-hemolysin (α-HL) channel, a protein heptamer formed by seven identical 33 kD protein molecules secreted by *Staphylococcus aureus*, is used as the channel in the nanopore device due to its stable conformation (in the strongly favored heptamer formation, which has minimal gating) and its overall geometry (see Fig. 1, Lower Panel). DNA and RNA interaction with the α-hemolysin channel during translocation is non-negligible (but not too strong either, i.e., it is not such that the molecule "gets stuck"). Although dsDNA is too large to translocate, about ten base-pairs at one end can still be drawn into the large *cis*-side vestibule. This permits very sensitive experiments since the ends of "captured" dsDNA molecules can be observed for extensive periods of time to resolve features, allowing highly accurate classification of the captured end of dsDNA molecules [1–3, 9–13]. This is a very brief and limited synopsis of the Nanopore Detector background relevant to this paper. For other references on Nanopore Detectors see the review of Nanopore Detectors presented in [14]; early work involving alpha-Hemolysin Nanopore Detectors can be found in [1–3, 9–11, 15–25]; rapidly growing research endeavors on Nanopore Detectors based on solid-state and other synthetic platforms can be found in [26–36].

### Cheminformatics

### EVA projection

### Preliminary HMM with Duration and EVA for channel current signal analysis

One important application of the HMM-with-duration method used in [4] includes kinetic feature extraction from EVA projected channel current data (the HMM-with-Duration is shown to offer a critical stabilizing capability in an example in [4]). The EVA-projected/HMMwD processing offers a hands-off (minimal tuning) method for extracting the mean dwell times for various blockade states (the core kinetic information).

The HMM-with-Duration implementation, described in [4], has been tested in terms of its performance at parsing synthetic blockade signals (see Fig. 5). The synthetic data ranges over an exhaustive set of possibilities for thorough testing of the HMM-with-Duration. The synthetic data used in [4] was designed to have two levels with lifetime in each level determined by a governing distribution (Poisson and Gaussian distributions with a range of mean values were considered). The HMM here was performed with 0 EM iterations.

### SVM classification used in PRI sampling selection

*multiclass*SVMs. One arranges several

*binary*classifiers as a decision tree such that they perform a multi-class decision-making function (SVM-external classification – an example of this is the architecture used here, see Fig. 2). The second approach involves solving a single optimization problem corresponding to the entire data set (with multiple hyperplanes), with multi-class discriminator optimization performed internally. The SVM-internal approach, when it is stable and properly generalizable (an active area of research), is preferred (see Results for interface), since a tuning over Decision tree topologies and weightings is avoided [40]. The on-line discriminatory speed of a trained SVM is simply that of evaluating an inner product, so it's operational constraint on the PRI feedback endeavor is negligible compared to that of the HMM feature extraction stage. For this reason, there is little discussion of SVMs in this paper, even though SVMs comprise much of the complexity of the HMM/SVM PRI feedback system.

### Generalized HMMs

The work by Johnson in [41] is an excellent source of references on generalized HMMs (but entirely focused on speech applications). The approaches to HMM duration described in [41] are broken into three categories: 1) Hidden semi-Markov models (HSMMs), 2) variable transition HMMs (VTHMMs), and 3) standard HMMs with more states, etc. The current work thus would fall into the VTHMM category. In [41] they assert that VTHMM methods are essentially variations on Ferguson's [42] explicit duration HMM (EDHMM), placed in the HSMM category, yet acknowledge the differences in computational complexity. In [43], Levinson gives estimates of transitions and emissions for state durations modeled by gamma distributions, but no explicit method provided or used to decode actual signals. In [44], Ramesh et al estimate the duration dependent state transitions, A_{ij}(d), in general rather than assume, as in the current work, that A_{ij}(d) = C_{ij} for i ≠ j, but do not offer any performance analysis per se of the Viterbi decoding algorithm. Also, [44] provides insight into the work by Ferguson [41] in that, the expansion state HMM (ESHMM) in [41] uses multiple expansion states in order to finitely approximate the duration of a given state of the original system. The current work uses only one model state for each known physical state, and uses time dependent self-transition probabilities to capture state duration information.

In [45], Mitchell et al model duration explicitly using expansion substates, rather than an explicitly time dependent self-transition probability as in the current work. It also manages computational complexity via parallel platform architecture. The paper listed as [46], is the only prior work that is found to model the self-transitions using the cumulative duration in the form A_{ii}(d_{i}) = 1 - P(d_{i}), and this was only in a much more restricted sense than that described here. The authors there considered only a two-state system for which the "splitting" in probabilities upon exiting one state is trivial. In [46], Vaseghi computes the duration dependent state transitions directly from the Viterbi output, thus avoiding the costly forward-backward computation and this is similar the methods employed here for general multi-state HMMs. This is useful in situations where the exit transitions are not already known and must be estimated from the data. Otherwise, like most of the references included here, though it can apply specifically to the current work, it involves algorithms that are more generic and computationally complex and not the clear, extremely fast, and simple implementation described here.

As in most other works cited here, Sin et al. [47] involves more generic computations and hence more computational complexity, though in the examples provided avoids the complexity of multiple exit transitions from any given state by restricting the actual computations performed to only left-to-right type models, where there is only one exit transition per state. (Future work might be to use the framework in [47] to confirm the choice of splitting factor for multiple exit transitions, see Discussion.) Park et al [48] also performs analysis using only left-to-right models, where the splitting factor for multiple exit transitions is not an issue. Though emissions for a given state in cheminformatic data are not expected to be time-dependent – other than noise component, future work would be to use the framework in [48] to confirm stationary emissions. Finally, in Yoma et al. [49] the analysis is restricted again to the typical left-to-right models of speech recognition, where considerations of splitting factor for exit transitions are trivial.

## Results

### HMM with Duration experimental tests

*n*states, where

*n*is the number of dwell bins in the quantization of the dwell-time distribution. The computations that are needed scale quadratically in

*n*. So, for

*n*= 1000, the explicit HMMwD can take 1,000,000-times longer to compute than the HMMwD described here (or the conventional HMM). Performance of the new HMMwD is given in comparison to the conventional HMM is also shown in Figure 7 and in the figures in Supplemental Files 1–3. Specifically, the figures in Additional File 1–3 show the Average and Standard Deviation of Viterbi Decoding Accuracy over 10 different trials (instances) of 10 k-length synthetic 2-level signal data, where both levels have identical Poisson duration but the separation between the 2 levels varies. From top to bottom, the Viterbi response improves as the number of steps increases in the decoding HMM's approximation of the 1 k-step generating HMM's Poisson durations. From left to right in each plot, the Viterbi response improves as the separation of the 2 levels (emission means) increases. In the figure in Additional File 1 are shown the results with only 1 step used in the Viterbi decoding HMM's approximation of the 1 k-step generating HMM's Poisson durations. In the figure in Additional File 2 are shown the results with 10 steps used in the Viterbi decoding HMM's approximation of the 1 k-step generating HMM's Poisson durations. In the figure in Additional File 3 are shown the results with 100 steps used in the Viterbi decoding HMM's approximation of the 1 k-step generating HMM's Poisson durations. The simplicity of the binary signal case actually provides a challenging case for differentiating the performance of the HMMwD methods from the conventional HMM, so the preliminary results shown here show promise for the overall validity and utility of the method.

For all row entries the average dwell time of both the upper and lower signal levels increases proportionately with bin-count.

Bin # | 2-state Geometric HMM | 2n-state Explicit HMMwD | 2-state Implicit HMMwD |
---|---|---|---|

10 | 743876 | 832514 | 756765 |

100 | 951916 | 960272 | 953151 |

1000 | 995388 | 995408 | 995381 |

As with Table 1, for all row entries the average dwell time of both the upper and lower signal levels increases proportionately with bin-count.

Bin # | 2-state Geometric HMM | 2n-state Explicit HMMwD | 2-state Implicit HMMwD |
---|---|---|---|

10 | 720050 | 795723 | 741579 |

100 | 947062 | 957753 | 948505 |

1000 | 994978 | 995013 | 995005 |

### Pattern recognition informed feedback via LabWindows automation

A blockade signal's stationary statistics, or phases thereof, reveals information about the kinetics of the biopolymer resulting from interactions with surroundings, or from undergoing conformational changes. LabVIEW Automation software is used to help manage the feedback linkage between patch-clamp amplifier measurements and in-house cheminformatics software. This has been used to demonstrate molecular identification in the first 100 ms of capture, with return of classification information to the control of the amplifier – for voltage-controlled sample ejection if desired. Screen-captures of the interfaces are shown in the figures in Additional Files 5–7. Additional File 5 shows the Acquisition Server interface and LabWindows C development environment. Additional File 6 shows a real time 9at vs 9gc classification (with 9at identification indicated by the led being on). Additional File 7 shows a real time 9gc identification.

The LabWindows Server initiates the distributed CCC computations by sending data to a cluster of Linux Clients via a TCP/IP channel. The Linux clients run the expensive HMM analysis as distributed processes (similarly for off-line SVM training). The sample classification is then used by the Server to provide feedback to the nanopore apparatus to increase the effective sampling time on the molecules of interest (meant to boost nanopore detector productivity by magnitudes, as described in the Background).

### Channel current cheminformatics and Machine-Learning web interfaces

The tFSA/HMM-based channel current cheminformatics web-interfaces are as follows: (1), As the tFSA runs, signal regions are identified for processing by the HMM. (2), A 50-state generic-HMM (typically) is used to obtain feature vectors, to be used by a SVM, consisting of three components: (i) Blockade Histogram: blockade level occupation probabilities (stat's on Viterbi path); (ii) The Emission Variances; and (iii) Compression of Transition probabilities to weighted sum of transition profiles from two dominant levels (the UL and LL states) (one of the interfaces for this process is shown in the figure in Additional File 8). And (3), Kinetic features are then extracted (often with EVA-projected data from step 2). An SVM interface and all other software described is available via the webpage: http://logos.cs.uno.edu/~nano/. As with the external and internal multiclass SVM discriminator implementations, the strong performance of the binary SVM enables SVM-External as well as SVM-Internal approaches to clustering [40].

## Methods and discussion

### Conventional HMM

- 1.
Baum-Welch Iteration of recursively defined forward/backward probabilities (symbols x, states π)

f

_{ l }(*i*+ 1) = P(x_{1}...x_{i+1}, π_{i+1}=*l*) =*e*_{ l }(x_{i+1})π_{ k }f_{ k }(*i*)*a*_{ kl }b

_{ k }(*i*) = P(x_{i+1}...x_{ L }| π_{ i }=*k*) = π_{ l }*a*_{ kl }*e*_{ l }(x_{i+1})b_{ l }(*i*+ 1) - 2.
Viterbi Path Determination

v

_{ l }(*i*) =*e*_{ l }(x_{ i })max_{ k }(v_{ k }(*i*- 1)*a*_{ kl })ptr

_{ i }(*l*) = argmax_{ k }(v_{ k }(*i*- 1)*a*_{ kl })

In the HMMwD described here, each of the stationary transition probabilities a_{
kl
}are replaced by a dwell-time dependent update factor. The forward/backward probabilities used in the standard HMM-EM algorithm occur when evaluating p(Z_{0...L-1}) by breaking the sequence probability p(Z_{0...L-1}) into two pieces via use of a single hidden variable treated as a Bayesian parameter: p(Z_{0...L-1}) = Σ_{k}p(Z_{0...i}, s_{i} = k)p(Z_{i+1...L-1}, s_{i} = k) = Σ_{k}f_{ki}b_{ki}, where f_{ki} = p(Z_{0...i}, s_{i} = k) and b_{ki} = p(Z_{i+1...L-1}, s_{i} = k). Given stationarity, the state transition probabilities and the state probabilities at the ith observation satisfy the trivial relation p_{qi} = Σ_{k}a_{kq}p_{k(i-1)}, where p_{qi} = p(S_{i} = q), and p_{q0} = p(S = q), and the latter probabilities are the state priors. The trivial recursion relation that is implied can be thought of as an operator equation, with operation the product by a_{kq} followed by summation (contraction) on the k index. The operator equation can be rewritten using an implied summation convention on repeated Greek-font indices (Einstein summation convention): p_{q} = a_{βq}p_{β}. Transition-probabilities in a similar operator role, but now taking into consideration local sequence information via the emission probabilities, are found in recursively defined expressions for the forward variables, f_{ki} = e_{ki}(a_{βk}f_{β(i-1)}), and backward variables, b_{ki} = a_{kβ}e_{β(i+1)}b_{β(i+1)}. The recursive definitions on forward and backward variables permit efficient computation of observed sequence probabilities using dynamic programming tables. It is at this critical juncture that side information must mesh well with the states (column components in the table), i.e., in a manner like the emission or transition probabilities. Length information, for example, can be incorporated via length-distribution-biased transition probabilities (introduced in [4]), and that is what is experimentally validated done here.

### EVA projection

Using a standard implementation of a HMM with emissions probabilities parameterized by Gaussian distributions: emission_probabilities[i][k] = exp(-(k - i)*(k - i)/(2*variance)), where "i" and "k" are each a state where 0 <= i, k <= 49 in a 50 state system. To perform EVA, the variance is simply multiplied by a factor that essentially widens the gaussian distribution parameterized to best fit the emissions, and the equation simply becomes exp(-(k-i)*(k-i)/(2*variance*eva_factor)). The choice of this amplification factor is important. If too large of a factor is used, then the power signal will be altered to the point where the state transition information will be invalid. But for a sizable range of this parameter, HMM with EVA will remove the noise from the power signal while *strictly* maintaining the state transitions. In practice, any multiplicative factor between 2 and 10 works well.

After EVA-projection, a simple FSA can easily extract level duration information. Each level is identified by a simple threshold of blockade readings, typically one or two percent of baseline. Then, the HMM with EVA processed data is swept through with a small window to eliminate any remaining noise or spike artifacts that may distort actual level duration statistics. It is important to note that there must be a sufficient amount of data to support the level duration statistics. A small sample will simply not be representative of the true kinetic information. Further, it is important that the FSA is tuned properly in order to properly identify levels.

### HMM with Duration via cumulant transition probability

_{length}= L) = Prob(e

_{length}≥ L + 1)/Prob(e

_{length}≥ L). The transition probabilities out of state 'e' can have some subtleties, as shown in the following where the states are exon (e), intron (i), and junk (j). In this case, the transition probabilities governing the following transitions, (jj) -> (je), (ee) -> (ej), (ee) -> (ei), (ii) -> (ie) are computed as: Prob(ei | e

_{length}= L) = Prob(e

_{length}= L)/Prob(e

_{length}≥ L) × 40/(40 + 60) and Prob(ej | e

_{length}= L) = Prob(e

_{length}= L)/Prob(e

_{length}≥ L) × 60/(40 + 60), where the total number of (ej) transitions is 60 and the total number of (ei) transitions is 40. The pseudocode to track the critical length information, on a cellular basis in the dynamic programming table, goes as follows:

- 1.
Maintain separate counters for the junk, exon and intron regions.

- 2.
The counters are updated as follows:

- a.
The exon counter is set to 2 for a (je) - (ee) transition

- b.
The exon counter gets incremented by 1 for every (ee) - (ee) transition

Prob(e_{length} ≥ L + 1) is computed as: Prob(e_{length} ≥ L + 1) = 1 - ∑_{i=1..L}Prob(e_{length} = i). Hence we generate a list such that for each index k > 0, the value 1 - ∑_{i=1..k}Prob(e_{length} = i) is stored.

Simplifying from three state model, {e, i, j}, to a two-state model, {e, i}, for a moment: after n occurrences of state 'e', the 2 cases of update factor to handle are:

P(e_{n+1}|ee..e_{n}) = P(length e ≥ n+1)/P(length e ≥ n)

P(i|ee..e_{n}) = P(length e = n)/P(length e ≥ n)

Similarly for n occurrences of state 'i', and there are no probability splitting ambiguities upon exiting state 'e' as there is only one state to exit to in the two-state system (and there are no differences in the Viterbi and Forward/Backward transition probability alterations).

Consider, as an example, a simple extension of our two-state notation to cover N + 1 states: {e, i^{1},...,i^{N}}. Suppose we are interested in the probability of an 'i' after seeing a length 4 segment of e-states:

P(i|eeee) = 1 - P(e|eeee) = 1 - P(L_{e} = 5)/P(L_{e} = 4)

There are two types of transition rule that appear to result, one for Viterbi, with its **maximum** operation on paths, and one for Forward/Backward, with its **sum** operation. For the two-state case, N = 1, and these update rules involving "splitting" factors all become the same (see Results):

#### Viterbi update

- (1)
Difference splitting: p(i

^{k}|eeee) = [1 - P(L_{e}= 5)/P(L_{e}= 4)] * [1 + P(i^{k}|e) - Avg_{z}P(Z|e)] - (2)
Ratio splitting: p(i

^{k}|eeee) = [1 - P(L_{e}= 5)/P(L_{e}= 4)] * P(i^{k}|e)/[Avg_{z}P(Z|e)]

In this situation we are not maintaining a sum rule on probabilities, here we are viewing each path through the table in a manner consistent with the maxprob evaluation.

#### Forward/Backward update

- (1)
Difference splitting: p(i

^{k}|eeee) = [1 - P(L_{e}= 5)/P(L_{e}= 4)] * [1 + P(i^{k}|e) - Avg_{z}P(Z|e)]/N - (2)
Ratio splitting: p(i

^{k}|eeee) = [1 - P(L_{e}= 5)/P(L_{e}= 4)] * P(i^{k}|e)/[N*Avg_{z}P(Z|e)]

The equation above with the factor [1 + P(i^{k}|e) - Avg_{z}P(Z|e)]/N provides a suitable "splitting factor", as i^{k} and 'e' probabilities sum to one, remain positive, and have other nice properties. The splitting factor is not unique, however, as case (2) makes clear.

It is important to note that we appear to have some freedom on splittings of probabilities upon exiting a state (when we are using the length distribution cumulants to describe transition probabilities, etc.). This is merely an associated effect of that length distribution incorporation – now upon exiting that length distribution our main factor is P(i^{k}|some prior length of e), that factor is blind to the appropriate splittings amongst the "not e" states, and we must incorporate another factor to deal with the probability splitting – in the case of forward/backward, this is chosen to obey a prob sum to 1 on all cases, on Viterbi this must maintain each path's probability with proper weighting with respect to the others (consistent with the max-path operation).

### Real-time processing hardware/software setup

- 1.
Acquire data from the DAQ buffer at 50 KHz sample rate.

- 2.
Update the server GUI screen with the acquired data – though a 10× data decimation was required in order to avoid irrevocable delays in reading from the DAQ buffer

- 3.
Perform tFSA logic to screen for signals resulting from molecular capture events at the nanopore channel

- 4.
Send capture-signals as long as 100 ms in duration at a rate 10 per second to a waiting 1.5 GHz processor, Linux-based TCP/IP client for HMM-based feature extraction

- 5.
Receive extracted HMM features from the TCP/IP client

- 6.
Compute the classification of the HMM features with tolerance via a binary SVM previously trained on 9gc and 9at bphp signals.

- 7.
According to the user's preset preference, issue a control signal to the DAQ resulting in ejection of the undesirable molecule so determined from the nanopore channel site.

## Declarations

### Acknowledgements

Federal funding was provided by NIH K-22 (PI, 5K22LM008794), NIH NNBM R-21 (co-PI), and NIH R-01 (sub-award). State funding was provided from a LaBOR Enhancement (PI), a LaBOR Research Competitiveness Subcontract (PI), and a LaBOR/NASA LaSPACE Grant (PI). Thanks to Eric Morales and Iftekhar Amin for gathering data. Thanks to Matthew Landry for establishing the feature vector test sets from data from the nanopore detector. Thanks to A. Murat Eren for work updating the web-interfaces. Funding also provided by New Orleans Children's Hospital and the University of New Orleans Computer Science Department.

This article has been published as part of *BMC Bioinformatics* Volume 8 Supplement 7, 2007: Proceedings of the Fourth Annual MCBIOS Conference. Computational Frontiers in Biomedicine. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/8?issue=S7.

## Authors’ Affiliations

## References

- Winters-Hilt S, Vercoutere W, DeGuzman VS, Deamer DW, Akeson M, Haussler D: Highly Accurate Classification of Watson-Crick Basepairs on Termini of Single DNA Molecules. Biophys J. 2003, 84: 967-976.PubMed CentralView ArticlePubMedGoogle Scholar
- Vercoutere W, Winters-Hilt S, Olsen H, Deamer DW, Haussler D, Akeson M: Rapid discrimination among individual DNA hairpin molecules at single-nucleotide resolution using an ion channel. Nat Biotechnol. 2001, 19 (3): 248-252.View ArticlePubMedGoogle Scholar
- Vercoutere W, Winters-Hilt S, DeGuzman VS, Deamer D, Ridino S, Rogers JT, Olsen HE, Marziali A, Akeson M: Discrimination Among Individual Watson-Crick Base-Pairs at the Termini of Single DNA Hairpin Molecules. Nucl Acids Res. 2003, 31: 1311-1318.PubMed CentralView ArticlePubMedGoogle Scholar
- Winters-Hilt S: Hidden Markov Model Variants and their Application. BMC Bioinformatics. 7 (Suppl 2): S14-2006, Sept. 26Google Scholar
- Durbin R: Biological sequence analysis: probalistic models of proteins and nucleic acids. 1998, Cambridge, UK New York: Cambridge University Press, xi: 356-View ArticleGoogle Scholar
- Koski T: Hidden Markov Models for Bioinformatics. 2001, Kluwer Academic PublishersView ArticleGoogle Scholar
- Churbanov A, Baribault C, Winters-Hilt S: Duration learning for nanopore ionic flow blockade analysis. BMC Bioinformatics. 2007, 8 (Suppl 7): S14-PubMed CentralView ArticlePubMedGoogle Scholar
- Osuna E, Freund R, Girosi F: An improved training algorithm for support vector machines. Neural Networks for Signal Processing VII. Edited by: Principe J, Gile L, Morgan N, Wilson E. 1997, IEEE, New York, 276-85.Google Scholar
- Winters-Hilt S: Nanopore detection using channel current cheminformatics. SPIE Second International Symposium on Fluctuations and Noise, 25–28 May. 2004Google Scholar
- Winters-Hilt S, Akeson M: Nanopore cheminformatics. DNA and Cell Biology. 2004Google Scholar
- Winters-Hilt S: Highly Accurate Real-Time Classification of Channel-Captured DNA Termini. Third International Conference on Unsolved Problems of Noise and Fluctuations in Physics, Biology, and High Technology. 2003, 355-368.Google Scholar
- Winters-Hilt S, Landry M, Akeson M, Tanase M, Amin I, Coombs A, Morales E, Millet J, Baribault C, Sendamangalam S: Cheminformatics Methods for Novel Nanopore analysis of HIV DNA termini. BMC Bioinformatics. 7 (Suppl 2): S22-2006, Sept. 26Google Scholar
- Winters-Hilt S, Davis A, Amin I, Morales E: Nanopore current transduction analysis of protein binding to non-terminal and terminal DNA regions: analysis of transcription factor binding, retroviral DNA terminus dynamics, and retroviral integrase-DNA binding. BMC Bioinformatics. 2007, 8 (Suppl 7): S10-PubMed CentralView ArticlePubMedGoogle Scholar
- Ken Healy K: Nanopore-Based DNA Analysis. PhD Thesis. 2006, The National University of Ireland – University College CorkGoogle Scholar
- Bezrukov SM, Vodyanoy I, Parsegian VA: Counting polymers moving through a single ion channel. Nature. 1994, 370 (6457): 279-281.View ArticlePubMedGoogle Scholar
- Bezrukov SM: Ion Channels as Molecular Coulter Counters to Probe Metabolite Transport. J Membr Biol. 2000, 174: 1-13.View ArticlePubMedGoogle Scholar
- Kasianowicz JJ, Brandin E, Branton D, Deamer DW: Characterization of Individual Polynucleotide Molecules Using a Membrane Channel. Proc Natl Acad Sci USA. 1996, 93 (24): 13770-73.PubMed CentralView ArticlePubMedGoogle Scholar
- Akeson M, Branton D, Kasianowicz JJ, Brandin E, Deamer DW: Microsecond time-scale discrimination among polycytidylic acid, polyadenylic acid, and polyuridylic acid as homopolymers or as segments within single RNA molecules. Biophys J. 1999, 77 (6): 3227-3233.PubMed CentralView ArticlePubMedGoogle Scholar
- Meller A, Nivon L, Brandin E, Golovchenko J, Branton D: Rapid nanopore discrimination between single polynucleotide molecules. Proc Natl Acad Sci USA. 2000, 97 (3): 1079-1084.PubMed CentralView ArticlePubMedGoogle Scholar
- Meller A, Branton D: Single molecule measurements of DNA transport through a nanopore. Electrophoresis. 2002, 23 (16): 2583-2591.View ArticlePubMedGoogle Scholar
- Henrickson SE, Misakian M, Robertson B, Kasianowicz JJ: Driven DNA transport into an asymmetric nanometer-scale pore. Phys Rev Lett. 2000, 85 (14): 3057-3060.View ArticlePubMedGoogle Scholar
- Meller A, Nivon L, Branton D: Voltage-driven DNA translocations through a nanopore. Phys Rev Lett. 2001, 86 (15): 3435-3438.View ArticlePubMedGoogle Scholar
- Howorka S, Cheley S, Bayley H: Sequence-specific detection of individual DNA strands using engineered nanopores. Nat Biotechnol. 2001, 19 (7): 636-639.View ArticlePubMedGoogle Scholar
- Deamer DW, Akeson M: Nanopores and nucleic acids: prospects for ultrarapid sequencing. Trends Biotechnol. 2000, 18: 147-151.View ArticlePubMedGoogle Scholar
- Branton D, Meller A: Using nanopores to discriminate between single molecules of DNA. Structure and Dynamics of Confined Polymers. Edited by: Kasianowicz JJ. 2002, Kluwer Acad Pub, 177-185.View ArticleGoogle Scholar
- Heins E, Albertorio F, Yang T, Siwy Z, Baker L, Cheley S, Bayley HP, Cremer PS, Martin CR: Development of a supported lipid bilayer on porous polymeric support. presented at the 49th Biophysical Society Meeting, Long Beach, CA, USA. 2005Google Scholar
- Li J, Stein D, M C, Branton D, Aziz M, Golovchenko J: Ion beam sculpting on the nanoscale. Nature. 2001, 412: 166-169.View ArticlePubMedGoogle Scholar
- Stein D, Li J, Golovchenko JA: Ion-beam sculpting time scales. Phys Rev Lett. 2002, 89 (27): 276106-View ArticlePubMedGoogle Scholar
- Stein DM, McMullan CJ, Li J, Golovchenko JA: Feedback-controlled ion beam sculpting apparatus. Rev Sci Instrum. 2004, 75 (4): 900-905.View ArticleGoogle Scholar
- Mitsui T, Stein D, Kim Y-R, Hoogerheide D, Golovchenko JA: Nanoscale volcanoes: accretion of matter at ionsculpted nanopores. Phys Rev Lett. 2006, 96 (3): 036102-View ArticlePubMedGoogle Scholar
- Chen P, Mitsui T, Farmer DB, Golovchenko J, Gordon RG, Branton D: Atomic layer deposition to fine-tune the surface properties and diameters of fabricated nanopores. Nano Lett. 2004, 4 (7): 1333-1337.PubMed CentralView ArticlePubMedGoogle Scholar
- Storm AJ, Chen JH, Ling XS, Zandbergen HW, Dekker C: Fabrication of solid-state nanopores with single-nanometre precision. Nat Mater. 2003, 2 (8): 537-540.View ArticlePubMedGoogle Scholar
- Heng J, Dimitrov V, Grinkova Y, Ho C, Kim T, Muller D, Sligar S, Sorsch T, Twesten R, Timp R, Timp G: The detection of DNA using a silicon nanopore. Proc IEDM. 32.2.1-32.2.4. 8–10 Dec 2003Google Scholar
- Siwy Z, Dobrev D, Neumann R, Trautmann C, Voss K: Electro-responsive asymmetric nanopores in polyimide with stable ion-current signal. Appl Phys A. 2003, 76: 781-785.View ArticleGoogle Scholar
- Mara A, Siwy Z, Trautmann C, Wan J, Kamme F: An asymmetric polymer nanopore for single molecule detection. Nano Lett. 2004, 4 (3): 497-501.View ArticleGoogle Scholar
- Apel PY, Korchev YE, Siwy Z, Spohr R, Yoshida M: Diode-like single-ion track membrane prepared by electrostopping. Nucl Instrum Methods Phys Res, Sect B. 2001, 184: 337-346.View ArticleGoogle Scholar
- Cormen TH, Leiserson CE, Rivest RL: Introduction to Algorithms. 1989, MIT-Press, Cambridge, USAGoogle Scholar
- Vapnik VN: The Nature of Statistical Learning Theory. 1998, Springer-Verlag, New York, 2Google Scholar
- Burges CJC: A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov. 1998, 2: 121-67.View ArticleGoogle Scholar
- Winters-Hilt S, Yelundur A, McChesney C, Landry M: Support Vector Machine Implementations for Classification & Clustering. BMC Bioinformatics. 7 (Suppl 2): S4-2006, Sept. 26Google Scholar
- Johnson M: Capacity and Complexity of HMM Duration Modeling Techniques. IEEE Signal Processing Letters. 2005, 12 (5): 407-410.View ArticleGoogle Scholar
- Ferguson J: Variable duration models for speech. Proc Symp App Hidden Markov Models Text Speech. 1980Google Scholar
- Levinson S: Continuously variable duration hidden Markov models for speech analysis. Proc Int Conf Acoust, Speech, Signal Process. 1986, 1241-1244.Google Scholar
- Ramesh P, Wilpon J: Modeling state durations in hidden Markov models for automatic speech recognition. Proc Int Conf Acoust, Speech, Signal Process. 1992, 381-384.Google Scholar
- Mitchell C, Harper M, Jamieson L, Helzerman R: A Parallel Implementation of a Hidden Markov Model with Duration Modeling for Speech Recognition. Dig Sig Proc. 1995, 5: 43-57.View ArticleGoogle Scholar
- Vaseghi S: State duration modeling in hidden Markov models. Signal Process. 1995, 41: 31-41.View ArticleGoogle Scholar
- Sin B, Kim J: Nonstationary hidden Markov model. Signal Process. 1995, 46: 31-46.View ArticleGoogle Scholar
- YK Park, Un CK, Kwon OW: Modeling acoustic transitions in speech by modified hidden Markov models with state duration and state duration-dependent observation probabilities. IEEE Trans Speech Audio Process. 1996, 4 (5): 389-392.View ArticleGoogle Scholar
- Yoma N, McInnes F, Jack M, Stump S, Ling L: On including temporal constraints in viterbi alignment for speech recognition in noise. IEEE Trans Speech Audio Process. 2001, 9 (2): 179-182.View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.