Volume 8 Supplement 7
Proceedings of the Fourth Annual MCBIOS Conference. Computational Frontiers in Biomedicine
Duration learning for analysis of nanopore ionic current blockades
 Alexander Churbanov^{1}Email author,
 Carl Baribault^{2} and
 Stephen WintersHilt^{1, 2}
DOI: 10.1186/147121058S7S14
© Churbanov et al; licensee BioMed Central Ltd. 2007
Published: 01 November 2007
Abstract
Background
Ionic current blockade signal processing, for use in nanopore detection, offers a promising new way to analyze single molecule properties, with potential implications for DNA sequencing. The alphaHemolysin transmembrane channel interacts with a translocating molecule in a nontrivial way, frequently evidenced by a complex ionic flow blockade pattern. Typically, recorded current blockade signals have several levels of blockade, with various durations, all obeying a fixed statistical profile for a given molecule. Hidden Markov Model (HMM) based duration learning experiments on artificial twolevel Gaussian blockade signals helped us to identify proper modeling framework. We then apply our framework to the real multilevel DNA hairpin blockade signal.
Results
The identified upper level blockade state is observed with durations that are geometrically distributed (consistent with an a physical decay process for remaining in any given state). We show that mixture of convolution chains of geometrically distributed states is better for presenting multimodal longtailed duration phenomena. Based on learned HMM profiles we are able to classify 9 basepair DNA hairpins with accuracy up to 99.5% on signals from sameday experiments.
Conclusion
We have demonstrated several implementations for de novo estimation of duration distribution probability density function with HMM framework and applied our model topology to the real data. The proposed design could be handy in molecular analysis based on nanopore current blockade signal.
Background
In its quest for survival the bacterium Staphylococcus aureus secretes αhemolysin monomers that bind to the outer membrane of susceptible cells, where seven such units can oligomerize to form a waterfilled transmembrane channel [1–4]. The channel can cause death to the target cell by rapidly discharging vital molecules (such as ATP) and disturbing the membrane potential.
Suspended in lipid bilayer [see Additional File 1] the αhemolysin channel becomes a sensor when large molecules interact with the nanopore and modulate uniform ionic flow through the channel. Driven by transmembrane potential, single stranded DNA or RNA molecules translocate through the nanopore [5, 6], while more complex hairpins either unzip and translocate [7, 8] or toggle in the channel's vestibule [8, 9] [see Additional File 1]. The durations of ionic flow blockade events in these experiments are important signatures of interacting nucleic acid fragments composition [7, 10] or in certain cases characterize the molecular length [11].
where ϕ is a parse of the sequence consisting of a series of states q_{ i }and state durations d_{ i }, 0 ≤ i ≤ n, with each state q_{ i }emitting subsequence S_{ i }of length d_{ i }, so that the concatenation of all S_{0}S_{1} ... S_{ n }produces the complete output sequence S. P_{ e }(S_{ i }q_{ i }, d_{ i }) denotes the probability that state q_{ i }emits subsequence S_{ i }of duration d_{ i }. P_{ t }(q_{ i }q_{i1}) is GHMM transition probability from state q_{i1}to state q_{ i }and P_{ d }(d_{ i }q_{ i }) is the probability that state q_{ i }has duration d_{ i }. The primary objective, expressed in (1), is to combine probability returned by content probabilistic model (such as HMM) with duration probability for optimal parse. The GHMM implementation, as well as HMMwithDuration approach mentioned in [15], require explicit assignment of duration histogram to run Viterbi decoding.
When we try to classify single DNA base pair by nanopore ionic flow blockade signal processing [16], we frequently have to deal with a sequence of blockades resulting from complex molecular interactions with unknown states. For this reason, we are interested in de novo learning of emission content and duration distributions corresponding to these stationary blockade states. In this study we research several approaches to the problem of duration and content sensor learning in the context of nanopore ionic flow blockades analysis.
Results and discussion
Explicit duration model learning experiment
The objective of this experiment was to evaluate the ability of a randomly initialized explicit Duration HMM (DHMM), as described [see Section The explicit duration HMM implementation], to learn the original duration phenomena present in artificial data. We use the Expectation Maximization (EM) training procedure, as discussed [see Appendix D], to iteratively reinforce the network structure to match the test data set topology. For the model training purposes we have generated 120 sequences of 1,000 emissions each with the maximum state durations of 30 according to protocol discussed [see Section Running original explicit DHMM in generative mode].
where True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) are among the classified data points.
Convolution of states learning experiment
In this experiment we construct aggregate model states as convolution chain of three geometric distributions. The convolution chain for identical geometric distributions can be represented as a bellshaped Negative Binomial discrete PDF, as discussed [see Section Geometric duration distribution and convolution of geometric states].
Performance of Viterbi decoding depending on blockade maximum duration
Test set decoding performance for various aggregate state sizes. Here we show percentage of states recovered correct in Viterbi decoding for various methods.
Max. state duration  Explicit DHMM  Geometric duration HMM 

6  81.04%  72.70% 
10  83.36%  74.43% 
24  88.80%  82.18% 
30  90.06%  87.94% 
The geometric distribution model runs faster, but decoding performance is always inferior compared to the explicit DHMM. The geometric HMM appears to be simple and crude approximation to the duration signal. Explicit DHMM runs D times slower than simple geometric model, but produces superior results to any other types of implementations, given the exact duration histogram. The higher the maximum state duration D and the longer the average phenomena duration is, the better decoding quality we can obtain for all the models.
Learning durations on the real blockade signal
Here we analyze the ionic flow blockade signal resulting from the nine base pair "upper level toggler" hairpin DNA molecule CGTGGAACG TTTTCGTTCCACG, generated according to protocol described in [9] (signal was filtered at 50 kHz bandwidth using an analog low pass Bessel filter with the 20 μs analogtodigital sampling). Due to its unique sequence in the stem region and its interactions with the channel's vestibule it produces a rich signal (the upper level toggle) [17].
From the physical perspective the hairpin molecule can undergo different modes of capture blockade, such as Intermediate Level (IL), Upper Level (UL), Lover Level (LL) conductance states and spikes (S) [18]. When a 9 bp DNA hairpin initially enters the pore, the loop is perched in the vestibule mouth and the stem terminus binds to amino acid residues near the limiting aperture. This results in the IL conductance level. When the terminal basepair desorbs from the pore wall, the stem and loop may realign, resulting in a substantial current increase to UL. Interconversion between the IL and UL states may occur numerous times with UL possibly switching to the LL state. This LL state corresponds to binding of the stem terminus to amino acids near the limiting aperture but in a different manner from IL. From the LL bound state, the duplex terminus may fray, resulting in extension and capture of one strand in the pore constriction resulting into short term S state. The allowed transition events between these levels IL ⇔ UL ⇔ LL ⇔ S can happen at any time during the analysis procedure.
System performance on the 9CG versus 9TA data
Conclusion
Although running slowly, the explicit DHMM design has many advantages over other duration representation methods for HMMs, such as using unmodified Viterbi decoding algorithm and possibility for exact representation of any duration phenomena. Original explicit DHMM produced the best results in all artificial test categories. However, learning of such topology can quickly turn into a grim experience, since too many parameters need to be learned with noisy data.
The geometric duration distribution HMM is simple, but is not well suited to complex duration data analysis [see Section Performance of Viterbi decoding depending on blockade maximum duration]. Convolution of geometric states, especially a mixture of such aggregate states, is a much more robust and powerful method of interpolating noisy multimodal duration phenomena encountered in ionic flow blockade time series analysis. The software used to conduct experiments in this report is freely available at http://logos.cs.uno.edu/~achurban.
Methods
The explicit duration HMM implementation
Our model uses standard Viterbi decoding algorithm [see Appendix D], which we implemented in linear memory using linked list of back pointers in addition to implementation of ForwardBackward algorithm [21] for EM learning [see Appendix D] with memory use proportional to the number of states. The maximum state duration D has to be imposed on each duration histogram in this model which might seem as a limitation in case of longtailed distribution. This deficiency could easily be resolved by adding the geometrically distributed states to explicit DHMM, that are capable of modeling simple infinite long tailed durations [see Section Geometric duration distribution and convolution of geometric states], and use explicit part of the model to catch only the initial complex duration phenomena.
In this study we use two aggregate groups of states with corresponding discrete duration PDF obtained by discretizing to continuous PDFs [see Additional File 3], denoted as first and second states. The thermal noise of ionic flow at certain blockade level is approximated extremely well by the Gaussian PDF emission from HMM hidden states [see Appendix C]. The aggregate states are formed by lossless chains of transitions between hidden states, where we sacrifice the probability score only to enter the chain. We use Gaussian emissions $\mathcal{N}$(45, 20) and $\mathcal{N}$(50, 20) in the first and second aggregate states, correspondingly. Initial probabilities correspond to 50% chance to begin decoding in the first aggregate state and 50% for the second aggregate state.
Running original explicit DHMM in generative mode
We run original explicit DHMM [see Section The explicit duration HMM implementation] running in generative mode to get the test set of 1,000,000 sample points of artificial nanopore blockade signal [see Additional File 4]. In order to generate the test set we simply traverse the HMM graph in stochastic fashion according to transition probabilities assigned to edges, where each transition culminates in emission from PDF assigned to a state [see Appendix C]. Along with the emissions we record the known emitting hidden states for performance testing and parameter estimates of geometrically distributed HMM [see Section Geometric duration distribution and convolution of geometric states]. We use the test set to evaluate performance of various HMM implementations and learning techniques [see Sections Explicit duration model learning experiment, Convolution of states learning experiment and Performance of Viterbi decoding depending on blockade maximum duration].
Geometric duration distribution and convolution of geometric states
where N is the number of discrete duration samples k_{1},..., k_{ N }. The topology of the two state model with duration distributed according to geometric law [see Additional File 5].
Authors contributions
AC conceptualized the use of explicit duration HMM and convolution of geometric duration states. AC has implemented the system prototype, learned the models and drafted the manuscript. CB helped with implementing HMMwithDuration, conducting performance tests on artificial and real nanopore blockade signal. SWH helped with writing up the manuscript and provided many valuable suggestions throughout the study. All authors read and approved the final manuscript.
Appendices
Appendix A – Convolution of geometric distributions
The Negative binomial moment generating function is a product of n geometric distribution moment generating functions, which corresponds to convolution [19] of n identical geometric distributions with parameter p [see Additional File 6]. Distinct bellshaped plot of Negative binomial distribution PDF with parameters p = 0.99 and n = 1,..., 5 presented [see Additional File 7].
Appendix B – Learning the mixture models
where α_{ i }is mixing proportions.
The objective of learning is to maximize likelihood function $p(\mathcal{O}\Theta )={\displaystyle {\prod}_{i=1}^{N}p({o}_{i}\Theta )=\mathcal{L}(\Theta \mathcal{O})}$, i.e. we wish to find locally optimal set of parameters $\Theta *=\underset{\Theta}{argmax}\mathcal{L}(\Theta \mathcal{O})$ by using Expectation Maximization (EM) iterative procedure given the set of data points $\mathcal{O}$.
 (a)
Mixture proportions ${\widehat{\alpha}}_{k}=\frac{1}{N}{\displaystyle {\sum}_{i=1}^{N}p({\Theta}_{k}{o}_{i},\Theta )}$,
 (b)
Mean ${\widehat{\mu}}_{k}=\frac{{\displaystyle {\sum}_{i=1}^{N}{o}_{i}p({\Theta}_{k}{o}_{i},\Theta )}}{{\displaystyle {\sum}_{i=1}^{N}p({\Theta}_{k}{o}_{i},\Theta )}}$,
 (c)
Variance ${\widehat{\sigma}}_{k}^{2}=\frac{{\displaystyle {\sum}_{i=1}^{N}p({\Theta}_{k}{o}_{i},\Theta )({o}_{i}{\widehat{\mu}}_{k})({o}_{i}{\widehat{\mu}}_{k})}}{{\displaystyle {\sum}_{i=1}^{N}p({\Theta}_{k}{o}_{i},\mathrm{\Theta )}}}$.
Appendix C – Definition of Hidden Markov Model
The Hidden Markov Model (HMM) is a widely accepted stochastic modelling tool [23] used in various domains, such as speech recognition [24] and bioinformatics [25]. HMM is a stochastic finite state machine where each transition between hidden states is culminated by a symbol emission. The HMM could be represented as a directed graph with N states where each state could emit either discrete character or continuous value drawn from PDF. In order to describe HMM we need the following parameters

Set of states, we label individual states as S = {S_{1}, S_{2},..., S_{ N }}, and denote the state visited at time t as q_{ t },

Set of PDFs from where emission is drawn, in our case we use Normal distributions $B=\{\mathcal{N}({\mu}_{1},{\sigma}_{1}^{2}),\mathrm{...},\mathcal{N}({\mu}_{N},{\sigma}_{N}^{2})\}$,

The statetransmission probability matrix A = {a_{ ij }}, where a_{ ij }= p(q_{t+1}= jq_{ t }= i),

The initial state distribution vector ∏ = {π_{1},..., π_{ N }}.
Appendix D – HMM forwardbackward algorithm and Viterbi decoding
Here we adopt notation from [13] and report final HMM parameters update rules for EM learning algorithm rigorously derived in [22].
Viterbi algorithm for finding optimal parse
 1.
Initially δ_{1}(i) = π_{ i }$\mathcal{N}$(o_{1}Θ_{ i }), ψ_{1}(i) = 0 for 1 ≤ i ≤ N,
 2.for t = 2,..., T and 1 ≤ j ≤ N,${\delta}_{t}(j)=\underset{1\le i\le N}{\mathrm{max}}[{\delta}_{t1}(i){a}_{ij}]\mathcal{N}({o}_{t}{\Theta}_{j}),{\psi}_{t}(j)=\underset{1\le i\le N}{\mathrm{arg}\mathrm{max}}[{\delta}_{t1}(i){a}_{ij}]$
 3.
Finally ${q}_{T}^{\ast}=\underset{1\le i\le N}{\mathrm{arg}\mathrm{max}}[{\delta}_{T}(i)]$, trace back ${q}_{t}^{\ast}={\psi}_{t+1}({q}_{t+1}^{\ast})$ for t = T  1, T  2,..., 1 with optimal decoding ${Q}^{\ast}=\{{q}_{1}^{\ast},{q}_{2}^{\ast},\mathrm{...},{q}_{T}^{\ast}\}$.
HMM expectation step
We need to find expected probabilities of being at a certain state at a certain moment of time with forwardbackward procedure.
 1.
Initially α_{1}(i) = π_{ i }$\mathcal{N}$(o_{1}Θ_{ i }) for 1 ≤ i ≤ N,
 2.for t = 2, 3,..., T and 1 ≤ j ≤ N,${\alpha}_{t}(j)=\left[{\displaystyle {\sum}_{i=1}^{N}{\alpha}_{t1}(i){a}_{ij}}\right]\mathcal{N}({o}_{t}{\Theta}_{j})$
 3.
Finally $p(\mathcal{O}\lambda )={\displaystyle {\sum}_{i=1}^{N}{\alpha}_{T}(i)}$ is the sequence likelihood according to model.
 1.
Initially β_{ T }(i) = 1 for 1 ≤ i ≤ N,
 2.for t = T  1, T  2,..., 1 and 1 ≤ i ≤ N,${\beta}_{t}(i)={\displaystyle {\sum}_{j=1}^{N}{a}_{ij}}\mathcal{N}({o}_{t+1}{\Theta}_{j}){\beta}_{t+1}(j)$
 3.
Finally $p(\mathcal{O}\lambda )={\displaystyle {\sum}_{i=1}^{N}{\pi}_{i}\mathcal{N}({o}_{1}{\Theta}_{i}){\beta}_{1}(i)}$.
HMM maximization step
 (a)
Initial state probabilities estimate ${\widehat{\pi}}_{i}$ = γ_{1}(i) for 1 ≤ i ≤ N,
 (b)
Statetransition probabilities estimate ${\widehat{a}}_{ij}=\frac{{\displaystyle {\sum}_{t=1}^{T1}{\xi}_{t}(i,j)}}{{\displaystyle {\sum}_{t=1}^{T1}{\gamma}_{t}(i)}}$ for 1 ≤ i, j ≤ N,
 (c)
Gaussian output probabilities estimate $\begin{array}{cc}{\widehat{\mu}}_{j}=\frac{{\displaystyle {\sum}_{t=1}^{T}{o}_{t}{\gamma}_{t}(j)}}{{\displaystyle {\sum}_{t=1}^{T}{\gamma}_{t}(j)}},& {\widehat{\sigma}}_{j}^{2}=\frac{{\displaystyle {\sum}_{t=1}^{T}({o}_{t}{\widehat{\mu}}_{j})({o}_{t}{\widehat{\mu}}_{j}){\gamma}_{t}(j)}}{{\displaystyle {\sum}_{t=1}^{T}{\gamma}_{t}(j)}}\end{array}$ for 1 ≤ j ≤ N.
Declarations
Acknowledgements
Federal funding was provided by an NIH K22 (SWH PI, 5K22LM008794), an NIH NNBM R21 (SWH coPI), and LA Board of Regents Enhancement, RCS, and LaSPACE grants (SWH PI). Funding also provided by New Orleans Childrens Hospital and the University of New Orleans Computer Science Department. The authors are grateful for many constructive suggestions made by anonymous reviewers.
This article has been published as part of BMC Bioinformatics Volume 8 Supplement 7, 2007: Proceedings of the Fourth Annual MCBIOS Conference. Computational Frontiers in Biomedicine. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/8?issue=S7.
Authors’ Affiliations
References
 Song L, Hobaugh M, Shustak C, Cheley S, Bayley H, Gouaux J: Structure of Staphylococcal alphaHemolysin, a Heptameric Transmembrane Pore. Science 1996,274(5294):1859–1865. 10.1126/science.274.5294.1859View ArticlePubMedGoogle Scholar
 Bhakdi S, TranumJensen J: Alphatoxin of Staphylococcus aureus. Microbiol Rev 1991,55(4):733–751.PubMed CentralPubMedGoogle Scholar
 Walker B, Bayley H: Key Residues for Membrane Binding, Oligomerization, and Pore Forming Activity of Staphylococcal α Hemolysin Identified by Cysteine Scanning Mutagenesis and Targeted Chemical Modification. Journal of Biological Chemistry 1995,270(39):23065–23071. 10.1074/jbc.270.39.23065View ArticlePubMedGoogle Scholar
 Gouaux J, Braha O, Hobaugh M, Song L, Cheley S, Shustak C, Bayley H: Subunit Stoichiometry of Staphylococcal alphaHemolysin in Crystals and on Membranes: A Heptameric Transmembrane Pore. PNAS 1994, 91: 12828–12831. 10.1073/pnas.91.26.12828PubMed CentralView ArticlePubMedGoogle Scholar
 Kasianowicz J, Brandindagger E, Brantondagger D, Deamer D: Characterization of individual polynucleotide molecules using a membrane channel. PNAS 1996, 93: 13770–13773. 10.1073/pnas.93.24.13770PubMed CentralView ArticlePubMedGoogle Scholar
 Akeson M, Branton D, Kasianowicz J, Brandin E, Deamer D: Microsecond timescale discrimination among polycytidylic acid, polyadenylic acid, and polyuridylic acid as homopolymers or as segments within single RNA molecules. Biophysical Journal 1999, 77: 3227–3233.PubMed CentralView ArticlePubMedGoogle Scholar
 Mathé J, Visram H, Viasnoff V, Rabin Y, Meller A: Nanopore Unzipping of Individual DNA Hairpin Molecules. Biophysical Journal 2004, 87: 3205–3212. 10.1529/biophysj.104.047274PubMed CentralView ArticlePubMedGoogle Scholar
 Vercoutere W, WintersHilt S, Olsen H, Deamer D, Haussler D, Akeson M: Rapid discrimination among individual DNA hairpin molecules at singlenucleotide resolution using an ion channel. Nature Biotechnology 2001, 19: 248–252. 10.1038/85696View ArticlePubMedGoogle Scholar
 Vercoutere W, WintersHilt S, DeGuzman V, Deamer D, Ridino S, Rodgers J, Olsen H, Marziali A, Akeson M: Discrimination among individual WatsonCrick base pairs at the termini of single DNA hairpin molecules. Nucleic Acids Research 2003,31(4):1311–1318. 10.1093/nar/gkg218PubMed CentralView ArticlePubMedGoogle Scholar
 Akeson M, Branton D, Kasianowicz J, Brandin E, Deamer D: Microsecond timescale discrimination among polycytidylic acid, polyadenylic acid, and polyuridylic acid as homopolymers or as segments within single RNA molecules. Biophysical Journal 1999,77(6):3227–3233.PubMed CentralView ArticlePubMedGoogle Scholar
 Kasianowicz J, Brandin E, Branton D, Deamer D: Characterization of individual polynucleotide molecules using a membrane channel. PNAS 1996,93(24):13770–13773. 10.1073/pnas.93.24.13770PubMed CentralView ArticlePubMedGoogle Scholar
 Mitchell C, Helzerman R, Jamieson L, Harper M: A Parallel Implementation of a Hidden Markov Model with Duration Modeling for Speech Recognition. Digital Signal Processing, A Review Journal 1995, 5: 298–306. [http://citeseer.ist.psu.edu/mitchell95parallel.html]View ArticleGoogle Scholar
 Rabiner L: A tutorial on Hidden Markov Models and Selected Applications in Speach Recognition. Proceedings of IEEE 1989, 77: 257–286. 10.1109/5.18626View ArticleGoogle Scholar
 Majoros W, Pertea M, Delcher A, Salzberg S: Efficient decoding algorithms for generalized hidden Markov model gene finders. BMC Bioinformatics 2005.,6(16):
 WintersHilt S: Hidden Markov Model Variants and their Application. BMC Bioinformatics 2006,7(Suppl 2):S14. 10.1186/147121057S2S14PubMed CentralView ArticlePubMedGoogle Scholar
 WintersHilt S, Landry M, Akeson M, Tanase M, Amin I, Coombs A, Morales E, Millet J, Baribault C, Sendamangalam S: Cheminformatics Methods for Novel Nanopore analysis of HIV DNA termini. BMC Bioinformatics 2006,7(Suppl 2):S22. 10.1186/147121057S2S22PubMed CentralView ArticlePubMedGoogle Scholar
 DeGuzman V, Lee C, Deamer D, Vercoutere W: Sequencedependent gating of an ion channel by DNA hairpin molecules. Nucleic Acids Research 2006,34(22):6425–6437. 10.1093/nar/gkl754PubMed CentralView ArticlePubMedGoogle Scholar
 WintersHilt S, Vercoutere W, DeGuzman V, Deamer D, Akeson M, Haussler D: Highly Accurate Classification of WatsonCrick Basepairs on Termini of Single DNA Molecules. Biophysical Journal 2003, 84: 967–976.PubMed CentralView ArticlePubMedGoogle Scholar
 Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis. Volume chap 3. Cambridge University press; 1998:69.View ArticleGoogle Scholar
 Mitchell C, Helzerman R, Jamieson L, Harper M: A Parallel Implementation of a Hidden Markov Model with Duration Modeling for Speech Recognition. Digital Signal Processing, A Review Journal 1995, 5: 298–306. [http://citeseer.ist.psu.edu/mitchell95parallel.html]View ArticleGoogle Scholar
 Miklós I, Meyer I: A linear memory algorithm for BaumWelch training. BMC Bioinformatics 2005, 6: 231. 10.1186/147121056231PubMed CentralView ArticlePubMedGoogle Scholar
 Bilmes J: A Gentle Tutorial of the EM algorithm and its application to parameter Estimation for Gaussian mixture and Hidden Markov Models. In Tech Rep TR97–021. International Computer Science Institute; 1998.Google Scholar
 Bilmes J: What HMMs can do. In Tech rep. University of Washington, Seattle; 2002.Google Scholar
 Rabiner L, Juang BH: Fundamentals of speech recognition. Printice Hall; 1993.Google Scholar
 Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis. Cambridge University press; 1998.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.