 Proceedings
 Open access
 Published:
Duration learning for analysis of nanopore ionic current blockades
BMC Bioinformatics volume 8, Article number: S14 (2007)
Abstract
Background
Ionic current blockade signal processing, for use in nanopore detection, offers a promising new way to analyze single molecule properties, with potential implications for DNA sequencing. The alphaHemolysin transmembrane channel interacts with a translocating molecule in a nontrivial way, frequently evidenced by a complex ionic flow blockade pattern. Typically, recorded current blockade signals have several levels of blockade, with various durations, all obeying a fixed statistical profile for a given molecule. Hidden Markov Model (HMM) based duration learning experiments on artificial twolevel Gaussian blockade signals helped us to identify proper modeling framework. We then apply our framework to the real multilevel DNA hairpin blockade signal.
Results
The identified upper level blockade state is observed with durations that are geometrically distributed (consistent with an a physical decay process for remaining in any given state). We show that mixture of convolution chains of geometrically distributed states is better for presenting multimodal longtailed duration phenomena. Based on learned HMM profiles we are able to classify 9 basepair DNA hairpins with accuracy up to 99.5% on signals from sameday experiments.
Conclusion
We have demonstrated several implementations for de novo estimation of duration distribution probability density function with HMM framework and applied our model topology to the real data. The proposed design could be handy in molecular analysis based on nanopore current blockade signal.
Background
In its quest for survival the bacterium Staphylococcus aureus secretes αhemolysin monomers that bind to the outer membrane of susceptible cells, where seven such units can oligomerize to form a waterfilled transmembrane channel [1–4]. The channel can cause death to the target cell by rapidly discharging vital molecules (such as ATP) and disturbing the membrane potential.
Suspended in lipid bilayer [see Additional File 1] the αhemolysin channel becomes a sensor when large molecules interact with the nanopore and modulate uniform ionic flow through the channel. Driven by transmembrane potential, single stranded DNA or RNA molecules translocate through the nanopore [5, 6], while more complex hairpins either unzip and translocate [7, 8] or toggle in the channel's vestibule [8, 9] [see Additional File 1]. The durations of ionic flow blockade events in these experiments are important signatures of interacting nucleic acid fragments composition [7, 10] or in certain cases characterize the molecular length [11].
Two distinct approaches of duration modelling have been proposed for HMM framework by speech recognition community, based on explicit duration modelling, which is normally implemented with histograms or parametric distributions, and implicit modeling based on set of geometrically distributed selfrecurring nodes [12]. The most common way of implementing explicit duration model is Generalized Hidden Markov Model (GHMM), where each state can emit more than one symbol at a time [13]. Following [14], the optimal GHMM parse could be expressed by the following equation
where ϕ is a parse of the sequence consisting of a series of states q_{ i }and state durations d_{ i }, 0 ≤ i ≤ n, with each state q_{ i }emitting subsequence S_{ i }of length d_{ i }, so that the concatenation of all S_{0}S_{1} ... S_{ n }produces the complete output sequence S. P_{ e }(S_{ i }q_{ i }, d_{ i }) denotes the probability that state q_{ i }emits subsequence S_{ i }of duration d_{ i }. P_{ t }(q_{ i }q_{i1}) is GHMM transition probability from state q_{i1}to state q_{ i }and P_{ d }(d_{ i }q_{ i }) is the probability that state q_{ i }has duration d_{ i }. The primary objective, expressed in (1), is to combine probability returned by content probabilistic model (such as HMM) with duration probability for optimal parse. The GHMM implementation, as well as HMMwithDuration approach mentioned in [15], require explicit assignment of duration histogram to run Viterbi decoding.
When we try to classify single DNA base pair by nanopore ionic flow blockade signal processing [16], we frequently have to deal with a sequence of blockades resulting from complex molecular interactions with unknown states. For this reason, we are interested in de novo learning of emission content and duration distributions corresponding to these stationary blockade states. In this study we research several approaches to the problem of duration and content sensor learning in the context of nanopore ionic flow blockades analysis.
Results and discussion
Explicit duration model learning experiment
The objective of this experiment was to evaluate the ability of a randomly initialized explicit Duration HMM (DHMM), as described [see Section The explicit duration HMM implementation], to learn the original duration phenomena present in artificial data. We use the Expectation Maximization (EM) training procedure, as discussed [see Appendix D], to iteratively reinforce the network structure to match the test data set topology. For the model training purposes we have generated 120 sequences of 1,000 emissions each with the maximum state durations of 30 according to protocol discussed [see Section Running original explicit DHMM in generative mode].
First, we learned randomly initialized geometric model, as described [see Section Geometric duration distribution and convolution of geometric states], for 200 iterations to reliably recover the two major Gaussian emitting components and roughly estimate the average duration for two states. An accuracy of 85.75% has been achieved by Viterbi decoding [see Appendix D] on the learned geometric duration model for the test set, which constitutes 95.72% performance of the original explicit DHMM run on the same test set. Here and further we identify accuracy as the ratio of correctly decoded emissions to the entire number of emissions in the given time series
where True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) are among the classified data points.
We use the recovered Gaussian emissions and initial probabilities to initialize the explicit DHMM, and we use learned average state duration as the expected prior for the explicit duration histogram initialization. We freeze the emission Gaussian Probability Density Functions (PDFs) so that they don't change. Upon convergence of the explicit DHMM to a local likelihood maximum we record 88.98% Viterbi decoding accuracy on the test set, which is 99.33% of the original explicit DHMM performance, i.e. we were able to recover the duration phenomena with performance almost identical to the original explicit DHMM. Figure 1 shows histograms obtained for the state durations. Although their shape approximates the original duration PDF pretty well, the recovered histograms experience substantial abrupt and unwanted variations. Another shortcomings of this duration learning strategy is that it is extremely slow and requires simultaneous estimation of large number of parameters. Therefore, in the next section we present an approach based on a convolution chain of geometric duration states.
Convolution of states learning experiment
In this experiment we construct aggregate model states as convolution chain of three geometric distributions. The convolution chain for identical geometric distributions can be represented as a bellshaped Negative Binomial discrete PDF, as discussed [see Section Geometric duration distribution and convolution of geometric states].
The resulting convolution model trained with EM algorithm [see Appendix D] on an artificial nanopore signal with maximum state duration of 30 of 120 sequences 1,000 emissions each, generated according to protocol discussed [see Section Running original explicit DHMM in generative mode], is shown in Figure 2. In this learning experiment we use known emissions \mathcal{N}(45, 20) and \mathcal{N}(50, 20), that do not change in the process of learning, and initialize the prior probabilities and transitions with the expected state durations. Interestingly, direct use of the learned convolution model in Viterbi decoding produces reconstruction fidelity substantially inferior to the simple geometricallydistributed model. The convolution chain has full power only for forwardbackward procedure [13] for likelihood estimation [see Appendix D] and does not work for representing duration phenomena in case of Viterbi decoding. Therefore, we use the histograms shown in Figure 2 to initialize explicit DHMM transitions. Such hybrid explicit DHMM achieves on Viterbi decoding accuracy of 90.20%, which is 99.69% performance of original explicit DHMM on the same test set. Therefore, by learning chains of convolving geometricallydistributed components we achieve similar or better performance as compared to direct learning of explicit DHMM, in much shorter time period. The experiment clearly demonstrates the ability of a convolution chain to learn the complex duration phenomena in the data, outperforming the simple geometric duration model. Convolving states cannot generate or model duration phenomena shorter than the chain length (three in our case), therefore the number of convolving states should be used conservatively.
Performance of Viterbi decoding depending on blockade maximum duration
We run Viterbi decoding for the original models presented [see Sections The explicit duration HMM implementation and Geometric duration distribution and convolution of geometric states] with results shown in Table 1. In case of the geometric model we used transitions assigned according to simple maximum likelihood formula [see Section Geometric duration distribution and convolution of geometric states] estimated on the test set emissions of the original explicit DHMM.
The geometric distribution model runs faster, but decoding performance is always inferior compared to the explicit DHMM. The geometric HMM appears to be simple and crude approximation to the duration signal. Explicit DHMM runs D times slower than simple geometric model, but produces superior results to any other types of implementations, given the exact duration histogram. The higher the maximum state duration D and the longer the average phenomena duration is, the better decoding quality we can obtain for all the models.
Learning durations on the real blockade signal
Here we analyze the ionic flow blockade signal resulting from the nine base pair "upper level toggler" hairpin DNA molecule CGTGGAACG TTTTCGTTCCACG, generated according to protocol described in [9] (signal was filtered at 50 kHz bandwidth using an analog low pass Bessel filter with the 20 μs analogtodigital sampling). Due to its unique sequence in the stem region and its interactions with the channel's vestibule it produces a rich signal (the upper level toggle) [17].
From the physical perspective the hairpin molecule can undergo different modes of capture blockade, such as Intermediate Level (IL), Upper Level (UL), Lover Level (LL) conductance states and spikes (S) [18]. When a 9 bp DNA hairpin initially enters the pore, the loop is perched in the vestibule mouth and the stem terminus binds to amino acid residues near the limiting aperture. This results in the IL conductance level. When the terminal basepair desorbs from the pore wall, the stem and loop may realign, resulting in a substantial current increase to UL. Interconversion between the IL and UL states may occur numerous times with UL possibly switching to the LL state. This LL state corresponds to binding of the stem terminus to amino acids near the limiting aperture but in a different manner from IL. From the LL bound state, the duplex terminus may fray, resulting in extension and capture of one strand in the pore constriction resulting into short term S state. The allowed transition events between these levels IL ⇔ UL ⇔ LL ⇔ S can happen at any time during the analysis procedure.
For the purposes of de novo emission levels detection we have learned the mixture of six Gaussian (MoG) components [see Appendix B] on the raw ionic flow blockade signals. EM learning step converged to the following mixture of six Gaussian components
as could be seen in Figure 3, where \mathcal{N}(xμ, σ^{2}) is normal PDF.
We took four primary components from the recovered MoG (2) and applied them as emissions to the convolution model [see Section Geometric duration distribution and convolution of geometric states] for four aggregate states. We learn the model on the ionic flow blockade signal of size 173,000 samples with the recovered topology shown in Figure 4. The graph of transition nodes connecting the learned aggregate states appears to be sparse with nonzero transitions [see Additional File 2]. This analysis shows that not all transitions between molecular interaction states are allowed. Interestingly enough, the second state has transitions to other three states. According to the interaction physical model discussed above the molecule should bounce back and fourth between the deeper blockade levels, thus components \mathcal{N}(52.26, 1.18) and \mathcal{N}(38.82, 1.68) dominate. The recovered geometric distribution of the blockade events (a classic physical decay phenomenon), indicate that upper level toggler molecule has constant statedependent probability to dissociate from one interaction state and transit to another physically feasible conformation.
Aggregate states 1, 3 and 4 converged to pure geometric distributions with no apparent bellshaped duration phenomena, as could be seen in Figure 4. However, as could be seen in Figures 5(a) and 5(d), the longtailed distribution does not fall nicely in the framework of geometric duration. The geometric durations learned on these longtailed distributions does not approximate well neither the initially sharp peak nor the long tail in duration histograms and therefore should really be treated as multimodal distribution approximated by mixture of geometric components. On the other hand, the histograms shown in Figures 5(b) and 5(c) fall perfectly into framework of geometric PDF.
We improve the histograms for multimodal longtailed distributions by training the mixture of two convolution chains. Resulting convolution mixture generative histograms present the original phenomena much better as could be seen in Figure 6. Upon introduction of convolution mixture the model logarithmic likelihood [see Appendix B], given training sequence, has increased from 420068.73 to 418636.5 for the fully trained topologies which indicated better model fit to data. Mixing more than two components per aggregate state did not provide apparent improvements and unnecessarily complicated the model.
System performance on the 9CG versus 9TA data
We took 3,460 ms ionic flow blockade signals for 9GC and 9TA molecules, recorded according to protocol described in [9], and automatically learned the convolution topology according to the strategy [see Section Learning durations on the real blockade signal]. The remaining sequences, generated the same day, were used as a test set. We have split the test set sequences into chunks of 100 ms each to investigate shortterm classification performance, resulting into 13,753 test fragments for 9TA signal and 15,652 9GC test fragments as could be seen in Figure 7. We run both 9GC and 9TA learned HMM profiles on the test sequences and classify them according to maximum likelihood. We achieve classification accuracy of 99.56% on the 9GC test set and 97.87% on the 9TA test set.
Conclusion
Although running slowly, the explicit DHMM design has many advantages over other duration representation methods for HMMs, such as using unmodified Viterbi decoding algorithm and possibility for exact representation of any duration phenomena. Original explicit DHMM produced the best results in all artificial test categories. However, learning of such topology can quickly turn into a grim experience, since too many parameters need to be learned with noisy data.
The geometric duration distribution HMM is simple, but is not well suited to complex duration data analysis [see Section Performance of Viterbi decoding depending on blockade maximum duration]. Convolution of geometric states, especially a mixture of such aggregate states, is a much more robust and powerful method of interpolating noisy multimodal duration phenomena encountered in ionic flow blockade time series analysis. The software used to conduct experiments in this report is freely available at http://logos.cs.uno.edu/~achurban.
Methods
The explicit duration HMM implementation
In Figure 8 we show the explicit DHMM topology we use to combine duration with content sensors, which we refer to as the original explicit DHMM throughout the manuscript. This model follows the topology discussed in [19] for exact duration implementation and is similar in computational complexity to a more common explicit duration modeling of GHMM [13, 14, 20]. Our implementation takes advantage of intuitive duration presentation, instead of using disjoint parametric distributions or histograms for duration modeling that complicate decoding algorithm well beyond standard Viterbi procedure.
Our model uses standard Viterbi decoding algorithm [see Appendix D], which we implemented in linear memory using linked list of back pointers in addition to implementation of ForwardBackward algorithm [21] for EM learning [see Appendix D] with memory use proportional to the number of states. The maximum state duration D has to be imposed on each duration histogram in this model which might seem as a limitation in case of longtailed distribution. This deficiency could easily be resolved by adding the geometrically distributed states to explicit DHMM, that are capable of modeling simple infinite long tailed durations [see Section Geometric duration distribution and convolution of geometric states], and use explicit part of the model to catch only the initial complex duration phenomena.
In this study we use two aggregate groups of states with corresponding discrete duration PDF obtained by discretizing to continuous PDFs [see Additional File 3], denoted as first and second states. The thermal noise of ionic flow at certain blockade level is approximated extremely well by the Gaussian PDF emission from HMM hidden states [see Appendix C]. The aggregate states are formed by lossless chains of transitions between hidden states, where we sacrifice the probability score only to enter the chain. We use Gaussian emissions \mathcal{N}(45, 20) and \mathcal{N}(50, 20) in the first and second aggregate states, correspondingly. Initial probabilities correspond to 50% chance to begin decoding in the first aggregate state and 50% for the second aggregate state.
Running original explicit DHMM in generative mode
We run original explicit DHMM [see Section The explicit duration HMM implementation] running in generative mode to get the test set of 1,000,000 sample points of artificial nanopore blockade signal [see Additional File 4]. In order to generate the test set we simply traverse the HMM graph in stochastic fashion according to transition probabilities assigned to edges, where each transition culminates in emission from PDF assigned to a state [see Appendix C]. Along with the emissions we record the known emitting hidden states for performance testing and parameter estimates of geometrically distributed HMM [see Section Geometric duration distribution and convolution of geometric states]. We use the test set to evaluate performance of various HMM implementations and learning techniques [see Sections Explicit duration model learning experiment, Convolution of states learning experiment and Performance of Viterbi decoding depending on blockade maximum duration].
Geometric duration distribution and convolution of geometric states
The geometric duration distribution is implemented as a selfrecurring hidden state in the HMM framework and there are many merits of such duration modeling. The geometric duration distribution is modeled by only one state, which results in very compact probability tables for forwardbackward and Viterbi decoding algorithms. Random variable x is distributed according to geometric law p_{ x }(k) = p(1  p)^{k  1}where k = 1, 2, 3... and 1  p is the probability to stay in the same state. Parameter p fully characterizes this distribution and could be easily estimated by maximum likelihood, which is calculated as following
where N is the number of discrete duration samples k_{1},..., k_{ N }. The topology of the two state model with duration distributed according to geometric law [see Additional File 5].
The chain of consecutive identical geometrically distributed states could represent bellshaped Negative binomial duration distributions [19], as discussed [see Appendix A]. In the case of nonidentical geometrically distributed connected states the PDF remains bellshaped since the number of possible paths through the model \left(\begin{array}{c}k1\\ n1\end{array}\right) increases as the number of trials k grows, but the total sum of probabilities attributed to all these paths through n geometric components decreases. The mixture of aggregate states distributed according to Negative binomial law, as shown in Figure 9, can interpolate duration distribution even better, especially in case of multimodal distributions. A nice attribute of the duration representation, with geometrically distributed states, is that we are able to interpolate the noisy duration histogram, common for ionic flow time series, with much smoother discrete distribution.
Authors contributions
AC conceptualized the use of explicit duration HMM and convolution of geometric duration states. AC has implemented the system prototype, learned the models and drafted the manuscript. CB helped with implementing HMMwithDuration, conducting performance tests on artificial and real nanopore blockade signal. SWH helped with writing up the manuscript and provided many valuable suggestions throughout the study. All authors read and approved the final manuscript.
Appendices
Appendix A – Convolution of geometric distributions
In statistics, the probability distribution of the sum of several independent random variables is the convolution of their individual distributions. Suppose random variable x is distributed according to geometric law p_{ x }(k) = p q^{k1}where k = 1, 2, 3... is the number of trials to exit the state and q = 1  p is the probability to stay in the same state. The moment generating function for geometric distribution is
If random variable x is distributed according to Negative binomial, i.e. x ~ NegBin(n, p), then the moment generating function is written as
The Negative binomial moment generating function is a product of n geometric distribution moment generating functions, which corresponds to convolution [19] of n identical geometric distributions with parameter p [see Additional File 6]. Distinct bellshaped plot of Negative binomial distribution PDF with parameters p = 0.99 and n = 1,..., 5 presented [see Additional File 7].
Appendix B – Learning the mixture models
The onedimensional MoG model [22] of M components is a degenerate case of HMM
where α_{ i }is mixing proportions.
The objective of learning is to maximize likelihood function p(\mathcal{O}\Theta )={\displaystyle {\prod}_{i=1}^{N}p({o}_{i}\Theta )=\mathcal{L}(\Theta \mathcal{O})}, i.e. we wish to find locally optimal set of parameters \Theta *=\underset{\Theta}{argmax}\mathcal{L}(\Theta \mathcal{O}) by using Expectation Maximization (EM) iterative procedure given the set of data points \mathcal{O}.
Expectation step in mixture fitting algorithm is made through computing responsibility matrix of the components given data points
We use Bayesian rule to find posterior probability (responsibility) of a mixture component with parameters Θ_{ i }for data point o_{ j }
Expectation step is followed my maximization step where we reestimate parameters

(a)
Mixture proportions {\widehat{\alpha}}_{k}=\frac{1}{N}{\displaystyle {\sum}_{i=1}^{N}p({\Theta}_{k}{o}_{i},\Theta )},

(b)
Mean {\widehat{\mu}}_{k}=\frac{{\displaystyle {\sum}_{i=1}^{N}{o}_{i}p({\Theta}_{k}{o}_{i},\Theta )}}{{\displaystyle {\sum}_{i=1}^{N}p({\Theta}_{k}{o}_{i},\Theta )}},

(c)
Variance {\widehat{\sigma}}_{k}^{2}=\frac{{\displaystyle {\sum}_{i=1}^{N}p({\Theta}_{k}{o}_{i},\Theta )({o}_{i}{\widehat{\mu}}_{k})({o}_{i}{\widehat{\mu}}_{k})}}{{\displaystyle {\sum}_{i=1}^{N}p({\Theta}_{k}{o}_{i},\mathrm{\Theta )}}}.
Appendix C – Definition of Hidden Markov Model
The Hidden Markov Model (HMM) is a widely accepted stochastic modelling tool [23] used in various domains, such as speech recognition [24] and bioinformatics [25]. HMM is a stochastic finite state machine where each transition between hidden states is culminated by a symbol emission. The HMM could be represented as a directed graph with N states where each state could emit either discrete character or continuous value drawn from PDF. In order to describe HMM we need the following parameters

Set of states, we label individual states as S = {S_{1}, S_{2},..., S_{ N }}, and denote the state visited at time t as q_{ t },

Set of PDFs from where emission is drawn, in our case we use Normal distributions B=\{\mathcal{N}({\mu}_{1},{\sigma}_{1}^{2}),\mathrm{...},\mathcal{N}({\mu}_{N},{\sigma}_{N}^{2})\},

The statetransmission probability matrix A = {a_{ ij }}, where a_{ ij }= p(q_{t+1}= jq_{ t }= i),

The initial state distribution vector ∏ = {π_{1},..., π_{ N }}.
Set of parameters λ = (∏, A, B) completely specifies HMM. A simple example of HMM with two states where emissions are drawn from normal distributions \mathcal{N}(45, 20) and \mathcal{N}(50, 20) is shown in Figure 10.
Appendix D – HMM forwardbackward algorithm and Viterbi decoding
Here we adopt notation from [13] and report final HMM parameters update rules for EM learning algorithm rigorously derived in [22].
Viterbi algorithm for finding optimal parse
The Viterbi algorithm is a dynamic programming algorithm that runs on HMM for finding the most likely sequence of hidden states, called the Viterbi path, that result in an observed sequence.

1.
Initially δ_{1}(i) = π_{ i }\mathcal{N}(o_{1}Θ_{ i }), ψ_{1}(i) = 0 for 1 ≤ i ≤ N,

2.
{\delta}_{t}(j)=\underset{1\le i\le N}{\mathrm{max}}[{\delta}_{t1}(i){a}_{ij}]\mathcal{N}({o}_{t}{\Theta}_{j}),{\psi}_{t}(j)=\underset{1\le i\le N}{\mathrm{arg}\mathrm{max}}[{\delta}_{t1}(i){a}_{ij}]
for t = 2,..., T and 1 ≤ j ≤ N,

3.
Finally {q}_{T}^{\ast}=\underset{1\le i\le N}{\mathrm{arg}\mathrm{max}}[{\delta}_{T}(i)], trace back {q}_{t}^{\ast}={\psi}_{t+1}({q}_{t+1}^{\ast}) for t = T  1, T  2,..., 1 with optimal decoding {Q}^{\ast}=\{{q}_{1}^{\ast},{q}_{2}^{\ast},\mathrm{...},{q}_{T}^{\ast}\}.
HMM expectation step
We need to find expected probabilities of being at a certain state at a certain moment of time with forwardbackward procedure.
Forward procedure By definition α_{ t }(i) = p(o_{1}, o_{2},..., o_{ t }, q_{ t }= S_{ i }λ) is calculated the following way

1.
Initially α_{1}(i) = π_{ i }\mathcal{N}(o_{1}Θ_{ i }) for 1 ≤ i ≤ N,

2.
{\alpha}_{t}(j)=\left[{\displaystyle {\sum}_{i=1}^{N}{\alpha}_{t1}(i){a}_{ij}}\right]\mathcal{N}({o}_{t}{\Theta}_{j})
for t = 2, 3,..., T and 1 ≤ j ≤ N,

3.
Finally p(\mathcal{O}\lambda )={\displaystyle {\sum}_{i=1}^{N}{\alpha}_{T}(i)} is the sequence likelihood according to model.
Backward procedure By definition β_{ t }(i) = p(o_{t+1}, o_{t+2},..., o_{ T }, q_{ t }= S_{ i }λ) is calculated the following way

1.
Initially β_{ T }(i) = 1 for 1 ≤ i ≤ N,

2.
{\beta}_{t}(i)={\displaystyle {\sum}_{j=1}^{N}{a}_{ij}}\mathcal{N}({o}_{t+1}{\Theta}_{j}){\beta}_{t+1}(j)
for t = T  1, T  2,..., 1 and 1 ≤ i ≤ N,

3.
Finally p(\mathcal{O}\lambda )={\displaystyle {\sum}_{i=1}^{N}{\pi}_{i}\mathcal{N}({o}_{1}{\Theta}_{i}){\beta}_{1}(i)}.
By definition ξ_{ t }(i, j) is the probability of being in state i at time t, and state j at time t + 1, given the model and the observation sequence
By definition γ_{ t }(i) as the probability of being in state i at time t, given the observation sequence and the model
HMM maximization step
We update HMM parameters according to their expected utilization

(a)
Initial state probabilities estimate {\widehat{\pi}}_{i} = γ_{1}(i) for 1 ≤ i ≤ N,

(b)
Statetransition probabilities estimate {\widehat{a}}_{ij}=\frac{{\displaystyle {\sum}_{t=1}^{T1}{\xi}_{t}(i,j)}}{{\displaystyle {\sum}_{t=1}^{T1}{\gamma}_{t}(i)}} for 1 ≤ i, j ≤ N,

(c)
Gaussian output probabilities estimate \begin{array}{cc}{\widehat{\mu}}_{j}=\frac{{\displaystyle {\sum}_{t=1}^{T}{o}_{t}{\gamma}_{t}(j)}}{{\displaystyle {\sum}_{t=1}^{T}{\gamma}_{t}(j)}},& {\widehat{\sigma}}_{j}^{2}=\frac{{\displaystyle {\sum}_{t=1}^{T}({o}_{t}{\widehat{\mu}}_{j})({o}_{t}{\widehat{\mu}}_{j}){\gamma}_{t}(j)}}{{\displaystyle {\sum}_{t=1}^{T}{\gamma}_{t}(j)}}\end{array} for 1 ≤ j ≤ N.
References
Song L, Hobaugh M, Shustak C, Cheley S, Bayley H, Gouaux J: Structure of Staphylococcal alphaHemolysin, a Heptameric Transmembrane Pore. Science 1996,274(5294):1859–1865. 10.1126/science.274.5294.1859
Bhakdi S, TranumJensen J: Alphatoxin of Staphylococcus aureus. Microbiol Rev 1991,55(4):733–751.
Walker B, Bayley H: Key Residues for Membrane Binding, Oligomerization, and Pore Forming Activity of Staphylococcal α Hemolysin Identified by Cysteine Scanning Mutagenesis and Targeted Chemical Modification. Journal of Biological Chemistry 1995,270(39):23065–23071. 10.1074/jbc.270.39.23065
Gouaux J, Braha O, Hobaugh M, Song L, Cheley S, Shustak C, Bayley H: Subunit Stoichiometry of Staphylococcal alphaHemolysin in Crystals and on Membranes: A Heptameric Transmembrane Pore. PNAS 1994, 91: 12828–12831. 10.1073/pnas.91.26.12828
Kasianowicz J, Brandindagger E, Brantondagger D, Deamer D: Characterization of individual polynucleotide molecules using a membrane channel. PNAS 1996, 93: 13770–13773. 10.1073/pnas.93.24.13770
Akeson M, Branton D, Kasianowicz J, Brandin E, Deamer D: Microsecond timescale discrimination among polycytidylic acid, polyadenylic acid, and polyuridylic acid as homopolymers or as segments within single RNA molecules. Biophysical Journal 1999, 77: 3227–3233.
Mathé J, Visram H, Viasnoff V, Rabin Y, Meller A: Nanopore Unzipping of Individual DNA Hairpin Molecules. Biophysical Journal 2004, 87: 3205–3212. 10.1529/biophysj.104.047274
Vercoutere W, WintersHilt S, Olsen H, Deamer D, Haussler D, Akeson M: Rapid discrimination among individual DNA hairpin molecules at singlenucleotide resolution using an ion channel. Nature Biotechnology 2001, 19: 248–252. 10.1038/85696
Vercoutere W, WintersHilt S, DeGuzman V, Deamer D, Ridino S, Rodgers J, Olsen H, Marziali A, Akeson M: Discrimination among individual WatsonCrick base pairs at the termini of single DNA hairpin molecules. Nucleic Acids Research 2003,31(4):1311–1318. 10.1093/nar/gkg218
Akeson M, Branton D, Kasianowicz J, Brandin E, Deamer D: Microsecond timescale discrimination among polycytidylic acid, polyadenylic acid, and polyuridylic acid as homopolymers or as segments within single RNA molecules. Biophysical Journal 1999,77(6):3227–3233.
Kasianowicz J, Brandin E, Branton D, Deamer D: Characterization of individual polynucleotide molecules using a membrane channel. PNAS 1996,93(24):13770–13773. 10.1073/pnas.93.24.13770
Mitchell C, Helzerman R, Jamieson L, Harper M: A Parallel Implementation of a Hidden Markov Model with Duration Modeling for Speech Recognition. Digital Signal Processing, A Review Journal 1995, 5: 298–306. [http://citeseer.ist.psu.edu/mitchell95parallel.html]
Rabiner L: A tutorial on Hidden Markov Models and Selected Applications in Speach Recognition. Proceedings of IEEE 1989, 77: 257–286. 10.1109/5.18626
Majoros W, Pertea M, Delcher A, Salzberg S: Efficient decoding algorithms for generalized hidden Markov model gene finders. BMC Bioinformatics 2005.,6(16):
WintersHilt S: Hidden Markov Model Variants and their Application. BMC Bioinformatics 2006,7(Suppl 2):S14. 10.1186/147121057S2S14
WintersHilt S, Landry M, Akeson M, Tanase M, Amin I, Coombs A, Morales E, Millet J, Baribault C, Sendamangalam S: Cheminformatics Methods for Novel Nanopore analysis of HIV DNA termini. BMC Bioinformatics 2006,7(Suppl 2):S22. 10.1186/147121057S2S22
DeGuzman V, Lee C, Deamer D, Vercoutere W: Sequencedependent gating of an ion channel by DNA hairpin molecules. Nucleic Acids Research 2006,34(22):6425–6437. 10.1093/nar/gkl754
WintersHilt S, Vercoutere W, DeGuzman V, Deamer D, Akeson M, Haussler D: Highly Accurate Classification of WatsonCrick Basepairs on Termini of Single DNA Molecules. Biophysical Journal 2003, 84: 967–976.
Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis. Volume chap 3. Cambridge University press; 1998:69.
Mitchell C, Helzerman R, Jamieson L, Harper M: A Parallel Implementation of a Hidden Markov Model with Duration Modeling for Speech Recognition. Digital Signal Processing, A Review Journal 1995, 5: 298–306. [http://citeseer.ist.psu.edu/mitchell95parallel.html]
Miklós I, Meyer I: A linear memory algorithm for BaumWelch training. BMC Bioinformatics 2005, 6: 231. 10.1186/147121056231
Bilmes J: A Gentle Tutorial of the EM algorithm and its application to parameter Estimation for Gaussian mixture and Hidden Markov Models. In Tech Rep TR97–021. International Computer Science Institute; 1998.
Bilmes J: What HMMs can do. In Tech rep. University of Washington, Seattle; 2002.
Rabiner L, Juang BH: Fundamentals of speech recognition. Printice Hall; 1993.
Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis. Cambridge University press; 1998.
Acknowledgements
Federal funding was provided by an NIH K22 (SWH PI, 5K22LM008794), an NIH NNBM R21 (SWH coPI), and LA Board of Regents Enhancement, RCS, and LaSPACE grants (SWH PI). Funding also provided by New Orleans Childrens Hospital and the University of New Orleans Computer Science Department. The authors are grateful for many constructive suggestions made by anonymous reviewers.
This article has been published as part of BMC Bioinformatics Volume 8 Supplement 7, 2007: Proceedings of the Fourth Annual MCBIOS Conference. Computational Frontiers in Biomedicine. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/8?issue=S7.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Electronic supplementary material
12859_2007_1945_MOESM3_ESM.eps
Additional file 3: Artificial duration distributions represented as continuous PDFs of Beta mixtures. By discretizing these densities we can get duration histograms for any size of aggregate states used in our experiments. Here we use the following PDFs for the first state Mix_{1}(x) = 0.1874 × Beta(x3.0315, 3.0097) + 0.8126 × Beta(x3.9944, 9.4049) and Mix_{2}(x) = 0.1583 × Beta(x3.0446, 2.6063) + 0.8417 × Beta(x8.0777, 2.8867) for the second state. (EPS 648 KB)
12859_2007_1945_MOESM4_ESM.eps
Additional file 4: Gaussian PDFs and corresponding emissions for DHMM model [see Section The explicit duration HMM implementation] running in generative mode. Here the maximum duration of a state is 480 μs with 20 μs sampling rate. (EPS 649 KB)
12859_2007_1945_MOESM5_ESM.eps
Additional file 5: The HMM with geometric duration distribution corresponding to the maximum state duration of 6. Discrete duration distribution histograms are put next to each state. (EPS 43 KB)
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Churbanov, A., Baribault, C. & WintersHilt, S. Duration learning for analysis of nanopore ionic current blockades. BMC Bioinformatics 8 (Suppl 7), S14 (2007). https://doi.org/10.1186/147121058S7S14
Published:
DOI: https://doi.org/10.1186/147121058S7S14