 Proceedings
 Open Access
Clustering ionic flow blockade toggles with a Mixture of HMMs
 Alexander Churbanov^{1}Email author and
 Stephen WintersHilt^{1, 2}
https://doi.org/10.1186/147121059S9S13
© Churbanov and WintersHilt; licensee BioMed Central Ltd. 2008
 Published: 12 August 2008
Abstract
Background
Ionic current blockade signal processing, for use in nanopore detection, offers a promising new way to analyze single molecule properties with potential implications for DNA sequencing. The αHemolysin transmembrane channel interacts with a translocating molecule in a nontrivial way, frequently evidenced by a complex ionic flow blockade pattern with readily distinguishable modes of toggling. Effective processing of such signals requires developing machine learning methods capable of learning the various blockade modes for classification and knowledge discovery purposes. Here we propose a method aimed to improve our stochastic analysis capabilities to better understand the discriminatory capabilities of the observed the nanopore channel interactions with analyte.
Results
We tailored our memorysparse distributed implementation of a Mixture of Hidden Markov Models (MHMMs) to the problem of channel current blockade clustering and associated analyte classification. By using probabilistic fully connected HMM profiles as mixture components we were able to cluster the various 9 basepair hairpin channel blockades. We obtained very high Maximum a Posteriori (MAP) classification with a mixture of 12 different channel blockade profiles, each with 4 levels, a configuration that can be computed with sufficient speed for realtime experimental feedback. MAP classification performance depends on several factors such as the number of mixture components, the number of levels in each profile, and the duration of a channel blockade event. We distribute BaumWelch Expectation Maximization (EM) algorithms running on our model in two ways. A distributed implementation of the MHMM data processing accelerates data clustering efforts. The second, simultanteous, strategy uses an EM checkpointing algorithm to lower the memory use and efficiently distribute the bulk of EM processing in processing large data sequences (such as for the progressive sums used in the HMM parameter estimates).
Conclusion
The proposed distributed MHMM method has many appealing properties, such as precise classification of analyte in realtime scenarios, and the ability to incorporate new domain knowledge into a flexible, easily distributable, architecture. The distributed HMM provides a feature extraction that is equivalent to that of the sequential HMM with a speedup factor approximately equal to the number of independent CPUs operating on the data. The MHMM topology learns clusters existing within data samples via distributed HMM EM learning. A Java implementation of the MHMM algorithm is available at http://logos.cs.uno.edu/~achurban.
Keywords
 Hide Markov Model
 Expectation Maximization
 Channel Blockade
 Hide Markov Model Profile
 Hide Markov Model Parameter
Background
The bacterium Staphylococcus aureus secretes αhemolysin monomers that bind to the outer membrane of susceptible cells. Seven monomers can oligomerize to form a very stable waterfilled transmembrane channel [1]. The channel can cause death to the target cell by rapidly discharging vital molecules (such as ATP) and disturbing the membrane potential.
In an interview [In Focus, January 2002], one of the pioneers in the development of nanopore technology, Dr. Mark Akeson, states that getting a machine to learn base pair or nucleotide signatures and report the results automatically will be a key feature of a nanopore sequencing instrument. Here we propose a new method of unsupervised learning of ionic flow blockades with Mixture of Hidden Markov Models (MHMM) profiles that has a number of attractive attributes, at the expense of restricting learning to a smaller state space. For genome sequencing, the problem reduces to identifying the classes {A, C, G, T}, i.e., there are only four classes to discriminate. Thus, for some important problems the nonscaling constraint is not an issue and this approach may offer the best performance.
The Maximum a Posteriori (MAP) molecular classification with our model opens the possibility for making distributed decisions in real time. The EM algorithms running on our model are computationally expensive procedures, thus, an important method in this work involves computational speedup efforts via distributed processing implementations.
Results
The accuracy improvement is consistent with the accuracy of the previously reported classification process [6] as shown in Figure 8(a) (except for the 9GC molecule). The failure to discern 9GC from 9AT in the approach described here, and not in prior efforts [5, 6], may simply be the result of better blockadelevel resolution 'finestructure' with the prior model.
The better resolution between 9GC and 9AT channel blockades obtained with the 50state single HMM (used in [5, 6]) may simply be due to the fixed 1pA resolution (the state quantization binsize) providing a critical resolving capability between very similar blockade signals. If true, a hybrid solution may be to directly incorporate finestructure into the 4state multiple HMM processing model that is used here, by adding finestructure states at 1pA distances on either side of the 4 states identified by EM. Efforts along these lines are ongoing (see Discussion).
The MHMM analysis framework first has been implemented in a concurrent fashion on a quadcore Sun Ultra 40 M2 machine with speedup factor 3.66 as compared to a conventional implementation, and then distributed to the five machines of the same type with Java RMI with additional speedup of 4.02, which translates to the total speedup of 3.66 × 4.02 = 14.71.
Methods
The ionic flow blockade records were obtained from the previous studies [6]. Two axon binary files (each containing 500 blockade samples of 300 ms) have been used to learn the probabilistic profiles for each hairpin molecule. The first 100 ms of each channel blockade is the basis of the first test set. Four other axon binary files, with uninterrupted recordings (nonsweep data), for each hairpin molecule and recorded on the same day, are then used for testing. The test set was formed by equiprobable sampling of 500 labeled blockade samples from the pool of test files.
Another test set was constructed from the above data files to measure accuracy of consecutive sameanalyte toggle sample classification. In this instance we take all the available blockade signal coming from the test files of a certain molecule (not just the first 100 ms) and use multiple sample draws from the same signal blockade (i.e., consecutive 100 ms segments). With the 100 ms signal samples drawn from the same blockade event, we perform MAP scoring followed by majority vote (with random resolution of ties). (Note: the data rejection employed in [6] could be made roughly equivalent to the signal resampling approach described here by simply collecting consecutive 100 ms samples, as done here, and having classification on a given blockade once the signal isn't rejected, the only difference with the classification postprocessing being that in this effort majority vote is employed instead.) Accuracy is calculated as the number of correct classifications matching the known molecule type to the total number of classification events. As in [6], the ionic flow in each record has been normalized to the open channel current 120 pA prior to learning and testing.
For our distributed MHMM system implementation we have used cluster of five workstations Sun Ultra 40 M2, each equipped with two AMD DualCore Opteron processors (2220SE 2.8 GHz), connected through gigabit Ethernet switch.
HMM definition and EM learning
The following parameters describe the conventional HMM implementation according to [13]:

A set of states S = {S_{1},..., S_{ N }} with q_{ t }being the state visited at time t,

A set of PDFs B = {b_{1}(o),..., b_{ N }(o)}, describing the emission probabilities b_{ j }(o_{ t }) = p(o_{ t }q_{ t }= S_{ j }) for 1 ≤ j ≤ N, where o_{ t }is the observation at timepoint t from the sequence of observations O = {o_{1},..., o_{ T }},

The statetransition probability matrix A = {a_{i,j}} for 1 ≤ i, j ≤ N, where a_{i, j}= p(q_{t+1}= S_{ j }q_{ t }= S_{ i }),

The initial state distribution vector ∏ = {π_{1},..., π_{ N }}.
Forward and backward procedures.
Forward procedure  Backward procedure 

α_{ t }(i) ≡ p(o_{1},..., o_{ t }q_{ t }= S_{ i }, λ)  β_{ t }(i) ≡ p(o_{t+1},..., o_{ T }q_{ t }= S_{ i }, λ) 
• Initially α_{1}(i) = π_{ i }b_{ i }(o_{1}) for 1 ≤ i ≤ N,  • Initially β_{T}(i) = 1 for 1 ≤ i ≤ N, 
• ${\alpha}_{t}(j)=\left[{\displaystyle {\sum}_{i=1}^{N}{\alpha}_{t1}(i){a}_{i,j}}\right]{b}_{j}({o}_{t})$ for t = 2, 3,..., T and 1 ≤ j ≤ N,  • ${\beta}_{t}(i)={\displaystyle {\sum}_{j=1}^{N}{\alpha}_{i,j}{b}_{j}({o}_{t+1}){\beta}_{t+1}(j)}$ for t = T  1,...,1 and 1 ≤ i ≤ N, 
• Finally $p(O\lambda )={\displaystyle {\sum}_{i=1}^{N}{\alpha}_{T}(i)}$ is the sequence likehood.  • Finally $p(O\lambda )={\displaystyle {\sum}_{i=1}^{N}{\pi}_{i}{b}_{i}({o}_{1}){\beta}_{1}(i)}$. 
Maximization step in HMM learning.
Initial probability estimate  Transition probability estimate  Emission parameters estimate 

${\widehat{\pi}}_{i}$ = γ_{1}(i), for 1 ≤ i ≤ N.  ${\widehat{a}}_{i,j}=\frac{{\displaystyle {\sum}_{t=1}^{T1}{\xi}_{t}(i,j)}}{{\displaystyle {\sum}_{t=1}^{T1}{\gamma}_{t}(i)}}$, for 1 ≤ i, j ≤ N.  Gaussian emission ${\widehat{b}}_{j}(o)\to \mu =\frac{{\displaystyle {\sum}_{t=1}^{T}{o}_{t}{\gamma}_{t}(j)}}{{\displaystyle {\sum}_{t=1}^{T}{\gamma}_{t}(j)}}$, ${\widehat{b}}_{j}(o)\to {\sigma}^{2}=\frac{{\displaystyle {\sum}_{t=1}^{T}{({o}_{t}{\widehat{\mu}}_{j})}^{2}{\gamma}_{t}(j)}}{{\displaystyle {\sum}_{t=1}^{T}{\gamma}_{t}(j)}}$, for 1 ≤ j ≤ N. 
EM learning of HMM mixture
The objective of mixture learning is to maximize the likelihood function $p(\mathcal{O}\Theta )={\Pi}_{i=1}^{N}p({o}_{i}\Theta )=\mathcal{L}(\Theta \mathcal{O})$, i.e. we wish to find the locally optimal set of parameters ${\Theta}^{\ast}=argmax\mathcal{L}(\Theta \mathcal{O})$ by using the Expectation Maximization (EM) iterative procedure and the set of data points $\mathcal{O}$.
The Expectation step is followed by the maximization step where we reestimate parameters.

Mixture proportions${\widehat{\alpha}}_{m}=\frac{1}{K}{\displaystyle \sum _{k=1}^{K}p({\lambda}_{m}{O}_{k},\lambda )},$(5)

Initial probabilities${\widehat{\Pi}}_{m}=\frac{{\displaystyle {\sum}_{k=1}^{K}{\widehat{\Pi}}_{m}^{k}p({\lambda}_{m}{O}_{k},\lambda )}}{{\displaystyle {\sum}_{k=1}^{K}p({\lambda}_{m}{O}_{k},\lambda )}},$(6)
where ${\widehat{\Pi}}_{m}^{k}$ is an estimate of initial probabilities for the component m given sequence O_{ k },

Transitions${\widehat{A}}_{m}=\frac{{\displaystyle {\sum}_{i=1}^{K}{\widehat{A}}_{m}^{k}p({\lambda}_{m}{O}_{k},\lambda )}}{{\displaystyle {\sum}_{k=1}^{K}p({\lambda}_{m}{O}_{k},\lambda )}},$(7)
where ${\widehat{A}}_{m}^{k}$ is an estimate of transition probabilities for the component m given sequence O_{ k },

Emissions${\widehat{B}}_{m}=\frac{{\displaystyle {\sum}_{k=1}^{K}{\widehat{B}}_{m}^{k}p({\lambda}_{m}{O}_{k},\lambda )}}{{\displaystyle {\sum}_{k=1}^{K}p({\lambda}_{m}{O}_{k},\lambda )}},$(8)
where ${\widehat{B}}_{m}^{k}$ is an estimate of emission parameters for the component m given sequence O_{ k }.
Distributed EM implementation
As discussed in [15], the computational gain of a parallel implementation can greatly depend on model topology. In the speech recognition community researchers are able to use a highly parallel HMM architectures for phoneme and dictionary word recognition. Typically, when a large number of Processing Elements (PEs) is used, the utilization of each element drops due to communication overheads. Therefore, the communication overhead in any parallel architecture must be strictly managed, ideally reduced to a constellation of PEs with shared memory [15]. In recent work [16] we describe the performance of the following HMM EM algorithms (where we studied the last on the list):

Conventional EM due to Leonard E. Baum and Lloyd R. [17] takes O(T N) memory and O(2T N Q_{ max }+ T (Q + E)) time, where T is the length of the observed sequence, N is the number of HMM states, Q_{ max }is the maximum HMM node outdegree, E is the number of free emission parameters, Q is the number of free transition parameters.

Checkpointing EM [18–20] takes O($\sqrt{T}$N) memory and O(3T N Q_{ max }+ T (Q + E)) time,

Linear memory EM [16, 21] takes only O(N(Q + E D)) memory and O(T NQ_{ max }(Q + E D)) time.
Similar improvements are also described for the HMM Viterbi implementation in linear memory [16]. In actual usage with the comparatively small durations generally examined, the checkpointing algorithm was found to be the most memory efficient.
Distributed checkpointing algorithm for learning from large data samples
 1.
Client machine splits data sequence O into subsequences O_{1},..., O_{ t },..., ${O}_{\sqrt{T}}$ each of size $\sqrt{T}$ and distributes them across the servers along with λ,
 2.
Find Forward and Backward $\sqrt{T}$ checkpoints in sequential manner at the corresponding servers where emission matrices for O_{ t }were calculated and stored,
 3.
Reconstruct dynamic programming tables of size N$\sqrt{T}$ at the servers according to locally stored checkpoints to make local parameter estimate ${\widehat{\lambda}}_{t}=({\widehat{\Pi}}_{t},{\widehat{A}}_{t},{\widehat{B}}_{t})$,
 4.
After calculating local parameter estimate, communicate ${\widehat{\lambda}}_{t}$ back to the client machine and calculate $\widehat{\lambda}=(\widehat{\Pi},\widehat{A},\widehat{B})$,
 5.
Redistribute newly found $\widehat{\lambda}$ among the server machines for another EM round.
Distributed MHMM parameter estimate
 1.
For each parameter λ_{1},..., λ_{ m },..., λ_{ M }and sequence O_{1},..., O_{ k },..., O_{ K }calculate likelihood p(O_{ k } λ_{ m }) on the server nodes and communicate them back to the client,
 2.
Client finds responsibilities for each mixture component and a sequence according to formula (4),
 3.
Estimated mixture proportions ${\widehat{\alpha}}_{1},\mathrm{...},{\widehat{\alpha}}_{m},\mathrm{...},{\alpha}_{M}$ are found on a client node according to (5),
 4.
The server nodes find ${\widehat{\lambda}}_{m}^{k}$ estimates for parameter λ_{ m }and sequence O_{ k }and send them back to the client,
 5.
On the client node these newly computed parameters are weighted according to responsibilities (6), (7), (8),
 6.
Newly found HMM parameters ${\widehat{\lambda}}_{1},\mathrm{...},{\widehat{\lambda}}_{M}$ are disbursed back to the server nodes for the next round of EM training.
Discussion and conclusion
There are several advantages in our approach:

Classification is highly accurate with no data dropped from consideration,

Model parameters may have intuitive physical interpretation (but not in this study),

The MHMM implementation is distributed, such that:

Learning can take a larger number of samples (for improved accuracy),

Enables realtime analyte classification, currently takes only 0.411 sec to classify 100 ms sample,

Checkpointing algorithm keeps the memory profile low both on server and client sides without compromising the running time [16].
The need for using a mixture model beyond a simple HMM comes from the observation that generally no more than half of hairpin blockades come from the same mode of hairpin molecule interacting with nanopore (the modes correspond to principal components in the channel blockade stationary statistics profile). Other mode contributions require different probabilistic profiles for classification which naturally leads to a mixture analysis problem. The method shown in Figure 3 doesn't introduce such modes at the HMMprocessing stage, relying instead on the strengths of the SVM classifier directly.
Increasing EMlearning model complexity beyond 4 levels and 12 mixture components increases the loglikelihood of fully trained model, but does not lead to better prediction accuracy as shown in Figure 6. The likelihood increase is caused by the model overfitting the data. Overfitting with HMMprofile models, however, isn't found to be as detrimental to the generalization performance as with other learning methods – the main penalty is that the learning and classification times increase dramatically, as we need to estimate progressively increasing number of parameters.
Since we did not computationally exhaust all the possible parameter settings (number of components, number of levels and sample duration), we provide a rationale for the parameter choice we believe is optimal. With preliminary experiments learning on 9CG toggle samples with MHMM of 15 toggle clusters we have consistently exhausted the number of components, many of them converging to the same simple blockade as shown in figure 4(a) at the top right. This observation prompted us to use no more than 12 components in the channel blockade signalmode mixture model.
The number of four blockade levels corresponds to the physical model of DNA hairpin interacting with nanopore [5]. From the physical perspective the hairpin molecule can undergo different modes of capture blockade, such as Intermediate Level (IL), Upper Level (UL), Lover Level (LL) conductance states and spikes (S) [6]. When a 9 bp DNA hairpin initially enters the pore, the loop is perched in the vestibule mouth and the stem terminus binds to amino acid residues near the limiting aperture. This results in the IL conductance level. When the terminal basepair desorbs from the pore wall, the stem and loop may realign, resulting in a substantial current increase to UL. Interconversion between the IL and UL states may occur numerous times with UL possibly switching to the LL state. This LL state corresponds to binding of the stem terminus to amino acids near the limiting aperture but in a different manner from IL. From the LL bound state, the duplex terminus may fray, resulting in extension and capture of one strand in the pore constriction resulting into short term S state. The allowed transition events between the levels IL ⇔ UL ⇔ LL ⇔ S to happen at any time during the analysis procedure. The spikes model, as described in [16], could possibly be used to increase prediction accuracy. However, with the scenario discussed in this manuscript use of such additions did not lead to higher performance since the primary blockade modes shown in Figures 4(a) and 4(b) are void of spikes.
A demo program implementing distributed MHMM analysis framework is available free of charge on our web site http://logos.cs.uno.edu/~achurban.
Declarations
Acknowledgements
This research was partly funded from an NIH K22 award (K22LM008794), an NIH R21 award (R21GM073617), and an NIH program grant subcontract (R01HG003703).
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 9, 2008: Proceedings of the Fifth Annual MCBIOS Conference. Systems Biology: Bridging the Omics. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/9?issue=S9
Authors’ Affiliations
References
 Gouaux J, Braha O, Hobaugh M, Song L, Cheley S, Shustak C, Bayley H: Subunit stoichiometry of staphylococcal α hemolysin in crystals and on membranes: a heptameric transmembrane pore. PNAS 1994, 91: 12828–12831. 10.1073/pnas.91.26.12828PubMed CentralView ArticlePubMedGoogle Scholar
 Akeson M, Branton D, Kasianowicz J, Brandin E, Deamer D: Microsecond timescale discrimination among polycytidylic acid, polyadenylic acid, and polyuridylic acid as homopolymers or as segments within single RNA molecules. Biophysical Journal 1999,77(6):3227–3233.PubMed CentralView ArticlePubMedGoogle Scholar
 Butler T, Gundlach J, Trolly M: Ionic current blockades from DNA and RNA molecules in the α hemolysin nanopore. Biophysical Journal 2007,93(9):3229–3240. 10.1529/biophysj.107.107003PubMed CentralView ArticlePubMedGoogle Scholar
 Butler T, Gundlach J, Troll M: Determination of RNA orientation during translocation through a biological nanopore. Biophysical Journal 2006, 90: 190–199. 10.1529/biophysj.105.068957PubMed CentralView ArticlePubMedGoogle Scholar
 Vercoutere W, WintersHilt S, Olsen H, Deamer D, Haussler D, Akeson M: Rapid discrimination among individual DNA hairpin molecules at singlenucleotide resolution using an ion channel. Nature Biotechnology 2001, 19: 248–252. 10.1038/85696View ArticlePubMedGoogle Scholar
 WintersHilt S, Vercoutere W, DeGuzman V, Deamer D, Akeson M, Haussler D: Highly accurate classification of WatsonCrick basepairs on termini of single DNA molecules. Biophysical Journal 2003, 84: 967–976.PubMed CentralView ArticlePubMedGoogle Scholar
 Iqbal R, Landry M, WintersHilt S: DNA molecule classification using feature primitives. BMC Bioinformatics 2006,7(Suppl 2):S15. 10.1186/147121057S2S15PubMed CentralView ArticlePubMedGoogle Scholar
 Juang B, Rabiner L: A probabilistic distance measure for hidden Markov models. AT&T technical journal 1985,64(2):391–408.View ArticleGoogle Scholar
 Krogh A, Brown M, Mian I, Sjölander K, Haussler D: Hidden Markov models in computational biology: applications to protein modelling. Tech Rep UCSCCRL93–32, UCSC 1993.Google Scholar
 Smyth P: Clustering sequences with hidden Markov models. In Advances in Neural Information Processing Systems. Volume 9. Edited by: Mozer M, Jordan M, Petsche T. The MIT Press; 1997:648.Google Scholar
 Churbanov A, Rogozin I, Deogun J, Ali H: Method of predicting splice sites based on signal interactions. Biology Direct 2006.,1(10):Google Scholar
 Churbanov A, Baribault C, WintersHilt S: Duration learning for analysis of nanopore ionic current blockades. BMC Bioinformatics 2007,8(Suppl 7):S14. 10.1186/147121058S7S14PubMed CentralView ArticlePubMedGoogle Scholar
 Rabiner L: A tutorial on hidden Markov models and selected applications in speach recognition. Proceedings of IEEE 1989, 77: 257–286. 10.1109/5.18626View ArticleGoogle Scholar
 Bilmes J: A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. In Tech Rep TR97–021. International Computer Science Institute; 1998.Google Scholar
 Mitchell C, Helzerman R, Jamieson L, Harper M: A parallel implementation of a hidden Markov model with duration modeling for speech recognition. Digital Signal Processing, A Review Journal 1995, 5: 298–306. [http://citeseer.ist.psu.edu/mitchell95parallel.html]View ArticleGoogle Scholar
 Churbanov A, WintersHilt S: Implementing EM and Viterbi algorithms for hidden Markov model in linear memory. BMC Bioinformatics 2008, 9: 224. 10.1186/147121059224PubMed CentralView ArticlePubMedGoogle Scholar
 Baum L, Petrie T, Soules G, Weiss N: A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann Math Statist 1970, 41: 164–171. 10.1214/aoms/1177697196View ArticleGoogle Scholar
 Grice J, Hughey R, Speck D: Reduced space sequence alignment. CABIOS 1997, 13: 45–53.PubMedGoogle Scholar
 Tarnas C, Hughey R: Reduced space hidden Markov model training. Bioinformatics 1998,14(5):401–406. 10.1093/bioinformatics/14.5.401View ArticlePubMedGoogle Scholar
 Wheeler R, Hughey R: Optimizing reducedspace sequence analysis. Bioinformatics 2000,16(12):1082–1090. 10.1093/bioinformatics/16.12.1082View ArticlePubMedGoogle Scholar
 Miklós I, Meyer I: A linear memory algorithm for BaumWelch training. BMC Bioinformatics 2005, 6: 231. 10.1186/147121056231PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.