Clustering ionic flow blockade toggles with a Mixture of HMMs

Background Ionic current blockade signal processing, for use in nanopore detection, offers a promising new way to analyze single molecule properties with potential implications for DNA sequencing. The α-Hemolysin transmembrane channel interacts with a translocating molecule in a nontrivial way, frequently evidenced by a complex ionic flow blockade pattern with readily distinguishable modes of toggling. Effective processing of such signals requires developing machine learning methods capable of learning the various blockade modes for classification and knowledge discovery purposes. Here we propose a method aimed to improve our stochastic analysis capabilities to better understand the discriminatory capabilities of the observed the nanopore channel interactions with analyte. Results We tailored our memory-sparse distributed implementation of a Mixture of Hidden Markov Models (MHMMs) to the problem of channel current blockade clustering and associated analyte classification. By using probabilistic fully connected HMM profiles as mixture components we were able to cluster the various 9 base-pair hairpin channel blockades. We obtained very high Maximum a Posteriori (MAP) classification with a mixture of 12 different channel blockade profiles, each with 4 levels, a configuration that can be computed with sufficient speed for real-time experimental feedback. MAP classification performance depends on several factors such as the number of mixture components, the number of levels in each profile, and the duration of a channel blockade event. We distribute Baum-Welch Expectation Maximization (EM) algorithms running on our model in two ways. A distributed implementation of the MHMM data processing accelerates data clustering efforts. The second, simultanteous, strategy uses an EM checkpointing algorithm to lower the memory use and efficiently distribute the bulk of EM processing in processing large data sequences (such as for the progressive sums used in the HMM parameter estimates). Conclusion The proposed distributed MHMM method has many appealing properties, such as precise classification of analyte in real-time scenarios, and the ability to incorporate new domain knowledge into a flexible, easily distributable, architecture. The distributed HMM provides a feature extraction that is equivalent to that of the sequential HMM with a speedup factor approximately equal to the number of independent CPUs operating on the data. The MHMM topology learns clusters existing within data samples via distributed HMM EM learning. A Java implementation of the MHMM algorithm is available at .


Background
The bacterium Staphylococcus aureus secretes α-hemolysin monomers that bind to the outer membrane of susceptible cells. Seven monomers can oligomerize to form a very stable water-filled transmembrane channel [1]. The channel can cause death to the target cell by rapidly discharging vital molecules (such as ATP) and disturbing the membrane potential.
Suspended in lipid bilayer, as shown in Figure 1, the αhemolysin channel can be used as a sensor (nanoporedetector) when large molecules interact with the channel environment under an applied potential (where the open channel has 120 picoAmperes of ion flow under normal conditions). When a 9 bp DNA hairpin enters the pore, the loop is caught at the vestibule mouth, leaving the stem terminus perched to readily bind to the amino acid residues near the limiting aperture, resulting in a consistent toggle for thousands of milliseconds as shown in Figure 2.
Many approaches to characterizing of nucleic acid analyte -channel interactions use 2-D scatter plot analysis [2,3]. A recently proposed method of discriminating translocating RNA polynucleotide orientation [4] uses a combination of six sigmoid phenomenological functional forms to approximate possible blockades. A hybrid method of automated analyte classification was used in [5,6] that discriminates among 8GC, 9GC, 9CG, 9TA and 9AT molecules by first obtaining features extracted with Expecta-tion Maximization (EM) learning on a single 50-state fully connected Hidden Markov Model (HMM). They then construct a feature vector based on the HMM parameters and pass that to a Support Vector Machine (SVM) for classification (with the binary decision tree shown in Figure  3). Although the process shown in Fig. 3 is scalable, and has high classification accuracy, it can also involve high data rejection rates (good for performing solution assays). This motivates effort to have a less scalable, but lower data-rejection rate (such as what is needed during genomic sequencing). Later study of the data examined in [5,6], with PCA reduction on states followed by a simple, uninformed, AdaBoost classification (not SVM, see [7]), led to similar improvement on zero-rejection accuracy, and thus similar improvements (reductions) on the datarejection needed for high-accuracy classification [7]. That approach, however, didn't begin with the stronger (but non-scalable in class-number) feature extraction method described here. This is the first test of what is expected to be a highly accurate feature extraction method (better than those employed previously), where the critical limitation in general use, however, is in its scalability in number of classes to discriminate.
In an interview [In Focus, January 2002], one of the pioneers in the development of nanopore technology, Dr. Mark Akeson, states that getting a machine to learn base pair or nucleotide signatures and report the results automatically will be a key feature of a nanopore sequencing α-hemolysin nanopore with captured hairpin α-hemolysin nanopore with captured hairpin.
instrument. Here we propose a new method of unsupervised learning of ionic flow blockades with Mixture of Hidden Markov Models (MHMM) profiles that has a number of attractive attributes, at the expense of restricting learning to a smaller state space. For genome sequencing, the problem reduces to identifying the classes {A, C, G, T}, i.e., there are only four classes to discriminate. Thus, for some important problems the non-scaling constraint is not an issue and this approach may offer the best performance.
The Maximum a Posteriori (MAP) molecular classification with our model opens the possibility for making distributed decisions in real time. The EM algorithms running on our model are computationally expensive procedures, thus, an important method in this work involves computational speed-up efforts via distributed processing implementations.

Results
We have learned blockade signal clusters for five different types of molecules: two such profile mixtures, learned in 50 iterations, are shown in Figure 4. The classification accuracy is shown in Figure 5, where we used 10-fold resampling of 500 labeled toggle sample subsets from our test set [see Section Methods] (the 10-fold resampling is needed to perform majority-vote classification stabilization). The resampling offers a similar stabilization on classifications, and at similar computational expense, to what is done via data-rejection in [5,6]. Accuracy here is defined as where True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) are among the classified data samples. We have systematically investigated how the model complexity affects accuracy as shown in Figure 6, where average accuracy does not improve for the model of more than 12 components and more than 4 blockade levels, although some individual molecules take advantage of increased model complexity as their classification becomes more accurate. We have also investigated the blockade signal duration needed for proper classification, as shown in Figure 7, and for the data-sets examined found that samples with more that 100 ms duration yield little in either average classification accuracy or classification time. We tried using ionic flow blockade samples of 200 ms in the MHMM training, for example, with no apparent improvement to classification accuracy over the 100 ms duration samples. This behavior was not observed with the non-MAP, large-state (but scalable), approach used in [5,6], where greater observation times led to improved classification (although there is agreement that there was diminishing returns on learning sets for signal durations greater then 100 ms, and, especially, if greater than 500 ms).

Accuracy TP TN TP FP TN FN
Upper Level Toggler (ULT) with profile example Figure 2 Upper Level Toggler (ULT) with profile example. The accuracy of consecutive same-analyte toggle samples classification is shown in Figure 8(b), where we reach 100% performance within 14 classifications, except for the 9GC molecule, which underperformed when compared with [5,6]. The difficulty with 9GC classification accuracy convergence could be explained by substantial confusion with 9AT toggles, which reaches ~17% at first classification round and reluctantly reduces to ~3% after 21 classification rounds.
The accuracy improvement is consistent with the accuracy of the previously reported classification process [6] as shown in Figure 8(a) (except for the 9GC molecule). The failure to discern 9GC from 9AT in the approach described here, and not in prior efforts [5,6], may simply be the result of better blockade-level resolution 'fine-structure' with the prior model.
The better resolution between 9GC and 9AT channel blockades obtained with the 50-state single HMM (used in [5,6]) may simply be due to the fixed 1pA resolution (the state quantization bin-size) providing a critical resolving capability between very similar blockade signals. If true, a hybrid solution may be to directly incorporate fine-structure into the 4-state multiple HMM processing model that is used here, by adding fine-structure states at 1pA distances on either side of the 4 states identified by EM. Efforts along these lines are ongoing (see Discussion).
The MHMM analysis framework first has been implemented in a concurrent fashion on a quad-core Sun Ultra 40 M2 machine with speedup factor 3.66 as compared to a conventional implementation, and then distributed to the five machines of the same type with Java RMI with additional speedup of 4.02, which translates to the total speedup of 3.66 × 4.02 = 14.71.
Existing classification process with HMM feature extraction followed by SVM binary tree decision Figure 3 Existing classification process with HMM feature extraction followed by SVM binary tree decision.

Methods
In our approach we used unsupervised distributed learning of nanopore ionic flow blockade toggles with an MHMM. MHMMs have a long record of successful implementations that started in speech recognition [8] and later were used for clustering protein families [9], sequences [10] and in the search for splicing enhancers [11]. We use the HMM profile shown in Figures 2 and 9(a) to model the channel blockade process using MHMM components as shown in Figure 9(b). Justification for using such profiles is provided in [12], where we have found the dura-tion of ionic flow blockade levels to be distributed with a simple geometric distribution. The noise at a fixed-level blockade level is typically found to be Gaussian, consistent with the overall thermal and shot noise background for the transient-binding fixed-flow-geometry environments formed by channel and blockading elements.
MAP classification accuracy with 10-fold resampling on a split-sample data (with 4 levels and 15 components)     The ionic flow blockade records were obtained from the previous studies [6]. Two axon binary files (each containing 500 blockade samples of 300 ms) have been used to learn the probabilistic profiles for each hairpin molecule. The first 100 ms of each channel blockade is the basis of the first test set. Four other axon binary files, with uninterrupted recordings (non-sweep data), for each hairpin molecule and recorded on the same day, are then used for testing. The test set was formed by equiprobable sampling of 500 labeled blockade samples from the pool of test files.
Another test set was constructed from the above data files to measure accuracy of consecutive same-analyte toggle sample classification. In this instance we take all the available blockade signal coming from the test files of a certain molecule (not just the first 100 ms) and use multiple sample draws from the same signal blockade (i.e., consecutive Proposed and existing process classification accuracy   HMM profile and mixture of profiles (a) HMM profile example.
(b) Mixture of HMM profiles. 100 ms segments). With the 100 ms signal samples drawn from the same blockade event, we perform MAP scoring followed by majority vote (with random resolution of ties). (Note: the data rejection employed in [6] could be made roughly equivalent to the signal resampling approach described here by simply collecting consecutive 100 ms samples, as done here, and having classification on a given blockade once the signal isn't rejected, the only difference with the classification post-processing being that in this effort majority vote is employed instead.) Accuracy is calculated as the number of correct classifications matching the known molecule type to the total number of classification events. As in [6], the ionic flow in each record has been normalized to the open channel current 120 pA prior to learning and testing.
For our distributed MHMM system implementation we have used cluster of five workstations Sun Ultra 40 M2, each equipped with two AMD Dual-Core Opteron processors (2220SE 2.8 GHz), connected through gigabit Ethernet switch.

HMM definition and EM learning
The following parameters describe the conventional HMM implementation according to [13]: for the EM learning algorithm rigorously derived in [14]. When training the HMM using the Baum-Welch algorithm (an Expectation Maximization procedure), first we need to find the expected probabilities of being at a certain state at a certain time-point using the forward-backward procedure as shown in Table 1.
Let us define ξ t (i, j) as the probability of being in state i at time t, and state j at time t + 1, given the model and the observation sequence and γ t (i) as the probability of being in state i at time t, given the observation sequence and the model The HMM maximization step using these probabilities is shown in Table 2.

EM learning of HMM mixture
The objective of mixture learning is to maximize the likelihood function , i.e. we wish to find the locally optimal set of parameters by using the Expectation Maximization (EM) iterative procedure and the set of data points .
The Expectation step in the mixture fitting algorithm is done by computing the responsibility matrix of the components given the data points:

Distributed EM implementation
As discussed in [15], the computational gain of a parallel implementation can greatly depend on model topology.
In the speech recognition community researchers are able to use a highly parallel HMM architectures for phoneme and dictionary word recognition. Typically, when a large number of Processing Elements (PEs) is used, the utilization of each element drops due to communication overheads. Therefore, the communication overhead in any parallel architecture must be strictly managed, ideally reduced to a constellation of PEs with shared memory [15]. In recent work [16] we describe the performance of the following HMM EM algorithms (where we studied the last on the list): • Conventional EM due to Leonard E. Baum and Lloyd R. [17] takes O(T N) memory and O(2T N Q max + T (Q + E)) time, where T is the length of the observed sequence, N is the number of HMM states, Q max is the maximum HMM node out-degree, E is the number of free emission parameters, Q is the number of free transition parameters.
• Checkpointing EM [18][19][20] takes O( N) memory and O(3T N Q max + T (Q + E)) time, • Linear memory EM [16,21] takes only O(N(Q + E D)) memory and O(T NQ max (Q + E D)) time. Similar improvements are also described for the HMM Viterbi implementation in linear memory [16]. In actual usage with the comparatively small durations generally examined, the checkpointing algorithm was found to be the most memory efficient.

Distributed checkpointing algorithm for learning from large data samples
The distributed checkpointing EM algorithm is shown in Figure 10. Here are the steps in our distributed checkpointing algorithm implementation:

Discussion and conclusion
There are several advantages in our approach: • Classification is highly accurate with no data dropped from consideration, • Model parameters may have intuitive physical interpretation (but not in this study), • The MHMM implementation is distributed, such that: -Learning can take a larger number of samples (for improved accuracy), -Enables real-time analyte classification, currently takes only 0.411 sec to classify 100 ms sample, -Checkpointing algorithm keeps the memory profile low both on server and client sides without compromising the running time [16].
The need for using a mixture model beyond a simple HMM comes from the observation that generally no more than half of hairpin blockades come from the same mode of hairpin molecule interacting with nanopore (the modes correspond to principal components in the channel blockade stationary statistics profile). Other mode contributions require different probabilistic profiles for classification which naturally leads to a mixture analysis problem. The method shown in Figure 3 doesn't introduce such modes at the HMM-processing stage, relying instead on the strengths of the SVM classifier directly.
Increasing EM-learning model complexity beyond 4 levels and 12 mixture components increases the log-likelihood of fully trained model, but does not lead to better prediction accuracy as shown in Figure 6. The likelihood increase is caused by the model overfitting the data. Overfitting with HMM-profile models, however, isn't found to be as detrimental to the generalization performance as with other learning methods -the main penalty is that the learning and classification times increase dramatically, as we need to estimate progressively increasing number of parameters.
Since we did not computationally exhaust all the possible parameter settings (number of components, number of levels and sample duration), we provide a rationale for the parameter choice we believe is optimal. With preliminary experiments learning on 9CG toggle samples with MHMM of 15 toggle clusters we have consistently exhausted the number of components, many of them con-     figure  4(a) at the top right. This observation prompted us to use no more than 12 components in the channel blockade signal-mode mixture model.
The number of four blockade levels corresponds to the physical model of DNA hairpin interacting with nanopore [5]. From the physical perspective the hairpin molecule can undergo different modes of capture blockade, such as Intermediate Level (IL), Upper Level (UL), Lover Level (LL) conductance states and spikes (S) [6]. When a 9 bp DNA hairpin initially enters the pore, the loop is perched in the vestibule mouth and the stem terminus binds to amino acid residues near the limiting aperture. This results in the IL conductance level. When the terminal basepair desorbs from the pore wall, the stem and loop may realign, resulting in a substantial current increase to UL. Interconversion between the IL and UL states may occur numerous times with UL possibly switching to the LL state. This LL state corresponds to binding of the stem terminus to amino acids near the limiting aperture but in a different manner from IL. From the LL bound state, the duplex terminus may fray, resulting in extension and capture of one strand in the pore constriction resulting into short term S state. The allowed transition events between the levels IL ⇔ UL ⇔ LL ⇔ S to happen at any time during the analysis procedure. The spikes model, as described in [16], could possibly be used to increase prediction accuracy. However, with the scenario discussed in this manuscript use of such additions did not lead to higher performance since the primary blockade modes shown in Figures 4(a) and 4(b) are void of spikes.
A demo program implementing distributed MHMM analysis framework is available free of charge on our web site http://logos.cs.uno.edu/~achurban.