 Methodology article
 Open access
 Published:
An enhanced partial order curve comparison algorithm and its application to analyzing protein folding trajectories
BMC Bioinformatics volume 9, Article number: 344 (2008)
Abstract
Background
Understanding how proteins fold is essential to our quest in discovering how life works at the molecular level. Current computation power enables researchers to produce a huge amount of folding simulation data. Hence there is a pressing need to be able to interpret and identify novel folding features from them.
Results
In this paper, we model each folding trajectory as a multidimensional curve. We then develop an effective multiple curve comparison (MCC) algorithm, called the enhanced partial order (EPO) algorithm, to extract features from a set of diverse folding trajectories, including both successful and unsuccessful simulation runs. The EPO algorithm addresses several new challenges presented by comparing high dimensional curves coming from folding trajectories. A detailed case study on miniprotein Trpcage [1] demonstrates that our algorithm can detect similarities at rather low level, and extract biologically meaningful folding events.
Conclusion
The EPO algorithm is general and applicable to a wide range of applications. We demonstrate its generality and effectiveness by applying it to aligning multiple protein structures with low similarities. For user's convenience, we provide a web server for the algorithm at http://db.cse.ohiostate.edu/EPO.
Background
Proteins are the main agents in cells. From a chemical point of view, a protein molecule is a linear sequence of amino acids. This linear sequence, under appropriate physicochemical conditions, folds into a unique native structure rapidly. Understanding folding process is of paramount importance, especially since the outcome of it, namely the three dimensional protein structure, to a large extent decides the functionality of the molecule. Hence a lot of research has been devoted to investigating the kinetics of protein folding. In particular, modern (parallel) computation power makes it possible to perform largescale folding simulations. As a result, interpreting the huge amount of simulation data obtained becomes a crucial issue.
Given the highly stochastic nature of the protein motion, the study of protein fold usually relies on an ensemble of folding simulations including both successful and unsuccessful runs, which are trajectories that do or do not include a sequence of conformations leading to a near native conformation. Given such a diverse data set, scientists wish to answer questions such as what causes the folding process falling into different results, and what common properties are shared by the successful runs, but not the unsuccessful ones? To this end, it is highly desirable to be able to compare multiple folding trajectories and extract useful information from them.
In this paper, we model each protein folding trajectory as a multidimensional curve, and then present a novel multiplecurve comparison (MCC) algorithm to identify critical information from a set of trajectory curves in an automatic manner. In particular, we focus on the geometry of protein chain conformations throughout the folding process, and convert each conformation into a high dimensional point. The goal is to extract lists of ordered events common to successful runs but not to unsuccessful ones, such as discovering that a conformation B is always formed after A and followed by a conformation C before reaching a successful folding conformation. (Conformations A, B, and C may not be consecutive.) To this end, we develop an effective new multiple curves comparison algorithm called the enhanced partial order (EPO) algorithm, to capture similarities and dissimilarities between a set of input folding trajectories. The EPO algorithm is developed over the concept of POA (partial order alignment) [2, 3], but is greatly improved and extended in several aspects, especially in its sensitivity in detecting low level of similarity and its capability of handling high dimensional curves. Applying it to the folding trajectories of a miniprotein Trpcage [1] shows that our algorithm is able to automatically detect important critical folding events which are observed earlier [4] by biological methods. Our EPO algorithm is general, and we demonstrate its generality and effectiveness by also applying it to aligning multiple protein structures with low similarities.
Related work
Previously, folding simulations analysis is performed mainly for testing various protein folding models [5–7], such as the folding pathway model and the funnel model; and/or for studying energetic aspects of folding kinetics [8–11]. The geometric shapes of the conformations involved in folding trajectories have not been widely explored [4, 12, 13], despite their important role in folding. A particularly interesting work in this direction is by Ota et al. [4], where they investigated the folding trajectories of a miniprotein Trpcage using phylogenic tree combined with expert knowledge. However in general, an automatic tool to facilitate the folding simulations analysis at large scales is still missing. This paper provides an important step towards this goal by modeling folding trajectories as curves and using a new multiple curve comparison (MCC) algorithm to detect critical folding events.
The closest relative of our MCC problem in computational biology is the multiple structure alignment (MSTA) problem, which aims at aligning a family of protein structures, each modeled as a three dimensional polygonal curve to represent its backbone.
MSTA is a very hard problem. In fact, even the pairwise comparison problem of aligning two structures A and B is believed to be NPhard since one has to optimize simultaneously both the correspondence between A and B and the relative transformation of one structure with respect to the other. Numerous heuristicbased algorithms have been developed in practice for this fundamental problem [14–20]. If we have a set of k > 0 structures, then even the problem of aligning them optimally without considering transformations becomes intractable – it takes Ω(n^{k}) time using the standard dynamic programming algorithm, where n is the size of each protein involved.
In practice, progressive methods are widely used to attack the MSTA problem [21]. For example, given a set of structures, many approaches start with a seed structure and then progressively align the remaining structures onto it one by one [22–28]. A consensus or core structure is typically built throughout, to maintain the common substructures among the proteins that are already aligned. At each round, usually only pairwise structure comparison is performed to align the current consensus with a new structure.
The above progressive MSTA framework is a greedy approach. Its performance depends on the underlying pairwise comparison methods used, the order of structures that are progressively aligned, as well as the consensus structure maintained. Various heuristics have been exploited to find a good order for the progressive alignments. Note that this order can also be guided by a tree instead of a linear sequence, which removes the need of choosing a seed structure. The progressive procedure may also be iterated several times to locally refine the multiple structure alignments.
Our results
There are two main differences between the MCC problem we are interested in and the traditional MSTA problem. In the case of protein structures, it is usually explicitly or implicitly assumed that the (majority of the) input proteins belong to one family (How to classify a set of input structures into different families is a related problem, and many such classifications exist [17, 29, 30]), or at least share some relations. As such, one can expect that some consensus of the family should exist. However in our case, the set of curves are from a set of simulations including both successful and unsuccessful runs, and we wish to classify this diverse set of curves, and capture common features within as well as across its subfamilies. Secondly and more importantly, the level of similarity existing in these folding trajectories is usually much lower than that in a family of related proteins. Hence we aim at an algorithm with high sensitivity, which is able to detect smallscaled partial similarity and handle multidimensional curves (trajectories) as well.
In this paper, we propose and develop a sensitive MCC algorithm, called the EPO (enhanced partial order) algorithm, to compare a set of diverse high dimensional curves. Our algorithm follows a similar framework as the POA algorithm [3, 28] to encode the similarities of aligned curves in a partial order graph, instead of in a linear structure used by many traditional MSTA algorithms. This has the advantage that other than similarities among all curves, similarities among a subset of input curves can also be encoded in this graph. See Figure 1 for an example, where nodes in both graphs represent a group of aligned points from input curves.
For the more important problem of sensitivity, we observe that being a greedy approach, the progressive MSTA framework tends to be inherently insensitive to low level of similarities – if one early local decision is wrong, it may completely miss a smallscaled partial similarity. To improve this aspect of the performance of the progressive framework, we first propose a novel twolevel scoring function to measure similarity, which, together with a clustering idea, greatly enhances the quality of the local pairwise alignment produced at each round. We then develop an effective merging step to postprocess the obtained alignments. This step helps to reassemble vertices (high dimensional points) from input curves that should be matched together, but were scattered in several subclusters in the alignments due to some earlier nonoptimal decisions. Both techniques are general and can be used to improve the performance of many existing MSTA algorithms. Experimental results show that our MCC algorithm is highly sensitive and able to classify input curves. We also demonstrate the power of our tool in mining critical events from protein folding trajectories using a detailed case study of a miniprotein Trpcage.
Although our EPO algorithm is developed with the goal of comparing folding trajectories, the algorithm is general and can be applied to other domains as well, such as protein structures or pedestrian trajectories extracted from surveillance videos [31]. In particular, in this paper, we demonstrate this generality by showing that the EPO algorithm can be used to improve the results of existing multiple protein structure alignment algorithms, especially when input proteins share low structural similarity.
Results
Experimental Results on trajectory data
In this section, we report a systematic performance study on a biological dataset that contains 200 molecular dynamics simulations. The experiments achieve the following goals: First, we show that the quality of the alignments produced by our EPO algorithm is significantly better than that of the original POA algorithm. Second, we demonstrate the effectiveness of our algorithm by applying it to real protein simulation data and obtaining biologically meaningful results that are consistent with previous discoveries [4].
Background of dataset
Our input dataset includes 200 simulated folding trajectories for a particular protein called Trpcage. The dataset is provided by the Ota's Lab [4]. The folding simulations were performed at 325 K by using the AMBER99 force field with a small modification and the generalized Born implicit solvent model. Trpcage (see Figure 2) is a miniprotein consisting of 20 amino acids. It has been widely used for folding study because of its short, simple sequence and its quick folding kinetics. Following the definition from [32], a successful folding event has to satisfy the following two criteria:

The RMSD for a conformation from the native NMR structure [1] is less than 2 Å.

A subsequence of such nearnative conformations holds for at least 200 ps.
In [4], 58 successful folding trajectories reaching successful folding events are identified, and each trajectory includes 101 successive conformations sampled at 20 ps interval. Furthermore, there are two crucial observations in [4] that we will examine in the our experiment. First, before moving to the native conformation, a "ring" substructure (see Figure 2) has to be formed. Second, the distinction of native and pseudonative confirmations heavily relies on sidechain position of "ring" substructure. Ota et al. [4] obtained the above results by aligning each pair of trajectories first and then applying a neighbor joining method to group similar trajectories together. However this semiautomatic approach requires dedicated expert knowledge. The following experiments applied on the same dataset show that our EPO algorithm can automatically detect the above folding events with little prior knowledge.
Experimental setting
In order to be consistent with the results from [4], we select all 58 successful folding events, and call it SuccData. We also randomly select 58 unsuccessful folding trajectories, each containing 101 conformations, and collect them in a set called FailData. The union of successful and unsuccessful data is referred to as the MixData. There are two parameters used in our experiments: ε, and τ. They are set as ε = 1 Å and τ = 40 in all our experiments unless specified otherwise. Their meaning will be explained in Section 3 Method.
Investigation on entire protein structure
In the first set of experiments, we convert each conformation to a high dimensional point (i.e, a 20 × 20 = 400 dimensional point), based on the distance matrix between all of the alphacarbon atoms. Figure 3 compares the quality of the alignments of the SuccData by performing the POA algorithm, our EPO algorithm without the merging procedure (EPONoMerge), and the EPO algorithm. It shows the number of aligned nodes (yaxis) versus the size of aligned nodes (xaxis). Note that EPONoMerge is essentially POA with a clustering preprocessing and the new twolevel scoring function.
The similarity level between these trajectories is low (i.e, the number of aligned nodes with large size is small). It is clear from this histogram that our EPO algorithm significantly outperforms the other two by producing more aligned nodes with large sizes. The comparison between EPO and EPONoMerge demonstrates the effectiveness of our merging procedure, and that EPONoMerge is better than POA shows that the twolevel scoring function as well as the clustering preprocessing greatly enhances the performance. We have also performed experiments which show that compared to the POA algorithm, EPONoMerge is much less sensitive to the order of curves aligned. Comparisons of the three algorithms over the MixData produces a similar result, and majority of points aligned to heavy nodes (i.e, o ≥ 40) are from successful runs (the results are not shown in this report).
We also observe that most of the heavily aligned nodes are close to the end of the trajectories for the SuccData. In fact, many aligned points have conformation IDs around and greater than 90, which is indeed the time that the folding starts to get stabilized. More specifically, consider the set of aligned nodes of size greater than 40 for the SuccData. Among all points aligned to these nodes, 67.2% has a conformation ID greater than 90, and 24.4% has an ID between 80 and 90. This implies that our algorithm has the potential to detect the stabilization of successful folding events in an automatic manner.
This also implies that using the entire protein structure may be too coarse to detect critical folding events, as they are usually induced by small key motifs. In what follows, we map only a substructure of the input protein into a multidimensional point and provide more detailed analysis of this folding data.
Investigation on substructures
It is usually believed that certain critical motifs play important roles which stabilize the whole structure in the folding process [6, 7]. We wish to have a tool that can identify such critical motifs (substructures) automatically. We define a candidate motif to be two subchains of Trpcage, each of length 4. These two pieces induce a subwindow in the distance map of each conformation of the protein. We further require that the number of contacts in this subwindow w.r.t. the distance map of the native structure is at least 4, where a contact corresponds to two alphacarbon atoms within distance 6 Å. We collect a set of candidate motifs based on these criteria.
Now for a candidate motif, for each of its conformation of along the trajectory, we consider the distance matrix between its alphacarbon atoms as before, and convert the folding trajectory of this motif into a curve in the 4 × 4 = 16 dimensional space. In order to be more discriminative, we also introduce a sidechain weighting factor α, ranging from 0 to 1, to include the side chain information when comparing two high dimensional points: α = 0 means that sidechain information is completely ignored (Roughly speaking, for every conformation, we record for each residue also the relative position of the centroid of its sidechain with respect to its alphacarbon atom. This provides another high dimensional point that we call a sidechain point. The distance between two conformations will combine the distance between their sidechain points by the sidechain weighting factor). We perform our EPO algorithm on both the SuccData and the MixData, and there are two motifs that especially stand out, which we describe below.
Alphahelix substructure
The first one corresponds to an alphahelix substructure. In Figure 2, five successive amino acids (No.2–7) form an alphahelix structure which is a simple, selfcontained secondary structure (SSE) [1]. From the results returned by our EPO algorithm, we note that this alphahelix is formed rather early consistently in both successful and unsuccessful runs. Once formed, it remains stable. This is consistent with the common conception that due to its chemical property, alphahelix is a stable secondary structure, and can be formed quickly. Hence the formation of alphahelix cannot be used to differentiate successful runs from unsuccessful ones.
Ringsubstructure
The second motif corresponds to the neck of a ring structure. In particular, it consists of the subchains of No. 2 – 5 and No. 16 – 19 amino acids. The following results demonstrate that EPO can automatically not only find but also track the formation of such fingerprint substructures (critical motif).
First, we observe from Table 1 that when applying the EPO algorithm to the MixData (with the sidechain weight factor α = 0.9), signifificant alignments involve mainly trajectories from SuccData. For example, the last row of Table 1 shows that among the 62 points (from 62 trajectories) aligned to a particular node, 58 are from SuccData, with the remaining 4 from FailData. Hence this motif is potentially critical to the success of the folding of Trpcage. It also suggests that we can automatically classify the MixData into SuccData and FailData with few false positives based on this ringneck motif, while previously, the classification in the input data was obtained by a few expert defined rules.
Second, when the sidechain weighting factor α = 0.9, it turns out that 49.6% of significant aligned nodes are formed before the conformation ID 85 (compared to results from Section). For example, there are two aligned nodes from the successful runs, where 80% of points (i.e. trajectories) aligned to them has a conformation ID between 75 – 85. This implies that the complete formation of this ringneck usually immediately precedes the stabilization of the folding structure (which is roughly at conformation ID 90 for successful trajectories).
If reducing the sidechain weighting factor α to 0.5, naturally, we found more aligned nodes. In particular, other than the cluster with conformations of IDs around 80, we observe more significant clusters involving conformations with IDs from 50–70. By comparing the conformations of the ringneck motif in these clusters with those in the aligned nodes around 80, we found that the backbone structures are rather similar, but the sidechains are of different orientations. In other words, the shape of the ringneck motif is first stabilized by the backbone structure, and then the sidechains gradually move into right position. There are a few trajectories where the sidechains eventually move to the mirror image of their correct positions, and lead to pseudonative conformations which can be detected when considering the sidechains.
Figure 4 displays several groups of critical events identified by the EPO algorithm (corresponding to the aligned nodes as shown in Table 1). In particular, 4(a) includes two closely occurred events during the early stage of the folding procedure (one of the conformations is selected from the aligned node 1, and the other one is from alignednode 2). At this time, the sequence has started to fold and one can observe the helical structure, but the ring is not yet formed. 4(b) presents two conformations representing aligned nodes 3 and 4, respectively. We observe that the ring has started to form at this point, but is still not stable. 4(c) shows several conformations (one each from aligned nodes 5 to 15) occurred in order in most successful runs. At this stage, the ring is adjusted and stabilized. The adjustment mainly happens around the turn area and for side chains. 4(d) shows the final successful structure.
The above results are consistent with the results from [4], where such a ringshaped substructure was discovered semiautomatically by pairwise structure comparisons together with expert knowledge.
Timing of EPO
The above experiments are implemented on a Windows XP machine with 1.5 GHz CPU and 512 MB Memory. Table 2 compared the running time of the three methods: EPO, EPO without merging, and the POA algorithm (reimplemented by us). The EPO algorithm is faster than the traditional POA algorithm. This is because EPO preprocesses all points from input curves by clustering them into groups and only creates a new aligned node if it has a good potential to be heavily aligned. Thus it is not surprising that the EPO algorithm takes less time in aligning Faildata than SuccData, while the POA algorithm takes longer time (as it creates a lot of aligned nodes, leading to large partial order graphs). This interesting property implies that the EPO algorithm is effective at aligning curves with low level of similarities.
Experimental Results on protein structure data
Although the EPO algorithm was primarily developed for multiple high dimensional trajectories alignment, it can be used to improve multiple protein structures alignment as well. In particular, assume that we are given an alignment of a set of protein structures, that is, a transformation for each protein involved, as well as the correspondences between their residues. We can apply our EPO algorithm to further improve the correspondences based on the given transformations. We now use the EPO algorithm to refine the correspondences returned by four current popular multiple structure alignment algorithms: CEME [23], Multiprot [33], MAMMOTH [34] and POSA [28]. Table 3 shows our testing data set, which are the common sets used in previous multiple protein structure comparison algorithms. Set 1 contains several Calciumbinding proteins that have rather similar structures. Set 2 and set 3 contain a more diverse set of calmodulinlike proteins and timbarrels proteins, respectively.
Table 4 shows the results of applying EPO to the alignments of Calciumbinding proteins (set 1) as returned by other MSTA algorithms. "NCORE" refers to the number of residues aligned from each protein and "RMSD" refers to average rootmean square deviation of the aligned core. Note that as the six proteins in the data set have very similar structures, most algorithms can already align them rather well. Nevertheless, EPO still marginally improve the quantity and quality of alignment except for a slight increase in the RMSD when applied to the alignment returned by MAMMOTH. We test the EPO results by different ordering of input proteins. The results are consistent, which suggests that our EPO algorithm is robust.
Table 5 and 6 show the results of applying the EPO refinement to two structurallydiverse sets of proteins whose structural cores are harder to detect. There are notable improvements in terms of both quantity and quality.
In summary, the above results are similar to the case of comparing multiple folding trajectories. In particular, the twostage scoring function exploited by our EPO algorithm is effective and alleviates the ordering issue in progressive MSTA. The merging stage helps to produce better correspondences between input structures and makes the algorithm robust. Hence the EPO algorithm can be used as a postprocessing tool for current MSTA softwares to refine and optimize their alignment results.
Discussion and conclusion
In this paper we proposed and developed EPO, an effective multiple curve comparison method. Our primary goal is to analyze protein folding trajectories, although the algorithm is general, and can be applied to compare multiple protein structures as well. Our new method greatly improved the performance of the POA algorithm by using a clustering preprocessing, a more discriminative twolevel scoring function, as well as a novel merging postprocessing procedure. It can detect low level of similarity among input curves. We demonstrated the effectiveness of our method by applying it to a set of simulated folding trajectories of the miniprotein Trpcage. We also show the generality of our algorithm by using it to postprocess the results returned by several current multiple protein structure alignment (MTSA) algorithms.
Currently, we have only experimented the EPO algorithm with a miniprotein (Trpcage). One immediate question is to understand the scalability of the EPO algorithm for larger proteins or longer trajectories. In particular, a larger protein means a curve of higher dimensions. Our EPO algorithm seems to scale linearly with the number of participated trajectories, from current experiments. Furthermore, in practice, it is likely that we only perform the algorithm on small motifs. For longer trajectories, it seems that our algorithm scales in a quadratic manner. However, further experiments are necessary to investigate the scalability issue.
There are some previous work that analyze protein folding trajectories by collecting various statistics on measures such as the contact number (i.e, the number of native contacts) of each conformation along a trajectory and the URMS distance between a conformation and the native structure [13]. One way to view this is that a trajectory is mapped into a timeseries data representing the evolution of, say, the number of native contacts, which can be considered as a onedimensional curve. In this regard, we can use our EPO algorithm to analyze a collection of such curves induced by one measure. In general, there may be multiple measures, geometric or physiochemical, that a user may wish to inspect. Hence it is highly desirable to build a framework for analyzing folding trajectories that can incorporate these multiple measures, and that also enables the addition of new properties easily. This is one important future direction for us.
Finally, at this stage, the EPO algorithm can only be used to find the best correspondences between (multiple) curves (protein structures) for fixed transformations. That is why current experiments in Section 5 only use the EPO to refine the MSTA results by other algorithms (so that there are good initial transformations). We are currently developing an independent MSTA software to align multiple protein structures based on the EPO algorithm (i.e, solving both the alignment and the superimposition problems). We also remark that compared to other multiple curve alignment algorithms, our algorithm is specifically effective at capturing low level of similarities. As such, another important future direction is to extract structural motifs from a family of proteins that may not share high global similarity.
Methods
In this section, we describe our EPO algorithm for comparing a set of possibly high dimensional general curves. If we are given a set of protein folding data, we first convert each folding trajectory to a high dimensional curve. In particular, a folding trajectory is a sequence of conformations (structures) of a protein chain, representing different states of this protein at different time steps during the simulation of its folding process. We represent each conformation using the distance map between its alphacarbon atoms so that it is invariant under rigid transformations. For example, if a protein contains n amino acids, then its distance map is a n × n matrix M where M [i] [j] equals the distance between the i th and j th alphacarbon atoms along the protein backbone. This matrix can then be considered as a point in the n^{2} dimensions. This way, we map each trajectory of m conformations to a curve in {\mathbb{R}}^{{n}^{2}} with m vertices. We remark that one can also encode the sidechain information into the high dimensional curves, or map the trajectory of a substructure into a high dimensional curve. We will use such more refined high dimensional curves in most of our experiments as well. In the remaining part of this paper, we use the terms trajectories and curves interchangeably.
Notations and algorithm overview
Before we formally define the MCC problem, we introduce some necessary notations. Given a set of elements V = {v_{1},..., v_{ l }}, a relation \prec over V is transitive if v_{ i }\precv_{ j }and v_{ j }\precv_{ k }imply that v_{ i }\precv_{ k }. In this paper, we also refer to v_{ i }\precv_{ j }as a partial order constraint. A partial order graph (POG) G = (V, E) is a directed acyclic graph with V = {v_{1},..., v_{ l }}, where v_{ i }\precv_{ j }if there is an edge (v_{ i }, v_{ j }). Note that by the transitivity of this relation, two nodes may have a partial order constraint even when there is no edge between them in G. Let R be the set of partial order constraints induced by G. We say that a permutation Π(V) of V is a partial order list w.r.t. G if for any v_{ i }\precv_{ j }∈ R, we have that vi appears before v_{ j }in the permutation Π(V). In other words, the linear order in Π(V) is a total order satisfying all partial order constraints induced from G. See Figure 5 for an example.
Let \mathcal{T} = {T_{1},..., T_{ N }} be a set of N trajectories in ℝ^{d}, where each trajectory T_{ i }is an ordered sequence of n points {p}_{1}^{i},\mathrm{...},{p}_{n}^{i}. (For simplicity, we assume without loss of generality that all T_{ i }s have the same length n.) The goal of the MCC algorithm is to find aligned subsequences from \mathcal{T}.
More formally, an aligned node o is a collection of vertices from T_{ i }s, with at most one point from each T_{ i }. Given a 3tuple (\mathcal{T}, τ, ε), where τ and ε are input thresholds, an alignment of \mathcal{T} is a POG G with the corresponding set of partial order constraints R and a partial order list of aligned nodes \mathcal{O} = {o_{1},..., o_{ L }} such that the following three criteria are satisfied:
C1. o_{ k } ≥ τ, for any k ∈ [1, L];
C2. for any {p}_{j}^{i},{p}_{{j}^{\prime}}^{{i}^{\prime}}\in {o}_{k}, \left\right{p}_{j}^{i}{p}_{{j}^{\prime}}^{{i}^{\prime}}\left\right\le \epsilon;
C3. if {p}_{j}^{i}\in {o}_{{k}_{1}} and {p}_{{j}^{\prime}}^{{i}^{\prime}}\in {o}_{{k}_{2}} with {o}_{{k}_{1}}\prec {o}_{{k}_{2}}, then j <j'.
(C1) indicates that the number of vertices of input curves aligned to each aligned node o_{ k }is greater than a size threshold τ . (C2) means that these aligned points are tightly clustered together (i.e, the diameter of them is bounded by a distance threshold ε). (C3) enforces that points in different aligned nodes still maintain their partial order along their respective trajectory, which means that o_{ k }s are inherited and thus consistent to the points in each trajectory. Our goal is to maximize L, the size of such an alignment \mathcal{O}. See Figure 6(b) for an example of an alignment graph.
Algorithm overview
At a high level, the EPO algorithm has two stages (see Figure 6): (S1) initial POG construction stage and (S2) merging stage. The first stage generates an initial alignment for \mathcal{T}, encoded in a POG G. The procedure has the same framework as the POA algorithm, but its performance, especially when the similarity is low, is significantly improved, via the use of a clustering preprocessing step and a new twolevel scoring function. In the second stage, we develop a novel and effective procedure to merge nodes from G to output a better final alignment G*. Below, we describe each stage in detail.
Initial POG construction
Standard dynamic programming (DP) [35, 36] is an effective method for pairwise comparison between sequences. It produces an optimal alignment between two sequences with respect to a given scoring function. One can perform multiple sequences alignment progressively based on this DP pairwise comparison method. Roughly speaking, in the i th round of the algorithm, the alignment of the first i – 1 sequences is represented in a consensus sequence. The algorithm then updates this consensus by aligning it with the i th sequence S_{ i }using the standard DP algorithm. Information from S_{ i }that is not aligned to the consensus sequence isessentially lost. See Figure 1(a).
The partial order alignment (POA) algorithm [3] alleviates this problem by encoding the consensus in a POG instead of a linear sequence (see Figure 1(b)). That is, the alignment of S_{1},..., S_{i1}is encoded in a partial order graph G_{ i }, which is then updated to G_{i+1}by aligning it with S_{ i }. Due to the partial order in a POG, the alignment between G_{ i }and S_{ i }can still be achieved by a DP algorithm. The POA algorithm reduces the influence of the order of the sequences aligned, and is able to capture alignments between a subset of sequences. More details of the POA algorithm and its variants can be found in [2, 3].
In our case, each trajectory is mapped to an ordered sequence of points (i.e, a polygonal curve), and a similar algorithm can be applied to our trajectory data: Instead of the usual 1D sequences, we now have dD sequences, where d is the dimension of each point. Note that since each point corresponds to the distance map of a conformation, no transformation is needed when comparing such curves. The first stage of our EPO algorithm constructs a POG G with respect to the input set of trajectories \mathcal{T} using a modified POA algorithm. Below we explain the main differences between them.
A clustering preprocessing stage
The first problem with the standard POA algorithm is that the size of the POG graph maintained expands quickly when the level of similarity is low. For example, suppose we are updating the current POG G_{ i }to G_{i+1}by aligning it with a new curve T_{ i }. If a point p ∈ T_{ i }cannot be aligned to any node in G_{ i }, then it will create a new node in G_{i+1}, as this node may potentially be aligned later with the remaining curves. Consequently, if the similarity is sparse, many new nodes are created without really producing densely aligned nodes later and the size of the POG graph increases rapidly. This induces high computational complexity.
To address this problem, our algorithm first preprocesses all points from the input curves \mathcal{T} by clustering them into groups [37], the diameter of which is smaller than a user defined threshold, which is fixed as the distance threshold ε in our experiments. According to the clustering result, we only keep those points that belong to a cluster holding more than τ curves' points in it. For example, if the threshold is τ = 3 (i.e, we require each aligned node aligns points from at least 3 curves), we will prune out the points in such clusters that cover less than 3 curves. Meanwhile we collect cluster centers in C = {c_{1},..., c_{ r }}, which we refer to as the set of canonical cluster centers. Intuitively, C provides a synopsis of the input curves and represents potentially aligned nodes.
If, in the process of aligning T_{ i }with G_{ i }, a point p ∈ T_{ i }is not aligned to any node in G_{ i }, then we insert a new node in G_{i+1}only if p is within ε away from some canonical center from C – if p is far from all the canonical cluster centers, then there is little chance that p can form significant alignment with points from later curves, as that would have implied that p should belong to a dense cluster. We remark that this set of canonical cluster centers are not only used for shrinking the size of POG, but also used as a predictor of the new twostage scoring function that we will introduce shortly. There is also further advantage of pruning out unpromising points in the second merging stage of the EPO algorithm.
Scoring function
The choice of the scoring function when aligning G_{ i }= (V_{ i }, E_{ i }) with T_{ i }, is in general a crucial aspect of an alignment algorithm. A good scoring function will align as many points as possible globally. Given a point p ∈ T_{ i }and a node o ∈ G_{ i }, let δ(o, p) be the similarity between p and o, the definition of which will be described shortly. The score of aligning p with o is usually defined as:
where q is the parent of the point p along T_{ i }, and o' ranges over all immediate predecessors of o in the POG G_{ i }. It is easy to verify that such scores can be computed by a dynamic programming procedure due to the inherent order existing in both the trajectory and the POG.
A common way to define δ(o, p), the similarity between o and p, is as follows. Assume that each node o is associated with a node center ω (o) to represent all the points aligned to this node. Then
An alternative way to view this is that each node o has an influence region of radius ε around its center. A point p can be aligned to a node o only if it lies within the influence region of o.
Natural choices for the node center ω(o) of o include using an earlier computed canonical cluster center, or the center of the minimum enclosing ball of points already aligned to this node (or some weighted variants of it), which is a dynamic point relying on alignment order. The advantage of the former is that canonical cluster centers tend to spread apart, which helps to increase coverage of aligned nodes. Furthermore, the canonical cluster centers serve as good candidates for node centers as we already know that there are many points around them. The disadvantage is that it does not consider the distribution of points already aligned to this node. See Figure 7, where without considering the distribution of points aligned to o_{ a }and o_{ b }, the new point p will be aligned to o_{ b }even though o_{ a }is a better choice. Using the center of the minimum enclosing ball alleviates this problem. However, the influence regions of nodes produced this way tend to overlap much more than using the canonical cluster centers and the position of these centers also depend heavily on the order of curves aligned. We combine the advantages of both approaches into the following twolevel scoring function for measuring the similarity δ(o, p).
Specifically, for a node o, let q be the first point aligned to this node. This means that at the time we were examining q, q cannot be aligned to any existing node in the POG. Let c_{ k }∈ C be the nearest canonical cluster center of q – recall that the node o was created because q – c_{ k } ≤ ε. We add c_{ k }as a point aligned to this node, and at any time, the center of the minimum enclosing ball of currently aligned points, including c_{ k }, will be used as the node center ω(o). Now let
be the diameter of points currently aligned to o. We define that:
In other words, the new scoring function prefers centering points to be around previously computed cluster centers, thus tending to reduce overlaps between the influence regions of different nodes. Furthermore, it gives higher similarity score for points that are more tightly grouped together with those already aligned at current node, addressing the problem shown in Figure 7. Our experimental tests have shown that this two level scoring function significantly outperforms the ones using either only the canonical centers or only the centers of minimal enclosing balls. We remark that it is possible to use variants of the above twolevel scoring function, such as making it continuous (instead of being a step function). We choose the current form for its simplicity. Furthermore, experiments show that there is only marginal difference if we use the continuous version.
Merging stage
In the first stage, we have applied a progressive method to align each trajectory onto an alignment graph one by one. In the i th iteration, a point from T_{ i }is either aligned to the best matched node in the current POG G_{ i }, or a new node is created containing this point and the corresponding canonical cluster center, or discarded. After processing all of the N trajectories in order, we return a POG G = G_{ N }. In the second stage of our EPO algorithm, we further improve the quality of the alignment in G by using a novel merging process.
Given the greedy nature of the POA algorithm, the alignment obtained in G is not optimal and depends on the alignment order. Furthermore, since the influence regions of different alignednodes may overlap, no matter how we improve the scoring function, sometimes it is simply ambiguous to decide locally where to align a new incoming point, and a wrong decision may have grave consequence later.
For example, see Figure 8, where the set of points P (enclosed in the dotted circle) should have been aligned to one node. However, suppose the nodes o_{ a }and o_{ b }already exist before any point in P is inserted.
Then as points from P come in, it is rather likely that they are distributed evenly into both o_{ a }and o_{ b }. This problem becomes much more severe in higher dimensions, where P can be distributed to several nodes whose centers are wellseparated around P, but whose influence regions still cover some points from P (the number of such regions grows exponentially w.r.t. the dimension d). Hence instead of being captured in one heavily aligned node, P is broken into many nodes with small size. Our experimental tests confirm that this is happening rather commonly in both standard and our modified POA algorithms.
To address this problem, we propose a novel postprocessing on G. The goal is to merge qualified points from neighboring lessaligned nodes to augment more heavily loaded nodes. In particular, the following two invariants are maintained during the merging process:
(I1) At any time, the diameter of the target node is still bounded by the distance threshold ε;
(I2) The partial order constraints induced by the POG graph are always consistent with the order of points along each trajectory.
The second criterion means that at any time in the POG graph G', if p ∈ o_{1}, q ∈ o_{2}, p, q ∈ T_{ i }and p precedes q along the trajectory T_{ i }, then either o_{1} \preco_{2}, or there is no partial order relation between them. In other words, the resulting POG still corresponds to a valid alignment of \mathcal{T} with respect to the same thresholds.
As an example, see Figure 6, where the point {p}_{2}^{1} (i.e, the second point of the trajectory T_{1}) in the node b in (b) is moved to the node e in (c). Note that the graph is also updated to reflect the change (the dashed edge in (c)), in order to maintain the invariants (I1) and (I2). When all points aligned to a specific node o are merged (thus moved) to other nodes (i.e, o becomes empty), we delete o, and its successors in the POG will then become the successors of its parent.
A high level pseudocode of the merging process is shown in Figure 9. It augments better aligned nodes from the current POG G by processing first the nodes with larger size. We perform this procedure a few times till there is no significant increase in the quality of the resulting alignment. In practice, to speed up the algorithm, we merge neighbors to a node o only if its size is greater than some threshold (fixed at half of the size threshold, i.e, τ/2, in our experiments), as otherwise, there is low probability that o will become a heavy node later.
References
Neidigh J, Fesinmeyer R, Andersen N: PDB ID:1L2Y Miniproteins Trp the light fantastic. Nat Struct Biol 2002, 9(6):425–430.
Grasso C, Lee C: Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. Bioinformatics 2004, 20(10):1546–1556.
Lee C, Grasso C, Sharlow M: Multiple sequence alignment using partial order graphs. Bioinformatics 2002, 18(3):452–464.
Ota M, Ikeguchi M, Kidera A: Phylogeny of proteinfolding trajectories reveals a unique pathway to native structure. PNAS 2004, 101(51):17658–17663.
Borreguero JM, Ding F, Buldyrev SV, Stanley HE, Dokholyan NV: Multiple Folding Pathways of the SH3 Domain. ArXiv Physics eprints 2003., 87:
Levinthal C: Are there pathways for protein folding? J Chim Phys 1968, 65: 44–45.
Wolynes P, Onuchic J, Thirumalai D: Navigating the folding routes. Science 1995, 267: 1619–1920.
Abkevich VI, Gutin AM, Shakhnovich EI: Specific nucleus as the trasition state for protein folding: evidence from the lattice model. Biochemistry 1994, 33: 10026–10036.
Chiti F, Taddei N, White PM, Bucciantini M, Magherini F, Stefani M, Dobson CM: Mutational analysis of acylphosphatase suggests the importance of topology and contact order in protein folding. Nat Struct Biol 1999, 6(11):1005–1009.
Dokholyan NV, Buldyrev SV, Stanley HE, Shakhnovich EI: Molecular dynamics studies of folding of a proteinlike model. Fold Des 1998, 3(6):577–587.
Lockless SW, Ranganathan R: Evolutionarily Conserved Pathways of Energetic Connectivity in Protein Families. Science 1999, 286(5438):295–299.
Du R, Pande VS, Grosberg AY, Tanaka T, Shakhnovich E: On the role of conformational geometry in protein folding. Journal of Chemical Physics 1999, 111: 10375–10380.
Kedem K, Chew L, Elber R: UnitVector RMS (URMS) as a Tool to Analyze Molecular Dynamics Trajectories. Proteins 1999, 37(4):554–564.
Gerstein M, Levitt M: Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins. Protein Science 1998, 7: 445–456.
Gibrat JF, Madej T, Bryant SH: Surprising similarities in structure comparison. Curr Opin Struct Biol 1996, 6(3):377–385.
Holm L, Sander C: Protein structure comparison by alignment of distance matrices. J Mol Biol 1993, 233: 123–138.
Holm L, Sander C: Dali/FSSP classification of threedimensional protein folds. Nucleic Acids Res 1997, 25: 231–234.
Krissinel E, Henrick K: Secondarystructure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr D Biol Crystallogr 2004, 60(Pt 12 Pt 1):2256–2268.
Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of optimal path. Protein Engineering 1998, 11(9):739–747.
Taylor W, Orengo C: SSAP: sequential structure alignment program for protein structure comparison. Methods Enzymol 1996, 266: 617–35.
Sutcliffe MJ, Haneef I, Carney D, Blundell TL: Knowledge based moddelling of homologous proteins, part I: threedimensional frameworks derived from the simultaneous superposition of multiple structures. Protein Engineering 1987, 1(5):377–384.
Chew LP, Kedem K: Finding the Consensus Shape for a Protein Family (Extended Abstract).2002. [http://citeseer.ist.psu.edu/596999.html]
Guda C, Lu S, Scheeff ED, Bourne PE, Shindyalov LN: CEMC: a multiple protein structure alignment server. Nucleic Acids Research 2004., 32: "W100–3".
Lupyan D, LeoMacias A, Ortiz AR: A new progressiveiterative algorithm for multiple structure alignment. Bioinformatics 2005, 21(15):3255–3263.
Ochagavía ME, Wodak S: Progressive Combinatorial Algorithm for Multiple Structural Alignments:Application to Distantly Related Proteins. Proteins 2004, 55: 436–454.
Orengo CA: CORATopological fingerprints for protein structural families. Protein Science 1999, 8: 699–715.
Sandelin E: Extracting multiple structural alignments from pairwise alignments:a comparison of a rigorous and heuristic approach. Bioinformatics 2005, 21(7):1002–1009.
Ye Y, Godzik A: Multiple flexible structure alignment using partial order graphs. Bioinformatics 2005, 21: 2362–2369.
Murzin A, Brenner SE, Hubbard T, Chothia C: SCOP: A structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247: 536–540.
Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATHA Hierarchic Classification of Protein Domain Structures. Structure 1997, 5(8):1093–1108.
Caspi Y, Irani M: SpatioTemporal Alignment. Proc IEEE Transactions On Pattern Analysis and Machine Intelligence 2002, 1409–1424.
Koike R, Kinoshita K, Kidera A: Ring and Zipper formation is the key to understanding the structural variety in all β proteins. FEBS Letters 2003, 533: 9–13.
Shatsky M, Nussinov R, Wolfson HJ: MultiProt – A Multiple Protein Structural Alignment Algorithm. WABI '02: Proceedings of the Second International Workshop on Algorithms in Bioinformatics 2002, 235–250.
Lupyan D, LeoMacias A, Ortiz ARR: A new progressiveiterative algorithm for multiple structure alignment. Bioinformatics 2005, 3255–3263.
Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48: 443–453.
Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147: 195–197.
Jain AK, Murty MN, Flynn PJ: Data Clustering: A Review. ACM Comput Surv 1999, 31(3):264–323.
Acknowledgements
The work was supported in part by the Department of Energy (DOE) under grant DEFG0206ER25735 and in part by National Science Foundation under grant IIS0546713, CCF0747082, DBI0750891.
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
HS and YW carried out the algorithm development, as well as drafted the manuscript. HS also implemented the algorithm and carried out all experiments. HF and YW conceived of the study, provided critical input in design and coordinated various aspects of the project. MO provided the experimental data. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Sun, H., Ferhatosmanoglu, H., Ota, M. et al. An enhanced partial order curve comparison algorithm and its application to analyzing protein folding trajectories. BMC Bioinformatics 9, 344 (2008). https://doi.org/10.1186/147121059344
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/147121059344