 Methodology article
 Open Access
An enhanced partial order curve comparison algorithm and its application to analyzing protein folding trajectories
 Hong Sun^{1}Email author,
 Hakan Ferhatosmanoglu^{1},
 Motonori Ota^{2} and
 Yusu Wang^{1}
https://doi.org/10.1186/147121059344
© Sun et al; licensee BioMed Central Ltd. 2008
 Received: 14 April 2008
 Accepted: 18 August 2008
 Published: 18 August 2008
Abstract
Background
Understanding how proteins fold is essential to our quest in discovering how life works at the molecular level. Current computation power enables researchers to produce a huge amount of folding simulation data. Hence there is a pressing need to be able to interpret and identify novel folding features from them.
Results
In this paper, we model each folding trajectory as a multidimensional curve. We then develop an effective multiple curve comparison (MCC) algorithm, called the enhanced partial order (EPO) algorithm, to extract features from a set of diverse folding trajectories, including both successful and unsuccessful simulation runs. The EPO algorithm addresses several new challenges presented by comparing high dimensional curves coming from folding trajectories. A detailed case study on miniprotein Trpcage [1] demonstrates that our algorithm can detect similarities at rather low level, and extract biologically meaningful folding events.
Conclusion
The EPO algorithm is general and applicable to a wide range of applications. We demonstrate its generality and effectiveness by applying it to aligning multiple protein structures with low similarities. For user's convenience, we provide a web server for the algorithm at http://db.cse.ohiostate.edu/EPO.
Keywords
 Partial Order
 Candidate Motif
 Input Curve
 Folding Simulation
 Multiple Structure Alignment
Background
Proteins are the main agents in cells. From a chemical point of view, a protein molecule is a linear sequence of amino acids. This linear sequence, under appropriate physicochemical conditions, folds into a unique native structure rapidly. Understanding folding process is of paramount importance, especially since the outcome of it, namely the three dimensional protein structure, to a large extent decides the functionality of the molecule. Hence a lot of research has been devoted to investigating the kinetics of protein folding. In particular, modern (parallel) computation power makes it possible to perform largescale folding simulations. As a result, interpreting the huge amount of simulation data obtained becomes a crucial issue.
Given the highly stochastic nature of the protein motion, the study of protein fold usually relies on an ensemble of folding simulations including both successful and unsuccessful runs, which are trajectories that do or do not include a sequence of conformations leading to a near native conformation. Given such a diverse data set, scientists wish to answer questions such as what causes the folding process falling into different results, and what common properties are shared by the successful runs, but not the unsuccessful ones? To this end, it is highly desirable to be able to compare multiple folding trajectories and extract useful information from them.
In this paper, we model each protein folding trajectory as a multidimensional curve, and then present a novel multiplecurve comparison (MCC) algorithm to identify critical information from a set of trajectory curves in an automatic manner. In particular, we focus on the geometry of protein chain conformations throughout the folding process, and convert each conformation into a high dimensional point. The goal is to extract lists of ordered events common to successful runs but not to unsuccessful ones, such as discovering that a conformation B is always formed after A and followed by a conformation C before reaching a successful folding conformation. (Conformations A, B, and C may not be consecutive.) To this end, we develop an effective new multiple curves comparison algorithm called the enhanced partial order (EPO) algorithm, to capture similarities and dissimilarities between a set of input folding trajectories. The EPO algorithm is developed over the concept of POA (partial order alignment) [2, 3], but is greatly improved and extended in several aspects, especially in its sensitivity in detecting low level of similarity and its capability of handling high dimensional curves. Applying it to the folding trajectories of a miniprotein Trpcage [1] shows that our algorithm is able to automatically detect important critical folding events which are observed earlier [4] by biological methods. Our EPO algorithm is general, and we demonstrate its generality and effectiveness by also applying it to aligning multiple protein structures with low similarities.
Related work
Previously, folding simulations analysis is performed mainly for testing various protein folding models [5–7], such as the folding pathway model and the funnel model; and/or for studying energetic aspects of folding kinetics [8–11]. The geometric shapes of the conformations involved in folding trajectories have not been widely explored [4, 12, 13], despite their important role in folding. A particularly interesting work in this direction is by Ota et al. [4], where they investigated the folding trajectories of a miniprotein Trpcage using phylogenic tree combined with expert knowledge. However in general, an automatic tool to facilitate the folding simulations analysis at large scales is still missing. This paper provides an important step towards this goal by modeling folding trajectories as curves and using a new multiple curve comparison (MCC) algorithm to detect critical folding events.
The closest relative of our MCC problem in computational biology is the multiple structure alignment (MSTA) problem, which aims at aligning a family of protein structures, each modeled as a three dimensional polygonal curve to represent its backbone.
MSTA is a very hard problem. In fact, even the pairwise comparison problem of aligning two structures A and B is believed to be NPhard since one has to optimize simultaneously both the correspondence between A and B and the relative transformation of one structure with respect to the other. Numerous heuristicbased algorithms have been developed in practice for this fundamental problem [14–20]. If we have a set of k > 0 structures, then even the problem of aligning them optimally without considering transformations becomes intractable – it takes Ω(n^{ k }) time using the standard dynamic programming algorithm, where n is the size of each protein involved.
In practice, progressive methods are widely used to attack the MSTA problem [21]. For example, given a set of structures, many approaches start with a seed structure and then progressively align the remaining structures onto it one by one [22–28]. A consensus or core structure is typically built throughout, to maintain the common substructures among the proteins that are already aligned. At each round, usually only pairwise structure comparison is performed to align the current consensus with a new structure.
The above progressive MSTA framework is a greedy approach. Its performance depends on the underlying pairwise comparison methods used, the order of structures that are progressively aligned, as well as the consensus structure maintained. Various heuristics have been exploited to find a good order for the progressive alignments. Note that this order can also be guided by a tree instead of a linear sequence, which removes the need of choosing a seed structure. The progressive procedure may also be iterated several times to locally refine the multiple structure alignments.
Our results
There are two main differences between the MCC problem we are interested in and the traditional MSTA problem. In the case of protein structures, it is usually explicitly or implicitly assumed that the (majority of the) input proteins belong to one family (How to classify a set of input structures into different families is a related problem, and many such classifications exist [17, 29, 30]), or at least share some relations. As such, one can expect that some consensus of the family should exist. However in our case, the set of curves are from a set of simulations including both successful and unsuccessful runs, and we wish to classify this diverse set of curves, and capture common features within as well as across its subfamilies. Secondly and more importantly, the level of similarity existing in these folding trajectories is usually much lower than that in a family of related proteins. Hence we aim at an algorithm with high sensitivity, which is able to detect smallscaled partial similarity and handle multidimensional curves (trajectories) as well.
For the more important problem of sensitivity, we observe that being a greedy approach, the progressive MSTA framework tends to be inherently insensitive to low level of similarities – if one early local decision is wrong, it may completely miss a smallscaled partial similarity. To improve this aspect of the performance of the progressive framework, we first propose a novel twolevel scoring function to measure similarity, which, together with a clustering idea, greatly enhances the quality of the local pairwise alignment produced at each round. We then develop an effective merging step to postprocess the obtained alignments. This step helps to reassemble vertices (high dimensional points) from input curves that should be matched together, but were scattered in several subclusters in the alignments due to some earlier nonoptimal decisions. Both techniques are general and can be used to improve the performance of many existing MSTA algorithms. Experimental results show that our MCC algorithm is highly sensitive and able to classify input curves. We also demonstrate the power of our tool in mining critical events from protein folding trajectories using a detailed case study of a miniprotein Trpcage.
Although our EPO algorithm is developed with the goal of comparing folding trajectories, the algorithm is general and can be applied to other domains as well, such as protein structures or pedestrian trajectories extracted from surveillance videos [31]. In particular, in this paper, we demonstrate this generality by showing that the EPO algorithm can be used to improve the results of existing multiple protein structure alignment algorithms, especially when input proteins share low structural similarity.
Results
Experimental Results on trajectory data
In this section, we report a systematic performance study on a biological dataset that contains 200 molecular dynamics simulations. The experiments achieve the following goals: First, we show that the quality of the alignments produced by our EPO algorithm is significantly better than that of the original POA algorithm. Second, we demonstrate the effectiveness of our algorithm by applying it to real protein simulation data and obtaining biologically meaningful results that are consistent with previous discoveries [4].
Background of dataset

The RMSD for a conformation from the native NMR structure [1] is less than 2 Å.

A subsequence of such nearnative conformations holds for at least 200 ps.
In [4], 58 successful folding trajectories reaching successful folding events are identified, and each trajectory includes 101 successive conformations sampled at 20 ps interval. Furthermore, there are two crucial observations in [4] that we will examine in the our experiment. First, before moving to the native conformation, a "ring" substructure (see Figure 2) has to be formed. Second, the distinction of native and pseudonative confirmations heavily relies on sidechain position of "ring" substructure. Ota et al. [4] obtained the above results by aligning each pair of trajectories first and then applying a neighbor joining method to group similar trajectories together. However this semiautomatic approach requires dedicated expert knowledge. The following experiments applied on the same dataset show that our EPO algorithm can automatically detect the above folding events with little prior knowledge.
Experimental setting
In order to be consistent with the results from [4], we select all 58 successful folding events, and call it SuccData. We also randomly select 58 unsuccessful folding trajectories, each containing 101 conformations, and collect them in a set called FailData. The union of successful and unsuccessful data is referred to as the MixData. There are two parameters used in our experiments: ε, and τ. They are set as ε = 1 Å and τ = 40 in all our experiments unless specified otherwise. Their meaning will be explained in Section 3 Method.
Investigation on entire protein structure
The similarity level between these trajectories is low (i.e, the number of aligned nodes with large size is small). It is clear from this histogram that our EPO algorithm significantly outperforms the other two by producing more aligned nodes with large sizes. The comparison between EPO and EPONoMerge demonstrates the effectiveness of our merging procedure, and that EPONoMerge is better than POA shows that the twolevel scoring function as well as the clustering preprocessing greatly enhances the performance. We have also performed experiments which show that compared to the POA algorithm, EPONoMerge is much less sensitive to the order of curves aligned. Comparisons of the three algorithms over the MixData produces a similar result, and majority of points aligned to heavy nodes (i.e, o ≥ 40) are from successful runs (the results are not shown in this report).
We also observe that most of the heavily aligned nodes are close to the end of the trajectories for the SuccData. In fact, many aligned points have conformation IDs around and greater than 90, which is indeed the time that the folding starts to get stabilized. More specifically, consider the set of aligned nodes of size greater than 40 for the SuccData. Among all points aligned to these nodes, 67.2% has a conformation ID greater than 90, and 24.4% has an ID between 80 and 90. This implies that our algorithm has the potential to detect the stabilization of successful folding events in an automatic manner.
This also implies that using the entire protein structure may be too coarse to detect critical folding events, as they are usually induced by small key motifs. In what follows, we map only a substructure of the input protein into a multidimensional point and provide more detailed analysis of this folding data.
Investigation on substructures
It is usually believed that certain critical motifs play important roles which stabilize the whole structure in the folding process [6, 7]. We wish to have a tool that can identify such critical motifs (substructures) automatically. We define a candidate motif to be two subchains of Trpcage, each of length 4. These two pieces induce a subwindow in the distance map of each conformation of the protein. We further require that the number of contacts in this subwindow w.r.t. the distance map of the native structure is at least 4, where a contact corresponds to two alphacarbon atoms within distance 6 Å. We collect a set of candidate motifs based on these criteria.
Now for a candidate motif, for each of its conformation of along the trajectory, we consider the distance matrix between its alphacarbon atoms as before, and convert the folding trajectory of this motif into a curve in the 4 × 4 = 16 dimensional space. In order to be more discriminative, we also introduce a sidechain weighting factor α, ranging from 0 to 1, to include the side chain information when comparing two high dimensional points: α = 0 means that sidechain information is completely ignored (Roughly speaking, for every conformation, we record for each residue also the relative position of the centroid of its sidechain with respect to its alphacarbon atom. This provides another high dimensional point that we call a sidechain point. The distance between two conformations will combine the distance between their sidechain points by the sidechain weighting factor). We perform our EPO algorithm on both the SuccData and the MixData, and there are two motifs that especially stand out, which we describe below.
Alphahelix substructure
The first one corresponds to an alphahelix substructure. In Figure 2, five successive amino acids (No.2–7) form an alphahelix structure which is a simple, selfcontained secondary structure (SSE) [1]. From the results returned by our EPO algorithm, we note that this alphahelix is formed rather early consistently in both successful and unsuccessful runs. Once formed, it remains stable. This is consistent with the common conception that due to its chemical property, alphahelix is a stable secondary structure, and can be formed quickly. Hence the formation of alphahelix cannot be used to differentiate successful runs from unsuccessful ones.
Ringsubstructure
The second motif corresponds to the neck of a ring structure. In particular, it consists of the subchains of No. 2 – 5 and No. 16 – 19 amino acids. The following results demonstrate that EPO can automatically not only find but also track the formation of such fingerprint substructures (critical motif).
EPO on ring structure(MixData).
Classification  

Aligned Node ID  o_{ i } ≥ τ  SuccData  FailData  D(o) Å 
1  49  27  22  1.852 
2  45  28  17  1.798 
3  41  29  12  2.189 
4  40  31  9  1.447 
5  48  31  16  1.761 
6  40  32  8  1.322 
7  47  34  13  1.133 
8  42  35  7  1.923 
9  44  36  8  0.873 
10  49  42  7  1.428 
11  54  48  6  1.020 
12  59  50  9  1.294 
13  60  51  9  0.932 
14  56  52  4  1.255 
15  62  56  6  1.782 
16  62  58  4  1.503 
Second, when the sidechain weighting factor α = 0.9, it turns out that 49.6% of significant aligned nodes are formed before the conformation ID 85 (compared to results from Section). For example, there are two aligned nodes from the successful runs, where 80% of points (i.e. trajectories) aligned to them has a conformation ID between 75 – 85. This implies that the complete formation of this ringneck usually immediately precedes the stabilization of the folding structure (which is roughly at conformation ID 90 for successful trajectories).
If reducing the sidechain weighting factor α to 0.5, naturally, we found more aligned nodes. In particular, other than the cluster with conformations of IDs around 80, we observe more significant clusters involving conformations with IDs from 50–70. By comparing the conformations of the ringneck motif in these clusters with those in the aligned nodes around 80, we found that the backbone structures are rather similar, but the sidechains are of different orientations. In other words, the shape of the ringneck motif is first stabilized by the backbone structure, and then the sidechains gradually move into right position. There are a few trajectories where the sidechains eventually move to the mirror image of their correct positions, and lead to pseudonative conformations which can be detected when considering the sidechains.
The above results are consistent with the results from [4], where such a ringshaped substructure was discovered semiautomatically by pairwise structure comparisons together with expert knowledge.
Timing of EPO
Comparing processingtime.
MixData  SuccData  Faildata  

unit = Min.  
EPO  ≈ 30  ≈ 14  ≈ 7 
EPO w/o merging  ≈ 26  ≈ 12  ≈ 6 
POA  ≈ 49  ≈ 23  ≈ 33 
Experimental Results on protein structure data
Protein structural data set. "Average Size" is the average number of residues in each protein.
Data Set ID  Number of Structures  Average Size  PDB Codes 

Set 1  
Calciumbinding  6  140  4cpv2scpA2sas1top1scmB3icb 
Set 2  
Calmodulinlike  3  161  1jfjA1ncx2sas 
Set 3  
Timbarrels  7  391  1btc1pii1tml4enl5rubA6xia7timA 
Results of refinement on calciumbinding proteins(Set 1) across 4 algorithms.
Before EPO refinement  After EPO refinement  

NCORE  RMSD  NCORE  RMSD  
CEMC  57  3.02 Å  63  2.80 Å 
Multiprot  48  1.66 Å  52  1.56 Å 
MAMMOTH  14  1.01 Å  15  1.09 Å 
POSA  56  2.82 Å  59  2.78 Å 
Results of refinement on calmodulinlike proteins(set 2) across 4 algorithms.
Before EPO refinement  After EPO refinement  

NCORE  RMSD  NCORE  RMSD  
CEMC  62  5.85 Å  77  4.77 Å 
Multiprot  56  1.93 Å  60  1.90 Å 
MAMMOTH  15  0.94 Å  17  0.85 Å 
POSA  65  2.67 Å  68  2.34 Å 
Results of refinement on timbarrels proteins(Set 3) across 2 algorithms(CEMC and MAMMOTH failed to return any aligned core).
Before EPO refinement  After EPO refinement  

NCORE  RMSD(Å)  NCORE  RMSD(Å)  
Multiprot  24  1.82  29  1.83 
POSA  40  3.6  50  3.22 
In summary, the above results are similar to the case of comparing multiple folding trajectories. In particular, the twostage scoring function exploited by our EPO algorithm is effective and alleviates the ordering issue in progressive MSTA. The merging stage helps to produce better correspondences between input structures and makes the algorithm robust. Hence the EPO algorithm can be used as a postprocessing tool for current MSTA softwares to refine and optimize their alignment results.
Discussion and conclusion
In this paper we proposed and developed EPO, an effective multiple curve comparison method. Our primary goal is to analyze protein folding trajectories, although the algorithm is general, and can be applied to compare multiple protein structures as well. Our new method greatly improved the performance of the POA algorithm by using a clustering preprocessing, a more discriminative twolevel scoring function, as well as a novel merging postprocessing procedure. It can detect low level of similarity among input curves. We demonstrated the effectiveness of our method by applying it to a set of simulated folding trajectories of the miniprotein Trpcage. We also show the generality of our algorithm by using it to postprocess the results returned by several current multiple protein structure alignment (MTSA) algorithms.
Currently, we have only experimented the EPO algorithm with a miniprotein (Trpcage). One immediate question is to understand the scalability of the EPO algorithm for larger proteins or longer trajectories. In particular, a larger protein means a curve of higher dimensions. Our EPO algorithm seems to scale linearly with the number of participated trajectories, from current experiments. Furthermore, in practice, it is likely that we only perform the algorithm on small motifs. For longer trajectories, it seems that our algorithm scales in a quadratic manner. However, further experiments are necessary to investigate the scalability issue.
There are some previous work that analyze protein folding trajectories by collecting various statistics on measures such as the contact number (i.e, the number of native contacts) of each conformation along a trajectory and the URMS distance between a conformation and the native structure [13]. One way to view this is that a trajectory is mapped into a timeseries data representing the evolution of, say, the number of native contacts, which can be considered as a onedimensional curve. In this regard, we can use our EPO algorithm to analyze a collection of such curves induced by one measure. In general, there may be multiple measures, geometric or physiochemical, that a user may wish to inspect. Hence it is highly desirable to build a framework for analyzing folding trajectories that can incorporate these multiple measures, and that also enables the addition of new properties easily. This is one important future direction for us.
Finally, at this stage, the EPO algorithm can only be used to find the best correspondences between (multiple) curves (protein structures) for fixed transformations. That is why current experiments in Section 5 only use the EPO to refine the MSTA results by other algorithms (so that there are good initial transformations). We are currently developing an independent MSTA software to align multiple protein structures based on the EPO algorithm (i.e, solving both the alignment and the superimposition problems). We also remark that compared to other multiple curve alignment algorithms, our algorithm is specifically effective at capturing low level of similarities. As such, another important future direction is to extract structural motifs from a family of proteins that may not share high global similarity.
Methods
In this section, we describe our EPO algorithm for comparing a set of possibly high dimensional general curves. If we are given a set of protein folding data, we first convert each folding trajectory to a high dimensional curve. In particular, a folding trajectory is a sequence of conformations (structures) of a protein chain, representing different states of this protein at different time steps during the simulation of its folding process. We represent each conformation using the distance map between its alphacarbon atoms so that it is invariant under rigid transformations. For example, if a protein contains n amino acids, then its distance map is a n × n matrix M where M [i] [j] equals the distance between the i th and j th alphacarbon atoms along the protein backbone. This matrix can then be considered as a point in the n^{2} dimensions. This way, we map each trajectory of m conformations to a curve in ${\mathbb{R}}^{{n}^{2}}$ with m vertices. We remark that one can also encode the sidechain information into the high dimensional curves, or map the trajectory of a substructure into a high dimensional curve. We will use such more refined high dimensional curves in most of our experiments as well. In the remaining part of this paper, we use the terms trajectories and curves interchangeably.
Notations and algorithm overview
Let $\mathcal{T}$ = {T_{1},..., T_{ N }} be a set of N trajectories in ℝ^{ d }, where each trajectory T_{ i }is an ordered sequence of n points ${p}_{1}^{i},\mathrm{...},{p}_{n}^{i}$. (For simplicity, we assume without loss of generality that all T_{ i }s have the same length n.) The goal of the MCC algorithm is to find aligned subsequences from $\mathcal{T}$.
More formally, an aligned node o is a collection of vertices from T_{ i }s, with at most one point from each T_{ i }. Given a 3tuple ($\mathcal{T}$, τ, ε), where τ and ε are input thresholds, an alignment of $\mathcal{T}$ is a POG G with the corresponding set of partial order constraints R and a partial order list of aligned nodes $\mathcal{O}$ = {o_{1},..., o_{ L }} such that the following three criteria are satisfied:
C1. o_{ k } ≥ τ, for any k ∈ [1, L];
C2. for any ${p}_{j}^{i},{p}_{{j}^{\prime}}^{{i}^{\prime}}\in {o}_{k}$, $\left\right{p}_{j}^{i}{p}_{{j}^{\prime}}^{{i}^{\prime}}\left\right\le \epsilon $;
C3. if ${p}_{j}^{i}\in {o}_{{k}_{1}}$ and ${p}_{{j}^{\prime}}^{{i}^{\prime}}\in {o}_{{k}_{2}}$ with ${o}_{{k}_{1}}\prec {o}_{{k}_{2}}$, then j <j'.
Algorithm overview
At a high level, the EPO algorithm has two stages (see Figure 6): (S1) initial POG construction stage and (S2) merging stage. The first stage generates an initial alignment for $\mathcal{T}$, encoded in a POG G. The procedure has the same framework as the POA algorithm, but its performance, especially when the similarity is low, is significantly improved, via the use of a clustering preprocessing step and a new twolevel scoring function. In the second stage, we develop a novel and effective procedure to merge nodes from G to output a better final alignment G*. Below, we describe each stage in detail.
Initial POG construction
Standard dynamic programming (DP) [35, 36] is an effective method for pairwise comparison between sequences. It produces an optimal alignment between two sequences with respect to a given scoring function. One can perform multiple sequences alignment progressively based on this DP pairwise comparison method. Roughly speaking, in the i th round of the algorithm, the alignment of the first i – 1 sequences is represented in a consensus sequence. The algorithm then updates this consensus by aligning it with the i th sequence S_{ i }using the standard DP algorithm. Information from S_{ i }that is not aligned to the consensus sequence isessentially lost. See Figure 1(a).
The partial order alignment (POA) algorithm [3] alleviates this problem by encoding the consensus in a POG instead of a linear sequence (see Figure 1(b)). That is, the alignment of S_{1},..., S_{i1}is encoded in a partial order graph G_{ i }, which is then updated to G_{i+1}by aligning it with S_{ i }. Due to the partial order in a POG, the alignment between G_{ i }and S_{ i }can still be achieved by a DP algorithm. The POA algorithm reduces the influence of the order of the sequences aligned, and is able to capture alignments between a subset of sequences. More details of the POA algorithm and its variants can be found in [2, 3].
In our case, each trajectory is mapped to an ordered sequence of points (i.e, a polygonal curve), and a similar algorithm can be applied to our trajectory data: Instead of the usual 1D sequences, we now have dD sequences, where d is the dimension of each point. Note that since each point corresponds to the distance map of a conformation, no transformation is needed when comparing such curves. The first stage of our EPO algorithm constructs a POG G with respect to the input set of trajectories $\mathcal{T}$ using a modified POA algorithm. Below we explain the main differences between them.
A clustering preprocessing stage
The first problem with the standard POA algorithm is that the size of the POG graph maintained expands quickly when the level of similarity is low. For example, suppose we are updating the current POG G_{ i }to G_{i+1}by aligning it with a new curve T_{ i }. If a point p ∈ T_{ i }cannot be aligned to any node in G_{ i }, then it will create a new node in G_{i+1}, as this node may potentially be aligned later with the remaining curves. Consequently, if the similarity is sparse, many new nodes are created without really producing densely aligned nodes later and the size of the POG graph increases rapidly. This induces high computational complexity.
To address this problem, our algorithm first preprocesses all points from the input curves $\mathcal{T}$ by clustering them into groups [37], the diameter of which is smaller than a user defined threshold, which is fixed as the distance threshold ε in our experiments. According to the clustering result, we only keep those points that belong to a cluster holding more than τ curves' points in it. For example, if the threshold is τ = 3 (i.e, we require each aligned node aligns points from at least 3 curves), we will prune out the points in such clusters that cover less than 3 curves. Meanwhile we collect cluster centers in C = {c_{1},..., c_{ r }}, which we refer to as the set of canonical cluster centers. Intuitively, C provides a synopsis of the input curves and represents potentially aligned nodes.
If, in the process of aligning T_{ i }with G_{ i }, a point p ∈ T_{ i }is not aligned to any node in G_{ i }, then we insert a new node in G_{i+1}only if p is within ε away from some canonical center from C – if p is far from all the canonical cluster centers, then there is little chance that p can form significant alignment with points from later curves, as that would have implied that p should belong to a dense cluster. We remark that this set of canonical cluster centers are not only used for shrinking the size of POG, but also used as a predictor of the new twostage scoring function that we will introduce shortly. There is also further advantage of pruning out unpromising points in the second merging stage of the EPO algorithm.
Scoring function
where q is the parent of the point p along T_{ i }, and o' ranges over all immediate predecessors of o in the POG G_{ i }. It is easy to verify that such scores can be computed by a dynamic programming procedure due to the inherent order existing in both the trajectory and the POG.
An alternative way to view this is that each node o has an influence region of radius ε around its center. A point p can be aligned to a node o only if it lies within the influence region of o.
In other words, the new scoring function prefers centering points to be around previously computed cluster centers, thus tending to reduce overlaps between the influence regions of different nodes. Furthermore, it gives higher similarity score for points that are more tightly grouped together with those already aligned at current node, addressing the problem shown in Figure 7. Our experimental tests have shown that this two level scoring function significantly outperforms the ones using either only the canonical centers or only the centers of minimal enclosing balls. We remark that it is possible to use variants of the above twolevel scoring function, such as making it continuous (instead of being a step function). We choose the current form for its simplicity. Furthermore, experiments show that there is only marginal difference if we use the continuous version.
Merging stage
In the first stage, we have applied a progressive method to align each trajectory onto an alignment graph one by one. In the i th iteration, a point from T_{ i }is either aligned to the best matched node in the current POG G_{ i }, or a new node is created containing this point and the corresponding canonical cluster center, or discarded. After processing all of the N trajectories in order, we return a POG G = G_{ N }. In the second stage of our EPO algorithm, we further improve the quality of the alignment in G by using a novel merging process.
Given the greedy nature of the POA algorithm, the alignment obtained in G is not optimal and depends on the alignment order. Furthermore, since the influence regions of different alignednodes may overlap, no matter how we improve the scoring function, sometimes it is simply ambiguous to decide locally where to align a new incoming point, and a wrong decision may have grave consequence later.
Then as points from P come in, it is rather likely that they are distributed evenly into both o_{ a }and o_{ b }. This problem becomes much more severe in higher dimensions, where P can be distributed to several nodes whose centers are wellseparated around P, but whose influence regions still cover some points from P (the number of such regions grows exponentially w.r.t. the dimension d). Hence instead of being captured in one heavily aligned node, P is broken into many nodes with small size. Our experimental tests confirm that this is happening rather commonly in both standard and our modified POA algorithms.
To address this problem, we propose a novel postprocessing on G. The goal is to merge qualified points from neighboring lessaligned nodes to augment more heavily loaded nodes. In particular, the following two invariants are maintained during the merging process:
(I1) At any time, the diameter of the target node is still bounded by the distance threshold ε;
(I2) The partial order constraints induced by the POG graph are always consistent with the order of points along each trajectory.
The second criterion means that at any time in the POG graph G', if p ∈ o_{1}, q ∈ o_{2}, p, q ∈ T_{ i }and p precedes q along the trajectory T_{ i }, then either o_{1} $\prec $o_{2}, or there is no partial order relation between them. In other words, the resulting POG still corresponds to a valid alignment of $\mathcal{T}$ with respect to the same thresholds.
As an example, see Figure 6, where the point ${p}_{2}^{1}$ (i.e, the second point of the trajectory T_{1}) in the node b in (b) is moved to the node e in (c). Note that the graph is also updated to reflect the change (the dashed edge in (c)), in order to maintain the invariants (I1) and (I2). When all points aligned to a specific node o are merged (thus moved) to other nodes (i.e, o becomes empty), we delete o, and its successors in the POG will then become the successors of its parent.
Declarations
Acknowledgements
The work was supported in part by the Department of Energy (DOE) under grant DEFG0206ER25735 and in part by National Science Foundation under grant IIS0546713, CCF0747082, DBI0750891.
Authors’ Affiliations
References
 Neidigh J, Fesinmeyer R, Andersen N: PDB ID:1L2Y Miniproteins Trp the light fantastic. Nat Struct Biol 2002, 9(6):425–430.View ArticlePubMedGoogle Scholar
 Grasso C, Lee C: Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. Bioinformatics 2004, 20(10):1546–1556.View ArticlePubMedGoogle Scholar
 Lee C, Grasso C, Sharlow M: Multiple sequence alignment using partial order graphs. Bioinformatics 2002, 18(3):452–464.View ArticlePubMedGoogle Scholar
 Ota M, Ikeguchi M, Kidera A: Phylogeny of proteinfolding trajectories reveals a unique pathway to native structure. PNAS 2004, 101(51):17658–17663.PubMed CentralView ArticlePubMedGoogle Scholar
 Borreguero JM, Ding F, Buldyrev SV, Stanley HE, Dokholyan NV: Multiple Folding Pathways of the SH3 Domain. ArXiv Physics eprints 2003., 87:Google Scholar
 Levinthal C: Are there pathways for protein folding? J Chim Phys 1968, 65: 44–45.Google Scholar
 Wolynes P, Onuchic J, Thirumalai D: Navigating the folding routes. Science 1995, 267: 1619–1920.View ArticlePubMedGoogle Scholar
 Abkevich VI, Gutin AM, Shakhnovich EI: Specific nucleus as the trasition state for protein folding: evidence from the lattice model. Biochemistry 1994, 33: 10026–10036.View ArticlePubMedGoogle Scholar
 Chiti F, Taddei N, White PM, Bucciantini M, Magherini F, Stefani M, Dobson CM: Mutational analysis of acylphosphatase suggests the importance of topology and contact order in protein folding. Nat Struct Biol 1999, 6(11):1005–1009.View ArticlePubMedGoogle Scholar
 Dokholyan NV, Buldyrev SV, Stanley HE, Shakhnovich EI: Molecular dynamics studies of folding of a proteinlike model. Fold Des 1998, 3(6):577–587.View ArticlePubMedGoogle Scholar
 Lockless SW, Ranganathan R: Evolutionarily Conserved Pathways of Energetic Connectivity in Protein Families. Science 1999, 286(5438):295–299.View ArticlePubMedGoogle Scholar
 Du R, Pande VS, Grosberg AY, Tanaka T, Shakhnovich E: On the role of conformational geometry in protein folding. Journal of Chemical Physics 1999, 111: 10375–10380.View ArticleGoogle Scholar
 Kedem K, Chew L, Elber R: UnitVector RMS (URMS) as a Tool to Analyze Molecular Dynamics Trajectories. Proteins 1999, 37(4):554–564.View ArticlePubMedGoogle Scholar
 Gerstein M, Levitt M: Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins. Protein Science 1998, 7: 445–456.PubMed CentralView ArticlePubMedGoogle Scholar
 Gibrat JF, Madej T, Bryant SH: Surprising similarities in structure comparison. Curr Opin Struct Biol 1996, 6(3):377–385.View ArticlePubMedGoogle Scholar
 Holm L, Sander C: Protein structure comparison by alignment of distance matrices. J Mol Biol 1993, 233: 123–138.View ArticlePubMedGoogle Scholar
 Holm L, Sander C: Dali/FSSP classification of threedimensional protein folds. Nucleic Acids Res 1997, 25: 231–234.PubMed CentralView ArticlePubMedGoogle Scholar
 Krissinel E, Henrick K: Secondarystructure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr D Biol Crystallogr 2004, 60(Pt 12 Pt 1):2256–2268.View ArticlePubMedGoogle Scholar
 Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of optimal path. Protein Engineering 1998, 11(9):739–747.View ArticlePubMedGoogle Scholar
 Taylor W, Orengo C: SSAP: sequential structure alignment program for protein structure comparison. Methods Enzymol 1996, 266: 617–35.View ArticlePubMedGoogle Scholar
 Sutcliffe MJ, Haneef I, Carney D, Blundell TL: Knowledge based moddelling of homologous proteins, part I: threedimensional frameworks derived from the simultaneous superposition of multiple structures. Protein Engineering 1987, 1(5):377–384.View ArticlePubMedGoogle Scholar
 Chew LP, Kedem K: Finding the Consensus Shape for a Protein Family (Extended Abstract).2002. [http://citeseer.ist.psu.edu/596999.html]Google Scholar
 Guda C, Lu S, Scheeff ED, Bourne PE, Shindyalov LN: CEMC: a multiple protein structure alignment server. Nucleic Acids Research 2004., 32: "W100–3".Google Scholar
 Lupyan D, LeoMacias A, Ortiz AR: A new progressiveiterative algorithm for multiple structure alignment. Bioinformatics 2005, 21(15):3255–3263.View ArticlePubMedGoogle Scholar
 Ochagavía ME, Wodak S: Progressive Combinatorial Algorithm for Multiple Structural Alignments:Application to Distantly Related Proteins. Proteins 2004, 55: 436–454.View ArticlePubMedGoogle Scholar
 Orengo CA: CORATopological fingerprints for protein structural families. Protein Science 1999, 8: 699–715.PubMed CentralView ArticlePubMedGoogle Scholar
 Sandelin E: Extracting multiple structural alignments from pairwise alignments:a comparison of a rigorous and heuristic approach. Bioinformatics 2005, 21(7):1002–1009.PubMed CentralView ArticlePubMedGoogle Scholar
 Ye Y, Godzik A: Multiple flexible structure alignment using partial order graphs. Bioinformatics 2005, 21: 2362–2369.View ArticlePubMedGoogle Scholar
 Murzin A, Brenner SE, Hubbard T, Chothia C: SCOP: A structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247: 536–540.PubMedGoogle Scholar
 Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATHA Hierarchic Classification of Protein Domain Structures. Structure 1997, 5(8):1093–1108.View ArticlePubMedGoogle Scholar
 Caspi Y, Irani M: SpatioTemporal Alignment. Proc IEEE Transactions On Pattern Analysis and Machine Intelligence 2002, 1409–1424.Google Scholar
 Koike R, Kinoshita K, Kidera A: Ring and Zipper formation is the key to understanding the structural variety in all β proteins. FEBS Letters 2003, 533: 9–13.View ArticlePubMedGoogle Scholar
 Shatsky M, Nussinov R, Wolfson HJ: MultiProt – A Multiple Protein Structural Alignment Algorithm. WABI '02: Proceedings of the Second International Workshop on Algorithms in Bioinformatics 2002, 235–250.View ArticleGoogle Scholar
 Lupyan D, LeoMacias A, Ortiz ARR: A new progressiveiterative algorithm for multiple structure alignment. Bioinformatics 2005, 3255–3263.Google Scholar
 Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48: 443–453.View ArticlePubMedGoogle Scholar
 Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147: 195–197.View ArticlePubMedGoogle Scholar
 Jain AK, Murty MN, Flynn PJ: Data Clustering: A Review. ACM Comput Surv 1999, 31(3):264–323.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.