In this section, we describe our EPO algorithm for comparing a set of possibly high dimensional general curves. If we are given a set of protein folding data, we first convert each folding trajectory to a high dimensional curve. In particular, a folding trajectory is a sequence of conformations (structures) of a protein chain, representing different states of this protein at different time steps during the simulation of its folding process. We represent each conformation using the distance map between its alpha-carbon atoms so that it is invariant under rigid transformations. For example, if a protein contains n amino acids, then its distance map is a n × n matrix M where M [i] [j] equals the distance between the i th and j th alpha-carbon atoms along the protein backbone. This matrix can then be considered as a point in the n2 dimensions. This way, we map each trajectory of m conformations to a curve in with m vertices. We remark that one can also encode the side-chain information into the high dimensional curves, or map the trajectory of a substructure into a high dimensional curve. We will use such more refined high dimensional curves in most of our experiments as well. In the remaining part of this paper, we use the terms trajectories and curves interchangeably.
Notations and algorithm overview
Before we formally define the MCC problem, we introduce some necessary notations. Given a set of elements V = {v1,..., v
l
}, a relation over V is transitive if v
i
v
j
and v
j
v
k
imply that v
i
v
k
. In this paper, we also refer to v
i
v
j
as a partial order constraint. A partial order graph (POG) G = (V, E) is a directed acyclic graph with V = {v1,..., v
l
}, where v
i
v
j
if there is an edge (v
i
, v
j
). Note that by the transitivity of this relation, two nodes may have a partial order constraint even when there is no edge between them in G. Let R be the set of partial order constraints induced by G. We say that a permutation Π(V) of V is a partial order list w.r.t. G if for any v
i
v
j
∈ R, we have that vi appears before v
j
in the permutation Π(V). In other words, the linear order in Π(V) is a total order satisfying all partial order constraints induced from G. See Figure 5 for an example.
Let = {T1,..., T
N
} be a set of N trajectories in ℝd, where each trajectory T
i
is an ordered sequence of n points . (For simplicity, we assume without loss of generality that all T
i
s have the same length n.) The goal of the MCC algorithm is to find aligned sub-sequences from .
More formally, an aligned node o is a collection of vertices from T
i
s, with at most one point from each T
i
. Given a 3-tuple (, τ, ε), where τ and ε are input thresholds, an alignment of is a POG G with the corresponding set of partial order constraints R and a partial order list of aligned nodes = {o1,..., o
L
} such that the following three criteria are satisfied:
C1. |o
k
| ≥ τ, for any k ∈ [1, L];
C2. for any , ;
C3. if and with , then j <j'.
(C1) indicates that the number of vertices of input curves aligned to each aligned node o
k
is greater than a size threshold τ . (C2) means that these aligned points are tightly clustered together (i.e, the diameter of them is bounded by a distance threshold ε). (C3) enforces that points in different aligned nodes still maintain their partial order along their respective trajectory, which means that o
k
s are inherited and thus consistent to the points in each trajectory. Our goal is to maximize L, the size of such an alignment . See Figure 6(b) for an example of an alignment graph.
Algorithm overview
At a high level, the EPO algorithm has two stages (see Figure 6): (S1) initial POG construction stage and (S2) merging stage. The first stage generates an initial alignment for , encoded in a POG G. The procedure has the same framework as the POA algorithm, but its performance, especially when the similarity is low, is significantly improved, via the use of a clustering preprocessing step and a new two-level scoring function. In the second stage, we develop a novel and effective procedure to merge nodes from G to output a better final alignment G*. Below, we describe each stage in detail.
Initial POG construction
Standard dynamic programming (DP) [35, 36] is an effective method for pairwise comparison between sequences. It produces an optimal alignment between two sequences with respect to a given scoring function. One can perform multiple sequences alignment progressively based on this DP pairwise comparison method. Roughly speaking, in the i th round of the algorithm, the alignment of the first i – 1 sequences is represented in a consensus sequence. The algorithm then updates this consensus by aligning it with the i th sequence S
i
using the standard DP algorithm. Information from S
i
that is not aligned to the consensus sequence isessentially lost. See Figure 1(a).
The partial order alignment (POA) algorithm [3] alleviates this problem by encoding the consensus in a POG instead of a linear sequence (see Figure 1(b)). That is, the alignment of S1,..., Si-1is encoded in a partial order graph G
i
, which is then updated to Gi+1by aligning it with S
i
. Due to the partial order in a POG, the alignment between G
i
and S
i
can still be achieved by a DP algorithm. The POA algorithm reduces the influence of the order of the sequences aligned, and is able to capture alignments between a subset of sequences. More details of the POA algorithm and its variants can be found in [2, 3].
In our case, each trajectory is mapped to an ordered sequence of points (i.e, a polygonal curve), and a similar algorithm can be applied to our trajectory data: Instead of the usual 1D sequences, we now have dD sequences, where d is the dimension of each point. Note that since each point corresponds to the distance map of a conformation, no transformation is needed when comparing such curves. The first stage of our EPO algorithm constructs a POG G with respect to the input set of trajectories using a modified POA algorithm. Below we explain the main differences between them.
A clustering preprocessing stage
The first problem with the standard POA algorithm is that the size of the POG graph maintained expands quickly when the level of similarity is low. For example, suppose we are updating the current POG G
i
to Gi+1by aligning it with a new curve T
i
. If a point p ∈ T
i
cannot be aligned to any node in G
i
, then it will create a new node in Gi+1, as this node may potentially be aligned later with the remaining curves. Consequently, if the similarity is sparse, many new nodes are created without really producing densely aligned nodes later and the size of the POG graph increases rapidly. This induces high computational complexity.
To address this problem, our algorithm first preprocesses all points from the input curves by clustering them into groups [37], the diameter of which is smaller than a user defined threshold, which is fixed as the distance threshold ε in our experiments. According to the clustering result, we only keep those points that belong to a cluster holding more than τ curves' points in it. For example, if the threshold is τ = 3 (i.e, we require each aligned node aligns points from at least 3 curves), we will prune out the points in such clusters that cover less than 3 curves. Meanwhile we collect cluster centers in C = {c1,..., c
r
}, which we refer to as the set of canonical cluster centers. Intuitively, C provides a synopsis of the input curves and represents potentially aligned nodes.
If, in the process of aligning T
i
with G
i
, a point p ∈ T
i
is not aligned to any node in G
i
, then we insert a new node in Gi+1only if p is within ε away from some canonical center from C – if p is far from all the canonical cluster centers, then there is little chance that p can form significant alignment with points from later curves, as that would have implied that p should belong to a dense cluster. We remark that this set of canonical cluster centers are not only used for shrinking the size of POG, but also used as a predictor of the new two-stage scoring function that we will introduce shortly. There is also further advantage of pruning out unpromising points in the second merging stage of the EPO algorithm.
Scoring function
The choice of the scoring function when aligning G
i
= (V
i
, E
i
) with T
i
, is in general a crucial aspect of an alignment algorithm. A good scoring function will align as many points as possible globally. Given a point p ∈ T
i
and a node o ∈ G
i
, let δ(o, p) be the similarity between p and o, the definition of which will be described shortly. The score of aligning p with o is usually defined as:
where q is the parent of the point p along T
i
, and o' ranges over all immediate predecessors of o in the POG G
i
. It is easy to verify that such scores can be computed by a dynamic programming procedure due to the inherent order existing in both the trajectory and the POG.
A common way to define δ(o, p), the similarity between o and p, is as follows. Assume that each node o is associated with a node center ω (o) to represent all the points aligned to this node. Then
(1)
An alternative way to view this is that each node o has an influence region of radius ε around its center. A point p can be aligned to a node o only if it lies within the influence region of o.
Natural choices for the node center ω(o) of o include using an earlier computed canonical cluster center, or the center of the minimum enclosing ball of points already aligned to this node (or some weighted variants of it), which is a dynamic point relying on alignment order. The advantage of the former is that canonical cluster centers tend to spread apart, which helps to increase coverage of aligned nodes. Furthermore, the canonical cluster centers serve as good candidates for node centers as we already know that there are many points around them. The disadvantage is that it does not consider the distribution of points already aligned to this node. See Figure 7, where without considering the distribution of points aligned to o
a
and o
b
, the new point p will be aligned to o
b
even though o
a
is a better choice. Using the center of the minimum enclosing ball alleviates this problem. However, the influence regions of nodes produced this way tend to overlap much more than using the canonical cluster centers and the position of these centers also depend heavily on the order of curves aligned. We combine the advantages of both approaches into the following two-level scoring function for measuring the similarity δ(o, p).
Specifically, for a node o, let q be the first point aligned to this node. This means that at the time we were examining q, q cannot be aligned to any existing node in the POG. Let c
k
∈ C be the nearest canonical cluster center of q – recall that the node o was created because ||q – c
k
|| ≤ ε. We add c
k
as a point aligned to this node, and at any time, the center of the minimum enclosing ball of currently aligned points, including c
k
, will be used as the node center ω(o). Now let
be the diameter of points currently aligned to o. We define that:
(2)
In other words, the new scoring function prefers centering points to be around previously computed cluster centers, thus tending to reduce overlaps between the influence regions of different nodes. Furthermore, it gives higher similarity score for points that are more tightly grouped together with those already aligned at current node, addressing the problem shown in Figure 7. Our experimental tests have shown that this two level scoring function significantly outperforms the ones using either only the canonical centers or only the centers of minimal enclosing balls. We remark that it is possible to use variants of the above two-level scoring function, such as making it continuous (instead of being a step function). We choose the current form for its simplicity. Furthermore, experiments show that there is only marginal difference if we use the continuous version.
Merging stage
In the first stage, we have applied a progressive method to align each trajectory onto an alignment graph one by one. In the i th iteration, a point from T
i
is either aligned to the best matched node in the current POG G
i
, or a new node is created containing this point and the corresponding canonical cluster center, or discarded. After processing all of the N trajectories in order, we return a POG G = G
N
. In the second stage of our EPO algorithm, we further improve the quality of the alignment in G by using a novel merging process.
Given the greedy nature of the POA algorithm, the alignment obtained in G is not optimal and depends on the alignment order. Furthermore, since the influence regions of different aligned-nodes may overlap, no matter how we improve the scoring function, sometimes it is simply ambiguous to decide locally where to align a new in-coming point, and a wrong decision may have grave consequence later.
For example, see Figure 8, where the set of points P (enclosed in the dotted circle) should have been aligned to one node. However, suppose the nodes o
a
and o
b
already exist before any point in P is inserted.
Then as points from P come in, it is rather likely that they are distributed evenly into both o
a
and o
b
. This problem becomes much more severe in higher dimensions, where P can be distributed to several nodes whose centers are well-separated around P, but whose influence regions still cover some points from P (the number of such regions grows exponentially w.r.t. the dimension d). Hence instead of being captured in one heavily aligned node, P is broken into many nodes with small size. Our experimental tests confirm that this is happening rather commonly in both standard and our modified POA algorithms.
To address this problem, we propose a novel postprocessing on G. The goal is to merge qualified points from neighboring less-aligned nodes to augment more heavily loaded nodes. In particular, the following two invariants are maintained during the merging process:
(I1) At any time, the diameter of the target node is still bounded by the distance threshold ε;
(I2) The partial order constraints induced by the POG graph are always consistent with the order of points along each trajectory.
The second criterion means that at any time in the POG graph G', if p ∈ o1, q ∈ o2, p, q ∈ T
i
and p precedes q along the trajectory T
i
, then either o1 o2, or there is no partial order relation between them. In other words, the resulting POG still corresponds to a valid alignment of with respect to the same thresholds.
As an example, see Figure 6, where the point (i.e, the second point of the trajectory T1) in the node b in (b) is moved to the node e in (c). Note that the graph is also updated to reflect the change (the dashed edge in (c)), in order to maintain the invariants (I1) and (I2). When all points aligned to a specific node o are merged (thus moved) to other nodes (i.e, o becomes empty), we delete o, and its successors in the POG will then become the successors of its parent.
A high level pseudocode of the merging process is shown in Figure 9. It augments better aligned nodes from the current POG G by processing first the nodes with larger size. We perform this procedure a few times till there is no significant increase in the quality of the resulting alignment. In practice, to speed up the algorithm, we merge neighbors to a node o only if its size is greater than some threshold (fixed at half of the size threshold, i.e, τ/2, in our experiments), as otherwise, there is low probability that o will become a heavy node later.