We address the problem of improving residue-wise score-based predictors for protein interface residues as a node labeling problem for undirected graphs using the model class of conditional random fields (CRFs). Lafferty et al.[30] were the first who applied CRFs to the problem of labeling sequence data. Li et al.[28] used line CRFs to address the interaction site prediction. They have the advantage that the Viterbi algorithm well-known from decoding HMMs can be used to efficiently infer the most likely labeling sequence. Very useful and illustrative presentations on CRFs are given in [31, 32, 36, 37]. Above CRF-based models make the assumption that the label of one residue is conditionally independent of the labels of all other residues given the labels of the two adjacent residues in the protein sequence. To the best of our knowledge, we are the first to employ a graphical model that takes the spatial neighborhood of residues located on the protein surface into account.
This section is subdivided into three parts. We first explain how we model protein surfaces by pairwise CRFs. Then we introduce our new inference method. Finally, we elucidate our training method.
Using conditional random fields to model protein surfaces
For every protein under study that has n surface residues, a pair of random vectors (X, Y) is considered. The vector X is the observation that represents the knowledge about this protein that is utilized in the prediction, e.g. the 3D structure of the target protein and a multiple sequence alignment together with homologs.
The vector Y is a random sequence of length n over the alphabet {I,N} that labels the index set {1,2,…,n}, which in turn is called the set of positions (of the surface residues). The label I represents interface residues, whereas the label N represents non-interface residues. {I,N}n is the set of all label sequences of length n over {I,N}. We will also call them assignments as the term ‘label sequence’ may lead to confusion when applied below to subsets of {1,2,…,n} that are not contiguous sequences.
Let be the neighborhood graph, where is the set of positions,
is the set of edges that typically results from an atom-distance-based neighborhood definition for positions. We assume for convenience in notation that
has no isolated nodes. Cases with isolated nodes could trivially be reduced to cases without isolated nodes. Let
be the set of
’s cliques, which we refer to as node cliques. For a node clique and an assignment y we denote by y
c
the restriction of y to the positions belonging to the node clique c. For c={i} and c={i,j} we write y
i
and rather than y{i} and y{i,j}.
The preceding notation is also used in the slightly more general case of partial label assignments to arbitrarily chosen subsets
of the set of positions
. Formally, let denote . Given two partial assignments and are identical on , the union is well-defined.
The conditional distribution function of our pCRF (X, Y) with respect to the neighborhood graph is defined as follows:
(1)
where x and y are arbitrarily chosen instances of the random observation X and the random label sequence Y, respectively, Φ
c
(y
c
,x)∈∙ () is the feature of the CRF located at the node clique c (again Φ
i
and Φi,j simplify notation for Φ{i} and Φ{i,j}), and Z(x) is the observation-specific normalization factor defined by
(2)
Let us call the score of the label sequence y given the observation x.
A CRF is called a pairwise CRF (pCRF) if Φ
c
≡0, for all node cliques c larger than two. The remaining features Φ
i
and Φi,j are referred to as node features and edge features, respectively. Thus, every position and every edge is represented by the pair (Φ
i
(N,x),Φ
i
(I,x)) and by the quadruplet (Φ{i,j} (N,N,x),Φ{i,j}(I,N,x),Φ{i,j}(N,I,x),Φ{i,j}(I,I,x)).
Following [30], we assume moreover that each node feature and each edge feature is a sum of weighted base features. More precisely, for every position and every edge we assume representations
where y∈{I,N}n and x is an observation. The two real vectors
(3)
need to be calculated in a training phase.
In the most general sense, protein characteristics are real-valued evaluations of positions and pairs of adjacent positions (edges of the neighborhood graph), respectively, that are correlated with our position labeling problem. We use a standard step function technique to obtain base features from protein characteristics, rather than taking the raw values of the characteristics. To make our paper self-contained, let us describe this technique for short.
A protein characteristic depends on the observation and either a node or an edge. Each protein characteristic, such as e.g. the relative solvent-accessible surface area of a residue, is transformed into several binary features by binning, i.e. we distinguish only a few different cases rather than the whole range of the characteristic. Assuming the common case of real-valued characteristics, the bins are a partition of the reals into intervals. The use of this discretization allows to approximate any shape of dependency of the labels on the characteristics, rather than assuming a fixed shape such as linear or logarithmic.
From protein characteristics for positions to node features. We subdivide the range of the characteristics C into say γ intervals, where γ is at least two. Let s1<s2<…<sγ−1 be the corresponding interval boundaries. It is reasonable to take s
ι
as the ι/γ-quantile of the empirical distribution of C for non-interface residues, where C(i,x)∈(s0,s
γ
]. Then we define for each position the following 2γ base features associated with the position characteristics C.
(4)
where y=N,I, and ι=0,1,…,γ−1 and s0:=−∞,s
γ
:=∞.
From protein characteristics for edges to edge features. Let D be the characteristics. Analogous to the previous case, we then obtain for each edge the following 4γ base features associated with D, where y,y′∈{N,I} and ι=0,1,…,γ−1.
(5)
In both cases we set γ=5.
Devising a generalized Viterbi algorithm for pCRFs
The problem of finding a most probable label sequence y∗ given an observation x is NP-hard for general pCRFs [31]. In this subsection we present a heuristic that approximately solves this problem.
To this end, we first devise an algorithm, which we call generalized Viterbi algorithm. It computes an optimal label sequence, where the posterior probability of y∗ given x is maximized. Unfortunately, its run-time is in too many cases not acceptable. That is why we transform it in a second step into a feasible, time-bounded approximation algorithm.
The generalized Viterbi algorithm
Let be the neighborhood graph underlying the protein under study. For any assignment (label sequence) y and any subset of
, let denote the partial assignment of y with respect to . (This is in line with the notation y
c
(c a position clique) introduced earlier in this study).
If are pairwise disjoint position sets, the assignment for canonically resulting from assignments is denoted by . For , the score is defined by
Then the problem of determining a most probable label sequence y∗ given an observation x can be reformulated as
This is the case, because it suffices to consider the score.
To put this into practice, we devised an algorithm we call generalized Viterbi. On the one hand, it is analogous to the classical Viterbi algorithm. On the other hand, there is a major difference. In our case there is no canonical order in which the positions of
are traversed. Having explained our algorithm for any order, we show how to calculate a fairly effective one. In what follows, we assume that the positions not yet touched are held in a dynamic queue. Those positions having already left the queue form the history set.
Assume that the subgraph of
induced by
has connected components , …, . For μ=1,2,…,m, let be the so-called boundary component associated with defined by . The complement is the interior of the μ-th history component. See Figure 1 for an example.
For assignments of the boundary components , the Viterbi variables are defined as
(6)
The Viterbi variables can be represented as a set of tables, one table of size for each boundary component . In the case where a boundary component is empty the table reduces to a single number.
At any stage, the algorithm stores the connected components , …, of the current history set
, the corresponding boundary components , …, , and Viterbi variable values where range over all possible assignments of corresponding boundary component. We store for every assignment on the boundary, a maximizing interior assignment. This assignment is the argmax of (6) but is determined with the dynamic programming recursions defined below. Let us call these data the current state of the algorithm. It mainly consists of record sets indexed by the boundary labelings.
At the very beginning the queue contains all positions, the history set
and the corresponding boundary component
are empty. As long as the position queue is not empty, the top element v is extracted and the state is updated as follows.
Adjoining v to the history set
, there are two cases to distinguish. Either position v is not adjacent to any other position of any old boundary component (see Figure 2) or adjoining position v to
results in adding it to some connected component of the old history set or even merging together two or more of them (see Figure 3).
In the first case we simply have to take over all the old connected components, boundary sets and Viterbi variables. Moreover, we perform the instructions
In the second case position v is adjacent to some boundary components, say . Then the old history components and the current position v are merged together:
The other history set components and corresponding Viterbi variables are not affected.
For μ=m′,m′+1,…,m, let be the set of all positions out of that are no longer boundary nodes after having adjoined v to the history set. The nodes in are removed from the boundary after the iteration. Let be the complement of in . By inspecting the edges incident to the current position v, all these sets can be computed in linear time.
The new boundary set is then either or , where it can be checked in linear time whether or not v is a new boundary position.
We are now in a position to calculate the new Viterbi variables , where ranges over all assignments of the new boundary set .
If then
Here, any assignment of a node set is assumed to implicitly define assignments for any subset thereof. Figure 4 illustrates this case of the recursion step. If, however, , then
Finally, the interior labeling is stored, where the maximum is attained. The algorithm terminates after the last node v from
has been processed. In the typical case, where the graph is connected, at termination .
The running time of the algorithm is , where b is the size of the largest boundary set and n is the number of surface residues. We call this algorithm generalized Viterbi algorithm as for the case of a graph that is a linear chain 1−2−3−⋯−n of nodes using the node order 1,2,…,n the Viterbi variables we define are the same as in the standard Viterbi algorithm for HMMs. In the case of a graph that is a tree, this algorithm specializes to the Fitch algorithm or an argmax-version of Felsenstein’s pruning algorithm when a leaf-to-root node order is chosen after rooting the tree at an arbitrary node. In both special cases the boundary sets always have size at most 1. The tree example also motivate the use of several history sets at the same time: using a single history set only, one would not be able to achieve a linear running time on trees.
A heuristic based on the generalized Viterbi algorithm
First, it is vital for our generalized Viterbi algorithm to keep the size of the boundary sets small. A good position order is here of great importance. The algorithm starts by choosing a vertex of minimal degree. When determining the next position to be dequeued, the algorithm selects a boundary node such that the number of incident edges leading to nodes not belonging to any current history set is minimal. In an arbitrarily chosen order these nodes are dequeued next.
Second, the space demand is reduced by restricting the number of boundary labelings admitted. Starting from the available labelings of the current history set, the percentage of the reachable boundary labelings of the successor history that will be discarded is calculated. Then the corresponding percentile is estimated. To this end, a sufficiently large sample of possible labelings of the new boundary set is drawn, the Viterbi variables are computed, and the corresponding sample percentile is taken. Finally, only those boundary labelings of the new history set are retained whose Viterbi variables exceed this percentile.
That way we compute near-optimal solutions good enough for our purposes within feasible computation time.
Piecewise training for pCRFs
Let
be the independent identically distributed training sample. For every μ=1,2,…,m, let and be the set of positions and edges in the neighborhood graph associated with x
μ
, let be the number of positions of the μ-th training example and let be the set of all possible label sequences of this graph.
This data set is unbalanced as there are many more non-interface positions as interface positions. As customary for other machine learning approaches such as support vector machines and artificial neural networks [28], we here manipulated the ratio of positive and negative example positions for training in order to obtain reasonable results.
We have amplified the influence of the positive examples, rather than selecting various sets of training data by deleting negative ones as done in [28].
Let ν
I
, ν
N
, ν
II
and ν
NN
be the number of interface positions, the number of non-interface positions, the number of interface-interface edges, and the number of non-interface-non-interface edges in
, respectively. Then we define the following two amplifier functions for all positions i and for all edges {i,j} of the m neighborhood graphs resulting from the training data
.
To uniformly govern the influence of the amplifiers, we introduce an amplifier control parameter η3∈ [ 0,1].
We set up our two log-likelihood objective function by
where ideally for each μ=1,2,…,m
is the training-instance-specific normalization factor.
Unfortunately, maximizing this objective function in general is algorithmically intractable. Taking pattern from Sutton et al.[35] who introduced what they called piecewise training, we deal with this problem by disentangling the labels of nodes and edges. For μ=1,2,…,m, a non-coherent labeling of the neighborhood graph is any mapping that assigns to every position and every edge a label y
v
∈{I,N} and a pair of labels y
e
∈{I,N}2, respectively.
We then replace Z(x(μ),η3) by
as normalization factor. This makes the optimization problem computationally feasible.
The L-BFGS method [38] is used to solve it. That way we obtain the coefficient vectors α and β (see Equations 3), which depend on the amplifier control parameter η3∈ [ 0,1].
To mitigate the negative consequences of disentanglement, we use a correction factor δ≥1. For any characteristics D and ι=0,1,…,γ−1, the weights of the bases edge features and (see Equation 5) are all multiplied by δ. Thus a change in classification along an edge is additionally penalized. The correction factor δ is set best between 1.15 and 1.25.
For our implementation of the training, we used the Java CRF package from Sunita Sarawagi at http://crf.sourceforge.net/.