CRF-based models of protein surfaces improve protein-protein interaction site predictions

Dong, Zhijie; Wang, Keyu; Linh Dang, Truong Khanh; Gültas, Mehmet; Welter, Marlon; Wierschin, Torsten; Stanke, Mario; Waack, Stephan

doi:10.1186/1471-2105-15-277

Methodology article
Open access
Published: 13 August 2014

CRF-based models of protein surfaces improve protein-protein interaction site predictions

Zhijie Dong¹,
Keyu Wang¹,
Truong Khanh Linh Dang¹,
Mehmet Gültas²,
Marlon Welter¹,
Torsten Wierschin³,
Mario Stanke³ &
…
Stephan Waack¹

BMC Bioinformatics volume 15, Article number: 277 (2014) Cite this article

3614 Accesses
18 Citations
1 Altmetric
Metrics details

Abstract

Background

The identification of protein-protein interaction sites is a computationally challenging task and important for understanding the biology of protein complexes. There is a rich literature in this field. A broad class of approaches assign to each candidate residue a real-valued score that measures how likely it is that the residue belongs to the interface. The prediction is obtained by thresholding this score.

Some probabilistic models classify the residues on the basis of the posterior probabilities. In this paper, we introduce pairwise conditional random fields (pCRFs) in which edges are not restricted to the backbone as in the case of linear-chain CRFs utilized by Li et al. (2007). In fact, any 3D-neighborhood relation can be modeled. On grounds of a generalized Viterbi inference algorithm and a piecewise training process for pCRFs, we demonstrate how to utilize pCRFs to enhance a given residue-wise score-based protein-protein interface predictor on the surface of the protein under study. The features of the pCRF are solely based on the interface predictions scores of the predictor the performance of which shall be improved.

Results

We performed three sets of experiments with synthetic scores assigned to the surface residues of proteins taken from the data set PlaneDimers compiled by Zellner et al. (2011), from the list published by Keskin et al. (2004) and from the very recent data set due to Cukuroglu et al. (2014). That way we demonstrated that our pCRF-based enhancer is effective given the interface residue score distribution and the non-interface residue score are unimodal.

Moreover, the pCRF-based enhancer is also successfully applicable, if the distributions are only unimodal over a certain sub-domain. The improvement is then restricted to that domain. Thus we were able to improve the prediction of the PresCont server devised by Zellner et al. (2011) on PlaneDimers.

Conclusions

Our results strongly suggest that pCRFs form a methodological framework to improve residue-wise score-based protein-protein interface predictors given the scores are appropriately distributed. A prototypical implementation of our method is accessible at http://ppicrf.informatik.uni-goettingen.de/index.html.

Background

Protein-protein interactions are constitutive of almost every biological process. The ability to identify the residues that form the interaction sites of these complexes is necessary to understand them. In particular, it is the basis for new therapeutic approaches to treat diseases [1, 2].

A great deal of work has been done on developing in-silico prediction methods. As already observed by Zhou et al.[3], these methods can be subdivided with respect to the kind of mathematical foundation invoked and with respect to the features or characteristics of the protein used.

Residue-wise score-based prediction methods

Let x_r be the data relevant for a residue r in a given protein chain. These methods then employ a function f(x_r,λ), where λ are some coefficients which have been learned through the training. The value of f(x_r,λ) then determines, whether r is rated as an interface or not. The linear regression method [4, 5], the scoring function method [6–11], the neural network method [12–17], and the support vector machine method [18–25] are of this kind.

Probabilistic methods

Let X be the data relevant for a protein chain, where these data are assumed to stem from a random source thus obeying a random distribution. X, which alternatively is called the observation, typically includes the structure. The label sequence of the residues Y that classifies each individual residue either as interface or as non-interface is assumed to be random, too. Typically, probabilistic methods use the conditional probability distribution (Y | X) to determine a classification y* of the residues of maximal posterior probability (y* | x). Naive Bayesian methods [26], Bayesian network methods [27], hidden Markov models (HMMs) [26], and linear-chain Conditional Random Fields taking the backbone as underlying graphical structure [28] fall in this category. Using posterior decoding on the basis of the forward-backward algorithm, both HMMs and CRFs are residue-wise score-based prediction methods, where the binary decision is made by thresholding the posterior probabilities of classifying the residues as interface.

Notations

We use Latin uppercase letters when referring to random aspects of the objects denoted by them. In contrast, lowercase letters denote arbitrarily chosen but fixed objects. In this context boldface letters indicate vectors, the corresponding non-boldface letters their coefficients.

The vast majority of methods use the 3D structure of the target protein chain in form of a PDB file as input [4–13, 15, 17–21, 23–25]. However, a few methods are not requiring a 3D structure and rather use sequences only [14, 16, 22]. We here consider the problem with a given 3D structure of the target protein chain. Sequence-based input may include a multiple sequence alignment of related proteins from which, for example, sequence conservation can be inferred. When the 3D structure of an unbound binding partner is also available, protein-protein docking methods can be applied. This has also been exploited to provide feedback from docking to the more specific problem of interface prediction [29]. We here consider the case where the binding partner’s 3D structure is not given. Nor requires the presented method the sequence of the binding partner. Albeit, we tested on homodimers only as we here rather focus on our new method rather than on features or types of proteins. The protein features used for interface prediction in the literature are reviewed in the Methods section as far as we make use of them in this article.

Most of the current studies for predicting interaction sites of proteins that use a probabilistic method are restricted by treating the residues of the proteins as independent vertices. Li et al. have taken the backbone neighborhood into account thus modeling the protein as a sequence [28] using what can be called a line CRF or linear-chain CRF. The features they define on the label pair of two backbone neighbors have the effect of smoothing the predicted labels along the protein sequence. Decisive is, however, that they were the first who used conditional random fields (CRFs) for interface prediction. CRFs in turn have come into use for solving sequence labeling problems due to Lafferty et al.[30]. See [31] for an overview. From the mathematical point of view they take advantage of the fact that they model the conditional probability (Y | X) rather than the joint probability (Y, X). Recently there has been an explosion of interest in conditional random fields (CRFs) with successful applications. It has been shown that CRFs have the abilities for solving sequence labeling problems like part-of-speech tagging (POST) [32] and natural language processing [33]. Furthermore in the web extraction problem, in which the web-sites are modeled as two dimensional grid graphs, CRFs perform well [34]. One of their outstanding benefits over many other statistical models is that a CRF can easily describe the dependencies of observations.

As proteins are folded into three dimensional structures, spatial relationships create dependencies between residues. For example, we find on the test data described below that the correlation coefficient between spatial neighbors that are not also sequence neighbors (distance ≤3.5 Å) is 0.45. This is only slightly lower than the correlation coefficient between residues that are sequence neighbors (0.49). As there are more than three times as many spatial pairs of neighbors than sequence neighbors at this threshold it is reasonable from a modeling standpoint to use a model that respects all dependencies induced by spatial proximity, not only the dependencies induced by proximity along the backbone.

There are many papers using spatial neighborhood information of residues to predict-protein interaction sites (see e.g. [2, 13, 21, 28]). However, the spatial information of proteins was only integrated into the feature functions, but not represented in the model. For probabilistic models, the difference between the two ways to integrate spacial information is that in previous models the label of the i-th residue Y_i is conditional independent from the labels of other residues given data X and – in the case of linear CRFs or HMMs – given the labels of Y_i−1 and Y_i+1. Even when neighborhood information is only used for spatial smoothing of the labels, the intuitive advantage over, say, an SVM classifier that uses spatial neighborhood in the features but classifies each residue independently, is that not-patch-like candidate labelings are explicitly punished. In contrast, such an independent classifier-approach may have a tendency to predict individual interface residues ‘sprinkled’ around the protein surface [28].

For this reason, a general CRF seems to be more suitable for the task. However, inference for general CRFs is intractable. In this paper, pairwise conditional random fields (pCRFs) are utilized. Specializing general CRFs, only node cliques and edge cliques are taken into consideration in pCRFs. A pCRF retains most spatial information of proteins, can be specified with the same number of parameter as a line CRF and approximate inference remains feasible with the generalization of the Viterbi algorithm introduced here. Taking pattern from piecewise training methods [35], we disentangled the labels of nodes and edges to train the model.

In order to take advantage of a residue-wise score-based predictor, we model the protein surface by means of a pCRF, where the observation is solely a sequence of surface residue scores between 0 and 1 output by the predictor. We then utilize a generalized Viterbi algorithm and piecewise training. The resulting tool tries to enhance the predictor chosen on the surface of the protein under study. It is the aim of this paper to demonstrate effectiveness of this approach provided that the interface residue scores and the non-interface residue scores are appropriately distributed.

Methods

We address the problem of improving residue-wise score-based predictors for protein interface residues as a node labeling problem for undirected graphs using the model class of conditional random fields (CRFs). Lafferty et al.[30] were the first who applied CRFs to the problem of labeling sequence data. Li et al.[28] used line CRFs to address the interaction site prediction. They have the advantage that the Viterbi algorithm well-known from decoding HMMs can be used to efficiently infer the most likely labeling sequence. Very useful and illustrative presentations on CRFs are given in [31, 32, 36, 37]. Above CRF-based models make the assumption that the label of one residue is conditionally independent of the labels of all other residues given the labels of the two adjacent residues in the protein sequence. To the best of our knowledge, we are the first to employ a graphical model that takes the spatial neighborhood of residues located on the protein surface into account.

This section is subdivided into three parts. We first explain how we model protein surfaces by pairwise CRFs. Then we introduce our new inference method. Finally, we elucidate our training method.

Using conditional random fields to model protein surfaces

For every protein under study that has n surface residues, a pair of random vectors (X, Y) is considered. The vector X is the observation that represents the knowledge about this protein that is utilized in the prediction, e.g. the 3D structure of the target protein and a multiple sequence alignment together with homologs.

The vector Y is a random sequence of length n over the alphabet {I,N} that labels the index set {1,2,…,n}, which in turn is called the set of positions (of the surface residues). The label I represents interface residues, whereas the label N represents non-interface residues. {I,N}ⁿ is the set of all label sequences of length n over {I,N}. We will also call them assignments as the term ‘label sequence’ may lead to confusion when applied below to subsets of {1,2,…,n} that are not contiguous sequences.

Let $G = (V, E)$ be the neighborhood graph, where $V = {1, 2, \dots, n}$ is the set of positions, is the set of edges that typically results from an atom-distance-based neighborhood definition for positions. We assume for convenience in notation that has no isolated nodes. Cases with isolated nodes could trivially be reduced to cases without isolated nodes. Let be the set of ’s cliques, which we refer to as node cliques. For a node clique $c \in C$ and an assignment y we denote by y_c the restriction of y to the positions belonging to the node clique c. For c={i} and c={i,j} we write y_i and $(y_{i}, y_{j})$ rather than y_{i} and y_{i,j}.

The preceding notation is also used in the slightly more general case of partial label assignments to arbitrarily chosen subsets of the set of positions . Formally, let $y_{S}$ denote ${(i, y_{i}) | i \in S, y_{i} \in {I, N}}$ . Given two partial assignments $y_{S_{1}}$ and $y_{S_{2}}$ are identical on $S_{1} \cap S_{2}$ , the union $y_{S_{1}} \cup y_{S_{2}}$ is well-defined.

The conditional distribution function of our pCRF (X, Y) with respect to the neighborhood graph $G = (V, E)$ is defined as follows:

\begin{align} ℙ (y | x) & = \frac{1}{Z (x)} exp (\sum_{i \in V} Φ_{i} (y_{i}, x) + \sum_{\{i, j\} \in E} Φ_{i, j} (y_{i}, y_{j}, x) \\ + \sum_{c \in C ∖ (V \cup E)} Φ_{c} (y_{c}, x)), \end{align}

(1)

where x and y are arbitrarily chosen instances of the random observation X and the random label sequence Y, respectively, Φ_c(y_c,x)∈∙ ( $c \in C$ ) is the feature of the CRF located at the node clique c (again Φ_i and Φ_i,j simplify notation for Φ_{i} and Φ_{i,j}), and Z(x) is the observation-specific normalization factor defined by

\begin{align} Z (x) & : = \sum_{y \in {\{I, N\}}^{n}} exp (\sum_{i \in V} Φ_{i} (y_{i}, x) + \sum_{\{i, j\} \in E} Φ_{i, j} (y_{i}, y_{j}, x) \\ + \sum_{c \in C ∖ (V \cup E)} Φ_{c} (y_{c}, x)) . \end{align}

(2)

Let us call $ln (Z (x) ℙ (y ∣ x))$ the score of the label sequence y given the observation x.

A CRF is called a pairwise CRF (pCRF) if Φ_c≡0, for all node cliques c larger than two. The remaining features Φ_i and Φ_i,j are referred to as node features and edge features, respectively. Thus, every position $i \in V$ and every edge $(i, j) \in E$ is represented by the pair (Φ_i(N,x),Φ_i(I,x)) and by the quadruplet (Φ_{i,j} (N,N,x),Φ_{i,j}(I,N,x),Φ_{i,j}(N,I,x),Φ_{i,j}(I,I,x)).

Following [30], we assume moreover that each node feature and each edge feature is a sum of weighted base features. More precisely, for every position $i \in V$ and every edge ${i, j} \in E$ we assume representations

\begin{array}{l} Φ_{i} (y_{i}, x) & = \sum_{k = 1}^{K_{1}} α_{k} ψ_{k} (i, y_{i}, x) \\ Φ_{i, j} (y_{i}, y_{j}, x) & = \sum_{k = 1}^{K_{2}} β_{k} ϕ_{k} (i, j, y_{i}, y_{j}, x), \end{array}

where y∈{I,N}ⁿ and x is an observation. The two real vectors

α : = (α_{1}, α_{2}, \dots, α_{K_{1}}) β : = (β_{1}, β_{2}, \dots, β_{K_{2}})

(3)

need to be calculated in a training phase.

In the most general sense, protein characteristics are real-valued evaluations of positions and pairs of adjacent positions (edges of the neighborhood graph), respectively, that are correlated with our position labeling problem. We use a standard step function technique to obtain base features from protein characteristics, rather than taking the raw values of the characteristics. To make our paper self-contained, let us describe this technique for short.

A protein characteristic depends on the observation and either a node or an edge. Each protein characteristic, such as e.g. the relative solvent-accessible surface area of a residue, is transformed into several binary features by binning, i.e. we distinguish only a few different cases rather than the whole range of the characteristic. Assuming the common case of real-valued characteristics, the bins are a partition of the reals into intervals. The use of this discretization allows to approximate any shape of dependency of the labels on the characteristics, rather than assuming a fixed shape such as linear or logarithmic.

From protein characteristics for positions to node features. We subdivide the range of the characteristics C into say γ intervals, where γ is at least two. Let s₁<s₂<…<s_γ−1 be the corresponding interval boundaries. It is reasonable to take s_ι as the ι/γ-quantile of the empirical distribution of C for non-interface residues, where C(i,x)∈(s₀,s_γ]. Then we define for each position $i \in V$ the following 2γ base features associated with the position characteristics C.

ϕ_{y, ι}^{(C)} (i, y_{i}, x) : = \{\begin{array}{l} 1 & if y_{i} = y and C (i, x) \in (s_{ι}, s_{ι + 1}]; \\ 0 & otherwise; \end{array}

(4)

where y=N,I, and ι=0,1,…,γ−1 and s₀:=−∞,s_γ:=∞.

From protein characteristics for edges to edge features. Let D be the characteristics. Analogous to the previous case, we then obtain for each edge ${i, j} \in E$ the following 4γ base features associated with D, where y,y^′∈{N,I} and ι=0,1,…,γ−1.

ϕ_{y, y^{'}, ι}^{(D)} (i, j, y_{i}, y_{j}, x) : = \{\begin{array}{l} 1 & if y_{i} = y, y_{j} = y^{'}, and D (i, j, x) \in (s_{ι}, s_{ι + 1}]; \\ 0 & otherwise. \end{array}

(5)

In both cases we set γ=5.

Devising a generalized Viterbi algorithm for pCRFs

The problem of finding a most probable label sequence y^∗ given an observation x is NP-hard for general pCRFs [31]. In this subsection we present a heuristic that approximately solves this problem.

To this end, we first devise an algorithm, which we call generalized Viterbi algorithm. It computes an optimal label sequence, where the posterior probability of y^∗ given x is maximized. Unfortunately, its run-time is in too many cases not acceptable. That is why we transform it in a second step into a feasible, time-bounded approximation algorithm.

The generalized Viterbi algorithm

Let $G = (V, E)$ be the neighborhood graph underlying the protein under study. For any assignment (label sequence) y and any subset $V^{'}$ of , let $y_{V^{'}}$ denote the partial assignment of y with respect to $V^{'}$ . (This is in line with the notation y_c (c a position clique) introduced earlier in this study).

If $V_{1}, V_{2}, \dots, V_{r}$ are pairwise disjoint position sets, the assignment for $V_{1} \cup V_{2} \cup \dots \cup V_{r}$ canonically resulting from assignments $y_{V_{1}}, y_{V_{2}}, \dots, y_{V_{r}}$ is denoted by $y_{V_{1}} \cup y_{V_{2}} \cup \dots \cup y_{V_{r}}$ . For $V^{'} \subset V$ , the score $s V^{'} (y_{V^{'}} ∣ x)$ is defined by

\begin{align} s_{V^{'}} (y_{V^{'}} ∣ x) : = \sum_{i \in V^{'}} Φ_{i} (y_{i}, x) + \sum_{\begin{matrix} i, j \in V^{'} \\ {i, j} \in E \end{matrix}} Φ_{i, j} (y_{i}, y_{j}, x) . \end{align}

Then the problem of determining a most probable label sequence y^∗ given an observation x can be reformulated as

\begin{align} y^{*} = \underset{y}{argmax} s_{V} (y ∣ x) . \end{align}

This is the case, because it suffices to consider the score.

To put this into practice, we devised an algorithm we call generalized Viterbi. On the one hand, it is analogous to the classical Viterbi algorithm. On the other hand, there is a major difference. In our case there is no canonical order in which the positions of are traversed. Having explained our algorithm for any order, we show how to calculate a fairly effective one. In what follows, we assume that the positions not yet touched are held in a dynamic queue. Those positions having already left the queue form the history set $ℋ \subseteq V$ .

Assume that the subgraph of induced by has connected components $ℋ_{1}, ℋ_{2}$ , …, $ℋ_{m}$ . For μ=1,2,…,m, let $ℬ_{μ} \subseteq ℋ_{μ}$ be the so-called boundary component associated with $ℋ_{μ}$ defined by $ℬ_{μ} : = \{i \in ℋ_{μ} |\exists j \notin ℋ, {i, j} \in E\}$ . The complement $ℋ_{μ} ∖ ℬ_{μ}$ is the interior of the μ-th history component. See Figure 1 for an example.

For assignments $y_{ℬ_{1}}, y_{ℬ_{2}}, \dots, y_{ℬ_{m}}$ of the boundary components $ℬ_{1}, ℬ_{2}, \dots, ℬ_{m}$ , the Viterbi variables ${vit}_{ℋ_{1}} (y_{ℬ_{1}}), {vit}_{ℋ_{2}} (y_{ℬ_{2}}), \dots, {vit}_{ℋ_{m}} (y_{ℬ_{m}})$ are defined as

\begin{array}{l} {vit}_{ℋ_{μ}} (y_{ℬ_{μ}}) & : = max_{y_{ℋ_{μ} ∖ ℬ_{μ}}} s_{ℋ_{μ}} (y_{ℋ_{μ} ∖ ℬ_{μ}} \cup y_{ℬ_{μ}} ∣ x) \\ \times (μ = 1, 2, \dots, m) . \end{array}

(6)

The Viterbi variables can be represented as a set of tables, one table of size $2^{| ℬ_{μ} |}$ for each boundary component $ℬ_{μ}$ . In the case where a boundary component is empty the table reduces to a single number.

At any stage, the algorithm stores the connected components $ℋ_{1}, ℋ_{2}$ , …, $ℋ_{m}$ of the current history set , the corresponding boundary components $ℬ_{1}, ℬ_{2}$ , …, $ℬ_{m}$ , and Viterbi variable values ${vit}_{ℋ_{1}} (y_{ℬ_{1}}), {vit}_{ℋ_{2}} (y_{ℬ_{2}}), \dots, {vit}_{ℋ_{m}} (y_{ℬ_{m}}),$ where $y_{ℬ_{1}}, y_{ℬ_{2}}, \dots, y_{ℬ_{m}}$ range over all possible assignments of corresponding boundary component. We store for every assignment on the boundary, a maximizing interior assignment. This assignment is the argmax of (6) but is determined with the dynamic programming recursions defined below. Let us call these data the current state of the algorithm. It mainly consists of record sets indexed by the boundary labelings.

At the very beginning the queue contains all positions, the history set and the corresponding boundary component are empty. As long as the position queue is not empty, the top element v is extracted and the state is updated as follows.

Adjoining v to the history set , there are two cases to distinguish. Either position v is not adjacent to any other position of any old boundary component (see Figure 2) or adjoining position v to results in adding it to some connected component of the old history set or even merging together two or more of them (see Figure 3).

In the first case we simply have to take over all the old connected components, boundary sets and Viterbi variables. Moreover, we perform the instructions

\begin{array}{l} ℋ_{m + 1} & \leftarrow ℬ_{m + 1} \leftarrow {v}, {vit}_{ℋ_{m + 1}} (N) \leftarrow s_{ℋ_{m + 1}} (N ∣ x), \\ {vit}_{ℋ_{m + 1}} (I) \leftarrow s_{ℋ_{m + 1}} (I ∣ x) . \end{array}

In the second case position v is adjacent to some boundary components, say $ℬ_{m^{'}}, ℬ_{m^{'} + 1}, \dots, ℬ_{m}$ . Then the old history components $ℋ_{m^{'}}, ℋ_{m^{'} + 1}, \dots, ℋ_{m}$ and the current position v are merged together:

ℋ_{tmp} \leftarrow ℋ_{m^{'}} \cup ℋ_{m^{'} + 1} \cup \dots \cup ℋ_{m} \cup {v} .

The other history set components and corresponding Viterbi variables are not affected.

For μ=m^′,m^′+1,…,m, let $R_{μ} \subseteq ℬ_{μ}$ be the set of all positions out of $ℬ_{μ}$ that are no longer boundary nodes after having adjoined v to the history set. The nodes in $R_{μ}$ are removed from the boundary $ℬ_{μ}$ after the iteration. Let ${\tilde{ℬ}}_{μ}$ be the complement of $R_{μ}$ in $ℬ_{μ}$ . By inspecting the edges incident to the current position v, all these sets can be computed in linear time.

The new boundary set $ℬ_{tmp}$ is then either ${\tilde{ℬ}}_{m^{'}} \cup {\tilde{ℬ}}_{m^{'} + 1} \cup \dots \cup {\tilde{ℬ}}_{m}$ or ${\tilde{ℬ}}_{m^{'}} \cup {\tilde{ℬ}}_{m^{'} + 1} \cup \dots \cup {\tilde{ℬ}}_{m} \cup {v}$ , where it can be checked in linear time whether or not v is a new boundary position.

We are now in a position to calculate the new Viterbi variables ${vit}_{ℋ_{tmp}} (y_{ℬ_{tmp}})$ , where $y_{ℬ_{tmp}}$ ranges over all assignments of the new boundary set $ℬ_{tmp}$ .

If $v \notin ℬ_{tmp}$ then

\begin{array}{l} {vit}_{ℋ_{tmp}} (y_{ℬ_{tmp}}) & \leftarrow max_{y_{v}} \{Φ_{v} (y_{v}, x) + \sum_{μ = m^{'}}^{m} max_{y_{R_{μ}}} \\ \times ({vit}_{ℋ_{μ}} y_{{\tilde{ℬ}}_{μ}} \cup y_{R_{μ}} + \sum_{\begin{matrix} w \in ℬ_{μ} \\ {v, w} \in E \end{matrix}} Φ_{v, w} (y_{v}, y_{w}, x))\} . \end{array}

Here, any assignment of a node set is assumed to implicitly define assignments for any subset thereof. Figure 4 illustrates this case of the recursion step. If, however, $v \in ℬ_{tmp}$ , then

\begin{array}{l} {vit}_{ℋ_{tmp}} (y_{ℬ_{tmp}}) \leftarrow Φ_{v} (y_{v}, x) & + \sum_{\begin{matrix} w \in {\tilde{ℬ}}_{μ^{'}} \cup \dots \cup {\tilde{ℬ}}_{μ} \\ (v, w) \in E \end{matrix}} Φ_{v, w} (y_{v}, y_{w}, x) \\ + \sum_{μ = m^{'}}^{m} \{max_{y_{R_{μ}}} ({vit}_{ℋ_{μ}} (y_{{\tilde{ℬ}}_{μ}} \cup y_{R_{μ}}) \\ + \sum_{\begin{matrix} w \in R_{μ} \\ (v, w) \in E \end{matrix}} Φ_{v, w} (y_{v}, y_{w}, x))\} . \end{array}

Finally, the interior labeling is stored, where the maximum is attained. The algorithm terminates after the last node v from has been processed. In the typical case, where the graph is connected, at termination $m = 1, ℋ_{1} = V, ℬ_{1} = \emptyset$ .

The running time of the algorithm is $O (n 2^{b})$ , where b is the size of the largest boundary set and n is the number of surface residues. We call this algorithm generalized Viterbi algorithm as for the case of a graph that is a linear chain 1−2−3−⋯−n of nodes using the node order 1,2,…,n the Viterbi variables we define are the same as in the standard Viterbi algorithm for HMMs. In the case of a graph that is a tree, this algorithm specializes to the Fitch algorithm or an argmax-version of Felsenstein’s pruning algorithm when a leaf-to-root node order is chosen after rooting the tree at an arbitrary node. In both special cases the boundary sets always have size at most 1. The tree example also motivate the use of several history sets at the same time: using a single history set only, one would not be able to achieve a linear running time on trees.

A heuristic based on the generalized Viterbi algorithm

First, it is vital for our generalized Viterbi algorithm to keep the size of the boundary sets small. A good position order is here of great importance. The algorithm starts by choosing a vertex of minimal degree. When determining the next position to be dequeued, the algorithm selects a boundary node such that the number of incident edges leading to nodes not belonging to any current history set is minimal. In an arbitrarily chosen order these nodes are dequeued next.

Second, the space demand is reduced by restricting the number of boundary labelings admitted. Starting from the available labelings of the current history set, the percentage of the reachable boundary labelings of the successor history that will be discarded is calculated. Then the corresponding percentile is estimated. To this end, a sufficiently large sample of possible labelings of the new boundary set is drawn, the Viterbi variables are computed, and the corresponding sample percentile is taken. Finally, only those boundary labelings of the new history set are retained whose Viterbi variables exceed this percentile.

That way we compute near-optimal solutions good enough for our purposes within feasible computation time.

Piecewise training for pCRFs

Let

D : = ((x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)}))

be the independent identically distributed training sample. For every μ=1,2,…,m, let $V_{μ}$ and $E_{μ}$ be the set of positions and edges in the neighborhood graph associated with x_μ, let $n_{μ} = | V_{μ} |$ be the number of positions of the μ-th training example and let ${I, N}^{n_{μ}}$ be the set of all possible label sequences of this graph.

This data set is unbalanced as there are many more non-interface positions as interface positions. As customary for other machine learning approaches such as support vector machines and artificial neural networks [28], we here manipulated the ratio of positive and negative example positions for training in order to obtain reasonable results.

We have amplified the influence of the positive examples, rather than selecting various sets of training data by deleting negative ones as done in [28].

Let ν_I, ν_N, ν_II and ν_NN be the number of interface positions, the number of non-interface positions, the number of interface-interface edges, and the number of non-interface-non-interface edges in , respectively. Then we define the following two amplifier functions for all positions i and for all edges {i,j} of the m neighborhood graphs resulting from the training data .

\begin{array}{l} η_{1} (i) : & = \{\begin{matrix} \frac{ν_{N}}{ν_{I}} - 1 if y_{i} = I; \\ 0 if y_{i} = N . \end{matrix} \\ η_{2} (i, j) & : = \{\begin{matrix} \frac{ν_{NN}}{ν_{II}} - 1 if y_{i} = y_{j} = I; \\ \frac{ν_{N}}{ν_{I}} - 1 if y_{i} \neq y_{j}; \\ 0 if y_{i} = y_{j} = N . \end{matrix} \end{array}

To uniformly govern the influence of the amplifiers, we introduce an amplifier control parameter η₃∈ [ 0,1].

We set up our two log-likelihood objective function by

\begin{array}{l} ℓ (λ^{(s)}, η_{3}) : & = \sum_{μ = 1}^{m} \sum_{i \in V_{μ}} (1 + η_{3} η_{1} (i)) \sum_{k = 1}^{K_{1}} α_{k} ψ_{k} (i, y_{i}^{(μ)}, x^{(μ)}) \\ + \sum_{μ = 1}^{m} \sum_{{i, j} \in E_{μ}} (1 + η_{3} η_{2} (i, j)) \sum_{k = 1}^{K_{2}} β_{k} ϕ_{k} \\ \times (i, j, y_{i}^{(μ)}, y_{j}^{(μ)}, x^{(μ)}) - \sum_{μ = 1}^{m} ln Z (x^{(μ)}, η_{3}), \end{array}

where ideally for each μ=1,2,…,m

\begin{array}{l} Z (x^{(μ)}, η_{3}) & : = \sum_{y \in {I, N}^{n_{μ}}} exp (\sum_{i \in V_{μ}} (1 + η_{3} η_{1} (i)) \sum_{k = 1}^{K_{1}} α_{k} ψ_{k} (i, y_{i}, x^{(μ)}) \\ + \sum_{{i, j} \in E_{μ}} (1 + η_{3} η_{2} (i, j)) \sum_{k = 1}^{K_{2}} β_{k} ϕ_{k} (i, j, y_{i}, y_{j}, x^{(μ)})) \end{array}

is the training-instance-specific normalization factor.

Unfortunately, maximizing this objective function in general is algorithmically intractable. Taking pattern from Sutton et al.[35] who introduced what they called piecewise training, we deal with this problem by disentangling the labels of nodes and edges. For μ=1,2,…,m, a non-coherent labeling $y \in {I, N}^{V_{μ} \times E_{μ}}$ of the neighborhood graph $(V_{μ}, E_{μ})$ is any mapping that assigns to every position $v \in V_{μ}$ and every edge $e \in E_{μ}$ a label y_v∈{I,N} and a pair of labels y_e∈{I,N}², respectively.

We then replace Z(x^(μ),η₃) by

\begin{array}{l} \tilde{Z} (x^{(μ)}, η_{3}) & : = \sum_{y \in {I, N}^{V_{μ} \times E_{μ}}} exp (\sum_{v \in V_{μ}} (1 + η_{3} η_{1} (v)) \sum_{k = 1}^{K_{1}} α_{k} ψ_{k} \\ \times (v, y_{v}, x^{(μ)}) + \sum_{e \in E_{μ}} (1 + η_{3} η_{2} (e)) \sum_{k = 1}^{K_{2}} β_{k} ϕ_{k} \\ \times (e, y_{e}, x^{(μ)})) \end{array}

as normalization factor. This makes the optimization problem computationally feasible.

The L-BFGS method [38] is used to solve it. That way we obtain the coefficient vectors α and β (see Equations 3), which depend on the amplifier control parameter η₃∈ [ 0,1].

To mitigate the negative consequences of disentanglement, we use a correction factor δ≥1. For any characteristics D and ι=0,1,…,γ−1, the weights of the bases edge features $ϕ_{I, I, ι}^{(D)}$ and $ϕ_{N, N, ι}^{(D)}$ (see Equation 5) are all multiplied by δ. Thus a change in classification along an edge is additionally penalized. The correction factor δ is set best between 1.15 and 1.25.

For our implementation of the training, we used the Java CRF package from Sunita Sarawagi at http://crf.sourceforge.net/.

Results and discussion

In this section we demonstrate effectiveness of our pCRF-based protein surface model to enhance residue-wise score-based predictions of protein-protein interfaces. For the sake of ensuring reliability of the methods we used three data sets. The first one is PlaneDimers due to Zellner et al.[25], the second one is the list of 1276 two-chain-proteins published by Keskin et al.[39], which was used by Liet al.[28] to test their linear-chain CRF. Third, we used a non-redundant data set containing 22604 unique interface structures very recently compiled by Cukuroglu et al. and published in [40].

The data set PlaneDimers is less known than the data due to Keskin et al.. It consists of redundancy-free homodimers with flat protein-protein interfaces. Zellner et al.[25] developed an SVM, called PresCont, that assigns to each residue on the protein surface a score between 0 and 1, which we refer to as PresCont score in the sequel. The larger the score, the more likely the residue belongs to the interface. Zellner et al. made the prediction by thresholding the score. The PresCont server and the data list PlaneDimers are publicly available (see http://www-bioinf.uni-regensburg.de/).

In the first subsection we describe two sets of experiments performed with synthetic data, one on PlaneDimers[25], the other one on the list published by Keskin et al.[39]. In both cases we independently assign to each surface position a random score drawn according to two different parametrized sequences of β-distributions Beta(α_I(ς)β_I(ς)) and Beta(α_N(ς)β_N(ς)), one for the interface sites determined by the reference labeling, the other one for the non-interface positions. The parametrized values α_I(ς), α_N(ς), β_I(ς) and β_N(ς) determining the two sequences of distributions are chosen such that the following conditions are satisfied. The mean values e_I>e_N are the average PresCont scores on interface sites and non-interface sites of all chains from PlaneDimers. The variances $σ_{I}^{2}$ and $σ_{N}^{2}$ are equal to $σ_{I, 0}^{2} ς$ and $σ_{N . 0}^{2} ς$ , where $σ_{I, 0}^{2}$ and $σ_{N . 0}^{2}$ are the corresponding variances of the PresCont score, and ς∈{0.8,0.9,1.0,1.1,1.2} models the precision of the synthetic score. The deciding feature of all these distributions is that they are unimodal. The result of the subsection is that enhancement works for unimodal score distributions.

The second subsection is about a synthetic data experiment on a new data set due to Cukuroglu [40]. Here we follow the line of the first subsection except for the fact that we restrict ourselves to signal precision ς=1.0.

In the third subsection we study the PresCont scores for two-chain protein complexes from the data set PlaneDimers. According to Figure 5, the PresCont score for non-interface residues is far from being unimodal. However, if one restrict oneself to the part above a threshold in the neighborhood of 0.5 and larger, one may ask whether enhancement restricted to that domain will works. The subsection answers this question in the affirmative. Having chosen a threshold as described above, one can improve the classification with respect to this threshold as follows. Take over the prediction for scores below the threshold and reclassify the residues the scores of which are above by means of the pCRF-based enhancer.

In general, observations x could encompass a PDB file, which in particular determines the 3D-structure of the protein, together with an MSA that models evolutionary aspects. In our case an observation solely consists of the PresCont score sequence or of the sequence of synthetic scores for the surface residues. Formally, every observation x is equal to a vector (ζ₁,ζ₂,…,ζ_n)∈ [ 0,1]ⁿ.

There are several neighborhood notions for residues, surface/core definitions and interface determinations in the literature. When studying the data set PlaneDimers, we follow [25]. In the case of the list due to Keskin et al.[39], the definitions according to [28] are used. Finally, when studying complexes taken from the data set published in [40], we take the following definitions. The RASA value of a surface residue is at least 15% (see [28]). Two residues are defined as contacting if the distance between any two of their atoms is less than the sum of the corresponding van der Waals radii plus 0.5 Å (see [40]).

Anyway, according to Keskin et al.[39] we define the distance of two residues on one and the same chain as the distance of their major carbon atoms. We then say that one residue is nearby another residue, if they are at distance below 6 Å. (Note that usually residues adjacent on backbone are at distance of less than or equal to 3.5 Å). This definition in turn is the basis of the neighborhood graph $G = (V, E)$ underlying the pCRF. Two surface positions are joined together by an undirected edge if and only if the corresponding residues are nearby ones.

Our pCRF-based enhancer utilizes one position characteristic and two edge characteristics on the basis of the standard step function method explained in the Methods section. If x=(ζ₁,ζ₂,…,ζ_n)∈ [ 0,1]ⁿ is the observation associated with the protein under study, and if $G = (V, E)$ is the neighborhood graph, then for every position $i \in V$ and every edge $(i, j) \in E$ we set

C (i, x) : = ζ_{i} D_{1} (i, j, x) : = max \{ζ_{i}, ζ_{j}\} D_{2} (i, j, x) : = |ζ_{i} - ζ_{j}| .

To enhance predictions obtained by thresholding, solely information coming from the residue neighborhood relations on the surface is additionally used.

In order to be able to calculate the performance measure of area under the ROC curve (AUC) for our pCRF-based enhancer on synthetic scores, we proceed as follows. For each edge ${i, j} \in E$ , we replace the local feature value Φ_i,j(I,I,x) by κ Φ_i,j(I,I,x), where κ∈(0,∞).

We enhance residue-wise score-based predictors only on the protein surface. In our synthetic data experiments there is no predictor available for core residues. For proteins taken from the data list published by Keskin et al.[39] it happens that interface sites belong to the core. That is why we use what we call Surface AUC Ratio Γ of the enhancer as our performance measure for our synthetic data experiments.

Γ : = \frac{AUC referred to the protein surface of the enhancer}{AUC referred to the protein surface of the residue-wise score-based threshold predictor} .

If Γ is greater than 1, the enhancement was successful. The larger Γ, the greater success.

To estimate performance measures, we applied 5-fold cross-validation experiments.

A fully built-out pCRF-based tool box for modeling protein surfaces needs to comprise all the standard algorithms as e.g. forward-backward techniques, marginalization and posterior decoding known for HMMs and linear-chain CRFs. To begin with, in the fourth subsection we explain how to put a variant form of the forward algorithm and posterior decoding for pCRFs into practice.

Simulating unimodal scores of various precisions

We estimated means e_I and e_N and variances $σ_{0, I}^{2}$ and $σ_{0, N}^{2}$ of the PresCont score on interface sites and noninterface positions of PlaneDimers, respectively, as follows.

\begin{matrix} ê_{I} & = 0.61488 ê_{N} = 0.40590 \\ {\hat{σ}}_{0, I}^{2} & = 0.03991 {\hat{σ}}_{0, N}^{2} = 0.04006 \end{matrix}

(7)

We randomly chose 120 instances under the uniform distribution from the data set published by Keskin et al.[39] to perform our experiments. Let us refer to this set as KL-subset in the sequel. (It is accessible at http://ppicrf.informatik.uni-goettingen.de/index.html).

Zellner et al.[25] used the following determinations. A residue is defined to be part of the protein surface, if its relative solvent-accessible surface area is at least 5%[17]. A surface residue is said to constitute an inter-facial contact, if there exists at least one atom of this residue which has a van-der-Waals-sphere at a distance of at most 0.5 Å from the van-der-Waals sphere to any atom from a partner chain residue [39].

Based on [3, 12, 15, 20, 41], Li et al.[28] assume an inter-facial contact of a residue on a chain is assumed to be there, if any heavy atom of this residue is at distance of at most 5 Å from any heavy atom from a partner chain. The relative solvent-accessible surface area of surface residue is at least 15%.

We independently assigned to each interface surface residue of the two data sets a random score between zero and one according to the β-distribution Beta(α_I(ς)β_I(ς)), and to every non-interface surface residue a score according to Beta(α_N(ς)β_N(ς)), where the score precision ς satisfies

ς \in {0.8, 0.9, 1.0, 1.1, 1.2},

(8)

and the parameters α_I(ς),β_I(ς),α_N(ς),β_N(ς) were chosen such that

\begin{array}{l} ê_{I} & = \frac{α_{I} (ς)}{α_{I} (ς) + β_{I} (ς)} \\ {\hat{σ}}_{0, I}^{2} ς & = \frac{α_{I} (ς) β_{I} (ς)}{{(α_{I} (ς) + β_{I} (ς))}^{2} (α_{I} (ς) + β_{I} (ς) + 1)} \end{array}

(9)

\begin{array}{l} ê_{N} & = \frac{α_{N} (ς)}{α_{N} (ς) + β_{N} (ς)} \\ {\hat{σ}}_{0, N}^{2} ς & = \frac{α_{N} (ς) β_{N} (ς)}{{(α_{N} (ς) + β_{N} (ς))}^{2} (α_{N} (ς) + β_{I} (ς) + 1)} \end{array}

(10)

The Surface AUC Ratios of the enhancer compared with the threshold predictor on PlaneDimers and the KL-subset are displayed in Table 1. There is an improvement of 8.4%−9.3% on PlaneDimers and of 3.2%−5.0% on the KL-subset.

Table 1 Classification results on PlaneDimers and the KL-subset, where the β -distributions according to which the synthetic scores were drawn are defined by Equations 7, 8, 9 and 10

Full size table

Moreover, we compared individual classification results obtained by thresholding the scores with pCRF-based enhanced predictions. Because of the fact that the specificity of the threshold predictor can be easily changed by manipulating the threshold, we proceeded as follows. For every score precision, the pCRF-based enhancer has a well-defined specificity referred to the surface residues. We then chose the threshold such that the specificity of the threshold predictor is close to that of the enhancer. The results are shown in Table 2. The sensitivity is increased by 53%−67% on the data set PlaneDimers and by 14%−22% on the KL-subset.

Table 2 Comparing the enhancer with the threshold classifier of approximately equal specificity on synthetic scores assigned to surface residues of protein complexes taken from the data set PlaneDimers and the KL-subset

Full size table

Table 1 and Table 2 justify the following conclusion. Enhancing the threshold prediction by our pCRF works provided that the distributions of the interface scores as well as the non-interface scores are unimodal. The enhancement for the data set PlaneDimers is larger than for the KL-subset. This might be caused by the plain interface geometry of the complexes taken from PlaneDimers.

Utilizing the new data set due to Cukuroglu [40]

As in the case of the KL-subset, we randomly chose 60 dimers. We refer to the resulting list as CGNK-subset. Having assigned synthetic scores according to Equations 7, 9 and 10, where ς=1.0, we compared individual classification results obtained by thresholding the scores with pCRF-based enhanced predictions in exactly the same way as we did for the KL-subset. The results are shown in Table 3. The sensitivity is increased by 22%.

Table 3 Comparing the enhancer with the threshold classifier of approximately equal specificity on synthetic scores assigned to surface residues of protein complexes taken from the CGNK-subset

Full size table

A main finding of Cukuroglu [40] relevant to protein-protein interface prediction is, that the average interface RASA value is greater than 40%. Since our method is designed to improve performance of a given residue-wise predictor, using this result is not in the scope of this paper. However, a CRF-based predictor integrating features for cliques of size greater than 2 is not beyond the range of current algorithmic capabilities. In such a model a feature set that discretizes the mean RASA value of cliques is promising.

Enhancing the PresCont server prediction on PlaneDimers

For the sake of completeness, we shortly review the residue characteristics used by PresCont.

Relative solvent-accessible surface area

For any residue a, the solvent-accessible surface area a s a(a) can be computed by e.g. the software library BALL [42]. Most of the classifiers known from the literature utilize this characteristic (see [43]). For PresCont the relative solvent-accessible surface area according to

rasa (a) : = \frac{asa (a)}{{asa}_{max} (a)}

(11)

is taken into operation, where a s a_max(a) is the maximally possible accessible surface area of residue a[44].

Hydrophobicity

Many interfaces possess a hydrophobic core surrounded by a ring of polar residues [45, 46]. In order to reduce noise, in [25] the contribution of hydrophobic patches rather then the influence of individual residues is utilized.

Residue conservation

Measures of this type utilized in [25] are the Shannon entropy and the relative Shannon entropy of empirical residue distributions in MSA columns. As an alternative, empirical expectations of BLOSUM-based similarities are taken for them.

Scores of local neighborhoods

They are evaluated by means of log-odd ratios of neighboring residue pair frequencies in interfaces as opposed to residue pair frequencies on complementary protein surface areas. The resulting scores are averaged both over the neighborhood of the positions under study and the rows of the MSA associated with the protein.

On the basis of Figure 5 we enhanced PresCont for thresholds θ∈ [ 0.500,0.625]. The decisive factor for this choice is that the PresCont score distributions for interface sites as well as non-interface positions above θ are “sufficiently close to” unimodal distributions. For every such θ, we set all scores less than or equal to θ to zero and then left the classification of all surface residues to the pCRF modified as follows. The residues of score zero are not taken into account when it comes to discretizing the protein characteristics (see Equations 4 and 5). Let us call this enhancing above θ.

To evaluate improvements we proceeded as when compiling Table 2. For every threshold θ under consideration another threshold θ^′ was chosen such that thresholding at θ^′ has the same specificity as enhancing above θ. The results are displayed in Table 4 and visualized for an individual protein by Figure 6. According to Table 4 the increase in sensitivity ranges from 4% to 7%. The true-positive predictions on the surface of the protein with PDB-Entry 1QM4 are compared in Figure 6, where again the specificity of the two classifiers is the same.

Table 4 Enhancing above various thresholds on PlaneDimers , where PresCont ’s threshold was chosen such that the specificity approximately equals that of enhancing

Full size table

Discussing posterior decoding

As in the case of linear-chain CRFs, the generalized Viterbi algorithm can be transformed into a variant form of the forward algorithm. It might be the case that the following additional problem arises.

Let v₁,v₂,…,v_n be the ordering in which the positions of are traversed by the algorithm, and let $\hat{I}$ denote the set of position indices i<n such that v_i is not an element of the boundary $ℬ^{(i)}$ of the history set $ℋ^{(i)}$ at stage i. If $\hat{I}$ is not empty, we encounter an obstacle when it comes to sampling label sequences. For $i \in \hat{I}$ , position v_i is not labeled in the course of the sampling procedure. That is why we augment the neighborhood graph so that those positions no longer exist, all predictions remain unchanged, and the order of magnitude of the running time is not increased. To this end, we complement the ordering v₁,v₂,…,v_n as follows. For every $i \in \hat{I}$ , we insert a new node ${\hat{v}}_{i}$ between v_i and v_i+1. Having extended the neighborhood graph by these nodes not being associated with residue positions of the protein under study and by new edges ${v_{i}, {\hat{v}}_{i}} (i \in \hat{I})$ , where for $i \in \hat{I}$ and $y_{0} . y_{1}, y_{2} \in {N, I} Φ_{{\hat{v}}_{i}} (y_{0}, x) = Φ_{{v_{i}, {\hat{v}}_{i}}} (y_{1}, y_{2}, x) = 0,$ the above mentioned obstacle is eliminated without any influence on the prediction and the order of magnitude of the running time.

Proceeding now in a way analogous to the classical case, in every formula that is a building block of the generalized Viterbi algorithm the following two steps of replacement need to be performed.

First, for every position $i \in V$ , every edge $(i, j) \in E$ , every label y₀∈{I,N}, and every label pair (y₁,y₂)∈{I,N}², we replace Φ_i(y₀,x) with exp(Φ_i(y₀,x)), and Φ_{i,j}(y₁,y₂,x) with exp(Φ_{i,j}(y₁,y₂,x)).

Second, we replace sums with products and then maxima with sums.

Thus we obtain as analogues of the Viterbi variables ${vit}_{ℋ_{μ}} (y_{ℬ_{μ}})$ defined by Equation 6 what we call component forward variables ${cf}_{ℋ_{μ}} (y_{ℬ_{μ}})$ .

If $ℋ_{1}^{(i)}, ℋ_{2}^{(i)}, \dots, ℋ_{m_{i}}^{(i)}$ and $ℬ_{1}^{(i)}, ℬ_{2}^{(i)}, \dots, ℬ_{m_{i}}^{(i)}$ are the connected components of the history set $ℋ^{(i)}$ and the corresponding boundary set $ℬ^{(i)}$ at stage i∈{1,2,…,n}, respectively, then the forward variable at stage i with respect to a boundary assignment $y_{ℬ^{(i)}}$ is defined as

f_{i} (y_{ℬ^{(i)}}) : = \prod_{j = 1}^{m_{i}} {cf}_{ℋ_{j}^{(i)}} (y_{ℬ_{j}^{(i)}}) .

For any assignment $y_{ℬ^{(i)}}$ (i>1), the forward variable $f_{i} (y_{ℬ^{(i)}})$ is a nontrivial linear combination of forward variables $f_{i - 1} (y_{ℬ^{(i - 1)}})$ , where $y_{ℬ^{(i - 1)}}$ ranges over some assignments of the boundary set $ℬ^{(i - 1)}$ at stage i−1. Analogous to the linear-chain case, a random backward walk through a state graph, with all possible assignments $y_{ℬ^{(i)}}$ (i=n,n−1,…,1) being the set of nodes, results in a random labeling of the positions, where each labeling is drawn with its posterior probability.

This sampling technique allows the efficient calculation of posterior probabilities at nodes and edges in a straightforward manner.

Conclusions

Residue-wise score-based threshold predictors of protein-protein interaction sites assign to each residue of the protein under study a score. The classification is then made by thresholding the score. In case of using probabilistic data models, the parameters of the threshold predictor have been learned on a training data set in advance.

We have demonstrated that such threshold predictors can be improved by pCRF-based enhancers given the shape of the interface surface score distribution and the non-interface surface score distribution with respect to the training set resemble the shape of unimodal distributions. Besides the surface residue scores, only the spatial neighborhood structure between the surface residues of the protein under study is taken into account. Thus, the improvement can be attributed to our model. In addition to the precision of the scores, the amount of improvement depends on the 3D-complexity of the interfaces to be predicted. To this end, three sets of experiments with synthetic surface residue scores for protein complexes randomly chosen from the data set PlaneDimers compiled by Zellner et al.[25] and from the lists published by Keskin et al.[39] and Cukuroglu et al.[40].

The enhancement is structurally based on the following model property of pCRFs in contrast to residue-wise predictors. Though the scores of near-by residues may be correlated, labeling a position as interface or non-interface by thresholding the score does not influence the classification of its neighbors. When using pCRFs, this is the case.

The pCRF-based enhancer is also applicable, if the score distributions are only unimodal over a certain sub-domain. The improvement is then restricted to that domain. Thus we were able to improve the prediction of the PresCont server devised by Zellner et al. on PlaneDimers[25].

The prediction is made on grounds of a generalized Viterbi inference heuristic. As for training, we developed a piecewise training procedure for pCRFs, where the enhancer needs to be trained on data originating from the same source as the training data of the threshold predictor to be improved.

A prototypical implementation of our pCRF-based method is accessible at http://ppicrf.informatik.uni-goettingen.de/index.html.

References

Sowa ME, He W, Slep KC, Kercher MA, Lichtarge O, Wensel TG: Prediction and confirmation of a site critical for effector regulation of RGS domain activity. Nat Struct Biol. 2001, 8: 234-237.
Article PubMed CAS Google Scholar
Zhou HX: Improving the understanding of human genetic diseases through predictions of protein structures and protein-protein interaction sites. Curr Med Chem. 2004, 11: 539-549.
Article PubMed CAS Google Scholar
Zhou HX, Qin S: Interaction-site prediction for protein complexes: a critical assessment. Bioinformatics. 2007, 23 (17): 2203-2209.
Article PubMed CAS Google Scholar
Li JJ, Huang DS, Wang B, Chen P: Identifying protein-protein interfacial residues in heterocomplexes using residue conservation scores. Int J Biol Macromol. 2006, 38 (3–5): 241-247.
Article PubMed CAS Google Scholar
Kufareva I, Budagyan L, Raush E, Totrov M, Abagyan R: PIER: protein interface recognition for structural proteomics. Proteins. 2007, 67 (2): 400-417.
Article PubMed CAS Google Scholar
Burgoyne NJ, Jackson RM: Predicting protein interaction sites: binding hot-spots in protein-protein and protein-ligand interfaces. Bioinformatics. 2006, 22 (11): 1335-1342.
Article PubMed CAS Google Scholar
de Vries SJ, van Dijk AD, Bonvin AM: WHISCY: what information does surface conservation yield? Application to data driven docking. Proteins. 2006, 63 (3): 479-489.
Article PubMed CAS Google Scholar
Hoskins J, Lovell S, Blundell TL: An algorithm for predicting protein-protein interaction sites: abnormally exposed amino acid residues and secondary structure elements. Protein Sci. 2006, 15 (5): 1017-1029.
Article PubMed Central PubMed CAS Google Scholar
Landau M, Mayrose I, Rosenberg Y, Glaser F, Martz E, Pupko T, Ben-Tal N: ConSurf2005: the projection of evolutionary conservation scores of residues on protein structures. Nucleic Acids Res. 2005, 33 (Web-Server-Issue): 299-302.
Article Google Scholar
Liang SL, Zhang C, Liu S, Zhou Y: Protein binding site prediction using an empirical scoring function. Nucleic Acids Res. 2006, 34 (13): 3698-3707.
Article PubMed Central PubMed CAS Google Scholar
Murakami Y, Jones S: SHARP²: protein-protein interaction predictions using patch analysis. Bioinformatics. 2006, 22 (14): 1794-1795.
Article PubMed CAS Google Scholar
Zhou HX, Shan Y: Prediction of protein interaction sites from sequence profile and residue neighbor list. Protein Struct Funct Genet. 2001, 44: 336-243.
Article CAS Google Scholar
Fariselli P, Pazos F, Valencia A, Casadio R: Prediction of protein-protein interaction sites in heterocomplexes with neural networks. Eur J Biochem. 2002, 269:
Google Scholar
Ofran Y, Rost B: Predicted protein-protein interaction sites from local sequence information. FEBS Lett. 2003, 544: 236-239.
Article PubMed CAS Google Scholar
Chen H, Zhou HX: Prediction of interface residues in protein-protein complexes by a consensus neural network: test against NMR data. Protein Struct Funct Genet. 2005, 61: 21-35.
Article CAS Google Scholar
Ofran Y, Rost B: ISIS: interaction sites identified from sequence. Bioinformatics. 2007, 23 (2): 13-16.
Article Google Scholar
Porollo A, Meller J: Prediction-based fingerprints of protein-protein interactions. Protein Struct Funct Genet. 2007, 66: 630-645.
Article CAS Google Scholar
Bordner A, Abagyan R: Statistical analysis and prediction of protein-protein interfaces. Protein Struct Funct Genet. 2005, 60: 353-366.
Article CAS Google Scholar
Bradford J, Westhead D: Improved prediction of protein-protein binding sites using a support vector machines approach. Bioinformatics. 2005, 21 (8): 1487-1494.
Article PubMed CAS Google Scholar
Chung JL, Wang W, Bourne PE: Exploiting sequence and structure homologs to identify protein-protein binding sites. Proteins. 2006, 62: 630-640.
Article PubMed CAS Google Scholar
Koike A, Takagi T: Prediction of protein-protein interaction sites using support vector machines. Protein Eng Design Selec. 2004, 17 (2): 165-173.
Article CAS Google Scholar
Res I, Mihalek I, Lichtarge O: An evolution-based classifier for prediction of protein interfaces without using protein structures. Bioinformatics. 2005, 21 (10): 2496-2501.
Article PubMed CAS Google Scholar
Wang B, Wong HS, Huang DS: Inferring protein-protein interaction sites using residue conservation and evolutionary information. Protein Pept Lett. 2006, 13 (10): 999-1005.
Article PubMed CAS Google Scholar
Wang B, Chen P, Huang DS, Li JJ, Lok TM, Lyu MR: Predicting protein interaction sites from residue spatial sequence profile and evolution rate. FEBS Lett. 2006, 580 (2): 380-384.
Article PubMed CAS Google Scholar
Zellner H, Staudigel M, Trenner M, Bittkowski M, Wolowski V, Icking M, Merkl R: PresCont: Predicting Protein-Protein Interfaces Utilizing Four Residue Properties. Proteins: Struct Funct Bioinformatics. 2011, 80 (1): 154-168.
Article Google Scholar
Neuvirth H, Raz R, Schreiber G: ProMate: a structure based prediction program to identify the location of protein-protein binding sites. J Mol Biol. 2004, 338 (1): 181-199.
Article PubMed CAS Google Scholar
Bradford JR, Needham CJ, Bulpitt AJ, Westhead DR: Insights into protein-protein interfaces using a Bayesian network prediction method. J Mol Biol. 2006, 362 (2): 365-386.
Article PubMed CAS Google Scholar
Li MH, Lin L, Wang XL, Liu T: Protein-protein interaction site prediction based on conditional random fields. Bioinformatics. 2007, 23 (5): 597-604.
Article PubMed CAS Google Scholar
Hwang H, Vreven T, Weng Z: Binding interface prediction by combining protein-protein docking results. Proteins: Struct Funct Bioinformatics. 2013, [http://dx.doi.org/10.1002/prot.24354]
Google Scholar
Lafferty JD, McCallum A, Pereira FCN: Conditional random fields: probabilistic models for segmenting and labeling sequence data. Proceedings of the Eighteenth International Conference on Machine Learning. Edited by: Brodley CE. 2001, San Francisco, CA, USA: Danyluk AP. Morgan Kaufmann Publishers Inc., 282-289. [http://dl.acm.org/citation.cfm?id=645530.655813]
Google Scholar
Sutton C, McCallum A: Introduction to Statistical Relational Learning. 2006, Cambridge, Massachusetts, USA: MIT Press, chap. An Introduction to Conditional Random Fields for Relational Learning
Google Scholar
McCallum A, Li W: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4, CONLL ‘03. 2003, Stroudsburg: Association for Computational Linguistics, 188-191. [http://dx.doi.org/10.3115/1119176.1119206]
Chapter Google Scholar
Dietterich TG, Ashenfelter A, Bulatov Y: Training conditional random fields via gradient tree boosting. Proceedings of the Twenty-first International Conference on Machine Learning, Volume 69 of ACM International Conference Proceeding Series. Edited by: Brodley CE. 2004, New York, NY, USA: ACM, 28-28. [http://doi.acm.org/10.1145/1015330.1015428]
Google Scholar
Zhu J, Nie Z, Wen JR, Zhang B, Ma WY: 2D Conditional, Random Fields for Web information extraction. Proceedings of the 22Nd International Conference on Machine Learning, Volume 119 of ACM International Conference Proceeding Series. Edited by: Raedt LD, Wrobel S. 2005, New York, NY, USA: ACM, 1044-1051. [http://doi.acm.org/10.1145/1102351.1102483]
Google Scholar
Sutton C, McCallum A: Piecewise training of undirected models. UAI ’05, Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence (UAI), Edinburgh, Scotland, July 26-29. 2005, AUAI Press, 568-575.
Google Scholar
McCallum A, Rohanimanesh K, Sutton C: Dynamic conditional random fields for jointly labeling multiple sequences. NIPS-2003 Workshop on Syntax, Semantics and Statistics. 2003
Google Scholar
Sha F, Pereira F: Shallow parsing with conditional random fields. 2003, [http://citeseer.ist.psu.edu/article/sha03shallow.html]
Chapter Google Scholar
Liu DC, Nocedal J: On the limited memory BFGS method for large scale optimization. Math Program. 1989, 45: 503-528.
Article Google Scholar
Keskin O, Tsai CJ, Wolfson H, Nussinov R: A new, structurally nonredundant, diverse data set of protein-protein interfaces and its implications. Protein Sci. 2004, 13: 1043-1055.
Article PubMed Central PubMed CAS Google Scholar
Cukuroglu E, Gursoy A, Nussinov R, Keskin O: Non-redundant unique interface structures as templates for modeling protein interactions. PLoS ONE. 2014, 9: e86738-
Article PubMed Central PubMed Google Scholar
Rost B, Sander C: Conservation and prediction of solvent accessibility in protein families. Proteins: Struct Funct Bioinformatics. 1994, 20 (3): 216-226. [http://dx.doi.org/10.1002/prot.340200303]
Article CAS Google Scholar
Hildebrandt A, Dehof AK, Rurainski A, Bertsch A, Schumann M, Toussaint NC, Moll A, Stöckel D, Nickels S, Mueller SC, Lenhof HP, Kohlbacher O: BALL - biochemical algorithms library 1.3. BMC Bioinformatics. 2010, 11: 531-
Article PubMed Central PubMed Google Scholar
Xia JF, Zhao XM, Song J, Huang DS: APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility. BMC Bioinformatics. 2010, 11: 174-
Article PubMed Central PubMed Google Scholar
Miller S, Janin J, Lesk AM, Chothia1 C: Interior and surface of monomeric proteins. J Mol Biol. 1987, 196 (3): 641-656.
Article PubMed CAS Google Scholar
Larsen TA, Olson AJ, Goodsell DS: Morphology of protein–protein interfaces. Structure. 1998, 6 (4): 421-427.
Article PubMed CAS Google Scholar
Bouvier B, Grünberg R, Nilges M, Cazals F: Shelling the Voronoi interface of protein–protein complexes reveals patterns of residue conservation, dynamics, and composition. Proteins: Struct Funct Bioinformatics. 2009, 76 (3): 677-692.
Article CAS Google Scholar

Download references

Acknowledgements

Zhijie Dong acknowledges financial support by the Deutsche Forschungsgemeinschaft (Research Training Group 1023 “Identification in Mathematical Models: Synergy of Stochastic and Numerical Methods”). Moreover, we thank Rainer Merkl and Hermann Zellner from Regensburg, who supported us in gaining a deeper understanding of the PresCont service. Special thank goes to Moritz Manecke for a first implementation of the generalized Viterbi algorithm. Finally, we acknowledge with thanks the valuable suggestions and comments of the unknown referees.

Author information

Authors and Affiliations

Institute of Computer Science, University of Göttingen, Goldschmidtstr. 7, 37077, Göttingen, Germany
Zhijie Dong, Keyu Wang, Truong Khanh Linh Dang, Marlon Welter & Stephan Waack
Institute of Bioinformatics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
Mehmet Gültas
Institut für Mathematik und Informatik, Walther-Rathenau-Str. 47, 17487, Greifswald, Germany
Torsten Wierschin & Mario Stanke

Authors

Zhijie Dong
View author publications
You can also search for this author in PubMed Google Scholar
Keyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Truong Khanh Linh Dang
View author publications
You can also search for this author in PubMed Google Scholar
Mehmet Gültas
View author publications
You can also search for this author in PubMed Google Scholar
Marlon Welter
View author publications
You can also search for this author in PubMed Google Scholar
Torsten Wierschin
View author publications
You can also search for this author in PubMed Google Scholar
Mario Stanke
View author publications
You can also search for this author in PubMed Google Scholar
Stephan Waack
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stephan Waack.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

The model was designed by ZD and KW, who also adapted piecewise training to pCRFs and performed the analysis of the first submission. The tool was designed and implemented by ZD and KW, who were supported by MW and TW. The algorithms were devised by MG, MS, MW and SW. The generalized Viterbi algorithm, however, was conceived by MS. The Web server was set up by MG. The analysis necessary for the revised version was performed by TKLD, who was supported by MW. The manuscript was drafted by ZD and KW, written by MS and SW and revised by TW. SW conceived the study. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( https://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Dong, Z., Wang, K., Linh Dang, T.K. et al. CRF-based models of protein surfaces improve protein-protein interaction site predictions. BMC Bioinformatics 15, 277 (2014). https://doi.org/10.1186/1471-2105-15-277

Download citation

Received: 11 October 2013
Accepted: 01 August 2014
Published: 13 August 2014
DOI: https://doi.org/10.1186/1471-2105-15-277

CRF-based models of protein surfaces improve protein-protein interaction site predictions

Abstract

Background

Results

Conclusions

Background

Residue-wise score-based prediction methods

Probabilistic methods

Notations

Methods

Using conditional random fields to model protein surfaces

Devising a generalized Viterbi algorithm for pCRFs

The generalized Viterbi algorithm

A heuristic based on the generalized Viterbi algorithm

Piecewise training for pCRFs

Results and discussion

Simulating unimodal scores of various precisions

Utilizing the new data set due to Cukuroglu [40]

Enhancing the PresCont server prediction on PlaneDimers

Relative solvent-accessible surface area

Hydrophobicity

Residue conservation

Scores of local neighborhoods

Discussing posterior decoding

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us