CRF-based models of protein surfaces improve protein-protein interaction site predictions

Background The identification of protein-protein interaction sites is a computationally challenging task and important for understanding the biology of protein complexes. There is a rich literature in this field. A broad class of approaches assign to each candidate residue a real-valued score that measures how likely it is that the residue belongs to the interface. The prediction is obtained by thresholding this score. Some probabilistic models classify the residues on the basis of the posterior probabilities. In this paper, we introduce pairwise conditional random fields (pCRFs) in which edges are not restricted to the backbone as in the case of linear-chain CRFs utilized by Li et al. (2007). In fact, any 3D-neighborhood relation can be modeled. On grounds of a generalized Viterbi inference algorithm and a piecewise training process for pCRFs, we demonstrate how to utilize pCRFs to enhance a given residue-wise score-based protein-protein interface predictor on the surface of the protein under study. The features of the pCRF are solely based on the interface predictions scores of the predictor the performance of which shall be improved. Results We performed three sets of experiments with synthetic scores assigned to the surface residues of proteins taken from the data set PlaneDimers compiled by Zellner et al. (2011), from the list published by Keskin et al. (2004) and from the very recent data set due to Cukuroglu et al. (2014). That way we demonstrated that our pCRF-based enhancer is effective given the interface residue score distribution and the non-interface residue score are unimodal. Moreover, the pCRF-based enhancer is also successfully applicable, if the distributions are only unimodal over a certain sub-domain. The improvement is then restricted to that domain. Thus we were able to improve the prediction of the PresCont server devised by Zellner et al. (2011) on PlaneDimers. Conclusions Our results strongly suggest that pCRFs form a methodological framework to improve residue-wise score-based protein-protein interface predictors given the scores are appropriately distributed. A prototypical implementation of our method is accessible at http://ppicrf.informatik.uni-goettingen.de/index.html.


Background
Protein-protein interactions are constitutive of almost every biological process. The ability to identify the residues that form the interaction sites of these complexes is necessary to understand them. In particular, it is the basis for new therapeutic approaches to treat diseases [1,2].
A great deal of work has been done on developing insilico prediction methods. As already observed by Zhou *Correspondence: waack@informatik.uni-goettingen.de 1 Institute of Computer Science, University of Göttingen, Goldschmidtstr. 7, 37077 Göttingen, Germany Full list of author information is available at the end of the article et al. [3], these methods can be subdivided with respect to the kind of mathematical foundation invoked and with respect to the features or characteristics of the protein used.

Probabilistic methods
Let X be the data relevant for a protein chain, where these data are assumed to stem from a random source thus obeying a random distribution. X, which alternatively is called the observation, typically includes the structure. The label sequence of the residues Y that classifies each individual residue either as interface or as non-interface is assumed to be random, too. Typically, probabilistic methods use the conditional probability distribution P (Y | X) to determine a classification y* of the residues of maximal posterior probability P (y* | x). Naive Bayesian methods [26], Bayesian network methods [27], hidden Markov models (HMMs) [26], and linear-chain Conditional Random Fields taking the backbone as underlying graphical structure [28] fall in this category. Using posterior decoding on the basis of the forward-backward algorithm, both HMMs and CRFs are residue-wise score-based prediction methods, where the binary decision is made by thresholding the posterior probabilities of classifying the residues as interface.

Notations
We use Latin uppercase letters when referring to random aspects of the objects denoted by them. In contrast, lowercase letters denote arbitrarily chosen but fixed objects. In this context boldface letters indicate vectors, the corresponding non-boldface letters their coefficients.
The vast majority of methods use the 3D structure of the target protein chain in form of a PDB file as input [4][5][6][7][8][9][10][11][12][13]15,[17][18][19][20][21][23][24][25]. However, a few methods are not requiring a 3D structure and rather use sequences only [14,16,22]. We here consider the problem with a given 3D structure of the target protein chain. Sequencebased input may include a multiple sequence alignment of related proteins from which, for example, sequence conservation can be inferred. When the 3D structure of an unbound binding partner is also available, proteinprotein docking methods can be applied. This has also been exploited to provide feedback from docking to the more specific problem of interface prediction [29]. We here consider the case where the binding partner's 3D structure is not given. Nor requires the presented method the sequence of the binding partner. Albeit, we tested on homodimers only as we here rather focus on our new method rather than on features or types of proteins. The protein features used for interface prediction in the literature are reviewed in the Methods section as far as we make use of them in this article.
Most of the current studies for predicting interaction sites of proteins that use a probabilistic method are restricted by treating the residues of the proteins as independent vertices. Li et al. have taken the backbone neighborhood into account thus modeling the protein as a sequence [28] using what can be called a line CRF or linear-chain CRF. The features they define on the label pair of two backbone neighbors have the effect of smoothing the predicted labels along the protein sequence. Decisive is, however, that they were the first who used conditional random fields (CRFs) for interface prediction. CRFs in turn have come into use for solving sequence labeling problems due to Lafferty et al. [30]. See [31] for an overview. From the mathematical point of view they take advantage of the fact that they model the conditional probability P (Y | X) rather than the joint probability P (Y, X). Recently there has been an explosion of interest in conditional random fields (CRFs) with successful applications. It has been shown that CRFs have the abilities for solving sequence labeling problems like part-of-speech tagging (POST) [32] and natural language processing [33]. Furthermore in the web extraction problem, in which the web-sites are modeled as two dimensional grid graphs, CRFs perform well [34]. One of their outstanding benefits over many other statistical models is that a CRF can easily describe the dependencies of observations.
As proteins are folded into three dimensional structures, spatial relationships create dependencies between residues. For example, we find on the test data described below that the correlation coefficient between spatial neighbors that are not also sequence neighbors (distance ≤ 3.5 Å) is 0.45. This is only slightly lower than the correlation coefficient between residues that are sequence neighbors (0.49). As there are more than three times as many spatial pairs of neighbors than sequence neighbors at this threshold it is reasonable from a modeling standpoint to use a model that respects all dependencies induced by spatial proximity, not only the dependencies induced by proximity along the backbone.
There are many papers using spatial neighborhood information of residues to predict-protein interaction sites (see e.g. [2,13,21,28]). However, the spatial information of proteins was only integrated into the feature functions, but not represented in the model. For probabilistic models, the difference between the two ways to integrate spacial information is that in previous models the label of the i-th residue Y i is conditional independent from the labels of other residues given data X and -in the case of linear CRFs or HMMs -given the labels of Y i−1 and Y i+1 . Even when neighborhood information is only used for spatial smoothing of the labels, the intuitive advantage over, say, an SVM classifier that uses spatial neighborhood in the features but classifies each residue independently, is that not-patch-like candidate labelings are explicitly punished. In contrast, such an independent classifier-approach may have a tendency to predict individual interface residues 'sprinkled' around the protein surface [28].
For this reason, a general CRF seems to be more suitable for the task. However, inference for general CRFs http://www.biomedcentral.com/1471-2105/15/277 is intractable. In this paper, pairwise conditional random fields (pCRFs) are utilized. Specializing general CRFs, only node cliques and edge cliques are taken into consideration in pCRFs. A pCRF retains most spatial information of proteins, can be specified with the same number of parameter as a line CRF and approximate inference remains feasible with the generalization of the Viterbi algorithm introduced here. Taking pattern from piecewise training methods [35], we disentangled the labels of nodes and edges to train the model.
In order to take advantage of a residue-wise scorebased predictor, we model the protein surface by means of a pCRF, where the observation is solely a sequence of surface residue scores between 0 and 1 output by the predictor. We then utilize a generalized Viterbi algorithm and piecewise training. The resulting tool tries to enhance the predictor chosen on the surface of the protein under study. It is the aim of this paper to demonstrate effectiveness of this approach provided that the interface residue scores and the non-interface residue scores are appropriately distributed.

Methods
We address the problem of improving residue-wise scorebased predictors for protein interface residues as a node labeling problem for undirected graphs using the model class of conditional random fields (CRFs). Lafferty et al. [30] were the first who applied CRFs to the problem of labeling sequence data. Li et al. [28] used line CRFs to address the interaction site prediction. They have the advantage that the Viterbi algorithm well-known from decoding HMMs can be used to efficiently infer the most likely labeling sequence. Very useful and illustrative presentations on CRFs are given in [31,32,36,37]. Above CRFbased models make the assumption that the label of one residue is conditionally independent of the labels of all other residues given the labels of the two adjacent residues in the protein sequence. To the best of our knowledge, we are the first to employ a graphical model that takes the spatial neighborhood of residues located on the protein surface into account.
This section is subdivided into three parts. We first explain how we model protein surfaces by pairwise CRFs. Then we introduce our new inference method. Finally, we elucidate our training method.

Using conditional random fields to model protein surfaces
For every protein under study that has n surface residues, a pair of random vectors (X, Y) is considered. The vector X is the observation that represents the knowledge about this protein that is utilized in the prediction, e.g. the 3D structure of the target protein and a multiple sequence alignment together with homologs.
The vector Y is a random sequence of length n over the alphabet {I, N} that labels the index set {1, 2, . . . , n}, which in turn is called the set of positions (of the surface residues). The label I represents interface residues, whereas the label N represents non-interface residues. {I, N} n is the set of all label sequences of length n over {I, N}. We will also call them assignments as the term 'label sequence' may lead to confusion when applied below to subsets of {1, 2, . . . , n} that are not contiguous sequences.
Let G = (V, E) be the neighborhood graph, where V = {1, 2, . . . , n} is the set of positions, E is the set of edges that typically results from an atom-distance-based neighborhood definition for positions. We assume for convenience in notation that G has no isolated nodes. Cases with isolated nodes could trivially be reduced to cases without isolated nodes. Let C be the set of G's cliques, which we refer to as node cliques. For a node clique c ∈ C and an assignment y we denote by y c the restriction of y to the positions belonging to the node clique c. For c = {i} and c = {i, j} we write y i and y i , y j rather than y {i} and y {i,j} .
The preceding notation is also used in the slightly more general case of partial label assignments to arbitrarily chosen subsets S of the set of positions V. Formally, let y S denote (i, y i ) | i ∈ S, y i ∈ {I, N} . Given two partial assignments y S 1 and y S 2 are identical on S 1 ∩S 2 , the union y S 1 ∪ y S 2 is well-defined.
The conditional distribution function of our pCRF (X, Y) with respect to the neighborhood graph G = (V, E) is defined as follows: where x and y are arbitrarily chosen instances of the random observation X and the random label sequence Y, respectively, c (y c , x) ∈ • (c ∈ C) is the feature of the CRF located at the node clique c (again i and i,j simplify notation for {i} and {i,j} ), and Z(x) is the observation-specific normalization factor defined by Let us call ln(Z(x)P(y | x)) the score of the label sequence y given the observation x.
A CRF is called a pairwise CRF (pCRF) if c ≡ 0, for all node cliques c larger than two. The remaining features i and i,j are referred to as node features http://www.biomedcentral.com/1471-2105/15/277 and edge features, respectively. Thus, every position i ∈ V and every edge (i, j) ∈ E is represented by the pair ( i (N, x), i (I, x)) and by the quadruplet Following [30], we assume moreover that each node feature and each edge feature is a sum of weighted base features. More precisely, for every position i ∈ V and every edge {i, j} ∈ E we assume representations where y ∈ {I, N} n and x is an observation. The two real vectors need to be calculated in a training phase.
In the most general sense, protein characteristics are real-valued evaluations of positions and pairs of adjacent positions (edges of the neighborhood graph), respectively, that are correlated with our position labeling problem. We use a standard step function technique to obtain base features from protein characteristics, rather than taking the raw values of the characteristics. To make our paper self-contained, let us describe this technique for short.
A protein characteristic depends on the observation and either a node or an edge. Each protein characteristic, such as e.g. the relative solvent-accessible surface area of a residue, is transformed into several binary features by binning, i.e. we distinguish only a few different cases rather than the whole range of the characteristic. Assuming the common case of real-valued characteristics, the bins are a partition of the reals into intervals. The use of this discretization allows to approximate any shape of dependency of the labels on the characteristics, rather than assuming a fixed shape such as linear or logarithmic.
From protein characteristics for positions to node features. We subdivide the range of the characteristics C into say γ intervals, where γ is at least two. Let s 1 < s 2 < . . . < s γ −1 be the corresponding interval boundaries. It is reasonable to take s ι as the ι/γ -quantile of the empirical distribution of C for non-interface residues, where C(i, x) ∈ (s 0 , s γ ]. Then we define for each position i ∈ V the following 2γ base features associated with the position characteristics C.
In both cases we set γ = 5.

Devising a generalized Viterbi algorithm for pCRFs
The problem of finding a most probable label sequence y * given an observation x is NP-hard for general pCRFs [31]. In this subsection we present a heuristic that approximately solves this problem.
To this end, we first devise an algorithm, which we call generalized Viterbi algorithm. It computes an optimal label sequence, where the posterior probability of y * given x is maximized. Unfortunately, its run-time is in too many cases not acceptable. That is why we transform it in a second step into a feasible, time-bounded approximation algorithm.

The generalized Viterbi algorithm
Let G = (V, E) be the neighborhood graph underlying the protein under study. For any assignment (label sequence) y and any subset V of V, let y V denote the partial assignment of y with respect to V . (This is in line with the notation y c (c a position clique) introduced earlier in this study).
If V 1 , V 2 , . . . , V r are pairwise disjoint position sets, the assignment for V 1 ∪V 2 ∪. . .∪V r canonically resulting from assignments Then the problem of determining a most probable label sequence y * given an observation x can be reformulated as This is the case, because it suffices to consider the score.
To put this into practice, we devised an algorithm we call generalized Viterbi. On the one hand, it is analogous to the classical Viterbi algorithm. On the other hand, there is a major difference. In our case there is no canonical order in which the positions of G are traversed. Having explained our algorithm for any order, we show how to http://www.biomedcentral.com/1471-2105/15/277 calculate a fairly effective one. In what follows, we assume that the positions not yet touched are held in a dynamic queue. Those positions having already left the queue form the history set H ⊆ V.
Assume that the subgraph of G induced by H has connected components H 1 , Figure 1 for an example.
For assignments The Viterbi variables can be represented as a set of tables, one . . , y B m range over all possible assignments of corresponding boundary component. We store for every assignment on the boundary, a maximizing interior assignment. This assignment is the argmax of (6) but is determined with the dynamic programming recursions defined below. Let us call these data the current state of the algorithm. It mainly consists of record sets indexed by the boundary labelings. At the very beginning the queue contains all positions, the history set H and the corresponding boundary component B are empty. As long as the position queue is not empty, the top element v is extracted and the state is updated as follows.
Adjoining v to the history set H, there are two cases to distinguish. Either position v is not adjacent to any other position of any old boundary component (see Figure 2) or adjoining position v to H results in adding it to some connected component of the old history set or even merging together two or more of them (see Figure 3).
In the first case we simply have to take over all the old connected components, boundary sets and Viterbi variables. Moreover, we perform the instructions In the second case position v is adjacent to some boundary components, say B m , B m +1 , . . . , B m . Then the old history components H m , H m +1 , . . . , H m and the current position v are merged together: The other history set components and corresponding Viterbi variables are not affected.
For μ = m , m + 1, . . . , m, let R μ ⊆ B μ be the set of all positions out of B μ that are no longer boundary nodes after having adjoined v to the history set. The nodes in R μ are removed from the boundary B μ after the iteration. LetB μ be the complement of R μ in B μ . By inspecting the edges incident to the current position v, all these sets can be computed in linear time.
The new boundary set B tmp is then eitherB m ∪B m +1 ∪ . . . ∪B m orB m ∪B m +1 ∪ . . . ∪B m ∪ {v}, where it can be checked in linear time whether or not v is a new boundary position.
We are now in a position to calculate the new Viterbi variables vit H tmp (y B tmp ), where y B tmp ranges over all assignments of the new boundary set B tmp . If Here, any assignment of a node set is assumed to implicitly define assignments for any subset thereof. illustrates this case of the recursion step. If, however, v ∈ B tmp , then Finally, the interior labeling is stored, where the maximum is attained. The algorithm terminates after the last node v from V has been processed. In the typical case, where the graph is connected, at termination m = 1, The running time of the algorithm is O(n2 b ), where b is the size of the largest boundary set and n is the number of surface residues. We call this algorithm generalized Viterbi algorithm as for the case of a graph that is a linear chain 1 − 2 − 3 − · · · − n of nodes using the node order 1, 2, . . . , n the Viterbi variables we define are the same as in the standard Viterbi algorithm for HMMs. In the case of a graph that is a tree, this algorithm specializes to the Fitch algorithm or an argmax-version of Felsenstein's pruning algorithm when a leaf-to-root node order is chosen after rooting the tree at an arbitrary node. In both special cases the boundary sets always have size at most 1. The tree example also motivate the use of several history sets at the same time: using a single history set only, one would not be able to achieve a linear running time on trees.

A heuristic based on the generalized Viterbi algorithm
First, it is vital for our generalized Viterbi algorithm to keep the size of the boundary sets small. A good position order is here of great importance. The algorithm starts by choosing a vertex of minimal degree. When determining the next position to be dequeued, the algorithm selects a boundary node such that the number of incident edges leading to nodes not belonging to any current history set is minimal. In an arbitrarily chosen order these nodes are dequeued next.
Second, the space demand is reduced by restricting the number of boundary labelings admitted. Starting from the available labelings of the current history set, the percentage of the reachable boundary labelings of the successor history that will be discarded is calculated. Then the corresponding percentile is estimated. To this end, a sufficiently large sample of possible labelings of the new boundary set is drawn, the Viterbi variables are computed, and the corresponding sample percentile is taken. Finally, only those boundary labelings of the new history set are retained whose Viterbi variables exceed this percentile.
That way we compute near-optimal solutions good enough for our purposes within feasible computation time.

Piecewise training for pCRFs
Let D := x (1) , y (1) , x (2) , y (2) , . . . , x (m) , y (m) be the independent identically distributed training sample. For every μ = 1, 2, . . . , m, let V μ and E μ be the set of positions and edges in the neighborhood graph associated with x μ , let n μ = |V μ | be the number of positions of the μ-th training example and let {I, N} n μ be the set of all possible label sequences of this graph.
This data set is unbalanced as there are many more noninterface positions as interface positions. As customary for other machine learning approaches such as support vector machines and artificial neural networks [28], we here manipulated the ratio of positive and negative example positions for training in order to obtain reasonable results.
We have amplified the influence of the positive examples, rather than selecting various sets of training data by deleting negative ones as done in [28].
Let ν I , ν N , ν II and ν NN be the number of interface positions, the number of non-interface positions, the number of interface-interface edges, and the number of noninterface-non-interface edges in D, respectively. Then we define the following two amplifier functions for all positions i and for all edges {i, j} of the m neighborhood graphs resulting from the training data D.
To uniformly govern the influence of the amplifiers, we introduce an amplifier control parameter η 3 ∈ [0, 1].
We set up our two log-likelihood objective function by where ideally for each μ = 1, 2, . . . , m is the training-instance-specific normalization factor. Unfortunately, maximizing this objective function in general is algorithmically intractable. Taking pattern from Sutton et al. [35] who introduced what they called piecewise training, we deal with this problem by disentangling the labels of nodes and edges. For μ = 1, 2, . . . , m, a noncoherent labeling y ∈ {I, N} V μ ×E μ of the neighborhood graph V μ , E μ is any mapping that assigns to every position v ∈ V μ and every edge e ∈ E μ a label y v ∈ {I, N} and a pair of labels y e ∈ {I, N} 2 , respectively. http://www.biomedcentral.com/1471-2105/15/277 We then replace Z x (μ) , η 3 bỹ as normalization factor. This makes the optimization problem computationally feasible.
The L-BFGS method [38] is used to solve it. That way we obtain the coefficient vectors α and β (see Equations 3), which depend on the amplifier control parameter η 3 ∈ [0, 1].
To mitigate the negative consequences of disentanglement, we use a correction factor δ ≥ 1. For any characteristics D and ι = 0, 1, . . . , γ − 1, the weights of the bases edge features φ (D) I,I,ι and φ (D) N,N,ι (see Equation 5) are all multiplied by δ. Thus a change in classification along an edge is additionally penalized. The correction factor δ is set best between 1.15 and 1.25.
For our implementation of the training, we used the Java CRF package from Sunita Sarawagi at http://crf. sourceforge.net/.

Results and discussion
In this section we demonstrate effectiveness of our pCRFbased protein surface model to enhance residue-wise score-based predictions of protein-protein interfaces. For the sake of ensuring reliability of the methods we used three data sets. The first one is PlaneDimers due to Zellner et al. [25], the second one is the list of 1276 two-chainproteins published by Keskin et al. [39], which was used by Liet al. [28] to test their linear-chain CRF. Third, we used a non-redundant data set containing 22604 unique interface structures very recently compiled by Cukuroglu et al. and published in [40].
The data set PlaneDimers is less known than the data due to Keskin et al.. It consists of redundancy-free homodimers with flat protein-protein interfaces. Zellner et al. [25] developed an SVM, called PresCont, that assigns to each residue on the protein surface a score between 0 and 1, which we refer to as PresCont score in the sequel. The larger the score, the more likely the residue belongs to the interface. Zellner et al. made the prediction by thresholding the score. The PresCont server and the data list PlaneDimers are publicly available (see http://wwwbioinf.uni-regensburg.de/).
In the first subsection we describe two sets of experiments performed with synthetic data, one on PlaneDimers [25], the other one on the list published by Keskin et al. [39]. In both cases we independently assign to each surface position a random score drawn according to two different parametrized sequences of β-distributions Beta(α I (ς)β I (ς)) and Beta(α N (ς)β N (ς)), one for the interface sites determined by the reference labeling, the other one for the non-interface positions.
The parametrized values α I (ς), α N (ς), β I (ς) and β N (ς) determining the two sequences of distributions are chosen such that the following conditions are satisfied. The mean values e I > e N are the average PresCont scores on interface sites and non-interface sites of all chains from PlaneDimers. The variances σ 2 I and σ 2 N are equal to σ 2 I,0 ς and σ 2 N.0 ς, where σ 2 I,0 and σ 2 N.0 are the corresponding variances of the PresCont score, and ς ∈ {0.8, 0.9, 1.0, 1.1, 1.2} models the precision of the synthetic score. The deciding feature of all these distributions is that they are unimodal. The result of the subsection is that enhancement works for unimodal score distributions.
The second subsection is about a synthetic data experiment on a new data set due to Cukuroglu [40]. Here we follow the line of the first subsection except for the fact that we restrict ourselves to signal precision ς = 1.0.
In the third subsection we study the PresCont scores for two-chain protein complexes from the data set PlaneDimers. According to Figure 5, the PresCont score for noninterface residues is far from being unimodal. However, if one restrict oneself to the part above a threshold in the neighborhood of 0.5 and larger, one may ask whether enhancement restricted to that domain will works. The subsection answers this question in the affirmative. Having chosen a threshold as described above, one can improve the classification with respect to this threshold as follows. Take over the prediction for scores below the threshold and reclassify the residues the scores of which are above by means of the pCRF-based enhancer.
In general, observations x could encompass a PDB file, which in particular determines the 3D-structure of the protein, together with an MSA that models evolutionary Figure 5 The distribution of the PresCont score for complexes from the data set PlaneDimers. http://www.biomedcentral.com/1471-2105/15/277 aspects. In our case an observation solely consists of the PresCont score sequence or of the sequence of synthetic scores for the surface residues. Formally, every observation x is equal to a vector (ζ 1 , ζ 2 , . . . , ζ n ) ∈ [0, 1] n .
There are several neighborhood notions for residues, surface/core definitions and interface determinations in the literature. When studying the data set PlaneDimers, we follow [25]. In the case of the list due to Keskin et al. [39], the definitions according to [28] are used. Finally, when studying complexes taken from the data set published in [40], we take the following definitions. The RASA value of a surface residue is at least 15% (see [28]). Two residues are defined as contacting if the distance between any two of their atoms is less than the sum of the corresponding van der Waals radii plus 0.5 Å (see [40]).
Anyway, according to Keskin et al. [39] we define the distance of two residues on one and the same chain as the distance of their major carbon atoms. We then say that one residue is nearby another residue, if they are at distance below 6 Å. (Note that usually residues adjacent on backbone are at distance of less than or equal to 3.5 Å). This definition in turn is the basis of the neighborhood graph G = (V, E) underlying the pCRF. Two surface positions are joined together by an undirected edge if and only if the corresponding residues are nearby ones.
Our pCRF-based enhancer utilizes one position characteristic and two edge characteristics on the basis of the standard step function method explained in the Methods section. If x = (ζ 1 , ζ 2 , . . . , ζ n ) ∈ [0, 1] n is the observation associated with the protein under study, and if G = (V, E) is the neighborhood graph, then for every position i ∈ V and every edge (i, j) ∈ E we set To enhance predictions obtained by thresholding, solely information coming from the residue neighborhood relations on the surface is additionally used.
In order to be able to calculate the performance measure of area under the ROC curve (AUC) for our pCRF-based enhancer on synthetic scores, we proceed as follows. For each edge {i, j} ∈ E, we replace the local feature value i,j (I, I, x) by κ i,j (I, I, x), where κ ∈ (0, ∞). We enhance residue-wise score-based predictors only on the protein surface. In our synthetic data experiments there is no predictor available for core residues. For proteins taken from the data list published by Keskin et al. [39] it happens that interface sites belong to the core. That is why we use what we call Surface AUC Ratio of the enhancer as our performance measure for our synthetic data experiments. := AUC referred to the protein surface of the enhancer AUC referred to the protein surface of the residue-wise score-based threshold predictor .
If is greater than 1, the enhancement was successful. The larger , the greater success.
To estimate performance measures, we applied 5-fold cross-validation experiments.
A fully built-out pCRF-based tool box for modeling protein surfaces needs to comprise all the standard algorithms as e.g. forward-backward techniques, marginalization and posterior decoding known for HMMs and linear-chain CRFs. To begin with, in the fourth subsection we explain how to put a variant form of the forward algorithm and posterior decoding for pCRFs into practice.

Simulating unimodal scores of various precisions
We estimated means e I and e N and variances σ 2 0,I and σ 2 0,N of the PresCont score on interface sites and noninterface positions of PlaneDimers, respectively, as follows.
e I = 0.61488ê N = 0.40590 We randomly chose 120 instances under the uniform distribution from the data set published by Keskin et al. [39] to perform our experiments. Let us refer to this set as KL-subset in the sequel. (It is accessible at http://ppicrf. informatik.uni-goettingen.de/index.html).
Zellner et al. [25] used the following determinations. A residue is defined to be part of the protein surface, if its relative solvent-accessible surface area is at least 5% [17]. A surface residue is said to constitute an inter-facial contact, if there exists at least one atom of this residue which has a van-der-Waals-sphere at a distance of at most 0.5 Å from the van-der-Waals sphere to any atom from a partner chain residue [39].
Based on [3,12,15,20,41], Li et al. [28] assume an interfacial contact of a residue on a chain is assumed to be there, if any heavy atom of this residue is at distance of at most 5 Å from any heavy atom from a partner chain. The relative solvent-accessible surface area of surface residue is at least 15%.
We independently assigned to each interface surface residue of the two data sets a random score between zero and one according to the β-distribution Beta(α I (ς)β I (ς)), and to every non-interface surface residue a score according to Beta(α N (ς)β N (ς)), where the score precision ς satisfies ς ∈ {0.8, 0.9, 1.0, 1.1, 1.2}, ( 8 )  and the parameters α I (ς), β I (ς), α N (ς), β N (ς) were chosen such that The Surface AUC Ratios of the enhancer compared with the threshold predictor on PlaneDimers and the KLsubset are displayed in Table 1. There is an improvement of 8.4% − 9.3% on PlaneDimers and of 3.2% − 5.0% on the KL-subset.
Moreover, we compared individual classification results obtained by thresholding the scores with pCRF-based enhanced predictions. Because of the fact that the specificity of the threshold predictor can be easily changed by manipulating the threshold, we proceeded as follows. For every score precision, the pCRF-based enhancer has a well-defined specificity referred to the surface residues. We then chose the threshold such that the specificity of the threshold predictor is close to that of the enhancer. The results are shown in Table 2. The sensitivity is increased by 53% − 67% on the data set PlaneDimers and by 14% − 22% on the KL-subset. Table 1 and Table 2 justify the following conclusion. Enhancing the threshold prediction by our pCRF works provided that the distributions of the interface scores as well as the non-interface scores are unimodal. The enhancement for the data set PlaneDimers is larger than for the KL-subset. This might be caused by the plain interface geometry of the complexes taken from PlaneDimers.
Utilizing the new data set due to Cukuroglu [40] As in the case of the KL-subset, we randomly chose 60 dimers. We refer to the resulting list as CGNK-subset. Having assigned synthetic scores according to Equations 7,9 and 10, where ς = 1.0, we compared individual classification results obtained by thresholding the scores with pCRF-based enhanced predictions in exactly the same way as we did for the KL-subset. The results are shown in Table 3. The sensitivity is increased by 22%.
A main finding of Cukuroglu [40] relevant to proteinprotein interface prediction is, that the average interface RASA value is greater than 40%. Since our method is designed to improve performance of a given residue-wise predictor, using this result is not in the scope of this paper. However, a CRF-based predictor integrating features for cliques of size greater than 2 is not beyond the range of current algorithmic capabilities. In such a model a feature set that discretizes the mean RASA value of cliques is promising.

Enhancing the PresCont server prediction on PlaneDimers
For the sake of completeness, we shortly review the residue characteristics used by PresCont.

Relative solvent-accessible surface area
For any residue a, the solvent-accessible surface area asa(a) can be computed by e.g. the software library BALL [42]. Most of the classifiers known from the literature utilize this characteristic (see [43]). For PresCont the relative solvent-accessible surface area according to rasa(a) := asa(a) asa max (a) (11) is taken into operation, where asa max (a) is the maximally possible accessible surface area of residue a [44].

Hydrophobicity
Many interfaces possess a hydrophobic core surrounded by a ring of polar residues [45,46]. In order to reduce noise, in [25] the contribution of hydrophobic patches rather then the influence of individual residues is utilized.

Residue conservation
Measures of this type utilized in [25] are the Shannon entropy and the relative Shannon entropy of empirical residue distributions in MSA columns. As an alternative, empirical expectations of BLOSUM-based similarities are taken for them.

Scores of local neighborhoods
They are evaluated by means of log-odd ratios of neighboring residue pair frequencies in interfaces as opposed to residue pair frequencies on complementary protein surface areas. The resulting scores are averaged both over the neighborhood of the positions under study and the rows of the MSA associated with the protein.
On the basis of Figure 5 we enhanced PresCont for thresholds θ ∈ [0.500, 0.625]. The decisive factor for this choice is that the PresCont score distributions for interface sites as well as non-interface positions above θ are "sufficiently close to" unimodal distributions. For every such θ, we set all scores less than or equal to θ to zero and then left the classification of all surface residues to the pCRF modified as follows. The residues of score zero are not taken into account when it comes to discretizing the protein characteristics (see Equations 4 and 5). Let us call this enhancing above θ. To evaluate improvements we proceeded as when compiling Table 2. For every threshold θ under consideration another threshold θ was chosen such that thresholding at θ has the same specificity as enhancing above θ. The results are displayed in Table 4 and visualized for an individual protein by Figure 6. According to Table 4 the increase in sensitivity ranges from 4% to 7%. The truepositive predictions on the surface of the protein with PDB-Entry 1QM4 are compared in Figure 6, where again the specificity of the two classifiers is the same.

Discussing posterior decoding
As in the case of linear-chain CRFs, the generalized Viterbi algorithm can be transformed into a variant form of the forward algorithm. It might be the case that the following additional problem arises.
Let v 1 , v 2 , . . . , v n be the ordering in which the positions of G are traversed by the algorithm, and let I denote the set of position indices i < n such that v i is not an element of the boundary B (i) of the history set H (i) at stage i. If I is not empty, we encounter an obstacle when it comes to sampling label sequences. For i ∈ I, position v i is not labeled in the course of the sampling procedure. That is why we augment the neighborhood graph G so that those positions no longer exist, all predictions remain unchanged, and the order of magnitude of the running time is not increased. To this end, we complement the http://www.biomedcentral.com/1471-2105/15/277 ordering v 1 , v 2 , . . . , v n as follows. For every i ∈ I, we insert a new node v i between v i and v i+1 . Having extended the neighborhood graph by these nodes not being associated with residue positions of the protein under study and by new edges {v i , v i } (i ∈ I), where for i ∈ I and y 0 .y 1 , y 2 ∈ {N, I} v i (y 0 , x) = {v i , v i } (y 1 , y 2 , x) = 0, the above mentioned obstacle is eliminated without any influence on the prediction and the order of magnitude of the running time.
Proceeding now in a way analogous to the classical case, in every formula that is a building block of the generalized Viterbi algorithm the following two steps of replacement need to be performed.
Second, we replace sums with products and then maxima with sums.
Thus we obtain as analogues of the Viterbi variables vit H μ (y B μ ) defined by Equation 6 what we call component forward variables cf H μ (y B μ ). For any assignment y B (i) (i > 1), the forward variable f i (y B (i) ) is a nontrivial linear combination of forward variables f i−1 (y B (i−1) ), where y B (i−1) ranges over some assignments of the boundary set B (i−1) at stage i − 1. Analogous to the linear-chain case, a random backward walk through a state graph, with all possible assignments y B (i) (i = n, n− 1, . . . , 1) being the set of nodes, results in a random labeling of the positions, where each labeling is drawn with its posterior probability.
This sampling technique allows the efficient calculation of posterior probabilities at nodes and edges in a straightforward manner.

Conclusions
Residue-wise score-based threshold predictors of proteinprotein interaction sites assign to each residue of the protein under study a score. The classification is then made by thresholding the score. In case of using probabilistic data models, the parameters of the threshold predictor have been learned on a training data set in advance.
We have demonstrated that such threshold predictors can be improved by pCRF-based enhancers given the shape of the interface surface score distribution and the non-interface surface score distribution with respect to the training set resemble the shape of unimodal distributions. Besides the surface residue scores, only the spatial neighborhood structure between the surface residues of the protein under study is taken into account. Thus, the improvement can be attributed to our model. In addition to the precision of the scores, the amount of improvement depends on the 3D-complexity of the interfaces to be predicted. To this end, three sets of experiments with synthetic surface residue scores for protein complexes randomly chosen from the data set PlaneDimers compiled by Zellner et al. [25] and from the lists published by Keskin et al. [39] and Cukuroglu et al. [40].
The enhancement is structurally based on the following model property of pCRFs in contrast to residue-wise predictors. Though the scores of near-by residues may be correlated, labeling a position as interface or non-interface by thresholding the score does not influence the classification of its neighbors. When using pCRFs, this is the case.
The pCRF-based enhancer is also applicable, if the score distributions are only unimodal over a certain sub-domain. The improvement is then restricted to that domain. Thus we were able to improve the prediction of the PresCont server devised by Zellner et al. on PlaneDimers [25].
The prediction is made on grounds of a generalized Viterbi inference heuristic. As for training, we developed a piecewise training procedure for pCRFs, where the enhancer needs to be trained on data originating from the same source as the training data of the threshold predictor to be improved.