Dataset
The benchmark dataset used in this study is the original PSICOV test set introduced by Jones et al. [16]. We used the same alignments without modification as have been made available to ensure comparability. However, for validation we selected 146 out of 150 single domain monomeric proteins because the latest PSICOV version V2.1b3 was unable to provide contact predictions on the remaining four test cases, when used with default parameters. This is due to the insufficient number of effective sequences within the four alignments.
The second test set consists of 37 proteins of the CASP11 experiment (see Additional file 1). We selected only those proteins where the latest version of PSICOV successfully provided predictions, to make a fair comparison. The training set introduced by Jones et al. [22] was used to build the RF meta-classifier.
Coevolution analysis methods
The residue-residue contact prediction metrics applied in this study are MI [10, 11], MIp [12], OMES [25], BvN [13], DCA [15] and PSICOV [16]. The resulting coevolution between pairs of amino-acids using MI, MIp and OMES were calculated using Evol module of Prody [26]. BvN results were generated using Perl scripts and C++ source code kindly provided by the authors [13]. PSICOV results were calculated using the code available online [16]. DCA results were obtained using the fast and free software version FreeContact introduced by Kaján et al. [27]. Methodological details for the different methods may be found in the original studies.
COUSCOus
Pre-processing
In our approach, we generate a sample covariance matrix S from the input MSA. The MSAs are composed of n orthologous protein sequences where each sequence represents a row. Each protein sequence is made of m amino acids as a result of which we have L columns per alignment row. The size of the covariance matrix S is 21L×21L. This is because we compute the marginal single site frequencies f(A
i
) and f(B
j
) of observed amino acid types (20 natural occurring amino acids and a gap) in columns i and j and their corresponding pair site frequencies f(A
i
B
j
):
$$ S_{ij}^{ab} = f\left(A_{i}B_{j}\right)-f(A_{i})f(B_{j}) $$
(1)
Interestingly the precision matrix Θ which is the inverse of the covariance matrix S will contain the partial correlations of all pairs of variables taking into consideration the effects of all other variables. Hence, the non-zero entries of Θ will provide the extent of direct coupling between any two pairs of amino acids at sites i and j.
Yet, due to the fact that we are generating a covariance matrix S out of MSAs representing homologous protein sequences where not all amino acids are present at each site of the MSA, it is certain that S is singular and not directly invertible. Several approaches have been proposed to approximate the precision matrix in such cases. The most powerful and widely used technique is the sparse inverse covariance estimation using the graphical lasso (GLasso) [28].
GLasso
We briefly summarise the basic motivation and algorithm. Consider matrix X=[X
1,…,X
p
] where X
i
is a random vector of length n with covariance matrix Σ and precision matrix Θ={θ
ij
}1≤i,j≤p
. Further, let S denote the empirical covariance matrix obtained from the data. The estimation of the precision matrix Θ is challenging when it is sparse. Interestingly, this task is closely related to selection of graphical models.
Let G=(V,E) be a graph representing conditional independence relations between components of X. G is composed of a set of vertices V with p components {X
1,…,X
p
} and an edge set E of ordered pairs (i,j), with (i,j)∈E, if an edge between X
i
and X
j
exists. The edge between X
i
and X
j
is excluded from the edge set E if and only if X
i
and X
j
are independent given all other components {X
k
,k≠i,j}. Assuming that the raw data X is multivariate gaussian (X∼N(μ,Σ)), the conditional independence between X
i
and X
j
given all other components is equivalent to zero in the precision matrix (θ
ij
=0) as shown in [29]. Hence, for gaussian distributions recovering the structure of graph G is equivalent to the estimation of the support of the precision matrix.
The precision matrix Θ can then be estimated using a L
1 penalised log-likelihood approach. The GLasso algorithm, introduced by Friedman et al. [28], efficiently computes the solution by:
$$ \hat{\Theta}_{\text{GLasso}} := \arg\min_{\Theta \succ 0}\left\{\text{tr}(S\Theta) - \log\det(\Theta) + \lambda\|\Theta\|_{1}\right\}, $$
(2)
with tr as trace, ∥Θ∥1 as the sum of the absolute values of the elements in Θ and λ as a positive tuning parameter to control the sparsity.
Empirical Bayes covariance estimator
The most natural estimator of the covariance Σ is the sample covariance matrix S. The estimator is optimal in the classical settings with large number of samples and fixed low dimensions (n>p). However, it performs poorly in high-dimensional settings (n<<p), see Johnstone [30]. The GLasso approach operates very well in this context, but the computational time required to reach convergence can be large in some cases such as for protein families with low number of sequences. As an alternative to the natural estimator S, several shrinkage estimators have been proposed in the literature [31, 32]. They take a weighted average of the sample covariance matrix S, with a suitable chosen target diagonal matrix. Jones et al. applied a smoothed covariance estimator that shrinks the matrix towards the shrinkage target \(F=diag(\bar {S},\bar {S},\ldots,\bar {S})\) [16]. In this work, we applied the empirical Bayes estimator proposed by Haff [24]:
$$ \hat{\Sigma}=S+\frac{p-1}{n\ tr(S)}I_{p}, $$
(3)
where I
p
represents the identity matrix of order p. In a Bayesian framework, it has been proven by Haff that this estimator is the best estimator of the form a(S+u
t(u)C), with 0<a<1/(n−1) and u=1/tr(S
−1
C). Here t(·) is non-increasing and C an arbitrary positive definite matrix. It dominates estimators of the form aS by a substantial amount. More precisely, it has been proven that under the L
2 loss, the uniform reduction in the risk function is at least \(100\frac {p+1}{n+p}\)%. In this study, we performed the shrinkage until the adjusted covariance matrix \(\hat {\Sigma }\) is no longer singular, i.e. is a positive-definite matrix. The adjusted covariance matrix \(\hat {\Sigma }\) was finally applied in the GLasso algorithm to obtain the sparse precision matrix, that contains the degree of direct coupling between any pair of amino acids.
APC correction
The coevolution pair list is generated identically to the PSICOV final processing. For each MSA column pair i and j we compute the L
1-norm out of the corresponding 20×20 submatrix in Θ (only contributions of the 20 amino acid types are considered):
$$ C_{ij} = \sum_{ab}\mid \Theta_{ij}^{ab}\mid $$
(4)
Furthermore, we adjust the coupling score by the average product correction (APC), previously applied for MI by Dunn et al. [12] to reduce entropic and phylogenetic bias:
$$ {COUSCOus}_{ij} = C_{ij} - \frac{\bar{C}_{(i-)}\bar{C}_{(-j)}}{\bar{C}} $$
(5)
with \(\bar {C}_{(i-)}\) as the mean precision norm of column i and all other columns, \(\bar {C}_{(-j)}\) as the corresponding for column j and \(\bar {C}\) as the mean precision norm of all coupling scores.
Random forest meta-classifier
As previously indicated, new hybrid methods that combine coevolution detecting tools with other sources of information such as protein physiochemical features outperforms single methods like PSICOV. However, we are convinced that improvements of single residue-residue contact detecting methods can boost new emerging hybrid techniques. We designed a RF meta-classifier that includes several contact prediction methodologies along with a small number of sequence-derived physiochemical features.
In particular, we built a RF classifier using the training set alignments from the MetaPSICOV study [22]. In total, we used 336 protein alignments where PSICOV was able to successfully provide contact predictions. The RF was trained using the following features:
-
Contact detecting methodologies MI, MIp, BvN, PSICOV or COUSCOus, FreeContact and CCMpred
-
Secondary structure and solvent exposure probabilities derived from PSIPRED [33]
-
Shannon entropy using R [34] package bio3d [35]
-
Hydrophobicity using R package Interpol [36]
-
Amino acid physiochemical properties
The RF meta-classifier was trained using 500 trees with ten features and a max-depth of eight. We performed five-fold cross-validation while training the classifier and optimised the area under the curve (AUC) metric for performance.
Evaluation metrics and distance descriptions
The problem of predicting protein residue-residue contacts is well-known to be an extremely difficult one as on average only 3% of all possible residue pairs in known protein structures are identified to be real contacts. In the latest CASP11 challenge [37], this problem was tackled by dividing the contact prediction task into two categories. First, evaluation of predicted contacts using quality metrics like Accuracy and X
d
[38] on reduced lists (RL). Second, evaluation of predicted contacts using quality metrics like Matthew’s correlation coefficient (MCC) [39] and area under the precision-recall (A
U
C
pr
) for full lists (FL). The RL are usually defined by considering the top L/n predicted contacts where L is the length of the evaluation target or protein sequence and n is a small integer (e.g. 1,5 or 10). RL metric accuracy is calculated as \(\frac {TP}{TP+FP}\) where TP defines a correctly predicted contact and FP an incorrectly predicted contact. The second RL metric X
d
represents the difference between the distance distributions of the predicted contacts and all pairs distance distributions in the 3D target structure. It is defined as \(X_{d} = \sum \limits _{i=1}^{15} \frac {Pip-Pia}{di*15}\), with Pia and Pip are the percentages of pairs included in the i
th bin for the whole target and predicted contacts respectively. Additional details can be found in [38]. The FL metrics used in this study are A
U
C
pr
as it is a robust metric for unbalanced classes and the Matthew’s correlation coefficient [39] to evaluate all residue pairs for contact prediction.