To predict sites of structural or functional importance, we combine the known
-metric of normalized mutual information
 with a novel metric
to enhance the influence of dissimilar compensatory mutations when measuring covariation of two sites. We discuss how we devised
in Methods section.
To learn the frequency of compensatory mutations, we took
-significant site pairs as training data. We did that for reasons of computation time regardless of the fact that these data are biased. To deal with this bias, one could carry through the training in an iterative process, with our training being the first iteration. For i > 0, in the (i + 1)-th iteration of this modified training, a doubly stochastic matrix
is calculated based on
-significant site pairs. This is done until the training data are stable.
According to Birkhoff’s Theorem
, every doubly stochastic matrix is a convex combination of permutation matrices. Moreover, from the Hardy-Littlewood-Pólya majorization theorem
 follows that transforming the probability mass function by a doubly stochastic matrix increases entropy. Consequently, by linearly transforming the empirical amino acid pair distribution of a site pair by D(1) before calculating the
-value, we penalized those site pairs whose original distribution does not match the frequency pattern of formal dissimilar compensatory mutations in the training data described in the Methods section.
The challenge was to separate the signal caused by structural and functional constraints from the background. To address this issue, we studied only metrics μ that satisfy the following condition. The larger the μ(k,l)-value, the larger the probability that the two sites k and l have co-evolved. Our critical assumptions were: i) the μ(k,l)-values follow three different distributions, one for the signal, one for the noise, and one for pairs of completely unrelated sites; ii) there is an MSA-dependent threshold below which the metric μ does not fall with overwhelming probability, when it is applied to the site pairs of functional or structural importance to which μ is sensitive; iii) there is an MSA-dependent threshold significantly smaller then the one in (ii) such that with overwhelming probability there are no μ(k,l)-values of pairs (k,l) of unrelated sites exceeding it.
In order to near-completely eliminate the noise, we filtered both our training and input data. We calculated the significant pairs such that the preassigned false discovery rate was guaranteed by generalizing the Storey-Tibshirani procedure devised for multiple testing problems
Our method to eliminate noise is orthogonal to the technique developed in
. Therein, for every pair of sites the so-called average product correction (APC) is calculated as an explicit noise measure, by which the mutual information is then decreased. Furthermore, it generalizes the way Merkl and Zwick
 as well as Gao et al.
 cope with noise. According to our judgment, taking only the top 75 high-scoring pairs or the top 25 pairs into account as done in
[16, 17], respectively, is too conservative.
We based our noise separation technique on rather weak distribution assumptions that are standard practice in multiple hypothesis testing, instead of explicitly model the noise in terms of a metric. We applied the connectivity degree technique due to Merkl and Zwick
 to significant site pairs with respect to our metrics. The cut-off for the connectivity degree was set to the 90-th percentile. That way we defined significant sites. Finally, a site was defined to be CMF-significant, if it was μ-significant, where μ is either
Why did we set the cut-off value for the connectivity degree to the 90-th percentile? Going through all possiblen-th percentiles for n = 80,81,…,99, the Matthews correlation coefficient (MCC) of a joint prediction for human EGFR and GCK proteins is maximal if n = 90.
It is plausible that the number of functionally or structurally important sites does not only depend on the length of the protein. Therefore, the 90-th percentile cut-off should be replaced by an MSA-dependent threshold in future studies.
Our results for human EGFR and GCK proteins suggest that the large majority of significant compensatory mutation sites found by CMF are in agreement with previous experimental studies regarding the functions and stability of these proteins. 15 and 16 CMF-significant sites in human EGRF and GCK proteins, respectively, are verified as disease associated nsSNP positions (see Figures
2) where most amino acid substitutions in protein sequences damage structural stability of proteins
[36, 37, 45]. Moreover, we have observed that in both proteins some of CMF-significant nsSNP positions are nearby allosteric sites, binding sites or catalytic sites each of which are considered to be functionally important
[46, 47] (see Figures
4). Disease associated mutations at these nearby positions are likely to affect protein function
Despite the large number of CMF-significant sites demonstrated to be structurally or functionally important for both of the proteins, 9 and 15 significant sites in human EGFR and GCK proteins, respectively, are not included in essential sites. However, we hypothesize that most of the novel significant sites may play a critical role in both proteins notwithstanding the absence of previous experimental data. Therefore, further progress from the molecular and structural biology end is required not only to assess the importance of these sites, but also for a future perspective on a deeper understanding of protein structure.
Because we have also used the
-metric, we compared our tool with H2r presented in
 rather than with those methods developed in
. This way, we studied the impact of applying the Storey-Tibshirani procedure in combination with the effect of using the 90-th percentile cut-off for the connectivity degree. We have applied H2r without adding pseudo counts to the human EGFR and GCK protein. For EGFR, the 14 sites T725, A755, N756, A767, Q791, V802, N816, V819, K846, V876, M881, K913, D916, and E931 are identified as significant. Out of these significant sites, ten of these residue sites T725, A755, N756, A767, Q791,K846, V876, M881, K913, and D916 are essential sites. On the other hand, for GCK, H2r identified the 15 residue positions L25, R36, R63, M107, C213, V226, G261, D262, G264, L266, D267, E268, T405, K414, and H416 as significant. Twelve of these sites, namely R36, R63, M107, C213, V226, G261,D262, G264, L266, D267, K414, and H416, are essential sites. However, when using the H2r Web service (
http://www-bioinf.uni-regensburg.de/) to analyze EGFR and GCK proteins, sensitivity is decreased, while precision is increased. By this service only eight sites for EGFR and nine sites for GCK were found to be significant. Moreover, only five and eight of them are verified as functionally or structurally important for EGFR and GCK proteins, respectively. This difference stems from the fact that the H2r Web service tightens the filtering of the columns. In addition to this, statistically evaluating H2r for EGFR and GCK proteins, we observed a sensitivity of 5.4%, specificity of 96.7%, precision of 75.9%, and a Matthews correlation coefficient value of 0.047. On the other hand, the CMF reaches precision of 79.1%, and a Matthews correlation coefficiant value of 0.133. For sensitivity and specificity of the CMF refer to the last row of Table
The results of this comparison show that a vast majority of functionally or structurally important residue positions cannot be detected without using our novel MSA specific model and both metrics (