MASS: predict the global qualities of individual protein models using random forests and novel statistical potentials

Background Protein model quality assessment (QA) is an essential procedure in protein structure prediction. QA methods can predict the qualities of protein models and identify good models from decoys. Clustering-based methods need a certain number of models as input. However, if a pool of models are not available, methods that only need a single model as input are indispensable. Results We developed MASS, a QA method to predict the global qualities of individual protein models using random forests and various novel energy functions. We designed six novel energy functions or statistical potentials that can capture the structural characteristics of a protein model, which can also be used in other protein-related bioinformatics research. MASS potentials demonstrated higher importance than the energy functions of RWplus, GOAP, DFIRE and Rosetta when the scores they generated are used as machine learning features. MASS outperforms almost all of the four CASP11 top-performing single-model methods for global quality assessment in terms of all of the four evaluation criteria officially used by CASP, which measure the abilities to assign relative and absolute scores, identify the best model from decoys, and distinguish between good and bad models. MASS has also achieved comparable performances with the leading QA methods in CASP12 and CASP13. Conclusions MASS and the source code for all MASS potentials are publicly available at http://dna.cs.miami.edu/MASS/.


Background
The quality assessment (QA) of protein models plays an important role in protein tertiary structure prediction and model refinement [1]. Since it was introduced into the critical assessment of techniques for protein structure (CASP) as an independent category in 2006, various methods have been developed for predicting the qualities of protein models [2][3][4][5][6][7][8]. Computational quality assessment tools can be categorized into three types: singlemodel methods, clustering-based methods, and quasi-single methods. Compared with clustering-based methods that require a pool of protein models as input, single-model methods only need an individual protein model as input [8] and are indispensable when there are only a few models available in the model pool. Quasi-single methods can be thought of as a hybrid of single-model and clustering-based methods.
Most of the single-model methods are developed based on machine learning algorithms, such as support vector machine (SVM) [9,10], random forests [11], and deep learning algorithms [12][13][14][15]. Single-model methods have used various features for training the machine learning models, such as energy functions [7,11] and the consistency between predicted and assigned secondary structures [8]. Liu et al. [8] developed a deep learning architecture based on stacked denoising encoders (SdA) to predict residue-specific qualities of individual models. Cao et al. developed DeepQA [5], in which energy functions, physio-chemical characteristics, and structural information were used as features, and deep belief networks were used as the machine learning algorithm. ProQ3 [7] used Rosetta energy terms as input features and SVM as machine learning algorithm and outperformed its previous version ProQ2 [10].
In this study, we present a single-model method named MASS for predicting global qualities of individual protein models. We designed and re-implemented ten protein potentials and proved that they indicate different structural or energetic characteristics of a protein model. The random forests algorithm is used as the machine learning algorithm; the values from ten potentials along with six other types of features are used to predict the global qualities of individual models. We evaluated MASS along with other QA methods in CASP11, CASP12, and CASP13 and found that MASS outperforms most of the methods in CASP11 and is comparable with the leading methods in CASP12 and CASP13.

Training data and features
The training data were collected from previous CASP experiments: 85 targets from CASP9 and 67 from CASP10. The objective values are GDT-TS scores obtained from superimposing protein models with their native structures using LGA [16]. For each protein target, there are about 300 models. Considering the small differences between the five models from the same group, we randomly selected 150 models on each target for generating machine learning features.

Comparison between predicted and assigned secondary structures and solvent accessibilities
The predicted secondary structures and relative solvent accessibilities were obtained by executing SCRATCH [23]. The secondary structures and relative solvent accessibilities of a protein model were assigned by STRIDE [24]. The Q3, SOV'99, and SOV_refine scores [17,25] were used to assess the similarity between the predicted and assigned protein secondary structures. The other three features indicate the percentage of identical values between the predicted and assigned relative solvent accessibilities including buried, exposed, and both buried and exposed at 25% exposure threshold.

Pseudo amino acid composition
Pseudo amino acid composition (PseAA) [21] was used to indicate amino acid composition.

Radius of gyration
Radius of gyration has been widely used as an indicator of protein structure compactness [26]. When we used radius of gyration in this study, we only considered N, Cα, and C atoms, and assumed that all atoms of interest have equal masses.

Residue-residue contact information
The in-contact relationship between the Cα-Cα atoms is defined as the sequence separation ≥ 6 and Euclidean distance in 3D space less than 8 Å. The first feature is the average sequence separation between atoms that are in-contact. The second feature is the average value of the distances between in-contact Cα-Cα atoms weighted by their sequence separations.

MASS potentials
We designed six protein statistical potentials from scratch including pseudo-bond angle potential (PAP), accessible surface potential at the atomic level (ASPA), sequence separation-dependent potential (SSDP), contact-dependent potential (CDP), relative solvent accessibility potential (RSAP), and volume-dependent potential (VDP). We re-designed (made minor modifications on the existing designs) the torsion angle potential (TAP) previously defined in QMEAN [27]. We re-implemented (the potentials were previously defined by others; we implemented them in PERL) three previously defined protein potentials: centrosymmetric burial potential (CSP), accessible surface potential at the residue level (ASPR), and distance-dependent potential (DDP).
We used both Cα and Cβ atoms to represent a residue in five potentials: ASPR, CDP, CSP, DDP, and SSDP. Therefore, we used in total 15 potentials for a given protein model (will be referred to as MASS potentials hereafter). The protein dataset we used for extracting reference states is TOP8000 [28], which contains about 8000 high-resolution (< 2.0 Å) and quality-filtered experimentally-determined protein structures (chains) with 70% PDB homology level. The dataset was previously used to update the torsional distributions in MolProbity [28] and was used here to extract the distributions of other reference states.
The reference state information consists of pseudobond angles, torsion angles, centrosymmetric burial, accessible surface at the residue level and at the atomic level, residue distance, sequence separation, residueresidue contact, relative solvent accessibility, and atom volume. The general formula [29,30] we used to calculate the potentials of an atom (a residue) or paired atoms (paired residues) is: , where σ is a weight parameter and was set to 1/50 [31] and RT was set to 0.582 kcal/mole [30]. For the potentials discussed below, we used newly-designed ways of calculating M, f observed , and f reference .

Pseudo-bond angle potential
We defined pseudo-bond angles as the angles formed by three consecutive N, Cα, or C atoms in the backbone. The 180°degree of pseudo-bond angles is evenly split into n = 6 classes. The M value in the pseudo-bond angle potential (PAP) for a specific residue is defined as: , where R denotes the residue type, ss is the secondary structure state of the residue, c N is the class of pseudobond angle formed by N atoms, c C α is the class of pseudo-bond angle formed by Cα atoms, c C is the class of pseudo-bond angle formed by C atoms, and f is a function that returns the number of occurrences of a specific combination of R, ss, c N , c Cα , and c C . Therefore, for a specific residue type R its M is the number of observations for different secondary structure states and pseudo-bond angle classes based on our reference state information.
For the ith residue with residue type R i , suppose it has a specific combination of states ðss; c N ; c C α ; c C Þ, then we define f i ðss; c N ; c C α ; c C Þ as the number of occurrences of that combination of states for residue type R i (one occurrence number is generated for a residue type, with in total 20 occurrence numbers generated for 20 amino acid types). We define f reference as: where the denominator is the sum of all observations or occurrences for all residue types based on our reference state information, and the numerator is the sum of the observations for this specific state combination for all residue types. The f observed is defined as: . f observed is very similar to f reference , but when calculating the former one, we only considered the residue type R i .

Torsion angle potential
We refined the definition of torsion angle potential (TAP) previously defined in QMEAN [27]. For three adjacent residues in a protein chain, six dihedral angles For each of the six dihedral angles, we first evenly split the 360°degree into n = 9 classes. The existing definition of TAP includes two types of combinations of the six dihedral angels: ). We will use (a 1 , a 2 , a 3 ) hereafter to indicate these two categories of combinations. For a given specific class set or combination (a 1 , a 2 , a 3 ), we further created another two approaches of defining the state combinations. The first approach defines five classes c 5 : if a 1 = a 2 and a 1 = a 3 , we label it as c 51 ; if a 1 ≠ a 2 , a 2 ≠ a 3 , and a 1 ≠ a 3 , we label it as c 52 ; if a 1 = a 2 and a 1 ≠ a 3 , we label it as c 53 ; if a 1 = a 3 and a 1 ≠ a 2 , we label it as c 54 ; and if a 2 = a 3 and a 1 ≠ a 2 , we label it as c 55 . The second apporach defines four classes c 4 : if a 2 = a 1 and a 2 = a 3 , we label it as c 41 ; if a 2 ≠ a 1 and a 2 ≠ a 3 , we label it as c 42 ; if a 2 = a 1 and a 2 ≠ a 3 , we label it as c 43 ; and if a 2 = a 3 and a 2 ≠ a 1 , we label it as c 44 . Therefore, given a set of six dihedral angles , and then classify them based on c 5 and c 4 if we set (a 1 , a 2 , a 3 . The refined definition of M for the torsion angle potential for a specific residue is: and c 5 class definition, similarly for Ψc 5 , Φc 4 , Ψc 4 . For the ith residue with residue type R i , suppose it has a specific state combination (ss, Φc 5 , Ψc 5 , Φc 4 , Ψc 4 ), we then define f i (ss, Φc 5 , Ψc 5 , Φc 4 , Ψc 4 ) as the number of occurrences of that combination of states for residue type R i (one occurrence number is generated for a residue type, with in total 20 occurrence numbers generated for 20 amino acid types). We define f reference and f observed as: The definitions of M R , f referene , and f observed are very similar to the ones we defined in Pseudo-bond angle potential, but here we use a two-layer class assignment system (first use a 1 , a 2 , and a 3 and then use c 4 and c 5 ) for torsion angles.

Centrosymmetric burial potential
We re-implemented the centrosymmetric burial potential (CSP) [32] with two major alternations. One of them is that we integrated the static radius of gyration (Rg) for protein models and native structures. The Rg we used in this work was derived from a simple function of the number of residues (N): Rg = 0.395 × N 0.6 + 7.257, resulting from regression analysis between a dataset of about 1000 globular proteins from the PDB database and their corresponding sequence lengths [33]. The second alternation is that we added secondary structure classes (i.e., H, E, and C), in the same way as in the former two potentials. The range [0, 3 × Rg] is evenly divided into 30 bins or classes c 30 . For each atom (Cα or Cβ), we calculated the distance between the atom and center of mass of the current protein and determine which c 30 this distance belongs to. The M, f reference , and f observed for the CSP for each atom (Cα or Cβ) are defined as: For the ith residue (atom Cα or Cβ) with residue type R i , suppose it has a specific state combination of (ss, c 30 ), we define f i (ss, c 30 ) as the number of occurrences of that combination of states for residue type R i (one occurrence number is generated for a residue type, with in total 20 occurrence numbers generated for 20 amino acid types).

Accessible surface potential at the residue level
We re-implemented the accessible surface potential at the residue level (ASPR) [34]. The accessible surface of any given residue is calculated as the total number of residues locating within a 11 Å radius sphere centered on the given residue [34]. The accessible surface for a given residue is classified into 25 classes c 25 (the range [0, 50] is evenly divided into 25 bins). The M, f reference , and f observed for the ASPR for each residue (represented by Cα or Cβ atom) are defined as following: For the ith residue (represented by Cα or Cβ atom) with residue type R i , suppose it has a specific state combination (ss, c 25 ), we define f i (ss, c 25 ) as the number of occurrences of that combination of states for residue type R i (one occurrence number is generated for a residue type, with in total 20 occurrence numbers generated for 20 amino acid types).

Accessible surface potential at the atomic level
We designed accessible surface potential at the atomic level (ASPA) based on the classification of all heavy atoms into 40 atom types c 40 [35]. For each given heavy atom, we calculated its accessible surface as we did for ASPR but used an 8 Å radius sphere. The accessible surface for a given heavy atom is classified into 30 classes c 30 (the range [50,200] is evenly divided into 30 bins).
The M, f reference , and f observed for the ASPA for each heavy atom are defined as: For the ith heavy atom with atom class R i , suppose it has a specific state c 30 , we define f i (c 30 ) as the number of occurrences of that state for heavy atom type R i (one occurrence number is generated for one type of heavy atoms, with in total 40 occurrence numbers generated for 40 heavy atom types).

Distance-dependent potential
We re-implemented the same distance-dependent potential (DDP) as described in [29,30]. Therefore, no detailed description is shown here. We evenly divided the distance range [5,25] into 40 classes and only considered any two residues with at least three residues away. When calculating DDP, we used Cα or Cβ atom to represent a residue.

Sequence separation-dependent potential
We designed sequence separation-dependent potential (SSDP) based on the definition of DDP. SSDP is very similar to DDP, but we evenly divided the sequence separation range [0,300] into 60 classes and only considered two residues with distances equal to or less than 8 Å. For calculating SSDP, we used Cα or Cβ atom to represent a residue.

Contact-dependent potential
We designed contact-dependent potential (CDP). Two residues, with sequence separation equal to or larger than 6, are considered to be in-contact if their Euclidean distance is less than 9 Å in the 3D space. Therefore, the M, f reference , and f observed for the CDP for any two residues (each represented by their Cα or Cβ atom and each belonging to 20 residue-type classes c j ) being incontact are defined as: For the ith and jth residues (represented by Cα or Cβ atoms) with residue type R i and c j , suppose they have a specific class combination (ss, c j ), we define f i (ss, c j ) as the number of occurrences of that combination of states for residue type R i (one occurrence number is generated for a residue type, with in total 20 occurrence numbers generated for 20 amino acid types).

Relative solvent accessibility potential
We designed relative solvent accessibility potential (RSAP) from scratch. The relative solvent accessibility is assigned by STRIDE [24], and evenly divided into 10 classes c 10 from range [0, 1]. The M, f reference , and f observed for the RSAP for each residue (represented by the Cα or Cβ atom) are defined as: For the ith residue (represented by the Cα or Cβ atom) with residue type R i , suppose it has a specific state combination (ss, c 10 ), we define f i (ss, c 10 ) as the number of occurrences of that combination of states for residue type R i (one occurrence number is generated for a residue type, with in total 20 occurrence numbers generated for 20 amino acid types).

Volume-dependent potential
We designed a new potential: volume-dependent potential (VDP). For each Cα atom, we calculated its volume as described in [36]. Given a volume value, we classified it into 10 classes c 10 (range [10,30] was evenly divided).
The M, f reference , and f observed for the VDP for each residue (represented by Cα atom) are defined as: For the ith residue (represented by the Cα atom) with residue type R i , suppose it has a specific class combination (ss, c 10 ), we define f i (ss, c 10 ) as the number of occurrences of that combination of states for residue type R i (one occurrence number is generated for a residue type, with in total 20 occurrence numbers generated for 20 amino acid types).

Rosetta energy functions
We used Rosetta Energy Function 2015 (REF15) to generate 19 energy scores for each residue [22]. The Rosetta energy of a protein model is the sum of all residues' energy scores; and the twentieth energy score provided by REF15 is the sum of all 19 energy scores.

Optimal parameters
Given a protein model, we first calculated the ten potentials at the atom or residue level and obtained the potentials/energy scores of this protein model by summing up all residues' energy scores (∑E). To determine the optimal parameters (e.g., range boundaries and class number) in each of the ten potentials, we selected 730 single-domain models in CASP9 by randomly selecting 10 models from each of the 73 template-basedmodelling (TBM) single-domain targets. We tested various configurations of parameters. The Pearson's and Spearman correlations between GDT-TS of the 730 models and their corresponding energy scores were used to determine the final parameters (see Additional file 1: Table S1).

Random forests
The random forests algorithm [37] was used as the machine learning algorithm in this study, which has been widely used in the field of computational biology ] 38 , 39 ]. We used five-fold cross-validation to determine the optimal parameters (i.e., ntree, mtry) [40]. The number of trees (i.e., ntree) we tested were from 500 to 5000 with an interval of 500. The mtry values we tested were from 10 to 34 with an interval of one. The optimal parameters we obtained were 2500 and 24, respectively.

Results
Similar to how CASP officially evaluates QA methods that predict global qualities [1] of protein models, we assessed our method, together with four methods participated in CASP11, seven in CASP12, and 16 in CASP13, by four criteria measuring the abilities to assign relative scores, identify the best model from decoys, assign absolute scores, and discriminate good models from bad models. The corresponding measures are (1) the weighted mean of Pearson's product moment correlation coefficient (wmPMCC), (2) the average loss (Ave loss), (3) the average GDT-TS deviations (Ave ΔGDT), and (4) the Matthews correlation coefficient (MCC) and receiver operating characteristic (ROC). The weighted mean of Pearson's product moment correlation coefficient (wmPMCC) was used to evaluate the QA methods' ability to predict relative model accuracy. In each stage, the Pearson's correlation coefficients r between predicted and real GDT-TS scores for each target were calculated, and then the correlation r was transformed into an additive quantity using: where z is the normally distributed variable. We then calculated the arithmetic mean score of z values, denoted as z. The final wmPMCC r was obtained using the following inverse equation.
The average loss (Ave loss) was designed to assess the quality of identifying the best model from a pool of models of each target. The loss value is the absolute value between the native GDT-TS scores of the best model and the predicted best model, which means that smaller loss values correspond to better ability to identify best models.
Compared with wmPMCC, the average GDT-TS deviation (Ave Δ GDT) was used to evaluate the QA methods' ability to assign absolute model accuracy. For each model in a target, the GDT-TS deviation is the absolute value between real GDT-TS score and predicted global-quality score.
To evaluate the ability of distinguishing between good and bad models, we computed the Matthews correlation coefficient (MCC). For a protein model, if its true GDT-TS score is ≥50 (out of 100) and a QA method assigns a score ≥ 50, we counted it as true positive (TP). The MCC score was calculated as: , where TN stands for true negatives, FP for false positives, and FN for false negatives. We also performed the receiver operating characteristic (ROC) analysis [41]. ROC curves (AUC) indicate the ability of binary classification of the model's quality; if real GDT-TS ≥50, the model quality is considered good, otherwise poor.
We first proved that the ten MASS potentials are significantly different from each other. We calculated the Pearson's correlations of every two potentials for 730 models (see Additional file 1: Table S2 upper triangular) and the statistical significance of the differences at the 95% confidence level using paired t-tests (see Additional file 1: Table S2 lower triangular). We also calculated the Spearman correlations of every two potentials for 730 models (see Additional file 1: Table S3 upper triangular) and statistical significance of the differences at the 95% confidence level using paired Wilcoxon Signed-Rank tests (see Additional file 1: Table S3 lower triangular). From Additional file 1: Tables S2 and S3, we can conclude that the ten MASS potentials we designed and reimplemented are statistically and significantly different from each other.
Our method is the only method that achieves > 0.7 wmPMCC in stage 1 and > 0.4 in stage 2, indicating that MASS can accurately predict the real GDT-TS scores. Figure 1 shows two example predictions of MASS on two CASP targets.
In terms of CASP12, as shown in Additional file 1: Table S5, ProQ3 and DeepQA outperform the others in stage 1, and ProQ3 and SVMQA outperform the others in stage 2, see Table 2. MASS outperforms SVMQA in terms of Ave ΔGDT in both stages and in terms of MCC and ROC in stage 1. Moreover, we reported the significance of differences between any two methods in Additional file 1: Table S6 for CASP 11 and Additional file 1: Table S7 for CASP 12 by Fisher Z-Transformation and ttest. It shows that the predictions of MASS are significantly different to the predictions of QAcon, ProQ3, and SVMQA.   We also evaluated our method using 57 targets in CASP13 experiment along with 16 methods participating in CASP13 including ModFOLD7 series [15], FaeNNz, ProQ4 [14], MESHI series, VoroMQA series [44], MULTICOM-NOVEL, Bhattacharya-SingQ, Bhattacharya-Server, PLU-AngularQA [45], and PLU-TopQA (methods having missing models or targets were excluded). The results are shown in Table 3 for stage 2 and Additional file 1: Table S8 for stage 1. In stage 1, ModFOLD7 series [15] perform better than the others according to the five evaluation metrics. MASS achieves a slightly lower ROC (i.e., 0.94) compared with 0.99 from ModFOLD7. In stage 2, Mod-FOLD7 series still outperform the other methods.
Notice that the pseudo amino acid composition for all models of a target are the same. In section 1 of the Additional file 1, we provided a discussion showing that although this feature cannot distinguish the models within a target, it can affect the scores given to all the models of a target.
We provided the contribution of each of the 70 machine learning features in Fig. 2, which provides useful information for future research in this field. All of the 70 features play a positive role in the machine learning task with one of the solvent accessibility features, some of the PseAA features, and the twentieth energy scores from Rosetta contributing more than the rest.
The running time analysis of MASS is shown in the Additional file 1. Finally, to assess the values of the three energy sets including the three energy functions (RWplus, GOAP, and DFIRE), our novel MASS potentials, Rosetta energy functions, we individually occluded each of the three energy sets by setting the corresponding features to zero and then executed the same MASS model to obtain new predictions on 75 targets in CASP11 stage 2. We compared the evaluation results with/without occlusion and the results were shown in Additional file 1: Table S9. MASS potentials demonstrated higher importance than the three energy functions (RWplus, GOAP, and DFIRE) and Rosetta energy functions.

Discussions
MASS is a random-forests-based approach for estimating the quality of individual protein models. It uses various features extracted from protein sequences and models. The features can be classified into seven sets: (1) consistency between predicted and assigned secondary structures and solvent accessibilities; (2) three energy functions (RWplus, GOAP, and DRIRE); (3) PseAA coding of protein sequence; (4) radius of gyration of the protein model; (5) residue-residue contact information; (6) 15 MASS potentials; and (7) 20 Rosetta energy functions. We evaluated MASS along with other QA methods in CASP11, CASP12, and CASP13. MASS outperforms most of the methods in CASP11 and is comparable with the leading methods in CASP12 and CASP13.
We defined and re-implemented 10 protein potentials using various protein properties including sequence-separation-dependent, distance-dependent, contact-dependent, volume-dependent, torsion angle, pseudo-bond angle, accessible surface, relative solvent accessibility, and centrosymmetric burial. We have proved that these 10 protein potentials play a key role in the good performance of MASS. The 10 MASS potentials can be used as machine learning features for other studies in the field of protein science, such as protein structure prediction and protein function prediction.
Currently, MASS does not support residue-specific (local) quality assessment, which can be used in refining protein models. However, most of the features we used in this work are residue-specific, which can be directly used as local or residue-specific features for developing residue-specific quality assessment methods. As our future work, we plan to integrate MASS potentials with deep learning methods to estimate residue-specific protein model qualities.

Conclusions
In this study, we designed and implemented ten potentials using different reference state information including pseudo-bond angles, torsion angles, centrosymmetric burial, accessible surface at the residue level and at the atomic level, residue distance, sequence separation, residue-residue contact, relative solvent accessibility, and atom volume. We proved that the ten potentials were statistically significant different to each other. MASS potentials demonstrated higher importance than the three energy functions (RWplus, GOAP, and DFIRE) and Rosetta energy functions when used as machine learning features.
We also present MASS, which uses seven types of features and random forests to predict global qualities of individual protein models. To evaluate MASS and the related tools, we used four CASP-official-evaluation criteria that measured the abilities to assign relative and absolute scores, identify the best model from decoys, and distinguish between good and bad models. MASS outperforms almost all of the four CASP11 leading single-model methods for global quality assessment. MASS is comparable with most of the leading methods in CASP12 and CASP13 experiments.