 Research
 Open Access
 Published:
MASS: predict the global qualities of individual protein models using random forests and novel statistical potentials
BMC Bioinformatics volume 21, Article number: 246 (2020)
Abstract
Background
Protein model quality assessment (QA) is an essential procedure in protein structure prediction. QA methods can predict the qualities of protein models and identify good models from decoys. Clusteringbased methods need a certain number of models as input. However, if a pool of models are not available, methods that only need a single model as input are indispensable.
Results
We developed MASS, a QA method to predict the global qualities of individual protein models using random forests and various novel energy functions. We designed six novel energy functions or statistical potentials that can capture the structural characteristics of a protein model, which can also be used in other proteinrelated bioinformatics research. MASS potentials demonstrated higher importance than the energy functions of RWplus, GOAP, DFIRE and Rosetta when the scores they generated are used as machine learning features. MASS outperforms almost all of the four CASP11 topperforming singlemodel methods for global quality assessment in terms of all of the four evaluation criteria officially used by CASP, which measure the abilities to assign relative and absolute scores, identify the best model from decoys, and distinguish between good and bad models. MASS has also achieved comparable performances with the leading QA methods in CASP12 and CASP13.
Conclusions
MASS and the source code for all MASS potentials are publicly available at http://dna.cs.miami.edu/MASS/.
Background
The quality assessment (QA) of protein models plays an important role in protein tertiary structure prediction and model refinement [1]. Since it was introduced into the critical assessment of techniques for protein structure (CASP) as an independent category in 2006, various methods have been developed for predicting the qualities of protein models [2,3,4,5,6,7,8]. Computational quality assessment tools can be categorized into three types: singlemodel methods, clusteringbased methods, and quasisingle methods. Compared with clusteringbased methods that require a pool of protein models as input, singlemodel methods only need an individual protein model as input [8] and are indispensable when there are only a few models available in the model pool. Quasisingle methods can be thought of as a hybrid of singlemodel and clusteringbased methods.
Most of the singlemodel methods are developed based on machine learning algorithms, such as support vector machine (SVM) [9, 10], random forests [11], and deep learning algorithms [12,13,14,15]. Singlemodel methods have used various features for training the machine learning models, such as energy functions [7, 11] and the consistency between predicted and assigned secondary structures [8]. Liu et al. [8] developed a deep learning architecture based on stacked denoising encoders (SdA) to predict residuespecific qualities of individual models. Cao et al. developed DeepQA [5], in which energy functions, physiochemical characteristics, and structural information were used as features, and deep belief networks were used as the machine learning algorithm. ProQ3 [7] used Rosetta energy terms as input features and SVM as machine learning algorithm and outperformed its previous version ProQ2 [10].
In this study, we present a singlemodel method named MASS for predicting global qualities of individual protein models. We designed and reimplemented ten protein potentials and proved that they indicate different structural or energetic characteristics of a protein model. The random forests algorithm is used as the machine learning algorithm; the values from ten potentials along with six other types of features are used to predict the global qualities of individual models. We evaluated MASS along with other QA methods in CASP11, CASP12, and CASP13 and found that MASS outperforms most of the methods in CASP11 and is comparable with the leading methods in CASP12 and CASP13.
Methods
Training data and features
The training data were collected from previous CASP experiments: 85 targets from CASP9 and 67 from CASP10. The objective values are GDTTS scores obtained from superimposing protein models with their native structures using LGA [16]. For each protein target, there are about 300 models. Considering the small differences between the five models from the same group, we randomly selected 150 models on each target for generating machine learning features.
We used 70 features in seven categories: (1) the agreement of predicted and assigned secondary structures, specifically, the Q3, SOV’99, SOV_refine scores [17] and three solvent accessibility scores (six features); (2) existing statistical potential energy of protein models, including RWplus [18], GOAP [19], and DRIRE [20] (three features); (3) pseudo amino acid composition of the amino acid sequences [21] (23 features); (4) radius of gyration of the models (one feature); (5) residueresidue contact information (two features); (6) newlydesigned and newlyimplemented protein statistical potentials (15 features); and (7) Rosetta energy functions [22] (20 features).
Comparison between predicted and assigned secondary structures and solvent accessibilities
The predicted secondary structures and relative solvent accessibilities were obtained by executing SCRATCH [23]. The secondary structures and relative solvent accessibilities of a protein model were assigned by STRIDE [24]. The Q3, SOV’99, and SOV_refine scores [17, 25] were used to assess the similarity between the predicted and assigned protein secondary structures. The other three features indicate the percentage of identical values between the predicted and assigned relative solvent accessibilities including buried, exposed, and both buried and exposed at 25% exposure threshold.
Statistical potential energy
We used three statistical potential energy scores, including the ones generated by RWplus [18], GOAP [19], and DFIRE [20].
Pseudo amino acid composition
Pseudo amino acid composition (PseAA) [21] was used to indicate amino acid composition.
Radius of gyration
Radius of gyration has been widely used as an indicator of protein structure compactness [26]. When we used radius of gyration in this study, we only considered N, Cα, and C atoms, and assumed that all atoms of interest have equal masses.
Residueresidue contact information
The incontact relationship between the CαCα atoms is defined as the sequence separation ≥ 6 and Euclidean distance in 3D space less than 8 Å. The first feature is the average sequence separation between atoms that are incontact. The second feature is the average value of the distances between incontact CαCα atoms weighted by their sequence separations.
MASS potentials
We designed six protein statistical potentials from scratch including pseudobond angle potential (PAP), accessible surface potential at the atomic level (ASPA), sequence separationdependent potential (SSDP), contactdependent potential (CDP), relative solvent accessibility potential (RSAP), and volumedependent potential (VDP). We redesigned (made minor modifications on the existing designs) the torsion angle potential (TAP) previously defined in QMEAN [27]. We reimplemented (the potentials were previously defined by others; we implemented them in PERL) three previously defined protein potentials: centrosymmetric burial potential (CSP), accessible surface potential at the residue level (ASPR), and distancedependent potential (DDP).
We used both Cα and Cβ atoms to represent a residue in five potentials: ASPR, CDP, CSP, DDP, and SSDP. Therefore, we used in total 15 potentials for a given protein model (will be referred to as MASS potentials hereafter). The protein dataset we used for extracting reference states is TOP8000 [28], which contains about 8000 highresolution (< 2.0 Å) and qualityfiltered experimentallydetermined protein structures (chains) with 70% PDB homology level. The dataset was previously used to update the torsional distributions in MolProbity [28] and was used here to extract the distributions of other reference states.
The reference state information consists of pseudobond angles, torsion angles, centrosymmetric burial, accessible surface at the residue level and at the atomic level, residue distance, sequence separation, residueresidue contact, relative solvent accessibility, and atom volume. The general formula [29, 30] we used to calculate the potentials of an atom (a residue) or paired atoms (paired residues) is:
, where σ is a weight parameter and was set to 1/50 [31] and RT was set to 0.582 kcal/mole [30]. For the potentials discussed below, we used newlydesigned ways of calculating M, f_{observed}, and f_{reference}.
Pseudobond angle potential
We defined pseudobond angles as the angles formed by three consecutive N, Cα, or C atoms in the backbone. The 180° degree of pseudobond angles is evenly split into n = 6 classes. The M value in the pseudobond angle potential (PAP) for a specific residue is defined as:
, where R denotes the residue type, ss is the secondary structure state of the residue, c_{N} is the class of pseudobond angle formed by N atoms, \( {c}_{C_{\alpha }} \) is the class of pseudobond angle formed by Cα atoms, c_{C} is the class of pseudobond angle formed by C atoms, and f is a function that returns the number of occurrences of a specific combination of R, ss, c_{N}, c_{Cα}, and c_{C}. Therefore, for a specific residue type R its M is the number of observations for different secondary structure states and pseudobond angle classes based on our reference state information.
For the ith residue with residue type R_{i}, suppose it has a specific combination of states \( \left( ss,{c}_N,{c}_{C_{\alpha }},{c}_C\right) \), then we define \( {f}_i\left( ss,{c}_N,{c}_{C_{\alpha }},{c}_C\right) \) as the number of occurrences of that combination of states for residue type R_{i} (one occurrence number is generated for a residue type, with in total 20 occurrence numbers generated for 20 amino acid types). We define f_{reference} as:
, where the denominator is the sum of all observations or occurrences for all residue types based on our reference state information, and the numerator is the sum of the observations for this specific state combination for all residue types. The f_{observed} is defined as:
. f_{observed} is very similar to f_{reference}, but when calculating the former one, we only considered the residue type R_{i}.
Torsion angle potential
We refined the definition of torsion angle potential (TAP) previously defined in QMEAN [27]. For three adjacent residues in a protein chain, six dihedral angles (Φ_{i − 1}, Ψ_{i − 1}, Φ_{i}, Ψ_{i}, Φ_{i + 1}, Ψ_{i + 1}) are defined. For each of the six dihedral angles, we first evenly split the 360° degree into n = 9 classes. The existing definition of TAP includes two types of combinations of the six dihedral angels: (Φ_{i1}, Φ_{i}, Φ_{i+1}) and (Ψ_{i1}, Ψ_{i}, Ψ_{i+1}). We will use (a_{1}, a_{2}, a_{3}) hereafter to indicate these two categories of combinations. For a given specific class set or combination (a_{1}, a_{2}, a_{3}), we further created another two approaches of defining the state combinations. The first approach defines five classes c_{5}: if a_{1} = a_{2} and a_{1} = a_{3}, we label it as c_{51}; if a_{1} ≠ a_{2}, a_{2} ≠ a_{3}, and a_{1} ≠ a_{3}, we label it as c_{52}; if a_{1} = a_{2} and a_{1} ≠ a_{3}, we label it as c_{53}; if a_{1} = a_{3} and a_{1} ≠ a_{2}, we label it as c_{54}; and if a_{2} = a_{3} and a_{1} ≠ a_{2}, we label it as c_{55}. The second apporach defines four classes c_{4}: if a_{2} = a_{1} and a_{2} = a_{3}, we label it as c_{41}; if a_{2} ≠ a_{1} and a_{2} ≠ a_{3}, we label it as c_{42}; if a_{2} = a_{1} and a_{2} ≠ a_{3}, we label it as c_{43}; and if a_{2} = a_{3} and a_{2} ≠ a_{1}, we label it as c_{44}. Therefore, given a set of six dihedral angles (Φ_{i − 1}, Ψ_{i − 1}, Φ_{i}, Ψ_{i}, Φ_{i + 1}, Ψ_{i + 1}), we first classify each of them into n = 9 classes c(Φ_{i − 1}, Ψ_{i − 1}, Φ_{i}, Ψ_{i}, Φ_{i + 1}, Ψ_{i + 1}), and then classify them based on c_{5} and c_{4} if we set (a_{1}, a_{2}, a_{3}) = c(Φ_{i − 1}, Φ_{i}, Φ_{i + 1}) or (a_{1}, a_{2}, a_{3}) = c(Ψ_{i − 1}, Ψ_{i}, Ψ_{i + 1}). The refined definition of M for the torsion angle potential for a specific residue is:
, where Φc_{5} denotes the category of Φ, that is, c(Φ_{i − 1}, Φ_{i}, Φ_{i + 1}) and c_{5} class definition, similarly for Ψc_{5}, Φc_{4}, Ψc_{4}.
For the ith residue with residue type R_{i}, suppose it has a specific state combination (ss, Φc_{5}, Ψc_{5}, Φc_{4}, Ψc_{4} ), we then define f_{i}(ss, Φc_{5}, Ψc_{5}, Φc_{4}, Ψc_{4}) as the number of occurrences of that combination of states for residue type R_{i} (one occurrence number is generated for a residue type, with in total 20 occurrence numbers generated for 20 amino acid types). We define f_{reference} and f_{observed} as:
The definitions of M_{R}, f_{referene}, and f_{observed} are very similar to the ones we defined in Pseudobond angle potential, but here we use a twolayer class assignment system (first use a_{1}, a_{2}, and a_{3} and then use c_{4} and c_{5}) for torsion angles.
Centrosymmetric burial potential
We reimplemented the centrosymmetric burial potential (CSP) [32] with two major alternations. One of them is that we integrated the static radius of gyration (Rg) for protein models and native structures. The Rg we used in this work was derived from a simple function of the number of residues (N): Rg = 0.395 × N^{0.6} + 7.257, resulting from regression analysis between a dataset of about 1000 globular proteins from the PDB database and their corresponding sequence lengths [33]. The second alternation is that we added secondary structure classes (i.e., H, E, and C), in the same way as in the former two potentials. The range [0, 3 × Rg] is evenly divided into 30 bins or classes c_{30}. For each atom (Cα or Cβ), we calculated the distance between the atom and center of mass of the current protein and determine which c_{30} this distance belongs to. The M, f_{reference}, and f_{observed} for the CSP for each atom (Cα or Cβ) are defined as:
For the ith residue (atom Cα or Cβ) with residue type R_{i}, suppose it has a specific state combination of (ss, c_{30}), we define f_{i}(ss, c_{30}) as the number of occurrences of that combination of states for residue type R_{i} (one occurrence number is generated for a residue type, with in total 20 occurrence numbers generated for 20 amino acid types).
Accessible surface potential at the residue level
We reimplemented the accessible surface potential at the residue level (ASPR) [34]. The accessible surface of any given residue is calculated as the total number of residues locating within a 11 Å radius sphere centered on the given residue [34]. The accessible surface for a given residue is classified into 25 classes c_{25} (the range [0, 50] is evenly divided into 25 bins). The M, f_{reference}, and f_{observed} for the ASPR for each residue (represented by Cα or Cβ atom) are defined as following:
For the ith residue (represented by Cα or Cβ atom) with residue type R_{i}, suppose it has a specific state combination (ss, c_{25}), we define f_{i}(ss, c_{25}) as the number of occurrences of that combination of states for residue type R_{i} (one occurrence number is generated for a residue type, with in total 20 occurrence numbers generated for 20 amino acid types).
Accessible surface potential at the atomic level
We designed accessible surface potential at the atomic level (ASPA) based on the classification of all heavy atoms into 40 atom types c_{40} [35]. For each given heavy atom, we calculated its accessible surface as we did for ASPR but used an 8 Å radius sphere. The accessible surface for a given heavy atom is classified into 30 classes c_{30} (the range [50,200] is evenly divided into 30 bins). The M, f_{reference}, and f_{observed} for the ASPA for each heavy atom are defined as:
For the ith heavy atom with atom class R_{i}, suppose it has a specific state c_{30} , we define f_{i}(c_{30}) as the number of occurrences of that state for heavy atom type R_{i} (one occurrence number is generated for one type of heavy atoms, with in total 40 occurrence numbers generated for 40 heavy atom types).
Distancedependent potential
We reimplemented the same distancedependent potential (DDP) as described in [29, 30]. Therefore, no detailed description is shown here. We evenly divided the distance range [5, 25] into 40 classes and only considered any two residues with at least three residues away. When calculating DDP, we used Cα or Cβ atom to represent a residue.
Sequence separationdependent potential
We designed sequence separationdependent potential (SSDP) based on the definition of DDP. SSDP is very similar to DDP, but we evenly divided the sequence separation range [0,300] into 60 classes and only considered two residues with distances equal to or less than 8 Å. For calculating SSDP, we used Cα or Cβ atom to represent a residue.
Contactdependent potential
We designed contactdependent potential (CDP). Two residues, with sequence separation equal to or larger than 6, are considered to be incontact if their Euclidean distance is less than 9 Å in the 3D space. Therefore, the M, f_{reference}, and f_{observed} for the CDP for any two residues (each represented by their Cα or Cβ atom and each belonging to 20 residuetype classes c_{j}) being incontact are defined as:
For the ith and jth residues (represented by Cα or Cβ atoms) with residue type R_{i} and c_{j}, suppose they have a specific class combination (ss, c_{j}), we define f_{i}(ss, c_{j}) as the number of occurrences of that combination of states for residue type R_{i} (one occurrence number is generated for a residue type, with in total 20 occurrence numbers generated for 20 amino acid types).
Relative solvent accessibility potential
We designed relative solvent accessibility potential (RSAP) from scratch. The relative solvent accessibility is assigned by STRIDE [24], and evenly divided into 10 classes c_{10} from range [0, 1]. The M, f_{reference}, and f_{observed} for the RSAP for each residue (represented by the Cα or Cβ atom) are defined as:
For the ith residue (represented by the Cα or Cβ atom) with residue type R_{i}, suppose it has a specific state combination (ss, c_{10}), we define f_{i}(ss, c_{10}) as the number of occurrences of that combination of states for residue type R_{i} (one occurrence number is generated for a residue type, with in total 20 occurrence numbers generated for 20 amino acid types).
Volumedependent potential
We designed a new potential: volumedependent potential (VDP). For each Cα atom, we calculated its volume as described in [36]. Given a volume value, we classified it into 10 classes c_{10} (range [10, 30] was evenly divided). The M, f_{reference}, and f_{observed} for the VDP for each residue (represented by Cα atom) are defined as:
For the ith residue (represented by the Cα atom) with residue type R_{i}, suppose it has a specific class combination (ss, c_{10}), we define f_{i}(ss, c_{10}) as the number of occurrences of that combination of states for residue type R_{i} (one occurrence number is generated for a residue type, with in total 20 occurrence numbers generated for 20 amino acid types).
Rosetta energy functions
We used Rosetta Energy Function 2015 (REF15) to generate 19 energy scores for each residue [22]. The Rosetta energy of a protein model is the sum of all residues’ energy scores; and the twentieth energy score provided by REF15 is the sum of all 19 energy scores.
Optimal parameters
Given a protein model, we first calculated the ten potentials at the atom or residue level and obtained the potentials/energy scores of this protein model by summing up all residues’ energy scores (∑E). To determine the optimal parameters (e.g., range boundaries and class number) in each of the ten potentials, we selected 730 singledomain models in CASP9 by randomly selecting 10 models from each of the 73 templatebasedmodelling (TBM) singledomain targets. We tested various configurations of parameters. The Pearson’s and Spearman correlations between GDTTS of the 730 models and their corresponding energy scores were used to determine the final parameters (see Additional file 1: Table S1).
Random forests
The random forests algorithm [37] was used as the machine learning algorithm in this study, which has been widely used in the field of computational biology [38, 39]. We used fivefold crossvalidation to determine the optimal parameters (i.e., ntree, mtry) [40]. The number of trees (i.e., ntree) we tested were from 500 to 5000 with an interval of 500. The mtry values we tested were from 10 to 34 with an interval of one. The optimal parameters we obtained were 2500 and 24, respectively.
Results
Similar to how CASP officially evaluates QA methods that predict global qualities [1] of protein models, we assessed our method, together with four methods participated in CASP11, seven in CASP12, and 16 in CASP13, by four criteria measuring the abilities to assign relative scores, identify the best model from decoys, assign absolute scores, and discriminate good models from bad models. The corresponding measures are (1) the weighted mean of Pearson’s product moment correlation coefficient (wmPMCC), (2) the average loss (Ave loss), (3) the average GDTTS deviations (Ave ΔGDT), and (4) the Matthews correlation coefficient (MCC) and receiver operating characteristic (ROC). The weighted mean of Pearson’s product moment correlation coefficient (wmPMCC) was used to evaluate the QA methods’ ability to predict relative model accuracy. In each stage, the Pearson’s correlation coefficients r between predicted and real GDTTS scores for each target were calculated, and then the correlation r was transformed into an additive quantity using:
, where z is the normally distributed variable. We then calculated the arithmetic mean score of z values, denoted as \( \overline{z} \). The final wmPMCC \( \overline{r} \) was obtained using the following inverse equation.
The average loss (Ave loss) was designed to assess the quality of identifying the best model from a pool of models of each target. The loss value is the absolute value between the native GDTTS scores of the best model and the predicted best model, which means that smaller loss values correspond to better ability to identify best models.
Compared with wmPMCC, the average GDTTS deviation (Ave Δ GDT) was used to evaluate the QA methods’ ability to assign absolute model accuracy. For each model in a target, the GDTTS deviation is the absolute value between real GDTTS score and predicted globalquality score.
To evaluate the ability of distinguishing between good and bad models, we computed the Matthews correlation coefficient (MCC). For a protein model, if its true GDTTS score is ≥50 (out of 100) and a QA method assigns a score ≥ 50, we counted it as true positive (TP). The MCC score was calculated as:
, where TN stands for true negatives, FP for false positives, and FN for false negatives. We also performed the receiver operating characteristic (ROC) analysis [41]. ROC curves (AUC) indicate the ability of binary classification of the model's quality; if real GDTTS ≥50, the model quality is considered good, otherwise poor.
We first proved that the ten MASS potentials are significantly different from each other. We calculated the Pearson’s correlations of every two potentials for 730 models (see Additional file 1: Table S2 upper triangular) and the statistical significance of the differences at the 95% confidence level using paired ttests (see Additional file 1: Table S2 lower triangular). We also calculated the Spearman correlations of every two potentials for 730 models (see Additional file 1: Table S3 upper triangular) and statistical significance of the differences at the 95% confidence level using paired Wilcoxon SignedRank tests (see Additional file 1: Table S3 lower triangular). From Additional file 1: Tables S2 and S3, we can conclude that the ten MASS potentials we designed and reimplemented are statistically and significantly different from each other.
We blindly tested MASS on 75 CASP11 targets in two stages (sel20 and best150), and compared its performance with four leading methods in CASP11: ProQ2 [3], ProQ2refine [10], Qprob/ MULTICOM_NOVEL [4], and QAcon/MULTICOMCLUSTER [42]. We also blindly tested MASS on 72 CASP12 targets and compared it with leading methods in CASP12 including ProQ3 [7], SVMQA [6], VoroMQA [43], DeepQA/MULTICOMCLUSTER [5], Myproteinme, QASproGP, and QMEAN [27]. As shown in Additional file 1: Table S4 for CASP11, our method (MASS) outperforms the other four in every aspect in stage 1, for stage 2, see Table 1. Our method is the only method that achieves > 0.7 wmPMCC in stage 1 and > 0.4 in stage 2, indicating that MASS can accurately predict the real GDTTS scores. Figure 1 shows two example predictions of MASS on two CASP targets.
In terms of CASP12, as shown in Additional file 1: Table S5, ProQ3 and DeepQA outperform the others in stage 1, and ProQ3 and SVMQA outperform the others in stage 2, see Table 2. MASS outperforms SVMQA in terms of Ave ΔGDT in both stages and in terms of MCC and ROC in stage 1. Moreover, we reported the significance of differences between any two methods in Additional file 1: Table S6 for CASP 11 and Additional file 1: Table S7 for CASP 12 by Fisher ZTransformation and ttest. It shows that the predictions of MASS are significantly different to the predictions of QAcon, ProQ3, and SVMQA.
We also evaluated our method using 57 targets in CASP13 experiment along with 16 methods participating in CASP13 including ModFOLD7 series [15], FaeNNz, ProQ4 [14], MESHI series, VoroMQA series [44], MULTICOMNOVEL, BhattacharyaSingQ, BhattacharyaServer, PLUAngularQA [45], and PLUTopQA (methods having missing models or targets were excluded). The results are shown in Table 3 for stage 2 and Additional file 1: Table S8 for stage 1. In stage 1, ModFOLD7 series [15] perform better than the others according to the five evaluation metrics. MASS achieves a slightly lower ROC (i.e., 0.94) compared with 0.99 from ModFOLD7. In stage 2, ModFOLD7 series still outperform the other methods.
Notice that the pseudo amino acid composition for all models of a target are the same. In section 1 of the Additional file 1, we provided a discussion showing that although this feature cannot distinguish the models within a target, it can affect the scores given to all the models of a target.
We provided the contribution of each of the 70 machine learning features in Fig. 2, which provides useful information for future research in this field. All of the 70 features play a positive role in the machine learning task with one of the solvent accessibility features, some of the PseAA features, and the twentieth energy scores from Rosetta contributing more than the rest.
The running time analysis of MASS is shown in the Additional file 1. Finally, to assess the values of the three energy sets including the three energy functions (RWplus, GOAP, and DFIRE), our novel MASS potentials, Rosetta energy functions, we individually occluded each of the three energy sets by setting the corresponding features to zero and then executed the same MASS model to obtain new predictions on 75 targets in CASP11 stage 2. We compared the evaluation results with/without occlusion and the results were shown in Additional file 1: Table S9. MASS potentials demonstrated higher importance than the three energy functions (RWplus, GOAP, and DFIRE) and Rosetta energy functions.
Discussions
MASS is a randomforestsbased approach for estimating the quality of individual protein models. It uses various features extracted from protein sequences and models. The features can be classified into seven sets: (1) consistency between predicted and assigned secondary structures and solvent accessibilities; (2) three energy functions (RWplus, GOAP, and DRIRE); (3) PseAA coding of protein sequence; (4) radius of gyration of the protein model; (5) residueresidue contact information; (6) 15 MASS potentials; and (7) 20 Rosetta energy functions. We evaluated MASS along with other QA methods in CASP11, CASP12, and CASP13. MASS outperforms most of the methods in CASP11 and is comparable with the leading methods in CASP12 and CASP13.
We defined and reimplemented 10 protein potentials using various protein properties including sequenceseparationdependent, distancedependent, contactdependent, volumedependent, torsion angle, pseudobond angle, accessible surface, relative solvent accessibility, and centrosymmetric burial. We have proved that these 10 protein potentials play a key role in the good performance of MASS. The 10 MASS potentials can be used as machine learning features for other studies in the field of protein science, such as protein structure prediction and protein function prediction.
Currently, MASS does not support residuespecific (local) quality assessment, which can be used in refining protein models. However, most of the features we used in this work are residuespecific, which can be directly used as local or residuespecific features for developing residuespecific quality assessment methods. As our future work, we plan to integrate MASS potentials with deep learning methods to estimate residuespecific protein model qualities.
Conclusions
In this study, we designed and implemented ten potentials using different reference state information including pseudobond angles, torsion angles, centrosymmetric burial, accessible surface at the residue level and at the atomic level, residue distance, sequence separation, residueresidue contact, relative solvent accessibility, and atom volume. We proved that the ten potentials were statistically significant different to each other. MASS potentials demonstrated higher importance than the three energy functions (RWplus, GOAP, and DFIRE) and Rosetta energy functions when used as machine learning features.
We also present MASS, which uses seven types of features and random forests to predict global qualities of individual protein models. To evaluate MASS and the related tools, we used four CASPofficialevaluation criteria that measured the abilities to assign relative and absolute scores, identify the best model from decoys, and distinguish between good and bad models. MASS outperforms almost all of the four CASP11 leading singlemodel methods for global quality assessment. MASS is comparable with most of the leading methods in CASP12 and CASP13 experiments.
Availability of data and materials
MASS is publicly available at http://dna.cs.miami.edu/MASS/.
Abbreviations
 CASP:

Critical assessment of techniques for protein structure prediction
 QA:

Quality assessment
References
 1.
Kryshtafovych A, Barbato A, Monastyrskyy B, Fidelis K, Schwede T, Tramontano A. Methods of model accuracy estimation can help selecting the best models from decoy sets: assessment of model accuracy estimations in CASP11. Proteins: Struct Funct Bioinf. 2016;84(S1):349–369.
 2.
Wang Z, Eickholt J, Cheng J. APOLLO: a quality assessment service for single and multiple protein models. Bioinformatics. 2011;27(12):1715–6.
 3.
Ray A, Lindahl E, Wallner B. Improved model quality assessment using ProQ2. BMC Bioinformatics. 2012;13(1):224.
 4.
Cao R, Cheng J. Protein singlemodel quality assessment by featurebased probability density functions. Sci Rep. 2016;6:23990.
 5.
Cao R, Bhattacharya D, Hou J, Cheng J. DeepQA: improving the estimation of single protein model quality with deep belief networks. BMC Bioinformatics. 2016;17(1):495.
 6.
Manavalan B, Lee J. SVMQA: support–vectormachinebased protein singlemodel quality assessment. Bioinformatics. 2017;33(16):2496–503.
 7.
Uziela K, Shu N, Wallner B, Elofsson A. ProQ3: Improved model quality assessments using Rosetta energy terms. Sci Rep. 2016;6(1):33509.
 8.
Liu T, Wang Y, Eickholt J, Wang Z. Benchmarking deep networks for predicting residuespecific quality of individual protein models in CASP11. Sci Rep. 2016;6:19301.
 9.
Cao R, Wang Z, Wang Y, Cheng J. SMOQ: a tool for predicting the absolute residuespecific quality of a single protein model with support vector machines. BMC Bioinformatics. 2014;15(1):120.
 10.
Uziela K, Wallner B. ProQ2: estimation of model accuracy implemented in Rosetta. Bioinformatics. 2016;32(9):1411–3.
 11.
Manavalan B, Lee J, Lee J. Random forestbased protein model quality assessment (RFMQA) using structural features and potential energy terms; 2014.
 12.
Hou J, Cao R, Cheng J. Deep convolutional neural networks for predicting the quality of single protein structural models. bioRxiv. 2019:590620.
 13.
Uziela K, Menéndez Hurtado D, Shu N, Wallner B, Elofsson A. ProQ3D: improved model quality assessments using deep learning. Bioinformatics. 2017;33(10):1578–80.
 14.
Hurtado DM, Uziela K, Elofsson A: Deep transfer learning in the assessment of the quality of protein models. arXiv preprint arXiv:180406281 2018.
 15.
McGuffin LJ, Adiyaman R, Maghrabi AH, Shuid AN, Brackenridge DA, Nealon JO, Philomina LS. IntFOLD: an integrated web resource for high performance protein structure and function prediction. Nucleic Acids Res. 2019;47:W408–13.
 16.
Zemla A. LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res. 2003;31(13):3370–4.
 17.
Liu T, Wang Z. SOV_refine: a further refined definition of segment overlap score and its significance for protein structure similarity. Source Code Biol Med. 2018;13(1):1.
 18.
Zhang J, Zhang Y. A novel sidechain orientation dependent potential derived from randomwalk reference state for protein fold selection and structure prediction. PLoS One. 2010;5(10):e15386.
 19.
Zhou H, Skolnick J. GOAP: a generalized orientationdependent, allatom statistical potential for protein structure prediction. Biophys J. 2011;101(8):2043–52.
 20.
Zhou H, Zhou Y. Distancescaled, finite idealgas reference state improves structurederived potentials of mean force for structure selection and stability prediction. Protein Sci. 2002;11(11):2714–26.
 21.
Chou KC. Prediction of protein cellular attributes using pseudoamino acid composition. Proteins. 2001;43(3):246–55.
 22.
Alford RF, LeaverFay A, Jeliazkov JR, O’Meara MJ, DiMaio FP, Park H, Shapovalov MV, Renfrew PD, Mulligan VK, Kappel K. The Rosetta allatom energy function for macromolecular modeling and design. J Chem Theory Comput. 2017;13(6):3031–48.
 23.
Magnan CN, Baldi P. SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics. 2014;30(18):2592–7.
 24.
Frishman D, Argos P. Knowledgebased protein secondary structure assignment. Proteins Struct Funct Genet. 1995;23(4):566–79.
 25.
Zemla A, Venclovas Č, Fidelis K, Rost B. A modified definition of Sov, a segmentbased measure for protein secondary structure prediction assessment. Proteins. 1999;34(2):220–3.
 26.
Lobanov MY, Bogatyreva N, Galzitskaya O. Radius of gyration as an indicator of protein structure compactness. Mol Biol. 2008;42(4):623–8.
 27.
Benkert P, Tosatto SC, Schomburg D. QMEAN: a comprehensive scoring function for model quality assessment. Proteins. 2008;71(1):261–77.
 28.
Chen VB, Arendall WB, Headd JJ, Keedy DA, Immormino RM, Kapral GJ, Murray LW, Richardson JS, Richardson DC. MolProbity: allatom structure validation for macromolecular crystallography. Acta Crystallogr D Biol Crystallogr. 2010;66(1):12–21.
 29.
Sippl MJ. Boltzmann's principle, knowledgebased mean fields and protein folding. An approach to the computational determination of protein structures. J Comput Aid Mol Des. 1993;7(4):473–501.
 30.
Melo F, Sánchez R, Sali A. Statistical potentials for fold assessment. Protein Sci. 2002;11(2):430–48.
 31.
Sippl MJ. Calculation of conformational ensembles from potentials of mena force: an approach to the knowledgebased prediction of local structures in globular proteins. J Mol Biol. 1990;213(4):859–83.
 32.
DeBolt SE, Skolnick J. Evaluation of atomic level mean force potentials via inverse folding and inverse refinement of protein structures: atomic burial position and pairwise nonbonded interactions. Protein Eng. 1996;9(8):637–55.
 33.
Jayaram B, Bhushan K, Shenoy SR, Narang P, Bose S, Agrawal P, Sahu D, Pandey V. Bhageerath: an energy based web enabled computer software suite for limiting the search space of tertiary structures of small globular proteins. Nucleic Acids Res. 2006;34(21):6195–204.
 34.
Melo F, Feytmans E. Assessing protein structures with a nonlocal atomic interaction energy. J Mol Biol. 1998;277(5):1141–52.
 35.
Melo F, Feytmans E. Novel knowledgebased mean force potential at atomic level. J Mol Biol. 1997;267(1):207–22.
 36.
Gerstein M, Tsai J, Levitt M. The volume of atoms on the protein surface: calculated from simulation, using Voronoi polyhedra. J Mol Biol. 1995;249(5):955–66.
 37.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
 38.
Manavalan B, Shin TH, Kim MO, Lee G. AIPpred: sequencebased prediction of antiinflammatory peptides using random forest. Front Pharmacol. 2018;9:276.
 39.
Manavalan B, Subramaniyam S, Shin TH, Kim MO, Lee G. Machinelearningbased prediction of cellpenetrating peptides and their uptake efficiency with improved accuracy. J Proteome Res. 2018;17(8):2715–26.
 40.
Liaw A, Wiener M. Classification and regression by randomForest. R News. 2002;2(3):18–22.
 41.
Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, Müller M. pROC: an opensource package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12(1):77.
 42.
Cao R, Adhikari B, Bhattacharya D, Sun M, Hou J, Cheng J. QAcon: single model quality assessment using protein structural and contact information with machine learning techniques. Bioinformatics. 2017;33(4):586–588.
 43.
Olechnovič K, Venclovas Č. VoroMQA: assessment of protein structure quality using interatomic contact areas. Proteins. 2017;85(6):1131–45.
 44.
Olechnovič K, Venclovas Č. VoroMQA web server for assessing threedimensional structures of proteins and protein complexes. Nucleic Acids Res. 2019;47:W437–42.
 45.
Conover M, Staples M, Si D, Sun M, Cao R. AngularQA: protein model quality assessment with LSTM networks. Comput Math Biophys. 2019;7(1):1–9.
Acknowledgements
Not applicable.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 21 Supplement 4, 2020: Proceedings of the 16th Annual MCBIOS Conference. The full contents of the supplement are available at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume21supplement4
Funding
The research was supported by National Institutes of Health R15GM120650 and startup funding from the University of Miami to ZW. Publication costs are funded by the National Institutes of Health R15GM120650 to ZW.
Author information
Affiliations
Contributions
TL developed the systems and potentials, benchmarked the performance, and wrote the first draft of the manuscript. ZW advised the research and edited the manuscript. Both authors have read and approved the final manuscript.
Authors’ information
Not applicable.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1.
Supplementary Information and data. This document provides more details regarding pseudo amino acid composition, MASS potential comparision, and evaluation results for different QA methods in stage 1.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Liu, T., Wang, Z. MASS: predict the global qualities of individual protein models using random forests and novel statistical potentials. BMC Bioinformatics 21, 246 (2020). https://doi.org/10.1186/s1285902033833
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902033833
Keywords
 Protein model quality assessment
 Singlemodel QA
 Random forests
 Protein energy potentials