Skip to main content

Advertisement

We’d like to understand how you use our websites in order to improve them. Register your interest.

Ranking near-native candidate protein structures via random forest classification

Abstract

Background

In ab initio protein-structure predictions, a large set of structural decoys are often generated, with the requirement to select best five or three candidates from the decoys. The clustered central structures with the most number of neighbors are frequently regarded as the near-native protein structures with the lowest free energy; however, limitations in clustering methods and three-dimensional structural-distance assessments make identifying exact order of the best five or three near-native candidate structures difficult.

Results

To address this issue, we propose a method that re-ranks the candidate structures via random forest classification using intra- and inter-cluster features from the results of the clustering. Comparative analysis indicated that our method was better able to identify the order of the candidate structures as comparing with current methods SPICKR, Calibur, and Durandal. The results confirmed that the identification of the first model were closer to the native structure in 12 of 43 cases versus four for SPICKER, and the same as the native structure in up to 27 of 43 cases versus 14 for Calibur and up to eight of 43 cases versus two for Durandal.

Conclusions

In this study, we presented an improved method based on random forest classification to transform the problem of re-ranking the candidate structures by an binary classification. Our results indicate that this method is a powerful method for the problem and the effect of this method is better than other methods.

Background

Proteins are basic elements involved in biological functions. Recent advances in computational methods and algorithmic efficiency have enabled prediction of the three-dimensional (3D) structures of proteins from their sequences, which represents an increasingly important method for exploring their roles, networks, functions, and potentials as drug targets. Whether comparative modeling, protein threading modeling, or ab initio modeling, detecting the lowest free energy model (best model) from decoys by clustering represents an important step in protein-structure prediction [1]. In these methods, decoys are clustered, and the centroid structures of each cluster are reported as the final predicted structures. In popular protein-structure-prediction systems, including I-TASSER [2], MODELLER [3], and Rosseta [4], clusters are created iteratively. One criterion for clustering involves choosing decoys with more neighbors over decoys with fewer neighbors. The cluster centers ranked according to cluster size and suggested that larger cluster centers are closer to the best near-native models.

Zhang and et al. [5] developed SPICKER, which uses a simple and effective strategy to identify near-native conformations via cluster analysis. In the strategy, the best of the top five identified folds has a root-mean-square deviation (RMSD) from the native structure in the top 1.4% of all decoys. For 78% of the proteins, the difference in the model RMSD from the native structure and that of the native structure to the absolutely best individual decoy is < 1 Å. Li and Ng [6] proposed Calibur, which uses three strategies to enhance performance, which remains stable, regardless of increases in the number of decoys, and Francois et al. [7] proposed a fast method effective for large-scale models. Clusco [8] was developed to compare high-throughput protein models using different similarity measures, including those generated using parallel execution on CPUs and GPUs. Li et al. [9] proposed an efficient clustering method allowing rapid estimation of cluster centroids and efficient pruning of rotation spaces. Although these methods improved the accurate detection of optimal near-native models and accelerated the clustering process, their accuracy is lacking, as usually cluster centers harboring the largest models might include the closest model to the native structure due to inaccuracies related to evaluating the lowest free energy and 3D distance metrics. These stat-of-art methods have successfully explored the best five or three candidate structures from the decoys, but unfortunately sometime they failed to give a correct order of the five or three candidate structures. The accuracies of SPICKER, Calibur, and Durandal in predicting the first model are 60, 44, and 49%, respectively, with 17, 31, and 27 incorrectly ranked models in candidates, respectively. If we can re-rank the candidate structures in 100% correct order, the average RMSD of the first model can be improved 11.9, 16.3 and 15.9% with SPICKR, Calibur, and Durandal.

To address this issue, we propose an algorithm based on random forest classification to re-rank candidate structures detected by clustering. The algorithm solves the problem of re-ranking candidate structures by an binary classification, taking the length of the protein, PSSM (position-specific scoring matrix), the size of each cluster category associated with the protein, the average RMSD and average TM_SCORE [10] between the models and the other four models, and the average RMSD and average TM_SCORE between each model and all other models in the cluster category as features. Finally, the RMSD between each protein and its corresponding native protein is used as a label. Our results suggest that the algorithm chooses the first models were closer to the native structure in 12 of 43 cases versus four for SPICKER, and the same as the native structure in up to 27 of 43 cases versus 14 for Calibur and up to eight of 43 cases versus two for Durandal.

Method

Cluster methods for detecting candidate near-native structures

Protein-structure clustering is an important step in protein 3D structure, function, and interaction predictions. Structure-prediction methodologies involving clustering require identification candidate structures with the highest degree of similarity to the native structure from a large number of decoy structures, generated by the free modeling or template modeling, based on 3D structures similar to those provided to the clustering algorithm. The following three methods represent current methods for detecting near-native models.

SPICKER

The method developed by Zhang and et al. [5] generates clusters in a single-step process using a set of shrinking scales, followed by dynamic adjustment of the conformational-similarity threshold between candidate pairs during each iteration. After labeling a set of 1489 non-homologous proteins representing all protein structures in the PDB > 200 residues, a fast algorithm for population-based protein structural model analysis was proposed. Two new distance matrices for describing the differences and similarities among models were developed. Compared with existing methods using calculation times quadratic to the number of models, Dscore1-based clustering achieves linear-time complexity to obtain almost the same accuracy for near-native model selection.

Calibur

The method developed by Li and Ng [6] clusters decoys using proximate decoy organization, preliminary screening via lower and upper bounds, and outlier filtering. This method scales well with respect to increases in the number of decoys and automatically discovers a suitable threshold distance for clustering based on the decoys used as input. Several algorithms for this discovery are implemented in Calibur, with the fastest used by default.

Durandal

The method developed by Francois and et al. [7] works on large decoy sets and is consistently faster than other methods in the performance of exact clustering. In some cases, Durandal also outperforms approximate methods, with this attributed to its use of triangular inequality to accelerate exact clustering without compromising the distance function.

Although these three clustering methods can detect near-native models, the limitations of clustering methods and three-dimensional structure-distance evaluation make it difficult to determine the exact order of the candidate structures. Therefore, we chose to use random forest classification to re-rank the near-native models obtained by the three clustering algorithms.

Inter-cluster and intra-cluster features

Feature selection is one of the key issues of the any machine learning method. The complex biological evolutionary process increases the difficulty of feature selection [11, 12]. This re-rank task is closely related to the protein and the cluster information, so we divided the seven features employed by the method into three categories: protein features, intra-cluster features (information within each cluster) and inter-cluster features (relationships between clusters). Proteins features are directly related to the protein information include 1) the length of the protein sequence and 2) position-specific scoring matrix, PSSM which is a way of encoding amino acids. The type of the PPSM is a matrix which has N lines that represent the number of amino acid in the protein and M columns that the number of types of amino acid. We converted this matrix into an vector of length 1 × (MAXN × M) and spliced it into a vector of length 6 + MAXN × M with the other six features. If N is greater than MAXN,we take MAXN. Intra-cluster features include the following: 3) the size of the clusters, which means the number of elements in the clusters; 4) the average RMSD between the cluster center and the remaining models in the cluster which represents the similarity of intra_cluster; and 5) the average TM_SCORE between the cluster center and the remaining models in the cluster which represents the similarity of intra_cluster. Inter-cluster features include the following: 6) the average RMSD between the current center model and the other four center models, which represents the similarity of inter_cluster; 7) the average TM_SCORE between the current center model and the other four center models, which represents the similarity of inter_cluster.

The schematic of the method

Random forest classification employs a combination of the bagging algorithm and the random subspace algorithm [13, 14], with a decision tree used as a foundation of the method [15, 16]. Classification accuracy is improved by combining multiple decision trees: h1(x), h2(x), …, hnTree(x) [17, 18]. Once the random forest classifier is obtained (Fig. 1), classification of samples of unknown categories is performed. The original data T = (xi1, xi2, xi3, …, xi6 + MAXN × M, yi), i [1, N] (the index i represents ith samples in the original and the index x represents each feature of the random forest.) contains N samples corresponding to 6 + MAXN × M features in the dataset. Y = yi, i [1, N] is the category label that corresponds to the RMSD between each decoy and the native protein structure. yi takes c ≥ 2 values, which represent c classifications. The method used four different random forest to identify the first model, the second model, the third model, the forth model and the fifth model. Each random forest is a binary classification where “1” represents the candidate that has minimum RMSD with native protein and “0” represents the remaining candidates in decoys. We built these four random forest sequentially. After each random forest was completed, we selected candidate that labeled “1” as the best near-native model and removed it from the decoys. At the same time, we used the remaining candidates as the input for the next random forest. The method was done until all candidates were selected. The process of method is shown in Fig. 1.

Fig. 1
figure1

Schematic of the re-ranking method via random forest classification

Algorithm

The first step involves clustering using each method in order to obtain K clusters [19, 20], followed by ranking by the number of proteins in each category and extracting the top five or three optimal models [21], which are divided into a training set and a test set.

The training set T1 is randomly divided into N sub-datasets which are the number of trees in forest that is set as 100, discretization of each continuous attribute using the dichotomy, and the best classification node is selected from the 6 + MAXN × M features using information entropy [22]. The feature with the best value is selected as the best split feature [23], with Eq. (1) showing the calculation method. Until the division of the feature ends, a decision tree is formed, the result is obtained according to the voting criterion. And until the N trees are constructed, the random forest is completed.

$$ Entropy(T)=-{\sum}_{i=1}^4{P}_i{\log}_2{P}_i $$
(1)

According to Eq. (1), the larger the information entropy, the higher the purity of the data. Pi represents the proportion of category i samples relative to the total number of samples. Therefore the training set T1 is divided n parts which equal to the number of attribute values of the feature that is chosen by the information entropy.

Finally, the test set is used to obtain the sorted results [24].

The end conditions of the random forest algorithm are as follows: the decision tree reaches the maximum depth, and the end node impurity reaches the threshold, and the number of samples at the end node reaches the set value, and the features are fully used. The algorithm of random forest is shown in Table 1.

Table 1 Improved detection of near-native structures via random forest classification

Evaluation indices

To evaluate the performance of the re-rank method, the RMSD and TM_SCORE are used to evaluate the distance of models to the native structure, respectively.

RMSD

As a commonly used measure of the difference between protein structures, RMSD describes variation between two models. The RMSD represents the sample standard deviation of the difference between the predicted value and the observed value. When these differences are estimated by data samples, they are often referred to as residuals, whereas when they are not calculated by samples, the differences are referred to as prediction error. The RMSD is mainly used to aggregate the size of the error in the prediction and often expresses this prediction as a magnitude at different times. The RMSD is a measure of good accuracy and generally used to compare the predicted error of a particular variable between different models [25,26,27]. RMSD is calculated according to Eq. (2):

$$ RMSD=\sqrt{\frac{1}{N}{\sum}_1^N\left({x}_i^2-{x}_j^2\right)+\left({y}_i^2-{y}_j^2\right)+\left({z}_i^2-{z}_j^2\right)} $$
(2)

where N is the number of atoms corresponding to the two proteins i and j.

TM_SCORE

TM_SCORE measures structural similarity between two protein models. This index addresses global multiple similarity and is insensitive to local structural changes, with the TM_SCORE of random structure pairs generally independent of sequence length. TM_SCORE values are presented as a set (0, 1), where 1 represents a perfect match between two structures. According to calculations of TM_SCORE using structures from the Protein Data Bank, a score > 0.17 corresponds to randomly selected unrelated proteins, whereas a score > 0.5 assumes highly similar folds [28]. TM_SCORE is calculated according to Eq. (3):

$$ TM\_ Score=\mathit{\operatorname{Max}}\left[\frac{1}{L_n}{\sum}_i^{L_a}\frac{1}{1+{\left(\frac{d_i}{d_0}\right)}^2}\right] $$
(3)

where Ln is the sequence length of the native structure, La is the sequence length of the residue-specific alignment with the template structure, di is the distance residual between the ith alignment, d0 is the scale of the standardized matching difference, and Max indicates the maximum value after optimal spatial superposition.

Results

Datasets

Four datasets are employed in the experiments. They are I-TASSER Decoy Set-I, QUARK Decoy Set, CASP10 dataset and CASP11 dataset which are generated by I-TASSER and QUARK (https://zhanglab.ccmb.med.umich.edu/decoys/). These datasets are widely used to evaluate protein decoy clustering [29]. We used I-TASSER Decoy Set-I as a test dataset and the other three datasets as the training sets. Table 2 provides an overview of the four datasets.

Table 2 Datasets

The TASSER Decoy Set-I contains a complete set of atomic structure decoys for 56 non-homologous proteins. Among them, 13 proteins whose decoys are not able to cluster into more than five clusters are removed. The remaining 43 proteins are employed in the dataset. The backbone structure was ab initio modeled by I-TASSER, and side-chain atoms were added using Pulchra (http://www.pirx.com/pulchra/index.shtml).

The QUARK Decoy Set contains 145 non-homologous proteins. The backbone structure was ab initio modeled by QUARK, with the all-atom and models of the best candidate generated by ModRefiner (https://zhanglab.ccmb.med.umich.edu/ModRefiner/).

The CASP10 dataset relies upon I-TASSER and QUARK decoys for single-domain proteins in CASP10 that the I-TASSER server predicted as belonging to a single domain. The dataset contains 54 proteins with experimental structures resolved before the CASP10 meeting. The data harbor a gap between the submitted model and the best model among the decoys; therefore, choosing the best model relative to the experimental structure is extremely challenging.

The CASP11 dataset includes decoys generated by I-TASSER and QUARK for CASP11 targets and that the I-TASSER server predicted as belonging to a single domain. Multi-domain targets were ignored to avoid the possibility that ambiguity in domain splitting might render the decoys meaningless. These decoys were used during CASP11.

Comparison of the three clustering methods with random forest classification

We evaluated the ability of the method to identify near-native structures relative to that of previous methods according to clustering methodology. Predictions were performed across the same time points, with the first false prediction leading to inaccuracies in subsequent predicted models and resulting in poor rankings. The comparative analysis removes the ranked data and ranks the remaining data for subsequent rounds of processing.

Comparison of the first model

Because the RMSD between decoy models and the native model is used as a label for the random forest classifier, we assigned model with the lowest RMSD as label “1”, and the remaining models as label “0” to establish a two-category set (0,1) for ranking. However, the percentage of model with “0” is four-fifths and the percentage of model with “1” is one-fifth, there is an imbalance of the training set. We used over-sampling to increase the amount of data in the “1” case, so that we can reduce the imbalance of training set. The 43 sets representing the protein data were submitted for training, with the models having an RMSD of “1” predicted as the first model. Comparing RMSD values between the first model predicted by the random forest classifier and those predicted using the three different clustering methods indicated that our method outperformed the others (Table 3).

Table 3 RMSD comparison of the first model of 43 proteins

Use of the random forest classifier ranked the candidate structures with higher accuracy according to average RMSD. Twelve of the models predicted by the random forest classifier were closer to the native structure than those predicted by SPICKER, 27 were the same, and four were inferior. The average RMSD decreased 8.40% from 5.36 to 4.91 after ranked by random forest classifier. Twenty-one of the models predicted by the random forest classifier were closer to the native structure than those predicted by Calibur, eight were the same, and 14 were inferior. Finally, six of the models predicted by the random forest classifier were closer to the native structure than those predicted by Durandal, 35 were the same, and two were inferior. These data indicated that the random forest classifier allowed more accurate order of candidate structures exhibiting the highest degree of similarity to the native structure relative to the three other methods.

Comparison of the second model

After removal of the first model from the dataset, we followed the same algorithmic procedure to establish the optimal RMSD values between decoy models and the native structure, resulting in another two-category set (0,1). However, the percentage of model with “0” is three-fourths and the percentage of model with “1” is one-fourth. We used over-sampling to overcome the imbalance of training set. Comparing RMSD values between the first model predicted by the random forest classifier and those predicted using the three different clustering methods indicated that our method outperformed the others (Fig. 2).

Fig. 2
figure2

Comparison of RMSD of the second model in the absence of the first model

Use of the random forest classifier generated predictions with higher accuracy according to average RMSD. Fifteen of the models predicted by the random forest classifier were closer to the native structure than those predicted by SPICKER, 22 were the same, and six were with higher RMSDs, resulting in a 21% increase in accuracy. Eleven of the models predicted by the random forest classifier were closer to the native structure than those predicted by Calibur, 19 were the same, and 13 were worse, resulting in a 4% increase in accuracy. Sixteen of the models predicted by the random forest classifier were closer to the native structure than those predicted by Durandal, 19 were the same, and eight were worse, resulting in a 18% increase in accuracy. These data indicated that the random forest classifier allowed more accurate prediction of models exhibiting the highest degree of similarity to the native structure relative to the three other methods.

Comparison of the third model and the fourth model

Since Calibur and Durandal usually predict only the three of the near-native candidate structures, while SPICKER can predict five structures, the comparisons of the third and the fourth models are only implemented against SPICKER. Comparing RMSD values between the third and the fourth model predicted by the random forest classifier and those predicted using the three different clustering methods indicated that our method outperformed the others (Fig. 3). In the Fig. 3a, the random forest classifier ordered predictions with higher accuracy according to average RMSD. Sixteen of the models predicted by the random forest classifier were closer to the native structure than those predicted by SPICKER, 17 were the same, and ten were worse, resulting in a 14% increase in accuracy. In the Fig. 3a, Use of the random forest classifier generated predictions with higher accuracy according to average RMSD. Eleven of the models predicted by the random forest classifier were closer to the native structure than those predicted by SPICKER, 27 were the same, and five were worse, resulting in a 14% increase in accuracy. These data indicated that the random forest classifier allowed more accurate prediction of models exhibiting the highest degree of similarity to the native structure relative to SPICKER.

Fig. 3
figure3

Comparison of the RMSD of the third model and the fourth model. a. Comparison of the RMSD of the third model. b. Comparison of the RMSD of the fourth model

Comparison of the numbers of correct predictions

The Fig. 4 indicated that the random forest classifier allowed more accurate prediction of models exhibiting the highest degree of similarity to the native structure relative to three clustering methods. After re-ordered by RF_SPICKER, 35(81.39%) out of 43 first models are exactly identified, while SPICKER only correctly identified 26(60.46%) first models. When detecting the second third and fourth models, RF_SPICKER correctly predicted 4, 5 and 6 targets more than SPICKER, respectively. Even if Calibur and Durandal usually predict only three near-native candidate structures, RF_Calibur and RF_ Durandal successful predicted 1 and 5 more targets than Calibur and Durandal on the first model respectively. And they successful predicted 1 and 8 more targets on the second model respectively.

Fig. 4
figure4

Comparison of the numbers of correct predictions

Discussion

1dcj is a small protein encoded by the yhhP gene in the Escherichia coli database. Its high precision NMR (Nuclear Magnetic Resonance) structure is identified by Katoh E and his colleagues at 2000 [30,31,32]. In the previous research the cell division process is related to 1dcj although the precise biological function of this protein has not been yet identified. The serum glycoprotein C5a(1kjs) is derived from the proteolytic cleavage of complement protein C5, has been implicated in the pathogenesis of a number of inflammatory and allergic conditions [16, 33]. The three-dimensional structure is detected by two-dimensional NMR. The computational structures are very useful for protein functional and evolutional understanding.

Visual structural comparisons of native, SPICKER, Calibur and Durandal are shown in the Fig. 5a and b. The native structure is in green, the first models detected by SPICKER, Calibur and Durandal are in yellow, and the re-ranked models predicted via random forest classification are in red. In the visual comparison on 1dcj, both SPICKER model (1dcj, RMSD 11.66) and RF_SPICKER model (1dcj, RMSD 10.45) successful built two helixes in the purple circles, but the helixes of RF_SPICKER model are more closer to the native structure. The native structure of 1dcj has three beta-strand motifs. Although prediction of the three-dimensional structure of beta-strand is commonly regarded as difficult task, the random forest classification successfully choose RF_Calibur model (1dcj, RMSD 11.66) with one beta-strand as the first model. Unfortunately Calibur choose the model (1dcj, RMSD 12.18) without any beta-strand. The main difference between Durandal model (1dcj, RMSD 11.95) and RF_Durandal model (1dcj, RMSD 9.96) is the location of first helix region. On the protein 1kjs, SPICKER model (1kjs, RMSD 8.67) completely failed to build the right-side short helix, while the RF_SPICKER model (1kjs, RMSD 5.88) has this short helix and only the direction of the helix is not exactly consistent with the native helix. In Calibur and Durandal model comparison, RF_Calibur model (1kjs, RMSD 5.89) and RF_Durandal model (1kjs, RMSD 5.92) successfully built the short helix rather than Calibur model (1kjs, RMSD 8.44) and Durandal model (1kjs, RMSD 8.74) and well aligned with the native model.

Fig. 5
figure5

Visual Comparison of random forest classifier and current prediction methods on 1dcjA and 1kjs_

Conclusion

This study presented a method re-order the candidate near-native structures by random forest classification after the clustering methods explored the five or three candidate structures. The method employed four binary classifier to detect the first, second, third, fourth and fifth model with protein features, inter-cluster features and intra-cluster features. To evaluate the performance of the method four widely-used datasets, I-TASSER Decoy Set-I, QUARK Decoy Set, CASP10 dataset and CASP11 dataset, are employed. Comparison with three dominated methods, the method decreased the average RMSD 8.40% from 5.35 to 4.91 for SPICKER, decreased 9.76% from 5.53 to 4.99 for Calibur and decreased the average RMSD 3.91% from 5.36 to 5.15 for Durandal on the first model.

Availability of data and materials

The extracted data supporting the conclusions of this article is included within the article. Dataset can be access from https://zhanglab.ccmb.med.umich.edu/decoys/

Abbreviations

3D:

Three-dimensional

CASP:

Computer automated stowage planning

NMR:

Nuclear magnetic resonance

PSSM:

Position-specific scoring matrix

RF_Calibur:

RMSD of model predicted by the random forest classification from Calibur results

RF_Durandal:

RMSD of model predicted by the random forest classification from Durandal results

RF_SPICKER:

RMSD of model predicted by the random forest classification from SPICKER results

RMSD:

Root mean squared error

TM_SCORE:

Template modeling score

References

  1. 1.

    Zhang J, Xu D. Fast algorithm for population-based protein structural model analysis. PROTEOMICS. 2013;13(2):221–9.

  2. 2.

    Simoncini D, Berenger F, Shrestha R, et al. A probabilistic fragment-based protein structure prediction algorithm. PLoS One. 2012;7(7):e38799.

  3. 3.

    Marti-Renom MA, Stuart A, Fiser A, et al. Comparative protein structure modeling of genes and genomes [J]. Annu Rev Biophys Biomol Struct. 2000;29:291–325.

  4. 4.

    Huang PS, Boyken SE, Baker D. The coming of age of de novo protein design. Nature. 2016;537(7620):320–7.

  5. 5.

    Zhang Y, Skolnick J. SPICKER: a clustering approach to identify near-native protein folds. J Comput Chem. 2004;25(6):865–71.

  6. 6.

    Li SC, Ng YK. Calibur: a tool for clustering large numbers of protein decoys. BMC Bioinformatics. 2010;11(1):25–0.

  7. 7.

    Berenger F. Zhou, et al. entropy-accelerated exact clustering of protein decoys. Bioinformatics. 2011;27(7):939–45.

  8. 8.

    Huang X, Lu Q, Qian P. Evaluation of protein structure prediction clustering algorithm. Comput Eng. 2011;37(1):24–7.

  9. 9.

    Li SC, Bu D, Li M. Clustering 100,000 protein structure decoys in minutes. IEEE/ACM Transac Comput Biol Bioinformatics. 2012;9(3):765–73.

  10. 10.

    Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score [J]. Nucleic Acids Res. 2005;33(7):2302–9.

  11. 11.

    Liu H, Mo Y, Wang J, et al. A new feature selection method based on clustering[C], Eighth International Conference on Fuzzy Systems & Knowledge Discovery. Shanghai: IEEE; 2011.

  12. 12.

    Huang DS, Yu HJ. Normalized feature vectors: a novel alignment-free sequence comparison method based on the numbers of adjacent amino acids. IEEE/ACM Transac Comput Biol Bioinformatics. 2013;10(2):457–67.

  13. 13.

    Siddiqui FU, Mat Isa NA. Enhanced moving K-means (EMKM) algorithm for image segmentation [J]. IEEE Trans Consum Electron. 2011;57(2):833–41.

  14. 14.

    Liu B, Weng F, et al. iEnhancer-EL: Identifying enhancers and their strength with ensemble learning approach. Bioinformatics. 2018;34(22):3835–42.

  15. 15.

    Hoffman R, Lazaridis D. The limits of compulsion: demographic influences on voter turnout in Australian state elections. Aust J Polit Sci. 2013;48(1):28–43.

  16. 16.

    Huang DS, Zhang L, et al. Prediction of protein-protein interactions based on protein-protein correlation using least squares regression. Curr Protein Pept Sci. 2014;15(6):553–60.

  17. 17.

    Liu Q, Lu J, Chen S. Design and analysis of traffic incident detection method based on random forest. J Southeast Univ (English Edition). 2014;1:88–95.

  18. 18.

    Liu B, Yang F, et al. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformatics. 2018;34(1):33–40. https://doi.org/10.1093/bioinformatics/btx579.

  19. 19.

    Dang J, Jia R, Luo X, et al. Research on wear properties assessment of tubular turbine guide bearing based on H-K clustering-logistic regression model. Shuili Xuebao/J Hydraulic Eng. 2017;48(2):226–33.

  20. 20.

    Yuan L, Zhu L, et al. Nonconvex penalty based low-rank representation and sparse regression for eQTL mapping. IEEE/ACM Transac Comput Biol Bioinformatics. 2017;14(5):1154–64.

  21. 21.

    Jamroz M, Kolinski A. ClusCo: clustering and comparison of protein models. Bmc Bioinformatics. 2013;14(1):62.

  22. 22.

    Wang A, Wan G, Cheng Z, et al. An incremental extremely random forest classifier for online learning and tracking[C]. IEEE International Conference on Image Processing. Hong Kong: IEEE; 2010.

  23. 23.

    del Río S. López, Victoria, Benítez, José Manuel, et al. on the use of MapReduce for imbalanced big data using random forest. Inform Sci Int J. 2014;285(C):112–37.

  24. 24.

    Pudlo P, Marin JM, Estoup A, et al. Reliable ABC model choice via random forests. Bioinformatics. 2015;32(6):859–66.

  25. 25.

    Wu H, Li H, Min J, et al. Identify high-quality protein structural models by enhanced K-means [J]. Biomed Res Int. 2017;2017(18):1–9.

  26. 26.

    Zhu L, Zhang HB, et al. Direct AUC optimization of regulatory motifs. Bioinformatics. 2017;33(14):i243–51.

  27. 27.

    You ZH, Lei YK, et al. Using manifold embedding for assessing and predicting protein interactions from high-throughput experimental data. Bioinformatics. 2010;26(21):2744–51.

  28. 28.

    Yu H, Zhang C, Wang G. A tree-based incremental overlapping clustering method using the three-way decision theory. Knowl-Based Syst. 2016;91:189–203.

  29. 29.

    Yang J, Zhang W, He B, et al. Template-based protein structure prediction in CASP11 and retrospect of I-TASSER in the last decade. Proteins Struct Funct Bioinformatics. 2016;84(S1):233–46.

  30. 30.

    Katoh E, Hatta T, Shindo H, et al. High precision NMR structure of YhhP, a novel Escherichia coli protein implicated in cell division. J Mol Biol. 2000;304(2):0–229.

  31. 31.

    Chuai GH, Ma H, Yan JF, et al. DeepCRISPR: optimized CRISPR guide RNA design by deep learning. Genome Biol. 2018;19(1):80.

  32. 32.

    Zhu L, Deng SP, et al. Identifying spurious interactions in the protein-protein interaction networks using local similarity preserving embedding. IEEE/ACM Transac Comput Biol Bioinformatics. 2017;14(2):345–52.

  33. 33.

    Zhang X, Boyar W, Toth MJ, et al. Structural definition of the C5a C terminus by two-dimensional nuclear magnetic resonance spectroscopy. Proteins Struct Func Bioinformatics. 2015;28(2):261–7.

Download references

Acknowledgments

The authors acknowledge and thank the anonymous reviewers for their suggestions that allowed the improvement of our manuscript.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 20 Supplement 25, 2019: Proceedings of the 2018 International Conference on Intelligent Computing (ICIC 2018) and Intelligent Computing and Biomedical Informatics (ICBI) 2018 conference: bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-25.

Funding

This work was supported in part by the National Natural Science Foundation of China (No. 61772357, 61902272, 61672371, 61876217, 61902271, 61750110519), and Suzhou Science and Technology Project (SYG201704, SNG201610, SZS201609). The publication costs of this article were funded by the grants of the above foundations and projects.

Author information

Affiliations

Authors

Contributions

HW proposed the original idea. HH and WL designed the framework and the experiments. HW, HH, WL and QF performed the experiments and performed the primary data analysis. HW and HH wrote the manuscript. YD, JQ and HL modified the codes and the manuscript. All authors contributed to the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Weizhong Lu.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wu, H., Huang, H., Lu, W. et al. Ranking near-native candidate protein structures via random forest classification. BMC Bioinformatics 20, 683 (2019). https://doi.org/10.1186/s12859-019-3257-8

Download citation

Keywords

  • Protein structural prediction
  • Random forest
  • SPICKER