Decoy selection for protein structure prediction via extreme gradient boosting and ranking

Background Identifying one or more biologically-active/native decoys from millions of non-native decoys is one of the major challenges in computational structural biology. The extreme lack of balance in positive and negative samples (native and non-native decoys) in a decoy set makes the problem even more complicated. Consensus methods show varied success in handling the challenge of decoy selection despite some issues associated with clustering large decoy sets and decoy sets that do not show much structural similarity. Recent investigations into energy landscape-based decoy selection approaches show promises. However, lack of generalization over varied test cases remains a bottleneck for these methods. Results We propose a novel decoy selection method, ML-Select, a machine learning framework that exploits the energy landscape associated with the structure space probed through a template-free decoy generation. The proposed method outperforms both clustering and energy ranking-based methods, all the while consistently offering better performance on varied test-cases. Moreover, ML-Select shows promising results even for the decoy sets consisting of mostly low-quality decoys. Conclusions ML-Select is a useful method for decoy selection. This work suggests further research in finding more effective ways to adopt machine learning frameworks in achieving robust performance for decoy selection in template-free protein structure prediction.


Background
Protein molecules play a vital role in controlling the biological activities of a cell. There are a number of attempts in wet laboratories to determine biologically-active/native tertiary structures as a route to decoding protein function [1]. Technological advances have now made it possible to generate hundreds of thousands of tertiary structures for a given amino-acid sequence, known as decoys, in a few CPU hours [2]. The multiplicity of decoys necessitates recognizing high-quality, near-native decoys among hundreds of thousand of decoys in an ensemble. Identifying these near-native decoys is a challenging problem in computational structural biology, and is known as decoy selection.
Template-free methods, which generate low-energy tertiary structures in the absence of one or more structural templates from homogeneous sequences, have now become prominent. The most popular ones include Rosetta [3] and Quark [4]. To compute the low-energy structures, these methods employ stochastic optimization to find local minimum of a selected energy/scoring function. A well known fact is that energy bias often does not lead to tertiary structures that are close to the native. Therefore, identifying near-natives from a large ensemble of decoys remains an open problem [5].
Consequently, other decoy selection strategies gained momentum due to the weak role of energy in recognizing near-native conformations, which is reflected in Critical Assessment of protein Structure Prediction (CASP) [5] series of community wide experiments. Clustering-based methods dominate the model quality assessment (MQA) performed in CASP. Clustering-based decoy selection methods work on the notion that decoys are randomly distributed around the native structure which a consensus method ought to reveal. The clustering-based decoy selection performs better when the ensemble consists of mostly good quality decoys. However, if the sampling of decoys in the decoy generation stage is sparse, resulting in many dissimilar decoys in an ensemble, consensus methods fail to recognize exceptionally good decoys [6]. Moreover, the time complexity incurred in clustering a large decoy ensemble creates another bottleneck.
In addressing the above challenges in decoy selection, we propose an alternative approach that takes advantage of the consensus methods and a machine learning technique. As described in [7], protein energy landscape reveals important statistical information regarding the conformational organization and pathway. In this paper, we leverage the quantitative knowledge garnered from the energy landscape of a protein molecule in a machine learning framework to address the challenges in decoy selection. Supervised machine learning methods are gaining prominence in computational biology applications. These methods generate predictive models that learn subtle patterns from the data without making any prior assumptions [8]. One of the biggest challenges for these predictive models is to succeed even when the dataset is extremely imbalanced. Data imbalance is a common problem in computational biology and bioinformatics [9]. For instance, one of the benchmark proteins in our experiments contains only 0.005% of positive instances (near-natives) among 58,491 decoys. Even in such a sparse decoy set, the proposed method successfully identifies the near-natives. Our method works as follows: first, the method extracts local structures from the energy landscape probed through a template-free protein structure prediction method; next, a machine learningbased decoy selection method uses these local structures to finally select groups of good quality decoys. The method outperforms state-of-the-art decoy selection strategies in [10].

Related work
The diverse collection of decoy selection strategies can be categorized into single-model, multi-model, quasi-single, and machine learning (ML) methods. Single-model methods predict quality on a per-decoy basis [11], these are physics-based and/or knowledgebased. Physics-based methods employ different atomic interactions such as electrostatic, Van Der Waals interactions, hydrogen bonding [12][13][14], whereas the knowledge-based scoring functions employ statistical analysis of known native structures [15][16][17]. Between these two methods, knowledge-based methods are known to be more successful in predicting high quality decoys [18,19].
Cluster-based methods work on the premise that the decoys are randomly distributed around the 'true' answer [20,21], which is not entirely valid due to the inherent bias associated with the template-free protein structure prediction methods used to generate the decoys. Apart from the huge time-complexity incurred by clustering a large decoy ensemble, the cluster-based methods often fail to identify good quality decoys (near-natives) for hard targets, which are more sparsely sampled [6]. Despite the bottlenecks, cluster-based decoy selection strategies have been the most popular methods in the decoy selection literature. Quasi-single models combine the single-model and consensus methods. First, some high quality reference structures are selected, then the remaining decoys in the ensemble are compared with the reference structures [22]. These methods are shown to perform better [5,23,24].
Recent investigations are employing machine learning (ML) methods for decoy selection [25][26][27]. For instance, work in [28] uses Support Vector Machine (SVM) and uses a statistical scoring function GOAP [29] to distinguish native decoys from the non-native ones. Decoy selection through machine learning are mostly single-model methods. These methods leverage structural features of proteins to assess decoy quality. Work in [30] employs non-negative matrix factorization for selecting the best cluster of decoys and the the best decoy in the decoy set, which can be further extended to large scale using the the distributed implementations [31] of NMF.
Deep learning has also become a popular approach to address ML problems in bioinformatics [32]. Along with a variety of applications, such as DNA sequencing [33], enzyme function prediction [34], de-novo prediction of membrane proteins [35], protein contact map prediction [36], and protein secondary structure prediction [37], deep learning has been successfully utilized for protein decoy selection as well. For instance, a deep belief network-based protein quality estimation (decoy selection) model DeepQA outperforms SVM-based methods and achieves state-of-the-art performance on the CASP dataset [38]. Convolutional neural network-based models have also observed success in protein decoy selection [38][39][40].
In this paper, we prefer to investigate shallow models, which, unlike deep architectures, do not place such high demands on the size of the training dataset in relation to the number of parameters. As our ability to expediently generate or obtain structure data grows, deep learning will surely provide an interesting way forward that we plan to pursue in tandem with strategies to reduce the dimensionality of the loss function.
In this paper, we employ an ML technique to a multi-model method that exploits local structures extracted from an energy landscape [41]. The proposed ML-based multi-model method offers promising results in terms of higher true positives and lower false positives.

Methods
First, we elaborate on the concept of energy landscape that forms the basis of our decoy selection method.

Energy landscapes to basins
The energy landscape is an instance of a more general fitness landscape that comprises a set of points X, a neighborhood N (X) defined on X, a distance metric on X, and a fitness function f : X → R ≥0 that assigns a fitness to every point in X. Moreover, the points in X secure neighbors via the neighborhood function. In the context of decoy selection, the points x ∈ X represent decoy structures, and the fitness function often designates an energy function. Effectively, the energy landscape of decoy structures characterizes the mapping of structures to their internal energy and provides important quantitative information about the structure space.
A protein energy landscape features an ensemble of structural states near or far from the native state and an extensive collection of intermediate states that shape the multi-modal and multi-dimensional nature of the landscape [41]. The concept of a basin is connected to a local/focal minimum. A focal minimum in a landscape is surrounded by a basin of attraction, which is the set of points on the landscape from which steepest descent/ascent converges to that focal optimum. Barriers separate basins and regulate transitions of a system between different structural states corresponding to basins in the landscape.
Under the energy landscape treatment, the biologically-active/native state(s) can be determined by identifying corresponding basins, which requires one to extract the underlying organization of decoys to identify basins in the landscape. One approach to achieve this objective is to embed the decoys in a connectivity data structure and utilize energies to identify basins. Consider an set of decoys. The can be embedded in a nearestneighbor graph (nn-graph) G = (V , E) [42]. The vertex set V is populated with the decoys, and the edge set E is populated by inferring the neighborhood structure of the landscape. The distance between two structures is measured via root-mean-squareddeviation (RMSD) after each of the structures is superimposed over some reference structures (arbitrarily, chosen to be the first in the ensemble); the superimposition minimizes differences due to rigid-body motions. Each vertex u ∈ V is connected to vertices v ∈ V if d(u, v) ≤ , where is a user-defined parameter. If the landscape has been sampled sparsely and in a non-uniform way, there is a possibility of creating a disconnected graph from a small value. One way to prevent such scenario is to increase the while controlling the density of the resulting nn-graph via the number of nearest neighbors of u.
The local minima of the landscape can be detected by analyzing the nn-graph. A ver- where v ∈ N(u) (N(u) denotes the neighborhood of u). The remaining vertices are then assigned to basins as follows. Each vertex u is associated a negative gradient estimated by selecting the edge (u, v) that maxi- (u, v). From each vertex u that is not a local minimum, the negative gradient is followed (via the edge that maximizes the above ratio) until a local minimum is reached. Vertices that reach the same local minimum are assigned to the basin associated with that minimum.

Basin selection via basin ranking
The basins, extracted from the energy landscape, can be useful in decoy selection. Work in [10] shows that simple, ranking-based basin selection strategies outperform a standard clustering-based decoy selection method in terms of purity (percentage of true positives, penalizes the selected basin by the extent of false positives found in that basin). Basins can be ranked as a combination of basin characteristics. For instance, basins can be ranked merely as size (S), as a combination of size and the energy (S+E) of the focal minimum of that basin. The size of basin is computed by the number decoys that belong to a basin. On the other hand, size and energy are used as conflicting objectives in a multi-objective, Pareto-based selection strategy. In a multi-objective optimization, solution A dominates solution B, if A is better than or equal to B for all optimization objectives, and for at least one objective, A is strictly better than B. In the context of basins, Pareto Rank (PR) of Basin A is the number of basins that dominate A. The Pareto Count (PC) of basin A is the number of basins that A dominates. Specifically, basins can be ranked with their PR, or with PR and PC (PR+PC). Empirical studies conducted in [10] demonstrate the superiority of the Pareto-based basin selection strategies over both cluster-based, size and energy-based decoy selection methods.
Despite good performance, ranking-based decoy selection strategies are unable to perform consistently well over all test cases regardless of their difficulty levels. Neither S+E nor PR+PC can provide fair performance (less false positives and more true positives in the selected clusters/basins) over all or most of the test cases. One would prefer a decoy selection method that is able to provide reasonably good performance for all or most of the test cases regardless of difficulty level or heterogeneity in structural characteristics. This is the premise of the work presented in this paper.

Decoy selection via ML and ranking
Shortcomings of ranking-based basin selection strategies necessitate a new basin selection strategy. On that premise, we present a novel basin-based decoy selection method, referred to as ML-Select, that employs machine learning techniques. The method operates in two phases: the first phase captures n pure basins; while the second phase purifies the selected n basins and offers top k purified basins as output. Both the phases involve fitting a regression model and a selection approach (ranking) based on the regression results. To generalize across all possible difficulty levels of proteins, we randomly select two proteins per difficulty level (easy, medium, hard) to train the models. Therefore, the performance of our models is independent of a test case and difficulty levels. We now describe the two phases of ML-Select in further detail.

Phase 1
In this phase, ML-Select predicts the purity of basins and ranks them based on the predicted values. We use two kinds of attributes: Pareto and graph-based attributes as features to build the regression model. The Pareto-based features are PR and PC, computed from treating basin size and focal energy as two conflicting optimization objectives [10]. We assign the ranks to each basin that are calculated based on the PR and PC values associated with the given basin. Specifically, each basin is assigned two ranks based on their PR and PC values, which serve as two different features.
The graph-based feature, number of connected components, characterizes a spatial attribute of the graphical representation of basins. The extracted basins from the nngraph (of all the decoys in the dataset) using the Structural Bioinformatics Library (SBL) [42] are essentially bags of decoys. Estimating the spatial structure of these decoys in a specific basin is hard. Therefore, we consider the number of connected components as one of the features for ML-Select.
In order to easily recover the relative spatial organization of the decoys comprising a basin, we construct m different nearest-neighbor graphs using the decoys populating m different basins. We use pdist + 1Å for the distance threshold to create the nearestneighbor graphs, where pdist refers to the average pairwise distance between the decoys of the basins. Depending on the distance between the decoys in a basin, the corresponding graph may consist of one or more connected components, which signify the structural attribute of a basin. Figure 1 shows an example graphical representation of the components in a basin. We rank the basins based on the predicted purity and pass the top n basins to the second phase for further purification.

Phase 2
In the second phase, we predict the root mean-squared-deviation (rmsd) of a decoy from the true native. The training set of this phase uses the same proteins as in the first phase. However, the features in the second phase are different from that of the previous phase. We use twenty features of which three are knowledge-based potentials and the remaining are energy scores from Rosetta suite [43]. The three knowledge-based features are: RW, RWplus [44] and dDFIRE [45]. RW is distance-dependent atomic potential and RWplus is side-chain orientation dependent potential; the third feature is dDFIRE, which improves the DFIRE statistical potential by adding an orientation dependency. The remaining 17 features are energy terms in the REF2015 scoring function [46] in the Rosetta suite of scoring functions. The 17 Rosetta REF2015 energy terms are the Lennard-Jones attractive and repulsive terms that capture interactions between atoms in different residues, the Lazaridis-Karplus solvation energy, the intra-residue Lazaridis-Karplus solvation energy term, the asymmetric solvation energy term, the Lennard-Jones repulsive term that captures interactions between atoms in the same residue, the Coulombic electrostatic potential with a distance-dependent dielectric, the Proline ring closure energy and energy of the psi angle of preceding residue, the backbone-backbone hydrogen-bonding energy term between atoms close and distant in the primary sequence, the sidechain-backbone and sidechain-sidechain hydrogen-bonding energy term, the Ramachandran preferences term, the (backbone) omega dihedral term, the probability of amino acid given torsion values for the phi and psi backbonee angles, the internal energy of sidechain rotamers term (as derived from Dunbrack's statistics), and a special torsional potential term to keep the tyrosine hydroxyl in the plane of the aromatic ring.
The top n pure basins from the first phase are treated as test cases. That is, we build n regression models for n basins that are passed to the second phase from the first phase. Each of these basins are further purified as follows. In a given basin from phase 1, if the predicted rmsd of a decoy falls short of pre-defined threshold (dist_thresh, explained later in the implementation details), we remove that decoy from a test case basin. Effectively, the decoys that are further away from the true native are removed from the selected basins. As a result, the purity of the selected basin improves. We rank the basins based on the resulting purity after the non-native decoy elimination and offer the top k basins as a result at the end of second phase. The purification process in this phase poses a threat of eliminating a good decoy (ones near the native). We mitigate this effect with a shift in the pre-defined distance threshold, dist_thresh ± τ , where τ ∈ {10%, 20%, 25%} of the predefined threshold. The effect of the threshold variation on purity is discussed later in the results.

Evaluation metrics
We evaluate the performance of our approach using two metrics: percentage of true positives (n) and purity (p). At a given distance threshold dist_thresh (explained in the implementation details), n is the ratio of number of true near-natives in the selected basin B 1−x , where x ∈ {1, 2, 3}, to the total number of true near-natives in that decoy ensemble. This metric resembles the Sensitivity (recall or true positive rate) measure. However, even significantly high n might become less effective if the number of false positives in the selected basin is high, where, a random draw from the selected basin would result in a lower probability of offering a true near-native. The metric p compensates this scenario by penalizing a large basin (or a group of selected basins) containing a large number of true and false positives to the extent of the false positive population present in that basin. p is computed as a ratio of the number of true positives to the size of a basin (or a group of basins). Therefore, a basin with a large number of false positives results in a low purity regardless of the number of true positives in that basin. In essence, purity metric resembles the precision of our method. Specifically, we discuss the performance of ML-Select and four other competing methods in terms of purity metric due to its balanced treatment towards false and true positives. For evaluation, we select these metrics that focus more on true and false positives rather than on true and false negatives because here we are more concerned with increasing the probability of selecting a true positive from the selected basins in a random draw, which can be achieved by minimizing the false positives and maximizing the true positives.

Implementation details
We use a distance threshold of 1Å for creating the nn-graph of a decoy ensemble via SBL [42]. Since Rosetta decoy generation protocol may produce sparse samples, a low threshold may result in a disconnected graph. To address this problem, we increase the initial threshold until the graph is connected. Minimum distance from a decoy in an ensemble to the true native is referred to as min_dist. For a protein with a known native structure, all decoys under the threshold dist_thresh are deemed as near-natives. As there are three different categories of test cases, we set the dist_thresh parameter to determine the nearnatives on a per-case basis. More specifically, dist_thresh is set to 2Å for the easy cases (min_dist < 1Å). For the medium cases (1Å ≤ min_dist < 2Å), dist_thresh is either 2.5Å or 3Å. For the hard cases (3Å < min_dist), we increase the dist_thresh until one of the methods accumulate non-zero number of near-natives in the top selected basins. Moreover, if any test case belongs to a particular category based on the min_dist, but very few near-natives can be found according to that min_dist, we move that test case to the next difficulty level.
We use a boosting-based ensemble learning approach, XGBoost [47], to build the regression models. We use a linear regression model via XGBoost in both phase 1 and phase 2. XGBoost is fast, scalable that follows the principle of gradient boosting. XGBoost is good to control over-fitting while producing a more regularized model formalization [48]. We calculate the knowledge-based features as follows. We calculate the RW potentials in the form of calRW and calRWplus, the executable programs used in the calculation are from Zhang lab [49]. The dDFIRE potential has been calculated using dDFIRE program [50]. We use 15 rounds of boosting to build our regression model. For training the regression models, we choose top q pure basins and randomly draw q basins (total 2q basins) from the rest of the training data.
We use 2 easy, 2 medium, and 2 hard proteins for training the models. For testing, we use an easy, a medium, or a hard protein that has not been used in the training dataset. To test/evaluate on a protein, we use another protein to take its place for training. Eventually, all the 18 proteins are tested and there is no overlap between the training and testing data.
To address the randomness in the training phase, we run the models on the test data for 50 times, and report the average p and n. We use 10 for q in this experiment. Construction of the nn-graph by SBL takes from 1 to 2 hours depending on the lengths (number of amino acids) of the proteins and the size of the decoy ensembles. Construction of the regression models take about a minute. Once the model has been built, testing it on a new dataset with 50 runs takes about 12 seconds. Basin-Size and Basin-Size+Energy take about 20 seconds to test a new dataset. The runtimes for Pareto-Rank and Pareto-Rank+Count are 65 and 96 seconds, respectively.

Results
We experimented with eighteen proteins of different lengths and folds. These proteins constitute a benchmark dataset often used by decoy generation algorithms [51][52][53][54][55][56]. We used the Rosetta template-free (decoy generation) protocol to generate around 51,000 to 68,000 decoys per target. Table 1 presents all the eighteen proteins arranged into three different categories (easy, medium, and hard). The difficulty level (easy, medium, hard) has been determined using the minimum distance (min_dist) between the generated decoys and a known native conformation of the corresponding protein. The size of the decoy ensemble | | for each target is shown in column 6. Figure 2 provides a visual comparison of the methods with respect to the quality of the selected decoys in the top three basins. We present three representative cases from the easy, medium, and hard categories. Each plot shows the decoys as two-dimensional dots where the x-axis tracks the lRMSD of each decoy and the y-axis tracks the Rosetta REF2015 (all-atom) energy (measured in Rosetta Energy Units -REUs). Decoys in each basin are colored in maroon, gold, and navy to distinguish between the top three basins. The protein with known native structure under PDB id 1dtja, shown in the first column in Fig. 2, presents an easy case. ML-Select, shown in top row, captures the best quality decoys (near-natives, low lRMSD from the native) in the top three basins (p : 99.6%). All the decoys in top three basins are within 2Å from the known native. On the other hand, the top three basins, selected by four other strategies, contain decoys with larger lRMSD, which lowers the purity (as low as 60%). For instance, Pareto-Rank captures very few decoys in top three basins. Moreover, some of these decoys are more than 4Å away from the native.

Visualizing top basins
Although ML-Select obtains basins of smaller size compared to that of the existing strategies for the medium case, 1c8ca, the quality of the selected decoys are better, which results in higher purity (100%, 99%, 89.1% for B 1 , B 1−2 , B 1−3 , respectively). Contrarily, the larger basins, selected by Basin-Size, PR, and PR+PC, suffer from low purity due to the presence of numerous non near-natives (minimum 4.9% and maximum 52.7%). Basin-Size+Energy performs fair in this scenario (p : 94.4% for B 1−2 ). However, purity diminishes as more basins are added in the selection (56.2% for B 1−3 ). Evidently, it is more likely that a random draw would yield a near-native from the top basin (or group of basins) if ML-Select is employed to perform the selection.   [57]. Since larger clusters are considered to have tighter distributions and are typically used for near-native model selection in practice [57], we select the three largest clusters resulting from MUFOLD-CL as the top three clusters for comparison. As shown in Fig. 3, the top three clusters resulting from MUFOLD-CL are much larger; they contain nearnatives, as well as many non-natives. The presence of many non-natives lowers purity. For instance, for the easy protein 1dtja, despite containing 57.3% near-natives in the top cluster, purity is only 3%. This is due to the presence of many non-natives. Table 2 compares ML-Select with four basin-based decoy selection strategies proposed in [10] on the easy, medium, and hard test cases. The comparison focuses on p metric over B 1−x groups of decoys where x varies from 1 to 3. The results with respect to n metric and the size (s) of each B 1−x are also shown. Empirical evaluation conducted in [10] shows that the four existing selection methods outperform a clustering-based decoy selection strategy. Figure 4 compares the five selection strategies in terms of p metric. The x-axis shows the test cases while y-axis tracks the purity (p) achieved by each method. The bold font indicates the best result among all the experimental methods.

Quantitative comparison of decoy selection strategies
The purity of the top basin for all five selection strategies (except for PR, which performs much worse than others) are comparable for the easy cases (1dtdb, 1wapa, 1hz6a, tig, and  1dtja). However, the purity diminishes as more basins are added to the selection for the four existing selection strategies (Size, Size+Energy, PR, PR+PC). For instance, ML-Select  scores more than 98% for the top 3 basins (B 1−3 ) for all the easy test cases, whereas Basin-Size can achieve only 79.3% for 1wapa, Basin-Size+energy can provide only 73% purity for 1hz6a, and PR+PC achieves 0% purity for 1wapa. For the medium-difficulty cases, the purity improvements resulted from ML-Select are prominent. ML-Select outperforms the four existing selection strategies in 4 out of 6 cases for B 1−x , where x ∈[ 1 − 3]. For instance, ML-Select achieves a maximum of 100% and a minimum of 83% purity for 1bq9 and 1ail, whereas the remaining four methods achieve a minimum of 0% purity and a maximum of 3% purity.
The hard cases present the most challenging decoy ensembles. Even for these challenging decoy sets, ML-Select significantly outperforms the four existing selection strategies in 5 out of 7 test cases (1hhp, 2ezk, 1aoy, 2h5nd, and 1aly) for all sizes of basin selections (i.e., B 1−x , x ∈[ 1 − 3]). For two other cases (1isua and 1cc5), ML-Select performs better for the top basin for 1isua, and for 1cc5 when x ∈ [ 2,3]. For instance, for the most difficult test case 1aly, ML-Select obtains about 42% purity whereas the four other methods fail to provide a single true positive (0% purity). Table 3 compares ML-Select with MUFOLD-CL on the easy, medium, and hard test cases. For all cases, the top three clusters are fairly large, which lowers purity. For instance, the smallest of the top clusters (on 1wapa) contains 39% of all the decoys in the decoy set of size 68,000. The near-native presence in this decoy set is only 0.005%. As a result, despite containing 39.4% near-natives, abundant non-natives populating the top cluster lowers its purity. In contrast, ML-Select is more precise; it selects basins of much smaller size that consist of mostly near-natives, resulting in much higher purity.  The top G 1−x groups of decoys selected from each selection strategy, with x limited to 3, are analyzed. When analyzing B 1−x , the top x basins are merged. The analysis lists the metrics (M): percentage of near-native decoys (n); the purity (p), which is the proportion of near-native decoys relative to the size of a group; and the relative size (s, is proportional to | |) of each basin Figure 4 shows that ML-Select offers reasonably good performance for a variety of test cases, which is not the case with the basin-based strategies. For instance, PR performs quite well for 1c8ca and 2ci2 for B 1 , but it fails miserably for 1bq9, 1ail, and 1sap. As a result, one cannot rely on this selection strategy in achieving good purity over a new test case. Contrarily, ML-Select guarantees reasonably good purity over all the test cases (except for one test case, 2ci2). Hence, ML-Select stands out as a more reliable decoy selection strategy than the four existing selection methods. Figure 5 shows that ML-Select performs much better than MUFOLD-CL in terms of the purity metric. However, MUFOLD-CL has been able to provide some near-natives for the medium-difficulty protein 2ci2 on which ML-Select obtains 0% purity. However, MUFOLD-CL's performance in terms of purity is low, as well. This is due to the much bigger cluster size and the scarcity of near-natives in the decoy sets.  The best method is marked with an asterisk (*), while the boldface presents the significance of the respective method when compared with the best method Table 4 shows the Friedman statistical tests with Hommel's post-hoc [58] analysis in predicting the purity of the basins. The statistical tests are performed on all the five different experimental methods on all the eighteen test case proteins at α = 0.05. The first column indicates the number of basins under consideration in the prediction of purity. The second column shows the methods, while the third column presents the average rank calculated from the Friedman's test [59], which rejects the null hypothesis. Upon the rejection of the null hypothesis, Hommel's post-hoc analysis helps to determine the statistical significance of the new technique (ML-Select) when compared to that of the existing methods. The fourth and the fifth columns show the p-value and Hommel's critical value respectively. The lowest average rank shows the best (ML-Select) method, and is marked with an asterisk (*). A method is said to be significantly different from the best method if the p-value of the corresponding method is less than that of the p-Hommel at α = 0.05, is in boldface. Overall, for all the three different basin sizes, ML-Select is the best. Therefore, ML-Select significantly outperforms the existing basin-based selection strategies.

Effect of dist_thresh on performance
We varied the dist_thresh parameter in the second phase to monitor any performance deviations in ML-Select. Here we summarize our findings. The improvement in the purity of the selected basins is insignificant when we alter the pre-defined distance threshold, dist_thresh ± τ , where τ ∈ 10%, 20%, 25%. In 15 out of 18 test cases, the purity varied, however, when dist_thresh is increased by 20%, we see an insignificant improvement. For example, the purity of the top 3 basins for 1bq9 increases from 83% to 94.6% when the dist_thresh is raised by 20%. For all the remaining test cases, the improvement in the purity is insignificant. Overall, altering the distance threshold by a factor has insignificant impact in predicting the purity.

Discussion
The results presented in this paper suggest that energy landscape probed by a templatefree protein structure prediction method can be leveraged for decoy selection and warrants further investigation. In particular, energy is often ignored in favor of structural similarity in clustering-based decoy selection strategies. The work presented in this paper has demonstrated that energy, when utilized in the context of energy landscape, can be successfully employed to identify near-native decoys from a decoy ensemble.
Observation on results from clustering-based selection methods show that these methods fail to identify exceptionally good decoys for sparsely distributed decoy ensembles. Since a clear consensus is often not available as near-native decoys are usually scarce and far away from the rest of the decoys, consensus-based methods such as clustering-based selections struggle to yield good performance for such challenging datasets. As shown in this paper, basins in energy landscape can improve decoy selection performance. In particular, supervised learning methods applied to basins extracted from an energy landscape can not only provide better decoy selection performance, but also prove resilient against sparsely distributed decoy ensembles.
Specifically, this paper presents a novel decoy selection method, ML-Select, that employs a supervised machine learning method to identify basins comprising mostly near-native decoys. ML-Select utilizes both energy-and graph-based characteristics of basins to successfully select near-native basins even for the challenging datasets consisting of only a few near-natives. Results presented in this paper also show that ML-Select is able to provide good performance for varied test cases irrespective of the difficulty level of the decoy ensemble.
Although ML-Select shows promise in decoy selection in template-free protein structure prediction, further investigation is warranted to address the current limitations. For instance, while ML-Select is able to provide a good-quality basin, this method does not assess the quality of individual decoys in the selected basin. However, the selected basin offers an informative set from which the best decoy(s) can be identified with the help of further ranking and more investigation. Further work will concentrate on utilizing decoy characteristics to incorporate an weighting scheme for identifying the best decoy(s) from a decoy ensemble. The line of inquiry pursued in this paper demonstrates a promising direction for advancing decoy selection research.

Conclusion
We proposed a novel machine learning strategy, ML-Select, in purifying the basins generated from the energy landscapes. Our experimental results indicate the utility of basins in the energy landscape probed by a template-free structure prediction method for automatic decoy selection. The model has been evaluated in terms of purity (favors lower false-positives and higher true-positives) and compared against four existing basin-based decoy selection strategies that perform better than a cluster-based selection strategy. We showed that ML-Select performs significantly better than all the four basin-based selection strategies. Moreover, the performance of ML-Select is highly reliable, unlike the inconsistent dominance of basin-based methods over the cluster-based method. Finally, we validate the use of machine learning techniques in decoy selection, while suggesting further research in this direction for advancing the state of decoy selection. In the future, we would like to investigate the use of other machine learning strategies and/or heuristics (similar to [60]) that initially predict the difficulty of a protein and use an ensemble of algorithms in predicting the purity of the basins for the respective class of proteins.