Skip to main content


LigandRFs: random forest ensemble to identify ligand-binding residues from sequence information alone



Protein-ligand binding is important for some proteins to perform their functions. Protein-ligand binding sites are the residues of proteins that physically bind to ligands. Despite of the recent advances in computational prediction for protein-ligand binding sites, the state-of-the-art methods search for similar, known structures of the query and predict the binding sites based on the solved structures. However, such structural information is not commonly available.


In this paper, we propose a sequence-based approach to identify protein-ligand binding residues. We propose a combination technique to reduce the effects of different sliding residue windows in the process of encoding input feature vectors. Moreover, due to the highly imbalanced samples between the ligand-binding sites and non ligand-binding sites, we construct several balanced data sets, for each of which a random forest (RF)-based classifier is trained. The ensemble of these RF classifiers forms a sequence-based protein-ligand binding site predictor.


Experimental results on CASP9 and CASP8 data sets demonstrate that our method compares favorably with the state-of-the-art protein-ligand binding site prediction methods.


Protein-ligand binding is important for some proteins to perform their functions. Protein-ligand binding sites are the residues of proteins that physically bind to ligands. A ligand is a signal triggering molecule, binding to a site on a target protein. In biochemistry, a ligand is a substance, usually a small molecule, that forms a complex always with a molecule to serve a biological purpose. For instance, oxygen is poorly soluble in aqueous solutions and cannot be perfectly carried to tissues if it is only dissolved in blood serum. However, none of the amino acid side chains in proteins is suited for the reversible binding of oxygen molecules. The function is always fulfilled by certain transition metals having a strong tendency to bind oxygen, such as iron and copper. Most commonly iron is used for the oxygen transportation. Myoglobin (PDB: 3RGK) is an iron- and oxygen-binding protein to facilitate the oxygen diffusion in muscles. It is a single polypeptide consisted of 153 or 154 amino acid residues, which is found in almost all mammals, primarily in muscle tissue. Commonly, there are several ligand categories: "metal ions" (e.g., Ca, Zn, Fe, and Mg), "inorganic anions" (e.g., SO4 and PO4), "DNA/RNA" for poly-ribonucleic acid binding sites, and "organic ligands" for cofactors, substrates, and receptor agonists/antagonists (e.g., NAD, FAD, ATP, SAM, CoA, and PLP) [1], and so on.

A number of methods applied nuclear magnetic resonance (NMR) spectroscopy [29] or X-ray [10] to determining protein structures. Such structural information is essential to determine the ligand-binding sites. Pintacuda et al. employed lanthanide ions for the determination of protein-ligand binding sites [2]. Ziarek et al. emphasized a semi-automated throughput-focused method to identify practical aspects of binding site characterization and structure determination of protein-ligand complexes, by automated and semi-automated NMR assignment methods [4]. Since experimental efforts to determine ligand-binding sites are always time-consuming, computational methods are needed that can assist in the identification of such sites.

Most computational approaches searched for similar or homologous structures of the query sequence to determine its ligand-binding sites from the homologous structures [1, 1113]. For instance, in the CASP9 competition, all top performing groups were based on the structure-based approach. Although they yielded good predictions (the average Matthews correlation coefficient of 0.62 for the top 10 performing groups), such structure-based techniques are restricted by the limited number of available protein structures. Therefore, sequence-based approaches are particulary useful when similar structural information is not available. A number of sequence-based methods have been developed to predict ligand-binding states [1416]. Passerini and co-workers introduced a method for identifying histidines (in either of two states: free or metal bound) and cysteines (in either of three states: free, metal bound, or in disulfide bridges) that participated in binding of several transition metals and iron complexes [15]. Shu et al. developed a method combining support vector machines (SVM) and homology-based predictions to predict zinc-binding sites (Cys, His, Asp and Glu) in proteins from their amino acid sequences [16]. Moreover, some sequence-based predictors attended the CASP9 competition [17].

However, the problem of whether ligand-binding sites can be predicted from sequence information remains open. Little progress has been made on the sequence-based ligand-binding site prediction. Kauffman and Karypis proposed a method that combined machine learning and homology information for the sequence-based ligand-binding site prediction [18]. However, the method did not perform well. In this paper, we propose a sequence-based approach, LigandRFs (Ligand binding site prediction by the ensemble of Random Forest classifiers), to identify protein-ligand binding residues based on the co-evolutionary context of amino acid residues. First, due to the imbalanced samples between ligand-binding sites and non-binding sites, several data sets are constructed. Each of them is composed of the binding site subset (positive subset) and part of the non-binding site subset (negative subsets). All the negative subsets are disjoint with each other. Then a random forest (RF) classifier is learned on each data set. Experiments on the CASP8 and CASP9 data sets show that the consensus of these RF classifiers can yield good prediction on ligand-binding sites.


We first analyzed the amino acid preferences for ligand binding sites and non-ligand binding sites. Figure 1 illustrates the preference comparison. It can be seen that amino acids, Asp, Gly and His, frequently occur in the ligand binding sites, while amino acids, Leu and Ala, are often regarded as non-ligand binding sites. However, it may not always be the case because our current data set is relatively small. Despite of that, Asp and His are considered as hydrophilic amino acids while Leu and Ala as hydrophobic ones in literature [19], which is partially in agreement with our statistics. In fact, hydrophilic amino acids seem to be more likely to be ligand binding sites.

Figure 1

Amino acid preferences for ligand binding sites (colored in blue) and non-ligand binding sites (colored in green). The corresponding ratio of the number of each amino acid in ligand binding to that in non-ligand binding is also illustrated here (in red). PLB and nonPLB in the figure legend denote protein-ligand binding and non-protein ligand binding, respectively.

Prediction results on CASP9

In this work, we first used CASP8 data set to train our method and then test it on CASP9 protein ligand data set, both of the two data sets involve the same definition of protein ligand binding site. Also we use sliding window to encode the input vector for each residue of the protein, which is then inputted into a classifier to determine whether or not it is a ligand binding site. By using the sliding window with length seven, Table 1 shows the performance comparison of the 15 RF classifiers and that of the ensemble, on the prediction of all ligand binding sites that was extended to include the entire biologically relevant ligand. From Table 1 it can be seen that the ensemble of the 15 RF classifiers with majority voting performs well. It yields an MCC of 0.37 and an F1-score of 35.99%, which outperforms any individual RF classifier, where the best individual prediction, the 2nd classifier, achieves an MCC of 0.35 and an F1 of 33.59%.

Table 1 Overall prediction performance of the 15 RF classifiers and that of the ensemble with different votes on the CASP9 data set.

Moreover, other sliding windows for the input encoding are conducted here. Table 2 shows prediction performance on different sliding windows and two ligand binding site groups: partial ligand sites including those only being in contact with atoms of the partial ligands, and all ligand sites including those being in contact with all atoms of the partial and the extended ligands. Among the different sliding windows in Table 2 window 7 performs the best in the case of the all ligand site group, while window 17 performs the best for the partial ligand site group. To reduce the effects of sliding window selection in encoding for input vectors, the combination technique (Eq. 1) is used and the performance is listed at the last row of the Table 2 respectively for the two binding site groups. It seems that the combination technique can reduce the effects of sliding window selection and achieves a little improvement.

Table 2 Prediction performance on different sliding windows for encoding input vectors on the CASP9 data set.

We also output the prediction performance for each target in CASP9 and the details are shown in Table 3. The final prediction performance on CASP9 data set is obtained by the average of all the targets. Our predictor yields different performances over the data set, some targets obtaining good predictions and some ones performing worse. Statistics from Table 3 protein targets bound to metal ligands perform better than those bound to non-metal ions. Experiments showed that template-based prediction methods will perform much better than de novo methods in the context [1]. However, for targets T0604 and T0629, both of which are free modeling (FM) targets, our prediction on T0629 performs much better than that on T0604. The reason is seemingly that the target T0629 is bound to metal ligands while T0604 is bound to non-metal ligands. It should be noted that the ratio of the number of binding sites to the total number of residues of the target is not a significant factor on the prediction performance of our de novo method. It can be seen that the average ratios for metal binding sites and non-metal binding sites are 3.51% and 4.86%, respectively, but our predictor on those metal binding targets performs better than those non-metal ones, achieving an improvement of 0.23 by MCC.

Table 3 Performance and information on each of the CASP9 targets.

Prediction results on CASP8

We also apply our method to evaluate on the CASP8 data set by training on the CASP9 data set. In the same way, we list the prediction performance on different sliding windows for encoding input vectors and obtain the average performance over those sliding windows. Results also show that the combination of different sliding windows yields better performance than any individual sliding window in a robust way, for there is no rule to determine the sliding window size in different data sets. In this work the best one is window 7 in CASP9 while window 5 and 27 in CASP8 (see Table 4). In Table 4 the combination technique yields an MCC of 0.44 while the best sliding window technique achieves an MCC of 0.43.

Table 4 Prediction performance on different sliding windows for encoding input vectors on the CASP8 data set.

Similar results of prediction performance on CASP8 data set can be shown in Table 5 comparing to that on the CASP9 data set. The average performance by MCC on those targets bound to metal ligands is better than that bound to non-metal ligands, where the former achieves an average MCC of 0.53 and the latter achieves an MCC of 0.32 only. In addition, although targets with metal ligands contain less number of binding sites than those with non-metal ligands, our method can identify them more accurately, except for two targets, T0410 and T0487.

Table 5 Performance and information on each of the CASP8 targets.

Comparison with other binding site prediction methods

Previous experiments showed that template-based prediction methods will perform much better than de novo methods in the context [1], but our method provides a comparative prediction on protein ligand binding sites, especially for the CASP8 data set. Figures 2 and 3 illustrate prediction comparison on CASP9 and CASP8 data sets, respectively. Although our method performs worse than most of template-based methods on the CASP9 data set, it performs better than many methods on the CASP8 data set.

Figure 2

Performance comparison of different methods on the measure of MCC for the partial binding site group and the all binding site group. The number above each bar denotes the number of protein targets the method was tested on, and the green eclipse shows our predictor. Here we list all predictors in CASP9 including those tested on even one protein target.

Figure 3

Performance comparison of different methods on the measures of MCC and Z-score. The number above each bar denotes the number of protein targets the method was tested on, and the green eclipse shows our predictor. Here we list all predictors in CASP8 including those tested on even one protein target.

It is difficult to compare our model with other similar methods for there are seldom methods of predicting ligand binding sites by using only sequence information. Most of ligand binding site prediction methods applied structural information of homologous proteins for the prediction. In CASP9, FN0193 is a predictor using SVM to identify protein binding sites. It basically used sequence profile information, the results from the disorder prediction models as well as secondary structure prediction models as additional features for the ligand-binding prediction models. Another work using sequence information was FN0132, which combined sequence information and homology-based transfer to identify protein binding sites. Our method yields an MCC of 0.40, which outperforms the two methods. Other two sequence-based methods in Table 6 performed even worse, only achieving an MCC of 0.19 for FN097 and 0.06 for FN240 in CASP9. Moreover, the random predictor is also implemented here and ran 100 times. The average performance is appended at the last row of Table 6. Results show that our method outperforms the random predictor by 36 times of the MCC score and 6 times of the F1 score.

Table 6 Performance comparison of the six methods on the CASP9 data set.

Case studies

Two targets in CASP9 were free modeling (FM) targets. The first one was T0604 (PDB: 3nlc), which is a putative FAD-dependent oxidoreductase with a bound FAD molecule. Experiments from CASP9 showed that the target was the most difficult one in the FN prediction in CASP9, with a maximum MCC of 0.56 and an average score of 0.29. Our sequence-based predictor yields an MCC of 0.08. The prediction of our method on T0604 is shown in Figure 4(a). Although the prediction is not good, our method can identify ligand binding sites partially. In addition, some wrongly predicted binding sites are around those true binding sites.

Figure 4

Illustration of protein binding site prediction for Targets T0570 (a), T0604 (b), and T0629 (c). (a) Prediction on 3nlc (T0604); (b) prediction on 3no3 (T0570); (c) prediction on 2xgf (T0629). Here the correctly predicted binding site residues are colored in red, the wrongly predicted binding sites in green, and the wrongly predicted non-binding sites in blue.

Another FM target was T0629 (PDB: 2xgf). It is formed by three chains and binds seven FE ions. Each FE ion is complexed by six histidine residues, where each two histidine residues is from one chain. For the same structures of the three chains, only prediction on chain A is illustrated in Figure 4(c). Experiments in CASP9 showed that all predictors in CASP9 can correctly identified a subset of the seven binding sites. Our method can cover all of the true binding sites, and only contain five wrongly predicted binding sites where two ones are close to true binding sites. Our method yields an MCC up to 0.85 for the target, which outperforms most of the methods in CASP9.

The last case is for the target T0570 (PDB: 3no3), which binds GOL nonmetal ligand coordinated to the MG metal ion. In target T0570, residues His30, Glu59, Glu123, Ile156, Phe158, Leu178 and Trp222 are bound to the GOL nonmetal ligand, while residues Glu59, Asp61 and Glu123 are bound to the MG metal ion. Our method can identify four binding site residues: His30, Glu59, Asp61, Glu123, some of which (His30, Glu59 and Glu123) bound to the GOL ligand and some (Glu59, Asp61 and Glu123) bound to the MG ion. Although our predictor performs worse than some structure-based methods in CASP9, it can cover half of the true binding sites and yield an MCC of 0.45 for the target T0570.


Experimental results showed that structure-based predictors yielded worse predictions on targets without local homologues [1, 12]. Target T0604 is a typical case. It yielded a maximum MCC score of 0.56 for the best prediction, and an average score of 0.29. Actually since the target has only remote homologues, its sequence profile is much sparser than other targets. The final encoding vectors for expressing the residues of the target cannot reflect the evolutionary context of binding sites. The following Table 7 shows part of sequence profile for target T0604, where most of the elements are zero. Therefore, other features such as secondary structure information as well as other physico-chemical characteristics of residues should be addressed and incorporated as input features, and thus might improve the prediction based on sequence features.

Table 7 Part of the sequence profile for target T0604.

There is no evidence to show that binding sites to metal ions are easier to be identified than that to nonmetal ligands for those structure-based methods [12, 20], although non-metal ligands are larger and more residues will bind non-metal ligands than metal ligands. However, in this work our sequence-based method yields good prediction on targets bound to metal ions, and achieves an MCC of 0.55 for the CASP9 data set and 0.53 for the CASP8 data set; while predictions on targets bound to non-metal ligands are much worse. It might be that residues bound to metal ligands are more conserved in evolutionary context than that bound to non-metal ligands, and thus the former can be identified more accurate.

Moreover, prediction performance may be changed with different sliding window sizes. To reduce the effects, we took Eq. 1 to address the issue. Results on both the CASP8 and the CASP9 data sets show the success and are shown in Tables 4 and 2. The combination of sliding windows performs better than all of the individual ones, it achieves an average MCC of 0.40 (0.37 for the best individual one) for the CASP9 data set and an average MCC of 0.44 (0.43 for the best individual one) for the CASP8 data set. Therefore the selection of the sliding window in different situations can be avoided.


This paper proposes an ensemble of RF classifiers with only sequence information to predict ligand binding sites. In order to balance the ligand site data set, several data sets are constructed and composed of the binding site subset (positive subset) and one of the non-binding site subsets (negative subsets), each of the negative subsets is independent to the others. Then each data set is inputted into a RF classifier. The ensemble of these RF classifiers can yield good prediction on ligand-binding sites. The encoding schema integrating those properties and evolutionary information of amino acids is important to obtain the evolutionary context of ligand binding site residues and thus, the method yields good performance on predicting ligand binding sites. Although structure-based methods still outperform sequence-based methods, our method provides a potentially alternative solution to the binding site prediction problem, especially when similar structure information of the query is not available.


Data sets

We took the targets in the CASP9 competition as our ligand-binding site data set. As stated in CASP9, there were 30 targets with bound ligands, out of which 10 were found in complex with metal ions (Ca, Fe, Mg, Mn, Na, and Zn), 17 were in complex with non-metal ligands (ANP, BES, COA, CSA, DST, EDO, FAD, GAL, GAR, GLA, GOL, GPX, HSA, IMD, IPR, ISC, LLP, LYS, NAD, NHS, PEG, PF1, PLP, SO4, STE, TLA), and three were in complex with hybrid ligands. Moreover, ligand-binding sites in most of the targets were located within a monomer. However, in six targets, the ligand was bound in the interface between multiple chains [1]. The assignments of ligand-binding sites were carefully checked and only the targets with unambiguous assignment were retained.

In addition, we took another data set in the CASP8 meeting to validate our method. In CASP8, there were 27 targets bound to 37 ligands. Of the ligands, there are 26 metals in 18 targets (Mg, Zn, Fe, FeS, Ni, and Ca), nine nucleotides (ADP, AMP) or derivates (FAD, SAM), one metabolite (PO4) and the last target bound a heme moiety [20].

Binding site definition

For each protein, all residues, at least one heavy atom within a given distance from any heavy atom of the ligand, were defined as ligand-binding site residues. In fact, the definition of the distance cutoff was different in literature. Kauffman and Karypis collected ligand-binding residues having at least one heavy atom within 5 Å of a ligand [18]. While in CASP9, the distance cutoff was defined as the sum of the van der Waals radii of the involved atoms plus a tolerance of 0.5 Å. Different distance cutoff leads to different ligand-binding site data set, i.e., about 9% of residues are the ligand-binding residues for the former definition, while only 3.9% for the latter. For the CASP9 definition, there are in total 355 ligand-binding residues and 8718 non-ligand binding residues of the 30 proteins. For the CASP8, there are 335 ligand-binding residues in a total of 7718 residues of the 27 protein targets. Figure 5 illustrates the binding site residues to ligands for protein PDB: 3NO3, where two ions, metal ion Mg and non-metal ion GOL, are bound to the protein.

Figure 5

LIGPLOT 38 of binding site residues to ligands for PDB ID, 3NO3. (a) the binding site residues to metal ion Mg301 (colored in cyan); (b) the structure of protein 3NO3 (GOL303 is colored in ruby and Mg301 in cyan); (c) The binding sites to non-metal ion GOL303 (colored in ruby). In CASP9 experiment, only His30, Glu59, Asp61, Glu123, Ile156, Phe158, Leu178 and Trp222 are deemed as ligand binding site residues.

Feature vector representation of a residue

In the AAindex1 database [21], there are 544 amino acid properties. Many of these properties are highly correlated. We thus extracted relatively irrelevant properties with a correlation coefficient (CC) of 0.5. For each of the 544 properties, the CC to all the other properties was calculated and the number of related properties was counted. The 544 properties were thus ranked according to their numbers of related properties. From the top property, we removed from the list all the properties that were related to it. This was repeatedly done until no related pair existed in the list, which resulted in 34 properties.

For a residue i in a protein chain, the association among the neighboring residues is considered in this work. A sliding window that contains seven residues centered at the residue i is used to encode the features. An encoding schema integrating amino acid properties and sequence profile is used to represent the residue. The sequence profile for one residue created by PSI-Blast with default parameters [22] is then multiplied by each amino acid property. For instance, the profile SPk, k = 1,..., 7 for residue k in the seven residue window and one amino acid property scale, AAP j , are both vectors with 1 × 20 dimensions. Thereafter, M S K j k = S P k × A A P j for residue k represents the multiplication of the corresponding sequence profile by the scale, where × represents the element-wise product. In our previous work, we found out that the standard deviation of MS K j k reflected the evolutionary variance of the residue k along with the amino acid property AAP j [2325].

For the residue i and the amino acid property, AAP j , therefore, it is represented by a 1 × 7 vector V ij for the case of sliding window with 7 residues. For all of the 34 amino acid properties, the residue i is represented by a 1 × (7 * 34) = 1 × 238 vector; the corresponding target value T i is 1 or 0, denoting whether the residue is a ligand-binding residue or not. Our goal is to learn the relationship between the input vectors V and the corresponding target array T.

Combine different sliding windows

In fact, each sliding window for encoding the input vector of each residue may cause deviation in investigating the relationship between the residue property profile and the ligand binding site. We use a combination technique to reduce the effects of sliding window selection. Suppose there are N predictions, Pred n , n = 1, . . ., N, resulted from N sliding windows, a new prediction can be obtained by

P r e d c o m b = P r e d ¯ - 1 N n = 1 N ( P r e d n - P r e d ¯ ) 2 , w h e r e P r e d ¯ = 1 N n = 1 N ( P r e d n ) .

The first term in the right part of Eq. 1 is the mean of predictions resulted from N sliding windows and the second one shows the standard deviation of them.

Ensemble of random forest classifiers

Machine learning techniques have played very important roles in various protein-related problems, such as B-factors prediction [26], domain identification [27], function annotation [28], membrane protein type prediction [29], and protein retrieval [30]. Here we propose to use the random forest model for the binding site prediction. A random forest [31] consists of an ensemble of simple tree predictors, each of which depends on a set of random features selected independently. It is capable of producing a response when presented by a set of predictor values. Therefore, the generalization error of a random forest depends not only on the individual trees but also significantly on the correlation between them. For the ligand-binding site prediction problem, the ensemble of simple trees votes for the most popular ligand-binding site class. Previous results showed that using consensus ideas can make significant improvement in prediction accuracy [3236].

Given a set of training data D N = {(X i , Y i )}, i = 1,..., N, let the number of training instances be N, the number of features in the classifier be J, and the number of trees to build be K. For each tree, a number of j features are considered to determine the decision of the tree, where j should be much less than J and set as 1 ≤ j ≤ int(log(J) + 1) by default. For the k − th tree, a random vector ϑ k is generated, which is independent and with the same distribution of the previous ones, ϑ1,..., ϑ k−1 . The k − th tree generated from the training set and ϑ k results in a classifier CF k (x; ϑ k ), where k = 1,..., K and x is a training instance.

After all of the trees are generated, they vote for the most popular class and thus the prediction of the whole forest is,

F X = m a j o r i t y v o t e { C F k X } k = 1 K ,

where X is a query instance.

Since the binding site data set is highly imbalanced, i.e., only 3.9% of all the instances are positive samples, balancing the positive (binding site class) and the negative (non-binding site class) data is necessary to avoid the overfitting of classifiers. Since protein chains contain different ratios of binding sites to non-binding sites, 15 data sets are thus formed, D N n , n = 1,..., 15, each of which involves roughly the same number of the positive and negative samples. That is, the 15 data sets share the same positive samples, but have disjoint negative samples. A random forest classifier is trained for each of the 15 data sets. The final prediction is the majority voting of the 15 random forests.

Evaluation criteria

In this work we adopted four evaluation measures to evaluate the performance of our method, i.e., sensitivity (Sen), precision (Prec), F-measure (F1), specificity (Spe), accuracy (ACC), and Matthews correlation coefficient (MCC) [23, 37]. They are defined as follows:

S e n = T P T P + F N , P r e c = T P T P + F P , F 1 = 2 × P r e c × S e n P r e c + S e n , S e n = T N F P + T N , A c c = T N + T P T N + F P + F N + T P , M C C = T P × T N - F P × F N ( T P + F N ) ( T P + F P ) ( T N + F P ) ( T N + F N ) ,

where TP (True Positive) is the number of correctly predicted binding sites; FP (False Positive) is the number non-binding sites that are predicted to be binding sites; TN (True Negative) is the number of correctly predicted non-binding sites; and FN (False Negative) is the number of binding sites that are predicted to be non-binding sites.

Besides, Z score is also used to evaluate the performance of our method. It can be used to reduce the effects of target difficulty on the ranking. The Z score of the predictor P for a given target T is shows as:

Z P , T = M C C P , T - M C C T ¯ σ T ,

where MCC P,T is the raw MCC score for the target T given by the predictor P, M C C T ¯ is the mean MCC score for the target T, and σ T is the standard deviation of MCC scores for the target T. The final Z score for the predictor P is the mean of Z scores over all targets.


  1. 1.

    Schmidt T, Haas J, Gallo Cassarino T, Schwede T: Assessment of ligand-binding residue predictions in CASP9. Proteins. 2011, 79 (Suppl 10): 126-136. []

  2. 2.

    Pintacuda G, John M, Su XC, Otting G: NMR structure determination of protein-ligand complexes by lanthanide labeling. Acc Chem Res. 2007, 40 (3): 206-212. 10.1021/ar050087z. []

  3. 3.

    Alipanahi B, Gao X, Karakoc E, Donaldson L, Li M: PICKY: a novel SVD-based NMR spectra peak picking method. Bioinformatics. 2009, 25 (12): i268-i275. 10.1093/bioinformatics/btp225. []

  4. 4.

    Ziarek JJ, Peterson FC, Lytle BL, Volkman BF: Binding site identification and structure determination of protein-ligand complexes by NMR a semiautomated approach. Methods Enzymol. 2011, 493: 241-275. []

  5. 5.

    Jang R, Gao X, Li M: Towards fully automated structure-based NMR resonance assignment of 15N-labeled proteins from automatically picked peaks. J Comput Biol. 2011, 18 (3): 347-363. 10.1089/cmb.2010.0251. []

  6. 6.

    Alipanahi B, Gao X, Karakoc E, Li SC, Balbach F, Feng G, Donaldson L, Li M: Error tolerant NMR backbone resonance assignment and automated structure generation. J Bioinform Comput Biol. 2011, 9: 15-41.

  7. 7.

    Jang R, Gao X, Li M: Combining automated peak tracking in SAR by NMR with structure-based backbone assignment from 15N-NOESY. BMC Bioinformatics. 2012, 13 (Suppl 3): S4-10.1186/1471-2105-13-S3-S4. []

  8. 8.

    Liu Z, Abbas A, Jing BY, Gao X: WaVPeak: picking NMR peaks through wavelet-based smoothing and volume-based filtering. Bioinformatics. 2012, 28 (7): 914-920. 10.1093/bioinformatics/bts078. []

  9. 9.

    Abbas A, Kong XB, Liu Z, Jing BY, Gao X: Automatic peak selection by a Benjamini-Hochberg-based algorithm. PLoS One. 2013, 8: e53112-10.1371/journal.pone.0053112. []

  10. 10.

    Palmer RA, Niwa H: X-ray crystallographic studies of protein-ligand interactions. Biochem Soc Trans. 2003, 31 (Pt 5): 973-979. []

  11. 11.

    Dai T, Liu Q, Gao J, Cao Z, Zhu R: A new protein-ligand binding sites prediction method based on the integration of protein sequence conservation information. BMC Bioinformatics. 2011, 12 (Suppl 14): S9-10.1186/1471-2105-12-S14-S9. []

  12. 12.

    Roche DB, Tetchner SJ, McGuffin LJ: FunFOLD: an improved automated method for the prediction of ligand binding residues using 3D models of proteins. BMC Bioinformatics. 2011, 12: 160-10.1186/1471-2105-12-160. []

  13. 13.

    Gonzalez AJ, Liao L, Wu CH: Predicting ligand binding residues and functional sites using multipositional correlations with graph theoretic clustering and kernel CCA. IEEE/ACM Trans Comput Biol Bioinform. 2012, 9 (4): 992-1001. []

  14. 14.

    Andreini C, Bertini I, Rosato A: A hint to search for metalloproteins in gene banks. Bioinformatics. 2004, 20 (9): 1373-1380. 10.1093/bioinformatics/bth095. []

  15. 15.

    Passerini A, Punta M, Ceroni A, Rost B, Frasconi P: Identifying cysteines and histidines in transition-metal-binding sites using support vector machines and neural networks. Proteins. 2006, 65 (2): 305-316. 10.1002/prot.21135. []

  16. 16.

    Shu N, Zhou T, Hovmöller S: Prediction of zinc-binding sites in proteins from sequence. Bioinformatics. 2008, 24 (6): 775-782. 10.1093/bioinformatics/btm618. []

  17. 17.

    CASP9 Abstract Book, Pacific Grove, California, USA: Ninth Meeting, Critical Assessment of Techniques for Protein Structure Prediction 2010. []

  18. 18.

    Kauffman C, Karypis G: LIBRUS: combined machine learning and homology information for sequence-based ligand-binding residue prediction. Bioinformatics. 2009, 25 (23): 3099-3107. 10.1093/bioinformatics/btp561. []

  19. 19.

    Kyte J, Doolittle RF: A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982, 157: 105-132. 10.1016/0022-2836(82)90515-0.

  20. 20.

    Lopez G, Ezkurdia I, Tress ML: Assessment of ligand binding residue predictions in CASP8. Proteins. 2009, 77 (Suppl 9): 138-146. []

  21. 21.

    Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M: AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 2008, 36 (Database): D202-D205. []

  22. 22.

    Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.

  23. 23.

    Chen P, Li J: Sequence-based identification of interface residues by an integrative profile combining hydrophobic and evolutionary information. BMC Bioinformatics. 2010, 11: 402-10.1186/1471-2105-11-402. []

  24. 24.

    Chen P, Wong L, Li J: Detection of outlier residues for improving interface prediction in protein heterocomplexes. IEEE/ACM Trans Comput Biol Bioinform. 2012, 9 (4): 1155-1165. []

  25. 25.

    Chen P, Li J, Wong L, Kuwahara H, Huang JZ, Gao X: Accurate prediction of hot spot residues through physicochemical characteristics of amino acid sequences. Proteins. 2013, 81 (8): 1351-1362. 10.1002/prot.24278. []

  26. 26.

    Chen P, Wang B, Wong HS, Huang DS: Prediction of protein B-factors using multi-class bounded SVM. Protein Pept Lett. 2007, 14 (2): 185-190. 10.2174/092986607779816078.

  27. 27.

    Chen P, Liu C, Burge L, Li J, Mohammad M, Southerland W, Gloster C, Wang B: DomSVR: domain boundary prediction with support vector regression from sequence information alone. Amino Acids. 2010, 39 (3): 713-726. 10.1007/s00726-010-0506-6. []

  28. 28.

    Messih MA, Chitale M, Bajic VB, Kihara D, Gao X: Protein domain recurrence and order can enhance prediction of protein functions. Bioinformatics. 2012, 28 (18): i444-i450. 10.1093/bioinformatics/bts398. []

  29. 29.

    Wang J, Li Y, Wang Q, You X, Man J, Wang C, Gao X: ProClusEnsem: predicting membrane protein types by fusing different modes of pseudo amino acid composition. Comput Biol Med. 2012, 42 (5): 564-574. 10.1016/j.compbiomed.2012.01.012. []

  30. 30.

    Wang J, Gao X, Wang Q, Li Y: ProDis-ContSHC: learning protein dissimilarity measures and hierarchical context coherently for protein-protein comparison in protein database retrieval. BMC Bioinformatics. 2012, 13 (Suppl 7): S2-10.1186/1471-2105-13-S7-S2. []

  31. 31.

    Breiman L: Random forests. Machine Learning. 2001, 45: 5-32. 10.1023/A:1010933404324. []

  32. 32.

    Gao X, Bu D, Xu J, Li M: Improving consensus contact prediction via server correlation reduction. BMC Struct Biol. 2009, 9: 28-10.1186/1472-6807-9-28. []

  33. 33.

    Chen P, Li J: Prediction of protein long-range contacts using an ensemble of genetic algorithm classifiers with sequence profile centers. BMC Struct Biol. 2010, 10 (Suppl 1): S2-10.1186/1472-6807-10-S1-S2. []

  34. 34.

    Wang B, Chen P, Wang P, Zhao G, Zhang X: Radial basis function neural network ensemble for predicting protein-protein interaction sites in heterocomplexes. Protein Pept Lett. 2010, 17 (9): 1111-1116. 10.2174/092986610791760397.

  35. 35.

    Ballester PJ, Mitchell JBO: A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics. 2010, 26 (9): 1169-1175. 10.1093/bioinformatics/btq112. []

  36. 36.

    Qiu Z, Wang X: Improved prediction of protein ligand-binding sites using random forests. Protein PeptLett. 2011, 18 (12): 1212-1218.

  37. 37.

    Wang B, Chen P, Huang DS, Li Jj, Lok TM, Lyu MR: Predicting protein interaction sites from residue spatial sequence profile and evolution rate. FEBS Lett. 2006, 580 (2): 380-384. 10.1016/j.febslet.2005.11.081. []

  38. 38.

    Wallace AC, Laskowski RA, Thornton JM: LIGPLOT: a program to generate schematic diagrams of protein-ligand interactions. Protein Eng. 1995, 8 (2): 127-134. 10.1093/protein/8.2.127.

Download references


This work was supported Award Numbers KUS-CI-016-04 and GRP-CF-2011-19-P-Gao-Huang, made by King Abdullah University of Science and Technology (KAUST). This work was also supported by the National Natural Science Foundation of China (Nos. 61300058, 61374181 and 61472282). Publication charges for this article have been funded by the Award Numbers KUS-CI-016-04 and GRP-CF-2011-19-P-Gao-Huang, made by King Abdullah University of Science and Technology (KAUST).

This article has been published as part of BMC Bioinformatics Volume 15 Supplement 15, 2014: Proceedings of the 2013 International Conference on Intelligent Computing (ICIC 2013). The full contents of the supplement are available online at

Author information

Correspondence to Xin Gao.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

PC and XG conceived the study; PC participated in the experimental design; PC, JH and XG carried it out and drafted the manuscript. All authors revised the manuscript critically.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chen, P., Huang, J.Z. & Gao, X. LigandRFs: random forest ensemble to identify ligand-binding residues from sequence information alone. BMC Bioinformatics 15, S4 (2014).

Download citation


  • Random Forest
  • Site Prediction
  • Matthews Correlation Coefficient
  • Random Forest Classifier
  • Sequence Profile