- Open Access
Predicting RNA-binding sites of proteins using support vector machines and evolutionary information
- Cheng-Wei Cheng†1, 4,
- Emily Chia-Yu Su†2, 3, 4,
- Jenn-Kang Hwang2,
- Ting-Yi Sung4Email author and
- Wen-Lian Hsu1, 4Email author
© Cheng et al; licensee BioMed Central Ltd. 2008
- Published: 12 December 2008
RNA-protein interaction plays an essential role in several biological processes, such as protein synthesis, gene expression, posttranscriptional regulation and viral infectivity. Identification of RNA-binding sites in proteins provides valuable insights for biologists. However, experimental determination of RNA-protein interaction remains time-consuming and labor-intensive. Thus, computational approaches for prediction of RNA-binding sites in proteins have become highly desirable. Extensive studies of RNA-binding site prediction have led to the development of several methods. However, they could yield low sensitivities in trade-off for high specificities.
We propose a method, RNAProB, which incorporates a new smoothed position-specific scoring matrix (PSSM) encoding scheme with a support vector machine model to predict RNA-binding sites in proteins. Besides the incorporation of evolutionary information from standard PSSM profiles, the proposed smoothed PSSM encoding scheme also considers the correlation and dependency from the neighboring residues for each amino acid in a protein. Experimental results show that smoothed PSSM encoding significantly enhances the prediction performance, especially for sensitivity. Using five-fold cross-validation, our method performs better than the state-of-the-art systems by 4.90%~6.83%, 0.88%~5.33%, and 0.10~0.23 in terms of overall accuracy, specificity, and Matthew's correlation coefficient, respectively. Most notably, compared to other approaches, RNAProB significantly improves sensitivity by 7.0%~26.9% over the benchmark data sets. To prevent data over fitting, a three-way data split procedure is incorporated to estimate the prediction performance. Moreover, physicochemical properties and amino acid preferences of RNA-binding proteins are examined and analyzed.
Our results demonstrate that smoothed PSSM encoding scheme significantly enhances the performance of RNA-binding site prediction in proteins. This also supports our assumption that smoothed PSSM encoding can better resolve the ambiguity of discriminating between interacting and non-interacting residues by modelling the dependency from surrounding residues. The proposed method can be used in other research areas, such as DNA-binding site prediction, protein-protein interaction, and prediction of posttranslational modification sites.
- Support Vector Machine
- Support Vector Machine Classifier
- Slide Window Size
- PSSM Profile
- Window Size Selection
RNA-protein interaction plays an important role in various biological processes, such as protein synthesis, gene expression, posttranscriptional regulation, and viral infectivity. The prediction results of RNA-binding sites in proteins can provide biological insights for investigating RNA-protein interaction. For instance, the ribosome is a protein synthesis complex consisting of ribosomal RNAs (rRNAs) and proteins. Sunita et al.  applied predicted RNA-binding sites to study the relationship between RNA methyltransferases RsmC and 16S rRNA. In addition, Bechara et al.  incorporated predicted results from a RNA-binding site predictor to inspect the connection between fragile X mental retardation protein and G-quartet RNA structure. Moreover, some RNA viruses, such as human immunodeficiency virus (HIV) and hepatitis C virus, have a RNA genome and replicate themselves by interacting with host proteins . Therefore, identification of the RNA interacting residues in proteins provides valuable information for understanding the mechanisms of protein synthesis, gene regulation, and pathogen-host interaction.
In recent years, rapid advances in genomic and proteomic studies have yielded a tremendous amount of DNA and protein sequences. We used the keyword "RNA-binding" to search against the National Center for Biotechnology Information (NCBI) protein sequence database on June 9, 2008, and obtained 196,686 protein sequences. However, when searching against Protein Data Bank (PDB)  for molecular/chain type containing protein and RNA, we only retrieved 684 structures. In addition, experimental determination of RNA-protein interaction remains time-consuming and labor-intensive. Therefore, computational approaches for predicting RNA-binding sites in proteins have become increasingly important to understand the mechanisms of RNA-protein interaction.
Extensive studies of RNA-protein binding site prediction have lead to the development of several methods, which can be classified as follows.
1. Amino acid composition-based methods
Jeong et al.  used an artificial neural network (ANN) to predict RNA-protein interacting residues based on amino acid compositions and predicted secondary structure elements. It achieved Matthew's correlation coefficient (MCC) of 0.29 and overall accuracy of 77.50% along with specificity of 87.29% and sensitivity of 40.30%. Terribilini et al.  presented RNABindR using a Naïve Bayes classifier on amino acid sequences to predict RNA binding sites in proteins. RNABindR attained MCC, overall accuracy, specificity, and sensitivity of 0.35, 84.80%, 93% and 38%, respectively.
2. Evolutionary information-based methods
Jeong and Miyano  applied an ANN to predict the RNA interacting residues based on evolutionary information from the position-specific scoring matrix (PSSM), and achieved MCC, overall accuracy, specificity, and sensitivity of 0.39, 80.20%, 91.04%, and 43.40%, respectively. The MCC is further improved to 0.41 by the incorporation of weighted profiles. Kumar et al. proposed a predictor, PPRint , using PSSM profiles in a support vector machine (SVM) model, and it achieved MCC, overall accuracy, specificity, and sensitivity of 0.45, 81.16%, 89.55%, and 53.05%, respectively.
3. Hybrid methods
Wang and Brown  developed an SVM-based classifier, BindN, using features including relative solvent accessible surface area, hydrophobicity index, side chain pKa value, molecular mass, and BLAST results. The overall accuracy, specificity, and sensitivity of BindN are 74.25%, 75.70%, and 65.78%, respectively.
Although many methods have been proposed for RNA-binding site prediction, several challenges still remain. First, many of previous methods yield low sensitivities in tradeoff for high specificities since some biological applications, such as identification of critical residues for site-specific mutagenesis, emphasize more on specificities rather than sensitivities [6, 8]. These methods could suffer from low coverage of RNA-binding sites in high-throughput proteomic analyses. Second, the MCC values of existing methods remain in the range of 0.27~0.45, which presents a great scope for improvement in the complementary measure of prediction performance. Finally, in most methods parameters such as the size of the sliding window are selected from test results evaluated by n-fold cross-validation, which may lead to overestimation of the prediction performance. Thus, the performance would be worse if a more rigorous procedure is applied for parameter selection and performance evaluation.
Our method and future applications
In this study, we propose a method, RNAProB (RNA-Pro tein B inding site prediction), for prediction of RNA-binding residues in proteins using SVM classifiers and a new smoothed PSSM encoding scheme. Besides incorporation of upstream and downstream residues in a standard PSSM generated by PSI-BLAST, smoothed PSSM encoding also considers, for each amino acid in a sequence, the dependency effect from its neighboring amino acids. Similar to the spatial domain method used in the research field of image processing , smoothed PSSM encoding calculates the evolutionary information of a central position based on the sum of those from surrounding residues. Experimental results show that the prediction performance of smoothed PSSM encoding performs better than the state-of-the-art approaches on the benchmark data sets. Evaluated by five-fold cross-validation, RNAProB outperforms the other approaches by 0.10~0.23 in MCC, 4.90%~6.83% in overall accuracy, and 0.88%~5.33% in specificity. Most notably, our method significantly improves sensitivity by 26.90%, 26.62%, and 7.05% for the RBP86, RBP109, and RBP107 data sets, respectively. To avoid data overfitting, we also incorporate a three-way data split procedure to evaluate the prediction performance of RNAProB. Our results show that our method not only achieves significant improvement on the performance, but also attains a high prediction accuracy evaluated by a three-way data split procedure. Moreover, our analysis indicates that smoothed PSSM could serve as a more discriminative feature for distinguishing between interacting and non-interacting residues. We believe that the proposed encoding scheme could be applicable to other research fields, such as DNA-binding sites, protein-protein interaction, and prediction of posttranslational modification sites.
Summary of three benchmark data sets
Number of protein chains
X-ray crystallography resolution
Number of interacting residues
Number of non-interacting residues
Total number of residues
The RBP86 data set consists of 86 protein chains extracted from RNA-protein complexes with X-ray crystallography resolution better than 3 Å in PDB. Sequence redundancy in the data set is removed so that no protein pair has a sequence identity greater than 70%. In the RNA-protein complexes, a residue is regarded as interacting with RNA if the distance between an RNA molecule and the residue in the protein is less than 6 Å. The resultant data set contains 4,568 RNA interacting residues and 15,503 non-interacting residues. The RBP86 data set has been used in Terribilini et al.  and Kumar et al. . In Kumar et al., it is also referred to as the "main" data set.
The RBP109 data set contains 109 protein sequences obtained from 56 RNA-protein complexes with X-ray crystallography resolution better than 3.5 Å in PDB. For any two protein chains, the sequence identity is no more than 30%. The numbers of interacting and non-interacting residues are 3,581 and 21,526, respectively. The RBP109 data set is downloaded from RNABindR web server . In Terribilini et al. , this is named as the "RB109" data set.
Derived from 61 RNA-protein complexes in PDB, the RBP107 data set is comprised of 107 protein chains with X-ray crystallography resolution better than 3.5 Å and sequence identity no more than 25%. Based on the cut-off distance of 3.5 Å, the RBP107 data set contains 2,555 interacting residues and 19,496 non-interacting ones. Wang and Brown  applied this data set to construct and evaluated their approach. In Kumar et al. , it is referred to as the "alternate" data set.
Support vector machines (SVM)
where w ∈ ℝd is a weight vector, b is a bias (constant), and Φ is a mapping function. For more flexible classification, SVM allows instance i positions at the wrong side of hyperplane with slack variable ξ i and cost parameter C. In SVM, a kernel function K(xi, xj), such as linear, polynomial, radial basis function (RBF), and sigmoid function, is used to present Φ(xi)·Φ(xj) where x i and x j are two data vectors. In this study, we use RBF as the kernel function in the SVM. The formulation of RBF is defined in Equation (2), where γ is a training parameter.
K(xi, xj) = exp(-γ||xi - xj||2) (2)
Developed by Lin et al. , LIBSVM is a powerful and well-known SVM package used by many researchers. We apply LIBSVM to implement our classifiers for prediction of RNA-binding sites in proteins.
Feature extraction and representation
Evolutionary information has been shown to be effective for RNA-binding site prediction . For this reason, we use PSI-BLAST  to search against NCBI non-redundant (nr) database and generate a PSSM based on BLOSUM62 substitution matrix  for each protein with e-value as 0.001 and iteration number as 3. A PSSM is comprised of L vectors (L denotes the length of the protein), in which contain the log-likelihoods for different amino acids in a position. Next, we illustrate two different encoding schemes to represent the PSSM.
1. Standard PSSM encoding scheme
2. Smoothed PSSM encoding scheme
In addition to the consideration of neighbors of a residue α i , we propose a new encoding scheme to incorporate the dependency of surrounding residues. In a standard PSSM profile, the log-likelihood at each position is calculated based on an assumption that each position is independent from the others. However, Terribilini et al.  observed that RNA binding residues tend to occur in clusters. Their analysis revealed that 95% of interacting residues in the RBP109 data set have at least one additional interacting residue among the four amino acids on either side, and 49% of those have at least four. Inspired by the consideration of adjacent pixels used in the spatial domain method from the research field of image processing , we present a new encoding scheme to model the dependency or correlation among surrounding neighbors of a central residue. Similar to the feature representation in standard PSSM encoding, we use a sliding window of size w to incorporate the evolutionary information from upstream and downstream residues. In the construction of a smoothed PSSM, each row vector of a residue α i is represented and smoothed by the summation of ws surrounding row vectors (Vsmoothed_i = Vi-(ws-1)/2 + ... + Vi + ... + Vi+(ws-1)/2). For the N-terminal and C-terminal of a protein, (w-1)/2 ZERO vectors, are appended to the hand or tail of a smoothed PSSM profile. Using the smoothed PSSM encoding scheme, the feature vector of a residue α i is represented by (Vsmoothed_i-(w-1)/2,..., Vsmoothed_i,..., Vsmoothed_i+(w-1)/2). The feature values in each vector are normalized to a range between -1 and 1. Here, we apply different smoothing window sizes from 3 to 11 with a step as 2 (i.e., ws = 3, 5,..., 11). Figure 1(B) illustrates an example of a smoothed PSSM profile. At position 9, the corresponding value of amino acid 'A' represented by a smoothed PSSM encoding is the sum of [(-2)+(-2)+(-3)+5+(-2)+3+0].
Window size selection and parameter optimization
The workflow of window size selection and parameter optimization.
Sliding window size (w)
C and γ
Smoothing window size (ws)
Weight parameter (w1 and w-1)
3 ≤ w ≤ 41 (step = 2)
Optimized w from step 1
-3 ≤ log2C ≤ 12 (step = 1)
-3 ≤ log2γ ≤ -15 (step = -1)
Optimized w from step 1
Optimized C and γ from step 2
3 ≤ ws ≤ 11 (step = 2)
Optimized w from step 1
Optimized C and γ from step 2
Optimized ws from step 3
1 ≤ w1 ≤ 8# (step = 1), w-1 = 1
Optimized w from step 1
Optimized C and γ from step 2
Optimized ws from step 3
Optimized w1 and w-1 from step4
Apply PSI-BLAST to generate a standard PSSM of the protein.
Generate a smoothed PSSM of the protein using an optimized smoothing window size.
Construct a feature vector for each residue in the protein sequence by an optimized sliding window size, and normalize all feature values in the vector into a range of -1 and 1.
Use a trained SVM classifier with optimized parameters (C, γ, w1, w-1) to predict the interacting and non-interacting residues in the protein.
After the above steps, RNAProB outputs the corresponding interacting or non-interacting state of each residue in the protein.
Training and testing
The performance of RNAProB is assessed by n-fold cross-validation and three-way data split. To compare with other approaches, we use five-fold cross-validation to evaluate the performance of RNAProB. However, to prevent data-overfitting, a three-way data split procedure is applied to assess our predictor. The performance of RNAProB is evaluated as follows.
1. n-fold cross-validation
A data set is randomly divided into five distinct non-overlapping sets of positive and negative instances (i.e., n = 5), four of which are used to train the predictor and the accuracy of the predictor is evaluated on the remaining set. This procedure is repeated five times.
2. Three-way data split
To avoid over fitting, we use a more stringent three-way data split procedure [16, 17] to evaluate the performance of RNAProB. A data set is randomly partitioned into three non-overlapping sets: a training set for classifier learning, a validation set for parameter selection, and a test set for performance evaluation. In this paper, we divide a data set into five distinct sets, three for training, one for validation, and one for testing. The procedure is also iterated 5 times.
Performance evaluation measures
For comparison with other approaches, we follow the measures used in previous work [8, 9, 18], including specificity (Spec), sensitivity (Sens), MCC , and overall accuracy (Acc). Specificity and sensitivity measure how well the binary classifier recognizes negative and positive cases, respectively. A specificity of 100% and a sensitivity of 100% imply that the classifier identifies all non-interacting residues as non-interacting and all interacting residues as interacting, correspondingly. When a predictor's specificity increases, its sensitivity often decreases. On the other hand, MCC, which considers both under- and over-predictions, gives a complementary measure of the prediction performance, where MCC = 1 denotes a perfect prediction, MCC = 0 indicates a completely random assignment, and MCC = -1 means a perfectly reverse correlation. Moreover, overall accuracy presents how well the classifier distinguishes true positives and true negatives, and 100% overall accuracy denotes a perfect prediction. The definitions of specificity, sensitivity, MCC, and overall accuracy are defined in Equations (3), (4), (5), and (6), respectively. In the equations, TP, TN, FP, and FN denote the numbers of true positives, true negatives, false positives, and false negatives, correspondingly.
Specificity = TN/(TN + FP) × 100 (3)
Sensitivity = TP/(TP + FN) × 100 (4)
Acc = (TP + TN)/(TP + TN + FP + FN) × 100
In addition to the above measures, we also use the receiver operating characteristic (ROC) curve  and area under the ROC curve (AUC)  to evaluate the performance of standard and smoothed PSSM encoding schemes. In an ROC curve plot, the X-axis represents false positive rate (i.e., 1-specificity) and Y-axis denotes true positive rate (i.e., sensitivity). We incorporate different thresholds in the SVM classifier to plot the true positive rates against false positive rates in an ROC curve. Moreover, AUC calculates the area under an ROC curve and the maximum value of AUC is 1, which denotes a perfect prediction. A random guess results in an AUC value close to 0.5.
To determine the thresholds in the SVM classifiers, we follow the criteria used in the previous work. We notice that the thresholds in other approaches are optimized with respect to different measures. For example, Kumar et al.  and Jeong and Miyano  both optimized their results in the RBP86 data set based on MCC. In addition, Terribilini et al.  also selected the thresholds with the best MCC for the RBP109 data set. On the other hand, Wang and Brown  determined the best thresholds in the RBP107 data set based on the average of specificity and sensitivity. Therefore, the thresholds in RNAProB are optimized with respect to MCC for the RBP86 and RBP109 data sets, while the threshold is determined by the average of sensitivity and specificity for the RBP107 data set.
Effect of smoothed PSSM encoding scheme
Performance comparison of standard PSSM and smoothed PSSM.
Experimental results demonstrate that our proposed smoothed PSSM encoding scheme not only achieves good prediction performance, but also yields a significant improvement over standard PSSM encoding. Smoothed PSSM encoding scheme outperforms standard PSSM by 2.32%~4.60% in overall accuracy and 0.06~0.178 in MCC. The consideration of dependency among neighboring residues works well in distinguishing interacting residues from non-interacting ones; accordingly, the prediction performance of smoothed PSSM encoding scheme is substantially improved. This supports our assumption that the incorporation of the correlation between surrounding residues in PSSM profiles can significantly enhance the performance of RNA-binding site prediction.
RNAProB prediction performance on the benchmark data sets
Performance of five-fold cross-validation and three-way data split for the benchmark data sets.
3-way data split
3-way data split
3-way data split
1. Performance comparison with other approaches on the RBP86 data set
Performance comparison of different approaches using five-fold cross-validation for the benchmark data sets.
2. Performance comparison with RNABindR on the RBP109 data set
Table 5 illustrates the performance comparison with RNABindR [6, 11], a Naïve Bayes based method developed on the same data set. Using five-fold cross-validation, RNAProB achieves 0.58, 89.70%, 93.88%, and 64.62% in MCC, overall accuracy, specificity, and sensitivity, respectively, compared favourably to 0.35, 84.80%, 93.00%, and 38.00% by RNABindR. Particularly, our method significantly outperforms RNABindR by 26.62% in terms of sensitivity.
3. Performance comparison with other approaches on the RBP107 data set
Table 5 compares the performance of RNAProB with other approaches on the RBP107 data set. Based on physicochemical properties, BindN (i.e. referred to as BindN-PCP in Table 5) attains MCC, overall accuracy, specificity, and sensitivity of 0.27, 69.32%, 69.84%, and 66.28%, respectively . Incorporated with more biological features, BindN (i.e. denoted as BindN-ALL in Table 5) further improves specificity and accuracy by 5.86% and 4.93% with a slight decrease in sensitivity . PPRint improves sensitivity to 70.09% with the other measures performed comparable to those of BindN-ALL. Our method significantly outperforms the-state-of-the-art approaches by 0.10, 5.10%, 5.33%, and 7.05% in MCC, overall accuracy, specificity, and sensitivity, respectively. This demonstrates that RNAProB not only achieves accurate performance, but also substantially improves sensitivity in the prediction of RNA-binding sites.
Physicochemical preferences of interacting and non-interacting residues
Comparison of smoothed PSSM and standard PSSM
We present RNAProB, which combines a new smoothed PSSM encoding scheme with a SVM model for prediction of RNA-binding sites in proteins. In a standard PSSM profile, evolutionary information is calculated based on an assumption that each position is independent of others. However, the correlation or dependency from surrounding residues is incorporated in the proposed smoothed PSSM encoding. Experimental results show that the prediction performance of smoothed PSSM encoding performs better than the state-of-the-art approaches on the benchmark data sets. Evaluated by five-fold cross-validation, RNAProB outperforms the other approaches by 0.10~0.23 in MCC, 4.90%~6.83% in overall accuracy, and 0.88%~5.33% in specificity. Most notably, our method significantly improves sensitivity by 26.90%, 26.62%, and 7.05% for the RBP86, RBP109, and RBP107 data sets, respectively. Performance improvement in RNAProB not only demonstrates that smoothed PSSM can better resolve the ambiguity in discriminating RNA interacting and non-interacting residues, but also supports our assumption that consideration of correlation between neighboring residues can significantly enhance prediction accuracy. To prevent data over fitting, a rigorous three-way data split procedure is incorporated to evaluate our prediction performance. The proposed method can be used in other research topics, such as DNA-binding site prediction, protein-protein interaction, and prediction of posttranslational modification sites.
We would like to express our sincere thanks to Dr. Kumar, Dr. Terribilini, and Dr. Jeong for kindly providing their data sets. We also thank Jia-Ming Chang, Allan Lo, Hua-Sheng Chiu, Ei-Wen Yang, Yi-Yuan Chiu, and Dr. Ming-Tat Ko for helpful discussion and useful suggestions.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 12, 2008: Asia Pacific Bioinformatics Network (APBioNet) Seventh International Conference on Bioinformatics (InCoB2008). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S12.
- Sunita S, Purta E, Durawa M, Tkaczuk KL, Swaathi J, Bujnicki JM, Sivaraman J: Functional specialization of domains tandemly duplicated within 16S rRNA methyltransferase RsmC. Nucleic Acids Res. 2007, 35 (13): 4264-4274. 10.1093/nar/gkm411.PubMed CentralView ArticlePubMedGoogle Scholar
- Bechara E, Davidovic L, Melko M, Bensaid M, Tremblay S, Grosgeorge J, Khandjian EW, Lalli E, Bardoni B: Fragile X related protein 1 isoforms differentially modulate the affinity of fragile X mental retardation protein for G-quartet RNA structure. Nucleic Acids Res. 2007, 35 (1): 299-306. 10.1093/nar/gkl1021.PubMed CentralView ArticlePubMedGoogle Scholar
- McKnight KL, Heinz BA: RNA as a target for developing antivirals. Antivir Chem Chemother. 2003, 14 (2): 61-73.View ArticlePubMedGoogle Scholar
- Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE, Burkhardt K, Feng Z, Gilliland GL, Iype L, Jain S: The Protein Data Bank. Acta Crystallogr D Biol Crystallogr. 2002, 58 (Pt 6 No 1): 899-907. 10.1107/S0907444902003451.View ArticlePubMedGoogle Scholar
- Jeong E, Chung IF, Miyano S: A neural network method for identification of RNA-interacting residues in protein. Genome Inform. 2004, 15 (1): 105-116.PubMedGoogle Scholar
- Terribilini M, Lee JH, Yan C, Jernigan RL, Honavar V, Dobbs D: Prediction of RNA binding sites in proteins from amino acid sequence. RNA. 2006, 12 (8): 1450-1462. 10.1261/rna.2197306.PubMed CentralView ArticlePubMedGoogle Scholar
- Jeong E, Miyano S: A Weighted Profile Based Method for Protein-RNA Interacting Residue Prediction. Transactions on Computational Systems Biology. 2006, 123-139.Google Scholar
- Kumar M, Gromiha MM, Raghava GP: Prediction of RNA binding sites in a protein using SVM and PSSM profile. Proteins. 2008, 71 (1): 189-194. 10.1002/prot.21677.View ArticlePubMedGoogle Scholar
- Wang L, Brown SJ: BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Res. 2006, W243-248. 10.1093/nar/gkl298. 34 Web ServerGoogle Scholar
- Gonzalez RC, Woods RE: Digital Image Processing. 2002, Prentice HallGoogle Scholar
- Terribilini M, Sander JD, Lee JH, Zaback P, Jernigan RL, Honavar V, Dobbs D: RNABindR: a server for analyzing and predicting RNA-binding sites in proteins. Nucleic Acids Res. 2007, W578-584. 10.1093/nar/gkm294. 35 Web ServerGoogle Scholar
- Vapnik VN: The Nature of Statistical Learning Theory. 1995, SpringerView ArticleGoogle Scholar
- LIBSVM: a library for support vector machines. [http://www.csie.ntu.edu.tw/~cjlin/libsvm/]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA. 1992, 89 (22): 10915-10919. 10.1073/pnas.89.22.10915.PubMed CentralView ArticlePubMedGoogle Scholar
- Ritchie MD, White BC, Parker JS, Hahn LW, Moore JH: Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases. BMC Bioinformatics. 2003, 4: 28-10.1186/1471-2105-4-28.PubMed CentralView ArticlePubMedGoogle Scholar
- Su EC, Chiu HS, Lo A, Hwang JK, Sung TY, Hsu WL: Protein subcellular localization prediction based on compartment-specific features and structure conservation. BMC Bioinformatics. 2007, 8: 330-10.1186/1471-2105-8-330.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang L, Brown SJ: Prediction of RNA-binding residues in protein sequences using support vector machines. Conf Proc IEEE Eng Med Biol Soc. 2006, 1: 5830-5833.View ArticlePubMedGoogle Scholar
- Matthews BW: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta. 1975, 405 (2): 442-451.View ArticlePubMedGoogle Scholar
- Swets JA: Measuring the accuracy of diagnostic systems. Science. 1988, 240 (4857): 1285-1293. 10.1126/science.3287615.View ArticlePubMedGoogle Scholar
- Bradley AP: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition. 1997, 30 (7): 1145-1159. 10.1016/S0031-3203(96)00142-2.View ArticleGoogle Scholar
- Yu CS, Chen YC, Lu CH, Hwang JK: Prediction of protein subcellular localization. Proteins. 2006, 64 (3): 643-651. 10.1002/prot.21018.View ArticlePubMedGoogle Scholar
- Chang JM, Su EC, Lo A, Chiu HS, Sung TY, Hsu WL: PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis. Proteins. 2008, 72 (2): 693-710. 10.1002/prot.21944.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.