Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set
© You et al.; licensee BioMed Central Ltd. 2014
Published: 3 December 2014
Identifying protein-protein interactions (PPIs) is essential for elucidating protein functions and understanding the molecular mechanisms inside the cell. However, the experimental methods for detecting PPIs are both time-consuming and expensive. Therefore, computational prediction of protein interactions are becoming increasingly popular, which can provide an inexpensive way of predicting the most likely set of interactions at the entire proteome scale, and can be used to complement experimental approaches. Although much progress has already been achieved in this direction, the problem is still far from being solved and new approaches are still required to overcome the limitations of the current prediction models.
In this work, a sequence-based approach is developed by combining a novel Multi-scale Continuous and Discontinuous (MCD) feature representation and Support Vector Machine (SVM). The MCD representation gives adequate consideration to the interactions between sequentially distant but spatially close amino acid residues, thus it can sufficiently capture multiple overlapping continuous and discontinuous binding patterns within a protein sequence. An effective feature selection method mRMR was employed to construct an optimized and more discriminative feature set by excluding redundant features. Finally, a prediction model is trained and tested based on SVM algorithm to predict the interaction probability of protein pairs.
When performed on the yeast PPIs data set, the proposed approach achieved 91.36% prediction accuracy with 91.94% precision at the sensitivity of 90.67%. Extensive experiments are conducted to compare our method with the existing sequence-based method. Experimental results show that the performance of our predictor is better than several other state-of-the-art predictors, whose average prediction accuracy is 84.91%, sensitivity is 83.24%, and precision is 86.12%. Achieved results show that the proposed approach is very promising for predicting PPI, so it can be a useful supplementary tool for future proteomics studies. The source code and the datasets are freely available at http://csse.szu.edu.cn/staff/youzh/MCDPPI.zip for academic use.
Protein-protein interactions (PPIs) play key roles in various cellular processes, including metabolic cycles, DNA transcription and replication, and signalling cascades. Correctly identifying and characterizing protein interactions are critical for understanding the molecular mechanisms inside the cell . In recent years, the researchers developed a couple of innovative techniques for identifying the interactions among proteins [1–3]. Due to the progress in high-throughput biological techniques such as Mass Spectrometric (MS), Tandem Affinity Purification (TAP) [1, 2, 4, 3] and other large-scale experimental approaches for PPIs identification, an immense amount of PPIs data for different organisms has been accumulated [1–5].
However, the high-throughput experimental approaches are time consuming and expensive. Thus, current PPIs data generated with experimental methods only cover a small fraction of the complete PPI networks . In addition, high-throughput biological techniques suffer from high false negative and false positive rates [6–10]. Therefore it is very important and urgent to develop the efficient and reliable computational approaches to facilitate the detection of protein interactions [11–13].
A number of computational techniques have been developed to provide either complementary information or supporting evidence to experimental methods [14–17]. Existing approaches typically use binary classification frameworks that differ in the features used to represent protein pairs. Different protein attributes or feature sources, such as protein domains, gene neighbourhood, phylogenetic profiles, gene expression, and literature mining knowledge are employed to infer protein interactions[6–8, 11, 18–25]. There are also approaches that integrate the interaction information from a couple of different biological data sources. However, if the pre-knowledge about the proteins is not available the aforementioned approaches cannot be implemented.
Recently, a number of attempts which derive information directly from amino acid sequence have been made to develop computational model to help in discovering new PPIs [7, 8, 11, 13, 27–30]. The experimental results in previous works showed that the information of protein primary sequences alone is sufficient to detect protein interactions [7, 11, 13]. Shen et al. developed an automatic and excellent identification system for predicting PPIs based on protein amino acids sequence information . In their study, the twenty protein amino acids were firstly clustered into seven clustering according to their dipoles and volumes of the side chains, and then we use the conjoint triad feature to represent of the given protein sequence based on the classification of amino acids. This approach achieves a high prediction accuracy of 83.9% on human PPIs data set. However, Shen's work cannot takes neighbouring effect into account and it is generally agreed that the interactions among proteins occur in the discontinuous amino acids segments of the protein sequence. Lately, Guo et al. employed auto covariance (AC) transformation method to consider the discontinuous amino acids segments in the sequence . When applied to predict saccharomyces cerevisiae PPIs, their approach achieved a high prediction accuracy of 86.55%. In our previous studies, our methods which used autocorrelation descriptors and correlation coefficient also yielded good prediction performance [8, 31, 32].
In this study, a novel feature representation method for prediction of PPIs is proposed. We hypothesize that the continuous and discontinuous amino acids segments play an important role in determining the interactions between proteins. For example, discontinuous regions consist of amino acid residues remote from each other in primary protein sequence, yet spatially proximate in protein three-dimensional structure, which determines the interaction of proteins. In other words, the proposed protein representation method account for the interactions between sequentially distant but spatially close amino acid residues, thus it be able to adequately capture multiple overlapping continuous and discontinuous binding patterns within protein sequence.
To sum up, in this paper we propose a sequence-based approach for the prediction of protein-protein interactions using support vector machine (SVM) combined with a novel multi-scale continuous and discontinuous protein feature representation. In order to reduce the dimensionality of data and improve the accuracy of the predictor, an effective feature selection method minimum redundancy maximum relevance (mRMR) is employed to select a compact and discriminative new feature subset . The Saccharomyces cerevisiae PPI dataset was employed to evaluate the performance of the proposed method. The experiment results demonstrate that our approach yielded 91.36% prediction accuracy with 91.94% precision at the sensitivity of90.67%. Our proposed method was also evaluated using the independent dataset of the Helicobacter pylori PPIs and achieved a high overall accuracy of 84.91%, which further demonstrates the effectiveness of the proposed method.
In this section, we first briefly introduce the PPIs datasets which is employed to evaluate the proposed method. Then we discuss the evaluation strategies used in performance comparisons. Finally, we analyse the experimental results and compare our results with the related research.
Benchmark PPI datasets
To evaluate the performance of the proposed method, the PPIs dataset collected from yeast core subset of Database of Interacting Proteins (DIP) has been employed. This dataset is originally derived by Guo et al. and consists of 11,188 protein pairs, where half are from the positive dataset and half are from the negative dataset . It should be noticed that the protein pairs which contain a protein with fewer than fifty residues or have ≥40% sequence identity were removed in our PPIs dataset; the remaining 5,594 protein pairs comprise the final positive dataset. Choosing negative examples is a very important for training a predictor of PPIs. The common method is based on annotations of cellular localization. In this study, the 5,594 protein pairs occurring in two different subcellular localizations were chosen as negative PPIs dataset.
In the experiment, the five-fold cross-validation was employed to evaluate the prediction performance of the proposed method. More specifically, the PPIs dataset is randomly divided into five equally sized subsets, and then each subset is used as a testing set in turn, while the other four subsets are used for training set.
where TP, FP, TN and FN refer to number of true positive, number of false positive, number of true negative and number of false negative PPIs, respectively. MCC is considered to be the most robust metric of any class prediction method. An MCC equal to 0 is regarded as a completely random prediction, whereas 1 is regarded as a perfect prediction.
In this paper, the proposed sequence-based PPI predictor was implemented using MATLAB platform. For SVM algorithm, the implementation of LIBSVM available from http://www.csie.ntu.edu.tw/~cjlin/libsvm was utilized, which was originally developed by Chang and Lin . The Radial Basis Function was chosen as the kernel function and the optimized parameters were obtained with a grid search approach. Regarding mRMR, the implementation by Peng and Ding available from http://penglab.janelia.org/proj/mRMR/ was used. All the simulations were carried out on a computer with 3.1 GHz 2-core CPU, 6 GB memory and Windows operating system.
Prediction performance of proposed model
The DIP PPIs data which investigated in Guo et al. was adopted to evaluate the performance of the proposed model . Proper parameters setting can improve the SVM classification accuracy; therefore the corresponding parameters for SVM were firstly optimized. Here, two parameters, C and gamma (γ), were determined using the grid search approach within a limited range. To guarantee that the experimental results are valid and can be generalized for making predictions regarding new data, the dataset is randomly partitioned into training and independent testing sets via a 5-fold cross validation. Each of the five subsets acts as an independent holdout testing dataset for the model trained with the rest of four subsets. Thus five models were generated for the five sets of data. The advantages of cross validation are that the impact of data dependency is minimized and the reliability of the results can be improved.
Comparison of the prediction performance by the proposed method and some state-of-the-art works on the yeast dataset.
90.67 ± 0.69
91.94 ± 0.62
91.36 ± 0.36
84.21 ± 0.59
97.07 ± 0.12
89.93 ± 3.68
88.87 ± 6.16
89.33 ± 2.67
87.30 ± 4.68
87.82 ± 4.33
87.36 ± 1.38
87.37 ± 0.22
89.50 ± 0.60
88.56 ± 0.33
77.15 ± 0.68
95.07 ± 0.39
75.81 ± 1.20
74.75 ± 1.23
75.08 ± 1.13
76.77 ± 0.69
82.17 ± 1.35
80.04 ± 1.06
78.14 ± 0.90
81.86 ± 0.99
80.41 ± 0.47
81.03 ± 1.74
90.24 ± 1.34
86.15 ± 1.17
We further compared our method with Guo et al., Zhou et al. and Yang et al., where the SVM, SVM and KNN was performed with the conventional Auto Covariance, Local Descriptor, and Local Descriptor representation as the input feature vectors, respectively. From Table 1, we can see that the performance of all of these methods with different machine learning model and sequence based feature representation are lower than ours, which indicates that our improvements are resulted from adopting the proposed MCD descriptor to represent the protein sequences. In a word, we may safely draw the conclusion that the proposed method generally outperforms the previous approaches with higher discrimination power for detecting PPIs based the information of protein amino acids sequences. Therefore, we can see clearly that our model is a much more efficient method for predicting PPIs compared with existing approaches methods. Therefore, it makes us be more convinced that our new method will be a useful tool for protein interaction prediction community.
Comparing the prediction performance with other methods
Performance comparison of different methods on the H
Ensemble of HKNN
Incremental Feature Selection (IFS) and optimal feature subset
The incremental feature selection (IFS) procedure was used to find an optimal subset from the mRMR feature list generated above . Suppose the total number of the features is . We can obtain feature subsets which are initiated from a subset containing one feature and generated by adding them one by one from the mRMR feature list. Then SVM predictors were constructed with 5-fold cross-validation based on the feature subsets. Finally the IFS curve of MCC to the different feature subset was plotted. An optimal feature subset was obtained with which the corresponding predictor yields the best MCC. The detailed analysis of the experimental results in this section and the IFS curve are available at the website: http://csse.szu.edu.cn/staff/youzh/MCDPPI.zip.
In this study, we have proposed an efficient technique for predicting protein interactions from protein primary sequences by combining a novel multi-scale continuous and discontinuous (MCD) feature representation with SVM model. The MCD representation takes into account the factors that the PPIs usually occur in discontinuous segments in the protein sequence, where distant amino acid residues are brought into spatial proximity by protein folding. A protein sequence was characterized by a number of regions using MCD representation, which is capable of capturing multiple overlapping continuous and discontinuous binding patterns within a protein sequence. In order to reduce the noise and irrelevant features which affect the protein prediction performance, the mRMR method was adopted for feature selection. Experimental results show that our method performed significantly well in predicting protein interactions. Achieved results demonstrate that the proposed approach is very promising for predicting PPI and can be a useful supplementary tool to traditional experimental method.
In this section, we introduce the proposed MCD-SVM approach for predicting protein interactions from protein primary sequences. The proposed approach to predict the PPIs is consisted of three steps: (1) Represent protein sequences as a vector by using the proposed multi-scale continuous and discontinuous (MCD) feature representation; (2) Minimum redundancy maximum relevance (mRMR) is utilized to do the feature selection; (3) SVM predictor is used to perform the protein interaction prediction tasks.
Feature vector extraction
To successfully use the machine learning methods to predict PPIs from protein sequences, one of the most important computational challenges is how to effectively represent a protein sequence by a fixed length feature vector in which the important information content of proteins is fully encoded. Although researchers have proposed various sequence-based methods to predict new PPIs, one flaw of them is that the interactions information cannot be drawn from both continuous and discontinuous amino acids segments at the same time. To overcome this problem, in this study we propose a novel Multi-scale Continuous and Discontinuous (MCD) sequence representation approach to transform the protein sequences into feature vectors by using binary coding scheme. A multi-scale decomposition technique is used to divide protein sequence into multiple sequence segments of varying length to describe both continuous and discontinuous regions. Here, the continuous sequence segments are composed of residues which are local in the polypeptide sequence, while discontinuous regions consist of residues from different parts of the sequence, brought into spatial proximity by the folding of the protein to its native structure.
It should be noticed that the proposed representation can be simply and conveniently edited at multiple scales, which offers a promising new approach for addressing these difficulties in a simple, unified, and theoretically sound way when present a protein sequence. For a given number of bits, each protein sequence may take on only a finite number of continuous or discontinuous regions. This limits the resolution of the sequence. If more bits are used for each protein sequence, then a higher degree of resolution is obtained. For example, if the protein sequence is encoded by 5-bit binary form, each protein sequence may take on 30 (25-2) different regions. Higher bit encoding requires more storage for data and requires more computing resource to process.
For each continuous or discontinuous region, three types of descriptors, composition (C), transition (T) and distribution (D), are used to represent its characteristics. C is the number of amino acids of a particular property (e.g., hydrophobicity) divided by the total number of amino acids in a local region. T characterizes the percentage frequency with which amino acids of a particular property is followed by amino acids of another property. D measures the chain length within which the first, 25%, 50%, 75%, and 100% of the amino acids of a particular property are located, respectively .
Division of amino acids into seven groups based on the dipoles and volumes of the side chains.
Then, every amino acid in each protein sequence is replaced by the index depending on its grouping. For example, protein sequence "GGYCCCYYGYYYGC-CGGYYGCG" is replaced by 1132223313331221133121 based on this classification of amino acids. There are eight '1', six '2' and eight '3' in this protein sequence. The composition for these three symbols is 8 × 100%/(8+6+8) = 36.36%, 6 × 100%/(8+6+8) = 27.27% and 8 × 100%/(8+6+8) = 36.36%, respectively. There are 4 transitions from '1' to '2' or from '2' to '1' in this sequence, and the percentage frequency of these transitions is (4/21) ×100% = 19%. The transitions from '1' to '3' or from '3' to '1' in this sequence can similarly be calculated as (6/21) ×100% = 28.57%. The transitions from '2' to '3' or from '3' to '2' in this sequence can also similarly be calculated as (2/21) ×100% = 9.52%.
For each continuous or discontinuous region, the three descriptors (C, T and D) were calculated and concatenated, and a total of 63 descriptors are generated: 7 for C, 21 ((7 × 6)/2) for T and 35 (7 × 5) for D. Then, all descriptors from 14 regions were concatenated and a total 882 dimensional vector has been built to represent each protein sequence. Finally, the protein pair is represented by concatenating the two vectors of two individual proteins. Thus, a 1764-dimentional vector has been constructed to character each protein pair and used as a feature vector for input into SVM classifier.
Minimum redundancy maximum relevance (mRMR)
After the feature extraction procedure, all protein interaction and non-interaction pairs in benchmark datasets are converted into numerical feature vectors with the same dimension. In order to reduce feature abundance and computation complexity, the Minimum Redundancy Maximum Relevancy (mRMR) criterion was used in this study to select an optimal feature subset .
The mRMR was originally proposed by Peng et al. to deal with the microarray gene expression data processing . It ranked features based on the trade-off between their relevance to the target concerned and the redundancy among the features themselves. Feature with the better trade-off between the maximum relevance to target and the minimum redundancy between features were considered as better features and would be selected in the final ordered list. The mRMR algorithm is described briefly below.
where is the joint probabilistic density function of and . and are the margin probabilistic density function of and, respectively.
Given a dataset with features, the mRMR feature evaluation will continue rounds. Finally, an ordered feature set can be obtained in which each feature has a subscript index indicating at which round the feature is chosen. The earlier the feature has been selected in the evaluation, the smaller the index is and the better the feature is. The mRMR program could be downloaded from the website at http://penglab.janelia.org/proj/mRMR/.
Support Vector Machine (SVM)
Support Vector Machine (SVM) is a classification and regression paradigm first developed by Vapnik . It has attracted much research attention in these years due its demonstrated improved generalization performance over other techniques in many real world applications including bioinformatics . The SVM originated from the idea of the structural risk minimization theory. The main difference between this technique and many other conventional classification techniques including neural networks is that it minimizes the structural risk instead of the empirical risk. The principle is based on the fact that minimizing an upper bound on the generalization error rather than minimizing the training error is expected to perform better. SVM training always seeks a global optimized solution and avoids over-fitting, so it has the ability to deal with a large number of features. A complete description to the theory of SVMs for pattern recognition is in Vapnik's book .
The basic idea of utilizing SVM model for classification can be stated briefly as follows. Firstly, map the original data into a feature space with high dimensionality through a linear or non-linear mapping function, which is relevant with the selection of the kernel function. Then, within the feature space from the first step, seek an optimized linear division, i.e. construct a hyper plane which separates the data into two classes.
In the equation (11), is a regularization parameter which controls the tradeoff between margin and misclassification error. These are called Support Vectors only if the corresponding . In this work, Radial Basis Functions (RBF) kernel, , is applied, which has better boundary response and most high-dimensional data sets can be approximated by Gaussian like distributions. In the experiment we use the well-known software LIBSVM to classify the PPI dataset .
This work is supported in part by the National Science Foundation of China, under Grants 61133010, 61373105, 61102119, 61373086, 61272339, 61171125, U1201256. The authors would like to thank all the guest editors and anonymous reviewers for their constructive advices.
The publication costs for this article were funded by the corresponding author's institution.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 15, 2014: Proceedings of the 2013 International Conference on Intelligent Computing (ICIC 2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S15.
- Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002, 415 (6868): 141-147. 10.1038/415141a.View ArticlePubMedGoogle Scholar
- Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA. 2001, 98 (8): 4569-4574. 10.1073/pnas.061034498.PubMed CentralView ArticlePubMedGoogle Scholar
- Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002, 415 (6868): 180-183. 10.1038/415180a.View ArticlePubMedGoogle Scholar
- Krogan NJ, Cagney G, Yu HY, Zhong GQ, Guo XH, Ignatchenko A, Li J, Pu SY, Datta N, Tikuisis AP: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature. 2006, 440 (7084): 637-643. 10.1038/nature04670.View ArticlePubMedGoogle Scholar
- Wang B, Chen P, Huang DS, Li JJ, Lok TM, Lyu MR: Predicting protein interaction sites from residue spatial sequence profile and evolution rate. Febs Letters. 2006, 580 (2): 380-384. 10.1016/j.febslet.2005.11.081.View ArticlePubMedGoogle Scholar
- You ZH, Lei YK, Gui J, Huang DS, Zhou XB: Using manifold embedding for assessing and predicting protein interactions from high-throughput experimental data. Bioinformatics. 2010, 26 (21): 2744-2751. 10.1093/bioinformatics/btq510.PubMed CentralView ArticlePubMedGoogle Scholar
- Xia JF, Zhao XM, Huang DS: Predicting protein-protein interactions from protein sequences using meta predictor. Amino Acids. 2010, 39 (5): 1595-1599. 10.1007/s00726-010-0588-1.View ArticlePubMedGoogle Scholar
- Xia JF, Han K, Huang DS: Sequence-based prediction of protein-protein interactions by means of rotation forest and autocorrelation descriptor. Protein Pept Lett. 2010, 17 (1): 137-145. 10.2174/092986610789909403.View ArticlePubMedGoogle Scholar
- Zhu L, You ZH, Huang DS, Wang B: LSE: A novel robust geometric approach for modeling protein-protein interaction networks. PLoS One. 2013, 8 (4):Google Scholar
- Lei YK, You ZH, Dong TB, Jiang YX, Yang JA: Increasing reliability of protein interactome by fast manifold embedding. Pattern Recognition Letters. 2013, 34 (4): 372-379. 10.1016/j.patrec.2012.09.004.View ArticleGoogle Scholar
- Guo YZ, Yu LZ, Wen ZN, Li ML: Using support vector machine combined with auto covariance to predict proteinprotein interactions from protein sequences. Nucleic Acids Res. 2008, 36 (9): 3025-3030. 10.1093/nar/gkn159.PubMed CentralView ArticlePubMedGoogle Scholar
- You Z-H, Yin Z, Han K, Huang D-S, Zhou X: A semi-supervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network. BMC Bioinformatics. 2010, 11:Google Scholar
- Shen JW, Zhang J, Luo XM, Zhu WL, Yu KQ, Chen KX, Li YX, Jiang HL: Predictina protein-protein interactions based only on sequences information. Proc Natl Acad Sci USA. 2007, 104 (11): 4337-4341. 10.1073/pnas.0607879104.PubMed CentralView ArticlePubMedGoogle Scholar
- Shoemaker BA, Panchenko AR: Deciphering protein-protein interactions. Part II. Computational methods to predict protein and domain interaction partners. Plos Computational Biology. 2007, 3 (4): 595-601.View ArticleGoogle Scholar
- Zhao XM, Li X, Chen L, Aihara K: Protein classification with imbalanced data. Proteins. 2008, 70 (4): 1125-1132.View ArticlePubMedGoogle Scholar
- Lam WWM, Chan KCC: Discovering functional interdependence relationship in PPI networks for protein complex identification. IEEE Trans Biomed Eng. 2012, 59 (4): 899-908.View ArticlePubMedGoogle Scholar
- You ZH, Lei YK, Zhu L, Xia JF, Wang B: Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis. BMC Bioinformatics. 2013, 14:Google Scholar
- Lei YK, You ZH, Ji Z, Zhu L, Huang DS: Assessing and predicting protein interactions by combining manifold embedding with multiple information integration. BMC Bioinformatics. 2012, 13:Google Scholar
- Zheng CH, Zhang L, Ng VTY, Shiu SCK, Huang DS: Molecular pattern discovery based on penalized matrix decomposition. IEEE-ACM Trans Comput Biol Bioinform. 2011, 8 (6): 1592-1603.View ArticleGoogle Scholar
- Zheng CH, Zhang L, Ng TY, Shiu SCK, Huang DS: Metasample-based sparse representation for tumor classification. IEEE-ACM Trans Comput Biol Bioinform. 2011, 8 (5): 1273-1282.View ArticleGoogle Scholar
- Zhao XM, Chen LN, Aihara K: Protein function prediction with high-throughput data. Amino Acids. 2008, 35 (3): 517-530. 10.1007/s00726-008-0077-y.View ArticlePubMedGoogle Scholar
- Zhao XM, Chen LN, Aihara K: A discriminative approach for identifying domain-domain interactions from protein-protein interactions. Proteins. 2010, 78 (5): 1243-1253. 10.1002/prot.22643.View ArticlePubMedGoogle Scholar
- Zhao XM, Cheung YM, Huang DS: A novel approach to extracting features from motif content and protein composition for protein sequence classification. Neural Networks. 2005, 18 (8): 1019-1028. 10.1016/j.neunet.2005.07.002.View ArticlePubMedGoogle Scholar
- Wang SL, Zhu YH, Jia W, Huang DS: Robust classification method of tumor subtype by using correlation filters. IEEE-ACM Trans Comput Biol Bioinform. 2012, 9 (2): 580-591.View ArticleGoogle Scholar
- You ZH, Yu JZ, Zhu L, Li S, Wen ZK: A mapreduce based parallel SVM for large-scale predicting protein-protein interactions. Neurocomputing. 2014Google Scholar
- Qi YJ, Klein-Seetharaman J, Bar-Joseph Z: Random forest similarity for protein-protein interaction prediction from multiple sources. 2005, Singapore: World Scientific Publ Co Pte LtdGoogle Scholar
- Yu HJ, Huang DS: Graphical representation for DNA sequences via joint diagonalization of matrix pencil. IEEE J Biomed Health Inform. 2013, 17 (3): 503-511.View ArticlePubMedGoogle Scholar
- Huang DS, Yu HJ: Normalized Feature Vectors: A novel alignment-free sequence comparison method based on the numbers of adjacent amino acids. IEEE-ACM Trans Comput Biol Bioinform. 2013, 10 (2): 457-467.View ArticleGoogle Scholar
- Zhang YN, Pan XY, Huang Y, Shen HB: Adaptive compressive learning for prediction of protein-protein interactions from primary sequence. J Theor Biol. 2011, 283 (1): 44-52. 10.1016/j.jtbi.2011.05.023.View ArticlePubMedGoogle Scholar
- Pan XY, Zhang YN, Shen HB: Large-scale prediction of human protein-protein interactions from amino acid sequence based on latent topic features. J Proteome Res. 2010, 9 (10): 4992-5001. 10.1021/pr100618t.View ArticlePubMedGoogle Scholar
- Shi MG, Xia JF, Li XL, Huang DS: Predicting protein-protein interactions from sequence using correlation coefficient and high-quality interaction dataset. Amino Acids. 2010, 38 (3): 891-899. 10.1007/s00726-009-0295-y.View ArticlePubMedGoogle Scholar
- You Z-H, Li S, Gao X, Luo X, Ji Z: Large-scale protein-protein interactions detection by integrating big biosensing data with computational model. BioMed Research International. 2014, 2014:Google Scholar
- Peng HC, Long FH, Ding C: Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell. 2005, 27 (8): 1226-1238.View ArticlePubMedGoogle Scholar
- Chang CC, Lin CJ: LIBSVM: A Library for support vector machines. ACM Trans Intell Syst Technol. 2011, 2 (3):Google Scholar
- Zhou YZ, Gao Y, Zheng YY: Prediction of protein-protein interactions using local description of amino acid sequence. Advances in Computer Science and Education Applications, Pt Ii. Edited by: Zhou M, Tan HH. 2011, Berlin: Springer-Verlag Berlin, 202: 254-262. 10.1007/978-3-642-22456-0_37.View ArticleGoogle Scholar
- Yang L, Xia JF, Gui J: Prediction of protein-protein interactions from protein sequence using local descriptors. Protein Pept Lett. 2010, 17 (9): 1085-1090. 10.2174/092986610791760306.View ArticlePubMedGoogle Scholar
- Martin S, Roe D, Faulon JL: Predicting protein-protein interactions using signature products. Bioinformatics. 2005, 21 (2): 218-226. 10.1093/bioinformatics/bth483.View ArticlePubMedGoogle Scholar
- Bock JR, Gough DA: Whole-proteome interaction mining. Bioinformatics. 2003, 19 (1): 125-134. 10.1093/bioinformatics/19.1.125.View ArticlePubMedGoogle Scholar
- Nanni L: Hyperplanes for predicting protein-protein interactions. Neurocomputing. 2005, 69 (1-3): 257-263. 10.1016/j.neucom.2005.05.007.View ArticleGoogle Scholar
- Nanni L, Lumini A: An ensemble of K-local hyperplanes for predicting protein-protein interactions. Bioinformatics. 2006, 22 (10): 1207-1210. 10.1093/bioinformatics/btl055.View ArticlePubMedGoogle Scholar
- Li BQ, Hu LL, Niu S, Cai YD, Chou KC: Predict and analyze S-nitrosylation modification sites with the mRMR and IFS approaches. J Proteomics. 2012, 75 (5): 1654-1665. 10.1016/j.jprot.2011.12.003.View ArticlePubMedGoogle Scholar
- Dubchak I, Muchnik I, Holbrook SR, Kim SH: Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci USA. 1995, 92 (19): 8700-8704. 10.1073/pnas.92.19.8700.PubMed CentralView ArticlePubMedGoogle Scholar
- Cortes C, Vapnik V: Support vector network. Mach Learn. 1995Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.