Classification of G-protein coupled receptors based on support vector machine with maximum relevance minimum redundancy and genetic algorithm
© Li et al; licensee BioMed Central Ltd. 2010
Received: 18 December 2009
Accepted: 16 June 2010
Published: 16 June 2010
Because a priori knowledge about function of G protein-coupled receptors (GPCRs) can provide useful information to pharmaceutical research, the determination of their function is a quite meaningful topic in protein science. However, with the rapid increase of GPCRs sequences entering into databanks, the gap between the number of known sequence and the number of known function is widening rapidly, and it is both time-consuming and expensive to determine their function based only on experimental techniques. Therefore, it is vitally significant to develop a computational method for quick and accurate classification of GPCRs.
In this study, a novel three-layer predictor based on support vector machine (SVM) and feature selection is developed for predicting and classifying GPCRs directly from amino acid sequence data. The maximum relevance minimum redundancy (mRMR) is applied to pre-evaluate features with discriminative information while genetic algorithm (GA) is utilized to find the optimized feature subsets. SVM is used for the construction of classification models. The overall accuracy with three-layer predictor at levels of superfamily, family and subfamily are obtained by cross-validation test on two non-redundant dataset. The results are about 0.5% to 16% higher than those of GPCR-CA and GPCRPred.
The results with high success rates indicate that the proposed predictor is a useful automated tool in predicting GPCRs. GPCR-SVMFS, a corresponding executable program for GPCRs prediction and classification, can be acquired freely on request from the authors.
G protein-coupled receptors (GPCRs), also known as 7 α-helices transmembrane receptors due to their characteristic configuration of an anticlockwise bundle of 7 transmembrane α helices , are one of the largest superfamily of membrane proteins and play an extremely important role in transducing extracellular signals across the cell membrane via guanine-binding proteins (G-proteins) with high specificity and sensitivity . GPCRs regulate many basic physicochemical processes contained in a cellular signaling network, such as smell, taste, vision, secretion, neurotransmission, metabolism, cellular differentiation and growth, inflammatory and immune response [3–9]. For these reasons, GPCRs have been the most important and common targets for pharmacological intervention. At present, about 30% of drugs available on the market act through GPCRs. However, detailed information about the structure and function of GPCRs are deficient for structure-based drug design, because the determination of their structure and functional using experimental approach is both time-consuming and expensive.
As membrane proteins, GPCRs are very difficult to crystallize and most of them will not dissolve in normal solvents . Accordingly, the 3 D structure of only squid rhodopsin, β1, β2 adrenergic receptor and the A2A adenosine receptor have been solved to data. In contrast, the amino acid sequences of more than 1000 GPCRs are known with the rapid accumulation of data of new protein sequence produced by the high-throughput sequencing technology. In view of the extremely unbalanced state, it is vitally important to develop a computational method that can fast and accurately predict the structure and function of GPCRs from sequence information.
Actually, many predictive methods have been developed, which in general, can be roughly divided into three categories. The first one is proteochemometric approach developed by Lapinsh . However, the methods need structural information of organic compounds. The second one is based on similarity searches using primary database search tools (e.g. BLAST, FASTA) and such database searches coupled with searches of pattern databases (PRINTS) . However, they do not seem to be sufficiently successful for comprehensive functional identification of GPCRs, since GPCRs make up a highly divergent family, and even when they are grouped according to similarity of function, their sequences share strikingly little homology or similarity to each other . The third one is based on statistical and machine learning method, including support vector machines (SVM) [8, 14–17], hidden Markov models (HMMs) [1, 3, 6, 18], covariant discriminant (CD) [7, 11, 19, 20], nearest neighbor (NN) [2, 21] and other techniques [13, 22–24].
Among them, SVM that is based on statistical learning theory has been extensively used to solve various biological problems, such as protein secondary structure [25, 26], subcellular localization [27, 28], membrane protein types , due to its attractive features including scalability, absence of local minima and ability to condense information contained in the training set. In SVM, an initial step to transform protein sequence into a fixed length feature vector is essential because SVM can not be directly applied to amino acid sequences with different length. Two commonly used feature vectors to predict GPCRs functional classes are amino acid composition (AAC) and dipeptide composition (DipC) [2, 7, 10, 16, 19, 20, 22], where every protein is represented by 20 or 400 discrete numbers. Obviously, if one uses AAC or DipC to represent a protein, many important information associated with the sequence order will be lost. To take into account the information, the so-called pseudo amino acid composition (PseAA) was proposed  and has been widely used to GPCRs and other attributes of protein studies [10, 31–36]. However, the existing methods were established only based on a single feature-set. And, few works tried to research the relationship between features and the functional classes of protein [37–39], or to find the informative features which contribute most to discriminate functional types. Karchin et al  also indicated that the performance of SVM could be further improved by using feature vector that posses the most discriminative information. Therefore, feature selection should be used for accurate SVM classification.
Feature selection, also known as variable selection or attribute selection, is the technique commonly used in machine learning and has played an important role in bioinformatics studies. It can be employed along with classifier construction to avoid over-fitting, to generate more reliable classifier and to provide more insight into the underlying causal relationships . The technique has been greatly applied to the field of microarray and mass spectra (MS) analysis [41–50], which has a great challenge for computational techniques due to their high dimensionality. However, there is still few works utilizing feature selection in GPCRs prediction to obtain the most informative features or to improve the prediction accuracy.
So, a new predictor combining feature selection and support vector machine is proposed for the identification and classification of GPCRs at the three levels of superfamily, family and subfamily. In every level, minimum redundancy maximum relevance (mRMR)  is utilized to pre-evaluate features with discriminative information. After that, to further improve the prediction accuracy and to obtain the most important features, genetic algorithms (GA)  is applied to feature selection. Finally, three models based on SVM are constructed and used to identify whether a query protein is GPCR and which family or subfamily the protein belongs to. The prediction quality evaluated on a non-redundant dataset by the jackknife cross-validation test exhibited significant improvement compared with published results.
As is well-known, sequence similarity in dataset has an important effect on the prediction accuracy, i.e. accuracy will be overestimated when using high similarity protein sequence. Thus, in order to disinterestedly test current method and facilitate to compare with other existing approaches, the dataset constructed by Xiao  is used as the working dataset. The similarity in the dataset is less than 40%. The dataset contains 730 protein sequences that can be classified into two parts: 365 non-GPCRs and 365 GPCRs. The 365 GPCRs can be divided into 6 families: 232 rhodopsin-like, 44 metabotropic glutamate/pheromone, 39 secretin-like, 23 fungal pheromone, 17 frizzled/smoothened and 10 cAMP receptor. For rhodopsin-like of GPCRs, we further partitioned into 15 subfamilies based on GPCRDB (release 10.0) , including 46 amine, 72 peptide, 2 hormone, 17 rhodopsin, 19 olfactory, 7 prostanoid, 13 nucleotide, 2 cannabinoid, 1 platelet activating factor, 2 gonadotropin-releasing hormone, 3 thyrotropin-releasing hormone & secretagogue, 2 melatonin, 9 viral, 4 lysosphingolipid, 2 leukotriene B4 receptor and 31 orphan. Those subfamilies, which the number of proteins is lower than 10, are combined into a class, because they contain too few sequences to have any statistical significance. So, 6 classes (46 amine, 72 peptide, 17 rhodopsin, 19 olfactory, 13 nucleotide and 34 other) are obtained at subfamily level.
In order to fully characterize protein primary structure, 10 feature vectors are employed to represent the protein sample, including AAC, DipC, normalized Moreau-Broto autocorrelation (NMBAuto), Moran autocorrelation (MAuto), Geary autocorrelation (GAuto), composition (C), transition (T), distribution (D) , composition and distribution of hydrophobicity pattern (CHP, DHP). Here 8 and 7 amino acid properties extracted from AAIndex database  are selected to compute autocorrelation, C, T and D features, respectively. The properties and definitions of amino acids attributed to each group are shown in Additional file 1 and 2.
According to the theory of Lim , 6 kinds of hydrophobicity patterns include: (i, i+2), (i, i+2, i+4), (i, i+3), (i, i+1, i+4), (i, i+3, i+4) and (i, i+5). The patterns (i, i+2) and (i, i+2, i+4) often appear in the β-sheets while the patterns (i, i+3), (i, i+1, i+4) and (i, i+3, i+4) occur more often in α-helices. The pattern (i, i+5) is an extension of the concept of the "helical wheel" or amphipathic α-helix . Seven kinds of amino acids, including Cys (C), Phe (F), Ile (I), Leu (L), Met (M), Val (V) and Trp (W), may occur in the 6 patterns based on the observed of Rose et al . Because transmembrane regions of membrane protein are usually composed of β-sheet and α-helix, CHP and DHP are used to represent protein sequence.
The optimized feature subset selection
SVM is one of the most powerful machine learning methods, but it cannot perform automatic feature selection. To overcome this limitation, various feature selection methods were introduced [59, 60]. Feature selection methods typically were divided into two categories: filter and wrapper methods. Although filter methods are computationally simple and easily scale to high-dimensional dataset, they ignore the interaction between selected feature and classifier. In contrast, wrapper approaches include the interaction and can also take into account the correlation between features, but they have a higher risk of overfitting than filter techniques and are very computationally intensive, especially if building the classifier has a high computational cost . Considering the characteristics of the two methods, the mRMR belonging to filter methods is used to preselect a feature subset, and then GA belonging to wrapper methods is utilized to obtain the optimized feature subset.
Minimum redundancy maximum relevance (mRMR)
Where, p(x i ,y j ) is joint probabilistic density, p(x i ) and p(y j ) is marginal probabilistic density.
Where, S denoted that the feature subset, and |S| is the number of feature in S.
If continuous features exist in feature set, the feature must be discretized by using "mean ± standard deviation/2" as boundary of the 3 states. The value of feature larger than "mean + standard deviation/2" is transformed to state 1; The value of feature between "mean - standard deviation/2" and "mean + standard deviation/2" is transformed to state 2; The value of feature smaller than "mean - standard deviation/2" is transformed to state 3. In this case, computing mutual information is straightforward, because both joint and marginal probability tables can be estimated by tallying the samples of categorical variables in the data . More explanation about the calculation of probability can be seen from Additional file 3. Detailed depiction of the mRMR method can be found in reference , and mRMR program can be obtained from http://penglab.janelia.org/proj/mRMR/index.htm
Genetic algorithms (GA)
GA can effectively search the interesting space and easily solve complex problems without requiring a prior knowledge about the space and the problem. These advantages of GA make it possible to simultaneously optimize the feature subset and the SVM parameters. The chromosome representations, fitness function, selection, crossover and mutation operator in GA are described in the following sections.
The chromosome is composed of decimal and binary coding systems, where binary genes are applied to the selection of features and decimal genes are utilized to the optimization of SVM parameters.
Where, SVM _ accuracy is SVM classification accuracy, n is the number of selected features, N is the number of overall features.
Selection, crossover and mutation operator
The method based on chaos  is applied to the mutation operator of decimal coding. Mutation to the part of binary coding is the same as traditional GA.
The population size of GA is 30, and the termination condition is that the generation numbers reach 10000. A detailed depiction of the GA can be reference to our previous works .
Model construction and assessment of performance
For the present SVM, the publicly available LIBSVM software  is used to construct the classifier with the radial basis function as the kernel. Ten-fold cross-validation test is used to examine a predictor for its effectiveness. In the 10-fold cross-validation, the dataset is divided randomly into 10 equally sized subsets. The training and testing are carried out 10 times, each time using one distinct subset for testing and the remaining 9 subsets for training.
Here, TP, TN, FP and FN are the numbers of true positives, true negatives, false positives and false negatives, respectively.
Where, N is the total number of sequences, obs(i) is the number of sequences observed in class i, p(i) is the number of correctly predicted sequences of class.
Step 1. Produce various feature vectors that represent a query protein sequence.
Step 2. Preselect a feature subset by running mRMR. Select an optimized feature subset from the preselect subset by GA and SVM. Predict whether the query protein belong to the GPCRs or not. If the protein is classified into non-GPCRs, stop the process and output results, otherwise, go to the next step.
Step 3. Preselect again a feature subset and further select an optimized feature subset. Predict which family the protein belongs. If the protein is divided into non-Rhodopsin like, stop the process with the output of results, otherwise, go to the next step.
Step 4. Preselect a feature subset again and select an optimized feature subset. Predict which subfamily the protein belongs to.
Results and discussion
Identification a GPCR from non-GPCR
Recognition of GPCR family
Following the same steps described above, the quality of various feature subsets are investigated at family level based on grid search and 10-fold cross-validation tested. The relationship between number of feature and overall accuracy is shown in Figure 3. A significant increase in overall accuracy can be observed when the number of feature increased from 1 to 301, and the highest overall accuracy of 96.99% can be achieved.
We also further perform GA for preselecting feature subset with 600 features to acquire an optimized feature subset. The processes of optimization are displayed in Figure 4 and Figure 5. It can be observed that the number of features dramatically decreased from 250 to 57 when the number of generation increased from 1 to 2300, and the best fitness and highest overall accuracy of 99.73% can be achieved. So, the optimal classifier with 57 features is used to construct classifier at family level.
The results of the optimized feature subset are also shown in Figure 6. The optimized features subset contains 2 AAC, 14 DipC, 8 D, 7 NMBAuto, 21 MAuto, 2 GAuto and 3 DHP features. The results reveal that the order of these feature groups that contributed to the classification GPCRs into 6 families is: MAuto > DipC > D > NMBAuto > DHP > AAC and GAuto.
Classification of GPCR subfamily
Because knowledge of GPCRs subfamilies can provide useful information to pharmaceutical companies and biologists, the identification of subfamilies is a quite meaningful topic in assigning a function to GPCRs. Therefore, we constructed a classifier at subfamily level to predict the subfamily belonging to the rhodopsin-like family. Rhodopsin-like family is considered because it covers more than 80% of sequences in the GPCRDB database , and the number of other family in current dataset is too few to have any statistical significance. Similarly, we also study the quality of various feature subsets from mRMR based on grid search and 10-fold cross-validation tested. The correlation between number of features and overall accuracy is also illustrated in Figure 3. Overall accuracy enhanced when the number of features increased from 1 to 300, and the highest overall accuracy of 87.56% can be obtained by using the feature subset with 418 features.
In order to get an optimized feature subset, GA is further applied to further feature selection from a preselected feature subset with 600 features. The processes of convergence are shown in Figure 4 and Figure 5. The number of features in optimized feature subset significantly decreased from 278 to 115 when the number of generation increased from 1 to 1400, and corresponding fitness value is significantly increased. Subsequently, the number of features and fitness value maintained invariable. It clearly shows a premature convergence. However, the number of features decreased from 113 to 92 when the number of generation increased from 1800 to 3100, indicating GA has ability to escape from local optima. The finally optimized feature subset with 91 features can be obtained within 3200 generations. Therefore, we developed a classifier by the features from the optimized feature subset for classifying the subfamilies of the rhodopsin-like family.
The composition of optimized feature subset is shown in Figure 6. The optimized feature subset contains 3 AAC, 17 DipC, 3 C, 6 D, 18 NMBAuto, 31 MAuto, 6 GAuto, 2 CHP and 5 DHP features. The results suggest that the order of these feature groups that contributed to the prediction subfamily belonging to the rhodopsin-like family is: MAuto > NMBAuto > DipC > D and GAuto > DHP > AAC and C > CHP.
Comparison with GPCR-CA
Comparison of different method by the jackknife at superfamily level
Success rates obtained with the GPCR-SVMFS predictor by jackknife test at subfamily level
Number of proteins
Number of correct prediction
Q i /Q(%)
It can be seen from Table 1 that the accuracy, sensitivity, specificity and MCC by GPCR-SVMFS are 97.81%, 97.04%, 98.61% and 0.9563, respectively, which are 4.7% to 7.6% improvement over GPCR-CA method . The results indicated that the GPCR-SVMFS can identify GPCRs from non-GPCRs with high accuracy using optimized feature subset as the sequence feature.
As can be seen from Figure 7, the overall accuracy of GPCR-SVMFS is 99.18%, which is almost 15% higher than that of GPCR-CA. Furthermore, the accuracies of fungal pheromone, cAMP and frizzled/smoothened family are dramatically improved. The accuracy by GPCR-SVMFS for fungal pheromone family is 100%, approximately 93% higher than the accuracy by the GPCR-CA. The accuracies of cAMP and frizzled/smoothened are 100% and 94.12% based on GPCR-SVMFS, approximately 40% and 47% higher than the accuracy by the GPCR-CA, respectively. In additional, as for secretin and metabotropic glutamate/pheromone family, the predictive accuracies are 97.44% and 97.73% by GPCR-SVMFS, approximately 23% and 16% higher than those of GPCR-CA, indicating GPCR-SVMFS is effective and helpful for the prediction of GPCRs at family level.
As shown in Table 2, the accuracies of amine, peptide, rhodopsin, olfactory and other are 93.48%, 98.61%, 88.24% and 94.12%, respectively. Meanwhile, we also have notice that the accuracy of nucleotide is lower than that of amine, peptide, rhodopsin, olfactory, which may be caused by the less protein samples contained in nucleotide class. Although the accuracy for nucleotide is only 76.92%, the overall accuracy is 94.53% for identifying subfamiliy, indicating the current method can yield quite reliable results at subfamily level.
Comparison with GPCRPred
Furthermore, in order to roundly evaluate our method we also performed it on another dataset used in GPCRPred , which is a three-layer classifier based on SVM. In the classifier, DipC is used for characterizing GPCRs at the levels of superfamily, family and subfamily. The dataset obtained from GPCRPred contains 778 GPCRs and 99 non-GPCRs. The 778 GPCRs can be divided into 5 families: 692 class A-rhodopsin and andrenergic, 56 class B-calcitonin and parathyroid hormone, 16 class C-metabotropic, 11 class D-pheromone and 3 class E-cAMP. The class A at subfamily level is composed of 14 major classes and sequences are from the work of Karchin .
The performance of GPCR-SVMFS and GPCRPred at superfamily level
The performance of GPCR-SVMFS and GPCRPred at family level
Q i /Q(%)
The performance of GPCR-SVMFS and GPCRPred at subfamily level
Class A subfamilies
Number of proteins
Q i /Q(%)
Platelet activating factor
Gonadotrophin releasing hormone
Thyrotropin releasing hormone
Predictive power of GPCR-SVMFS
In order to test the performance of GPCR-SVMFS to identify orphan GPCRs, a dataset (we called it as "deorphan") containing 274 orphan proteins are collected from the GPCRDB database (released on 2006). We further verify the 274 orphan proteins by searching accession number in the latest version of GPCRDB (released on 2009). The results indicated that 8 proteins, 19 proteins and 2 proteins belong to amine, peptide and nucleotide respectively. Finally, the dataset of 29 proteins is constructed (The dataset can be obtained from Additional file 4.
The GPCR-SVMFS is able to accurately identify 13 peptides from 19 proteins, and 2 nucleotides are completed recognized. However, none of the 8 amines is correctly identified. So, overall success rate is 19/29 = 51.72%. The result is higher than that of completely randomized prediction, because the rate of correct identification by randomly assignment is 1/6 = 16.67% if the protein samples are completely randomly distributed among the 6 possible subfamilies (i.e. amine, peptide, rhodopsin, olfactory, nucleotide and other). The results imply that GPCR-SVMFS is indeed powerful to identify orphan GPCRs.
In addition, the prediction power of GPCR-SVMFS is also evaluated at family level and subfamily level by using 8 independent dataset, which are collected based on the GPCRDB (released on 2009). Three of the 8 dataset at family level are rhodopsin-like, metabotropic and secretin-like, which contains 20290, 1194 and 1484 proteins, respectively. Other 5 dataset at subfamily level are amine, peptide, rhodopsin, olfactory and nucleotide. The 5 dataset is composed of 1840, 4169, 1376, 9977 and 576 proteins, respectively (8 dataset are given in Additional file 5, 6, 7, 8, 9, 10, 11, 12).
The prediction power of GPCR-SVMFS to independent dataset at family level
Number of proteins
Number of correct prediction
Q i /Q(%)
The prediction power of GPCR-SVMFS to independent dataset at subfamily level
Number of proteins
Number of correct prediction
Q i /Q(%)
With the rapid increment of protein sequence data, it is indispensable to develop an automated and reliable method for classification of GPCRs. In this paper, a three-layer classifier is proposed for GPCRs by coupling SVM with feature selection method. Compared with existing methods, the proposed method provides better predictive performance, and high accuracies for superfamily, family and subfamily of GPCRs in jackknife cross-validation test, indicating the investigation of optimized features subset are quite promising, and might also hold a potential as a useful technique for the prediction of other attributes of protein.
The authors acknowledge financial support from the National Natural Science Foundation of China (No. 20975117; 20805059), Ph.D. Programs Foundation of Ministry of Education of China (No. 20070558010) and Natural Science Foundation of Guangdong Province (No. 7003714)
- Papasaikas PK, Bagos PG, Litou ZI, Hamodrakas SJ: A novel method for GPCR recognition and family classification from sequence alone using signatures derived from profile hidden markov models. SAR QSAR Environ Res 2003, 14: 413–420. 10.1080/10629360310001623999View ArticlePubMedGoogle Scholar
- Gao QB, Wang ZZ: Classification of G-protein coupled receptors at four levels. Protein Eng Des Sel 2006, 19: 511–516. 10.1093/protein/gzl038View ArticlePubMedGoogle Scholar
- Eo HS, Choi JP, Noh SJ, Hur CG, Kim W: A combined approach for the classification of G protein-coupled receptors and its application to detect GPCR splice variants. Comput Biol Chem 2007, 31: 246–256. 10.1016/j.compbiolchem.2007.05.002View ArticlePubMedGoogle Scholar
- Baldwin JM: Structure and function of receptors coupled to G proteins. Curr Opin Cell Biol 1994, 6: 180–190. 10.1016/0955-0674(94)90134-1View ArticlePubMedGoogle Scholar
- Lefkowitz RJ: The superfamily of heptahelical receptors. Nat Cell Biol 2000, 2: e133-e136. 10.1038/35017152View ArticlePubMedGoogle Scholar
- Qian B, Soyer OS, Neubig RR, Goldstein RA: Depicting a protein's two faces: GPCR classification by phylogenetic tree-based HMMs. FEBS Lett 2003, 554: 95–99. 10.1016/S0014-5793(03)01112-8View ArticlePubMedGoogle Scholar
- Chou KC, Elord DW: Bioinformatical analysis of G-protein-coupled receptors. J Proteome Res 2002, 1: 429–433. 10.1021/pr025527kView ArticlePubMedGoogle Scholar
- Karchin R, Karplus K, Haussler D: Classifying G-protein coupled receptors with support vector machines. Bioinformatics 2002, 18: 147–159. 10.1093/bioinformatics/18.1.147View ArticlePubMedGoogle Scholar
- Hebert TE, Bouvier M: Structural and functional aspects of G protein-coupled receptor oligomerization. Biochem Cell Biol 1998, 76: 1–11. 10.1139/bcb-76-1-1View ArticlePubMedGoogle Scholar
- Xiao X, Wang P, Chou KC: GPCR-CA: A cellular automaton image approach for predicting G-protein-coupled receptor functional classes. J Comput Chem 2009, 30: 1414–1423. 10.1002/jcc.21163View ArticlePubMedGoogle Scholar
- Lapinsh M, Prusis P, Uhlen S, Wikberg JES: Improved approach for proteochemometrics modeling: application to organic compound-amino G protein-coupled receptor interactions. Bioinformatics 2005, 21: 4289–4296. 10.1093/bioinformatics/bti703View ArticlePubMedGoogle Scholar
- Lapinsh M, Gutcaits A, Prusis P, Post C, Lundstedt T, Wikberg JE: Classification of G-protein coupled receptors by alignment-independent extraction of principal chemical properties of primary amino acid sequences. Protein Sci 2002, 11: 795–805. 10.1110/ps.2500102View ArticlePubMedPubMed CentralGoogle Scholar
- Inoue Y, Ikeda M, Shimizu T: Proteome-wide classification and identification of mammalian-type GPCRs by binary topology pattern. Comput Biol Chem 2004, 28: 39–49. 10.1016/j.compbiolchem.2003.11.003View ArticlePubMedGoogle Scholar
- Bhasin M, Raghava GPS: GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protien coupled receptors. Nucleic Acids Res 2004, 32: W383-W389. 10.1093/nar/gkh416View ArticlePubMedPubMed CentralGoogle Scholar
- Gupta R, Mittal A, Singh K: A novel and efficient technique for identification and classification of gpcrs. IEEE Trans Inf Technol Biomed 2008, 12: 541–548. 10.1109/TITB.2007.911308View ArticlePubMedGoogle Scholar
- Bhasin M, Raghava GP: GPCRsclass: a web tool for the classification of amine type of G-protein-coupled receptors. Nucleic Acids Res 2005, 33: W143-W147. 10.1093/nar/gki351View ArticlePubMedPubMed CentralGoogle Scholar
- Guo YZ, Li M, Lu M, Wen Z, Wang K, Li G, Wu J: Classifying G protein-coupled receptors and nuclear receptors on the basis of protein power spectrum from fast fourier transform. Amino Acids 2006, 30: 397–402. 10.1007/s00726-006-0332-zView ArticlePubMedGoogle Scholar
- Papasaikas PK, Bagos PG, Litou ZI, Promponas VJ, Hamodrakas SJ: PRED-GPCR: GPCR recognition and family classification server. Nucleic Acids Res 2004, 32: W380-W382. 10.1093/nar/gkh431View ArticlePubMedPubMed CentralGoogle Scholar
- Elrod DW, Chou KC: A study on the correlation of G-protein-coupled receptor types with amino acid composition. Protein Eng 2002, 15: 713–715. 10.1093/protein/15.9.713View ArticlePubMedGoogle Scholar
- Chou KC: Prediction of G-protein-coupled receptor classes. J Proteome Res 2005, 4: 1413–1418. 10.1021/pr050087tView ArticlePubMedGoogle Scholar
- Khan A, Khan MF, Choi TS: Proximity based GPCRs prediction in transform domain. Biochem Biophys Res Commun 2008, 371: 411–415. 10.1016/j.bbrc.2008.04.074View ArticlePubMedGoogle Scholar
- Huang Y, Cai J, Ji L, Li Y: Classifying G-protein coupled receptors with bagging classification tree. Comput Biol Chem 2004, 28: 275–280. 10.1016/j.compbiolchem.2004.08.001View ArticlePubMedGoogle Scholar
- Davies MN, Secker A, Freitas AA, Mendao M, Timmis J, Flower DR: On the hierarchical classification of G protein-coupled receptors. Bioinformatics 2007, 23: 3113–3118. 10.1093/bioinformatics/btm506View ArticlePubMedGoogle Scholar
- Wen Z, Li M, Li Y, Guo Y, Wang K: Delaunay triangulation with partial least squares projection to latent structures: a model for G-protein coupled receptors classification and fast structure recognition. Amino Acids 2007, 32: 277–283. 10.1007/s00726-006-0341-yView ArticlePubMedGoogle Scholar
- Guo J, Chen H, Sun Z, Lin Y: A novel method for protein secondary structure prediction using dual-layer SVM and profiles. Proteins 2004, 54: 738–743. 10.1002/prot.10634View ArticlePubMedGoogle Scholar
- Kumar M, Bhasin M, Natt NK, Raghava GP: BhairPred: prediction of β-hairpins in a protein from multiple alignment information using ANN and SVM techniques. Nucleic Acids Res 2005, 33: W154-W159. 10.1093/nar/gki588View ArticlePubMedPubMed CentralGoogle Scholar
- Chou KC, Cai YD: Using functional domain composition and support vector machines for prediction of protein subcellular location. J Biol Chem 2002, 277: 45765–45769. 10.1074/jbc.M204161200View ArticlePubMedGoogle Scholar
- Hua S, Sun Z: Support vector machine approach for protein subcellular localization prediction. Bioinformatics 2001, 17: 721–728. 10.1093/bioinformatics/17.8.721View ArticlePubMedGoogle Scholar
- Cai YD, Zhou GP, Chou KC: Support vector machines for predicting membrane protein types by using functional domain composition. Biophys J 2003, 84: 3257–3263. 10.1016/S0006-3495(03)70050-2View ArticlePubMedPubMed CentralGoogle Scholar
- Chou KC: Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 2001, 43: 246–255. 10.1002/prot.1035View ArticlePubMedGoogle Scholar
- Qiu JD, Huang JH, Liang RP, Lu XQ: Predction of G-protein-coupled receptor classes based on the concept of Chou's pseudo amino acid composition: an approach from discrete wavelet transform. Anal Biochem 2009, 390: 68–73. 10.1016/j.ab.2009.04.009View ArticlePubMedGoogle Scholar
- Lin WZ, Xiao X, Chou KC: GPCR-GIA: a web-server for identifying G-protein coupled receptors and their families with grey incidence analysis. Protein Eng Des Sel 2009, 22: 699–705. 10.1093/protein/gzp057View ArticlePubMedGoogle Scholar
- Xiao X, Lin WZ, Chou KC: Using grey dynamic modeling and pseudo amino acid composition to predict protein structural classes. J Comput Chem 2008, 29: 2018–2024. 10.1002/jcc.20955View ArticlePubMedGoogle Scholar
- Xiao X, Wang P, Chou KC: Predicting protein structural classes with pseudo amino acid composition: an approach using geometric moments of cellular automaton image. J Theor Biol 2008, 254: 691–696. 10.1016/j.jtbi.2008.06.016View ArticlePubMedGoogle Scholar
- Xiao X, Wang P, Chou KC: Predicting protein quaternary structural attribute by hybridizing functional domain composition and pseudo amino acid composition. J Appl Crystallogr 2009, 42: 169–173. 10.1107/S0021889809002751View ArticleGoogle Scholar
- Xiao X, Lin WZ: Application of protein grey incidence degree measure to predict protein quaternary structural types. Amino Acids 2009, 37: 741–749. 10.1007/s00726-008-0212-9View ArticlePubMedGoogle Scholar
- Chen C, Chen LX, Zou XY, Cai PX: Predicting protein structural class based on multi-features fusion. J Theor Biol 2008, 253: 388–392. 10.1016/j.jtbi.2008.03.009View ArticlePubMedGoogle Scholar
- Gao QB, Ye XF, Jin ZC, He J: Improving discrimination of outer membrane proteins by fusing different forms of pseudo amino acid composition. Analy Biochem 2010, 398: 52–59. 10.1016/j.ab.2009.10.040View ArticleGoogle Scholar
- Gao QB, Jin ZC, Ye XF, Wu C, He J: Prediction of unclear receptors with optimal pseudo amino acid composition. Anal Biochem 2009, 387: 54–59. 10.1016/j.ab.2009.01.018View ArticlePubMedGoogle Scholar
- Ma S, Huang J: Penalized feature selection and classification in bioinformatics. Brief Bioinform 2008, 9: 392–403. 10.1093/bib/bbn027View ArticlePubMedPubMed CentralGoogle Scholar
- Xiong M, Fang X, Zhao J: Biomarker identification by feature wrappers. Genome Res 2001, 11: 1878–1887.PubMedPubMed CentralGoogle Scholar
- Jirapech-Umpai T, Aitken S: Feature selection and classification for microarray data analysis: evolutionary methods for identifying predictive genes. BMC Bioinformatics 2005, 6: 148. 10.1186/1471-2105-6-148View ArticlePubMedPubMed CentralGoogle Scholar
- Ooi CH, Tan P: Genetic algorithms applied to multi-class prediction for the analysis of gene expression data. Bioinformatics 2003, 19: 37–44. 10.1093/bioinformatics/19.1.37View ArticlePubMedGoogle Scholar
- Li L, Weinberg CR, Darden TA, Pedersen LG: Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 2001, 17: 1131–1142. 10.1093/bioinformatics/17.12.1131View ArticlePubMedGoogle Scholar
- Gevaert O, Smet FD, Timmerman D, Moreau Y, Moor BD: Predicting the prognosis of breast cancer by integrating clinical and microarray data with bayesian networks. Bioinformatics 2006, 22: e184-e190. 10.1093/bioinformatics/btl230View ArticlePubMedGoogle Scholar
- Liu H, Li J, Wong L: A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome inform 2002, 13: 51–60.PubMedGoogle Scholar
- Prados J, Kalousis A, Sanchez JC, Allard L, Carrette O, Hilario M: Mining mass spectra for diagnosis and biomarker discovery of cerebral accidents. Proteomics 2004, 4: 2320–2332. 10.1002/pmic.200400857View ArticlePubMedGoogle Scholar
- Li L, Umbach DM, Terry P, Taylor JA: Application of the GA/KNN method to SELDI proteomics data. Bioinformatics 2004, 20: 1638–1640. 10.1093/bioinformatics/bth098View ArticlePubMedGoogle Scholar
- Ressom HW, Varghese RS, Drake SK, Hortin GL, Abdel-Hamid M, Loffredo C A: Goldman R: Peak selection from MALDI-TOF mass spectra using ant colony optimization. Bioinformatics 2007, 23: 619–626. 10.1093/bioinformatics/btl678View ArticlePubMedGoogle Scholar
- Bhanot G, Alexe G, Venkataraghavan B, Levine AJ: A robust meta-classification strategy for cancer detection from MS data. Proteomics 2006, 6: 592–604. 10.1002/pmic.200500192View ArticlePubMedGoogle Scholar
- Peng H, Long F, Ding C: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 2005, 27: 1226–1238. 10.1109/TPAMI.2005.159View ArticlePubMedGoogle Scholar
- JH: Adaptation in Natural and Artificial Systems. The University of Michigan Press, USA 1975.Google Scholar
- Horn F, Weare J, Beukers MW, Horsch S, Bairoch A, Chen W, Edvardsen O, Campagne F, Vriend G: GPCRDB: an information system for G protein-coupled receptors. Nucleic Acids Res 1998, 26: 275–279. 10.1093/nar/26.1.275View ArticlePubMedPubMed CentralGoogle Scholar
- Dubchak I, Muchnik I, Holbrook SR, Kim SH: Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci USA 1995, 92: 8700–8704. 10.1073/pnas.92.19.8700View ArticlePubMedPubMed CentralGoogle Scholar
- Kawashima S, Kanehisa M: AAindex: amino acid index database. Nucleic Acids Res 2000, 28: 374. 10.1093/nar/28.1.374View ArticlePubMedPubMed CentralGoogle Scholar
- Lim VI: Algorithms for prediction of α-helical and β-structural regions in globular proteins. J Mol Biol 1974, 88: 873–894. 10.1016/0022-2836(74)90405-7View ArticlePubMedGoogle Scholar
- Schiffer M, Edmundson AB: Use of helical wheels to represent the structures of proteins and to identify segments with helical potential. Biophys J 1967, 7: 121–136. 10.1016/S0006-3495(67)86579-2View ArticlePubMedPubMed CentralGoogle Scholar
- Rose GD, Geselowitz AR, lesser GJ, Lee RH, Zehfus MH: Hydrophobicity of amino acid residues in globular proteins. Science 1985, 229: 834–838. 10.1126/science.4023714View ArticlePubMedGoogle Scholar
- Zhang HH, Ahn J, Lin X, Park C: Gene selection using support vector machines with non-convex penalty. Bioinformatics 2006, 22: 88–95. 10.1093/bioinformatics/bti736View ArticlePubMedGoogle Scholar
- Segal MR, Dahlquist KD, Conklin BR: Regression approaches for microarray data analysis. J Comput Biol 2003, 10: 961–980. 10.1089/106652703322756177View ArticlePubMedGoogle Scholar
- Saeys Y, Inza I, Larranaga P: A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23: 2507–2517. 10.1093/bioinformatics/btm344View ArticlePubMedGoogle Scholar
- Lv QZ, Shen GL, Yu RQ: A chaotic approach to maintain the pupulation diversity of genetic algorithm in network training. Comput Biol Chem 2003, 27: 363–371. 10.1016/S1476-9271(02)00083-XView ArticleGoogle Scholar
- Li ZC, Zhou XB, Lin YR, Zou XY: Prediction of protein structure class by coupling improved genetic algorithm and support vector machine. Amino Acids 2008, 35: 580–590.Google Scholar
- Chang CC, Lin CJ: LIBSVM: a library for support vector machines.[http://www.csie.ntu.edu.tw/~cjlin/libsvm]
- Matthews BW: Comparison of predicted and observed secondary structure of T4 phage Iysozyme. Biochem Biophys Acta 1975, 405: 442–451.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.