- Research article
- Open Access
An improved classification of G-protein-coupled receptors using sequence-derived features
© Peng et al; licensee BioMed Central Ltd. 2010
Received: 1 March 2010
Accepted: 9 August 2010
Published: 9 August 2010
G-protein-coupled receptors (GPCRs) play a key role in diverse physiological processes and are the targets of almost two-thirds of the marketed drugs. The 3 D structures of GPCRs are largely unavailable; however, a large number of GPCR primary sequences are known. To facilitate the identification and characterization of novel receptors, it is therefore very valuable to develop a computational method to accurately predict GPCRs from the protein primary sequences.
We propose a new method called PCA-GPCR, to predict GPCRs using a comprehensive set of 1497 sequence-derived features. The principal component analysis is first employed to reduce the dimension of the feature space to 32. Then, the resulting 32-dimensional feature vectors are fed into a simple yet powerful classification algorithm, called intimate sorting, to predict GPCRs at five levels. The prediction at the first level determines whether a protein is a GPCR or a non-GPCR. If it is predicted to be a GPCR, then it will be further predicted into certain family, subfamily, sub-subfamily and subtype by the classifiers at the second, third, fourth, and fifth levels, respectively. To train the classifiers applied at five levels, a non-redundant dataset is carefully constructed, which contains 3178, 1589, 4772, 4924, and 2741 protein sequences at the respective levels. Jackknife tests on this training dataset show that the overall accuracies of PCA-GPCR at five levels (from the first to the fifth) can achieve up to 99.5%, 88.8%, 80.47%, 80.3%, and 92.34%, respectively. We further perform predictions on a dataset of 1238 GPCRs at the second level, and on another two datasets of 167 and 566 GPCRs respectively at the fourth level. The overall prediction accuracies of our method are consistently higher than those of the existing methods to be compared.
The comprehensive set of 1497 features is believed to be capable of capturing information about amino acid composition, sequence order as well as various physicochemical properties of proteins. Therefore, high accuracies are achieved when predicting GPCRs at all the five levels with our proposed method.
The structure of a G-protein-coupled receptor (GPCR) generally comprises seven α-helical transmembrane domains, an extracellular N-terminus, and an intracellular C-terminus . GPCRs constitute one of the largest family of membrane proteins, and their main function is to transduce extracellular signals into intracellular reactions. Therefore, they play a key role in diverse physiological processes such as neurotransmission, secretion, cellular differentiation, cellular metabolism, and so forth . It has been estimated that almost two-thirds of drugs on the market interact with GPCRs , which indicates that GPCRs are pharmacologically important. Therefore, both academic and industrial researchers are very interested in the studies on GPCRs to understand their structures and functions. Unfortunately, the 3 D protein structures of GPCRs are largely unavailable , except for the GPCR family bovine rhodopsin. Although some advanced biotechnologies such as NMR allow to detect the 3 D protein structures, their experiments are generally very time-consuming and costly. In contrast, a large number of GPCR primary sequences are known . To facilitate the identification and characterization of novel receptors , it is therefore very valuable to develop a computational method to predict GPCRs from the protein primary sequences.
In this paper we will look into the following classification problem, which is referred to as a five-level classification problem. Given a protein sequence, we need to determine whether it is a GPCR or a non-GPCR. If it is predicted into a GPCR, we need to further determine which family, subfamily, sub-subfamily, and subtype it belongs to. To tackle this problem, a set of distinct classifiers is generally needed for each level as depicted in Figure 1. In the literature, many computational methods have been proposed to predict GPCRs. However, to our best knowledge, there are no methods that can deal with the five-level problem completely, (i.e., allow to make predictions at all the five levels). For example, the methods presented in [9–12] predict GPCRs just at a single level (the second, third or fourth level), and the methods in  predict GPCRs only at the third and fourth levels. The prediction methods in  and  instead considered three and four levels, respectively.
Today's academic and industrial researchers are both interested in the functional roles of GPCRs at the finest subtype level. This is mainly because each subtype demonstrates its own characteristic ligand binding property, coupling partners of trimeric G-proteins, and interaction partners of oligomerization . Therefore, discrimination of functions of a GPCR subtype from the others (i.e., prediction of GPCRs at the fifth level as shown in Figure 1) becomes very important in the effort to decipher GPCRs. However, we can expect that it is a challenging task that shall not be easier than the prediction of GPCRs at any of the first four levels. Fortunately, more and more GPCR sequences are now being accumulated into the GPCRDB database, which makes it possible to accurately predict GPCRs at all the five levels. This is the main goal of our present study.
A lot of related work has been done previously. In general, there are two important components in a classification task -- one is feature extraction and the other is a classification algorithm. Feature extraction means how to extract features from protein sequences so that each protein is represented as a fixed-length numerical vector. Various methods have been proposed to extract information from protein sequence in the past decades (See eg., [15–19]). The commonly-used feature extraction methods are based on amino acid composition [9–11] and dipeptide composition [7, 12, 13, 20, 21], and more complicated ones include Chou's pseudo amino acid composition , the cellular automaton image approach , profile hidden Markov models , fast Fourier transform , wavelet-based time series analysis , and Fisher Score Vectors . Once protein sequences are represented by numerical vectors, any general-purpose classification algorithms can be used for classification, for instance, covariant discriminant [9–11, 16], nearest neighbor , bagging classification tree , and support vector machines [12, 20, 21, 23–25].
In this paper, we focus on predicting GPCRs at the five levels. Five groups of descriptors are used to extract information from the amino acid sequences. These five groups are (1) amino acid composition and dipeptide composition, (2) autocorrelation descriptors, (3) global descriptors, (4) sequence-order descriptors, and (5) Chou's pseudo amino acid composition descriptors. These descriptors reflect various physicochemical properties of proteins and have been adopted to predict many other protein attributes, such as protein subcellular localization [19, 26], outer membrane protein , nuclear receptors , and protein structural classes [17, 18]. By combining these descriptors, a comprehensive set of 1497 features are calculated for each amino acid sequence. By applying the principal component analysis on a dataset, we then reduce them to a set of 32 features that could retain as much of the data variability as possible.
Finally, a simple yet powerful algorithm called intimate sorting is employed to predict GPCRs, and the experimental tests on the benchmark datasets show that the classifications can be improved. Jackknife test shows that the overall accuracies of the proposed method at the first, second, third, fourth, and fifth levels achieve up to 99.5%, 88.8%, 80.47%, 80.3%, and 92.34%, respectively. Comparisons with several existing methods show that the proposed method achieves higher prediction performance consistently.
Results and Discussion
Predicting GPCR at five levels
Comparison with BLAST-based classification
The most straightforward method for predicting GPCRs might be based on homology search by sequence alignment tools such as BLAST and PSI-BLAST . A given GPCR sequence is hence predicted into the class to which its most similar GPCR sequence belongs. However, as the pairwise sequence similarities get lower, such an alignment-based method would rarely yield satisfactory predictions. For instance, when applied to the dataset GDFL for the prediction at the first level, the BLAST-based method achieved the overall accuracy of 74.58%, which is 14.92% lower than that from PCA-GPCR. Note that PCA-GPCR is instead an alignment-free method. The above experimental results therefore show that an alignment-free method is very promising in the high accurate prediction of GPCR classes.
Comparison with previous methods
In order to demonstrate the superior performance of PCA-GPCR, we make comparisons with a number of previous methods. Depending on the predictive capability of previous methods, the comparisons are made at a single level and at the first two levels, as follows.
Comparison at a single level
The number of proteins in four datasets and the corresponding prediction accuracies.
Comparison with GPCR-CA at the first two levels
We further compare our method with GPCR-CA  on the dataset D365, which comprises GPCRs from the second level. Unlike the datasets tested in the above subsection, D365 contains almost no high-homology sequence pairs. Note that the GPCR-CA is able to predict GPCRs at the first two levels.
Comparison with GPCR-CA in identifying the GPCRs and non-GPCRs.
Comparison with GPCR-CA for the dataset D365 in predicting GPCR families.
GPCR-CA extracts 24 features, including 20 features from amino acid composition and four features from cellular automaton image . While the last four features were reported to be able to reveal the protein's overall sequence patterns, only four features might not suffice to reveal overall sequence patterns completely. On the contrary, our method explores the amino acid sequences comprehensively to gain as much information from the protein primary sequences as possible. Both the amino acid composition and the dipeptide composition are utilized in our method and, moreover, the important sequence-order information and a variety of physicochemical properties of amino acids are carefully explored as well. We believe that it is this comprehensive set of features that lead our method to a higher prediction accuracy.
In this paper, we have proposed a new method called PCA-GPCR to predict GPCRs at five levels. In this method, a comprehensive set of 1497 sequence-derived features are generated from five groups of descriptors -- that is, amino acid composition and dipeptide composition, autocorrelation descriptors, global descriptors, sequence-order descriptors, and Chou's pseudo amino acid composition descriptors. These features are able to capture the information about the amino acid composition, sequence order as well as various physicochemical properties of proteins. Because of the high dimensionality of the feature space, the principal component analysis is hence used to reduce the dimension from 1497 to 32. The resulting 32-dimensional feature vectors are finally fed into a simple yet powerful intimate sorting algorithm for the prediction of GPCRs at five levels.
By evaluating on the datasets constructed from the latest version of the GPCRDB database, the overall accuracies of our method from the first level to the fifth level are 99.5%, 88.8%, 80.47%, 80.3%, and 92.34%, respectively. We further test and compare our method with several other methods based on four benchmark datasets widely used in the literature. At the second level, for a dataset containing 1238 GPCRs, the overall accuracy of our method reaches 99.76%. At the fourth level, for two different datasets that contain 167 and 566 GPCRs, the overall accuracies of our method reach up to 98.2% and 97.88%, respectively. They are all higher than those of the other methods under comparison. At the first two levels, we further test our method on a low-homology dataset (with only a few sequence pairs of more than 40% sequence identity). The overall accuracies thus achieved at the first level and second level are 95.21%, 92.6%, respectively, which are 3.57% and 9.04% higher than those of the method GPCR-CA.
We conclude that the high prediction accuracy of the proposed method is attributed to the comprehensive set of features that we constructed from five groups of descriptors. It is anticipated that our method could contribute more to the characterization of novel proteins and gain new insights into their functions, thereby facilitating drug discovery. A web server that predicts GPCRs at five levels with our proposed method is freely available at http://www1.spms.ntu.edu.sg/~chenxin/PCA_GPCR.
We construct a collection of non-redundant datasets from the latest release of the GPCRDB database (Version 9.9.1, September 2009)  to evaluate and train the classifiers for the GPCRs prediction. As mentioned in the Background section, the sequences in the GPCRDB database are organized in four levels: family or class, subfamily, sub-subfamily, and subtype. We download the GPCR sequences from the GPCRDB database and then filter out the high-homology sequences using the program CD-HIT . In order to ensure that there are enough sequences to train the classifiers, we apply different thresholds in CD-HIT for sequences at different levels. They are 0.4, 0.7, 0.8, and 0.9 for the family, subfamily, sub-subfamily, and subtype levels, respectively. After filtering, only families (subfamilies, sub-subfamilies, and subtypes) with more than 10 sequences are retained for training classifiers. Because the fifth family (Taste receptors T2R) has no subfamily and there are only 14 sequences remaining after filtering by CD-HIT, it is therefore ignored in subsequent analysis. At the end, we obtained 1589, 4772, 4924, and 2741 GPCRs at the family, subfamily, sub-subfamily and subtype levels, respectively. The name of families, subfamilies, sub-subfamilies, and subtype, together with the number of GPCR proteins retained at each level are listed in the Additional file 1.
The GPCR protein sequences retained at the family level are used to construct a positive dataset for training and evaluation. A negative dataset of non-GPCRs is then constructed in almost the same way as in Ref. , except that the latest version of ASTRAL SCOP (Version 1.75)  is used. First, we download the sequences that have less than 40% identity to each other (i.e., the file with the name "seq.75;item = seqs;cut = 40"). Then, remove those sequences of length less than 30, and those having identity above 40% using CD-HIT. Finally, a total of 10325 sequences remain, from which 1589 sequences are randomly selected to form a negative dataset. Because these selected proteins are organized into five levels, for the sake of convenience, we call them the datasets GDFL (GPCR Datasets in Five Levels). They are available at the web server provided in this paper.
In addition, in order to perform comparison with other existing methods directly, four benchmark datasets from previous studies are experimented in this study as well. For the sake of simplicity, they are referred to as D167, D566, D1238 and D365, respectively. We know that all of them were constructed based on the older version of the GPCRDB database. The proteins in the dataset D167 (belonging to the fourth level) are classified into four sub-subfamilies: (1) acetylcholine, (2) adrenoceptor, (3) dopamine, and (4) serotonin. The dataset D566 (belonging to the fourth level) instead comprises proteins in seven sub-subfamilies: (1) adrenoceptor, (2) chemokine, (3) dopamine, (4) neuropeptide, (5) olfactory type, (6) rhodopsin, and (7) serotonin. The dataset D1238 (belonging to the second level) comprises proteins from three families: (1) rhodopsin like, (2) secretin like, and (3) metabotrophic/glutamate/pheromone. The last dataset D365 (belonging to the second level) comprises proteins in the six families: (1) rhodopsin-like, (2) secretin-like, (3) metabotrophic/glutamate/pheromone; (4) fungal pheromone, (5) cAMP receptor and (6) frizzled/smoothened family. The numbers of proteins in the above four datasets are given in Table 1. Furthermore, 365 non-GPCR sequences are taken from the Swiss-Prot database to serve as a negative dataset against D365.
The CD-HIT clustering results for the four benchmark datasets.
The physicochemical properties of the amino acids and distances between two amino acids.
Range of property
Average flexibility indices
Free energy of solution in water
Residue accessible surface area in tripeptide
Normalized van der Waals volume
Positive, Neutral, Negative
Helix, Strand, Coil
Buried, Exposed, Intermediate
Polar, Neutral, Hydrophobic
Grantham chemical distance
Schneider-Wrede physicochemical distance
As mentioned in the introduction, amino acid composition was widely used to transform GPCR sequences into 20-dimension numerical vectors [9–11]. However, the sequence order information would be completely lost. In order to address this issue, dipeptide composition was proposed to represent GPCR sequences by 400-dimension vectors, which captures local-order information and has been reported to improve classifications [7, 12, 13, 20, 21]. Recently, GPCR-CA  utilized the conception of Chou's pseudo amino acid composition to represent each protein sequence by 24 features. The first 20 features correspond to the amino acid composition and the remaining four features are calculated from a so-called cellular automation image. These four features were shown capable of reflecting a protein's overall sequence pattern. Inspired by this work, we seek a new set of features that can comprehend as much information as possible from GPCR sequences. To this end, we investigate the following five groups of features, where the parameters are set to the same values as in .
Amino acid composition (AAC) and dipeptide composition (DC)
Amino acid composition is defined as the occurrence frequencies of 20 amino acids in a protein sequence.
where each i = 1, 2, ⋯ , 20 corresponds to a distinct amino acid and n A (i) is the number of amino acid i occurring in the protein sequence of length L.
where each i = 1, 2, ⋯, 400 corresponds to one of the 400 dipeptides and n D (i) is the number of dipeptide i occurring in the sequence.
Autocorrelation descriptors (AD)
where P0(i) are the property value of the amino acid i, , and .
where and R i + d are the amino acids at position i and i + d along the protein sequence, respectively. As mentioned earlier, we use the same parameter values as in , so the maximum value of d is 30.
where is the average value of the property of interest along the sequence.
For each autocorrelation descriptor, we obtain 240 (= 30 × 8) features. In total, 720 (= 240 × 3) features will be obtained to describe a protein sequence.
Global descriptors (GD)
These descriptors were first proposed by Dubchak et al.  to predict protein folding classes, and later applied to predict human Pol II promoter sequences . They are constructed as follows. Firstly, given each of the following seven amino acid properties: normalized van der Waals volume, polarity, polarizability, charge, secondary structure, solvent accessibility and relative hydrophobicity (i.e., properties 12-18 listed in Table 6), the 20 amino acids are divided into three groups according to their property values. Then, for a given amino acid sequence, we may obtain a new sequence of three symbols, each corresponding to one group of amino acids. Finally, three groups of quantities are defined on the new sequence; that is, composition (Comp), transition (Tran) and distribution (Dist), as demonstrated below.
For the sake of simplicity, suppose that a sequence is made of only two letters (A and B). Comp is defined as the occurrence frequency of each letter in the sequence. For example, we have a sequence BABBABABBABBAABABABBAAABBABABA, in which there are 14 As and 16 Bs. Therefore, the occurrence frequencies of A and B are 14/(14 + 16) × 100.00 = 46.67 and 16/(14 + 16) × 100.00 = 53.33, respectively. Tran is used to represent the occurrence frequency of pairs AB or BA. In the above sequence, there are 21 transitions from one letter to another, so Tran is computed as (21/29) × 100.00 = 72.14. On the other hand, Dist calculates the relative positions of the first, 25%, 50%, 75% and 100% of the total amount of a particular letter in the sequence. In the above sequence, for example, the first, 25%, 50%, 75% and 100% of the total amount of the letter B are located at the first, 6th, 12th, 20th and 29th positions, respectively. The quantities Dist for the letter B are hence 1/30 × 100.00 = 3.33, 6/30 × 100.00 = 20.00, 12/30 × 100.00 = 40.00, 20/30 × 100.00 = 66.67 and 29/30 × 100.00 = 96.67. Similarly, we can find the Dist values for the letter A; they are 6.67, 23.33, 53.33, 73.33 and 100.00. At the end, the global descriptors of the above sequence become
(Comp;Tran;Dist) = (46.67, 53.33; 72.14; 6.67, 23.33, 53.33, 73.33, 100.00, 3.33, 20.00, 40.00, 66.67, 96.67)
Suppose there are n distinct symbols in a sequence, then the number of features in Comp, Tran, and Dist are , , and 5 × n, respectively. Recall that the 20 amino acids are divided into three groups by each amino acid property, which leads to a new sequence of three symbols (n = 3). Following the similar procedure demonstrated above, we will obtain features to describe the new sequence (of three symbols). Combining all the features to be extracted based on the seven amino acid properties, we will obtain a total of 147 (= 21 × 7) features for each input protein sequence from the global descriptors.
Sequence-order descriptors (SD)
where d(R i , R i +j ) is one of the above distances between the two amino acids R i and R i + j located at position i and i + j, respectively.
where ω is a weighting factor (default ω = 0.1).
We end up with 60 (= 30 × 2) sequence-order-coupling numbers and 100 (= 50 × 2) quasi-sequence-order descriptors. In total, there are 160 features extracted from the sequence-order descriptors.
Chou'is pseudo amino acid composition descriptors (PseAAC)
where ω is a weighting factor (default ω = 0.1). It will generate 50 features from the Chou'is pseudo amino acid composition descriptors.
The number of features in each group of descriptors.
Number of features
Amino acid composition (AAC) and dipeptide composition (DC)
Autocorrelation descriptors (AD)
Global descriptors (GD)
Sequence-order descriptors (SD)
Chou'is pseudo amino acid composition descriptors (PseAAC)
Principal component analysis
Principal Component Analysis (PCA) is a classical statistical method which is still widely used in modern data analysis. PCA involves a mathematical procedure that transforms a large number of (possibly) correlated variables into a smaller number of uncorrelated variables, called principal components (PCs), that retain as much variability of the data as possible . Given a data matrix denoted by X = (X1, X2, ⋯, Xp), where X i is a column vector of size n which is equal to the number of proteins of interest and p denotes the number of protein sequence features, a typical PCA is performed as follows.
We can see that each PC(i) is a column vector with size n and the j-th element in PC(i) represents the i-th PC value of protein j. Thereafter, a total of p uncorrelated PCs are obtained.
In order to reduce the dimension of the feature space, only the first m PCs are used to represent each protein sequence (m ≤ p). It is generally hard to determine the optimal value of m. In this study, we aim to find a value of m that could make the overall prediction accuracy of GPCRs as high as possible, which we will further discuss later.
Intimate sorting algorithm
Many classification algorithms in the literature have been used to predict GPCRs, for instance, covariant discriminant [9–11, 16], nearest neighbor , bagging classification tree , and support vector machines [12, 20, 21, 23–25]. In this study, we use a simple yet powerful algorithm called intimate sorting. This algorithm is easy to implement and does not need to set any parameters as some other algorithms (e.g., support vector machines).
where and . When P ≡ P i , it can be easily seen that Φ(P, P i ) = 1, suggesting that they are most likely to belong to a same class. In general, we have -1 ≤ Φ(P, P i ) ≤ 1. The higher the Φ(P, P i ) value, the more likely two proteins belong to a same class. Among the N proteins in the training set, the one with the highest score with the query protein P is picked out, which we denote by P k , k ∈ [1, N]. If there is a tie, we would randomly select one of them. In the final step, the intimate sorting algorithm simply assigns P into the same GPCR class as P k .
where Tot(i) is the total number of sequences in class i, C(i) the number of correctly predicted sequences of class i, and μ the total number of classes under consideration. Note that this prediction assessment method was already adopted in several previous studies, e.g., [7, 15, 16, 24].
Selection of m
Contribution of features
Inspired by the PCA-based feature selection method described in [37, 39], we use the following procedure to assess the contributions of the 1497 features to prediction accuracy. Recall that in the previous principle component analysis on the dataset D365 we obtained 32 eigenvectors E1, E2, ⋯, E32, and each eigenvector comprises 1497 components. Let us denote E i = (E ij ), where 1 ≤ i ≤ 32 and 1 ≤ j ≤ 1497. To find the i-th PC in PCA, E ij is used to weight the j-th feature. In this sense, the value E ij can be viewed as the weight of contributions that the j-th feature makes to the i-th PC. To combine the contributions to all the PCs, we may compute . Then, w j can be naturally viewed as the weight of contributions that the j-th feature makes to the final prediction accuracy because our method is based on these 32 PCs. In general, the higher the weight w j , the more contributions the j-th feature makes.
The authors would like to thank Prof. Kuo-Chen Chou for sending us the datasets D167, D566 and D1238, and the anonymous referees for their valuable comments and suggestions. This work was partially supported by the Singapore NRF grant NRF2007IDM-IDM002-010 and MOE AcRF Tier 1 grant RG78/08.
- Horn F, Weare J, Beukers MW, Hörsch S, Bairoch A, Chen W, Edvardsen Ø, Campagne F, Vriend G: GPCRDB: an information system for G protein-coupled receptors. Nucleic Acids Res 1998, 26: 275–279. 10.1093/nar/26.1.275View ArticlePubMedPubMed CentralGoogle Scholar
- Hébert TE, Bouvier M: Structural and functional aspects of G protein-coupled receptor oligomerization. Biochem Cell Biol 1998, 76: 1–11. 10.1139/bcb-76-1-1View ArticlePubMedGoogle Scholar
- Ellis C: The state of GPCR research in 2004. Nat Rev Drug Discov 2004, 3: 577–626. 10.1038/nrd1458View ArticleGoogle Scholar
- Palczewski K, Kumasaka T, Hori T, Behnke CA, Motoshima H, Fox BA, Le Trong I, Teller DC, Okada T, Stenkamp RE, Yamamoto M, Miyano M: Crystal structure of rhodopsin: a G-protein coupled receptor. Science 2000, 289: 739–745. 10.1126/science.289.5480.739View ArticlePubMedGoogle Scholar
- Gaulton A, Attwood TK: Bioinformatics approaches for the classification of G-protein-coupled receptors. Curr Opin Pharmacol 2003, 3: 114–120. 10.1016/S1471-4892(03)00005-5View ArticlePubMedGoogle Scholar
- GPCRDB database[http://www.gpcr.org/7tm/]
- Gao QB, Wang ZZ: Classification of G-protein coupled receptors at four levels. Protein Eng Des Sel 2006, 19: 511–516. 10.1093/protein/gzl038View ArticlePubMedGoogle Scholar
- Davies MN, Secker A, Freitas AA, Mendao M, Timmis J, Flower DR: On the hierarchical classification of G protein-coupled receptors. Bioinformatics 2007, 23: 3113–3118. 10.1093/bioinformatics/btm506View ArticlePubMedGoogle Scholar
- Chou KC: Prediction of G-protein-coupled receptor classes. J Proteome Res 2005, 4: 1413–1418. 10.1021/pr050087tView ArticlePubMedGoogle Scholar
- Elrod DW, Chou KC: A study on the correlation of G-protein-coupled receptor types with amino acid composition. Protein Eng Des Sel 2002, 15: 713–715. 10.1093/protein/15.9.713View ArticleGoogle Scholar
- Chou KC, Elrod DW: Bioinformatical analysis of G-protein-coupled receptors. J Proteome Res 2002, 1: 429–433. 10.1021/pr025527kView ArticlePubMedGoogle Scholar
- Bhasin M, Raghava GPS: GPCRsclass: a web tool for the classification of amine type of G-protein-coupled receptors. Nucleic Acids Res 2005, 33: W143-W147. 10.1093/nar/gki351View ArticlePubMedPubMed CentralGoogle Scholar
- Huang Y, Cai J, Ji L, Li Y: Classifying G-protein coupled receptors with bagging classifition tree. Comput Biol Chem 2004, 28: 275–280. 10.1016/j.compbiolchem.2004.08.001View ArticlePubMedGoogle Scholar
- Kristiansen K: Molecular mechanisms of ligand binding, signaling, and regulation within the superfamily of G-protein-coupled receptors: molecular modeling and mutagenesis approaches to receptor structure and function. Pharmacol Ther 2004, 103: 21–80. 10.1016/j.pharmthera.2004.05.002View ArticlePubMedGoogle Scholar
- Lin WZ, Xiao X, Chou KC: GPCR-GIA: a web-server for identifying G-protein coupled receptors and their families with grey incidence analysis. Protein Eng Des Sel 2009, 22: 699–705. 10.1093/protein/gzp057View ArticlePubMedGoogle Scholar
- Xiao X, Wang P, Chou KC: GPCR-CA: A cellular automaton image approach for predicting G-protein-coupled receptor functional classes. J Comput Chem 2009, 30: 1413–1423. 10.1002/jcc.21163Google Scholar
- Xiao X, Wang P, Chou KC: Predicting protein structural classes with pseudo amino acid composition: an approach using geometric moments of cellular automaton image. J Theor Biol 2008, 254: 691–696. 10.1016/j.jtbi.2008.06.016View ArticlePubMedGoogle Scholar
- Xiao X, Shao SH, Huang ZD, Chou KC: Using pseudo amino acid composition to predict protein structural classes: Approached with complexity measure factor. J Comput Chem 2006, 27: 478–482. 10.1002/jcc.20354View ArticlePubMedGoogle Scholar
- Chou KC: Prediction of protein subcellar locations by incorporating quasi-sequence-order effect. Biochem Biophys Res Commun 2000, 278: 477–483. 10.1006/bbrc.2000.3815View ArticlePubMedGoogle Scholar
- Bhasin M, Raghava GPS: GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors. Nucleic Acids Res 2004, 32: W383-W389. 10.1093/nar/gkh416View ArticlePubMedPubMed CentralGoogle Scholar
- Gao QB, Wu C, Ma XQ, Lu J, He J: Classification of amine type G-protein coupled receptors with feature selection. Protein Pept Lett 2008, 15: 834–842. 10.2174/092986608785203755View ArticlePubMedGoogle Scholar
- Papasaikas PK, Bagos PG, Litou ZI, Hamodrakas SJ: A Novel method for GPCR recognition and family classification from sequence alone using signatures derived from profile hidden Markov models. SAR QSAR Environ Res 2003, 14: 413–420. 10.1080/10629360310001623999View ArticlePubMedGoogle Scholar
- Guo YZ, Li M, Lu M, Wen Z, Wang K, Li G, Wu J: Classifying G protein-coupled receptors and nuclear receptors on the basis of protein power spectrum from fast Fourier transform. Amino Acids 2006, 30: 397–402. 10.1007/s00726-006-0332-zView ArticlePubMedGoogle Scholar
- Gupta R, Mittal A, Singh K: A novel and efficient technique for identification and classification of GPCRs. IEEE Trans Inform Technol Biomed 2008, 12: 541–548. 10.1109/TITB.2007.911308View ArticleGoogle Scholar
- Karchin R, Karplus K, Haussler D: Classifying G-protein coupled receptors with support vector machines. Bioinformatics 2002, 18: 147–159. 10.1093/bioinformatics/18.1.147View ArticlePubMedGoogle Scholar
- Chou KC, Cai YD: A new hybrid approach to predict subcellular localization of proteins by incorporating gene ontology. Biochem Biophys Res Commun 2003, 311: 743–747. 10.1016/j.bbrc.2003.10.062View ArticlePubMedGoogle Scholar
- Cai YD, Chou KC: Predicting membrane protein type by functional domain composition and pseudo-amino acid composition. J Theor Biol 2006, 238: 395–400. 10.1016/j.jtbi.2005.05.035View ArticlePubMedGoogle Scholar
- Gao QB, Jin ZC, Ye XF, Wu C, He J: Prediction of nuclear receptors with optimal pseudo amino acid composition. Anal Biochem 2009, 387: 54–59. 10.1016/j.ab.2009.01.018View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389View ArticlePubMedPubMed CentralGoogle Scholar
- Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22: 1658–1659. 10.1093/bioinformatics/btl158View ArticlePubMedGoogle Scholar
- ASTRAL SCOP database[http://astral.berkeley.edu/]
- AAindex database[http://www.genome.ad.jp/dbget/aaindex.html]
- Chou KC: Prediction of protein cellular attributes using pseudo amino acid composition. Proteins: Structure, Function, and Genetics 2001, 43: 246–255. 10.1002/prot.1035View ArticleGoogle Scholar
- Li ZR, Lin HH, Han LY, Jiang L, Chen X, Chen YZ: PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res 2008, 34: W32-W37. 10.1093/nar/gkl305View ArticleGoogle Scholar
- Dubchak I, Muchink I, Holbrook SR, Kim SH: Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci USA 1995, 92: 8700–8704. 10.1073/pnas.92.19.8700View ArticlePubMedPubMed CentralGoogle Scholar
- Yang JY, Zhou Y, Yu ZG, Anh V, Zhou LQ: Human Pol II promoter recognition based on primary sequences and free energy of dinucleotides. BMC Bioinformatics 2008, 9: 11. 10.1186/1471-2105-9-S3-S11View ArticleGoogle Scholar
- Jolliffe IT: Principal component analysis. New York: Springer; 2002.Google Scholar
- Chou KC, Shen HB: Cell-PLoc: A package of web-servers for predicting subcellular localization of proteins in various organisms. Nature Protocols 2008, 3: 153–162. 10.1038/nprot.2007.494View ArticlePubMedGoogle Scholar
- Cohen I, Tian Q, Zhou XS, Huang TS: Feature selection using principal feature analysis. Univ. of Illinois at Urbana-Champaign 2002.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.