An improved classification of G-protein-coupled receptors using sequence-derived features

Background G-protein-coupled receptors (GPCRs) play a key role in diverse physiological processes and are the targets of almost two-thirds of the marketed drugs. The 3 D structures of GPCRs are largely unavailable; however, a large number of GPCR primary sequences are known. To facilitate the identification and characterization of novel receptors, it is therefore very valuable to develop a computational method to accurately predict GPCRs from the protein primary sequences. Results We propose a new method called PCA-GPCR, to predict GPCRs using a comprehensive set of 1497 sequence-derived features. The principal component analysis is first employed to reduce the dimension of the feature space to 32. Then, the resulting 32-dimensional feature vectors are fed into a simple yet powerful classification algorithm, called intimate sorting, to predict GPCRs at five levels. The prediction at the first level determines whether a protein is a GPCR or a non-GPCR. If it is predicted to be a GPCR, then it will be further predicted into certain family, subfamily, sub-subfamily and subtype by the classifiers at the second, third, fourth, and fifth levels, respectively. To train the classifiers applied at five levels, a non-redundant dataset is carefully constructed, which contains 3178, 1589, 4772, 4924, and 2741 protein sequences at the respective levels. Jackknife tests on this training dataset show that the overall accuracies of PCA-GPCR at five levels (from the first to the fifth) can achieve up to 99.5%, 88.8%, 80.47%, 80.3%, and 92.34%, respectively. We further perform predictions on a dataset of 1238 GPCRs at the second level, and on another two datasets of 167 and 566 GPCRs respectively at the fourth level. The overall prediction accuracies of our method are consistently higher than those of the existing methods to be compared. Conclusions The comprehensive set of 1497 features is believed to be capable of capturing information about amino acid composition, sequence order as well as various physicochemical properties of proteins. Therefore, high accuracies are achieved when predicting GPCRs at all the five levels with our proposed method.


Background
The structure of a G-protein-coupled receptor (GPCR) generally comprises seven a-helical transmembrane domains, an extracellular N-terminus, and an intracellular C-terminus [1]. GPCRs constitute one of the largest family of membrane proteins, and their main function is to transduce extracellular signals into intracellular reactions. Therefore, they play a key role in diverse physiological processes such as neurotransmission, secretion, cellular differentiation, cellular metabolism, and so forth [2]. It has been estimated that almost two-thirds of drugs on the market interact with GPCRs [3], which indicates that GPCRs are pharmacologically important. Therefore, both academic and industrial researchers are very interested in the studies on GPCRs to understand their structures and functions. Unfortunately, the 3 D protein structures of GPCRs are largely unavailable [4], except for the GPCR family bovine rhodopsin. Although some advanced biotechnologies such as NMR allow to detect the 3 D protein structures, their experiments are generally very timeconsuming and costly. In contrast, a large number of GPCR primary sequences are known [5]. To facilitate the identification and characterization of novel receptors [5], it is therefore very valuable to develop a computational method to predict GPCRs from the protein primary sequences.
Based on their binding ligand types, GPCRs are often classified into different groups, some of which are further divided into subgroups, sub-subgroups, etc. The GPCRDB database [1,6] is one of the most popular database for GPCRs, which organizes GPCRs using a hierarchical structure. As in [7,8], we call each layer of this hierarchical structure a level. The top layer is then referred to as the second level (One more layer will be added on the top of the hierarchical structure later), and the second layer is referred to as the third level, etc. According to the latest version of the GPCRDB database (Version 9.9.1, September 2009), GPCRs in the second level are classified into five families or classes (In the previous versions of the GPCRDB database, e.g., June 2006 release, GPCRs are classified into six families in this level); that is, (1) Class A Rhodopsin like, (2) Class B Secretin like, (3) Class C Metabotropic glutamate/ pheromone, (4) Vomeronasal receptors (V1R &V3R), and (5) Taste receptors T2R. For the first four families above, each is further divided into subfamilies located at the third level. Furthermore, located on the fourth and fifth levels of the hierarchical structure are the subsubfamilies and subtypes, respectively. On the other hand, given a new protein, the first step is to determine whether it is a GPCR or a non-GPCR. Therefore, we add one more level on the top of the hierarchical structure of the above classification system. It is referred to as the first level. The complete hierarchical structure of five levels is illustrated in Figure 1.
In this paper we will look into the following classification problem, which is referred to as a five-level classification problem. Given a protein sequence, we need to determine whether it is a GPCR or a non-GPCR. If it is predicted into a GPCR, we need to further determine which family, subfamily, sub-subfamily, and subtype it belongs to. To tackle this problem, a set of distinct classifiers is generally needed for each level as depicted in Figure 1. In the literature, many computational methods have been proposed to predict GPCRs. However, to our best knowledge, there are no methods that can deal with the five-level problem completely, (i.e., allow to make predictions at all the five levels). For example, the methods presented in [9][10][11][12] predict GPCRs just at a single level (the second, third or fourth level), and the methods in [13] predict GPCRs only at the third and fourth levels. The prediction methods in [8] and [7] instead considered three and four levels, respectively.
Today's academic and industrial researchers are both interested in the functional roles of GPCRs at the finest subtype level. This is mainly because each subtype demonstrates its own characteristic ligand binding property, coupling partners of trimeric G-proteins, and Figure 1 The hierarchical structure for GPCRs. The organization of GPCR sequences in the GPCRDB database does not include the first level in this figure. We add it in this study because we performed prediction at this level.
interaction partners of oligomerization [14]. Therefore, discrimination of functions of a GPCR subtype from the others (i.e., prediction of GPCRs at the fifth level as shown in Figure 1) becomes very important in the effort to decipher GPCRs. However, we can expect that it is a challenging task that shall not be easier than the prediction of GPCRs at any of the first four levels. Fortunately, more and more GPCR sequences are now being accumulated into the GPCRDB database, which makes it possible to accurately predict GPCRs at all the five levels. This is the main goal of our present study.
A lot of related work has been done previously. In general, there are two important components in a classification taskone is feature extraction and the other is a classification algorithm. Feature extraction means how to extract features from protein sequences so that each protein is represented as a fixed-length numerical vector. Various methods have been proposed to extract information from protein sequence in the past decades (See eg., [15][16][17][18][19]). The commonly-used feature extraction methods are based on amino acid composition [9][10][11] and dipeptide composition [7,12,13,20,21], and more complicated ones include Chou's pseudo amino acid composition [15], the cellular automaton image approach [16], profile hidden Markov models [22], fast Fourier transform [23], wavelet-based time series analysis [24], and Fisher Score Vectors [25]. Once protein sequences are represented by numerical vectors, any general-purpose classification algorithms can be used for classification, for instance, covariant discriminant [9][10][11]16], nearest neighbor [7], bagging classification tree [13], and support vector machines [12,20,21,[23][24][25].
In this paper, we focus on predicting GPCRs at the five levels. Five groups of descriptors are used to extract information from the amino acid sequences. These five groups are (1) amino acid composition and dipeptide composition, (2) autocorrelation descriptors, (3) global descriptors, (4) sequence-order descriptors, and (5) Chou's pseudo amino acid composition descriptors. These descriptors reflect various physicochemical properties of proteins and have been adopted to predict many other protein attributes, such as protein subcellular localization [19,26], outer membrane protein [27], nuclear receptors [28], and protein structural classes [17,18]. By combining these descriptors, a comprehensive set of 1497 features are calculated for each amino acid sequence. By applying the principal component analysis on a dataset, we then reduce them to a set of 32 features that could retain as much of the data variability as possible.
Finally, a simple yet powerful algorithm called intimate sorting is employed to predict GPCRs, and the experimental tests on the benchmark datasets show that the classifications can be improved. Jackknife test shows that the overall accuracies of the proposed method at the first, second, third, fourth, and fifth levels achieve up to 99.5%, 88.8%, 80.47%, 80.3%, and 92.34%, respectively. Comparisons with several existing methods show that the proposed method achieves higher prediction performance consistently.

Results and Discussion
Predicting GPCR at five levels For simplicity, we call the proposed method PCA-GPCR. PCA-GPCR preforms the prediction at five levels, and its flowchart structure is depicted in Figure 2. By the firstlevel classifier a new protein sequence is predicted to be either a GPCR or a non-GPCR. If it is predicted to be a GPCR, it will be further classified into one of the four families, which is done by the second-level classifier. The third-level classifiers hence determine which subfamily the protein belongs to. For some subfamilies (see Additional file 1), the fourth-level classifiers are used to determine the sub-subfamily of the protein. Finally, the fifth-level classifiers determine the subtypes of the protein, if any (see Additional file 1). We carried out the experiments on the collection of datasets GDFL (Please see the Methods section for the details of datasets). Jackknife tests show that the overall accuracies of PCA-GPCR are 99.5%, 88.8%, 80.47%, 80.3%, and 92.34% for the five levels, respectively. The details of experimental results are presented in the Additional file 1. It is commonly believed that, the smaller number of training sequences, the less reliable a classifier to be trained. Therefore, it is not surprising to see that the prediction accuracies are higher at the first and second levels and relatively lower at the third and fourth levels. On the other hand, to filter out high-homology sequences, we used CD-HIT with a less stringent threshold (0.9) for the fifth level than the one for any other levels, which results in a larger number of training sequences for the fifth level. This might partly explain why the accuracy achieved for the fifth level (subtype) is higher than those of the second, third and fourth levels. For the convenience of public use, a web server was already developed, which is freely available at http://www1.spms.ntu.edu.sg/ chenxin/PCA_GPCR.

Comparison with BLAST-based classification
The most straightforward method for predicting GPCRs might be based on homology search by sequence alignment tools such as BLAST and PSI-BLAST [29]. A given GPCR sequence is hence predicted into the class to which its most similar GPCR sequence belongs. However, as the pairwise sequence similarities get lower, such an alignment-based method would rarely yield satisfactory predictions. For instance, when applied to the dataset GDFL for the prediction at the first level, the BLAST-based method achieved the overall accuracy of 74.58%, which is 14.92% lower than that from PCA-GPCR. Note that PCA-GPCR is instead an alignmentfree method. The above experimental results therefore show that an alignment-free method is very promising in the high accurate prediction of GPCR classes.

Comparison with previous methods
In order to demonstrate the superior performance of PCA-GPCR, we make comparisons with a number of previous methods. Depending on the predictive capability of previous methods, the comparisons are made at a single level and at the first two levels, as follows.

Comparison at a single level
Because many previous methods predicted GPCR at a single level [9][10][11][12], we also predict GPCR at just one level in order to compare with them fairly. Three benchmark datasets that contain a proportion of high-homology sequence pairs, D167, D566 and D1238, are used here (Please see Methods section for the details of these datasets). The first two datasets comprise GPCRs from the fourth level, and the last one is composed of GPCRs from the second level. The resulting prediction accuracies for these datasets are listed in Table 1. We can see that the overall accuracies for three datasets are all above 97%. To be specific, the overall accuracies of 98.2%, 97.88%, and 99.76% are achieved for the datasets D167, D566, and D1238, respectively. They are slightly higher than the accuracies reported in Refs. [7,[9][10][11][12][13]21]. Indeed, the prediction accuracies for individual families or sub-subfamilies are all very high and, in some cases, have reached 100% or nearly 100%. Because the dataset D167 has been widely used to test various methods, it is adopted here for further detailed comparisons with the other five methods [7,10,12,13,21]. The experimental results are presented in Table 2. It is evident from the table that our method achieved the highest overall prediction accuracy. Our method performs better than any other tested methods in the predictions of the GPCR sub-subfamilies except for the sub-subfamily Serotonin.

Comparison with GPCR-CA at the first two levels
We further compare our method with GPCR-CA [16] on the dataset D365, which comprises GPCRs from the second level. Unlike the datasets tested in the above subsection, D365 contains almost no high-homology sequence pairs. Note that the GPCR-CA is able to predict GPCRs at the first two levels.
The prediction accuracies of both GPCR-CA and PCA-GPCR at the first and second levels are listed in Table 3 and Table 4, respectively. At the first level, to distinguish GPCRs from non-GPCRs, our method achieves the overall accuracy of 95.21%, which is 3.57% higher than that of GPCR-CA. At the second level, the overall accuracy of our method improves over GPCR-CA by 9.04%. Meanwhile, according to the prediction accuracies of individual families, our method performs much better than GPCR-CA except for the rhodopsinlike family. It is also noticeable that a substantial improvement of 86.95% (= 95.65% -8.70%) has been made for the prediction of the fungal pheromone family (partly due to the small size of protein sequences in this family, as shown in Table 1).
GPCR-CA extracts 24 features, including 20 features from amino acid composition and four features from cellular automaton image [16]. While the last four features were reported to be able to reveal the protein's overall sequence patterns, only four features might not suffice to reveal overall sequence patterns completely. On the contrary, our method explores the amino acid sequences comprehensively to gain as much information from the protein primary sequences as possible. Both the amino acid composition and the dipeptide composition are utilized in our method and, moreover, the important sequence-order information and a variety of physicochemical properties of amino acids are carefully explored as well. We believe that it is this comprehensive set of features that lead our method to a higher prediction accuracy.

Conclusions
In this paper, we have proposed a new method called PCA-GPCR to predict GPCRs at five levels. In this method, a comprehensive set of 1497 sequence-derived features are generated from five groups of descriptorsthat is, amino acid composition and dipeptide composition, autocorrelation descriptors, global descriptors, sequence-order descriptors, and Chou's pseudo amino acid composition descriptors. These features are able to capture the information about the amino acid composition, sequence order as well as various physicochemical properties of proteins. Because of the high Tot(i) is the number of sequences observed in class i, c(i) is the number of correctly predicted sequences of class i, and ACC is the prediction accuracy. The results of the other methods are taken directly from the corresponding references. The results of GPCR-CA are directly taken from the Ref. [16].
dimensionality of the feature space, the principal component analysis is hence used to reduce the dimension from 1497 to 32. The resulting 32-dimensional feature vectors are finally fed into a simple yet powerful intimate sorting algorithm for the prediction of GPCRs at five levels. By evaluating on the datasets constructed from the latest version of the GPCRDB database, the overall accuracies of our method from the first level to the fifth level are 99.5%, 88.8%, 80.47%, 80.3%, and 92.34%, respectively. We further test and compare our method with several other methods based on four benchmark datasets widely used in the literature. At the second level, for a dataset containing 1238 GPCRs, the overall accuracy of our method reaches 99.76%. At the fourth level, for two different datasets that contain 167 and 566 GPCRs, the overall accuracies of our method reach up to 98.2% and 97.88%, respectively. They are all higher than those of the other methods under comparison. At the first two levels, we further test our method on a low-homology dataset (with only a few sequence pairs of more than 40% sequence identity). The overall accuracies thus achieved at the first level and second level are 95.21%, 92.6%, respectively, which are 3.57% and 9.04% higher than those of the method GPCR-CA.
We conclude that the high prediction accuracy of the proposed method is attributed to the comprehensive set of features that we constructed from five groups of descriptors. It is anticipated that our method could contribute more to the characterization of novel proteins and gain new insights into their functions, thereby facilitating drug discovery. A web server that predicts GPCRs at five levels with our proposed method is freely available at http://www1.spms.ntu.edu.sg/~chenxin/PCA_GPCR.

Datasets
We construct a collection of non-redundant datasets from the latest release of the GPCRDB database (Version 9.9.1, September 2009) [6] to evaluate and train the classifiers for the GPCRs prediction. As mentioned in the Background section, the sequences in the GPCRDB database are organized in four levels: family or class, subfamily, sub-subfamily, and subtype. We download the GPCR sequences from the GPCRDB database and then filter out the high-homology sequences using the program CD-HIT [30]. In order to ensure that there are enough sequences to train the classifiers, we apply different thresholds in CD-HIT for sequences at different levels. They are 0.4, 0.7, 0.8, and 0.9 for the family, subfamily, sub-subfamily, and subtype levels, respectively. After filtering, only families (subfamilies, sub-subfamilies, and subtypes) with more than 10 sequences are retained for training classifiers. Because the fifth family (Taste receptors T2R) has no subfamily and there are only 14 sequences remaining after filtering by CD-HIT, it is therefore ignored in subsequent analysis. At the end, we obtained 1589, 4772, 4924, and 2741 GPCRs at the family, subfamily, sub-subfamily and subtype levels, respectively. The name of families, subfamilies, sub-subfamilies, and subtype, together with the number of GPCR proteins retained at each level are listed in the Additional file 1.
The GPCR protein sequences retained at the family level are used to construct a positive dataset for training and evaluation. A negative dataset of non-GPCRs is then constructed in almost the same way as in Ref. [25], except that the latest version of ASTRAL SCOP (Version 1.75) [31] is used. First, we download the sequences that have less than 40% identity to each other (i.e., the file with the name "seq.75;item = seqs;cut = 40"). Then, remove those sequences of length less than 30, and those having identity above 40% using CD-HIT. Finally, a total of 10325 sequences remain, from which 1589 sequences are randomly selected to form a negative dataset. Because these selected proteins are organized into five levels, for the sake of convenience, we call them the datasets GDFL (GPCR Datasets in Five Levels). They are available at the web server provided in this paper.
In addition, in order to perform comparison with other existing methods directly, four benchmark datasets from previous studies are experimented in this study as well. For the sake of simplicity, they are referred to as D167, D566, D1238 and D365, respectively. We know that all of them were constructed based on the older version of the GPCRDB database. The proteins in the dataset D167   (6) frizzled/smoothened family. The numbers of proteins in the above four datasets are given in Table 1. Furthermore, 365 non-GPCR sequences are taken from the Swiss-Prot database to serve as a negative dataset against D365 [16]. The sequence homology level is an important factor that affects the effectiveness of a classification method. Therefore, it is worthwhile to take a look at the sequence similarity levels of proteins in these datasets before performing any evaluation test. For simplicity, we analyze the similarity level of the whole dataset rather than the subsets in the dataset. Chou and Elrod [9][10][11] reported that all the receptor sequences in the aforementioned datasets were generally lower than 40%, according to their definition of the average sequence identity percentage between two protein sequences. Here, we run a protein sequence clustering program called CD-HIT [30] on each dataset with the varying thresholds of sequence identity. For example, if a threshold of 0.9 is used, the proteins having pairwise residue identities of 90% or above would be placed into a same cluster. In general, the fewer resulting clusters imply the higher overall sequence similarities. The test results are shown in Table 5, where the proteins are clustered with the thresholds of 0.9, 0.8, 0.7, 0.6, 0.5 and 0.4, respectively. In particular, 100 clusters are obtained for 167 proteins in the dataset D167 with the threshold of 0.9. It indicates that there do exist high-homology protein pairs, but they only take up a small proportion of the total number (i.e., 12861 = 167 × 166/2) of distinct protein pairs. The use of the threshold of 0.4 further reduces the number of clusters to 30, which could suggest that the average sequence identity of proteins is quite low. However, to avoid the overestimation of prediction accuracy, it would be better if those high-homology sequences can be filtered out with CD-HIT. For instance, the dataset D365 does not contain any protein pairs having ≥ 40% pairwise sequence identity except in the E-cAMP receptor family, which contains too few (only 10) GPCRs to apply filtering.

Physicochemical properties
In order to capture as much information of protein sequences as possible, a variety of physicochemical properties [32] are used in the procedure of feature extraction. These physicochemical properties are listed in Table 6, of which the first eighteen are used to measure the physicochemical properties of individual amino acids and the last two to measure the physicochemical distances between two amino acids.

Sequence-derived features
As mentioned in the introduction, amino acid composition was widely used to transform GPCR sequences into 20-dimension numerical vectors [9][10][11]. However, the sequence order information would be completely lost. In order to address this issue, dipeptide composition was proposed to represent GPCR sequences by 400-dimension vectors, which captures local-order information and has been reported to improve classifications [7,12,13,20,21]. Recently, GPCR-CA [16] utilized the conception of Chou's pseudo amino acid composition [33] to represent each protein sequence by 24 features. The first 20 features correspond to the amino acid composition and the remaining four features are calculated from a so-called cellular automation image. These four features were shown capable of reflecting a protein's overall sequence pattern. Inspired by this work, we seek a new set of features that can comprehend as much information as possible from GPCR sequences. To this end, we investigate the following five groups of features, where the parameters are set to the same values as in [34].

Amino acid composition (AAC) and dipeptide composition (DC)
Amino acid composition is defined as the occurrence frequencies of 20 amino acids in a protein sequence.
That is, where each i = 1, 2, ... , 20 corresponds to a distinct amino acid and n A (i) is the number of amino acid i occurring in the protein sequence of length L.
Similarly, dipeptide composition is defined as the occurrence frequencies of the 400 dipeptides (i.e., 400 amino acid pairs). That is, where each i = 1, 2, ..., 400 corresponds to one of the 400 dipeptides and n D (i) is the number of dipeptide i occurring in the sequence.

Autocorrelation descriptors (AD)
We use three autocorrelation descriptorsnormalized Moreau-Broto autocorrelation descriptors, Moran autocorrelation descriptors and Geary autocorrelation descriptors. They are all defined based on the value distributions of the first eight physicochemical properties of amino acids along a protein sequence (see Table 6). The measurement values of these properties are first standardized before we proceed to calculate the three autocorrelation descriptors. The standardization is performed as follows.
where P 0 (i) are the property value of the amino acid i, Geary autocorrelation descriptors are defined as: For each autocorrelation descriptor, we obtain 240 (= 30 × 8) features. In total, 720 (= 240 × 3) features will be obtained to describe a protein sequence.

Global descriptors (GD)
These descriptors were first proposed by Dubchak et al. [35] to predict protein folding classes, and later applied to predict human Pol II promoter sequences [36]. They are constructed as follows. Firstly, given each of the following seven amino acid properties: normalized van der Waals volume, polarity, polarizability, charge, secondary structure, solvent accessibility and relative hydrophobicity (i.e., properties 12-18 listed in Table 6), the 20 amino acids are divided into three groups according to their property values. Then, for a given amino acid sequence, we may obtain a new sequence of three symbols, each corresponding to one group of amino acids. Finally, three groups of quantities are defined on the new sequence; that is, composition (Comp), transition (Tran) and distribution (Dist), as demonstrated below. For the sake of simplicity, suppose that a sequence is made of only two letters (A and B). Comp is defined as the occurrence frequency of each letter in the sequence. For example, we have a sequence BABBABABBABBAA-BABABBAAABBABABA, in which there are 14 As and 16 Bs. Therefore, the occurrence frequencies of A and B are 14/(14 + 16) × 100.00 = 46.67 and 16/(14 + 16) × 100.00 = 53.33, respectively. Tran is used to represent the occurrence frequency of pairs AB or BA. In the above sequence, there are 21 transitions from one letter to another, so Tran is computed as (21/29) × 100.00 = 72.14. On the other hand, Dist calculates the relative positions of the first, 25%, 50%, 75% and 100% of the total amount of a particular letter in the sequence. In the above sequence, for example, the first, 25%, 50%, 75% and 100% of the total amount of the letter B are located at the first, 6th, 12th, 20th and 29th positions, respectively. The quantities Dist for the letter B are hence 1/30 × 100.00 = 3. amino acids are divided into three groups by each amino acid property, which leads to a new sequence of three symbols (n = 3). Following the similar procedure demonstrated above, we will obtain 21 3 features to describe the new sequence (of three symbols). Combining all the features to be extracted based on the seven amino acid properties, we will obtain a total of 147 (= 21 × 7) features for each input protein sequence from the global descriptors.

Sequence-order descriptors (SD)
In order to derive sequence-order descriptors, we rely on two distance measures for amino acid pairs. One is called the Grantham chemical distance matrix [34], and the other called the Schneider-Wrede physicochemical distance matrix [19]. Then, the jth-rank sequence-ordercoupling number is defined as: where d(R i , R i +j ) is one of the above distances between the two amino acids R i and R i + j located at position i and i + j, respectively.

Chou'is pseudo amino acid composition descriptors (PseAAC)
This set of features were originally developed by Chou [33] and have been used widely to predict various attributes of proteins, such as outer membrane protein [27], nuclear receptors [28], and protein structural classes [17,18]. The Chou's pseudo amino acid composition descriptors are defined similarly as the quasi-sequenceorder descriptors. The difference lies in the coupling number τ (j), which is modified to: where Θ (R i , R i + d ) is the dth-tier correlation factor that reflects the sequence order correlation between all the most contiguous residues along a protein chain. It is defined as: where ω is a weighting factor (default ω = 0.1). It will generate 50 features from the Chou'is pseudo amino acid composition descriptors.
In summary, a comprehensive set of 1497 features, which measure the protein sequences from different aspects, will be generated from the above five descriptors. The number of features in each group of descriptors is listed in Table 7. These features are used to represent every protein sequence, and may be directly fed into a classification algorithm. Note that, however, there are some correlations or redundancies among these features such as the first twenty features in the fourth and fifth groups of features. On the other hand, the dimension of the features is too large, which might make it difficult to work with many machine learning algorithms for classification. Therefore, it is necessary to reduce the dimension. In this study, we adopt one of the most popular and powerful techniques, namely, principal component analysis, for the purpose of dimensionality reduction.

Principal component analysis
Principal Component Analysis (PCA) is a classical statistical method which is still widely used in modern data analysis. PCA involves a mathematical procedure that transforms a large number of (possibly) correlated variables into a smaller number of uncorrelated variables, called principal components (PCs), that retain as much variability of the data as possible [37]. Given a data matrix denoted by X = (X 1 , X 2 , ..., X p ), where X i is a column vector of size n which is equal to the number of proteins of interest and p denotes the number of protein sequence features, a typical PCA is performed as follows.
First, we shall standardize every X i by where X i and Var(X i ) are the mean and variance of the vector components of X i , respectively. Then, the covariance matrix of Y = (Y 1 , Y 2 , ..., Y p ) is obtained as For the covariance matrix Cov(Y), we find all its eigenvalues l 1 ≥ l 2 ≥ ... ≥ l p and the corresponding eigenvectors, We can see that each PC(i) is a column vector with size n and the j-th element in PC(i) represents the i-th PC value of protein j. Thereafter, a total of p uncorrelated PCs are obtained.
In order to reduce the dimension of the feature space, only the first m PCs are used to represent each protein sequence (m ≤ p). It is generally hard to determine the optimal value of m. In this study, we aim to find a value of m that could make the overall prediction accuracy of GPCRs as high as possible, which we will further discuss later.

Intimate sorting algorithm
Many classification algorithms in the literature have been used to predict GPCRs, for instance, covariant discriminant [9][10][11]16], nearest neighbor [7], bagging classification tree [13], and support vector machines [12,20,21,[23][24][25]. In this study, we use a simple yet powerful algorithm called intimate sorting [26]. This algorithm is easy to implement and does not need to set any parameters as some other algorithms (e.g., support vector machines).
Suppose that a training set consists of N proteins {P 1 , P 2 , ..., P N }, each of which P i is a l-dimension vector, P i = (p i, 1 , p i, 2 , ..., p i , l) T . The GPCR classes of these proteins are already known, and each protein belongs to exactly one of the μ classes. The intimate sorting algorithm aims to place a query protein P = (p 1 , p 2 , ..., p l ) T into one of the μ classes based on the information from the N proteins in the training set. To this end, a measure of similarity score between P and P i is defined as Φ( , ) , , , , , P P P P P P . When P ≡ P i , it can be easily seen that Φ(P, P i ) = 1, suggesting that they are most likely to belong to a same class. In general, we have -1 ≤ Φ(P, P i ) ≤ 1. The higher the Φ(P, P i ) value, the more likely two proteins belong to a same class. Among the N proteins in the training set, the one with the highest score with the query protein P is picked out, which we denote by P k , k [1, N]. If there is a tie, we would randomly select one of them. In the final step, the intimate sorting algorithm simply assigns P into the same GPCR class as P k .

Prediction assessment
The jackknife test is a rigorous and objective statistical test that can always yield a unique result for a given test dataset [38]. Therefore, it is often used to examine the power of a new classifier. In this paper, we also use it to evaluate our method, where proteins are singled out from the dataset one by one as a testing protein and the classifier is trained by the remaining proteins. In this sense, the jackknife test is also called the leave-one-out test. The prediction accuracies (ACC) and overall accuracy (OACC) are then measured by the following formulae: where Tot(i) is the total number of sequences in class i, C(i) the number of correctly predicted sequences of class i, and μ the total number of classes under consideration. Note that this prediction assessment method was already adopted in several previous studies, e.g., [7,15,16,24].

Selection of m
As we mentioned earlier, the number of PCs in PCA, i.e., m, remains to be determined. Here, we choose its value by aiming to achieve the overall prediction accuracy as high as possible. To this end, we use the dataset D365 to compute the overall prediction accuracies OACCs of GPCR families for varying values of m. When m ranges from 1 to 80, OACCs thus obtained are plotted in Figure 3. We found that the highest accuracy (92.6%) is achieved with m = 32. Based on this observation, we chose m = 32 in our experiments.   Table 7. The divisions of these six subsets are marked by vertical blue lines.

Contribution of features
Inspired by the PCA-based feature selection method described in [37,39], we use the following procedure to assess the contributions of the 1497 features to prediction accuracy. Recall that in the previous principle component analysis on the dataset D365 we obtained 32 eigenvectors E 1 , E 2 , ..., E 32 , and each eigenvector comprises 1497 components. Let us denote E i = (E ij ), where 1 ≤ i ≤ 32 and 1 ≤ j ≤ 1497. To find the i-th PC in PCA, E ij is used to weight the j-th feature. In this sense, the value E ij can be viewed as the weight of contributions that the j-th feature makes to the i-th PC. To combine the contributions to all the PCs, we may compute w E Then, w j can be naturally viewed as the weight of contributions that the j-th feature makes to the final prediction accuracy because our method is based on these 32 PCs. In general, the higher the weight w j , the more contributions the j-th feature makes.
The contribution of each of the 1497 features is computed and depicted in Figure 4(A), where we can see that the contributions of the amino acid composition (AAC) in the first group of descriptors are much higher than those of the dipeptide composition (DC). Therefore, we separate the AAC from the DC in the first group of descriptors in the following discussions. In addition, we find that the features from the autocorrelation descriptors (AD) made the highest contributions among all the features. Because there are 1497 features, it is not convenient to discuss the contributions of all individuals one by one. Instead, we compute the average contributions of the features in the following six subsets: AAC, DC, AD, GD, SD, and PseAAC. Their results are shown in Figure 4(B). It is evident from the figure that the highest average contribution is obtained with the features in the PseAAC subset (0.1657). The slightly lower contributions are provided by the AD and AAC subsets (0.1595 and 0.1579, respectively). On the contrary, the features in the GD and SD subsets achieve the average contributions only slightly higher than 0.14. The features in the DC subset instead achieve the least average contribution (0.1092). In summary, if we arrange the features in the six subsets in a decreasing order of their average contributions, then we obtain PseAAC, AD, AAC, GD, SD, and DC.
In particular, among the AD subset, some features made quite high contributions while the others made relatively low contributions, as we can see in Figure 4 (A). For a thorough investigation, we plot the contributions of all the features in the AD subset again in Figure  5. These features are divided into eight groups according to the physicochemical properties used to compute them. In Figure 5, the eight groups of features are separated by vertical blue lines and indicated by P1, P2, ..., P8, respectively. Note that Pi represents the i-th physicochemical property listed in Table 6. It is evident from Figure 5 that the highest contributions are due to the features computed with the physicochemical properties P3, P5 and P6; they are the polarizability parameter, residue accessible surface area in tripeptide and residue volume, respectively. For the group P3, we can further divide its 90 features into three subgroups according to three different autocorrelation descriptors (normalized Moreau-Broto, Moran, and Geary autocorrelation descriptors). These three subgroups are separated by red vertical lines in Figure 5, and indicated by P3.1, P3.2 and P3.3, respectively. In each subgroup, the feature contributions are computed with the values of d varying from 1 to 30 (from left to right on the horizontal axis). Observe that Moran and Geary autocorrelation descriptors (P3.2 and P3.3) made much higher contributions than the normalized Moreau-Broto descriptors (P3.1). Furthermore, for Moran and Geary autocorrelation descriptors, the features that are computed with a value of d in the range from 20 to 30 generally give rise to a fairly high contribution, while the maximum is attained when d = 26. The similar characteristics can also be observed for the groups P5 and P6 from Figure 5.

Additional material
Additional file 1: The information about families, subfamilies, subsubfamilies, and subtypes. The names of families, subfamilies, subsubfamilies, and subtypes used by PCA-GPCR are listed in this file. The names are derived from GPCRDB database. The number of proteins in each family, subfamily, sub-subfamily, subtype, and the corresponding accuracies are also available in this file.