Functional discrimination of membrane proteins using machine learning techniques

Background Discriminating membrane proteins based on their functions is an important task in genome annotation. In this work, we have analyzed the characteristic features of amino acid residues in membrane proteins that perform major functions, such as channels/pores, electrochemical potential-driven transporters and primary active transporters. Results We observed that the residues Asp, Asn and Tyr are dominant in channels/pores whereas the composition of hydrophobic residues, Phe, Gly, Ile, Leu and Val is high in electrochemical potential-driven transporters. The composition of all the amino acids in primary active transporters lies in between other two classes of proteins. We have utilized different machine learning algorithms, such as, Bayes rule, Logistic function, Neural network, Support vector machine, Decision tree etc. for discriminating these classes of proteins. We observed that most of the algorithms have discriminated them with similar accuracy. The neural network method discriminated the channels/pores, electrochemical potential-driven transporters and active transporters with the 5-fold cross validation accuracy of 64% in a data set of 1718 membrane proteins. The application of amino acid occurrence improved the overall accuracy to 68%. In addition, we have discriminated transporters from other α-helical and β-barrel membrane proteins with the accuracy of 85% using k-nearest neighbor method. The classification of transporters and all other proteins (globular and membrane) showed the accuracy of 82%. Conclusion The performance of discrimination with amino acid occurrence is better than that with amino acid composition. We suggest that this method could be effectively used to discriminate transporters from all other globular and membrane proteins, and classify them into channels/pores, electrochemical and active transporters.


Background
Membrane proteins perform a diverse variety of functions, including the transport of ions and molecules across the membrane, bind to small molecules at the extra cellular space, recognize the immune system and energy transducers. The functional annotation of membrane proteins in genomic sequences is an important problem in bioinformatics and computational biology. Membrane transporters are a large group of proteins that span the cell membrane and form an intricate system of pumps and channels through which they deliver essential nutrients, eject waste products and assist the cell to sense environ-mental conditions. Transporters represent a large and diverse group of proteins that differ in membrane topology, energy coupling mechanism and substrate specificities [1]. They play indispensable roles in the fundamental cellular processes of all organisms [2].
Several methods have been proposed to discriminate membrane proteins from amino acid sequence information. These methods include statistical analysis [3][4][5], hidden Markov models [6,7] and machine learning techniques [8][9][10]. However, the discrimination of membrane proteins based on their functions is not yet explored and it is still at the infant stage.
In this work, we have analyzed the characteristic features of amino acid residues in major transporters, such as, channels/pores, electrochemical potential-driven transporters and primary active transporters. We have utilized different machine learning techniques for discriminating these classes of proteins and achieved the 5-fold cross-validation accuracy of 68%. The sensitivity of correctly identifying channels/pores, electrochemical and active transporters are, 55%, 70% and 76% respectively, in a set of 510, 502 and 706 proteins. The classification of channels and pores has been carried out, which showed the accuracy of 92%. In addition, we have discriminated transporters from other α-helical and β-barrel membrane proteins, and from all other proteins (globular and membrane) to the accuracy of 85% and 82%, respectively. Further, the influence of chain length for discrimination will be discussed.

Data sets
We have constructed datasets for channels/pores, electrochemical transporters and active transporters from the information available in Transport Classification Database, TCDB [11]. The TCDB has seven groups of transporters in which three of them have insufficient data for analysis and one is for incompletely characterized proteins. Hence, we have used the three major transporters, channels/pores, electrochemical and active transporters. The number of proteins belonging to these classes of transporters deposited in TCDB are 720, 989 and 1216, respectively. From these proteins, we have removed the redundant sequences using blastclust program [12] so that no two proteins have the sequence identity of more than 20%. This algorithm showed only one sequence in most of the clusters and we have randomly picked up one sequence for the clusters with many sequences. The final dataset contains 1718 proteins, which have 510 channels/ pores, 502 electrochemical and 706 active transporters.

Computation of amino acid composition and occurrence
The amino acid composition for the set of transporters has been computed using the number of amino acids of each type and the total number of residues. It is defined as: where i stands for the 20 amino acid residues, n i is the number of residues of each type and N is the total number of residues. The summation is through all the residues in all the considered proteins. The same procedure was repeated for all the three groups of transporters for obtaining their amino acid composition. The total number of residues in the datasets of channels/pores, electrochemical and active transporters are respectively, 259,143, 252,585 and 289,109.
The amino acid occurrence is the actual number of amino acid residues of each type present in a protein without normalizing with chain length.

5-fold cross-validation method and jack-knife test
We have performed a 5-fold cross-validation test for assessing the validity of the present work. In this method, the data set is divided into five groups, four of them are used for training and the rest is used for testing the method. The same procedure is repeated for five times and the average is computed for obtaining the accuracy of the method.
In jack-knife test, n-1 data are used for training and the prediction is made on the left-out protein. This procedure is repeated for n times and the average is computed for obtaining the accuracy.

Calculation of specificity, precision, F-measure and accuracy
We have used different measures, such as specificity, precision, F-measure and accuracy to assess the performance of discriminating channels/pores, electrochemical and active transporters. The term sensitivity shows the correct prediction of specific transporters and accuracy indicates the overall assessment. F-measure is the balance between sensitivity and precision, 1/F = [(1/Sensitivity) + (1/Precision)]/2. These terms are defined as follows:

Different machine learning algorithms used for discrimination
We have analyzed several machine learning techniques implemented in WEKA program [13] for discriminating membrane transporters from other proteins and classifying them into channels/pores, electrochemical and active transporters. This program includes several methods based on Bayes function, Neural network, Logistic function, Support vector machine, Regression analysis, Nearest neighbor, Meta learning, Decision tree and Rules. The details of all these methods are available in our earlier articles [9,10] as well as in the book on data mining [13].

Amino acid composition for the 20 amino acid residues in different transporters
The amino acid composition for the 20 amino acid residues in channels/pores, electrochemical and active transporters have been computed using Eqn. 1 and the results are presented in Table 1. Although several residues showed differences in their compositions, few residues have the difference of more than one (|Difference| >1) among the three classes of transporters. The residue Asn is dominant in channels/pores among all the transporters. Interestingly, Asn plays an important role to the stability and function of β-barrel membrane proteins [4,14]. The structural analysis on outer membrane cobalamin transporter protein (BtuB) that transports substrates across the outer membrane, showed that the residues, Asn185 and Asn276 are important for the stability of the upper surface of cyanocobalamin (vitamin B 12 ; CN-Cbl) binding pocket [15,16], which is important for its function. In glycerol facilitator protein the residues Asn68 and Asn203 play important roles to the stability by making hydrogen bonds to form helical polar strips that connect the periplasmic and cytoplasmic versibules [17]. Glu is another amino acid that shows the difference of more than one with electrochemical transporters. It has been showed that the residues Glu166 and Glu148 are important for the channel function in CIC chloride channel proteins [18]. The composition of residues Ala, Ile and Leu in channels/ pores are the least among the three transporters. Other hydrophobic residues also show similar tendency. It might be due to the fact that other two families are dominated with hydrophobic residues owing the presence of mainly transmembrane helical proteins.
The residues Phe and Leu are dominant in electrochemical transporters. In addition, the composition of Ala, Ile, Val and Trp are higher in this class of proteins compared with other two transporters. Interestingly, in glycerol-3-phosphate transporter the space between helices 1 and 7 is filled by nine aromatic side chains and the occurrence of bulky aromatic residues helps to close the pore completely [19]. In lactose permease the substrate binding site is composed of residues that include Trp151 [20]. The higher occurrence of hydrophobic residues is due to the presence of long stretches of these residues in membrane spanning segments of α-helical membrane proteins. The electrochemical transporters are mainly occupied with multiple spanning transmembrane helical proteins, which increased the occurrence of hydrophobic residues. On the other hand, the charged residues showed the lowest composition in this class of proteins. The composition of residues Asp, Glu and Lys are much lower than other transporters and Arg is also a less favored residue. However, the analysis of three dimensional structures showed that these charged residues are important for function. The residues Asp407, Asp480 and Lys940 are important for drug resistance in bacterial multidrug efflux transporter [21] and the charged residues E126, R144 and E269 are found to be in the substrate binding sites of lactose permease [20].
In active transporters none of the residue has the highest or lowest occurrence. All the residues have the composition, which lies between the compositions of channels/ pores and electrochemical transporters. However, Glu, Gln, Phe, Arg and Lys are close to channels/pores whereas Ala, Asn, Thr and Tyr are close to electrochemical transporters. The structural analysis on high-potential iron-sul- fur protein shows that the electron transfer is mainly achieved by hydrophobic interactions [22]. In addition aromatic residues are acting as binding site residues in vitamin B 12 binding protein [23].

Structural analysis of transporters
We have analyzed the three-dimensional structures of transporters deposited in TCDB and derived the propensity of the 20 amino acid residues to be in the membrane part. This has been computed by the ratio between the occurrence of each amino acid residue in the membrane part and the respective residue in the whole protein. The results obtained for channels/pores, electrochemical and active transporters are presented in Table 2. We observed that the membrane propensity of amino acid residues in channels/pores, electrochemical and active transporters have been partially reflected in their amino acid compositions. Especially the residues Asn and Tyr are dominant in channels/pores, the propensity of residues in active transporters is not the highest among all the three transporters, hydrophobic residues have high propensity in electrochemical transporters and so on. We noticed that the transporters will have 52-59% of their residues in the membrane part. It is noteworthy that the number of protein structures used to carry out the analysis is limited (a representative set of 22, 3 and 13 proteins in channels/ pores, electrochemical and active transporters, respectively) and hence there may be a possibility of minor changes in results when more number of proteins are used in the analysis.

Performance of different machine learning techniques for discriminating channels/pores, electrochemical and active transporters
We have analyzed the performance of different machine learning methods for discriminating channels/pores, electrochemical and active transporters and the results obtained with amino acid composition are presented in Table 3. We observed that the sensitivity, precision and Fmeasure for electrochemical transporters is better than other two classes of proteins. In addition, we have tested the performance of the present method with jack-knife test and the results obtained with neural network are shown in Table 3. We noticed that the jack-knife test and 5-fold cross-validation showed similar results with a difference of 1.8%. We have also carried out the computations with same number of data in each class of transporters (502 proteins each) and we observed that the net accuracy (66%) is marginally better than that obtained with the original dataset.
Further, this analysis showed a moderate difference in the performance of different machine learning methods (the accuracy varies from 60% to 66% in most of the methods). The main cause of obtaining different prediction results might be due to the usage of different adjustable parameters in these methods.

Influence of chain length for discrimination
The performance of different machine learning methods for discriminating channels/pores, electrochemical and active transporters with amino acid occurrence as features has been analyzed and the results are presented in Table  4. We observed that the average accuracy improved to 68% using neural network with amino acid occurrence. It has been shown that neural network is an efficient method for discriminating β-barrel membrane proteins [9,10]. The sensitivity is 0.55, 0.70 and 0.76 for channels/ pores, electrochemical and active transporters, respectively. The precision is 0.70, 0.78 and 0.62, and F-measure is 0.61, 0.74 and 0.68. In addition, we have tested the performance of the present method with jack-knife test and the results obtained with neural network are shown in Table 4. We noticed that the jack-knife test and 5-fold cross-validation showed similar results with a difference of 2.7%. We have also carried out the computations with same number of data in each class of transporters (502 proteins each) and we observed that the net accuracy (68%) is similar to that obtained with the original dataset.
The comparison of results presented in Tables 3 and 4 reveals that amino acid occurrence is better than composition for discriminating transporters. Recently, similar trend has been reported for discriminating different folding types of globular proteins [24]. These studies indicate the importance of chain length for discrimination in such a way that the normalization with chain length reduced the prediction accuracy.
When compared the performance of different machine learning methods, unlike amino acid composition, several methods showed poor sensitivity for channels/pores with occurrence. For example, Naïve Bayes showed the sensitivity of 0.20 and 0.76, respectively for channels/ pores and electrochemical transporters. However, several methods (E.g. k-nearest neighbor, bagging, neural net-  work etc.) showed good performances with similar sensitivity in all three classes of transporters.

Comparison between the present method and the results obtained with BLAST search
We have analyzed the capability of BLAST to discriminate the three different types of transporters based on homology search. For each protein we have computed the sequence identity with all proteins in the three transporters and assigned the group, which has the highest sequence identity or best e-value. The calculations have been repeated for all the 1708 proteins and computed the overall accuracy. This method showed an accuracy of 51.6% in discriminating channels/pores, electrochemical and active transporters. Our method showed the accuracy of 75%, which is superior to simple BLAST search and the analysis revealed the better performance of the present method.

Discrimination between two different classes of transporters
The amino acid composition of active transporters is in the range between that of channels/pores and active transporters (Table 1). Hence, we have examined the discrimination performance of two different transporters whether the discrimination accuracy is the highest between channels/pores and electrochemical transporters. The results are presented in Table 5. As expected the difference of amino acid compositions has been reflected in the performance of discrimination. The amino acid occurrence could discriminate the channels/pores and electrochemical transporters to the accuracy of 87%. The discrimination accuracy is 73% between channels/pores and active transporters, and 81% between electrochemical and active transporters. As discussed in previous sections, the discrimination accuracy with amino acid composition is less than that obtained with occurrence. However, we observed the same trend that the channels/pores and elec-trochemical transporters are discriminated with the highest accuracy.

Discrimination of channels and pores
Proteins in the category of channels/pores have transmembrane channels, which consists of α-helical and βstrand spanning segments [11]. Hence, we have tested different machine learning algorithms to discriminate the channels (mainly α-helices) and pores (mainly βstrands). The results obtained with amino acid composition are shown in Table 6. We found that most of the machine learning methods discriminated the channels and pores with the accuracy in the range of 88-92%. The neural network and support vector machine showed the highest accuracy of 92.4%. The sensitivity and specificity are, 93% and 92%, respectively using neural network. We observed similar level of accuracy using amino acid occurrence. The classification via regression and logistic function methods discriminated the channels and pores with the accuracy of 90%. The similar performance with amino acid composition and occurrence might be due to the difference in amino acid residues in the membrane spanning regions of α-helical and β-barrel membrane proteins. The α-helical membrane proteins are dominated with the stretches of hydrophobic residues whereas the polar and charged resides are intervened in the membrane spanning segments of β-barrel membrane proteins. The high accuracy obtained for discriminating channels and pores is consistent with other methods for discriminating α-helical/β-barrel membrane proteins [3][4][5][6][7][8][9][10].

Discrimination of transporters from other membrane proteins and all other proteins
We have developed a dataset of 3336 membrane proteins with less than 20% sequence identity that includes receptors and all other types of α-helical and β-barrel membrane proteins except transporters from SWISS-PROT database. Using a dataset of 3336 non-transporters and 1718 transporters we have analyzed the performance of different machine learning algorithms and the k-nearest neighbor could discriminate the transporters with the 5fold cross-validation accuracy of 79.1%. The sensitivity and specificity are 69.2% and 84.2%, respectively. Further, we have repeated the computations with equal number of transporters and non-transporters and obtained the accuracy of 85.0%. The jack-knife test also showed similar results that we obtained with 5-fold crossvalidation method.
In addition, we have set up a dataset for 5048 proteins, which include membrane transport proteins and other membrane and globular proteins. We obtained a 5-fold cross-validation accuracy of 78.7% in discriminating transporters and non-transporters. Further, we have used the same number of proteins in transporters and non-

Discrimination on the web
We have developed web servers for (i) discriminating membrane transport proteins from all other membrane and globular proteins [25] and (ii) distinguishing channels/pores, electrochemical and active transporters [26]. These servers take the amino acid sequence as input and predict whether the protein is membrane transporter or not, and the type of the membrane transport protein.
Both the servers can be freely accessible from our web site [27].

Applications of the present method for new sequences
The following procedure may be used to detect the type of a new protein. First the new sequence can be identified as a transporter or non-transporter using the discrimination method to classify them (previous section). It has been shown that the transporters and non-transporters are discriminated with the highest accuracy of 82%. For a transporter, it can be further identified into channels/pores, electrochemical and active transporters with an accuracy of 68%. Alternatively, several methods have been reported in the literature for discriminating globular proteins from α-helical [3,[28][29][30] or β-barrel [4-10,31] membrane proteins. These methods can be used to detect the membrane proteins. The membrane proteins of any kind can be classified into transporters and non-transporters with the maximum accuracy of 85%, and the transporters can be further classified into three groups. Hence, the two-way/ three-way prediction system can be used to detect different types of transporters in genomic sequences. The work on the integration of prediction methods is on progress.

Conclusion
We have systematically analyzed the amino acid compositions of channels/pores, electrochemical and active transporters and revealed the similarities and differences among them. Different machine learning algorithms have been tested to discriminate these transporters and we achieved the highest accuracy of 68% using neural network with amino acid occurrence. Further, we have examined the discrimination performance between two classes of transporters, which showed the highest accuracy of 87% between channels/pores and electrochemical transporters. In addition, the channels and porins are discriminated with the accuracy of 92%. On the other hand, the transporters and other membrane proteins/all other globular and membrane proteins are discriminated with the accuracy of 85% and 82%, respectively. We suggest that this method could be effectively used to discriminate transporters and different classes of transporters in genomic sequences.

Authors' contributions
MMG conceived the project, carried out the computations and analysis, and wrote the manuscript. YY prepared the