Discrimination of outer membrane proteins with improved performance
- Changhui Yan1Email author,
- Jing Hu1 and
- Yingfeng Wang1
https://doi.org/10.1186/1471-2105-9-47
© Yan et al; licensee BioMed Central Ltd. 2008
Received: 29 November 2007
Accepted: 24 January 2008
Published: 24 January 2008
Abstract
Background
Outer membrane proteins (OMPs) perform diverse functional roles in Gram-negative bacteria. Identification of outer membrane proteins is an important task.
Results
This paper presents a method for distinguishing outer membrane proteins (OMPs) from non-OMPs (that is, globular proteins and inner membrane proteins (IMPs)). First, we calculated the average residue compositions of OMPs, globular proteins and IMPs separately using a training set. Then for each protein from the test set, its distances to the three groups were calculated based on residue composition using a weighted Euclidean distance (WED) approach. Proteins from the test set were classified into OMP versus non-OMP classes based on the least distance. The proposed method can distinguish between OMPs and non-OMPs with 91.0% accuracy and 0.639 Matthews correlation coefficient (MCC). We then improved the method by including homologous sequences into the calculation of residue composition and using a feature-selection method to select the single residue and di-peptides that were useful for OMP prediction. The final method achieves an accuracy of 96.8% with 0.859 MCC. In direct comparisons, the proposed method outperforms previously published methods.
Conclusion
The proposed method can identify OMPs with improved performance. It will be very helpful to the discovery of OMPs in a genome scale.
Keywords
Background
Outer membrane proteins (OMPs) perform diverse functional roles, including bacterial adhesion, structural integrity of the cell wall, and material transport [1–3]. The membrane-spanning regions of OMPs form a characteristic β-barrel. Discriminating OMPs from other proteins and identifying membrane spanning β-barrels in them are crucial for many studies. Unlike α-helical membrane proteins, which can be easily identified based on long stretches of hydrophobic residues, OMPs are more difficult to predict, mainly due to shorter membrane-spanning regions with higher variations in properties [3]. Nevertheless, several methods have been proposed for this task. Gnanasekaran et al. [4] used profiles developed from structure-based alignments of porins to identify OMPs. Wimley et al. [5] analyzed the structure of 15 non-redundant OMPs and developed a method to identify OMPs based on residue composition and structural features, such as rise-per-residue of the β strands and loop length. Martelli et al. [6], Bagos et al. [7, 8], and Bigelow and Rost [9] used hidden Markov models (HMMs) to predict the topology of OMPs and discriminate OMPs from globular proteins. Liu et al. [10] developed a method that combines the residue composition of membrane-spanning regions and predicted secondary structure to identify OMPs. Natt et al. [11] used artificial neural network (ANN) and support vector machine (SVM) methods to identify β-barrels in OMPs using primary sequence, evolutionary information and physicochemical parameters as input. Their method also achieved success in discriminating OMPs. Garrow et al. [12, 13] developed a method for discrimination of OMPs in genomes using K-nearest neighbor method. Berven et al. [14] developed the BOMP method that predicts OMPs by combining pattern search, β-barrel score, and a filter that explores the abundance of asparagine and isoleucine in the protein. Gromiha and Suwa [15] developed a simple statistical method to identify OMPs based on amino acid composition. Later, they extended the approach by adding residue pair information and used a SVM-based method to identify OMPs with improved performance [16].
In this study, we propose a simple method that discriminates OMPs from non-OMPs using a weighted Euclidian distance (WED) calculated from residue composition. Our results show that this method achieves 96.8% accuracy with 0.859 MCC. In direct comparisons, the proposed method outperforms previous published methods.
Results
Discrimination between OMPs and non-OMPs
Performance of the published method and comparisons with previous methods with on-line servers
Mehod | MCC | Accuracy (%) | Sensitivity (%) | Specificity (%) | |
---|---|---|---|---|---|
WED a | Single b | 0.639 | 91.0 | 77.2 | 92.9 |
Homologous c | 0.648 | 91.4 | 76.3 | 93.5 | |
Homologous + feature selection d | 0.859 | 96.8 | 90.7 | 97.6 | |
BOMP (Berven et al., 2004) | 0.817 | 96.2 | 79.8 | 98.5 | |
ProfTMP (Bigelow and Rost, 2006) | 0.583 | 92.3 | 37.0 | 1 | |
TMB_HUNT (Garrow et al. 2005) | 0.828 | 96.4 | 81.5 | 98.5 |
Comparisons with previously published methods
We compare the proposed method with previously published methods. As discussed in Baldi et al. [17], in a two-class classification, if the numbers of examples in the two classes are not equal, MCC is a better measure for evaluating the classification performance. In the discrimination of OMPs and non-OMPs, the numbers of examples in the two classes are not equal. Therefore, we will use MCC as the primary measure in the comparison of different methods. At the same time, we also report accuracy, specificity, and sensitivity.
BOMP [14], TMB-Hunt [12, 13] and PROFtmb [9] are three top-scoring on-line servers that can discriminate OMPs. BOMP and TMB-Hunt are based on the K-nearest neighbor method, and PROFtmb is based on a hidden Markov model (HMM). We compared the proposed method with these methods by submitting the datasets used in this study to these servers. The comparisons (Table 1, rows 4–7) show that the proposed method outperforms all the other methods. It is worth to point out that the datasets used in this study are likely to have a big overlap with the datasets that were used to train BOMP, TMB-Hunt and PROFtmb servers. Thus, when we evaluated these methods by submitting our datasets to their web servers, the performance of these methods might have been overestimated. Remarkably, our method still outperforms the others under this condition.
Comparisons with other published methods
MCC | Accuracy (%) | Sensitivity (%) | Specificity (%) | |
---|---|---|---|---|
WED (homologous + feature selection)a | 0.894 | 97.4 | 91.1 | 98.4 |
Deviation Distance [15] | 0.541 | 82.4 | 78.8 | 83.3 |
Neural Network [19]b | 0.716 | 91.0 | 79.3 | 93.8 |
Support Vector Machine c [16] | 0.816 | 93.9 | 90.9 | 94.7 |
Receiver Operating Characteristic (ROC) Curve
ROC curve of the proposed method.
Identification of OMPs in the Proteome of E. coli
We applied the proposed method to search for OMPs in the proteome of E. Coli using α = -0.05, which corresponds to 98% specificity in the ROC curve. The E. Coli proteome consists of 4,319 proteins. 107 of them were predicted to be OMPs. That accounts for 2.5% of the whole proteome. This ratio is consistent with the previous estimation that 2–3% of the genes in Gram-negative bacteria encodes OMPs [2]. Among these 107 proteins, 49 are annotated as OMP proteins in Swiss-Prot [20] or ePSORTdb [21], a database of protein subcellular locations that have been determined by laboratory experiments, and 15 share very high similarities with some OMPs in the training set (with E ≤ 0.0001 in BLAST comparison). Thus, we have very high confidence in believing that these 64 hits are true positives. In addition, 13 proteins are annotated with "Membrane", "Cell membrane" and "Multi-pass membrane protein" in Swiss-Prot. We submitted these proteins to the TMHMM [22], a server for predicting the topology of trasmembrane α-helical proteins, and PSORTb [23], a server for predicting subcellular locations. None of them was predicted to be trasmembrane α-helical proteins (or inner membrane proteins) by both methods. Thus, most of these 15 proteins are very likely OMP proteins. The remaining 30 proteins may suggest new OMP proteins that have not been previously discovered.
We also compare our method's predictions with the proteome scanning results obtained by BOMP [14]. We choose BOMP for comparison because BOMP's predictions for E. Coli proteins are available on its server. In the E. Coli proteome, BOMP predicted 103 OMP proteins. Comparisons show that 59 proteins were predicted to be OMP by both our method and BOMP. 48 proteins were predicted to be OMP by our method but not by BOMP. Among them, 15 proteins are true positives. Additionally, 9 proteins are annotated with "membrane", "Cell membrane" or "multi-pass membrane protein" in SwissProt. 44 proteins were predicted to be TMB proteins by BOMP but not by our method. Among them, 16 are true positives. Additionally, 7 are annotated with "membrane", "Cell membrane" or "multi-pass membrane protein" in SwissProt. This comparison shows that there is a big overlap between the predictions of the proposed method and BOMP. It also shows that each of the two methods can identify some OMP proteins missed by the other. This suggests the possibility of achieving better performance by combining these two methods. Another possible direction to improve the performance is to combine the current method with other methods that predict signal peptides in proteins, since OMP proteins contain a signal peptide that leads them to the outer membrane.
Discussion
Simple Methods versus Complicated Methods
It is estimated that 2–3% of the genes in Gram-negative bacteria encodes OMPs [2]. Identifying all OMPs ("OMPome") from bacterial genome is an urgent and challenging task. Compared with other complicated methods, such as k-nearest neighbor method, neural network and SVM, that have been used to identify OMPs, one merit of the proposed method resides in its simplicity and fast speed. The training data set is read only once. The calculation of residue composition and weighted Euclidean distance (WED) can be done with a very fast speed. The method proposed here will be very helpful to the discovery of "OMPome" in a genome scale.
Euclidean distance versus weighted Euclidean distance
We used a WED (i.e., ) to discriminate OMPs from non-OMPs. Our results show that this method achieves better performance than a published method that discriminate OMPs and non-OMPs based on a deviation distance (i.e., ) [15]. In this study, we also tried Euclidean distance (i.e., ) instead of WED. But, the performance is not so good as using WED. Compared with Euclidean distance and the deviation distance, WED can better reveal the relation between a protein and a group. Intuitively, for the same amount of difference (i.e., | - x i |), when becomes smaller, the difference will become more significant. For example, for the same amount of difference 0.01 (i.e., | - x i | = 0.01), if the composition of residue i in OMPs is 90% (i.e., = 0.90 and x i = 0.89), then 0.01 does not imply a significant difference between the test protein and OMPs. But, if = 0.001 (then x i = 0.011), then | - x i | = 0.01 will suggest a significant difference between the test protein and OMPs.
Conclusion
In summary, this paper presents a simple method that can discriminate outer membrane proteins (OMPs) from non-OMPs with high performance: 96.8% accuracy and 0.859 MCC. Direct comparisons show that the proposed method outperforms previously published methods. In addition to its high accuracy and MCC, the proposed method is very simple and can be easily applied to genomic data in large scale.
Methods
Datasets
We compiled a set of outer membrane proteins (OMPs) that have been experimentally confirmed. It includes 118 proteins that are classified as "Transmembrane beta-barrels" in the SCOP database [24] and 188 proteins from the "β-Barrel porins" subclass in Transport Proteins Database [25]. We removed redundant proteins so that the mutual identity in the dataset was less than 25%. First the proteins were clustered by running BLASTCLUST with parameters "-S 25 -L 0.9 -b F". This step ensured that any two proteins from different clusters shared less than 25% identical residues over 90% coverage of any protein. Then, one protein was chosen from one cluster. Proteins with less than 50 amino acids and proteins that were not from Gram-negative bacterial were also removed. The final dataset consists of 119 OMPs. Globular proteins and α-helical membrane proteins (inner membrane proteins, IMPs) were obtained from a previous study by Park et al. [16]. We filtered the datasets so that the identity between any two proteins is less than 25%. After the filtering, 673 globular proteins, and 178 IMPs were left.
Residue composition
Residue composition of a protein was calculated using , where n i was the number of residues of type i in the protein. Average residue composition of OMPs was calculated using , where ni _ompwas the total number of residues of type i in OMPs. The average residue composition of globular proteins was calculated using , where ni _glowas the total number of residues of type i in globular proteins. The average residue composition of inner membrane proteins was calculated using , where ni _impwas the total number of residues of type i in globular proteins.
Weighted Euclidean distance (WED)
For each test protein, its distance to OMPs was calculated using , where x i was the composition of residue type i in the test protein, was the average composition of residue type i in OMPs. Note that gives the Euclidean distance between the OMP group and the test protein. In this study, we weighted each term inside the summation with . Thus, we call D omp a weighted Euclidean distance (WED). The WED between a test protein and globular proteins (D glo ) and the WED between a test protein and inner membrane proteins (D imp ) were calculated in a similar way.
Classification
Proteins were classified into the three groups based on the least WED. A test protein was predicted to be an OMP if D omp ≤ D imp and D omp ≤ D glo ; otherwise, it was predicted to be non-OMP.
Five-fold cross-validations
Five-fold cross-validations were used to evaluate the proposed method. The overall dataset was divided into five subsets. OMPs, globular proteins and IMPs were distributed into the subsets evenly. In each round of experiment, four subsets were used as the training set and the remaining subset was used as the test set. This procedure was repeated five times with each subset being used as test set once. The average performance was reported.
Including homologous sequences into the calculation of residue composition
For each protein, the BLAST program [26] was used to search for homologous sequences in the NCBI non-redundant database using an E-value of 0.0001. 50 best hits were chosen from the return result. If less than 50 hits were return, then all of the hits were chosen. These proteins plus the query protein were used to calculate the residue composition for the query protein.
Feature selection
We extended the proposed method by including the composition of di-peptides. We used a feature selection approach to search for residues and di-peptides that are useful for OMP prediction. For the feature selection, we implemented the Bestfirst method in the Weka package [27]. The feature selection was conducted with bi-directional search, with a starting set that include the 20 amino acids. In the end, we obtained a set of features that include the compositions of 14 amino acids and 130 di-peptides.
Performance measures
Sensitivity shows the fraction of OMPs that are correctly identified. Specificity shows the fraction of negative proteins that are correctly identified. Accuracy is the total accurate rate of the predictions. MCC (Matthews correlation coefficient) measures the correlation between predictions and actual class labels, which is in the range of [-1, 1], with 1 denoting perfect predictions. In a two-class classification, if the numbers of examples of the two classes are not equal, MCC is a better measure than accuracy [17]. Therefore, in the comparisons of different methods, we focus on the comparison of MCC. In addition to MCC, we also report the performance in other measures.
Availability
The software and data sets are available online [28].
Declarations
Acknowledgements
This work was supported by funding from the Department of Computer Science, Utah State University to CY.
Authors’ Affiliations
References
- Schulz GE: Beta-barrel membrane proteins. Curr Opin Struct Biol 2000, 10: 443–447. 10.1016/S0959-440X(00)00120-2View ArticlePubMedGoogle Scholar
- Wimley WC: The versatile beta-barrel membrane protein. Curr Opin Struct Biol 2003, 13: 404–411. 10.1016/S0959-440X(03)00099-XView ArticlePubMedGoogle Scholar
- Koebnik R, Locher KP, Van Gelder P: Structure and function of bacterial outer membrane proteins: barrels in a nutshell. Mol Microbiol 2000, 37: 239–253. 10.1046/j.1365-2958.2000.01983.xView ArticlePubMedGoogle Scholar
- Gnanasekaran TV, Peri S, Arockiasamy A, Krishnaswamy S: Profiles from structure based sequence alignment of porins can identify beta stranded integral membrane proteins. Bioinformatics 2000, 16: 839–842. 10.1093/bioinformatics/16.9.839View ArticlePubMedGoogle Scholar
- Wimley WC: Toward genomic identification of beta-barrel membrane proteins: Composition and architecture of known structures. Protein Sci 2002, 11: 301–312. 10.1110/ps.29402PubMed CentralView ArticlePubMedGoogle Scholar
- Martelli PL, Fariselli P, Krogh A, Casadio R: A sequence-profile-based HMM for predicting and discriminating beta barrel membrane proteins. Bioinformatics 2002, 18: S46–53.View ArticlePubMedGoogle Scholar
- Bagos P, Liakopoulos T, Spyropoulos I, Hamodrakas S: A hidden Markov model method, capable of predicting and discriminating beta-barrel outer membrane proteins. BMC Bioinformatics 2004, 5: 29. 10.1186/1471-2105-5-29PubMed CentralView ArticlePubMedGoogle Scholar
- Bagos PG, Liakopoulos TD, Spyropoulos IC, Hamodrakas SJ: PRED-TMBB: a web server for predicting the topology of beta-barrel outer membrane proteins. Nucl Acids Res 2004, 32: W400–404. 10.1093/nar/gkh417PubMed CentralView ArticlePubMedGoogle Scholar
- Bigelow H, Rost B: PROFtmb: a web server for predicting bacterial transmembrane beta barrel proteins. Nucl Acids Res 2006, 34: W186–188. 10.1093/nar/gkl262PubMed CentralView ArticlePubMedGoogle Scholar
- Liu Q, Zhu Y, Wang B, Li Y: Identification of beta-barrel membrane proteins based on amino acid composition properties and predicted secondary structure. Comp Biol Chem 2003, 27: 355–361. 10.1016/S1476-9271(02)00085-3View ArticleGoogle Scholar
- Natt NK, Kaur H, Raghava GPS: Prediction of transmembrane regions of beta-barrel proteins using ANN- and SVM-based methods. Proteins 2004, 56: 11–18. 10.1002/prot.20092View ArticlePubMedGoogle Scholar
- Garrow A, Agnew A, Westhead D: TMB-Hunt: An amino acid composition based method to screen proteomes for beta-barrel transmembrane proteins. BMC Bioinformatics 2005, 6: 56. 10.1186/1471-2105-6-56PubMed CentralView ArticlePubMedGoogle Scholar
- Garrow AG, Agnew A, Westhead DR: TMB-Hunt: a web server to screen sequence sets for transmembrane beta-barrel proteins. Nucl Acids Res 2005, 33: W188–192. 10.1093/nar/gki384PubMed CentralView ArticlePubMedGoogle Scholar
- Berven FS, Flikka K, Jensen HB, Eidhammer I: BOMP: a program to predict integral beta-barrel outer membrane proteins encoded within genomes of Gram-negative bacteria. Nucl Acids Res 2004, 32: W394–399. 10.1093/nar/gkh351PubMed CentralView ArticlePubMedGoogle Scholar
- Gromiha MM, Suwa M: A simple statistical method for discriminating outer membrane proteins with better accuracy. Bioinformatics 2005, 21: 961–968. 10.1093/bioinformatics/bti126View ArticlePubMedGoogle Scholar
- Park KJ, Gromiha MM, Horton P, Suwa M: Discrimination of outer membrane proteins using support vector machines. Bioinformatics 2005, 21: 4223–4229. 10.1093/bioinformatics/bti697View ArticlePubMedGoogle Scholar
- Baldi P, Brunak S, Chauvin Y, Andersen CAF: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 2000, 16: 412–424. 10.1093/bioinformatics/16.5.412View ArticlePubMedGoogle Scholar
- Gromiha MM, Ahmad S, Suwa M: Neural network-based prediction of transmembrane beta-strand segments in outer membrane proteins. Journal of Computational Chemistry 2004, 25: 762–767. 10.1002/jcc.10386View ArticlePubMedGoogle Scholar
- Gromiha MM, Suwa M: Discrimination of outer membrane proteins using machine learning algorithms. Proteins 2006, 63: 1031–1037. 10.1002/prot.20929View ArticlePubMedGoogle Scholar
- Bairoch A, Boeckmann B, Ferro S, Gasteiger E: Swiss-Prot: Juggling between evolution and stability. Brief Bioinform 2004, 5: 39–55. 10.1093/bib/5.1.39View ArticlePubMedGoogle Scholar
- Rey S, Acab M, Gardy JL, Laird MR, deFays K, Lambert C, Brinkman FSL: PSORTdb: a protein subcellular localization database for bacteria. Nucl Acids Res 2005, 33: D164–168. 10.1093/nar/gki027PubMed CentralView ArticlePubMedGoogle Scholar
- Krogh A, Larsson B, Heijne G, Sonnhammer ELL: Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J Mol Biol 2001, 305: 567–580. 10.1006/jmbi.2000.4315View ArticlePubMedGoogle Scholar
- Gardy JL, Laird MR, Chen F, Rey S, Walsh CJ, Ester M, Brinkman FSL: PSORTb v.2.0: Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 2005, 21: 617–623. 10.1093/bioinformatics/bti057View ArticlePubMedGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247: 536–540. 10.1006/jmbi.1995.0159PubMedGoogle Scholar
- Saier MH Jr, Tran CV, Barabote RD: TCDB: the Transporter Classification Database for membrane transport protein analyses and information. Nucl Acids Res 2006, 34: D181–186. 10.1093/nar/gkj001PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Witten IH, Frank E: Data Mining: Practical machine learning tools and techniques. 2nd edition. San Francisco, Morgan Kaufmann; 2005.Google Scholar
- Software and Data[http://www.cs.usu.edu/~cyan/OMP_WED/]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.