MetaTM - a consensus method for transmembrane protein topology prediction
© Klammer et al; licensee BioMed Central Ltd. 2009
Received: 13 January 2009
Accepted: 28 September 2009
Published: 28 September 2009
Transmembrane (TM) proteins are proteins that span a biological membrane one or more times. As their 3-D structures are hard to determine, experiments focus on identifying their topology (i. e. which parts of the amino acid sequence are buried in the membrane and which are located on either side of the membrane), but only a few topologies are known. Consequently, various computational TM topology predictors have been developed, but their accuracies are far from perfect. The prediction quality can be improved by applying a consensus approach, which combines results of several predictors to yield a more reliable result.
A novel TM consensus method, named MetaTM, is proposed in this work. MetaTM is based on support vector machine models and combines the results of six TM topology predictors and two signal peptide predictors. On a large data set comprising 1460 sequences of TM proteins with known topologies and 2362 globular protein sequences it correctly predicts 86.7% of all topologies.
Combining several TM predictors in a consensus prediction framework improves overall accuracy compared to any of the individual methods. Our proposed SVM-based system also has higher accuracy than a previous consensus predictor. MetaTM is made available both as downloadable source code and as DAS server at http://MetaTM.sbc.su.se
Transmembrane proteins are proteins that span the biological membrane one or more times. An estimated 20 - 30% of all genes in an organism code for TM proteins [1, 2]. Virtually all communication and transportation between the inside and the outside of a cell is mediated by them. Furthermore they are vital for cell recognition and cell adhesion and serve as receptors. This makes them especially interesting for medicine, since almost half of all present-day drug targets are TM proteins .
Two major types of TM proteins can be distinguished: α-helical TM proteins and TM β-barrels. Proteins of the α-helical class are by far the more abundant, therefore only this class will be considered in this paper. Although TM proteins make up about a fifth of all known protein sequences, only about one single percent of all known 3-D structures are TM proteins . This is due to the fact that transmembrane proteins are very hard to crystallize. There are methods to determine the rough membrane-spanning topology of TM proteins (e. g. reporter fusion, site tagging, antibodies or mass spectrometry), but even this has only been done for less than a thousandth of all known protein sequences. To bridge this gap, there is a great need for computational prediction.
The topology of an α-helical TM protein describes which parts of the amino-acid sequence are buried in the lipid bilayer, and which are facing the aqueous environment on either side of the cell (i. e. the cytoplasmic or the non-cytoplasmic side). The portions that lie within the bilayer are termed TM segments, while the ones on either the in- or the outside of the cell (or organelle) are mostly called loops. Since loops alternate between the inside and the outside of the membrane, the topology information can be reduced to the location of the first loop (N-terminal location) and the position of all TM segments.
An α-helical TM segment consists of an approximately 15 - 30 residues long region with an over-representation of hydrophobic residues . This fact makes the computational prediction of TM proteins a rewarding task. However, there are also other parts of proteins that have the same physico-chemical properties, e. g. the core region of signal peptides, which are short pro-peptides (i. e. cleaved off) that guide the membrane translocation of mature proteins. Their appearance often confuses TM topology predictors .
In silico TM topology prediction
As the topology of a TM protein mostly depends on its primary amino-acid sequence, the computational prediction can be carried out fairly easily. A large number of TM topology predictors are available today, ranging from simple hydrophobicity analysis (e. g. TopPred ) to more complex methods based on hidden Markov models (e. g. TMHMM , HMMTOP , Phobius ) or artificial neural networks (e. g. PHDhtm , Memsat ).
The use of homologous sequences can improve the accuracy of TM topology predictors by up to 10% , thus many predictors support this kind of information (e. g. PHDhtm, HMMTOP, Memsat, Phobius in its homology-supporting version PolyPhobius ).
Current TM topology predictors are claimed to predict the correct topology for 70 - 85% of all proteins, but studies on whole-genome data show that this is an overestimation [13, 14]. Furthermore, different predictors have different strengths and weaknesses. Some tend to over-predict TM segments, others are very conservative and miss more of them. Most predictors also tend to falsely predict signal peptides (SPs) as TM segments, whereas only a few of them can handle this problem (e. g. Phobius, Memsat).
Due to these different strengths and weaknesses of several predictors, it seems natural to make attempts to combine them and make a prediction on a meta-level. This is called a consensus prediction. The aim is to reduce method-specific weaknesses and therefore yield higher accuracies. This can be achieved by building a predictor that combines the results of various methods and — by applying some weighting and heuristics — calculates a meta-result. This meta-result, which represents the consensus of all methods, is potentially more reliable than a single prediction . Previous approaches to combining results into a consensus prediction include simple majority voting [16, 17] and Bayesian Belief Networks .
TM topology predictors
Another important feature is the SP prediction to avoid mutual false-classifications of TM segments and SPs. Two TM topology predictors (PolyPhobius and Memsat) are also capable of predicting signal peptides. However, Memsat does not deliver very reliable results for signal peptides  and therefore was not used in the consensus predictor for this kind of prediction. Additionally, SignalP , a method that only predicts SPs, was included, too. This method can be thought of as an assisting predictor to reach a consensus on the SP prediction together with PolyPhobius.
Support Vector Machines
In order to build a model that predicts TM topology from a set of inputs, one needs to employ a machine learning method. An increasingly popular technique is the support vector machine (SVM) . Here, non-linear dependencies between the input features are handled by mapping the input to a higher dimensional feature space by means of a kernel function. In kernel feature space, the SVM will construct a hyperplane that separates the two data sets with a maximal margin. This delivers good generalization and the ability to capture non-linear behavior. SVMs have been used successfully for a large number of bioinformatics prediction tasks .
The MetaTM algorithm
On the top level the consensus prediction is split into two major parts: (1) the segments consensus for finding TM segments and signal peptides (SPs), and (2) the N-terminal consensus. The latter determines whether the N-terminal end of the amino-acid sequence is located on the cytoplasmic or non-cytoplasmic side (also referred to as inside and outside, respectively) of the membrane. They are both predicted independently based on two different SVM models and afterwards combined into a final consensus topology.
To have the consensus predicted by the SVM segment model, the results of the incorporated predictors for each window have to be encoded as a vector. In this case they are represented by a nine-dimensional vector with the following boolean values: Six for the TM topology predictors, two for the SP predictors, and finally one that indicates if the current window is the first for the current query sequence. This last value is an additional indicator for the prediction of signal peptides, as they can only appear at the N-terminal end of a sequence (and therefore only within the first window of a query sequence).
If the voting was positive (i. e. either an SP or TM segment should be added to the consensus prediction), the averages of the overlapping segments' start and end positions are calculated, respectively. If a window contains SPs and TM segments, only those segments which are of the same class as that predicted by the SVM model are used for the averaging (see also Figure 1C). Then all segments used for the prediction of the consensus segment are masked to not be used for following predictions. Afterwards the rest of the sequence is scanned for the next segment (see Figure 1D). Next, the cycle starts again from the beginning until no more unmasked segments are present.
N-terminal location consensus
The N-terminal consensus in MetaTM is reached by a voting mechanism based on a second SVM model. Each predictor contributes to the result by voting either for N-terminus on the inside (cytoplasmic side) or N-terminus on the outside (non-cytoplasmic side). The results are encoded as an eight-dimensional vector with the following boolean values: six for the TM topology predictors, where 0 stands for the N-terminus being located on the inside and 1 for the outside, and two for the SP prediction of PolyPhobius and SignalP, respectively (1 if an SP has been predicted, otherwise 0). The last two values assist the N-terminal prediction such that the occurrence of an SP automatically leads to an outside N-terminal location. This is due to the biological fact that SPs are cleaved off from the remainder of the protein after it has been inserted across the membrane.
Comparison with single predictors
Data set categories
N-terminal location prediction results
Number of TM segments prediction results
Entire topology prediction results
Signal peptide prediction results
Comparison with previous consensus predictors
ConPred II data set prediction results
The prediction of segment and N-terminal consensus is achieved by two different support vector machine (SVM) models. We also tried a different approach, where weights were assigned to the incorporated predictors based on their prediction quality. The idea was that methods which deliver more reliable results should contribute more to the consensus. The weights of all predictors voting for a certain state (e g. N-terminus inside or N-terminus outside in the case of N-terminal location prediction) were summed up and compared to each other. Subsequently, the state with the higher vote was considered to be the consensus result. This approach, although fairly simple, also delivered good results and was only about 1 percentage point less reliable than the approach using SVM.
As the prediction quality of MetaTM strongly depends on the results of the underlying predictors, the performance could be further improved by adding better methods or replacing poorly performing ones with them. Of course, our aim was to use only well-performing predictors, but new methods can easily be incorporated in the future.
We have presented a novel TM consensus method, MetaTM, that predicts the transmembrane topology and signal peptides based on the results of seven single predictors. Although MetaTM was not able to deliver the best results in all data categories, it is the most reliable method on average in all three tests (i. e. N-terminal location, number of TM segments, entire topology). For predicting the entire topology of protein sequences, the most important test in TM topology prediction, MetaTM reached an average accuracy of 86.3%, which was 4.0 percentage points better than the result of the best single predictor. Furthermore, its average signal peptide prediction quality is also better than those of its incorporated SP predictors.
Compared to ConPred II, an existing consensus predictor, MetaTM was 2.6 percentage points more accurate in terms of entire topology prediction. Due to availability limitations of the ConPred II program, the prediction quality could only be compared based on the data set and results described in the ConPred II paper. Presumably, the results would have been even more clearly in favor of MetaTM if sequences with signal peptides had been in the ConPred data set, as ConPred II does not include SP predictors.
The data set for the comparison with the single predictors comprises data from the recently published TOPDB database  and the data set that was originally compiled for Phobius . TOPDB (revision 1) currently comprises 1452 α-helical TM protein sequences, of which 94 were excluded as they contain propeptides or membrane loops. The remaining 1358 sequences were combined with all 292 α-helical TM protein sequences and 2362 globular ones from the Phobius data set. There was some overlap of sequences between the two data sets, so duplicate entries were removed. This led to a final data set with 1460 TM protein sequences and 2362 globular ones, or 3822 sequences in total.
A disproportionate number of strong homologs in the selected data set could affect the result of the predictor comparison, as it could favor or disfavor a particular predictor. To rule out such bias, a pairwise comparison of all proteins in the data set was done. Only 70 pairs between 117 different proteins with more than 90% identity were found, ruling out an effect on the results.
The data set for the comparison with ConPred II comprising 231 α-helical TM protein sequences was downloaded from the predictor's homepage (see ). The performance of MetaTM was assessed with the underlying SVM models trained on the data set mentioned in the paragraph above. The results of ConPred II were taken from its paper  where the results are separately described for pro- and eukaryotic sequences. In our comparison, we did not make this distinction, so we recalculated the fractions of correct predictions for the entire set based on their reported results.
Homology detection and MSA
In order to reduce the duration of time-consuming homology searching, a sub-database of UniProt/SwissProt (release 55.2) was created (called SwissMemProts). The idea for this sub-database was to extract all membrane proteins from SwissProt and use the resulting subset as the database for the homology search. Membrane proteins were detected by searching for the occurrence of the string membrane in the CC-section of each entry. This CC-block stores the annotation of the sequence (e. g. function, sub-cellular location). If the string was found in this section, the corresponding sequence was added to the sub-database. This filtering procedure reduced the number of entries from 362,782 to 75,083 without decreasing the accuracy of MetaTM's results.
Homologs for PolyPhobius, HMMTOP and PHDhtm were found with the BLAST algorithm  (blastall Version 2.2.16). The following parameters were set: -p blastp, -e 1e-5 and -b 50. Resulting homologous sequences were aligned with the Kalign 2.0  multiple sequence alignment (MSA) tool using the default parameters, and the produced multiple alignment was passed to PolyPhobius and PHDhtm. HMMTOP does not require an aligned sequence; it rather takes the list of homologous sequences directly. The homology detection for Memsat was done with the default script that comes with the program and PSI-BLAST  (blastpgp Version 2.2.16).
The models used in the SVM voting mechanism were created with the libsvm  package (version 2.86). Two types of models have been designed, one for the segments consensus and one for the N-terminal location consensus. 10-fold cross validation was applied to train and test the model [see Additional file 2]. The cross validation sets were selected such that no proteins had more than 50% sequence identity matches between sets. The SVM models were produced with a Python script that comes with the package (called easy.py), using the radial basis function (RBF) kernel. This script automatically determines the optimal cost and RBF kernel parameters for each model, which is created during the cross validation process. For the final models (those that were trained on the entire data set), the optimized parameters are C = 2048 and γ = 4.88·10-4 for the N-terminal location model, and C = 2 and γ = 0.125 for the segments model.
- Wallin E, von Heijne G: Genome-wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci 1998, 7: 1029–1038.PubMed CentralView ArticlePubMedGoogle Scholar
- Krogh A, Larsson B, von Heijne G, Sonnhammer E: Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 2001, 305: 567–580. 10.1006/jmbi.2000.4315View ArticlePubMedGoogle Scholar
- Drews J: Drug discovery: a historical perspective. Science 2000, 287: 1960–1964. 10.1126/science.287.5460.1960View ArticlePubMedGoogle Scholar
- von Heijne G: The membrane protein universe: what's out there and why bother? J Intern Med 2007, 261: 543–557. 10.1111/j.1365-2796.2007.01792.xView ArticlePubMedGoogle Scholar
- Käll L, Krogh A, Sonnhammer E: A combined transmembrane topology and signal peptide prediction method. J Mol Biol 2004, 338: 1027–1036. 10.1016/j.jmb.2004.03.016View ArticlePubMedGoogle Scholar
- von Heijne G: Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. J Mol Biol 1992, 225: 487–494. 10.1016/0022-2836(92)90934-CView ArticlePubMedGoogle Scholar
- Sonnhammer E, von Heijne G, Krogh A: A hidden Markov model for predicting transmembrane helices in protein sequences. Proc Int Conf Intell Syst Mol Biol 1998, 6: 175–182.PubMedGoogle Scholar
- Tusnády G, Simon I: Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J Mol Biol 1998, 283: 489–506. 10.1006/jmbi.1998.2107View ArticlePubMedGoogle Scholar
- Rost B, Fariselli P, Casadio R: Topology prediction for helical transmembrane proteins at 86% accuracy. Protein Sci 1996, 5: 1704–1718. 10.1002/pro.5560050824PubMed CentralView ArticlePubMedGoogle Scholar
- Jones D: Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics 2007, 23: 538–544. 10.1093/bioinformatics/btl677View ArticlePubMedGoogle Scholar
- Viklund H, Elofsson A: Best alpha-helical transmembrane protein topology predictions are achieved using hidden Markov models and evolutionary information. Protein Sci 2004, 13: 1908–1917. 10.1110/ps.04625404PubMed CentralView ArticlePubMedGoogle Scholar
- Käll L, Krogh A, Sonnhammer E: An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics 2005, 21(Suppl 1):i251–257. 10.1093/bioinformatics/bti1014View ArticlePubMedGoogle Scholar
- Käll L, Sonnhammer E: Reliability of transmembrane predictions in whole-genome data. FEBS Lett 2002, 532: 415–418. 10.1016/S0014-5793(02)03730-4View ArticlePubMedGoogle Scholar
- Melén K, Krogh A, von Heijne G: Reliability measures for membrane protein topology prediction algorithms. J Mol Biol 2003, 327: 735–744. 10.1016/S0022-2836(03)00182-7View ArticlePubMedGoogle Scholar
- Martelli PL, Fariselli P, Casadio R: An ENSEMBLE machine learning approach for the prediction of all-alpha membrane proteins. Bioinformatics 2003, 19(Suppl 1):i205–211. 10.1093/bioinformatics/btg1027View ArticlePubMedGoogle Scholar
- Nilsson J, Persson B, von Heijne G: Consensus predictions of membrane protein topology. FEBS Lett 2000, 486: 267–269. 10.1016/S0014-5793(00)02321-8View ArticlePubMedGoogle Scholar
- Ikeda M, Arai M, Lao DM, Shimizu T: Transmembrane topology prediction methods: a re-assessment and improvement by a consensus method using a dataset of experimentally-characterized transmembrane topologies. In Silico Biol 2002, 2: 19–33.PubMedGoogle Scholar
- Taylor PD, Attwood TK, Flower DR: BPROMPT: A consensus server for membrane protein prediction. Nucleic Acids Res 2003, 31: 3698–3700. 10.1093/nar/gkg554PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410.View ArticlePubMedGoogle Scholar
- Lassmann T, Sonnhammer E: Kalign—an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics 2005, 6: 298. 10.1186/1471-2105-6-298PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul S, Madden T, Schäffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Nielsen H, Krogh A: Prediction of signal peptides and signal anchors by a hidden Markov model. Proc Int Conf Intell Syst Mol Biol 1998, 6: 122–130.PubMedGoogle Scholar
- Schölkopf B, Smola A: Learning with Kernels. Support Vector Machines. Cambridge: MIT Press; 2002.Google Scholar
- Noble W: What is a support vector machine? Nat Biotechnol 2006, 24: 1565–1567. 10.1038/nbt1206-1565View ArticlePubMedGoogle Scholar
- Arai M, Mitsuke H, Ikeda M, Xia J, Kikuchi T, Satake M, Shimizu T: ConPred II: a consensus prediction method for obtaining transmembrane topology models with high reliability. Nucleic Acids Res 2004, 32: W390–393. 10.1093/nar/gkh380PubMed CentralView ArticlePubMedGoogle Scholar
- Tusnády G, Kalmár L, Simon I: TOPDB: topology data bank of transmembrane proteins. Nucleic Acids Res 2008, 36: D234–239. 10.1093/nar/gkm751PubMed CentralView ArticlePubMedGoogle Scholar
- Chang CC, Lin CJ:LIBSVM: a library for support vector machines. 2001. [http://www.csie.ntu.edu.tw/~cjlin/libsvm]Google Scholar
- Sonnhammer E, Wootton J: Integrated graphical analysis of protein sequence features predicted from sequence composition. Proteins 2001, 45: 262–273. 10.1002/prot.1146View ArticlePubMedGoogle Scholar
- Dowell R, Jokerst R, Day A, Eddy S, Stein L: The distributed annotation system. BMC Bioinformatics 2001, 2: 7. 10.1186/1471-2105-2-7PubMed CentralView ArticlePubMedGoogle Scholar
- Messina DN, Sonnhammer EL: DASher: a stand alone protein sequence client for DAS, the Distributed Annotation System. Bioinformatics 2009, 25: 1333–1334. 10.1093/bioinformatics/btp153View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.