TMbed: transmembrane proteins predicted through language model embeddings

Background Despite the immense importance of transmembrane proteins (TMP) for molecular biology and medicine, experimental 3D structures for TMPs remain about 4–5 times underrepresented compared to non-TMPs. Today’s top methods such as AlphaFold2 accurately predict 3D structures for many TMPs, but annotating transmembrane regions remains a limiting step for proteome-wide predictions. Results Here, we present TMbed, a novel method inputting embeddings from protein Language Models (pLMs, here ProtT5), to predict for each residue one of four classes: transmembrane helix (TMH), transmembrane strand (TMB), signal peptide, or other. TMbed completes predictions for entire proteomes within hours on a single consumer-grade desktop machine at performance levels similar or better than methods, which are using evolutionary information from multiple sequence alignments (MSAs) of protein families. On the per-protein level, TMbed correctly identified 94 ± 8% of the beta barrel TMPs (53 of 57) and 98 ± 1% of the alpha helical TMPs (557 of 571) in a non-redundant data set, at false positive rates well below 1% (erred on 30 of 5654 non-membrane proteins). On the per-segment level, TMbed correctly placed, on average, 9 of 10 transmembrane segments within five residues of the experimental observation. Our method can handle sequences of up to 4200 residues on standard graphics cards used in desktop PCs (e.g., NVIDIA GeForce RTX 3060). Conclusions Based on embeddings from pLMs and two novel filters (Gaussian and Viterbi), TMbed predicts alpha helical and beta barrel TMPs at least as accurately as any other method but at lower false positive rates. Given the few false positives and its outstanding speed, TMbed might be ideal to sieve through millions of 3D structures soon to be predicted, e.g., by AlphaFold2. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04873-x.


Short description of Supporting Online Material
Here, we provide short explanations for the different performance metrics used during evaluation (Note S1, Fig. S4); how we created the results for several other prediction methods (Note S2); we briefly explain the idea behind using a simple Viterbi decoder and its limitations (Note S3, Fig. S2); and we provide illustrative sketches of our model architecture (Fig. S1) and nested cross-validation process (Fig. S3).
We list the hardware specifications of the machines used during the project (Table S1) and the optimal hyperparameters for each of the final models (Table S2). Further, we provide performance statistics for signal peptides (Table S3) and protein groups based on their number of transmembrane segments (Table S4). We also list statistics for the individual models and cross-validation splits (Tables S5 & S6), confusion matrices for the cross-validation and final models (Table S7 & S8), the effect of the Gaussian filter and Viterbi decoder on segment performance (Table S9), and estimate the expected number of mistakes made in a hypothetical proteome (Table S10). Additionally, we show a few more "false positives" that might actually be transmembrane proteins (Fig.  S5). Finally, we provide performance and annotation statistics for an out-of-distribution data set gathered from DeepTMHMM (Tables S11 & S12, Fig. S6) and a CASP-like data set of novel membrane proteins (Table S13).

Material
Note S1: Performance metrics explained.
We evaluated our and other methods using several standard and non-standard performance metrics, listed below. Statistics referring to a specific type of segment (i.e., transmembrane beta strands or helices, and signal peptides) are calculated using only the corresponding subset of proteins. For example, the precision and Qok values for transmembrane helices take only the 571 alpha helical transmembrane proteins (TMPs) into account, ignoring any false positive predictions made in beta barrel TMPs or globular proteins.
Recall, also called Sensitivity, is the percentage of positive samples (proteins or segments) that have been correctly predicted as such. For example, TMbed correctly identified 557 of the 571 alpha helical TMPs, i.e., a recall of about 98%.  Qtop gives the percentage of correctly predicted segments of a given type that also have the correct inside/outside orientation, i.e., its endpoints are on the correct sides of the membrane. For example, TMbed correctly predicts 730 of 768 transmembrane beta strands (recall of about 95%). Of those 730 segments, 714 also have the correct inside/outside orientation, i.e., the Qtop value is about 98%. We consider only the first residue on each side of a segment to determine its orientation.
We estimate the error margin of our performance values with the 95% confidence interval (CI), i.e., 1.96 times the standard error (SE) based on the sample standard deviation (SD): where is the number of measurements performed and ̅ is the mean over those. In our case, usually refers to the five cross-validation iterations.
Note S2: Other prediction methods.
In order to put the performance of TMbed into context, we made predictions for the proteins in our data set using several other methods.
DeepTMHMM (1) uses ESM-1b (2) embeddings to predict alpha helical and beta barrel transmembrane proteins (TMP). We generated all predictions using a local installation as described on their web server homepage (https://dtu.biolib.com/DeepTMHMM). (6), and SPOCTOPUS (7) all predict alpha helical TMPs. TOPCONS2, Philius, PolyPhobius, and SPOCTOPUS additionally predict signal peptides. With the exception of Philius, all other five methods use evolutionary information in the form of BLAST profiles or MSAs as additional input to the protein sequence. As TOPCONS2 is a consensus prediction method combining all of the above methods, we got all predictions from its web server (https://topcons.net). Unfortunately, the web server rejected one of the globular proteins, P05790, due to its high GA content (incorrectly thought to be a DNA sequence).

TOPCONS2 (3), OCTOPUS (4), Philius (5), PolyPhobius
CCTOP (8,9) is another consensus prediction method for alpha helical TMPs. It combines a total of 10 prediction methods and topology constraint determined by a homology lookup. We used their web server (https://cctop.ttk.hu/) to generate predictions for our data sets. Due to sequence length restrictions (up to 5,000 residues) we are missing predictions for one alpha helical TMP and six globular proteins.
SCAMPI2 (10) is an improved version of the older SCAMPI (11) method employed as part of TOPCONS2. We downloaded the software from its GitHub repository 1 and used UniRef90 as the BLAST search database to generate the alignments needed for the MSA version of SCAMPI2.
HMM-TM (12) and PRED-TMBB2 (13) are methods predicting alpha helical and beta barrel TMPs, respectively. We computed predictions for our data set using their respective web servers (http://www.compgen.org/tools). For both methods, we used the most recent improvements employing hidden neural networks (14). As the online services only allow batch submissions for the single sequence versions, i.e., without the use of MSAs as input, we also installed local versions of their methods (15) and Appendix p. 5 ran them offline. However, the offline version of PRED-TMBB2 does not include the protein filtering using pHMMs that the web server employs, significantly increasing its false positive rate. Unfortunately, the local MSA versions failed for some of the proteins and we were unable to fix the issue. Thus, we are missing predictions for 155 proteins (6 beta barrel TMPs, 19 alpha helical TMPs, 130 globular proteins) by HMM-TM (MSA) and 26 proteins (2 alpha helical TMPs, 24 globular proteins) by PRED-TMBB2 (MSA).
The authors of BetAware-Deep (16) kindly provided us with predictions for our data set as the web server only allows for submission of a single sequence at a time. Their method combines sequence profiles with several machine learning architectures (LSTM, CRF) to predict beta barrel TMPs.
We installed and ran an offline version of BOCTOPUS2 (17) to predict beta barrel TMPs in our data set. We generated the sequence profiles and results according to the descriptions on their GitHub repository 2 .
For TMSEG (18) and PROFtmb (19) we used the predictions generated by our PredictProtein (20) pipeline. TMSEG and PROFtmb predict alpha helical and beta barrel TMPs, respectively, both using BLAST profiles as additional input.
We used the SignalP 6.0 (21) web server 3 to generate additional signal peptide predictions. We chose the "slow" model mode to get accurate predictions. Just like TMbed, SignalP 6.0 uses a protein language model to generate embeddings (22).

Note S3: Viterbi decoder.
We use an untrained Viterbi decoder to translate the class probability distributions generated by our models into actual class labels for each residue in a sequence. The decoder scores state transitions according to the class probabilities predicted by the CNN model, trying to find the path with the highest sum of probabilities. We apply a score penalty of −100 to transitions not intended by our defined grammar (Fig. S2), effectively preventing the decoder from considering those transitions.
The main purpose of the decoder is to enforce a small set of rules: 1) Signal peptides may only start at the N-terminus of a sequence.
2) Signal peptides and transmembrane segments must be at least five residues long.
3) The inside/outside orientation of non-membrane parts must change after every transmembrane segment.
We explicitly model the transition from IN to OUT state to allow for sequence parts that pass the membrane boundaries without actually being in contact with the membrane. For example, this includes parts of beta barrel TMPs that pass through the pore formed by their own beta barrel structure (Manuscript: Fig. 1). Given those two alternatives, we decided to allow for direct transitions.   * Protein and segment performance for signal peptide (SP) prediction based on 661 globular proteins with SPs and 4993 globular proteins without SPs. Performance values were averaged over the five independent cross-validation test sets; error margins given for the 95% confidence interval (1.96*standard error); bold: best values for each column; italics: differences statistically significant with over 95% confidence (only computed between best and 2nd best). * TMbed segment performance for transmembrane beta strand (TMB) and helix (TMH) prediction based on 57 beta barrel and 571 alpha helical TMPs. TMPs are subdivided into groups based on their number of transmembrane segments: a) 2 or 4 TMBs, b) 8 or more TMBs, c) a single TMH, d) 2-5 TMHs, and e) 6 or more TMHs. The numbers in parenthesis indicate the number of proteins within that group. Performance values were averaged over the five independent cross-validation test sets; error margins given for the 95% confidence interval (1.96*standard error).     Comparison between the CNN model, models combining the CNN with either the Gaussian filter or the Viterbi decoder, and the final TMbed model combining all three components. Segment performances for transmembrane beta strand (TMB) and helix (TMH) prediction are based on 57 beta barrel and 571 alpha helical TMPs. Performance values were averaged over the five independent cross-validation test sets; error margins given for the 95% confidence interval (1.96*standard error).
Appendix p. 11    italics: differences statistically significant with over 95% confidence (only computed between best and 2 nd best, or all methods ranked 1 and those ranked lower). * Segment performance for transmembrane beta strand (TMB) and helix (TMH) prediction based on 14 beta barrel and 86 alpha helical TMPs from the DeepTMHMM data set; all proteins are non-redundant with respect to the TMbed data sets. Qok, recall and precision were averaged over 1000 bootstrap iterations (random sampling with replacement); error margins given for the 95% confidence interval (1.96*standard deviation); bold: best values for each column; italics: differences statistically significant with over 95% confidence (only computed between best and 2 nd best, or all methods ranked 1 and those ranked lower; ignores the OPM baseline).
1 Evaluation missing for one of 85 α-TMPs. 2 OPM represents the baseline for how much the DeepTMHMM data set annotations agree with our annotations collected from the OPM database, i.e. we are using the OPM annotations as predictions for the DeepTMHMM data set. The performance statistics were evaluated for a set of 44 beta barrel and 184 alpha helical TMPs common to both data sets (TMbed and DeepTMHMM).
TMbed SOM Appendix p. 14   Transitions encoded in the Viterbi decoder to go from one state to another state. We split transmembrane beta strands (TMB), helices (TMH), and signal peptides (SP) into sub-states to enforce minimum segment lengths of five residues. A decoded sequence must start with one of the blue states and may only end with one of the orange states.
The IN and OUT states on both sides represent the same two internal states and are only duplicated to simplify the graph.
Appendix p. 17 For the nested cross-validation process, we split the data set into five cross-validation splits (CVS). During each of the outer five iterations, we used one split as the test set to estimate the models final performance and the other four to develop the model. We further divided those four splits into training set and validation set. We then trained the model on the training set and evaluated the performance on the validation set, repeating the process for each hyperparameter combination. We repeated this process three more times, each time using a different split for the validation set. We chose the best set of hyperparameters based on the average performance on all four validation sets, trained the model using those parameters on the development set, and evaluated its performance on the test set. We repeated this overall process four more times, each time choosing a different CVS for the test set, until each CVS had been used for testing once. This process yielded five trained models, which we used for the final TMbed ensemble. Figure S4: Segment validation criteria.
Illustration of the two criteria for a predicted segment to be correct: 1) start and end positions must not deviate by more than five residues, i.e., max ( 1 , 2 ) ≤ 5, and 2) the intersection (overlap) between the observed and predicted segment must be at least half of their union, i.e., + 1 + 2 ≥ 0.5.
Appendix p. 19 Figure S5: More potential transmembrane proteins in the globular data set.
AlphaFold2 (23,24) structures of nine proteins from the globular data set: major surface antigen 4 (Q07408), normal mucosa of esophagus-specific gene 1 protein (Q9C002), Kunitz-type protease inhibitor 1 (O43278), G0/G1 switch protein 2 (P27469), maintenance of telomere capping protein 3 (P53077), sporulation protein RMD1 (Q03441), protein root UVB sensitive 2 (Q9SJX7), uncharacterized protein YDL157C (Q12082), and meiotically up-regulated gene 33 protein (O74472). For most proteins, transmembrane segments (dark purple) predicted by TMbed correlate well with membrane boundaries (dotted lines: red=outside, blue=inside) predicted by the PPM (25) web server. Images created using Mol* Viewer (26). Though our data set lists them as globular proteins, the predicted structures indicate transmembrane domains, which align with segments predicted by our method. Predictions were made with the final TMbed ensemble model. Transmembrane segments length distributions for 44 beta barrel (A) and 184 alpha helical (B) transmembrane proteins common to both the TMbed and DeepTMHMM data sets. Lines: statistics for the annotated segments in each data set; Bars: statistics for the segments predicted by each method during its individual cross-validation. Panel B is cropped to the right, missing two annotated segments (L: 40, 44) and four predicted segments (L: 37, 38, 40, 43) for TMbed.