AllesTM: predicting multiple structural features of transmembrane proteins

Hönigschmid, Peter; Breimann, Stephan; Weigl, Martina; Frishman, Dmitrij

doi:10.1186/s12859-020-03581-8

Methodology article
Open access
Published: 12 June 2020

AllesTM: predicting multiple structural features of transmembrane proteins

Peter Hönigschmid¹,
Stephan Breimann¹,
Martina Weigl¹ &
…
Dmitrij Frishman ORCID: orcid.org/0000-0002-9006-4707¹

BMC Bioinformatics volume 21, Article number: 242 (2020) Cite this article

2867 Accesses
3 Citations
1 Altmetric
Metrics details

Abstract

Background

This study is motivated by the following three considerations: a) the physico-chemical properties of transmembrane (TM) proteins are distinctly different from those of globular proteins, necessitating the development of specialized structure prediction techniques, b) for many structural features no specialized predictors for TM proteins are available at all, and c) deep learning algorithms allow to automate the feature engineering process and thus facilitate the development of multi-target methods for predicting several protein properties at once.

Results

We present AllesTM, an integrated tool to predict almost all structural features of transmembrane proteins that can be extracted from atomic coordinate data. It blends several machine learning algorithms: random forests and gradient boosting machines, convolutional neural networks in their original form as well as those enhanced by dilated convolutions and residual connections, and, finally, long short-term memory architectures. AllesTM outperforms other available methods in predicting residue depth in the membrane, flexibility, topology, relative solvent accessibility in its bound state, while in torsion angles, secondary structure and monomer relative solvent accessibility prediction it lags only slightly behind the currently leading technique SPOT-1D. High accuracy on a multitude of prediction targets and easy installation make AllesTM a one-stop shop for many typical problems in the structural bioinformatics of transmembrane proteins.

Conclusions

In addition to presenting a highly accurate prediction method and eliminating the need to install and maintain many different software tools, we also provide a comprehensive overview of the impact of different machine learning algorithms and parameter choices on the prediction performance.

AllesTM is freely available at https://github.com/phngs/allestm.

Introduction

Starting with the seminal work of Qian and Sejnowski [1], only the sky seems to be the limit for the application of machine learning methods to sequence-based protein structure prediction. Over the years, as the variety of employed algorithms expanded (neural networks, support vector machines, random forests), so did the scope of prediction targets – from secondary structure, transmembrane regions, and signal peptides to solvent accessibility, crystallizability, contact maps, and interaction sites. The art of predictor design is not only in the construction of an optimal machine learning framework and rigorous training and testing procedures, but also in the meticulous choice of the relevant feature space. In each particular case, the latter step requires in-depth domain knowledge and almost unavoidably involves arbitrary decisions. Recent major advances in protein structure prediction accuracy, made apparent by the latest rounds of the CASP contest [2], can be attributed to the wide adoption of deep learning methods [3, 4], which are able to learn hierarchical features and infer input/output mappings directly from complex and noisy data (Goodfellow et al., 2016). One specific advantage of this group of methods is that they allow automating the feature engineering process and thus eliminate, or at least significantly alleviate, the arguably most time-consuming step in the development of bioinformatics prediction algorithms. This, in its turn, opens up the possibility of developing multi-target prediction methods, i.e. methods that predict a whole array of protein properties directly from input sequences or evolutionary profiles.

In this work, we explore this idea focusing on integral transmembrane proteins. In spite of its functional importance and biomedical relevance, this class of proteins has been, to some extent, neglected in the field of structural bioinformatics, mostly due to the sparsity of experimental data available for training algorithms. For example, the residue contact prediction field was initiated in 1994 [5], but it was not until 2007 that specialized contact prediction methods for membrane proteins were proposed [6, 7]. Likewise, the first attempts to predict protein solubility were made almost 30 years ago [8], while for membrane proteins the first predictor of solubility and other experimental properties was developed in 2008 [9]. For some structural features, such as flexibility, secondary structure, torsion angles or solvent accessibility, membrane protein-specific predictors are not available at all. Experience shows that, in many cases, applying methods originally trained on soluble proteins to predict the structural features of transmembrane proteins leads to inferior performance.

Here we present AllesTM, an integrated tool to predict several essential structural features of transmembrane proteins: residue depth in the membrane, flexibility, topology, three-state secondary structure, relative solvent accessibility (in the protein’s monomer form as well as bound in a complex), and the torsion angles φ and ψ. By combining state of the art machine learning algorithms, a modern dataset of known transmembrane protein structures, widely accepted evaluation methods and metrics as well as heavy automation of training and parameter selection we were able to either outperform or at least achieve a comparable performance to the currently leading methods in terms prediction accuracy. An additional practical benefit of AllesTM is that it eliminates the need to install and maintain many different software tools and allows for easy installation by minimizing dependencies and proper packaging.

Results

Z-coordinates

The depth of a residue in the membrane is predicted as a continuous number, its distance from the membrane center, ranging from − 25 to + 25 Å (Fig. 1, S1 and S2 Figs). Significant deviations from the equal distribution of residue occurrence beyond the central portion of the transmembrane segment (i.e. outside of the range − 15 to + 15) can be explained as follows. The slightly higher occurrence of residues in the vicinity of the membrane boundaries (around − 15 and + 15) is apparently caused by interfacial helices oriented parallel to the membrane as well as reentrant regions. Minor ‘dents’ in the distribution at approximately − 20 Å and especially at 20 Å are due to the fact that in our data many amino acid chains begin (or end) at this distance from the membrane, and the protein N- or C-termini are more frequently positioned at 20 Å than at − 20 Å. Finally, peaks at − 25 Å and 25 Å are a trivial consequence of the fact that any values beyond this range are assigned these values.

While the z-coordinates are the only target where no other method is publicly available for comparison, several conclusions regarding the employed machine learning algorithm can be drawn from the performance overview shown in S1 Table. First, the accuracy increases with the receptive field of the method. The RF and the GBM with their limited receptive field achieve an MAE of 10.64 and 9.63, respectively, on the cross-validation dataset. The low performance of the RF is also visualized in Fig. 1, Fig. S1 and Fig. S2, showing that the predicted values deviate strongly from the expected distribution and are accumulated around a z-coordinate of 0. The conv method, which has a much wider receptive field due to its multiple layers, already achieves an MAE of 6.94 on the same data, while the dconv and LSTM approaches outperform the other models with MAEs of 6.02 and 4.99, respectively. The importance of a large receptive field indicates the depth of the residue on the membrane is influenced by long-range correlations between amino acid residues. At least for the cross-validation dataset, the MAE of the final method AllesTM (5.08) is slightly higher than the MAE of the LSTM model, but with a reduced MSE/RMSE. AllesTM produces results that resemble the observed distribution of z-coordinates quite well, especially when compared to the RF method (Fig. 1). All approaches without exception perform better on the independent test dataset than on the cross-validation dataset. This can have one of two reasons: either the independent test dataset is easier to predict, or the averaging of the five models derived from cross-validation led to an increased performance as the averaging procedure is equivalent to additional ensembling. The expected MAE from the final method AllesTM is 3.72 with an RMSE of only 6.32, which means that there are only a few large errors.

Topology

AllesTM predicts not only the position of a residue with respect to the membrane as an absolute value, but also in which of the four types of segments the residue is situated in: inner and outer side of the membrane (In/Out), transmembrane segment (TMS), or reentrant region (RER). Table 1, S2 Table, Fig. S3 and Fig. S4 show the performance of the trained algorithms compared with MEMSAT-SVM, PolyPhobius and SCAMPI on the cross-validation as well as the independent test dataset. Similar to the z-coordinate prediction, a larger receptive field has a positive impact on the prediction performance, although the effect is generally less pronounced compared to the z-coordinate prediction and more prominent for the In and Out segments than for the TMS and RER segments. For example, the RF model reaches an AUC of 0.85 compared to an AUC of 0.9 for the LSTM model in the In segments, while for the TMS segments the AUC differs only by 0.01 (0.88 for the RF model and 0.89 for the LSTM model). Furthermore, in terms of accuracy AllesTM outperforms MEMSAT-SVM as well as PolyPhobius and SCAMPI. This holds true when not including RER segments into the benchmark, a class neither PolyPhobius nor SCAMPI can predict. Focusing on the classes In, Out, TMS, and RER separately, only for RER segments MEMSAT-SVM achieves a higher AUC value.

Table 1 Performance summary of all methods and predictions targets evaluated on the independent test dataset. Bold face indicates the best performing method. More detailed metrics, broken down to individual classes, can be found in the supplementary materials

Full size table

Again, all our models perform better on the independent test dataset by a wide margin. In particular, AllesTM, our final method, achieves an ACC of 0.85 during cross-validation and an unprecedented ACC of 0.9 on the independent test dataset. In contrast, MEMSAT-SVM, PolyPhobius and SCAMPI exhibit similar performance on the two datasets (ACC of 0.74, 0.75 and 0.77 on the independent test data set, respectively). Thus, it appears that the datasets have a similar difficulty but the additional ensembling from the cross-validation models boosts the prediction performance even further. With an AUC of 0.7 on RER segments, AllesTM now outperforms MEMSAT-SVM (AUC of 0.58) by a wide margin also on the independent test dataset.

Flexibility

B-factors were normalized for each protein and predicted as continuous values as well as in terms of two classes, i.e. flexible and non-flexible. In contrast to the z-coordinate and topology predictions a large receptive field seems to be of less importance for the prediction of continuous and two-state flexibility. For example, comparing the performance of the GBM and the LSTM on the independent test dataset the MAE is only 0.02 smaller for continuous flexibility (0.65 and 0.63 respectively) and the AUC is only better by 0.01 for two-state flexibility (0.65 to 0.66 respectively) (Table 1, S3 and S4 Tables, as well as Fig. S5, Fig. S6, Fig. S7 and Fig. S8). In line with these observations, the blending approach, i.e. combining different models in the final method AllesTM, results in only a small performance improvement. AllesTM achieves an MAE of 0.51 for continuous flexibility on the independent test dataset, compared to an MAE of 0.4 for PROFbval and 0.13 measured for PredyFlexy. While the Neural Network (NN)-based PROFbval performs well compared to our RF model it is not specifically trained using TMPs and was trained on a substantially older dataset. While the creators of PredyFlexy claim a correlation with the target variable of 0.71, we were not able to reproduce this performance. As PredyFlexy predicts flexibility by mapping structure fragments with known flexibility values to the sequence, we presume that no TMPs were included in its fragment library.

Regarding the performance of two-state flexibility predictions a similar picture emerges. AllesTM achieves an AUC of 0.66 on the independent test dataset, 0.1 better than PROFbval. The difference in performance during cross-validation and on the independent test dataset is only marginal for the continuous as well as the two-state flexibility. As an example for continuous flexibility, AllesTM has an MAE of 0.66 on the cross-validation dataset but only 0.63 on the independent test dataset. PROFbval, on the other hand, performs slightly worse on the independent test dataset compared to the cross-validation dataset (MAEs of 0.78 and 0.77 respectively).

Torsion angles

AllesTM only predicts the φ and ψ angles because the third dihedral angle, ω, is essentially fixed at 180°. Similar to the protein flexibility, long range interactions seem to play a minor role when predicting φ and ψ, as the methods with a larger receptive field only perform slightly better. Therefore ANGLOR, which uses a NN for φ and a SVN for ψ angles, and SPINE X which uses a multi-layer NN, perform slightly worse compared to AllesTM (Table 1, S5 and S6 Tables, as well as Fig. S9, Fig. S10, Fig. S11 and Fig. S12). For φ angles in the independent test dataset, SPINE X and ANGLOR achieve an MAE of 20.68 and 19.57 respectively, while the error of AllesTM is 17.34. For comparison, the MAE of our LSTM solution is only 0.31 higher (17.03) while the MAE of our worst performing approach, the RF, is only 1.57 higher (18.91) than that of AllesTM, and still ahead of SPINE X and ANGLOR. Nevertheless, looking at the MAE, SPOT-1D outperforms our method AllesTM (15.85 and 17.34), but with a slightly higher RMSE (31.03 and 31.69). That means that AllesTM, while being on average less accurate, makes fewer large mistakes.

The results are similar for ψ angles, although SPINE X achieves a slightly lower MAE than our RF method on both, the cross-validation (MAEs of 39.76 and 42.91, respectively) as well as the independent test dataset (MAEs of 38.33 and 39.5, respectively). In terms of RMSE, SPINE X trails behind the RF model for the ψ angles on both datasets. While performing with a similar MAE, ANGLOR achieves a much better performance according to the RMSE compared to SPINE X on the independent test dataset (57.88 and 77.41, respectively). In this case SPOT-1D with an MAE of 23.51 and a RMSE of 50.11 outperforms AllesTM with an MAE of 30.41 and a RMSE of 50.73. Both methods, AllesTM and SPOT-1D use LSTMs with their large receptive field, and the previous comparisons suggest that the size of the receptive field only has a minor impact on φ and ψ angle prediction performance, so the strength of SPOT-1D has to lie elsewhere. The two main differences are that SPOT-1D uses predicted contacts as a feature and a larger training dataset, as it is not restricted to TMPs. Therefore there are three possible explanations why contacts bring this benefit. The first is that contacts of residues which are sequentially very far apart such that even the receptive field cannot incorporate them have an impact on the prediction target. An alternative explanation is that the corellated mutations of sequentially co-located residues provide a very strong signal. The third possible explanation is that the usage of a larger trainging dataset outweights the specifics of TMPs.

Secondary structure

Table 1 and S7 Table show the performance of AllesTM with its underlying models, SPINE X, PROFphd, PSIPRED and SPOT-1D for the three-state prediction of secondary structure, i.e. α-helix, β-strand, and coil. Looking at the overall accuracy (ACC) on the cross-validation as well as on the independent test dataset, AllesTM, PSIPRED and SPOT-1D perform best: AllesTM with an ACC of 0.84 and 0.86 respectively for both datasets and PSIPRED with an even higher ACC of 0.85 and 0.87 and SPOT-1D with a remarkable ACC of 0.88 and 0.89. Other models closely follow, with RF being the worst performer (ACC of 0.8 and 0.82 on the two datasets). SPINE X and PROFphd are slightly behind, with an ACC of 0.77 and 0.78 on the independent test dataset, respectively. With the overall ACC being rather close especially for our developed models, it could be assumed that the size of the receptive field does not play a significant role for the prediction of secondary structure. Zooming into the performance of the methods for α-helix, β-strand and coil separately reveals that this is only true for α-helices and coils. Using the AUC as a threshold-independent measure, the performance for predicting helices correlates almost perfectly with the overall ACC (Fig. S13 and Fig. S14). The RF model, for instance, with its ACC of 0.82 on the independent test dataset has an AUC of 0.8 on helices, while the LSTM achieves an ACC of 0.85 and an AUC of 0.86 on the same dataset and the same class. For β-strands, however, the situation is different. The AUCs of the RF and the GBM, the two models with the smallest receptive field, are 0.62 and 0.73 on the independent test dataset, respectively. The models with a larger receptive field, including conv, all achieve an AUC of 0.79 and above for β-strands. This is also in line with the observation that SPOT-1D especially excels at predicting β-strands (AUC of 0.88), presumably by combining an LSTM, which has a large receptive field, with contact predictions as an input feature.

Furthermore, the exceptional performance of PSIPRED and SPOT-1D leads to the conclusion that secondary structure in TMPs can be predicted without data or algorithms specifically tailored to that protein class. In order to fortify this statement, we evaluated the performance of all prediction methods only on the residues not located in the membrane (S8 Table). The ACC of AllesTM as well as PSIPRED and SPOT-1D (and all other methods) suffers from excluding the TM segments. As an example, the ACC of AllesTM drops from 0.78 to 0.73 on the independent test dataset. This indicates that these regions, which are mostly helical, are relatively easy to predict due to their highly hydrophobic nature and degenerate amino acid composition. Additionally, this comparison shows that for some cases including data from globular proteins is beneficial for the prediction performance, be it because of the larger amount of training data, or because the differences between transmembrane and globular proteins are not as relevant for this specific target.

Relative solvent accessibility

We predicted the relative solvent accessibility (RSA) for each residue in three forms: i) the monomer RSA, which is the RSA of a residue if the protein chain is not bound to any other chain, ii) the complex RSA, which is the RSA of a residue taking the whole complex into account, and iii) the change of RSA upon complex formation, i.e. if a residue is part of the interaction interface.

Table 1, S9, S10, and S11 Tables as well as Fig. S15, Fig. S16, Fig. S17, Fig. S18, Fig. S19 and Fig. S20 show the performance of our models including the final method AllesTM for these three prediction targets. The prediction performance of AllesTM for the monomer and the complex RSA is very similar. For example, AllesTM achieves an MAE of 0.15 and 0.14 on the independent test data, respectively, which is representative for the other models as well. Long range signals, which would be captured by a larger receptive field, have a noticeable impact on the MAE (e.g. 0.18 for RF and 0.15 for AllesTM on the independent test dataset predicting the monomer RSA), but not on the RMSE (e.g. 0.22 and 0.2, respectively). As the RMSE, which is similar for all models, is sensitive especially to large prediction errors, the difference in prediction performance for the RSA cases seems to be rather nuanced and only impacting the MAE. For the monomer and complex RSA benchmarks, we included SPINE X and SPOT-1D as well. SPINE X lags behind AllesTM independent of the applied performance measure but performs better on the complex RSA (MAE of 0.17) than on the monomer RSA (MAE of 0.2). For SPOT-1D it is the other way around as it even outperfoms AllesTM slightly on the monomeric RSA with an MAE of 0.14 compared to 0.15 of AllesTM, while lagging behind on the complex RSA (MAE 0.16 compared to 0.14, respectively). According to the benefits of using contact predictions as input features for targets such as the secondary structure, it is surprising that this feature seems to have a rather small impact on RSA prediction. Figure 2 supports the previous observations showing the distribution of the actual monomer RSA values and the predicted ones. Although not explicitly visible from the numbers, the distribution of RF and GBM models’ predictions is clearly of different shape than the observed distribution.

Compared to the other targets, the prediction of RSA changes does not show a very clear picture, or at least is different from the previous observations and changes dependent on the measure applied. Regarding the RMSE, for example, the RF and GBM models are on par with the final AllesTM method, achieving 0.13 on the cross-validation data and 0.15 on the independent test data outperforming the conv, dconv and LSTM models. For MAE, however, the opposite is the case as AllesTM, RF and GBM do not achieve an MAE as low as the conv, dconv and LSTM models.

Discussion

In contrast to most of the previously proposed methods, usually focused on just one property, AllesTM predicts 10 different structural features of transmembrane proteins, i.e. almost every aspect of protein structure that can be extracted from atomic coordinates. It blends several state of the art machine learning algorithms: random forests and gradient boosting machines, convolutional neural networks in their original form as well as enhanced by dilated convolutions and residual connections, and, finally, long short-term memory architectures. All predictions were carefully evaluated by 5-fold cross-validation, tested on an independent dataset, and compared to the respective state of the art methods.

We found that the size of the receptive field, i.e. the number of adjacent residues considered while making predictions for a specific residue position, has a varying impact for different prediction targets. This is also true for contact predictions which are used by SPOT-1D as an input feature, giving the tool an advantage for some targets. Furthermore, using a method specifically geared towards TMPs is not always beneficial or at least can be compensated by the availability of more training data, as is for example the case for secondary structure prediction accuracy.

Conclusions

In terms of prediction accuracy, the main results can be summarized as follows:

Z-coordinate. Being the only publicly available method for this particular target, AllesTM predicts Z-coordinates for individual residues with an average error of 3.72 Å (about 12% of the average membrane thickness), therefore locating the residues reasonably well.
Topology. AllesTM achieves an accuracy of 0.9 in a four-state prediction (inside, outside, transmembrane, and re-entrant regions) and thus outperforms the leading methods MEMSAT-SVM, SCAMPI and PolyPhobius.
B-factors, i.e. residue flexibility. AllesTM achieves an MAE of 0.63 for continuous value predictions and an AUC of 0.66 for two state (flexible/non-flexible) predictions. The respective values for PROFbval are 0.78 and 0.56.
Three-state secondary structure. With an accuracy of 0.86 AllesTM lags slightly behind PSIPRED (0.87) and SPOT-1D (0.89).
Relative solvent accessibility (RSA). For proteins complexed with other chains (MAE of 0.14) AllesTM outperforms SPINE X by approximately 20% and SPOT-1D by a small margin (MAE of 0.16). For monomers (MAE of 0.15) SPOT-1D is marginally better by an MAE difference of 0.01. The difference between monomers and their bound forms is unique to AllesTM and can be predicted with an MAE of 0.08.
Torsion angles φ and ψ. AllesTM (MAEs of 17.34 and 30.41 for φ and ψ, respectively) performs better than ANGLOR (19.57 and 38.33) but worse than SPOT-1D (15.85 and 23.51).

By providing the multitude of prediciton targets and avoiding the use of complex dependencies characteristic for many other prediction tools, AllesTM is easy to setup and run, making it a useful universal tool for structural bioinformatics studies. AllesTM was developed using Python and is available as a standalone tool having only HHblits as a non-Python dependency. It can be either installed from GitHub or via the Python Package Manager. See https://github.com/phngs/allestm for detailed instructions.

Materials and methods

Dataset

We retrieved from the Orientations of Proteins in the Membranes database (OPM) [10] all 4357 entries belonging to either the “Alpha-helical polytopic” or the “Bitopic proteins” class, each containing one or multiple protein chains with known 3D structures. OPM entries contain information about the thickness of the membrane and the relative position of the protein with respect to it. In addition, OPM offers modified PDB (Berman et al., 2000) files, with proteins rotated and translated in such a way that each atom’s z-coordinate corresponds to the depth of that atom in the membrane. A z-coordinate of 0 means that the atom is located at the center of the membrane, while values deviating from 0 indicate the distance of an atom from the membrane center measured in Å. These distances can be either negative or positive, depending on which side of the membrane the atom is located at. Positive z-coordinates correspond to the side of the membrane facing the compartment, which is more “outside” compared to the other side of the membrane, where z-coordinate values are negative. For example, z-values for the inner and outer bacterial membranes are assigned in such a way that the periplasmic space is annotated with positive values for the inner membrane (negative values represent the cytoplasm), and negative values for the outer membrane (where positive values represent the extracellular space). If the absolute value of a Cα atom’s z-coordinate exceeds a half of the membrane’s thickness (which is about 15.1 ± 1.0 Å in our final dataset according to the OPM database), the residue is considered to be located outside of the membrane. Protein chains not crossing the membrane at least once, i.e. not having at least one Cα atom on both sides of the membrane, were excluded from consideration.

We only considered protein structures solved by X-ray crystallography with a resolution of 3.5 Å or better. Protein chains shorter than 30 residues as well as those containing only one amino acid type were ignored. Because some OPM entries had no properly annotated B-factors, we required structures to have more than one different B-factor value. Amino acid sequences were extracted from the ATOM records of the PDB entries (atomseq), and only proteins with coordinate information available for the backbone atoms N, C and Cα of every amino acid were retained for further analysis. Because the atomseq records do not always correspond to the complete protein sequence provided in the seqres records, we required at least 80% of the seqres sequence to be covered by the atomseq sequence.

Missing residues can result in unrealistically large steps between the coordinate, B-factor torsion angle and solvent accessibility values if they are consecutive according to the atomseq sequence but not to the seqres sequence. For this reason, we only allowed missing residues at the beginning and the end of the atomseq sequence but did not consider them during training and benchmarking of the classifier. These filtering steps resulted in a dataset of 5375 high-quality protein chains, which were subjected to redundancy reduction and cross-validation described in the next section.

Redundancy reduction and cross-validation

While it is desirable to retain as many sequences as possible for training a machine learning model, there are some practical limits to the amount of data that can be used. First, the training process can be quite computationally intensive depending on the algorithm used, and second, redundant data leads to an overestimation of the method’s performance. Therefore, we chose to combine a two-step redundancy reduction procedure with a standard 5-fold cross-validation and testing on an independent dataset (Fig. 3). First, the initial dataset containing the 5375 protein chains was made non-redundant at the 40% sequence identity level using CD-HIT [11], which resulted in a much smaller dataset of only 302 sequences well suitable for computer-intensive training procedures. Thirty-one of these sequences, i.e. 10%, were set aside as independent test data and not employed for any training or parameter tuning until the final models were built. The remaining 271 proteins, or 90%, formed the foundation for the cross-validation data. In order to provide a fair assessment of our methodology, we ensured an equal distribution of protein topologies (proteins with a given number of transmembrane segments (TMS)) in both parts of the split. From the initial cross-validation data, all proteins which had a sequence identity above 30% to any protein in the independent test data were removed, resulting in 178 proteins. These sequences were randomly split into five equally sized bins, with four of these bins used for training and validation, and one bin for testing, again ensuring equal distribution of protein topologies. In order to not overestimate the performance of our predictor during cross-validation, we removed all proteins from these five bins that shared a sequence identity greater than 30% with any protein in the testing bin. This approach guarantees that the test bin is completely independent from the training and validation bins, while retaining as many proteins for training as possible. In order to be able to choose the hyperparameters of the learning algorithms (i.e. the parameters, which affect the model but have to be chosen prior to the training process) without using the test bin or even the independent test set, the 4 training and validation bins were split further. 90% of the sequences were used to learn the model’s parameters (training), while 10% were used to estimate the best performing hyperparameters (validation). Similar to the process employed in the previous splits, topologies were equally distributed and proteins in the training dataset sharing more than 30% sequence identity to any protein in the validation dataset were removed. We performed these steps 5 times using each bin, and therefore each protein, for testing exactly once. Because of the rigorous redundancy reduction inside each fold, the distribution by protein topology resulted in the folds being of slightly different sizes (S12 Table).

Prediction targets

Residue z-coordinates and protein topology

The depth of a residue in the membrane was obtained from the z-coordinate of its Cα atom in the modified PDB files derived from the OPM. Because we are mainly interested in the z-coordinates of residues inside and adjacent to the membrane boundaries, we limited the z-coordinate values to the range from − 25 to 25 Å as the membrane thickness normally does not exceed 30 Å. Values outside that range were set to − 25 and 25 Å, respectively. For training and prediction we scaled the z-coordinate values to a range from − 1 to 1, while for reporting the results the actual values were used.

In order to assign residues to discrete states, we determined four types of topological domains for each transmembrane protein. Initially, TMSs were defined as segments of which both ends have a distance of at most 10 Å to the opposite sides of the membrane. These segments are extended in both directions as long as they touch the membrane boundaries. Next, reentrant regions (RERs) were identified by finding stretches of amino acids which enter and exit the membrane on the same side, consist of at least 3 amino acids and immerge at least 3 Å into the membrane [12]. Finally, the remaining segments, i.e. those which are neither TMSs nor RERs, were annotated as domains residing on either the inner or the outer side of the membrane depending on whether their residues’ z-coordinates were negative or positive, respectively. All residues were assigned to one of the four discrete states - inside, TMS, outside and RER – and encoded by binary vectors [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0] and [0, 0, 0, 1], respectively.

Continuous and two-state flexibility

Residue flexibility was represented both as a continuous and a two-state discrete variable based on the B-factors derived from the PDB entries. Because B-factors are heavily dependent on the experimental conditions and the structure resolution, they are not directly comparable between different structures. We therefore converted them into z-scores for each protein separately using the formula

$${B}_{norm}=\frac{B_{\mathrm{r} aw}-\mu }{\sigma }$$

where B_raw is the original B-factor of a residue’s Cα atom, μ and σ are the mean and the standard deviation of all B-factors for a particular protein, respectively, and B_norm is the continuous residue flexibility used as a prediction target. Continuous flexibility was converted into two discrete states for flexible (Bnorm > 0.03) and rigid (Bnorm ≤0.03) [13], numerically represented as 1 and 0, respectively.

A recent study suggested an approach to estimate the maximal average values of B-factors at a given crystallographic resolution [14]. We found that in roughly a half of the structures in our independent dataset B-factors actually exceed the proposed maximal values, presumably as a consequence of the membrane proteins being removed from their natural stabilizing environment. Given the small number of available structures we chose not to exclude from consideration the atomic records with excessive B-factor values, but this issue definitely deserves further consideration.