Lipid exposure prediction enhances the inference of rotational angles of transmembrane helices
© Lai et al.; licensee BioMed Central Ltd. 2013
Received: 24 May 2013
Accepted: 1 October 2013
Published: 11 October 2013
Skip to main content
© Lai et al.; licensee BioMed Central Ltd. 2013
Received: 24 May 2013
Accepted: 1 October 2013
Published: 11 October 2013
Since membrane protein structures are challenging to crystallize, computational approaches are essential for elucidating the sequence-to-structure relationships. Structural modeling of membrane proteins requires a multidimensional approach, and one critical geometric parameter is the rotational angle of transmembrane helices. Rotational angles of transmembrane helices are characterized by their folded structures and could be inferred by the hydrophobic moment; however, the folding mechanism of membrane proteins is not yet fully understood. The rotational angle of a transmembrane helix is related to the exposed surface of a transmembrane helix, since lipid exposure gives the degree of accessibility of each residue in lipid environment. To the best of our knowledge, there have been few advances in investigating whether an environment descriptor of lipid exposure could infer a geometric parameter of rotational angle.
Here, we present an analysis of the relationship between rotational angles and lipid exposure and a support-vector-machine method, called TMexpo, for predicting both structural features from sequences. First, we observed from the development set of 89 protein chains that the lipid exposure, i.e., the relative accessible surface area (rASA) of residues in the lipid environment, generated from high-resolution protein structures could infer the rotational angles with a mean absolute angular error (MAAE) of 46.32˚. More importantly, the predicted rASA from TMexpo achieved an MAAE of 51.05˚, which is better than 71.47˚ obtained by the best of the compared hydrophobicity scales. Lastly, TMexpo outperformed the compared methods in rASA prediction on the independent test set of 21 protein chains and achieved an overall Matthew’s correlation coefficient, accuracy, sensitivity, specificity, and precision of 0.51, 75.26%, 81.30%, 69.15%, and 72.73%, respectively. TMexpo is publicly available at http://bio-cluster.iis.sinica.edu.tw/TMexpo.
TMexpo can better predict rASA and rotational angles than the compared methods. When rotational angles can be accurately predicted, free modeling of transmembrane protein structures in turn may benefit from a reduced complexity in ensembles with a significantly less number of packing arrangements. Furthermore, sequence-based prediction of both rotational angle and lipid exposure can provide essential information when high-resolution structures are unavailable and contribute to experimental design to elucidate transmembrane protein functions.
Integral membrane proteins participate in diverse cellular functions such as signal transductions, bioenergetics, ion transport, cell adhesion, and cell-cell recognition. It has also been estimated that about 20-30% of a typical genome encode for proteins with a transmembrane (TM) domain [1, 2]. Despite their biological importance and abundance, the mechanism by which TM proteins fold into native structures remains poorly understood due to a limited number of solved structures, accounting for less than 1% of all deposited structures in the Protein Data Bank (PDB) . Therefore, computational methods play an important role in deciphering the sequence-to-structure relationships and advancing our knowledge in this particular class of proteins.
Though recently solved structures of several amino-acid transporters, e.g., eukaryotic CLC Transporter (coded as 3ORG  in PDB) and potassium ion transporter (coded as 3PJZ  in PDB) revealed the existence of short helices in the reentrant region , the canonical topologies of TM proteins can be viewed as pairs of interacting transmembrane helices (TMHs), connecting loops and extramembraneous domains. In particular, the interaction between TMHs is an important determinant of folding and stability by the proposed two-stage model [7, 8]. Such an interaction is mediated by structural contacts at the helical interfaces with the protein itself, the ligands, as well as the lipid environment. From the perspective of structural modeling, the rotational angle of a TMH is a strong determinant of its interacting faces with the rest of the protein structure and the lipids. At the stage of conformation space sampling, we could filter out decoys that severely deviate from the predicted rotational angles. To elucidate rotational angles, Eisenberg et al. [9, 10] showed that hydrophobic scales can be used to estimate the hydrophobic moment direction to approximate the lipid-facing direction and proposed equations to calculate the rotational angles of TMHs based on such property. Later, several hydrophobicity scales or propensities have been proposed [11–15] to predict exposed residues or faces. To the best of our knowledge, there have been few advances to use lipid exposure, specifically the relative accessible surface area (rASA) in the lipid environment, to predict rotational angles. Henceforth, we use rASA for convenience to represent rASA in the lipid environment since in this paper we focus on the residues in such environment.
To determine the rotational angle of each TMH in the tertiary structure of a TM protein requires the information of lipid-facing direction, which has been defined differently in the literature. Pilpel et al.  described the lipid-facing direction as the vector opposite to the bisector of the acute angle formed by the two lines from the geometric center of the target TMH pointing to the geometric centers of the two nearest TMHs in the whole molecule. Stevens and Arkin  defined lipid-facing direction of a TMH as the vector connecting the geometric centers of the target helix and the whole molecule. The molecule could be a single chain or the complete protein. Dastmalchi et al.  accepted both the above definitions for lipid-facing direction.
Lipid exposure of each TMH has been shown useful in distinguishing between surfaces and interior interfaces, identifying potential functional residues buried in the protein core and exposed residues for protein-protein interactions [14, 17], and therefore facilitates the prediction of helix-packing conformations [17, 18]. In this work, we propose to use predicted rASA to estimate lipid-facing direction and then determine rotational angles of TMHs.
In order to train machine learning models by observed rASA from solved structures, we calculated the accessible surface area (ASA) by rolling a spherical probe along the van der Waal’s (VDW) surface of a protein molecule . The observed rASA for each residue is defined as dividing ASA by its reference value in an extended Gly-X-Gly tripeptide conformation. Several methods have been proposed to predict ASA instead of rASA in the lipid environment. For example, ASAP  uses evolutionary profiles to predict solvent accessibility by the support vector regression (SVR) and reports a Pearson correlation coefficient (PCC) of 0.62 for ASA prediction. The MPRAP  uses the support vector machine (SVM) to predict ASA of complete TM proteins by evolutionary profiles and residue distance from membrane center . A number of methods have been proposed to predict the burial or exposed status of each TM residue, where the status is determined by rASA whether below a predefined threshold. Several methods predict the status of TM residues directly from sequences without predicting their ASA; most of them rely on sequence conservation or knowledge-based propensities, including kPROT , ProperTM , TMLIP , MO  and TMX . RHYTHM  predicts the burial status of TM residues by matrix-based helix-helix contact prediction and sequence conservation, but this method requires prior knowledge of TMHs such as membrane coils or transporter/channels.
In this paper, we present TMexpo, a method to predict rotational angles of TMHs. For each TM residue, TMexpo first predicts rASA by SVR and predicts the burial or exposed status by SVM; both models use evolutionary profiles, sequence conservation, helix insertion energy and biochemical properties as features. Next, TMexpo determines rotational angles of TMHs based on the predicted rASA. In rotational angle prediction, TMexpo outperformed predictors using hydrophobicity scales and propensities by at least 19.2˚ in terms of mean absolute angular error (MAAE) on the independent test set of 21 protein chains. Notably, the prediction results showed that rotational angles of TMHs could be better inferred by predicted rASA than by existing scales. We expect the rotational angle prediction could benefit the structure prediction, especially for free modeling of transmembrane protein structures due to its difficulty and the necessities of reducing the number of packing arrangements.
The packing mechanism in which TM protein assembles in the lipids is not fully understood. One significant advance in this area is seen in the mechanism of the Sec translocon which demonstrates how TM proteins enter the membrane [25–27]. A commonly accepted model of TM protein folding is the two-stage model . Later, White and colleagues extended the two-stage model by including a four-step thermodynamic cycle of folding energy description . Several scales and propensities were developed to understand the lipid-facing direction of TMHs based on sequence analysis or knowledge-based information; however, using these scales to predict the rotational angle of TM proteins often results in low prediction accuracy. Therefore, we investigated whether an environment descriptor such as lipid exposure could better infer a geometric parameter such as the rotational angles of TMHs.
The MAAE of rotational angles determined by various approaches
MAAE in development set (554 TMHs)
MAAE in independent test set (188 TMHs)
NACCESS (observed rASA)
TMexpo (predicted rASA)
The success of determining the rotational angle via the observed rASA calculated by NACCESS may be due to the following two reasons. First, the description of observed rASA is derived from known protein structures, but the descriptions of hydrophobicity and lipid-facing propensities are derived from the sequence. Therefore, hydrophobicity and lipid-facing propensities alone are insufficient for accurate inference of helical packing, thereby rendering worse rotational angle estimation. Second, researchers had a simplified view of membrane proteins being “inside-out” proteins, which have interior polar core and exterior apolar surface [30, 31]. However, this paradigm was challenged since biased distribution of hydrophobic residues could not be detected in every membrane protein . As more solved structures become available, statistical analyses on these structures also support the above finding [33–36]. The canonical view of the “inside-out” property of membrane proteins based on hydrophobicity is challenged. On the contrary, observed rASA is a structural environment descriptor of helical packing and better infer the rotational angle.
To evaluate the capability of our proposed method for TM proteins with unknown structures, we used rASA predicted by TMexpo to determine the rotational angles of TMHs. Then we evaluated how predicted rASA could infer the rotational angle in terms of MAAE by the following two experiments. First, we used the development set of 89 chains to develop the TMexpo prediction model and tested on the independent test set of 21 chains. Second, we performed leave-one-out cross validation (LOOCV) on the development set. The results of both experiments are shown in Table 1. Both rASA prediction results of the development set and the independent test set are provided in the Additional file 1: Dataset S1. Particularly, detailed prediction results of 188 TMHs in the independent test set are reported in the Additional file 2: Table S1. Notably, 44 of 188 TMHs had the angular error less than 15˚ and their rotational angles were predicted precisely by TMexpo.
Observed from the first experiment, TMexpo-predicted rASA is shown to be comparable to the observed rASA derived from NACESS for inferring the rational angles; the predicted rASA achieved an MAAE of 48.31˚, slightly worse than the MAAE of 45.55˚ achieved by the observed rASA. Nevertheless, TMexpo-predicted rASA achieved much better MAAE than the other predictors using different hydrophobicity scales and propensities, including Eisenberg et al.’s consensus hydrophobicity scale (ES) , kPROT , ProperTM , TMLIP , and MO  by at least 19.2˚. The second experiment reported consistent results with the first experiment. Specifically, performing LOOCV on the development set resulted in the MAAE of 51.05˚, slightly worse than that inferred by the observed rASA, but better than the compared predictors by at least 20.42˚. Our results show that without known structures, TMexpo can effectively infer the rotational angle within a close margin to that inferred by the observed rASA and improve the prediction compared to hydrophobicity scales or lipid-facing propensities.
Comparison of different methods for classifying exposed/buried residues on the independent test set without interface TM residues
Next, we compared TMexpo’s performance of predicting rASA with existing methods, including TMX , RHYTHM  and MPRAP . One distinction among these methods is that RHYTHM requires prior knowledge of protein types as membrane-coil or channel for prediction. For comparison, we retrieved the prediction results for TMX, RHYTHM and MPRAP from their web servers by using their default parameters. Table 2 shows that TMexpo outperforms the compared methods across most of the measures except a slightly lower specificity compared to MPRAP by 1.16%. The specificity of TMexpo is lower than its sensitivity by 12.15%, and most of the predictors except MPRAP have the same trend as TMexpo’s results. This observation implies that the detection of buried residues may be more difficult than that of exposed residues in the TM domains. To further gain insights into this issue, we extracted buried residues in our dataset of 110 chains from the helix-packing database TMPad , and found that over 77% residues have at least one interhelical contact. This suggests that prediction of buried residues could very likely be improved by detecting contacts of interhelical interactions; however, TMexpo and most of the compared predictors retrieve features by local information of subsequences only. Interestingly, interhelical contacts may be conserved in sequences and discovered from evolutionary information such as PSSM profiles . We consider that evolutionary information is an effective feature for capturing interhelical interactions that contributes to rASA/burial status prediction; however, interhelical interaction prediction is still a challenging problem.
Comparison of different methods for classifying exposed/buried residues on 392 interface TM residues of the independent test set by rASA derived from both subunit structure and complete structure (in parentheses)
Comparison of different methods for classifying exposed/buried residues on 3,553 entire TM residues of the independent test set by rASA derived from both subunit structure and complete structure (in parentheses)
Harrington and Ben-Tal  characterized five structural features of interhelical interactions, namely, aromatic interactions, hydrogen bonds, salt bridges, and two interactions from packing motifs, that are useful for helical packing. They proposed an algorithm to pack the TMHs of TM proteins, as follows: First, the algorithm ordered the TMHs by the sequence from the N-terminus, and then iteratively grouped sequential TMHs by a scoring function based on the five types of interactions. They demonstrated helical packing on 15 diverse proteins, and the average RMSD of Cα in the native structure of the 15 reconstructed TM proteins ranged from 0.51 Å to 1.35 Å. In this subsection, we reexamined these proteins to study the rotational angle of TMHs and its relationship to helical packing. Since the protein 1AFO discussed in their work contains only one TMH, we excluded this protein in our analysis.
Rotational angle prediction on the protein chains from Harrington and Ben-Tal’s work
TM helix sequence
TM helix sequence
On the other hand, since rotational angle prediction is strongly correlated with the periodicity of a helix, predicted rotational angles may not work well for packing of helices that deviate from regular periodicities of rASA, such as those severely kinked, disrupted, highly tilted, or associated with a reentrant loop. Six out of all 73 TMHs were poorly predicted with angular errors over 100.29˚. We observed these six TMHs being classified as kinked or containing partial non-helical structure in the TM domain, and therefore the moment-based prediction performed poorly.
To illustrate the capability of the proposed method for rotational angle prediction based on predicted topology information, we submitted sequences of independent test set to three web servers including SVMSignal , TOPCONS  and MemBrain . As a pre-processing step, we removed predicted N-terminal signal peptide sequences by SVMSignal. For all predicted TMHs, a correctly predicted TMH is defined as a one-to-one overlap with the observed TMH of PDBTM, and the minimum residue number of overlaps between predicted TMHs and observed TMHs of PDBTM is eight. Within 188 TMHs of independent test set, the recall and precision of TOPCONS is 93.09% (175/188) and 100% (188/188), respectively. The recall and precision of MemBrain is 97.87% (184/188) and 95.34% (184/193), respectively. For fair comparison in rotational angles, we only discuss 175 TMHs that have been predicted well by both of the two topology predictors. Prediction results from SVMSignal, TOPCONS and MemBrain are available in the Additional file 3: Dataset S2.
To obtain observed rotational angles corresponding to each predicted TMH, we removed atoms which were predicted outside the membrane by the topology prediction for each protein chain, and then we followed the definition in this work to calculate the rotational angles. Specifically, we did not directly assign rotational angles calculated by TMH of PDBTM to predicted TMHs, but we recalculated rotational angles based on atoms of protein structure selected by predicted topology within the TM region. There are two reasons to do that. First, since the sequence of a predicted TMH is not identical to that defined by PDBTM and the definition of rotational angle depends on the helical principal axis and the Cα vector of the first residue to its lipid-facing direction vector, we cannot simply assume their structural property is similar. Second, for a predicted 3D protein structure, the TM region information comes from the topology predictor, and the rotational angle of each TMH is established on the atoms within predicted TM region, not from the PDBTM. Therefore, we have to recalculate the rotational angles of predicted TMH for comparison. Finally, we ignored any predicted TMH which includes residues that do not have structural data within PDB entity, and 155 TMHs were left for comparison.
The Additional file 4: Table S2 and the Additional file 5: Table S3 provide sequences of predicted TMHs corresponding to TMHs annotated in PDBTM, observed angle defined by residues of predicted topology, predicted angle, and moment lengths. For all 155 TMHs, the MAAE of TOPCONS is 43.04˚ and MAAE of MemBrain is 56.59˚. These two tables demonstrate the ability of TMexpo to predict rotational angles based on predicted topology. Interestingly, the MAAE of 155 TMHs based on topology predicted by TOPCONS has better results than topology annotated in PDBTM. There are two possible explanations for this observation. First, TMHs predicted by TOPCONS is longer than annotated in PDBTM, and this may help calculating the helical principle axis. Second, we excluded TMHs that have partially incomplete structural data, and the performance of 155 TMHs would differ from the dataset of 188TMHs. We conclude that rotational angles calculated by both predicted TMHs and predicted rASA are still consistent with the observed rotational angles defined by the predicted TMH. Therefore, while TMH boundary is not perfectly predicted, the predicted rotational angle can still provide useful information to the interior side of a TM protein and constrain decoys of predicted 3D structure.
We selected from our independent test set an amino acid antiporter, called AdiC, of E. coli strain O157:H7 to demonstrate how predicted rASA and rotational angle in the helical wheel presentation of TMHs facilitate the analysis of TM proteins. The E. coli strain O157:H7 is a pathogen and causes hemorrhagic diarrhea, and AdiC is a multi-spanning TM protein that enables E. coli to resist the acidic environment via exchanging extracellular arginine and intracellular agmatine [46, 47]. An arginine-bounded structure of AdiC was solved and is coded as 3L1L  in PDB.
Comparing Pearson correlation coefficients between contact-enriched set and reference set defined by different thresholds ( c )
Thresholds for contact-enriched set
PCC (number of residues) on contact-enriched set
PCC (number of residues) on reference set
c ≥ 1
c ≥ 2
c ≥ 3
Sequence-based prediction of both rotational angle and rASA can provide indispensable information for structure prediction when high-resolution structures are unavailable and contribute to experimental design to elucidate TM protein functions. In this paper, we present a novel concept of using lipid exposure to infer rotational angles and have developed a machine learning approach to predict rotational angles of TMHs. Significantly, using predicted rASA from our sequence-based model achieved an MAAE of 48.31˚ on the independent test set, which is better than that obtained by the best of the compared knowledge-based propensities (67.51˚). Furthermore, we demonstrate an application for structural analysis via an amino acid antiporter. We believe improving prediction of rotational angle can benefit the structure prediction because free modeling of TM protein structures is a tough task and reducing the number of packing arrangements is necessary.
The metric used for evaluating rotational angle prediction in this work is mean absolute angular error (MAAE). To evaluate the classification model, i.e., classifying burial and exposed status, we used the following performance measures, including Matthew’s correlation coefficient (MCC), accuracy, sensitivity, specificity, and precision. With respect to the regression model, i.e., rASA prediction, we used mean absolute error (MAE), root mean squared error (RMSE), and Pearson correlation coefficient (PCC).
See the Additional file 6: Table S4 for definitions of Pearson correlation coefficient, accuracy, sensitivity, specificity and precision.
The list of all protein chains (PDB:Chain) included in the development set and the independent test set
Independent test set
To calculate lipid exposure or exposed area of a structure, we used NACCESS program [19, 29] with the probe radius set to 2.0 Å. The size of probe radius 2.0 Å was selected to mimic the -CH2 of hydrocarbon chains, and it is identical to that used in Yuan et al. , Illergård et al.  and Lo et al. . The ASA for a residue was the sum of ASA from all atoms belonging to that residue. To extract the helical boundaries from the protein chains, we used the annotations of PDBTM. From the 89 protein chains in the development set, we obtained a total of 10,441 residues in TM domain. For the independent test set of newly solved proteins, we obtained 3,581 residues in TM domain. To annotate missing residues and missing atoms, we used PDB Validation Suite . In order to obtain rASA as a normalized measure for a TM residue, we divided the ASA values by their reference values in a Gly-X-Gly tripeptide in an extended conformation. The reference values were derived from Samantha et al. . To classify burial status of each residue for model training and testing, we followed the rASA threshold defined in Miller et al.’s work , i.e., rASA <5% to characterize buried residues and otherwise exposed, though different thresholds have been used in the literature.
We proposed residue-wise predictors based on support vector machines (SVMs), i.e., an SVM classifier to predict the burial/exposed status and a support vector regression (SVR) model to predict rASA values of each residue in TM domain. Specifically, C-SVC and epsilon-SVR implemented in LIBSVM  were used to develop the models, and both of them used the RBF kernel function. The parameters of the models were optimized by chain-wise LOOCV procedure on the development set. In LOOCV procedure, the best set of parameters to train the burial/exposed status classification model is of cost c = 21 and gamma g = 2-4; and the best set of parameters to train the real-number rASA regression model is of cost c = 2-1, gamma g = 2-5, loss function p = 10-3 and tolerance of termination criterion e = 10-2. Details of LOOCV performances can be obtained in the Additional file 8: Text S1.
Given a TM domain of a protein chain, each residue to be predicted was located at the center of a sliding window of length 17 and features were generated according to the 17-mer sequence. To train the classification model, exposed residues with label “E” were considered as positive data, and buried residues with label “B” as negative data. To train the regression model, the input was taken from the real-number rASA. We searched parameters by LOOCV procedure for the classification model and the regression model based on optimizing the MCC and the PCC, respectively. We did not directly predict ASA values because they are not normalized in a zero to one interval and this could produce bias in the presence of an outlier.
In training and testing, we excluded residues that participate in interchain contacts and the rationale is as follows: A sequence-based rASA predictor, which accepts the sequence of a structural subunit as input, can only describe structural properties of one subunit, not of the complete structure. Thus predicted rASA of those residues may be drastically different depending on their locations in the interacting interfaces. In the case of residues residing on the interchain surface, we observed rASA of these residues in a single chain may be significantly different from those seen in the complete structure with multiple subunits. Out of 110 representative protein chains used in our work, 86 protein chains are multimeric. Among all of their 9,800 TM residues without any missing atom, 2,167 (22.11%) residues have two different rASA values calculated from the single subunit and the complete structure, respectively; and the former-derived rASA is always larger than the latter-derived rASA. Notably, the maximum and average differences of the two rASA values of these 2,167 residues are 82.28% and 23.43%, respectively. Furthermore, 831 out of the 2,167 residues would be assigned inconsistent burial/exposed status according to their two different rASA values. In other words, 8.48% of the overall 9,800 transmembrane residues were considered as exposed from the perspective of a single subunit but turned out to be buried in their complete structures. For example, in 2OAR:A, 41 residues of 52 TM residues have different rASA values, and 13 residues are calculated as being exposed in the single chain but as being buried in the complete structure. It is noteworthy that the 60S (i.e., 37S by PDB indexing) and the 45 V (i.e., 22 V by PDB indexing) have drastic differences in their rASA values, i.e., 60.77% vs. 2.42% and 60.02% vs. 1.68%, as calculated by single chain and by complete structure, respectively. Since we did not know the native state of amino acids lying on the interchain surface, we excluded these residues from our training and testing data. For each protein chain, we calculated rASA for both single subunit structure and complete structure. Later, we excluded residues which were not identical in rASA by comparing the above two calculations, and also excluded residues that were missing partial or entire atoms.
In the testing stage, we performed a simple post-processing by rounding off their upper and lower bound to 1 and 0 because rASA values are contained in this interval. To derive the ASA values for each residue, we multiplied the predicted rASA values by the reference values .
In the design of TMexpo, we did not use a specific feature selection technique, and all the features used in TMexpo belong to one of the three feature groups. The first group is about interhelical contacts, specifically volume, polarity, charge and residue interhelical contact propensity. Since we have observed buried residues tend to have more interhelical contacts, and therefore we examined features related to interhelical contacts. For example, the well-known GxxxG motif can be regarded as small-xxx-small motifs , and we use volume profiles to incorporate such feature in the machine learning model. The polarity and charge can also be seen as features related to hydrogen bonding  and cation-pi interaction , respectively. To encode features into TMexpo, the volume  of each residue was divided by their maximum value 237.2 of tyrosine. The polarity was also encoded by the sigmoidal functions 1–1/(1 + e -po ), where po denotes the mean residue polarity calculated by Radzicka and Wolfenden’s method . We defined positively charged residues as 1, neutral residues as 0.5, and negatively charged residues as 0 based on the index used by Klein et al. . The residue interhelical contact propensity were developed by Lo et al. , and we used in TMexpo the normalized propensity by division of the maximum value 1.43 of cysteine.
The second group provides evolutionary information as position-specific scoring matrix (PSSM) profiles and conservation score to machine learning model. Evolutionary information is an important feature and has been incorporated in interhelical interaction predictors [38, 63]. To encode PSSM as features, the matrix was generated by performing PSI-BLAST against NCBI’s non-redundant database. This feature of a 17-mer peptide was encoded by a vector of size 17 × 20, where each entry was normalized by 1- 1/(1 + e -PSSM ). The conservation score was calculated by an algorithm developed by Capra and Singh  on the multiple sequence alignment generated by MAFFT [65, 66] based on the 17-mer peptide. We used the raw scores without using the local Z-score transformation described in their method.
The third group includes the TMH insertion energy, amphiphilicity of residues and turn propensities, which relate to structural information. The first two features can reveal residue position toward hydrophobic membrane or water interface, which is akin to Zpred features used in MPRAP that directly predict relative position from the center of membrane for each residue. The position-specific free energy of TMH insertion, termed as “free energy” to describe the hydrophobic core, was encoded by a sigmoidal function as 1-1/(1 + e -energy ), where energy denotes the free energy of TMH insertion estimated by Hessa et al.’s method . The amphiphilicity was encoded by the sigmoidal functions 1–1/(1 + e am ), where am denotes the amphiphilicity derived by Mitaku et al.’s method . We also considered the helix turn propensities in order to capture sequence information related to tight turns in naturally occurring TM helices from Monné et al. . This feature was normalized propensities to [0, 1] by dividing the maximum value 2.7 of proline.
All of the above features were normalized to a closed [0, 1] interval. A feature value close to 1 means the corresponding residue is more hydrophobic, more amphiphilic, higher polarity, positively charged, larger volume, more conserved, tends to have turns and interhelical contacts. We filled 0.5 as features for nonexistent residues in windows, except charge, interhelical contacts, and volume, we filled zeroes.
The rotational angle of a TMH was calculated as follows: First, we removed atoms which were annotated outside the membrane. Second, we computed the helical principal axis of the TMH of interest and aligned it with the z-axis with the N-terminal facing the screen, creating a top-view of the protein with respect to the target helix. Third, we identified the geometric centers of the molecule and each of the individual helices in the two-dimensional plane from the average x and y coordinates of Cα in the constituent TMH residues. We defined the lipid-facing direction of a TMH as in the opposite direction circumscribed by the geometric center of the target TMH connects to the molecular geometric center of the protein chain unit. The rotational angle of the target helix was measured as the angle rotated from the Cα vector of the first residue to its lipid-facing direction vector by clockwise motion viewed from the helix N-terminal to C-terminal. The angle ranges from 0˚ to 360˚.
For each residue, the rASA values can be seen as degree of directional lipid-facing. Therefore, for one TMH, the summation of all TMH residues’ lipid-facing tendency can characterize its rotational angle. In Equation 6, the x and the y terms are the vector summation of over n residues in a TMH. The moment length |M| is defined as Equation 7. The angle γ was solved first by inverse cosine function as Equation 8, and we determined moment direction θ by taking (360 - γ) as a result as Equation 9 if the sign of y term is negative.
Accessible surface area
Leave-one-out cross validation
Matthew’s correlation coefficient
Mean absolute angular error
Mean absolute error
Pearson correlation coefficient
Protein data bank
Relative accessible surface area
Root mean squared error
Support vector machine
Support vector regression
Van der Waal’s.
This work was supported in part by the National Science Council under grants NSC100-2319-B-010-002 and NSC101-2221-E-001-022. No additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.