AttSec: protein secondary structure prediction by capturing local patterns from attention map

Kim, Youjin; Kwon, Junseok

doi:10.1186/s12859-023-05310-3

Research
Open access
Published: 04 May 2023

AttSec: protein secondary structure prediction by capturing local patterns from attention map

Youjin Kim^1,2 &
Junseok Kwon¹

BMC Bioinformatics volume 24, Article number: 183 (2023) Cite this article

2502 Accesses
1 Citations
2 Altmetric
Metrics details

Abstract

Background

Protein secondary structures that link simple 1D sequences to complex 3D structures can be used as good features for describing the local properties of protein, but also can serve as key features for predicting the complex 3D structures of protein. Thus, it is very important to accurately predict the secondary structure of the protein, which contains a local structural property assigned by the pattern of hydrogen bonds formed between amino acids. In this study, we accurately predict protein secondary structure by capturing the local patterns of protein. For this objective, we present a novel prediction model, AttSec, based on transformer architecture. In particular, AttSec extracts self-attention maps corresponding to pairwise features between amino acid embeddings and passes them through 2D convolution blocks to capture local patterns. In addition, instead of using additional evolutionary information, it uses protein embedding as an input, which is generated by a language model.

Results

For the ProteinNet DSSP8 dataset, our model showed 11.8% better performance on the entire evaluation datasets compared with other no-evolutionary-information-based models. For the NetSurfP-2.0 DSSP8 dataset, it showed 1.2% better performance on average. There was an average performance improvement of 9.0% for the ProteinNet DSSP3 dataset and an average of 0.7% for the NetSurfP-2.0 DSSP3 dataset.

Conclusion

We accurately predict protein secondary structure by capturing the local patterns of protein. For this objective, we present a novel prediction model, AttSec, based on transformer architecture. Although there was no dramatic accuracy improvement compared with other models, the improvement on DSSP8 was greater than that on DSSP3. This result implies that using our proposed pairwise feature could have a remarkable effect for several challenging tasks that require finely subdivided classification. Github package URL is https://github.com/youjin-DDAI/AttSec.

Peer Review reports

Background

Proteins are chains of amino acids, in which approximately 20 kinds of amino acids can make an infinite number of proteins by changing their arrangement. This sequence of amino acids is called the primary structure of the protein (1D sequence). In the human body, proteins are spatially coiled, bent, and folded due to the interaction of amino acids, which induces a specific three-dimensional structure (3D structure). This is called the tertiary structure of protein. Many recent studies aim to predict this tertiary structure because several unique properties of protein can be derived from this structure [1,2,3]. However, it is very difficult to directly predict the 3D structure from the 1D sequence. To alleviate this difficulty, the secondary structure of protein is predicted, which links the 1D sequence to the 3D structure. Please note that the secondary structures can be intermediate features for the complex 3D structures and used as to represent the local properties of proteins. The secondary structures are typically assigned by the DSSP (Define Secondary Structure of Proteins) algorithm [4, 5]. The DSSP algorithm checks whether there is hydrogen bond for each amino acid pair by identifying the distance between the elements given the 3D coordinate file of the protein. Then, based on the local patterns of these hydrogen bonds, eight types of secondary structure are assigned to amino acids (DSSP8): 3-Helix (G), \(\alpha\)-Helix (H), 5-Helix(I), hydrogen bonded turn (T), residue in isolated \(\beta\)-bridge (B), extended strand participates in \(\beta\) ladder (E), bend (S), and coil (C). The aforementioned types can be further grouped into three larger classes (DSSP3): helix (H), strand (E), and loop (C). While there are several ways to reduce the 8 types to 3 types, we use general reduction: (G/H/I \(\rightarrow\) H, E/B \(\rightarrow\) E, S/T/C \(\rightarrow\) C).

Due to the lack of data and the difficulty of prediction, conventional methods for secondary structure prediction rarely use only a single sequence and highly rely on additional evolutionary information. For example, Multiple Sequence Alignment (MSA) in [6] and Position-Specific Scoring Matrix (PSSM) in [7] have been generated from other databases and used together with sequence data to predict protein structure. However, while constructing MSA or PSSM for each template sequence requires high effort, it is difficult to expect good performance for proteins with few or no homology sequences. To overcome this, a language model was employed in [8, 9], which has proven performance in the field of natural language processing. If the language model is pretrained with large unlabeled data and finetuned for a downstream task, the model can achieve outstanding performance even if only a small amount of the downstream task data is available. In this context, the embedding of a language model was used in [8, 9] to replace the evolutionary information by showing that the embedding of a language model that was pretrained with a pretext task with large protein sequence data could perform properly in protein-related downstream tasks like protein structure prediction, subcellular localization prediction, and membrane prediction. Inspired by these methods, our model also utilizes the protein embedding of a pretrained language model as an input instead of using the additional evolutionary information. Recently, there have been models that predict protein secondary structure by using language model’s embeddings instead of MSA, such as SPOT-1D-LM [10] and NetsurfP-3.0 [11]. SPOT-1D-LM employs ensemble learning by training three models with the embeddings of two different language models, ProtT5-XL-U50 and ESM-1b. Their models include one LSTM-based model and two 1D CNN-based models. Similarly, NetsurfP-3.0 also uses ESM-1b’s embedding and combines LSTM and 1D CNN to construct model. Both models have the common feature of having network structures that extract features sequentially in addition to using language model embeddings. In contrast, our proposed model, AttSec, takes a different approach to accurately describing the way secondary structures are assigned to each amino acid constituting a protein.

The secondary structure is determined by the patterns of hydrogen bonds, which correspond to pairwise features between amino acids. Then, the patterns of hydrogen bonds correspond to the local patterns of pairwise features between amino acids. To implement the aforementioned hierarchical approach via model design, AttSec extracts the self-attention map corresponding to the pairwise features between amino acid embeddings and passes it through 2D convolutional blocks to detect the local pattern. Thus, AttSec mainly consists of two parts. The first part has multiple layers of the transformer encoder to estimate the self-attention maps. When a secondary structure is assigned, different secondary structures can be assigned depending on how far apart amino acids form hydrogen bonds. Thus, to consider the importance of this relative distance, AttSec constructs a transformer encoder layer using relative position encoding (RPE) instead of conventional absolute position encoding (APE). In the second part, the 2D segment detector detects different patterns of hydrogen bonds from the stack of pairwise features. By using a convolutional kernel with different options per block, we ensure that the model gives robust detection results.

The contributions of our method are as follows.

We use protein embedding of the language model to replace additional evolutionary information, in which there is no significant drop in performance even for sequences with no or few homology sequences.
We describe the way that protein secondary structures are assigned by processing sequential features into pairwise features and detecting local patterns based on transformer-based deep learning compared with existing models that simply extract features in a sequential manner.

Methods

Dataset

We trained our model using two datasets for efficient comparison with baseline models. One is ProteinNet in [12] and the other is the NetSurfP-2.0 dataset in [13]. The first dataset, ProteinNet, is a benchmark dataset for protein structures and is built from PDB structures that were released as of 2016. ProteinNet provides data with different sequence identity cutoffs applied. Among data, we used a dataset with cutoff of 95% as used in [14]. The number of sequences in this training dataset is 39, 120. However, because the secondary structure data provided by ProteinNet was incomplete and not all data sequences could be assigned secondary structures with the DSSP program, we were able to use 38, 000 data for training. As the validation set, 100 proteins were used, which were same as those provided by [14]. The model that was trained in this way was evaluated using the SPOT-2016, SPOT-2016-HQ, SPOT-2018, SPOT-2018-HQ, and TEST-2018 datasets. These test datasets were also used in [14]. The training dataset, Proteinnet, includes protein structures released up to 2016. They constructed the SPOT-2016 dataset using proteins released between 2016 and 2020. Among them, proteins with an e-value cutoff of less than 0.1 in the hidden Markov model comparison with pre-2016 proteins were all removed. In addition, from the SPOT-2016 dataset, they gathered only the proteins released after 2018 to form SPOT-2018, and those with the HQ suffix were subsets with the resolution constraint applied. Moreover, the TEST-2018 dataset consists of high-resolution proteins released only in 2018, filtered at a 25% identity threshold with pre-2018 proteins. Because the ProteinNet dataset provided only the data for the 8-states DSSP assigned by the DSSP program, an additional reduction process was required to obtain the 3-states DSSP (DSSP3) data. Thus, we made DSSP3 data by converting DSSP8 to DSSP3 according to the general reduction method (G/H/I \(\rightarrow\) H, E/B \(\rightarrow\) S, S/T/C \(\rightarrow\) C). The second dataset, NetSurfP-2.0, provided by [13] can be downloaded simply in CSV format. NetSurfP-2.0 provides 10,792 data samples both in 3-states and 8-states DSSP. For validation of the model trained with this data, we used 646 protein data samples, including CASP12, CB513, and TS115 as in [8]. The model trained with NetSurfP-2.0 was evaluated on NEW364, CASP12, CB513, and TS115. The CASP12, CB513, and TS115 datasets are independent datasets used in [13]. Any protein with a sequence similarity of over 25% to any protein in these three datasets was excluded from the training set, but redundancy among the test datasets was not handled. The NEW364 dataset was created in [8] to complement the limitations of these three test sets. It was constructed by selecting proteins from the PDB with a resolution of 2.5 Å or better and a minimum of 20 amino acids, which were published after 2019. MMSeqs and PISCES were used to remove any proteins with more than 20% similarity to either the training data or the dataset itself.

Pretrained language model

Protein structure prediction tasks are challenging, because the size of the available dataset is small and there are few proteins whose structures are known. Thus, the inattentive use of complex models can cause overfitting problems. Fortunately, there exist extensive databases of proteins whose 3D structures are not known but whose primary sequences are known. Thus, many conventional methods utilize evolutionary information by finding sequences that are similar to the template sequences in a protein sequence database and putting them together as an input to the model. However, because these methods cannot guarantee performance for proteins having few or no homology proteins, recent methods attempt to extract evolutionary information from the protein sequence database in a different way. In the methods [8, 9], language models are pre-trained using large sequence data through a pretext-task to generate evolutionarily meaningful protein embeddings. As with these methods, we use a pretrained language model called ProtT5-XL-U50 [8] to obtain protein embeddings. ProtT5-XL-U50, which is based on T5 [15], is trained using the BFD dataset [16, 17] and the UniRef50 dataset [18] by performing a denoising task proposed in BERT [19] as a pretext task. This model provides 1024-dimension per token (per amino acids) embeddings given the protein primary sequence as an input. We import the pretrained language model and use the embedding derived through the inference as an input to our model. Please note that there is no additional finetuning for the language model.

Proposed secondary structure prediction model

The secondary structure is a local substructure of a protein. To allocate the secondary structure, the DSSP algorithm finds whether there is a hydrogen bond between amino acids and assigns one of eight secondary structures according to the pattern of the hydrogen bonds in the local region. To effectively capture these complex and hierarchical properties, we design the transformer-based deep neural network model in stages. As shown in Fig. 1(a), AttSec obtains attention maps by passing the protein embedding through the multiple transformer encoder layers. The attention map can be stacked as many as the number of heads per encoder layer. Thus, in the case of a protein with a total sequence length of p, the shape of the attention map is \(p \times p \times (N \times H)\) if it passes through a transformer encoder with N layers and H heads. Then, this stack of attention maps that corresponds to the pairwise features between amino acids is passed through the 2D segment detector so that the convolutional blocks capture meaningful local patterns. To predict the secondary structure for each token (amino acid), we transform the 2D shape features obtained by the convolutional blocks into 1D shape features. Our model conducts this process in a simple way by extracting only the diagonal elements of the 2D feature. In the 2D segment detector, because several layers of 2D convolution blocks are stacked, considering the receptive field, the diagonal elements of the final feature contain information about the local pattern of pairwise interactions around the target token that the secondary structure wants to know. Finally, these diagonal elements pass through the two fully connected layers to make the final prediction. The whole model consists of two parts: the transformer encoder shown in Fig. 1b and the 2D segment detector shown in Fig. 1c.

Transformer encoder layer

For the position encoding of the transformer encoder, we use a variant of relative position encoding (RPE). In vanilla transformers [20], absolute position encoding (APE) is employed to use the sinusoidal function based on the absolute position of the tokens. On the other hand, RPE is implemented based on the relative position of each token when self-attention is calculated, without considering the absolute position of the token. For our task, RPE is more suitable for position encoding than APE, because we regard the self-attention calculated with amino acid (token) embedding pairs as a feature related to hydrogen bonds formed between amino acids. By detecting local patterns from this, the secondary structure can be predicted. Thus, if the position embedding for the relative distance of the amino acid pair is added to the self-attention as a learnable form, it helps to distinguish the different patterns of hydrogen bonds. This is because different secondary structures are assigned depending on the distance between amino acids that form hydrogen bonds. For example, if a pattern in which the i-th amino acid forms a hydrogen bond with the \((i+3)\)-th amino acid appears in a local area, a 3-turn helix (G) is assigned, but a pattern in which the i-th amino acid forms a hydrogen bond with the \((i+4)\)-th amino acid appears, a 4-turn helix (H) is assigned. Thus, we utilze RPE as the position encoding to consider the importance of the relative distance between amino acids that form hydrogen bonds. The basic RPE proposed by [15] calculates the relative position and then assigns buckets according to distance. In this study, we modify the vanilla RPE with two changes in the way the buckets are allocated. The first change is to make the relative position bucket symmetric by assigning the same bucket if the relative distance is the same. The second change is that the range to which the bucket is allocated does not increase logarithmically but increases linearly to the specific distance, so that it is more sensitive to the relative position of amino acids.

2D segment detector

The proposed 2D segment detector for detecting local patterns from the stacked self-attentions is composed of 3 detectors, in which each detector is constructed by stacking 3 base blocks. The base block used in our detector has the same structure as the block used in [21]. This base block includes both channel attention and pixel attention, and it enables flexible learning by calculating weights for pixel-wise features and channel-wise features, respectively. The pixel-wise features from [21] can be considered as interaction-wise features in our detector. We extract various features by setting different options for the kernel used for each detector differently to enable the robust detection of local patterns. Conv2D kernels with a size of 3, padding 1, and dilation 1 are used in the first detector, Conv2D kernels with a size of 5, padding 2, and dilation 1 are used in the second detector, and Conv2D kernels with a size 3, padding 1, and dilation 2 are used in third detector. The features that pass through each detector are concatenated in a dimension-wise manner. Because there is no contraction of the feature due to the repeated use of padding in Conv blocks, the shape of the final feature that passs through the 2D segment detector becomes \(P \times P \times out\_dim\), as shown in Fig. 1c.

Training detail

Protein sequence data has a variable length for each sequence. In addition, because we transform sequential features into pairwise features during training, there is a large difference between the amount of computation and memory usage according to the length of the input sequences. Thus, it is necessary to process long sequences for stable training. Rather than cutting the sequence to a certain length during the preprocessing, we randomly crop it every epoch to enable efficient training while obtaining an augmentation effect. For training, cross entropy loss was used, the batch size was set to 2, and the number of epochs was set to 10. As a scheduler, cosineAnnealingLR was used to prevent the model from becoming trapped in local minima. The specific details of the model are as follows: the transformer encoder has 3 layers and 8 heads, resulting in a total dimension of 24 for the constructed attention map. The channel size of the convolution blocks used in the segment detector is set to 64 for all layers.

Results and discussion

Performance comparison

We used two datasets for training and compared the performance between different models. The model trained with ProteinNet was compared with PSIPRED [22], SPIDER3 [23], ProteinUnet [24], SPOT-1d single [14] that used only a single sequence as an input, and SPOT-1D [25] that used additional evolutionary information. Additionally, SPOT-1D-LM, which also uses language model embeddings similar to our method, was compared separately as it can only perform inference on sequences with a length of 1024 or less. These models were evaluated on the SPOT-2016 (1473 proteins), SPOT-2016-HQ (295 proteins), SPOT-2018 (548 proteins), SPOT-2018-HQ (125 proteins), and TEST-2018 (250 proteins) datasets. The model trained with NetSurfP-2.0 was compared with DeepProtVec, DeepSeqVec [26], ESM-1b, ProtT5-XL-U50, the ProtT5-XXL-U50 and NetsurfP-3.0 that used the embedding of the language model as a model input, and the NetSurfP-2.0 that used additional evolutionary information. These models were evaluated using CASP12-FM (20 proteins), NEW364 (364 proteins), CB513 (511 proteins), and TS115 (115 proteins) datasets. The performance of the aforementioned models was evaluated in terms of accuracy for all datasets.

Table 1 Average prediction accuracy of models trained with NetsurfP-2.0 DSSP8 dataset

AttSec: protein secondary structure prediction by capturing local patterns from attention map

Abstract

Background

Results

Conclusion

Background

Methods

Dataset

Pretrained language model

Proposed secondary structure prediction model

Transformer encoder layer

2D segment detector

Training detail

Results and discussion

Performance comparison

Ablation study

Position encoding ablation study

Model structure ablation study

Model complexity ablation study

Discussion

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Author's information

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us