ISSEC: inferring contacts among protein secondary structure elements using deep object detection

Background The formation of contacts among protein secondary structure elements (SSEs) is an important step in protein folding as it determines topology of protein tertiary structure; hence, inferring inter-SSE contacts is crucial to protein structure prediction. One of the existing strategies infers inter-SSE contacts directly from the predicted possibilities of inter-residue contacts without any preprocessing, and thus suffers from the excessive noises existing in the predicted inter-residue contacts. Another strategy defines SSEs based on protein secondary structure prediction first, and then judges whether each candidate SSE pair could form contact or not. However, it is difficult to accurately determine boundary of SSEs due to the errors in secondary structure prediction. The incorrectly-deduced SSEs definitely hinder subsequent prediction of the contacts among them. Results We here report an accurate approach to infer the inter-SSE contacts (thus called as ISSEC) using the deep object detection technique. The design of ISSEC is based on the observation that, in the inter-residue contact map, the contacting SSEs usually form rectangle regions with characteristic patterns. Therefore, ISSEC infers inter-SSE contacts through detecting such rectangle regions. Unlike the existing approach directly using the predicted probabilities of inter-residue contact, ISSEC applies the deep convolution technique to extract high-level features from the inter-residue contacts. More importantly, ISSEC does not rely on the pre-defined SSEs. Instead, ISSEC enumerates multiple candidate rectangle regions in the predicted inter-residue contact map, and for each region, ISSEC calculates a confidence score to measure whether it has characteristic patterns or not. ISSEC employs greedy strategy to select non-overlapping regions with high confidence score, and finally infers inter-SSE contacts according to these regions. Conclusions Comprehensive experimental results suggested that ISSEC outperformed the state-of-the-art approaches in predicting inter-SSE contacts. We further demonstrated the successful applications of ISSEC to improve prediction of both inter-residue contacts and tertiary structure as well.


Input of the ISSEC
ISSEC's input includes the predicted probabilities of inter-residue contacts generated by CCMpred and 3-state secondary structure generated by PSIPRED [3]. The 3-state secondary structure was transformed into a matrix as follows: For a pair of residues i and j, we concatenate v i and v j into a single vector, which is used as input feature of this residue pair. Here, v i refers to the probability distributions of 3-state secondary structure of residue i. Thus, we obtained a L × L × 7 feature map as input. Figure 1: Inter-SSE contacts (right panel) and corresponding characteristic patterns in inter-residue contact map (left panel) for protein 1a1t. (A) a regular β-β parallel contact between two β-strands, (B) a regular β-β anti-parallel contact between two β-strands, (C) an α-α contact between two α helices.

Definition of inter-SSE contacts
In the study, we focused on 3 types of inter-SSE contacts, namely, α-α contact, β-β parallel contact, and β-β anti-parallel contact (Supplementary Figure 1). The definitions of these three classes are listed below: • α-α contact: A pair of helices H 1 and H 2 is defined as contact if there is a residueresidue contact between them. The helices should have at least 6 residues.

Loss Function and Implementation
Details ISSEC uses multi-task loss function [4,2] that combines the loss of classification, localization and segmentation mask, i.e., • L class represents the loss function over 4 classes (α-α contact, β-β parallel contact, β-β anti-parallel contact, and background).
where p i denotes the probability that the rectangle region of interest represents the i-th type of inter-SSE contact, and p * i represents the ground-truth.
• L bbox measures the difference between the predicted position and true position of the inter-SSE contact.
where b i represents one of four points of predicted bounding box, while b * i represents its groundtruth. Here, L smooth 1 is the smooth L1-norm loss. Using the L bbox term, ISSEC could shrink the rectangle region and calculate the true position of the contacting SSEs.
• L mask is defined using the average binary cross-entropy loss as follows: The rectangle region of interest was first transformed into 4 m × m matrices using ROIAlign, where each matrix represent mask of a type of inter-SSE contact (αα contact, β-β parallel contact, β-β anti-parallel contact, and background). For each element (i, j) in the matrix, y ij denotes the true mask whereasŷ k ij denotes the predicted mask for type k.
Hyper Parameters: ResNet-50 was used as backbone to extract image features, all conversion filter were 3 x 3, except the output conversion filter were 1 x 1, we used ReLU in hidden layers.
Model Implementation and Efficiency: The optimization used the momentum optimizer with learning rate 0.002, with L2-norm regularization to prevent overfitting, and early stopping with max iteration of 100000. The whole algorithm is implemented using TensorFlow on a NVIDIA 1070 GPU, the batch size is set to 2 due to GPU limitation, and the training time is less than 15 hours.         Figure 4: Improving inter-residue contacts by re-weighting the loss function of Xu's model based the predicted inter-SSE contacts by our ISSEC. In Xu's model, the CCMpred was fed into the ResNet, and then a improved inter-residue contact was predicted. In our re-weighting strategy, the CCMpred was also fed into ISSEC to predicted inter-SSE contacts (Red boxes), which were then used to re-weighting loss function of Xu's model.   Here the predicted structures are shown in red whereas the native structure is shown in blue.