Distance map prediction using 2D-Recursive Neural Network
The artificial neural networks we used for predicting distance maps are based on the general-purpose 2D-RNN adaptive architecture previously described in [17, 39] and further outlined here in SI. A description of the architecture of the 2D-RNN together with details on the learning algorithm we employ is also provided in SI. 2D-RNN-based models are used for mapping 2D matrices of variable size into matrices of the same size. Here, the output of the model O represents the distance map itself, whereas the input I encodes a set of pairwise properties of the residues in the protein (Additional file 1: Figure S1).
The input vector I
associated with the j
th and k
th residue pair contains evolutionary information, secondary structure, solvent accessibility and contact density information. The frequencies of amino acids observed in the two columns, j and k, of the MSA are used as an evolutionary input to the network, therefore, representing two 20-dimensional probability vectors. Structural information in the form of standard secondary structure classes (α-helix, β-sheet, random coil) is encoded using two 3-dimensional vectors, whereas relative solvent accessibility (2 classes: buried, 0-25%; and exposed, 25-100%) and contact density (4 classes) are encoded using two 2-dimensional and two 4-dimensional vectors, respectively. In total, a vector of 58 units is used as an input to the ab initio model of the distance map prediction.
The model that encodes 20 common types of amino acids is termed the ‘classical’ model here. In addition to this classical model, two additional models with different evolutionary contents are created. The first, so-called “complementarity” model restricts the input to seven classes of amino acids that are expected to be relevant to the stability of the fold. The complementarity model clusters 20 amino acids into 7 classes based on their structural and physicochemical properties: (i) hydrophobic (A, F, I, L, M, V), (ii) polar (N, Q, S, T, W, Y), (iii) negatively charged (D, E), (iv) positively charged (H, K, R), (v) cysteine (C), (vi) glycine (G) and (vii) proline (P). In the second model, so called “correlation” model, amino acid information is augmented by the correlated mutation signal (1 unit) extracted from the MSA. Correlated mutations are calculated using the PAM70 substitution matrix and Göbel’s algorithm , in which completely conserved positions and the positions with > 20% gaps are discarded from the analysis.
In the template-based model an additional 2-dimensional vector extracted from template PDB profiles is appended to the input vector, similarly to [30
]. The first unit in this vector encodes the weighted average distance from the templates:
represents the weight attributed to the p
template. The weight w
depends on the template’s quality, q
, and its sequence identity, id
, to the target sequence:
The template quality further depends on the nature of the 3D structure (X-ray, NMR), its resolution and R-factor [51
], which for X-ray structures is given by:
Taking into account the cube of the sequence identity between the query and the template in Equation 3 allows us to favour those distances extracted from good templates over the distances calculated from low-similarity templates.
Finally, the last unit in the input vector encodes the weighted average of the template coverage and it is given by:
represents the coverage of the query by the template, i.e. the fractions of the non-gaps in the alignment. The template-based vector defined this way performed better than the number of alternatives in the preliminary testing (data not shown) and is used in the template-based models of distance maps.
The input vector
provided to the filtering NN contains the predicted distance d
obtained from the previous 2D-RNN network (1 unit), sequence separation between residues j and k (1 unit), the protein sequence length (1 unit) and global information extracted from the predicted distance map (15 units). The global information contains the average distance between all pairs of amino acid (m, n) within the segments j − 5 ≤ m ≤ j + 5 and k − 5 ≤ n ≤ k + 5 In addition to the average distance of this 11×11 residue patch positioned around the (j, k) residue pair, the average distances of 14 additional patches are also provided to the network by keeping the same separation between the pairs of residues, as in .
Learning and initialization
The 2D-RNNs composing the distance map predictors are trained by minimizing the squared error between the output and the target distances. To avoid large plateaux in the error function at the beginning of the training, a modified form of the gradient-descent algorithm is used. This algorithm employs a piecewise linear function in three different ranges for the network update weights, and is discussed in detail in . The transfer functions in all network units are implemented using the tanh function. We adopt a hybrid between on-line and batch training with 1,450 batch blocks per training set, i.e. two proteins per a batch. That is, the weights of all networks are updated based on the gradient computed on groups of two proteins. To prevent the error to decrease monotonically, the training set is shuffled at the beginning of each epoch. If the error does not decrease for 50 consecutive epochs, the learning rate is divided by 2. Prior to learning, the weights in each unit in all neural networks are randomly initialized. Their standard deviations are controlled in a flexible way, so as to avoid any bias and ensure that the expected total input into each unit remains approximately in the same range.
Due to the large number of training instances and limited computational power/time, all systems are trained in 5-fold cross validation. Each of the five networks is trained for 1000 epochs by saving the parameters every 5 epochs. For each network the last three saved models are combined in the single predictor. Finally, all 5 networks are combined in a single system. This is known to slightly improve the performance over individual models .
The reconstruction algorithm of protein C
-traces is organized into two sequential phases, as described in detail in . Shortly, in the first phase a random structure is generated by adding C
positions until the whole backbone is produced. The bonds of adjacent C
atoms in this phase are added in a random direction with the lengths restricted to lie in the interval 3.803±0.07 Å using uniform distribution. The positions of the C
atoms belonging to a helix structure are modelled using the coordinates of the ideal helix with random orientation. In the last phase, the algorithm refines the initial structure by optimizing the pseudo-energy function using local moves and simulating annealing . The moves we adopt displace a single residue at a time, and keep its distances to its neighbours constant.
The pseudo-energy function used here is shaped to encode the constraints represented by the distance map and various geometrical limitations. Let S
i = 1 … n
be a sequence of n
3D coordinates, with r
) being the coordinates of the i
atom of the current protein conformation and d
| the distance between the atoms i
. Then, the set of constraints guiding the reconstruction of the protein structure can be written by
. The first set of constraints
comes from the predicted distance map
− 1) mutual distances between C
atoms. The distances
are obtained as outputs from the second step of the overall pipeline (Figure 1
). The rest of the geometricconstraints include:
which limits neighbouring C
which defines clashes between residues; and
which defines the dependence of the distance between the first and the last residue in the β-strand d
on the amino acid length of l
of the strand. Using these constraints the pseudo-energy function can be written by:
In all the experiments, we run the annealing protocol for 10,000 × protein length iterations, in each of which the perturbation of a single residue is attempted. Pseudo-energy parameters are set to, α
0 = 0.2 α
1 = 0.025 (distance penalty), α
2 = 0.5 (clashes) and α
3 = 2.0 (strand length), so that the conformational search is biased towards the generation of compact, clash-free structures with the recommended length of β-strands and with C
∝ distances approaching to distances provided by the distance map.
A distance map contains no information about chirality. When an overall structure is reconstructed, the mirror-image structure is equally legitimate, having the same distance map. Therefore, in the final step we generate the mirror image of the reconstructed structure and refine it for additional 5,000 iterations. The choice of the final reconstructed structure depends on the pseudo-energy penalty needed for the original and mirror image reconstructed 3D structure (Equation 6).
In addition to this simple geometry-based reconstruction algorithm, we use a fragment-based reconstruction to predict the structures of CASP9 targets from non-native 4-class maps and distance maps. In the fragment-based reconstruction  implemented here, for each protein segment of length 9, 50 candidate structures in the PDB are identified using the fold recognition algorithm described in . A move consists in swapping a segment at a random position with another (random) one in the list. Since segment lengths are generally not the same, mutual distances between any two residues in the protein are affected by a move. Moves are accepted or rejected based on the same pseudo-energy function as in the previous protocol (Equation 6) and the simulating annealing protocol for 20,000 iterations. Lastly, the mirror image of the reconstructed structure is generated, a brief further reconstruction is attempted and its fitness is assessed. Given that segments from the PDB incorporate chirality information, we observe that, in the majority of cases, the correct mirror image is selected directly based on fitness.
The dataset used to train and test the predictors is extracted from the October 2009 25% pdb_select list containing 4,818 proteins . Since the training is computationally demanding (and its complexity quadratic in the protein length) we created a reduced version of the dataset by excluding proteins longer than 200 residues. The final dataset contains 3,645 proteins with 360,971 residues and 21,918,875 residue pairs (Additional file 1: Table S2). All systems are trained in 5-fold cross validation by splitting the dataset into 5 approximately equal folds. Inter-residue distances used for training are measured between Cα atoms, and their distribution is plotted in Additional file 1: Figure S2. If we include only the distances between residues separated by at least 2 amino acids in the sequence, then, it becomes clear from Additional file 1: Figure S2 that the majority of data (76%) are distributed in the range [10Å, 30Å] with a mean value of 20.7Å.
Secondary structure and relative solvent accessibility of each residue are assigned using DSSP , whereas contact density is calculated as in . True structural information is used for training of both the ab initio and the template-based models. For testing purposes, we use predictions from in-house servers [42, 44, 45] to predict secondary structure, solvent accessibility and contact density, respectively.
Evolutionary information in the form of amino acid probability vectors, amino acid classes and correlated mutations are calculated from MSAs. The alignments for the proteins in the training/test dataset are extracted from the non-redundant (NR) database. The alignments are generated by three runs of position specific iterative BLAST (PSI-BLAST)  with parameters b=3,000, e =10
To generate the structural templates for a protein, we run PSI-BLAST against the PDB (available on April 30th 2008) using the position specific scoring matrix (PSSM) generated during the alignment process. We deliberately use a high expectation parameter (e=10) to include hits that are beyond the usual comparative modelling scope (e<0.01). Finally, in order to avoid perfect templates coming from PDB resubmissions of the same structure and close homologues, we exclude those templates whose sequence similarity exceeds 95% over the whole query.
The distribution of the sequence identity to the average/best template identity is given in Additional file 1: Figure S3. The average identity for all templates, not surprisingly, is generally low with a median of 20% identity. Although the distribution is not uniform, all identity intervals are adequately represented: 37% of all proteins have the best hit with less than 20% sequence identity (midnight zone), the best hit of 21% proteins is between 20-30% sequence identity (twilight zone), and for the rest of 42%, close homologues can be found with sequence identity in the interval 30-95%.