Training set
The data set used in our simulations is extracted from the December 2003 25% pdb_select list [33]. We assign each residue's secondary structure and solvent accessibility using the DSSP program [34]. The relative solvent accessibility of a residue is defined as the accessibility in Å2 as computed by DSSP, divided by the maximum observed accessibility for that type of residue. Secondary structure is mapped from the 8 DSSP classes into three classes as follows: H, G, I → Helix; E, B → Strand; S, T,. → Coil. Relative solvent accessibility is mapped into 4 classes where class thresholds are chosen to be maximally informative, i.e. to split the set into (roughly) equally numerous classes: [0%, 4%), [4%, 25%), [25%, 50%) and [50%, ∞) exposed.
We remove all sequences for which DSSP does not produce an output due, for instance, to missing entries (e.g. if only the C
α
trace is present in the PDB file) or format errors. After processing by DSSP, this set (S2171) contains 2171 proteins and 344,653 amino acids. All the tests reported in this paper are run in 5-fold cross validation on S2171. The 5 folds are of roughly equal sizes, composed of 434–435 proteins and ranging between 67,345 and 70,098 residues. The datasets are available upon request.
Prediction from a multiple alignment of protein sequences rather than a single sequence has long been recognised as a way to improve prediction accuracy for virtually all protein structural features: secondary structure [9, 10, 14, 19, 22, 29, 40], solvent accessibility [11, 13, 15–18, 20, 21], beta-sheet pairing [41, 42], contact maps [43–45], etc. We exploit evolutionary information in the form of frequency profiles compiled from alignments of multiple homologous sequences, extracted from the NR database. Multiple sequence alignments for the S2171 set are extracted from the NR database as available on March 3 2004 containing over 1.4 million sequences. The database is first redundancy reduced at a 98% threshold, leading to a final 1.05 million sequences. The alignments are generated by three runs of PSI-BLAST [23] with parameters b = 3000 (maximum number of hits), e = 10-3 (expectation of a random hit) and h = 10-10 (expectation of a random hit for sequences used to generate the PSSM).
Data sets, training/test folds and multiple alignments are identical to those used to train and test the ab initio secondary structure predictor Porter [19].
Template generation
For each of the proteins in S2171 we search for structural templates in the PDB. We base our search on PDBFINDERII [46] as available on August 22 2005. An obvious problem arising is that all proteins in the S2171 set are expected to be in PDB (barring name changes), hence every protein will have a perfect template. To avoid this, we exclude from PDBFINDERII every protein that appears in S2171. We also exclude all entries shorter than 10 residues, leading to a final 66,350 chains. Because of the PDBFINDERII origin, only one chain is present in this set for NMR entries.
To generate the actual templates for a protein, we run two rounds of PSI-BLAST against the version of the redundancy-reduced NR database described above, with parameters b = 3000, e = 10-3 and h = 10-10. We then run a third round of PSI-BLAST against the PDB using the PSSM generated in the first two rounds. In this third round we deliberately use a high expectation parameter (e = 10) to include hits that are beyond the usual Comparative Modelling scope (e < 0.01, at the CASP6 competition [28]). We further remove from each set of hits thus found all those with sequence similarity exceeding 95% over the whole query, to exclude PDB resubmissions of the same structure at different resolution, other chains in N-mers and close homologues.
The distribution of sequence similarity of the best template, and average template similarity is plotted in figure 7. Roughly 15% of the proteins have no hits at more than 10% sequence similarity. About 20% of all proteins have at least one very high quality (better than 90% similarity) entry in their template set.
Although the distribution is not uniform, all similarity intervals are adequately represented: for about 40% of the proteins no hit is above 30% similarity; for nearly 20% of the proteins the best hit is in the 30–50% similarity interval. Overall 74,543 residues (21.6% of the set) are not covered by any template. The average similarity for all PDB hits for each protein, not surprisingly, is generally low: for roughly 75% of all proteins in S2171 the average identity is below 30%.
To test template-based predictions in marginal similarity conditions we also extract two further template sets from which all hits are excluded that exceed, respectively, 30% and 20% sequence similarity. In this case the number of residues not covered by any template climbs, respectively, to 148,124 (43% of the total) and 193,921 (56.3%).
Predictive architectures
To learn the mapping between our input space and output space we use two-layered architectures composed of Bidirectional Recurrent Neural Networks (BRNN)(Also known as 1D-RNN, e.g. in [44]) [10] of the same length N as the amino acid sequence. Similarly to [19] we use BRNNs with shortcut connections. In these BRNNs, connections along the forward and backward hidden chains span more than 1-residue intervals, creating shorter paths between inputs and outputs. These networks take the form:
where i
j
(resp. o
j
) is the input (resp. output) of the network in position j, and and are forward and backward chains of hidden vectors with . We parametrise the output update, forward update and backward update functions (respectively (O), (F)and (B)) using three two-layered feed-forward neural networks.
Encoding sequence and template information
Input i
j
associated with the j-th residue contains primary sequence information and evolutionary information, and direct structural information derived from PDB templates:
(1)
where, assuming that e units are devoted to sequence and evolutionary information, and t to structural information:
(2)
and:
(3)
Hence i
j
contains a total of e + t components.
As in [19]e = 25: beside the 20 standard amino acids, B (aspartic acid or asparagine), U (selenocysteine), X (unknown), Z (glutamic acid or glutamine) and · (gap) are considered. The input presented to the networks is the frequency of each of the 24 non-gap symbols, plus the overall frequency of gaps in each column of the alignment. I.e., if n
jk
is the total number of occurrences of symbol j in column k, and g
k
the number of gaps in the same column, the jthinput to the networks in position k is:
(4)
for j = 1...24, while the 25thinput is:
(5)
This input coding scheme is richer than simple 20-letter schemes and has proven effective in [19].
In the case of secondary structure prediction we use t = 10 for representing structural information from the templates. Hence the total number of inputs for a given residue is e + t = 35. The first 8 structural input units contain the average 8-class (DSSP style) secondary structure composition in the PDB templates, while the last 2 encode the average quality of the template column. Assume that s
p,j
is an 8-component vector encoding the DSSP-assigned 8-class secondary structure of j-th residue in the p-th template as follows:
H = (1, 0, 0, 0, 0, 0, 0, 0)
G = (0, 1, 0, 0, 0, 0, 0, 0)
I = (0, 0, 1, 0, 0, 0, 0, 0)
E = (0, 0, 0, 1, 0, 0, 0, 0)
B = (0, 0, 0, 0, 1, 0, 0, 0)
S = (0, 0, 0, 0, 0, 1, 0, 0)
T = (0, 0, 0, 0, 0, 0, 1, 0)
· = (0, 0, 0, 0, 0, 0, 0, 1)
Then, if P is the total number of templates for a protein:
(6)
Where w
p
is the weight attributed to the p-th template. If the identity between template p and the query is idp and the quality of a template (measured as X-ray resolution + R-factor/20, as in [33] – the lower the better) is q
s
, then it is:
(7)
Taking the cube of the identity between template and query drastically reduces the contribution of low-similarity templates when good templates are available. For instance a 90% identity template is weighed two orders of magnitude more than a 20% one. In preliminary tests (not shown) this measure performed better than a number of alternatives.
The final two units of i
j
encode the weighted average coverage and similarity of a column of the template profile as follows:
(8)
where c
p
is the coverage of the sequence by template p (i.e. the fraction of non-gaps in the alignment), and
(9)
It is worth noting how both structural information from templates and the two indices of template quality above are residue-based. For this reason, the case in which only templates covering fragments of a protein exist does not pose a problem for the method – the residues not covered by templates will simply have the section of the input with template information blank, and predictions will be based only on the sequence (and on sequence and template information transmitted by the forward and backwards memory chains). Template information for solvent accessibility is encoded similarly to secondary structure, except that 4 units are adopted to represent average solvent accessibility from PDB-derived templates (4 approximately equal classes). The two units encoding the profile quality are the same as in the secondary structure case. For the comparative experiments without templates, exactly same architectures are adopted, except that the part of the inputs representing the template profile is set to zero.
Filtering BRNN
We adopt a second filtering BRNN, similarly to [19]. The network is trained to predict secondary structures (or solvent accessibilities) given first-layer secondary structure (resp. accessibility) predictions. The i-th input to this second network includes the first-layer predictions in position i augmented by first stage predictions averaged over multiple contiguous windows. That is, if cj 1,...c
jm
are the outputs in position j of the first stage network corresponding to estimated probabilities of secondary structure or solvent accessibility j being in class m, the input to the second stage network in position j is the array I
j
:
(10)
where k
f
= j + f (2w + 1), 2w + 1 is the size of the window over which first-stage predictions are averaged and 2p + 1 is the number of windows considered. In the tests we use w = 7 and p = 7, as in [19]. This means that 15 contiguous, non-overlapping windows of 15 residues each are considered, i.e. first-stage outputs between position j - 112 and j + 112, for a total of 225 contiguous residues, are taken into account to generate the input to the filtering network in position j. This input contains a total of 16m real numbers: m representing the m-class output of the first stage in position j; 15m representing the m-class outputs of the first-stage averaged over each of the 15 windows. m is 3 in the case of secondary structure prediction and 4 for (4-class) solvent accessibility prediction.
Training, Ensembling
Five two-stage BRNN models are trained independently and ensemble averaged to build the final predictor. Differences among models are introduced by two factors: stochastic elements in the training protocol, such as different initial weights of the networks and different shuffling of the examples; different architecture and number of free parameters of the models. The training strategy is identical to that adopted for Porter [19]: 1000 epochs of training are performed for each model; the learning rate is halved every time we do not observe a reduction of the error for more than 50 epochs. The size and architecture of the models, apart from differences caused by the different number of inputs, is the same as Porter's. The number of free parameter per model ranges between 5,800 and 8,000. The template-based models are only slightly larger (on average 7% more free parameters) than the corresponding ab initio ones. Averaging the 5 models' outputs leads to classification performance improvements between 1% and 1.5% over single models. Furthermore a copy of each of the 5 models is saved at regular intervals (100 epochs) during training. Stochastic elements in the training protocol (similar to that described in [14]) guarantee that differences during training are non-trivial. An ensemble of a total of 45 such models yields a further slight improvement over the ensemble of 5 models.