The contact map of a protein with *N* amino acids is a symmetric *N* × *N* matrix *C*, with elements *C*
_{
ij
}defined as:

We define two amino acids as being in contact if the distance between their *C*
_{
α
}is less than a given threshold. For the definition of the PE we adopt a fixed 8 Å threshold, while in the contact map prediction stage we test 8 Å and 12 Å thresholds. Alternative definitions are possible, for instance based on different mutual *C*
_{
α
}distances (normally in the 7–12 Å range), or on *C*
_{
β
}-*C*
_{
β
}atom distances (normally 6.5–8 Å), or on the minimal distance between two atoms belonging to the side-chain or backbone of the two residues (commonly 4.5 Å).

Let λ(*C*) = {λ : *Cx* = λ*x*} be the spectrum of *C*,
_{λ} = {*x* : *Cx* = λ*x*} the corresponding eigenspace and
= max{λ ∈ λ(*C*)} the largest eigenvalue of *C*. The principal eigenvector of *C*,
, is the eigenvector corresponding to
.
can also be expressed as the argument which maximises the Rayleigh quotient:

Eigenvectors are usually normalised by requiring their norm to be 1, e.g. ||*x*||_{2} = 1 ∀*x* ∈
_{λ}. Since *C* is an adjacency (real, symmetric) matrix, its eigenvalues are real. Since it is a normal matrix (*A*
^{
H
}
*A* = *AA*
^{
H
}), its eigenvectors are orthogonal. Other basic properties can also be proven: the principal eigenvalue is positive; non-zero components of
have all the same sign [29]. Without loss of generality, we can assume they are positive, as in [18].

Ideally, prediction of the PE should be formulated as a sequential regression task in which each amino acid is mapped into its corresponding component of the PE. Here we consider two variations to the original problem. First, we model it as a classification task with multiple classes. Second, we predict the magnified eigenvector, i.e.
instead of
. Modelling regression problems as multi-class classifications is common practice, for instance in closely related tasks such as the prediction of protein solvent accessibility [26, 30, 31]. Predicting magnified eigenvector components has some advantages: by doing so we are simultaneously estimating the eigenvector components and the corresponding eigenvalue (as the norm of
is equal to 1); the main eigenvalue correlates well with the protein length (correlation of 0.62 on our training set), hence it is likely predictable. An estimate of the eigenvalue will in general be needed when attempting to predict contact maps from the PE, either by using an algorithm similar to the one in [18], or more in general by attempting to satisfy the constraint:

Formally, the PE prediction task consists in learning a mapping *f*(·) :
→
from the space
of labelled input sequences to the space
of labelled output sequences. In practice, we want to predict a sequence of labels *O* = (*o*
_{1}, ..., *o*
_{
N
}), for a given sequence of inputs *I* = (*i*
_{1}, ..., *i*
_{
N
}), where each *i*
_{
j
}∈ *I* is the input coding of the amino acid in position *j*. For PE prediction, we assume that there is a range *R* including all magnified eigenvector components, i.e. ∀*j*,
_{
j
}∈ *R*, and we divide the range *R* into a series of *m* disjoint intervals, i.e.
. We can represent each output label *o*
_{
j
}as belonging to an alphabet of *m* symbols, i.e. *o*
_{
j
}∈ Σ = {1, ..., *m*}, and *o*
_{
j
}corresponds to the class or interval *R*
_{
k
}in which the value of the *j*-th magnified eigenvector component falls: *o*
_{
j
}= *k* ⇔
_{
j
}∈ *R*
_{
k
}.

### Predictive architecture for the PE

To learn the mapping between our inputs
and outputs
we use a two-layered architecture composed of Bidirectional Recurrent Neural Networks (BRNN) [22] (also known as 1D-RNN, e.g. in [13]) of the same length as the amino acid sequence. Similarly to [21] we use BRNNs with *shortcut connections*. In these BRNNs, connections along the forward and backward hidden chains span more than 1-residue intervals, creating shorter paths between inputs and outputs. These networks take the form:

where
and
are forward and backward chains of hidden vectors with
. We parametrise the output update, forward update and backward update functions (respectively
^{(O)},
^{(F) }and
^{(B) }using three two-layered feed-forward neural networks. In our tests the input associated with the *j*-th residue *i*
_{
j
}contains amino acid information, secondary structure information, and hydrophobicity interaction values described in [20]. Amino acid information is obtained from multiple sequence alignments of the protein sequence to its homologues to leverage evolutionary information. Amino acids are coded as letters out of an alphabet of 25. Beside the 20 standard amino acids, B (aspartic acid or asparagine), U (selenocysteine), X (unknown), Z (glutamic acid or glutamine) and (gap) are considered. The input presented to the networks is the frequency of each of the 24 non-gap symbols, plus the overall frequency of gaps in each column of the alignment. I.e., if *n*
_{
jk
}is the total number of occurrences of symbol *j* in column *k*, and *g*
_{
k
}the number of gaps in the same column, the *j*
^{
th
}input to the networks in position *k* is:

for *j* = 1...24, while the 25^{
th
}input is:

This input coding scheme is richer than simple 20-letter schemes and has proven effective in [21]. The secondary structure part of the input is encoded using a three-letter scheme (helix, strand, coil). We adopt both true secondary structures, and secondary structures predicted by Porter [21]. When using predicted secondary structure, we carefully design our tests so that no sequence used for testing the PE prediction system is similar to sequences in Porter's training set. A single real-valued input is used to encode hydrophobicity interaction values. In [20] an optimised version of the scale is shown to be highly correlated to the PE. A further unit is used to encode the protein length (normalised by a factor 0.001).

Based on this encoding, a total of 30 units are used to represent each residue.

We adopt a second filtering BRNN, similarly to [21]. The network is trained to predict the PE given the first-layer PE predictions. The *i*-th input to this second network includes the first-layer predictions in position *i* augmented by first stage predictions averaged over multiple contiguous windows. I.e., if *c*
_{j1}, ... *c*
_{
jm
}are the outputs in position *j* of the first stage network corresponding to estimated probability of eigenvector component *j* being in class *m*, the input to the second stage network in position *j* is the array *I*
_{
j
}:

where *k*
_{
f
}= *j* + *f*(2*w* + 1), 2*w* + 1 is the size of the window over which first-stage predictions are averaged and 2*p* + 1 is the number of windows considered. In the tests we use *w* = 7 and *p* = 7. This means that 15 contiguous, non-overlapping windows of 15 residues each are considered, i.e. first-stage outputs between position *j*-112 and *j*+112, for a total of 225 contiguous residues, are taken into account to generate the input to the filtering network in position *j*. This input contains a total of 16*m* real numbers: *m* representing the *m*-class output of the first stage in position *j*; 15*m* representing the *m*-class outputs of the first-stage *averaged* over each of the 15 windows.

Five two-stage BRNN models are trained independently and ensemble averaged to build the final predictor. Differences among models are introduced by two factors: stochastic elements in the training-protocol, such as different initial weights of the networks and different shuffling of the examples; different architecture and number of free parameters of the models. Averaging the 5 models' outputs leads to classification performance improvements between 1% and 1.5% over single models. In [32] a slight improvement in secondary structure prediction accuracy was obtained by "brute ensembling" of several tens of different models trained independently. Here we adopt a less expensive technique: a copy of each of the 5 models is saved at regular intervals (100 epochs) during training. Stochastic elements in the training protocol (similar to that described in [33]) guarantee that differences during training are non-trivial. When an ensemble of 9 such copies for all the 5 models is used (45 models in total) we obtain a further slight improvement over the ensemble of 5 models.

### Predictive architecture for contact maps

We build a system for the prediction of contact maps based on 2D-RNN, described in [4] and [13]. This is a family of adaptive models for mapping two-dimensional matrices of variable size into matrices of the same size. 2D-RNN-based models were among the most successful contact map predictors at the CASP5 competition [27].

As in the PE prediction case, we use 2D-RNNs with *shortcut connections*, i.e. where lateral memory connections span *N*-residue intervals, where *N* > 1. If *o*
_{j,k}is the entry in the *j*-th row and *k*-th column of the output matrix (in our case, it will represent the estimated probability of residues *j* and *k* being in contact), and *i*
_{j,k}is the input in the same position, the input-output mapping is modelled as:

where
for *n* = 1, ..., 4 are planes of hidden vectors transmitting contextual information from each corner of the matrix to the opposite corner. We parametrise the output update, and the four lateral update functions (respectively
^{(O) }and
^{(n) }for *n* = 1, ..., 4) using five two-layered feed-forward neural networks, as in [13].

In our tests the input *i*
_{j,k}contains amino acid information, secondary structure and solvent accessibility information, and PE information for the amino acids in positions *j* and *k* in the sequence. Amino acid information is again obtained from multiple sequence alignments.