Parallel Cascade Identification
PCI is a powerful system identification technique that aims to build a mimetic dynamic nonlinear model given the input and output data from a real system. Interested readers are directed to ref [12] for a detailed description of the algorithm. A PCI model consists of a parallel arrangement of cascade models. Training the ith cascade path begins with fitting the (R+S+1) taps of the FIR filter in the L component. These values are taken from cross-correlations between the input data, x [n], and the desired cascade output, yi [n], computed for offsets in the range [-S, +R]. If a higher order cross-correlation is computed, then a one-dimensional slice is used in combination with strategically placed Kronecker delta functions [12]. The output of the ith L component, ui [n], is computed via a convolution of the FIR filter and the input data. To overcome transient effects introduced by the convolution of the FIR filters with a protein chain input, each protein chain was first zero-padded by R residues at the start of each chain, and by S residues at the end of each chain, after each cross-correlation (to obtain the FIR filter) was computed [25]. The polynomial of order I forming the ith N component is then used to best-fit ui [n] to yi [n]. The residual, yi+1 [n], between the actual cascade output, zi [n], and the desired cascade output, yi [n], forms the desired output for training the next cascade path. For the case of a multi-input single-output (MISO) PCI model, as is used throughout this study, the cross-correlations to fit the FIR filter are computed between the desired cascade output and one or more inputs selected randomly with replacement [12].
Before accepting a new candidate cascade into the PCI model, a minimum mean squared error (MSE) reduction criterion may be imposed. The stringency of this test is controlled by the architectural parameter P and is related to a standard correlation test to help ensure that the model will not fit solely noise [12]. Training continues with the fitting of new cascade paths until either 1) a pre-determined maximum number of candidate cascades are consecutively rejected (a value of 150 is used in the present study), or 2) a maximum number of cascades, maxC, are identified and added to the model. During optimization, the value of maxC was set to 85, primarily to reduce the computational requirements of assessing each PCI parameter set. The value of maxC was increased to 500, without risk of overfitting, when training the final PCI classifiers over the antiTest dataset since the size of the training data increased in length to 449112 residues as compared to only 112685 residues for the optimization subset. Note that maxC is an upper limit, and that the actual number of cascades in a model may be much less due to the use of the MSE reduction criterion. In fact, the number of cascades in the final PCI-MSE sequence-to-structure binary models were E = 482, T = 295, H = 500 while the post-PCI consensus classifier which combined PCI and PSIPRED had fewer cascades (E = 205, T = 250, H = 447).
Secondary structure prediction is a 3-state problem. By using a multi-level output (e.g. H = 1, T = 0, E = -1), a single 3-state PCI classifier can be used to achieve this. Rather than using a single 3-state PCI classifier, we can instead create three binary sub-problems, as depicted in Figure 1. Here, each binary model is trained on a specialized version of the training data where the output has been set to 1 for one primary secondary structure state, and -1 for the other two states. For example, when training a binary_H classifier, the output data for the primary state, H, were mapped to 1, while the remaining states, E and T, were mapped to -1 [33]. Each binary classifier is therefore an expert in recognizing one of the three states. During testing, the MSE-test score from each binary classifier is computed. The MSE-test score is a measure of mean-squared error between the actual model output and the nominal model output for the class of interest, scaled by the variance observed during training. More details are provided in ref [21, 22]. A 3-input MIN decision function (see Figure 1) examines the three MSE-test scores treating them as distances: the state for which the MSE score is smallest is selected. A measure of confidence is calculated as follows:
where d1 and d2 are respectively the smallest and next-smallest MSE-test scores from all three binary classifiers.
Preparation of the datasets
A list of 3107 sequence-unique protein chains was downloaded from the EVA system on 2 May 2004. Proteins whose amino acid sequence was not known with certainty or whose secondary structure was not available were removed from the dataset. This filtering resulted in 2713 protein chains remaining, of which the average chain length was 204 residues, the minimum and maximum chain lengths were 11 and 1290 residues respectively, and the total number of amino acid residues was 554085. The dataset was then divided into 5 subsets: S1 (543 chains), S2 (543 chains), S3 (542 chains), S4 (542 chains), and S5 (543 chains). Position-specific scoring matrices (PSSM) were computed for each sequence using PSI-BLAST [11] run for 3 iterations with a E-score threshold of 0.001 as used by PSIPRED [6].
Due to the nature of the EVA system, any proteins added to the system after the date on which a protein chain list is downloaded from EVA are guaranteed not to be homologous to any proteins contained on that chain list [24]. As just stated, such a protein chain list was downloaded on 2 May 2004 and those data were used to develop the PCI-based classifiers described in this study. On 5 April 2007, at the end of the study, a list of 365 new protein chains added to the EVA system since 2 May 2004 was downloaded and was used to construct a final test dataset. Of the 365 newly added protein chains, a subset of 125 protein chains had been tested by EVA on their dates of deposition into the PDB [34] against a battery of 9 contemporary methods. Unfortunately, EVA had not run all 365 new proteins against all 9 methods. The subset of 125 protein chains was selected to form the final EVA test set since archived results were available from EVA for each chain for all 9 methods. This dataset provides a unique opportunity to directly compare PCI's performance with a number of leading methods in a fair and objective way. The final EVA test set totalled 12905 residues, with an average of 103 residues per chain, and exhibited minimum and maximum chain lengths of 30 and 644 residues respectively. Note that a number of protein chains had one or more unknown residues in their sequence. These chains were kept in the test set since PSI-BLAST is able to handle such residues and still produce meaningful PSSM data.
The final PCI-based classifiers and post-PCI consensus combinations of PCI and PSIPRED were trained on the complete dataset of all original EVA data downloaded in May of 2004 (i.e. subsets S1 through S5) for a total of 2713 protein chains. These methods were then evaluated using the final EVA test dataset which shared no significant sequence similarity with any of the 2713 training proteins. Detailed prediction results were downloaded and parsed from the EVA system for 9 contemporary methods over the same dataset of 125 protein chains. The following methods are included in the comparison: PHD [3], PHDpsi [4], PROF_king [5], PROFsec [3], PSIPRED [6], Sable [10], Sable2 [10], SCRATCH (SSPro3) [7], and SSPro4 [7] and represent methods using ANNs [3–7, 10], information theory [5], and LDA [5]. Note that it would appear that the EVA system is no longer being updated on a regular basis however it remains a unique resource of results over multiple methods for a large database of sequence-dissimilar proteins.
PSI-BLAST requires a database of protein sequences to search against. In this study, a local copy of the NCBI "nr" (non-redundant) database was made on 21 June 2004, containing 1,865,463 sequences totalling 619,299,334 residues. Prior to use, this database was filtered for unwanted low complexity or coiled-coil elements using the pfilt program written by David Jones as part of the PSIPRED (v.2.45) suite of programs [6]. Note that the SwissProt database is used on the live webserver version to reduce computational time. No significant or systematic change in performance is observed when the sequence database is changed.
Ground truth secondary structure assignments for each protein chain were obtained using the DSSP program [35]. In the current study, the eight DSSP output classes are mapped to three states as follows: H={H,G,I}, E={E,B}, T={T,S,-}, where '-' denotes 'other'. This conversion is recommended as being a conservative approach [2]. This resulted in 20% of residues assigned to class E (β-strands), 33% of residues assigned to class H (α-helices), and the remaining 47% of residues assigned to class T (loops, turns, or non-regular structure).
Measuring prediction accuracy
Prediction accuracy is often measured using the Q3 score which is defined as the percentage of all residues that were predicted to be in the correct secondary structure state. By using a correlation coefficient [26], a more relevant evaluation of prediction accuracy is achieved. Matthews' correlation coefficient [26] (CC) combines sensitivity and specificity into a single measure and is widely employed to measure prediction accuracy. One weakness of the Q3 score is that it considers all errors to have equal cost despite the fact that not all types of errors are equally detrimental to the usefulness of a secondary structure prediction [27, 28]. The output of secondary structure prediction systems are often used to guide methods of tertiary structure prediction. Errors that involve the misclassification of a strand as a helix, or vice-versa, are particularly damaging to the eventual accuracy of the tertiary structure prediction [2]. To reflect this fact, it is common to report not only the Q3 and Matthews' correlation coefficient, but also the BAD score [2] for each secondary structure prediction. The BAD score is defined as the percentage of all predictions in which a strand and a helix state were confused. Lastly, we also report the segment overlap (SOV) score reflecting the degree of overlap between predicted and observed structural segments as defined in ref. [27, 28].
Optimization of PCI parameters through genetic algorithms
Each potential parameter set (consisting of R, S, I, and P) was represented by a chromosome having four genes. These were: 1) total memory length (not counting lag 0), g1 = (R+S); 2) degree of anticipation, g2 = (S/(R+S)); 3) degree of polynomial, g3 = I; and 4) cascade acceptance criteria, g4 = P. Possible values of the g4 gene were limited to multiples of 5 since a more coarse-grained search over a wider range was found to be most suitable for the P parameter. In order to evaluate parameter sets, 3-fold cross-validation testing was performed over the S1 optimization subset of 543 protein chains (112685 residues total). The average Matthews' correlation coefficient observed over the three folds was used as the criterion function. The GA was run for 26 generations with a population size of 24 chromosomes. The mutation rate was set to 0.25 and Booker's variable cross-over rate was used [36]. Although the parameter set which gave the highest prediction accuracy over the optimization data set was ultimately selected, several parameter sets were identified which gave suitable prediction accuracy over the cross-validation subsets and PCI's accuracy was not highly sensitive to architectural parameter values.
The following method was used in order to optimize the architectural parameters of three binary 3-input structure-to-structure PCI-MSE classifiers (or post-PCI classifiers) depicted in Figure 2: Three binary sequence-to-structure PCI models, characterized by the parameters given in Table 1, were trained on the S1 subset and then applied to the S1 subset thereby providing three raw PCI output signals. During optimization of the structure-to-structure binary PCI models, these three raw output signals were applied to the three inputs of each binary post-PCI model.
Consensus combination of PCI with PSIPRED
In this study, PCI was used to build a consensus classifier which combined PCI-MSE and PSIPRED [6], a leading ANN-based prediction method [24], as shown in figure 3. A pre-trained copy of PSIPRED v2.45 was downloaded, compiled, and run locally such that the source of the PSSM data could be controlled. The online "live" version of PSIPRED (found at http://bioinf.cs.ucl.ac.uk/psipred/) makes use of a slightly different sequence database for generating the PSSM data than is used in this study, and its sequence database is also updated weekly [37]. When run locally, the PSIPRED program provides three output values for each residue, indicating the likelihood that each residue belongs to each of the three secondary structure classes. Each likelihood value fell in the range [0,1] and was transformed into a distance by subtracting each likelihood value from 1. A number of approaches to combining PCI and PSIPRED outputs were evaluated over the cross-validation subsets (S2, S3, and S4). The use of a 6-input PCI consensus classifier (i.e. a structure-to-structure cascaded PCI-MSE classifier) was identified as the most promising approach where the three PSIPRED distances were combined with three binary PCI model outputs. The so-called post-PCI classifier was characterized by the structure-to-structure PCI architectural parameters listed in Table 1 (i.e. optimization was not repeated). Note that a single 6-input 3-state post-PCI classifier may have been used in place of 3 binary 6-input PCI classifiers, but early testing on the cross-validation subsets showed inferior performance with this approach.