svmPRAT: SVM-based Protein Residue Annotation Toolkit

Background Over the last decade several prediction methods have been developed for determining the structural and functional properties of individual protein residues using sequence and sequence-derived information. Most of these methods are based on support vector machines as they provide accurate and generalizable prediction models. Results We present a general purpose protein residue annotation toolkit (svmPRAT) to allow biologists to formulate residue-wise prediction problems. svmPRAT formulates the annotation problem as a classification or regression problem using support vector machines. One of the key features of svmPRAT is its ease of use in incorporating any user-provided information in the form of feature matrices. For every residue svmPRAT captures local information around the reside to create fixed length feature vectors. svmPRAT implements accurate and fast kernel functions, and also introduces a flexible window-based encoding scheme that accurately captures signals and pattern for training effective predictive models. Conclusions In this work we evaluate svmPRAT on several classification and regression problems including disorder prediction, residue-wise contact order estimation, DNA-binding site prediction, and local structure alphabet prediction. svmPRAT has also been used for the development of state-of-the-art transmembrane helix prediction method called TOPTMH, and secondary structure prediction method called YASSPP. This toolkit developed provides practitioners an efficient and easy-to-use tool for a wide variety of annotation problems. Availability: http://www.cs.gmu.edu/~mlbio/svmprat


Results:
We present a general purpose protein residue annotation toolkit (svmPRAT) to allow biologists to formulate residue-wise prediction problems. svmPRAT formulates the annotation problem as a classification or regression problem using support vector machines. One of the key features of svmPRAT is its ease of use in incorporating any user-provided information in the form of feature matrices. For every residue svmPRAT captures local information around the reside to create fixed length feature vectors. svmPRAT implements accurate and fast kernel functions, and also introduces a flexible window-based encoding scheme that accurately captures signals and pattern for training effective predictive models.
Conclusions: In this work we evaluate svmPRAT on several classification and regression problems including disorder prediction, residue-wise contact order estimation, DNA-binding site prediction, and local structure alphabet prediction. svmPRAT has also been used for the development of state-of-the-art transmembrane helix prediction method called TOPTMH, and secondary structure prediction method called YASSPP. This toolkit developed provides practitioners an efficient and easy-to-use tool for a wide variety of annotation problems.
Availability: http://www.cs.gmu.edu/~mlbio/svmprat Background Experimental methods to determine the structure and function of proteins have been out-paced with the abundance of available sequence data. As such, over the past decade several computational methods have been developed to characterize the structural and functional aspects of proteins from sequence information [1][2][3]. Support vector machines (SVMs) [4,5] along with other machine learning tools have been extensively used to successfully predict the residue-wise structural or functional properties of proteins [6][7][8][9][10]. The task of assigning every residue with a discrete class label or continuous value is defined as a residue annotation problem. Examples of structural annotation problems include secondary structure prediction [8,9,11], local structure prediction [12,13], and contact order prediction [14][15][16]. Examples of functional annotation problems include prediction of interacting residues [6] (e.g., DNA-binding residues, and ligand-binding residues), solvent accessible Open Access surface area estimation [10,17], and disorder prediction [7,18].
We have developed a general purpose protein residue annotation toolkit called svmPRAT. This toolkit uses a support vector machine framework and is capable of predicting both a discrete label or a continuous value. To the best of our knowledge svmPRAT is the first tool that is designed to allow life science researchers to quickly and efficiently train SVM-based models for annotating protein residues with any desired property. The protocol for training the models, and predicting the residue-wise property is similar in nature to the methods developed for the different residue annotation problems [6][7][8][9][10].
svmPRAT can utilize any type of sequence information associated with residues. Features of the residue under consideration, as well as neighboring residues, are encoded as fixed length feature vectors. svmPRAT also employs a flexible sequence window encoding scheme that differentially weighs information extracted from neighboring residues based on their distance to the central residue. This flexibility is useful for some problems.
The svmPRAT implementation includes standard kernel functions (linear and radial basis functions) along with a second-order exponential kernel function shown to be effective for secondary structure prediction and pairwise local structure prediction [9,19]. The kernel functions implemented are also optimized for speed by utilizing fast vector-based operation routines within the CBLAS library [20]. svmPRAT is capable of learning two-level cascaded models that use predictions from the first-level model to train a second-level model. Such twolevel models are effective in accounting for the residue properties that are dependent on properties of near-by residues (i.e., the functional or structural property is sequentially autocorrelated). This form of cascaded learning performs well for secondary structure prediction [9,17]. svmPRAT is made available as a pre-compiled binary on several different architectures and environments.
In this paper svmPRAT has been evaluated on a wide suite of prediction problems, which include solvent accessibility surface area estimation [10,17], local structure alphabet prediction [12,13], transmembrane helix segment prediction [21], DNA-protein interaction sites prediction [6], contact order [15] estimation, and disordered region prediction [7,18]. svmPRAT has been used in development of a transmembrane helix orientation prediction method called TOPTMH [22], shown to be one of the best performers on a blind independent benchmark [23]. The svmPRAT framework was also used for prediction of ligand-binding sites [24] and was initially prototyped for the YASSPP secondary structure program [9]. Support vector machines are a powerful tool for classification and regression tasks. However, adapting them to the particular case of protein sequence data can be onerous. svmPRAT is a tool that allows SVMs to be applied readily to sequence data by automating the encoding process and incorporating a number of different features that are specifically designed for the problem of protein residue annotation. Implementation svmPRAT approaches the protein residue annotation problem by utilizing local sequence information (provided by the user) around each residue in a support vector machine (SVM) framework [25,26]. svmPRAT uses the classification formulations to address the problem of annotating residues with discrete labels and the regression formulation for continuous values. The svmPRAT implementation utilizes the publicly available SVM light program [27].
svmPRAT provides two main programs, one for the learning annotation models (svmPRAT-L) and the other for the predicting labels from learned models (svmPRAT-P). The svmPRAT-L program trains either a classification or regression model for solving the residue annotation problem. For classification problems, svmPRAT-L trains one-versus-rest binary classification models. When the number of unique class labels are two (e.g., disorder prediction), svmPRAT-L trains only one binary classification model to differentiate between the two classes. When the number of unique class labels are greater than two (e.g., three-state secondary structure prediction), svmPRAT-L trains one-versus-rest models for each of the classes i.e., if there are K discrete class labels, svmPRAT-L trains K one-versus-rest classification models. For continuous value estimation problems (e.g., solvent accessible surface area estimation), svmPRAT-L trains a single support vector regression (-SVR) model.
The svmPRAT-P program assigns a discrete label or continuous value for each residue of the input sequences using the trained models produced by svmPRAT-L. In classification problems, svmPRAT-P uses the K oneversus-models to predict the likelihood of a residue to be a member of each of the K classes. svmPRAT-P assigns the residue the label or class which has the highest likelihood value. For regression problems, svmPRAT-P estimates a continuous value for each residue.

Input Information
The input to svmPRAT consists of two types of information. Firstly, to train the prediction models true annotations are provided to svmPRAT-L. For every input sequence used for training, a separate file is provided. Each line of the file contains an alphanumeric class label or a continuous value i.e., true annotation for every residue of the sequence.
Secondly, svmPRAT can accept any general user-supplied features for prediction. For a protein, svmPRAT accepts any information as feature matrices. Both, svmPRAT-L and svmPRAT-P accept these input feature matrices. svmPRAT-L uses these feature matrices in conjunction with the true annotation files to learn predictive models, whereas svmPRAT-P uses the input feature matrices with a model to make predictions for the residues.
A feature matrix F for a protein sequence X is of dimensions n × d, where n is the length of the protein sequence and d is the number of features or values associated with each position of the sequence. As an example, Figure 1(a) shows the PSI-BLAST derived position specific scoring matrix (PSSM) of dimensions n × 20. For every residue, the PSSM captures evolutionary conservation information by providing a score for each of the twenty amino acids. Other examples of feature matrices include the predicted secondary structure matrices and position independent scoring matrices.
We use F i to indicate the ith row of matrix F, which corresponds to the features associated with the ith residue of X. svmPRAT can accept multiple types of feature matrices per sequence. When multiple types of features are considered, the lth feature matrix is specified by F l .

Information Encoding
When annotating a particular residue, svmPRAT uses features of that residue as well as information about neighboring residues. Window encoding, also called wmer encoding, is employed to accomplish this. For sequence X with length n, we use x i to denote the ith residue of the sequence. Given a user-supplied width w, the wmer at position i of X (w <i ≤ n -w) is defined to be the (2w + 1)-length subsequence of X centered at position i. That is, residues immediately before and after x i are part of wmer(x i ). The feature vectors of residues in this window, F i-w ... F i+w , are concatenated to produce the final vector representation of residue x i . If each residue has d features associated with it, the wmer encoding vector has length (2w + 1) × d and is referred to as wmer(F i ). Kernel Functions svmPRAT implements several kernel functions to capture similarity between pairs of wmers. Selection of an appropriate kernel function for a problem is key to the effectiveness of support vector machine learning.

Linear Kernels
Given a pair of wmers, wmer(x i ) and wmer(y j ) a linear kernel function can be defined between their feature matrices wmer(F i ) and wmer(G j ), respectively as where 〈·,·〉 denotes the dot-product operation between two vectors. Some problems may require only approximate information for residue neighbors that are far away from the central residue while nearby residue neighbors are more important. For example, the secondary structure state of a residue is in general more dependent on the nearby sequence positions than the positions that are further away [28]. svmPRAT allows a window encoding shown in Figure 1(d) where the positions away from the central residue are averaged to provide a coarser representation while the positions closer to the central residue provide a finer representation. This two-parameter linear window kernel is denoted W w f , and computes the similarity between features wmer(F i ) and wmer(G j ) as The parameter w governs the size of the wmer considered in computing the kernel. Rows within i ± f contribute an individual dot product to the total similarity while rows outside this range provide only aggregate information. In all cases, f is less than or equal to w and as f approaches w, the window kernel becomes a sum of the dot products. This is the most fine-grained similarity measure considered and is equivalent to the one-parameter dot product kernel that equally weighs all positions of the wmer given by Equation 1. Thus, the two kernels K w are W w w , equivalent. Specifying f to be less than w merges neighbors distant from the central residue into only a coarse contribution to the overall similarity. For f <w, distant sequence neighbors are represented by only compositional information rather than specific positions where their features occur.
Exponential Kernels svmPRAT implements the standard radial basis kernel function (rbf), defined for some parameter g by svmPRAT also implements the normalized second order exponential (soe) kernel function shown to better capture pairwise information and improve accuracy for the secondary structure and local structure prediction problems [9,19]. Given any base kernel function K , we define K 2 as which is a second-order kernel in that it computes pairwise interactions between the elements x and y. We then define K soe as which normalizes K 2 and embeds it into an exponential space.
By setting a specific g parameter value and using normalized unit length vectors in Equation 3 it can be shown that the standard rbf kernel is equivalent (up to a scaling factor) to a first order exponential kernel which is obtained by replacing K 2 (x, y) with only the first-order term as K (x, y) in Equation 4, and plugging this modified K 2 (x, y) in the normalization framework of Equation 5.

Integrating Information
When multiple information in the form of different feature matrices is provided to svmPRAT, the kernel functions and information encoding per residue for each of the feature matrices remains the same. The final kernel fusion is accomplished using a weighted linear combination across the original base kernels. The weights for feature matrices can be set by the user.
For example, we can use the fusion of second-order exponential kernels on different features of a protein sequence. Considering two sequences with k sets of feature matrices F l and G l for l = 1,..., k, our fusion kernel is defined as where the weights ω l are supplied by the user. In most cases, these weights can be set to be equal but should be altered according to domain-specific information.

Cascaded Models
Several prediction algorithms like PHD [17], PSIPRED [11] and YASSPP [9] developed for secondary structure prediction use a two-level cascaded prediction framework. This two-level framework trains two models, referred as the L 1 and L 2 models, which are connected together in a cascaded fashion. Both the L 1 and L 2 models train K one-versus-rest binary classification models for predicting a discrete label or a single -SVR regression model for estimating a continuous value. The predictions from the first-level L 1 model are used as an input feature matrix along with the original features for training a L 2 model [9]. Such cascaded predictions can be accomplished within svmPRAT's framework in the following way. First, the entire training set is used to train a L 1 classification/regression model using the original input features. This is followed with a n-fold cross-validation step to generate predictions for the entire training set using the fold specific trained L 1 model. In each iteration, 1/n-th of the dataset is set aside for prediction whereas the remainder of the dataset is used for training. The predictions from the L 1 model are then used as a new input feature along with the original features to train a L 2 model. The user may specify any desired weighting between original features and the L 1 model predictions according to Equation 6. The final result is a cascaded prediction.

Efficient Implementation
The runtime performance of svmPRAT is tied to the speed of computing the kernel function values between pairs of wmers. All the implemented kernel functions have to compute a dot product between the vector representations.
svmPRAT optimizes the computation time for the dot product based kernel functions given by Equation 2 by using the optimized CBLAS (Basic Linear Algebra Subprograms) routines that are a part of the ATLAS library project [20]. The CBLAS routines provide the standard building blocks for performing vector-based and matrix-based computations. In particular, the efficient vector operations available through CBLAS are used within svmPRAT's kernel function implementations. This allows svmPRAT to train models and generate predictions for test cases quickly.
We ported the CBLAS routines to all the architectures on which svmPRAT was complied and provide binaries compiled with and without the CBLAS routines (see the Availability Section).

Predictions Output
For classification problems, svmPRAT's prediction program produces two outputs in text files. For every residue, raw prediction scores from the one-versus-rest SVMs are reported. In addition, each residue is assigned a class based on the maximum prediction score of the models. For regression problems, the output is a text file containing the estimated value produced by the -SVR model.
Model Selection svmPRAT provides an evaluation program called svmPRAT-E that allows the practitioner to determine the best set of parameters for a particular prediction problem using cross validation. For ease of use, a simple PERL script is provided which invokes svmPRAT-E for a fixed set of parameters to determine the best kernel and window lengths.

Results
svmPRAT has been used in two previous experimental settings with success. TOPTMH is a transmembrane-helix segment identification and orientation system which utilizes svmPRAT [22]. It has achieved the best performance on a static independent benchmark [23]. The work by Kauffman et al. used svmPRAT to predict the ligand-binding residues of a protein [24]. This was shown to improve the quality of homology models of the protein's binding site.
In this work, we illustrate the capabilities of svmPRAT on a wide range of prediction problems. These case studies illustrate the effectiveness and generality of the software for sequence annotation problems. Problems involving disordered regions, DNA-protein interaction sites, residue contact order, and general local structure class are covered in the subsequent sections. Table 1 shows some characteristics of the datasets used in each problem and the reference work from which the data was derived.

Disorder Prediction
Some proteins contain regions which are intrinsically disordered in that their backbone shape may vary greatly over time and external conditions. A disordered region of a protein may have multiple binding partners and hence can take part in multiple biochemical processes in the cell which make them critical in performing various functions [29]. Disorder prediction is an example of a binary classification problem for sequence data. Disordered region prediction methods like IUPred [30], Poodle [18], and DISPro [7] make predictions using physiochemical properties of the amino acids or evolutionary information within a machine learning tool like bi-recurrent neural networks or SVMs.
svmPRAT was used to discriminate between residues belonging to ordered versus disordered regions. We assessed the value of several feature sets on this problem as an illustration of how svmPRAT may combine sequence information. The feature sets were PSI-BLAST PSSMS ( P ), BLOSUM62 sequence features ( S ), and predicted secondary structure (ℬ). See the Material Section for explanation of the different input features.
The parameters w and f of the base window kernel ( W ) were varied to demonstrate their effects on prediction performance. Finally, linear (lin), radial basis function (rbf), and second order exponential (soe) kernels were all used to show how the similarity computation in W may be further processed to improve performance. Table 2 shows the classification performance of various svmPRAT models on the disorder prediction problem. To notate the models, we use features as the main level text and kernel as the superscript (e.g. PS soe uses PSSMs and secondary structure in the second order exponential kernel). ROC and F 1 scores are reported for ten-fold cross validation which was the experimental protocol used to benchmark the DISPro [7]. Comparing the ROC performance of the P soe , P rbf , and P lin models across different values of w and f, we observe that the soe kernel shows superior performance to the lin kernel and slightly better performance compared to the normalized rbf kernel used in this study. This is in agreement with the #C, #Seq, #Res, #CV, and % denote the number of classes, sequences, residues, number of cross validation folds, and the maximum pairwise sequence identity between the sequences, respectively. 8 represents the regression problem.  [7] reports a ROC score of 0.878. The numbers in bold show the best models for a fixed w parameter, as measured by ROC. P , ℬ, and S represent the PSI-BLAST profile, BLOSUM62, and YASSPP scoring matrices, respectively. soe, rbf, and lin represent the three different kernels studied using the W w, f as the base kernel. *denotes the best classification results in the sub-tables, and **denotes the best classification results achieved on this dataset. For the best model we report a Q 2 accuracy of 84.60% with an se rate of 0.33. results of our previous studies for predicting secondary structure [9] and predicting RMSD between subsequence pairs [19] where the soe kernel outperformed the rbf kernel.
The performance of svmPRAT on the disorder prediction problem improved by using the P , ℬ, and S feature matrices in combination rather than individually. Table 2 shows results for the successive use of P , PS , and PSB features in the soe kernel: the additional features tend to improve performance. The flexible encoding introduced by svmPRAT shows some merit for the disorder prediction problem. The best performing fusion kernel shows comparable performance to DisPro [7] that encapsulates profile, secondary structure and relative solvent accessibility information within a birecurrent neural network.

Runtime Performance of Optimized Kernels
We benchmarked the learning phase of svmPRAT on the disordered dataset comparing the runtime performance of the program compiled with and without the CBLAS subroutines. These results are reported in Table 3 and were computed on a 64-bit Intel Xeon CPU 2.33 GHz processor for the P lin , P rbf , and P soe kernels varying the wmer size from 11 to 15. Table 3 also shows the number of kernel evaluations for the different models. Using CBLAS, speedups ranging from 1.7 to 2.3 are achieved for disorder prediction. Similar speedups were noted for other prediction problems. Disorder Prediction at CASP8: CASP is a biennial protein structure prediction competition which includes a disorder prediction category (Competition Website: http://predictioncenter. org). We submitted predictions of disordered residues to the CASP8, the latest iteration of the competition. Our MARINER server (group 450) used svmPRAT as the backend prediction tool. CASP8 featured 125 target proteins with 27,775 residues out of which 11.2% were disordered residues.
The svmPRAT model employed for CASP8 was trained using profile information embedded within the soe kernel with a wmer size of 9. Table 4 gives the top performers from the disorder prediction category of CASP8. svmPRAT showed encouraging results compared to methods that are fine-tuned for disorder prediction.
The blind evaluation done in CASP8 proves the ability of svmPRAT to adapt readily to different prediction problems.
The results from the CASP assessors were published recently [31] and show that the top performers based on a weighted accuracy are consensus-based methods.
Poodle is a SVM-based approach that uses two sets of cascaded classifiers trained separately for long and short disordered regions. svmPRAT can easily train two separate cascaded models for long and short disordered regions and thus incorporate the domain insight introduced by Poodle, in an efficient and quick manner.

Contact Order Prediction
Pairs of residues are considered to be in contact if their C b atoms are within a threshold radius, generally 12 Å. Residue-wise contact order [15] is defined as the average distance separation between contacting residues within a sphere of set threshold. Contact order prediction is an example of a regression problem for sequence data: the value to be predicted is a positive integer rather than a The runtime performance of svmPRAT was benchmarked for learning a classification model on a 64-bit Intel Xeon CPU 2.33 GHz processor. #KER denotes the number of kernel evaluations for training the SVM model. NO denotes runtime in seconds when the CBLAS library was not used, YES denotes the runtime in seconds when the CBLAS library was used, and SP denotes the speedup achieved using the CBLAS library. class. To predict contact order, Song and Burage [15] used support vector regression with a variety of sequence features including PSI-BLAST profiles, predicted secondary structure from PSIPRED [11], amino acid composition, and molecular weight. Critical random networks have also been applied to solve the problem [16]. We used svmPRAT to train -SVR regression models for residue-wise contact order estimation. PSSM and predicted secondary structure, P and S respectively, were used as features in the soe kernel. The window kernel parameters w and f were varied again to study their impact. Evaluation was carried out using 15-fold cross validation on the dataset of Song and Burage [15]. Table 5 shows the average per protein correlation coefficient and RMSE values of svmPRAT models. The best performing model used a fusion of P and S feature matrices and improves CC by 21% and RMSE by 17% over the -SVR technique of Song and Barrage [15]. Their method used the standard rbf kernel with similar local sequence-derived amino acid and predicted secondary structure features. The major improvement of our method can be attributed to our fusion-based kernel setting with efficient encoding and the normalization introduced in by the second order exponential kernel (Equation 5). For the window kernel parameters, we observe that models trained with f <w generally shows better CC and RMSE values for residue-wise contact order prediction.

Protein-DNA Interaction Site Prediction
When it is known that the function of a protein is to bind to DNA, it is highly desirable from an experimental point of view to know which parts of the protein are involved in the binding process. Interaction is typically defined in terms of contacts between the protein and DNA in their co-crystallized structure: residues within a distance threshold of the DNA are considered interacting while the remaining residues are considered non-interacting. This is another example of a binary classification problem for sequence data. Several researchers have presented methods to identify DNA-binding residues. DISIS [6] uses support vector machines and a radial basis function kernel with PSSMs, predicted secondary structure, and predicted solvent accessibility as input features while Ahmad and Sarai employ a neural network method with PSSMs as input [32].
svmPRAT was used to train binary classification models on the DISIS dataset [6]. Following DISIS, we performed 3-fold cross validation on our models ensuring that the sequence identity between the different folds was less than 40%. During the experiments, we found that window kernels with w = f performed the best and therefore omit other values for the parameters. Table 6 gives the performance of svmPRAT models on DNA interaction site prediction. The model obtained by combining the P and S features gives a raw Q 2 accuracy of 83%. DISIS uses a two-level approach to solve this problem. The first level, which uses SVM learning with profiles, predicted secondary structure, and predicted solvent accessibility as inputs, gives Q 2 = 83% to which our approach compares favorably. DISIS further smooths this initial prediction using a rulebased approach that improves accuracy. We have not yet explored this type of rule-based approach.

Local Structure Alphabet Prediction
The notion of local, recurring substructure in proteins has existed for many years primarily in the form of the secondary structure classifications. Many local structure alphabets have been generated by careful manual analysis of structures such as the DSSP alphabet [33]. More recently, local structure alphabets have been derived through pure computational means. One such example are the Protein Blocks of de Brevern et al. [13] which were constructed through the use of self-organizing maps. The method uses residue dihedral angles during clustering and attempts to account for order dependence between local structure elements which should improve predictability.
We chose to use the Protein Blocks [13] as our target alphabet as it was found to be one of the best local structure alphabets according to conservation and predictability [12].
There are sixteen members in this alphabet which significantly increases prediction difficulty over traditional threestate secondary structure prediction.
We used a dataset consisting of 1600 proteins derived from the SCOP database version 1.57, classes A to E [34]. This dataset was previously used for learning profileprofile alignment scoring functions using neural networks [35]. To compute the true annotations, we used the three-dimensional structures associated with the proteins to assign each residue one of the Protein Blocks.
We used a small subset of the 1600 proteins to tune the w and f windowing parameters with the soe kernel. We found w = f worked well on the subset and subsequently restricted the large-scale experiments to this case. Threefold cross validation was done on all 1600 proteins for each parameter set and for both the soe and rbf kernels. Table 7 reports the classification accuracy in terms of the Q 16 accuracy and average ROC scores for different members of the Protein Blocks.
From Table 7 we see that the soe kernel provides a small performance boost over the rbf kernel. The addition of predicted secondary structure information from YASSPP ( S features) improves the Q 16 performance as would be expected for local structure prediction. Our Q 16 results are very encouraging, since they are approximately 67%, whereas the prediction accuracy for a random predictor would be 6.25% only. Competitive methods for predicting Protein Blocks from sequence reported a Q 16 accuracy of 40.7% in [36] and 57.9% in [12].

Datasets
Our empirical evaluations are performed for different sequence annotation problems on previously defined datasets. Table 1 presents information regarding the source and key features of different datasets used in our cross validation and comparative studies. We ensured that the pairwise sequence identities for the different datasets was less than 40%.
We utilized cross validation as our primary evaluation protocol. In n-fold cross validation, data are split into n sets. One of the n sets is left out while the others are used to train a model. The left out data are then predicted and the performance is noted. This process repeats with a different set left out until all n sets have been left out once. The average performance over all n-folds is reported. Where possible, we used the same splits of data as have been used in previous studies to improve the comparability of our results to earlier work.

Evaluation Metrics
We measure the quality of the classification methods using the receiver operating characteristic (ROC) scores. The numbers in bold show the best models for a fixed w parameter, as measured by ROC. P , and S represent the PSI-BLAST profile and YASSPP scoring matrices, respectively. soe, rbf, and lin represent the three different kernels studied using the W w f , as the base kernel. * denotes the best classification results in the sub-tables, and ** denotes the best classification results achieved on this dataset. For the best model we report a Q 2 accuracy of 83.0% with an se rate of 0.34. The ROC score is the area under the curve that plots the fraction of true positives against the fraction of false positives for different classification thresholds [37]. In all experiments, the ROC score reported is averaged over the n folds of cross validation. When the number of classes is larger than 2, we use a one versus rest ROC scores and report the average across all classes.
We also compute other standard statistics defined in terms of the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). These standard statistics are the following: For K-way classification, performance is summarized by Q K , defined as where N is the total number residues and TP i is the number of true positives for class i.
The ROC score serves as a good quality measure in the case of unbalanced class sizes where Q K may be high simply by predicting the most frequent class. This is often true for binary classification problems with very few positive examples. In such cases, it is essential to observe the precision and recall values which penalize the classifiers for under-prediction as well as overprediction. The F 1 score is a weighted average of precision and recall lying between 0 and 1, and is a good performance measure for different classification problems.
Regression performance is assessed by the Pearson correlation coefficient (CC) and the root mean square error (RMSE) between the predicted and observed true values for every protein in the datasets. The CC statistic ranges from -1 to +1 with larger values being better while RMSE is larger than zero with lower values implying better predictions. The results reported are averaged across the different proteins and cross validation folds.
For the best performing models, we also report the standard error, se, of Q K and CC scores, defined as se N =σ (11) whereσ is the sample standard deviation and N the number of data points. This statistic helps assess how much performance varies between proteins.

Input Information Position Specific Scoring Matrices
For a sequence of length n, PSI-BLAST [38] generates a position-specific scoring matrix (PSSM) referred to as P.
The dimensionality of P is n × 20, where the 20 columns of the matrix correspond to the twenty amino acids. The profiles in this study were generated using the version of the PSI-BLAST available in NCBI's 2.2.10 release of the BLAST package. PSI-BLAST was run as blastpgp -j 5 -e 0.01 -h 0.01 and searched against NCBI's NR database that was downloaded in November of 2004 (2,171,938 sequences).

Predicted Secondary Structure Information
We used the YASSPP secondary structure prediction server [9] with default parameters to generate the S feature matrix of dimensions n × 3. The (i, j)th entry of this matrix represents the propensity for residue i to be in state j, where j {1, 2, 3} corresponds to the three secondary structure elements: alpha helices, beta sheets, and coil regions.

Position Independent Scoring Matrices
Position independent sequence features were created for each residue by copying the residue's corresponding row of the BLOSUM62 scoring matrix. This resulted in a n × 20 feature matrix referred to as ℬ.
By using both PSSM and BLOSUM62 information, a SVM learner can construct a model that is based on both position independent and position specific information. Such a model is more robust to cases where PSI-BLAST could not generate correct alignments due to lack of homology to sequences in the NR database.

Conclusions
In this work we have presented a general purpose support vector machine toolkit that builds protein sequence annotation models. Dubbed svmPRAT, the toolkit's versatility was illustrated by testing it on several types of annotations problems. These included binary classification to identify transmembrane helices and DNA-interacting residues, K-way classification to identify local structural class, and continuous predictions to estimate the residue-wise contact order. During our evaluation, we showed the ability of svmPRAT to utilize arbitrary sequence features such as PSI-BLAST profiles, BLOSUM62 profiles, and predicted secondary structure which may be used with several kernel functions. Finally svmPRAT allows the incorporation of of local information at different levels of granularity through its windowing parameters. Our experiments showed that this allows it to achieve better performance on some problems. svmPRAT's key features include: (i) implementation of standard kernel functions along with powerful second-order exponential kernel, (ii) use of any type of sequence information associated with residues for annotation, (iii) flexible window-based encoding scheme, (iv) optimized for speed using fast solvers, (v) capability to learn two-level cascaded models, and (vi) available as pre-compiled binaries for various architectures and environments.
We believe that svmPRAT provides practitioners with an efficient and easy-to-use tool for a wide variety of annotation problems. The results of some of these predictions can be used to assist in solving the overarching 3D structure prediction problem. In the future, we intend to use this annotation framework to predict various 1D features of a protein and effectively integrate them to provide valuable supplementary information for determining the 3D structure of proteins.

Web Interface
Even though svmPRAT is easy to use and is available across a wide variety of platforms and architectures, we also provide biologists the functionality to predict local structure and function predictions using our web server, MONSTER (Minnesota prOteiN Sequence annotaTion servER). svmPRAT serves as the backend for MONSTER and can be accessed easily via the web link http://bio.dtc. umn.edu/monster.