Contact order calculation
Given a protein primary sequence, let us use L to denote its length, i.e., the number of amino acid residues in the sequence. The i-th residue is denoted as a
i
. For two distinct residues a
i
and a
j
, if there are two non-hydrogen atoms, one from each residue, within 6 Å, then a
i
and a
j
form a contact. L
ij
= |i - j| denotes the number of residues separating this contact. Assuming there are in total N contacts in the protein, the Abs_CO of this protein is defined as
(1)
where the summation goes over all contacting pairs (a
i
, a
j
) in the protein [6, 11]. The relative CO, or simply CO, is defined as
(2)
Because the relative CO is defined as Abs_CO normalized over protein length, exactly the same prediction accuracy can be achieved for Abs_CO as for CO. In this study, we focus on calculating and predicting Abs_CO, from which the corresponding CO can be trivially calculated. We implemented a contact order calculator that determines the Abs_CO value from the PDB coordinates of an input protein using the methods described above. The program was tested and validated against a large number of files for which the Abs_CO values had been previously published.
Prediction by homology
Many protein properties, including tertiary structure, secondary structure and solvent accessibility can be predicted via homology [31]. In other words, the properties of a query sequence can be predicted by directly transferring the properties or features of a homologous protein to the query protein. Since CO is a property that is a function of structure, we hypothesized that the calculated CO of known 3D structures could be used to predict the CO of homologous proteins. In implementing this approach we calculated the Abs_CO (using the method described in the last "Contact order calculation" section) for 16, 499 non-redundant proteins obtained from the PDB. These proteins were selected using the PDB culling/filtering service called PISCES [32]. Structures were initially selected using a 95% identity sequence-redundancy cutoff and a requirement for better than 3 Å resolution (for X-ray structures). Structures were further processed by removing disordered structures (secondary structure content < 10%) as well as all membrane proteins (membrane beta barrel and transmembrane helix proteins). The resulting CO database consisted of 16, 499 sequences in FASTA format with the Abs_CO value listed in the sequence name header. A local copy of BLAST [33] was installed which used this FASTA-formatted CO database as the search database. For a hit to be considered to be significant the query sequence must exhibit more than 20% sequence identity (computed as the number of identical residues divided by the query sequence length) to a protein in the CO database and the query sequence must be ± 40% of the length of the matching homologue. If these two criteria are met, then the contact order is transferred to the query protein. If any of these criteria is not met, then the contact order is predicted using the method described in the next "Prediction by regression" section. Tests through 5-fold cross validation on the CO database were performed using a variety of sequence identity cutoffs and sequence-length thresholds to assess their influence on both the accuracy and the coverage (coverage refers to the percentage of query sequences that could be predicted by this homology-based method). Overall, the 20% sequence identity cutoff and the 40% length threshold provided the best accuracy-to-coverage tradeoff.
Prediction by regression
In order to deal with the situation where no homologue can be found to predict the CO value (the last "Prediction by homology" section) we developed and tested a regression-based approach that permits accurate prediction of CO for any water-soluble protein. Let p(α) denote the percentage of residues in alpha-helices and p(β) denote the percentage of residues in beta strands in the protein. We observed that Abs_CO correlates well with a linear combination of p(α), p(β), and the protein length L. Given this observation we decided to use linear regression to optimize the correlation between Abs_CO and the protein primary and secondary structures, as follows:Abs_CO = χ1 · p(α) + χ2 · p(β) + χ3 · L + c, (3)
where χ
i
, i = 1, 2, 3, are the coefficients of the three factors p(α), p(β), and L, and c is a constant value in the linear regression. Note that for proteins with unknown three-dimensional structure, their secondary structures are also unknown. Therefore, as part of the linear regression process as specified in Formula (3), we predict their secondary structure content using Proteus [31]. Proteus is a secondary structure predictor that uses sequence alignment to achieve highly accurate predictions (Q3 accuracy score of 81.3% or greater), where homologs are identified using an E-value of < 0.01 and the secondary structures derived from VADAR [34] and the PPT-Database [35].
Using a large dataset of 933 high resolution three-dimensional protein structures (see Results section), the parameters in Formula (3) localize at χ1 = -6.8968, χ2 = 7.6216, χ3 = 0.0612, and c = 8.0397.
Subsequently, given any query protein, we may use Proteus again to predict p(α) and p(β) values, and then report its Abs_CO asAbs_CO = -6.8968p(α) + 7.6216p(β) + 0.0612L + 8.0397.
In the Results section, we will demonstrate the effectiveness of this stunningly simple prediction method.
In addition to this 3-factor CO predictor (Formula (4), and denoted as F3-LR), which has been implemented on our web server, we also developed other linear equations that considered more factors that might be strongly correlated to Abs_CO. For example, we added four other factors to F3-LR to create a 7-factor linear regression formula. These four factors are 1) the number of beta hairpins (two adjacent beta-strand segments form a hairpin if they are separated by 2 to 5 residues), 2) the number of distant beta strands (two adjacent beta-strand segments are considered "distant" if they are separated by at least 5 residues), 3) the number of Cysteine residues (C), and 4) the number of hydrophobic amino acid residues (V, I, L, M, F, W, C). Among these four factors, the latter two are obtained from the primary sequence, while the former two are extracted from the secondary structures predicted using Proteus. This method is denoted as F7-LR. The third method known as F27-LR considers 27 factors. These 27 factors include the first 5 factors in the F7-LR method, (the other two factors in the F7-LR method, the number of Cysteine residues and the number of hydrophobic amino acid residues, are replaced by) 19 amino acid frequencies of the 20 ones in the target protein, and 3 hydrophobicity frequencies defined as follows. For each amino acid type, its frequency in the target protein is defined as the number of occurrences divided by L, the length of the protein. Since the sum of all 20 such frequencies is 1, only 19 of them are included in the regression (to avoid redundancy). Next, for each residue in the target sequence, the hydrophobicity information of both the preceding and the succeeding residues are recorded. As a result, every residue, except the first and the last, is associated with one of the four labels: "HH", "HP", "PH", and "PP", where 'H' denotes hydrophobic and 'P' denotes hydrophilic. The frequency of "HH" is defined as the number of residues labeled with "HH" divided by L - 2. The other three frequencies are similarly defined, and their sum is exactly 1. For the same reason, only 3 of them are included in the regression.
We also tested two other regression methods: Support Vector Regression (SVR) [36] and Neural Network (NN) [37]. Combining these two regression methods, we have F3-SVR, F7-SVR, F27-SVR, and F3-NN, F7-NN, F27-NN. Performance of these nine different regression methods was assessed using a number of criteria to identify the best performing approach (see Results).
Public web server
We have implemented the above contact order calculator, the homology-based contact order predictor, and the linear regression based contact order predictors as a public web server [38]. The input to the server can be either a three-dimensional structure (either uploading the PDB file or key in the PDB id), or the primary sequence of the query protein. When the input is a sequence, our server will first use BLAST to identify sequences that are either identical or homologous to those in our CO database. There are three possible scenarios: 1) If the input is a 3D structure, or the input sequence exactly matches a known structure in our database, our server will calculate its Abs_CO directly using Formula (1); 2) If the input is a sequence and the BLAST search finds a homolog that is not an exact match but satisfies the criteria described in the "Prediction by homology" section, the pre-computed Abs_CO of the homologue is used as the predicted Abs_CO of the query sequence; 3) If the input is a sequence and has no BLAST match that falls into the second scenario, our server will call Proteus to predict the secondary structure content for the query protein, and then report its Abs_CO using Formula (4). Average calculation times are around 35 seconds for the CO calculator and about 27 seconds for the CO predictor.