DescFold: A web server for protein fold recognition
© Yan et al; licensee BioMed Central Ltd. 2009
Received: 9 July 2009
Accepted: 14 December 2009
Published: 14 December 2009
Machine learning-based methods have been proven to be powerful in developing new fold recognition tools. In our previous work [Zhang, Kochhar and Grigorov (2005) Protein Science, 14: 431-444], a machine learning-based method called DescFold was established by using Support Vector Machines (SVMs) to combine the following four descriptors: a profile-sequence-alignment-based descriptor using Psi-blast e-values and bit scores, a sequence-profile-alignment-based descriptor using Rps-blast e-values and bit scores, a descriptor based on secondary structure element alignment (SSEA), and a descriptor based on the occurrence of PROSITE functional motifs. In this work, we focus on the improvement of DescFold by incorporating more powerful descriptors and setting up a user-friendly web server.
In seeking more powerful descriptors, the profile-profile alignment score generated from the COMPASS algorithm was first considered as a new descriptor (i.e., PPA). When considering a profile-profile alignment between two proteins in the context of fold recognition, one protein is regarded as a template (i.e., its 3D structure is known). Instead of a sequence profile derived from a Psi-blast search, a structure-seeded profile for the template protein was generated by searching its structural neighbors with the assistance of the TM-align structural alignment algorithm. Moreover, the COMPASS algorithm was used again to derive a profile-structural-profile-alignment-based descriptor (i.e., PSPA). We trained and tested the new DescFold in a total of 1,835 highly diverse proteins extracted from the SCOP 1.73 version. When the PPA and PSPA descriptors were introduced, the new DescFold boosts the performance of fold recognition substantially. Using the SCOP_1.73_40% dataset as the fold library, the DescFold web server based on the trained SVM models was further constructed. To provide a large-scale test for the new DescFold, a stringent test set of 1,866 proteins were selected from the SCOP 1.75 version. At a less than 5% false positive rate control, the new DescFold is able to correctly recognize structural homologs at the fold level for nearly 46% test proteins. Additionally, we also benchmarked the DescFold method against several well-established fold recognition algorithms through the LiveBench targets and Lindahl dataset.
The new DescFold method was intensively benchmarked to have very competitive performance compared with some well-established fold recognition methods, suggesting that it can serve as a useful tool to assist in template-based protein structure prediction. The DescFold server is freely accessible at http://18.104.22.168/DescFold/index.html.
Template-based protein structure prediction methods (often known as comparative modeling and fold recognition) typically involve the following three steps. First, a (remote) homologous protein with known structure is identified as a template for a query sequence. The second step is to obtain an optimal alignment between the query sequence and the template sequence. Finally, a refined 3D model of the query protein can be generated based on the template structure. With more and more protein structural templates deposited in the current PDB database http://www.rcsb.org/pdb/home/home.do, template-based methods are increasingly powerful and their applications to many aspects of life sciences are widely explored .
The key step in template-based methods is to identify a structure template that shares a similar 3D structure with the query sequence. When the query protein shares significant sequence similarity with the template, classical sequence alignment methods, such as Blast , FASTA , Smith-Waterman  or Needleman-Wunsch  dynamic programming algorithm, are suitable and accurate in detecting their homologous relationship. Generally, the template-based method for dealing with such "easy" templates is referred to as comparative modeling. However, proteins with weak sequence similarity are also frequently found to share similar 3D folds. Such remote homology relationships can be hard to detect with classical sequence alignment methods. To find a template that shares only remote homology with the query protein, some profile-sequence (or sequence-profile) alignment methods like Psi-blast , Rps-blast , Impala , and Hidden Markov Models (HMM)  have been used, and they often result in a marked improvement. Nevertheless, the profile-sequence (or sequence-profile) alignment methods also perform poorly when the investigated protein pairs are situated in the twilight or midnight zone . A lot of efforts have therefore been deployed to develop more sensitive and powerful remote homology detection techniques, called fold recognition. During the last decade, fold recognition has received considerable attention and a variety of elegant fold recognition methods (e.g., FFAS , 3D-PSSM , Fugue , mGenThreader , ORFeus , MUSTER , and SP5 ) have been developed. The overall good performance of these techniques has been widely demonstrated in the CASP  and CAFASP  competitions as well as in real-time LiveBench experiments .
The basic strategy of fold recognition methods consists in comparing the query sequence with all the structures within a fold library. According to the measured compatibility between sequence and structure, the fold recognition method can identify the template with the best fit. The well-established fold recognition methods can be roughly grouped into three main categories: (1) structure-seeded profile-based; (2) profile-profile alignment-based; and (3) machine learning methods-based. In the first category, 3D-PSSM and Fugue are probably the two best-known representative algorithms. For instance, 3D-PSSM is based on a hybrid fold recognition approach using sequence profiles and structure-seeded profiles (i.e., 3D profiles) coupled with predicted secondary structure information and solvation potential . Grouped into the second category, the profile-profile alignment methods have recently been proven to be very powerful in remote homology identification as well as in generating accurate sequence alignments [20, 21]. Generally, the profile-profile alignment method uses dynamic programming to obtain a direct alignment between two sequence profiles through Psi-blast searching [22, 23]. To improve the performance of the profile-profile alignment, the structural information (e.g., predicted secondary structural information) was also frequently added to measure the similarity of two positional vectors [14, 16]. In the third category, machine learning-based methods were employed to combine different sequence and structural information into fold recognition systems [13, 24–27]. In mGenThreader , for instance, a neural network was used to combine pair-wise potentials, solvation potentials, and various alignment parameters. In the past several years, Support Vector Machines (SVMs) have also been widely used to build binary classifiers, which can allow the prediction of whether a sequence belongs to a single structural fold or not. Provided there are sufficient data in different protein folds, a set of binary classifiers can be trained and integrated into a fold recognition system (i.e. a multi-class predictor). A key step to establish an SVM classifier is to find effective kernel functions, which measure the similarity between any pair of protein sequences. There are some established kernel functions such as spectral kernel , profile-based string kernel , and mismatch string kernel .
A machine learning-based fold recognition method called DescFold was developed in our previous work . In DescFold, any measurement between two proteins or any feature vector extracted from a protein sequence can be defined as a descriptor. For example, the amino acid composition of a protein can be regarded as a descriptor; the e- value obtained from a Blast search of protein A against protein B can also be considered as a descriptor between A and B. Based on such a broad definition, thirteen descriptors' fold identification capabilities were evaluated and four optimal descriptors were selected to construct the original version of DescFold with the assistance of SVMs. Although SVMs were frequently used to build discriminative models between various protein folds , it should be emphasized that the SVMs here were employed to distinguish structurally similar and dissimilar protein pairs. The four implemented descriptors were a profile-sequence-alignment-based descriptor using Psi-blast e-values and bit scores, a sequence-profile-alignment-based descriptor using Rps-blast e-values and bit scores, a descriptor based on the alignment of secondary structural elements (SSEA), and a descriptor based on the occurrence of PROSITE functional motifs . Although the original DescFold was reported to significantly outperform a standard Psi-blast search, it showed weaker performance than some well-established methods when tested on the LiveBench-8 targets .
In the present study, we focus on developing an improved DescFold method through the following efforts. First, a profile-profile-alignment-based (PPA) descriptor was incorporated into the new DescFold method. Of the existing profile-profile alignment algorithms, COMPASS is one of the best-performing methods, and possess good computational efficiency . Additionally, COMPASS is freely accessible to the community. In this work, the alignment scores resulting from the COMPASS algorithm [23, 32] were defined as a PPA descriptor between a sequence pair. In the context of fold recognition, one of the aligned two sequences is regarded as a template, meaning that a structure-seeded profile is available for the template, which may contain different evolutionary information than a sequence profile derived from its homologous sequences. Moreover, the structure-seeded profile for the template sequence was generated by searching its structural neighbors with the assistance of TM-align . Again, the COMPASS algorithm was further used to derive a profile-structural-profile-alignment-based descriptor (i.e., PSPA). Finally, we also set up a user-friendly web server for DescFold, and have made it freely accessible to the research community. Here, we present details on the improvement resulting from two newly introduced profile-profile alignment related descriptors, the construction of the DescFold web server, and the intensive benchmark results of testing DescFold against some state-of-the-art fold recognition methods.
Results and Discussion
The performance of individual descriptors based on the SCOP_1.73_1835 dataset
Based on the SCOP 1.73 version , we compiled a total of 1,835 sequence-dissimilar but structurally related proteins into a highly diverse protein dataset named SCOP_1.73_1835. Then, we used the SCOP_1.73_1835 dataset to benchmark the six different descriptor types in leave-one-out fold identification experiments. Each time, a protein in SCOP_1.73_1835 was selected as a "test" protein and the remaining proteins were regarded as a fold library. By calculating the pair-wise similarity scores defined in different descriptors, the "test" protein was scanned against the fold library and the protein with the most significant similarity score (i.e., the top hit) was recorded. In case the top hit and the test protein belong to the same SCOP superfamily, a correct fold identification was assigned. When the above experiment is performed over all the SCOP_1.73_1835 proteins, a descriptor's performance can be simply quantified in terms of sensitivity by counting the number of proteins with correctly identified structural homologs. More details about the construction of the different types of descriptors and the compilation of the SCOP_1.73_1835 dataset are outlined in the Methods section.
The sensitivities of fold identification using different descriptors are listed in Table 1. Of the four descriptors used in the original DescFold, the performance of the Rps-blast- and Psi-blast-based descriptors yield a sensitivity of 37.49% and 36.84%, respectively. Predicted secondary structure has been proven to be useful in protein fold recognition/classification , which can be effectively encoded by the SSEA-based descriptor [13, 24, 36]. The SSEA-based descriptor allows a correct identification rate of 28.56%. The motif-based descriptor is only able to generate successful fold identification for approximately 20% of the total protein sequences. Generally, the performance ranking of these four descriptors is in good agreement with the results from our previous study, although the descriptors were evaluated over two different datasets.
Sensitivity of fold recognition based on individual descriptors.
524/1835 = 28.56%
676/1835 = 36.84%
688/1835 = 37.49%
360/1835 = 19.62%
1083/1835 = 59.02%
1052/1835 = 57.33%
The overall performance of DescFold based on the SCOP_1.73_1835 dataset
Sensitivity of DescFold using different descriptorsa.
SSEA + Psi-blast + Rps-blast
937/1835 = 51.06%
SSEA + Psi-blast + Rps-blast + motif
1025/1835 = 55.86%
SSEA + Psi-blast + Rps-blast + motif + PPA
1248/1835 = 68.01%
SSEA + Psi-blast + Rps-blast + motif + PPA + PSPA
1322/1835 = 72.04%
The ROCn scores and the corresponding sensitivity values of DescFold using different descriptors.a
ROC16,744 (Sn)b, c
ROC83,720 (Sn)b, c
ROC167,440 (Sn)b, c
SSEA + Psi-blast + Rps-blast
SSEA + Psi-blast + Rps-blast + motif
SSEA + Psi-blast + Rps-blast + motif + PPA
SSEA + Psi-blast + Rps-blast + motif + PPA + PSPA
The DescFold web server and a large-scale benchmarking experiment on the SCOP_1.75_1866 dataset
Comparison with some well-established fold recognition methods
In this work, our DescFold method was first benchmarked against some state-of-the-art fold recognition methods based on the LiveBench targets. As a real-time fold recognition benchmark program, every week the LiveBench server submits newly released PDB proteins to the participating fold-recognition servers, and evaluates the corresponding results. Here, we have selected the LiveBench-2008.1 targets (283 proteins) and LiveBench-2008.2 targets (513 proteins) as two reference test sets to compare the performance of DescFold and some well-established fold recognition methods. Although many fold recognition severs participated in the LiveBench-2008.1 and Livebench-2008.2 experiments, we compared our DescFold method with only five popular fold-recognition methods among them: 3D-PSSM , Fugue , mGenThreader , Inub  and FFAS .
Comparison of receiver operator characteristics (< = 10 false positives) and sensitivity for different fold recognition methods based on all LiveBench-2008.1 targets.a
Receiver operator characteristics (< = 10 false positives)b
Comparison of receiver operator characteristics (< = 10 false positives) for different fold recognition methods based on all LiveBench-2008.2 targets.a
Receiver operator characteristics (< = 10 false positives)b
The Lindahl dataset  was also employed to further benchmark the performance of our DescFold method. Based on the SCOP database (version 1.39), the Lindahl dataset contains 976 proteins, in which the sequence identity for any protein pair is < 40%. In this dataset, 555, 434 and 321 sequences have at least one matching structural homolog at the family, superfamily and fold levels, respectively. Taking the same strategy and procedures as we used with the SCOP_1.73_1835 dataset to develop the DescFold method, we retrained the DescFold method based on the Lindahl dataset. By employing the same assessment procedure as reported in the literature [16, 25, 40], the top 1 and the top 5 matched templates for each query sequence were used to evaluate the sensitivity of recognition performance. Since the Lindahl dataset was based on an old version of SCOP, it may be quite subjective to benchmark different methods based on this dataset. Ideally, the sequence and structural information of these 976 proteins should not be included in deriving the DescFold prediction models. More stringently, the sequence and structural homologs of these 976 proteins should also not be used. In the present study, we used the SCOP database (version 1.73) to derive the PSPA and motif-based descriptors. For instance, the PSPA descriptor used the SCOP_1.73_40% dataset to construct the structure-seeded profile, which may inevitably contain structural homologs of these 976 proteins. Meanwhile, the motif-based descriptor relied on the SCOP_1.73_95% dataset to derive the motif-fold compatibility, which may also utilize some sequence homologs of these 976 proteins. To allow for a fair comparison, we designed two DescFold predictors. In the first predictor (DescFold_I), both the PSPA and motif-based descriptors were skipped. In the second predictor (DescFold_II), the PSPA descriptor was still not considered, while the motif-based descriptor was kept. To derive the motif-based descriptor, however, these 976 proteins' sequence homologs in the SCOP_1.73_95% database were filtered by a Blast e-value threshold of 0.01.
The sensitivity of different methods on the Lindahl dataset at the family, superfamily, and fold levels.a, b
Family level (%)
Superfamily level (%)
Fold level (%)
Although many efforts were taken to make sure that the above two benchmark experiments were intensive and strict, we are still not able to guarantee a fully unbiased assessment. Regarding the benchmark based on the LiveBench targets, the fold libraries are different for the assessed methods, which may have some effect on the performance of the corresponding methods. For the comparative analysis based on the Lindahl dataset, the performance of other methods was originally collected from different literature. In this case, the sequence databases used to generate the profiles are not the same, which may result in different performance to some extent. Meanwhile, some methods may already have been significantly updated since their benchmark performance on the Lindahl dataset was published. As pointed out by Cheng and Baldi , such benchmark experiments can only provide a rough assessment rather than a very precise measurement. Even so, both of the aforementioned two benchmark experiments conclude that the performance of DescFold is fully comparable to some well-established peer methods.
In this work, we developed an improved DescFold method by combining two new profile-profile alignment related descriptors (i.e., the PPA and PSPA descriptors). Due to the fact that the profile-profile alignment is able to capture more evolutionary information which was missed in our original DescFold, the new DescFold leads to a much better performance. The new DescFold method was benchmarked against some other state-of-the-art fold recognition techniques by using the LiveBench targets and Lindahl dataset. Our DescFold method demonstrates competitive performance in comparison to the existing methods. To allow for practical applications, we have made it freely accessible to the community through a user-friendly web-server.
Concerning future development, the following two efforts should be taken to maintain DescFold as a competitive fold recognition system. Firstly, the fold library of DescFold should be regularly updated. To provide a more comprehensive fold library, those experimentally determined structures which are not included in the SCOP database should also be taken into account. Secondly, seeking new descriptors is still the most important direction for development of a better predictor. On the one hand, machine learning based-methods allow the incorporation of more descriptors into a fold recognition system, which may yield better performance. On the other hand, the introduced descriptors will inevitably increase the complexity of the prediction model and obscure the contribution of each individual descriptor. Therefore, a new descriptor candidate should be carefully assessed before its acceptance for inclusion in the future versions of DescFold. Thus, we expect such machine learning-based methods will not only result in a fold recognition system with higher accuracy, but also strengthen our fundamental understanding of the evolutionary relationship between protein sequence and structure.
In this work, we heavily rely on the SCOP database (version 1.73)  to construct the DescFold method. The corresponding SCOP sequences and structural data were obtained from the ASTRAL website http://astral.berkeley.edu/. To train and test the DescFold prediction models, two SCOP protein sequence subsets filtered by a 10% cut-off for sequence identity and an e-value threshold of 0.01 were downloaded from the ASTRAL website separately. Then, only sequences occurred in both of the above subsets were further kept. We also excluded sequences that are too short (less than 60 amino acids). Moreover, only a representative protein was reserved for each SCOP family. Finally, 1,835 protein sequences were kept and compiled into a dataset, which we named SCOP_1.73_1835 [see Additional file 1]. To construct the fold library of the DescFold web server, the SCOP_1.73_40% database with a total of 9,282 proteins was downloaded, in which the sequence identity among the proteins is equal to or less than 40%. The SCOP_1.73_40% database was also used as the database to search for structural neighbors for each template. Additionally, we also used the SCOP_1.73_95% dataset to derive the motif-based descriptor, in which the sequence identity for any sequence pair is ≤ 95%. A total of 15,273 protein sequences in the current SCOP_1.73_95% dataset were downloaded.
To perform a large-scale benchmarking on our DescFold server, a stringent test set was selected from a newer SCOP version (i.e., SCOP 1.75) based on the following criteria. Firstly, all proteins existed in SCOP 1.75 but not in SCOP 1.73 were downloaded. Secondly, only proteins sharing the fold types already existed in SCOP 1.73 were retained. Thirdly, proteins sharing a Blast e-value less than 0.1 with any protein in the SCOP_1.73_40% library were further discarded. Finally, 1,866 proteins from the SCOP 1.75 version were compiled into a test dataset called SCOP_1.75_1866 [see Additional file 2].
The NCBI non-redundant (NR) sequence database was downloaded from ftp://ncbi.nlm.nih.gov/blast/ (November, 2008). The NR database was further clustered at a cut-off of 90% identity (global alignment mode) by using CD-hit  and the resulting NR90 database, containing 4,205,215 sequences, was used to perform the Psi-blast search. To derive the motif-based descriptor, the PROSITE database (release 20.9) , which contains 1,322 patterns and 720 profiles, was obtained from http://www.expasy.org/prosite/.
Thus, the Psi-blast-based descriptor (i.e., evalue_modPsi-blast(A, B) and Score Psi-blast(A, B)), can be used to measure the sequence similarity between A and B.
The Psi-blast search can be conducted in a reverse way via Rps-blast (i.e., profile B against sequence A). As we derived the Psi-blast-based descriptor, the Rps-blast-based descriptor also results in two parameters evalue_modRps-blast(A, B) and ScoreRps-blast(A, B).
To derive the SSEA-based descriptor for two query sequences A and B, the following three steps were involved. First, the secondary structures of the two query sequences were predicted by PSIPRED . Second, the predicted secondary structural string for each sequence was converted into secondary structure elements such that "H" represents a helix element, "E" denotes a strand element, and "C" stands for a coil element. Third, the two secondary structure elements were aligned using a dynamic programming algorithm  with a scoring scheme proposed by Przytycka et al. . The resulting alignment score SSEA(A, B), ranging from 0 to 1, was regarded as the SSEA-based descriptor. For more details about the SSEA-based descriptor, please refer to our previous work .
Profile-profile-alignment-based (PPA) descriptor
The COMPASS algorithm [23, 32] was employed to derive a profile-profile-alignment-based descriptor between proteins A and B. First, a Psi-blast search was carried out to generate sequence profiles A and B, with the same parameter settings as we used to calculate the Psi-blast-based descriptor. Second, the two multiple alignments generated from the Psi-blast search (i.e., profiles A and B) were processed by COMPASS to obtain a profile-profile alignment. The resulting two parameters, evalue PPA (A, B) and Score PPA (A, B) were regarded as the similarity measurement between A and B (i.e., the PPA descriptor). Similar to Eq.(1), the evalue PPA (A, B) was further converted into evalue_mod PPA (A, B).
Profile-structural-profile-alignment-based (PSPA) descriptor
Considering a protein pair A and B in the context of fold recognition, protein A is regarded as the query sequence and protein B is a structural template. Thus, the profile for protein B can also be obtained by searching its structural neighbours. To derive a PSPA descriptor between A and B, sequence profile A and structure-seeded profile B were generated. Sequence profile A was generated as described in deriving the Psi-blast-based descriptor, while the structure-seeded profile was obtained through the following steps. First, we searched structural template B against the SCOP_1.73_40% structural database using the TM-align structural alignment method  with default parameters. The search resulted in 9282 pair-wise structural alignments. Second, only those structural hits with a TM-align score > 0.6 were kept. Generally, a structural hit with a TM-align score > 0.6 is considered significant, meaning protein B and the corresponding hit share significant structural similarity. Moreover, we took sequence B as the reference sequence and no gaps were allowed, while we trimmed the structural hits' residues if they were aligned with the gap regions of sequence B in the corresponding pair-wise alignment. Finally, the corresponding pair-wise sequence alignments were combined into a multiple sequence alignment (i.e., structure-seeded profile B). When sequence profile A and structure-seeded profile B were prepared, the COMPASS algorithm was used again to derive the PSPA descriptor (evalue_mod PSPA (A, B) and Score PSPA (A, B)).
Construction of DescFold
Based on the same strategy as detailed in our previous work, the aforementioned descriptors were combined into a fold recognition system termed DescFold with the assistance of SVMs. Similar to a 5-fold cross-validation, the protein pairs in the SCOP_1.73_1835 dataset (i.e., 1835 × 1834/2 = 1,682,695 pairs) were divided into five subsets of nearly equal size. Here, the SVM was trained to distinguish two different types of protein pairs (i.e., structurally similar and structurally dissimilar pairs). For the first type of protein pairs (i.e., positive instances), both proteins belong to the same superfamily. For the second type of protein pairs (i.e., negative instances), the two proteins are from different superfamilies. Of the total 1,682,695 protein pairs, 8,244 pairs were considered positive instances and their labels were set to + 1, while1,674,451 pairs were considered negative instances and their labels were set to -1. The aforementioned six descriptors were input as the feature vector for each protein pair, which contains a total of ten parameters. Taking a protein pair A and B as an example, the corresponding ten parameters are evalue_modPsi-blast(A, B), ScorePsi-blast(A, B), evalue_modRps-blast(A, B), ScoreRps-blast(A, B), SSEA(A, B), Motif_Score(A, B), evalue_mod PPA (A, B), Score PPA (A, B), evalue_mod PSPA (A, B), and Score PSPA (A, B).
To predict whether a given protein pair were structurally similar or dissimilar, the subset to which this pair belongs was labeled the "test" set, whereas the four remaining subsets were labeled "training" sets. SVM models were developed for each of the "training" sets. The ratio of the positive to negative instances in each training dataset is approximately 1:200. An unbalanced training dataset will affect the prediction performance of the established SVM models and we found that the optimal ratio in the training set was 1:2.5. Each training dataset was adjusted by discarding a random selection of the negative pairs prior to training. The whole training resulted in four separate SVM models, the prediction score being obtained as an average value over the decision values from the four different SVM models. Furthermore, the raw prediction score (RPS) was further converted into a Z-Score. We randomly selected 3000 pairs from the 1,682,695 protein pairs, and calculated the average value (AVE) and standard deviation (SD) of these pairs' prediction scores. For a query sequence, a Z-Score can then be calculated: Z = (RPS - AVE)/SD.
Libsvm  was employed as the SVM algorithm in our work. The applied kernel was the linear function and the other parameters were set to their default values. We also tried the automatic parameter optimization provided by Libsvm, but it did not result in a better performance. Instead of performing any further parameter optimization, we only used the default SVM parameters in our DescFold method. According to the randomized grouping of five subsets, the 5-fold cross-validation was repeated 5 times. Finally, the average performance was reported.
The F-scores of ten input features used in building the SVM models.
Score PSPA (A, B)
Score PPA (A, B)
evalue_mod PSPA (A, B)
evalue_mod PPA (A, B)
Construction of the web server of DescFold
To aid the research community, a web server for DescFold was constructed and is freely available at http://22.214.171.124/DescFold/index.html. To sufficiently represent the known protein structural space, the 9,282 proteins in the SCOP_1.73_40% dataset were used as the fold library. For computational efficiency, the Psi-blast-derived profiles, predicted secondary structure elements, S motif (fold|sequence), and structure-seeded profiles of the template proteins were pre-calculated. To search a query sequence against the fold library (i.e., SCOP_1.73_40%), a total of 9,282 protein pairs were involved. For each protein pair, the corresponding six descriptors were calculated. Then, the resulting ten parameters were used as the input for five SVM models trained in the above section, and the prediction score was obtained as an average value over the decision scores from the five different SVM models. Moreover, the prediction scores for all protein pairs were converted into Z-Scores. Finally, the top hits ranked by Z-Scores were reported. Users have options to display the top hits by setting the number of hits and the cut-off of Z-Score. The default number is ten and the maximal number is 50.
Availability and requirements
Project Name: DescFold
Project home page: http://126.96.36.199/DescFold/index.html
Operating system: Online service is web based; local version of the software should be run on a Linux platform.
Programming language: Perl.
Other requirements: None.
Any restrictions to use by non-academics: None.
We thank the anonymous referees whose constructive comments were very helpful in improving the quality of this work.We are grateful to Drs. Ruslan Sadreyev and Nick Grishin in the Howard Hughes Medical Institute for kindly providing the standalone version of COMPASS package. We also extend our gratitude to Dr. Yang Zhang at the University of Kansas, whose TM-align program was used to derive the structure-seeded profiles in this work. This research was supported by grants from the State High Technology Development Program (2008AA02Z307) and the National Key Basic Research Project of China (2009CB918802).
- Petrey D, Honig B: Protein structure prediction: inroads to biology. Mol Cell 2005, 20(6):811–819. 10.1016/j.molcel.2005.12.005View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403–410.View ArticlePubMedGoogle Scholar
- Pearson WR: Rapid and sensitive sequence comparison with FASTP and FASTA. Methods in enzymology 1990, 183: 63–98. full_textView ArticlePubMedGoogle Scholar
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147(1):195–197. 10.1016/0022-2836(81)90087-5View ArticlePubMedGoogle Scholar
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48(3):443–453. 10.1016/0022-2836(70)90057-4View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Schaffer AA, Wolf YI, Ponting CP, Koonin EV, Aravind L, Altschul SF: IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 1999, 15(12):1000–1011. 10.1093/bioinformatics/15.12.1000View ArticlePubMedGoogle Scholar
- Sonnhammer EL, Eddy SR, Durbin R: Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 1997, 28(3):405–420. 10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-LView ArticlePubMedGoogle Scholar
- Rost B: Twilight zone of protein sequence alignments. Protein Eng 1999, 12(2):85–94. 10.1093/protein/12.2.85View ArticlePubMedGoogle Scholar
- Jaroszewski L, Rychlewski L, Li Z, Li W, Godzik A: FFAS03: a server for profile--profile sequence alignments. Nucleic Acids Res 2005, (33 Web Server):W284–288. 10.1093/nar/gki418Google Scholar
- Kelley LA, MacCallum RM, Sternberg MJ: Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 2000, 299(2):499–520. 10.1006/jmbi.2000.3741View ArticlePubMedGoogle Scholar
- Shi J, Blundell TL, Mizuguchi K: FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J Mol Biol 2001, 310(1):243–257. 10.1006/jmbi.2001.4762View ArticlePubMedGoogle Scholar
- McGuffin LJ, Jones DT: Improvement of the GenTHREADER method for genomic fold recognition. Bioinformatics 2003, 19(7):874–881. 10.1093/bioinformatics/btg097View ArticlePubMedGoogle Scholar
- Ginalski K, Pas J, Wyrwicz LS, von Grotthuss M, Bujnicki JM, Rychlewski L: ORFeus: Detection of distant homology using sequence profiles and predicted secondary structure. Nucleic Acids Res 2003, 31(13):3804–3807. 10.1093/nar/gkg504PubMed CentralView ArticlePubMedGoogle Scholar
- Wu S, Zhang Y: MUSTER: Improving protein sequence profile-profile alignments by using multiple sources of structure information. Proteins 2008, 72(2):547–556. 10.1002/prot.21945PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang W, Liu S, Zhou Y: SP5: improving protein fold recognition by using torsion angle profiles and profile-based gap penalty model. PLoS One 2008, 3(6):e2325. 10.1371/journal.pone.0002325PubMed CentralView ArticlePubMedGoogle Scholar
- Moult J, Fidelis K, Kryshtafovych A, Rost B, Hubbard T, Tramontano A: Critical assessment of methods of protein structure prediction-Round VII. Proteins 2007, 69(Suppl 8):3–9. 10.1002/prot.21767PubMed CentralView ArticlePubMedGoogle Scholar
- Fischer D, Barret C, Bryson K, Elofsson A, Godzik A, Jones D, Karplus KJ, Kelley LA, MacCallum RM, Pawowski K, et al.: CAFASP-1: critical assessment of fully automated structure prediction methods. Proteins 1999, (Suppl 3):209–217. Publisher Full Text 10.1002/(SICI)1097-0134(1999)37:3+%3C209::AID-PROT27%3E3.0.CO;2-Y
- Rychlewski L, Fischer D: LiveBench-8: the large-scale, continuous assessment of automated protein structure prediction. Protein Sci 2005, 14(1):240–245. 10.1110/ps.04888805PubMed CentralView ArticlePubMedGoogle Scholar
- Yona G, Levitt M: Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J Mol Biol 2002, 315(5):1257–1275. 10.1006/jmbi.2001.5293View ArticlePubMedGoogle Scholar
- Panchenko AR: Finding weak similarities between proteins by sequence profile comparison. Nucleic Acids Res 2003, 31(2):683–689. 10.1093/nar/gkg154PubMed CentralView ArticlePubMedGoogle Scholar
- Rychlewski L, Jaroszewski L, Li WZ, Godzik A: Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci 2000, 9(2):232–241.PubMed CentralView ArticlePubMedGoogle Scholar
- Sadreyev R, Grishin N: COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol 2003, 326(1):317–336. 10.1016/S0022-2836(02)01371-2View ArticlePubMedGoogle Scholar
- Zhang Z, Kochhar S, Grigorov MG: Descriptor-based protein remote homology identification. Protein Sci 2005, 14(2):431–444. 10.1110/ps.041035505PubMed CentralView ArticlePubMedGoogle Scholar
- Cheng J, Baldi P: A machine learning information retrieval approach to protein fold recognition. Bioinformatics 2006, 22(12):1456–1463. 10.1093/bioinformatics/btl102View ArticlePubMedGoogle Scholar
- Rangwala H, Karypis G: Building multiclass classifiers for remote homology detection and fold recognition. BMC Bioinformatics 2006, 7: 455. 10.1186/1471-2105-7-455PubMed CentralView ArticlePubMedGoogle Scholar
- Rangwala H, Karypis G: Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics 2005, 21(23):4239–4247. 10.1093/bioinformatics/bti687View ArticlePubMedGoogle Scholar
- Leslie C, Eskin E, Noble WS: The spectrum kernel: a string kernel for SVM protein classification. Pac Symp Biocomput 2002, 564–575.Google Scholar
- Kuang R, Ie E, Wang K, Siddiqi M, Freund Y, Leslie C: Profile-based string kernels for remote homology detection and motif extraction. J Bioinform Comput Biol 2005, 3(3):527–550. 10.1142/S021972000500120XView ArticlePubMedGoogle Scholar
- Leslie CS, Eskin E, Cohen A, Weston J, Noble WS: Mismatch string kernels for discriminative protein classification. Bioinformatics 2004, 20(4):467–476. 10.1093/bioinformatics/btg431View ArticlePubMedGoogle Scholar
- Hofmann K, Bucher P, Falquet L, Bairoch A: The PROSITE database, its status in 1999. Nucleic Acids Res 1999, 27(1):215–219. 10.1093/nar/27.1.215PubMed CentralView ArticlePubMedGoogle Scholar
- Sadreyev RI, Tang M, Kim BH, Grishin NV: COMPASS server for remote homology inference. Nucleic Acids Res 2007, (35 Web Server):W653–658. 10.1093/nar/gkm293Google Scholar
- Zhang Y, Skolnick J: TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res 2005, 33(7):2302–2309. 10.1093/nar/gki524PubMed CentralView ArticlePubMedGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247(4):536–540.PubMedGoogle Scholar
- Chen K, Kurgan L: PFRES: protein fold classification by using evolutionary information and predicted secondary structure. Bioinformatics 2007, 23(21):2843–2850. 10.1093/bioinformatics/btm475View ArticlePubMedGoogle Scholar
- Fontana P, Bindewald E, Toppo S, Velasco R, Valle G, Tosatto SC: The SSEA server for protein secondary structure alignment. Bioinformatics 2005, 21(3):393–395. 10.1093/bioinformatics/bti013View ArticlePubMedGoogle Scholar
- Gribskov M, Robinson NL: Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Computers & chemistry 1996, 20(1):25–33. 10.1016/S0097-8485(96)80004-0View ArticleGoogle Scholar
- Gewehr JE, Hintermair V, Zimmer R: AutoSCOP: automated prediction of SCOP classifications using unique pattern-class mappings. Bioinformatics 2007, 23(10):1203–1210. 10.1093/bioinformatics/btm089View ArticlePubMedGoogle Scholar
- Fischer D: Hybrid fold recognition: combining sequence derived properties with evolutionary information. Pac Symp Biocomput 2000, 119–130.Google Scholar
- Lindahl E, Elofsson A: Identification of related proteins on family, superfamily and fold level. J Mol Biol 2000, 295(3):613–625. 10.1006/jmbi.1999.3377View ArticlePubMedGoogle Scholar
- Soding J, Biegert A, Lupas AN: The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res 2005, (33 Web Server):W244–248. 10.1093/nar/gki408Google Scholar
- Zhou H, Zhou Y: Single-body residue-level knowledge-based energy score combined with sequence-profile and secondary structure information for fold recognition. Proteins 2004, 55(4):1005–1013. 10.1002/prot.20007View ArticlePubMedGoogle Scholar
- Zhou H, Zhou Y: Fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment of fragments. Proteins 2005, 58(2):321–328. 10.1002/prot.20308PubMed CentralView ArticlePubMedGoogle Scholar
- Liu S, Zhang C, Liang S, Zhou Y: Fold recognition by concurrent use of solvent accessibility and residue depth. Proteins 2007, 68(3):636–645. 10.1002/prot.21459View ArticlePubMedGoogle Scholar
- Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res 2004, (32 Database):D226–229. 10.1093/nar/gkh039Google Scholar
- Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22(13):1658–1659. 10.1093/bioinformatics/btl158View ArticlePubMedGoogle Scholar
- Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999, 292(2):195–202. 10.1006/jmbi.1999.3091View ArticlePubMedGoogle Scholar
- Przytycka T, Aurora R, Rose GD: A protein taxonomy based on secondary structure. Nature structural biology 1999, 6(7):672–682. 10.1038/10728View ArticlePubMedGoogle Scholar
- Salwinski L, Eisenberg D: Motif-based fold assignment. Protein Sci 2001, 10(12):2460–2469.PubMed CentralView ArticlePubMedGoogle Scholar
- Chang CC, Lin CJ: LIBSVM: a library for support vector machines. Computer Program 2001.Google Scholar
- Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 1998, 11(9):739–747. 10.1093/protein/11.9.739View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.