- Open Access
Protein local 3D structure prediction by Super Granule Support Vector Machines (Super GSVM)
© Chen and Johnson; licensee BioMed Central Ltd. 2009
- Published: 8 October 2009
Understanding the relationship between the protein sequence and the 3D structure is a major research area in bioinformatics. The prediction of complete protein tertiary structure based only on sequence information is still an impractical work. This paper aims at revealing the hidden knowledge of the sequence motifs and the local tertiary structure.
In this paper, we propose a Super Granule Support Vector Machine (Super GSVM) model to obtain the high quality protein sequence motifs and to predict local tertiary structure information based on purely sequence information.
The proposed model overcomes the innate shortcoming of using the SVM on such a large data set, which is the inherent computational complexity involved in training support vectors for huge datasets including half million of samples. The satisfactory prediction results show the Super GSVM model generates decent protein sequence clusters and has the ability to capture the hidden sequence-to-structure information. This model also has a strong potential in the application of SVMs on other research areas with huge datasets.
- Distance Threshold
- Sequence Cluster
- Sequence Segment
- Information Granule
- Granular Computing
Understanding the relationship between protein sequence and 3D structure is one of the most important research tasks in both biology and bioinformatics researches. Based on many biochemical experiments, it is believed that the sequence is the sole determinate in a polypeptide's structural conformation. This means all the information that is necessary to specify protein interaction sites is embedded into the polypeptide's amino acid sequence .
In order to discover the protein sequence-to-structure relationship, Han and Baker used the K-means clustering algorithm to produce high quality protein clusters from protein sequence frequency profiles [2, 3]. Subsequently, they used the sequence clusters  combined with Hidden Markov Model (HMM)  to predict local protein structures. In their work, the clustering algorithm plays the central role in relating protein sequences to local structures. However, the conventional clustering algorithms assume that the distance between data points can be calculated with exact precision. While the distance function is not well characterized, this approach may not reveal the sequence-to-structure relationship efficiently .
Support Vector Machines (SVM)  has proven their value in various research domains. SVM apply the soft margin idea to allow mislabelled examples for maximization the margin; therefore, SVM has the ability to handle the non-linear classification by implicitly mapping input samples into a higher dimension for maximum-margin hyperplane generation. Under this point of view, SVM may be more efficient to discover the non-linear sequence-to-structure relationship than the K-means clustering algorithm . Nevertheless, due to the high computational cost of SVM, it is not favourable for large datasets . It is almost impossible to model a SVM over half a million data segments, which are then used to generate protein sequence recurring patterns. As a result, SVMs combined with granular computing might be a key step to uncover the secret behind the sequence-to-structure relationship. By using the divide-and-conquer principle, granular computing is able to divide a complex data-mining problem into a series of smaller and computationally simpler problems .
In this paper, we explain how to merge the power of SVM and granule computing to uncover the hidden information between the relationship of sequence and structure. A detailed report on local protein structure prediction results based on sequence information is also provided.
Super Granule Support Vector Machines (Super GSVM)
Protein local 3D structure prediction
The improvement of the number of high quality sequence clusters in each information granule
Total number of clusters
Prediction accuracy on three clustering groups. Since different distance threshold and different clustering groups generate distinct prediction accuracy, Table 2 shows a detailed report.
Prediction coverage on three clustering groups. Table 3 shows the prediction coverage of three clustering groups under different distance threshold.
We try to discover the sequence-to-structure relation by predicting protein 3D information, which is mainly focus on dmRMSD, from purely sequence knowledge. Some other tertiary structure knowledge such as torsion angle  can also be adapted in our future work as an additional source of 3D structural information. Besides, since the Ranking-SVM also gives ranking information on the target examples instead of simply a yes or no, it is highly possible to develop a strong voting mechanism to generate better prediction accuracy results.
In this work, we propose a Super GSVM model to discover the hidden protein sequence-to-structure information. We cluster on sequence profiles to find the recurring sequence patterns and evaluate the clusters by secondary structure similarity. We then build a Ranking-SVM for each cluster to improve the secondary structural similarity. Finally, based on the sequence clusters and the corresponding Ranking-SVMs, we predict the tertiary structure of the testing sequences. If the sequence similarity between the testing segment and the existing cluster is verified, we predict the 3D structure of the testing segment should be similar to the representative 3D structure of the sequence cluster. No tertiary structure information is involved in the training process; it is how we carry out the merit of discovering the relation between primary structure and tertiary structure. Although the prediction accuracy is not yet perfect, we open a new door to discovering protein sequence-to-structure information and believe many future works can be applied on our research methods to uncover this mystery. The proposed Super GSVM model is also favourable to many other scientific areas with huge amount of datasets.
Training dataset and independent testing dataset
The training dataset used in this work includes 2710 protein sequences obtained from the Protein Sequence Culling Server (PISCES) . No sequence in this database shared more than a 25% sequence identity. Sliding windows with nine successive residues are generated from each protein sequence. Each window represents one sequence segment of nine continuous positions. More than 560,000 segments are generated by this method and clustered into 800 clusters. The frequency profile from the HSSP  is constructed based on the alignment of each protein sequence from the Protein Data Bank (PDB) where all the sequences are considered homologous in the sequence database. Based on the 3D space information obtained from the PDB, we also calculate the distance matrix between all nine α-carbons and append this information to each data segment for prediction purposes.
The latest release of PISCES includes 4345 PDB files. Compared with the dataset in our experiment, 2419 PDB files are excluded. Therefore, we regard our 2710 protein files as the training dataset and 2419 protein files as the independent testing dataset, which contains 490,426 segments. Based on 3D space information obtained from PDB, we also calculate the distance matrix between all nine α-carbon and append this information to each data segment for testing purpose.
Primary structure distance and tertiary structure distance (dmRMSD)
Where L is the window size and N is 20 which represent 20 different amino acids. F k (i, j) is the value of the matrix at row i and column j used to represent the sequence segment. F c (i, j) is the value of the matrix at row i and column j used to represent the centroid of a given sequence cluster.
where is used to represent the predicted sequence cluster's 3D structure and is the structure information to be predicted. M is the number of distances in the distance matrix. Since the window size we use is nine, M = 36. In this work, we indicate an successful prediction of local 3D structure if dmRMSD is less than 1.5 Å.
Secondary structure similarity measure
Where ws is the window size and pi, Hshows the frequency of occurrence of helix among the segments for the cluster in position i. pi, Eand pi, Care defined in a similar way. In order to obtain the representative secondary structure for each position, we use the max function to identify the most frequent appeared secondary structure. For a simplified example: if a cluster with the window size of three contains three members, where their secondary structures are (HEH), (CHH), and (HEH). The representative secondary structure for the first position is H (66%), the representative secondary structure for the second position is E (66%), and the representative secondary structure for the third position is H (100%). Therefore, the average secondary structure similarity is (66%+66%+100%)/3.
Our original idea of using the sequence clusters to predict the tertiary structure is based on the assumption of (1) if the structural homology for a cluster exceeds 70%, the cluster can be considered structurally identical , and (2) if the structural homology for the cluster exceeds 60% and is below 70%, the cluster can be considered weakly structurally homologous . We group our sequence clusters into excellent clusters, good clusters, and fair clusters based on the criteria of secondary structure similarity higher than 80%, between 80%~70%, and between 70%~60% respectively. Intuitively, higher quality sequence clusters have better 3D structure prediction power. Therefore, the first step of Super GSVM not only generates sequence clusters, but also extract those clusters into higher quality ones.
Fuzzy Greedy K-means Model (FGK) Model
Granular computing represents information in the form of aggregates, also called "information granules" [9, 16]. For a huge and complicated problem, it uses the divide-and-conquer concept to split the original task into several smaller subtasks to save time and space complexity. Also, in the process of splitting the original task, it comprehends the problem without including meaningless information. As opposed to traditional data-oriented numeric computing, granular computing is knowledgeoriented .
Super Granule Support Vector Machine (Super GSVM)
Figure 1 shows the sketch of the proposed Super GSVM model. The whole model can be divided into two parts: 1. Granulating the training dataset and building the Ranking-SVM for each cluster; (Top-down until Collect all extracted clusters and Ranking-SVMs) 2. Local structure prediction produced by the sequence clusters and the corresponding Ranking-SVM. (In the lower part, from left to right)
The first part starts on softly dividing the huge training dataset into several smaller information granules by the Fuzzy C-means clustering algorithm. For each information granule, we perform the Greedy K-means clustering algorithm . Since the cluster size is much smaller than the initial training dataset, we train the Ranking-SVM based on the secondary structure for each cluster and obtain the rank of all members within the cluster. According to our previous report , we filtered out 20% of the lower ranking members to generate the clusters with the highest biological and biochemical quality. In this paper, we also filtered out 20% of the lower ranking members to yield higher quality clusters. The improved results are showed in Table 1. Although we cluster the training dataset into 799 clusters, training the Ranking-SVM on all clusters still took us 3 months. Finally, we collect all the sequence clusters and the Ranking-SVM models for the second part: local structure prediction.
The second part of the Super GSVM uses the clusters and the Ranking-SVM generated from the first part to predict the protein local 3D structure from purely sequence information. Please notice that during the first part of the Super GSVM, none of the 3D information is involved. This is mainly because we want to discover the relationship between primary sequence and tertiary structure. For each independent testing sequence segment, we calculate the primary sequence distance (by formula (1)) with all sequence clusters. Within a given distance threshold, if we can find a closest one, we temporarily assign the testing segment to the closest cluster. And then we feed the testing segment into the Ranking-SVM which is generated from the closest cluster to get the rank of the testing segment. If the rank of the testing segment is within the upper 80%, it indicates the segment belongs to this cluster, we then predict that the testing segment should have a similar Average Distance Matrix (ADM) to the cluster. If the rank of the testing segment is not within the upper 80%, it indicates the segment does not belong to this cluster, so we search for next closest cluster and repeat the process. If the testing segment cannot find the closest cluster within the given distance threshold, the testing segment cannot be predicted. Since sequence clusters are considered as the recurring patters or the sequence motifs, the sequence motifs only occur on a handful of locations in the whole protein sequences. That is why we can only emphasize on "Local" structure prediction. Different distance thresholds generate different prediction results. Table 2, 3 and the results section gives the detailed analysis of different parameter set ups.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 11, 2009: Proceedings of the Sixth Annual MCBIOS Conference. Transformational Bioinformatics: Delivering Value from Genomes. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S11.
- Karp G: Cell and molecular biology (concepts and experiments). 3rd edition. John Wiley & Sons Inc; 2002:52–65.Google Scholar
- Han KF, Baker D: Recurring local sequence motifs in proteins. Journal of Molecular Biology 1995, 251(1):176–187. 10.1006/jmbi.1995.0424View ArticlePubMedGoogle Scholar
- Han KF, Baker D: Global properties of the mapping between local amino acid sequence and local structure in proteins. Proceedings of the National Academy of Sciences of the United States of America 1996, 93(12):5814–5818. 10.1073/pnas.93.12.5814PubMed CentralView ArticlePubMedGoogle Scholar
- Bystroff C, Baker D: Prediction of local structure in proteins using a library of sequence-structure motifs. Journal of Molecular Biology 1998, 281: 565–577. 10.1006/jmbi.1998.1943View ArticlePubMedGoogle Scholar
- Bystroff C, Thorsson V, Baker D: HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. Journal of Molecular Biology 2000, 301: 173–190. 10.1006/jmbi.2000.3837View ArticlePubMedGoogle Scholar
- Zhong W, He J, Harrison R, Tai PC, Pan Y: Clustering Support Vector Machines for Protein Local Structure Prediction. Expert Systems With Applications 2007, 32(2):518–526. 10.1016/j.eswa.2005.12.011View ArticleGoogle Scholar
- Cortes C, Vapnik V: Support-Vector Networks. Machine Learning 1995, 20(3):273–297.Google Scholar
- Chang CC, Lin CJ: Training nu-support vector classifiers: theory and algorithms. Neural Computations 2001, 13: 2119–2147. 10.1162/089976601750399335View ArticleGoogle Scholar
- Yao YY: Perspectives of granular computing. Proceedings of 2005 IEEE International Conference on Granular Computing 1: 85–90. full_textGoogle Scholar
- Chen B, Pellicer S, Tai PC, Harrison R, Pan Y: Efficient Super Granular SVM Feature Elimination (Super GSVM-FE) Model for Protein Sequence Motif Information Extraction. International Journal of Functional Informatics and Personalised Medicine 2008, 1(1):8–25. 10.1504/IJFIPM.2008.018290View ArticleGoogle Scholar
- Chen B, Tai PC, Harrison R, Pan Y: FGK model: An Efficient Granular Computing Model for Protein Sequence Motifs Information Discovery. IASTED CASB Dallas, proceeding 2006, 56–61.Google Scholar
- Chen B, Tai PC, Harrison R, Pan Y: FIK model: A Novel Efficient Granular Computing Model for Protein Sequence Motifs and Structure Information Discovery. IEEE BIBE Washington D.C., proceeding 2006, 20–26.Google Scholar
- Wang G, Dunbrack RL Jr: PISCES: a protein sequence culling server. Bioinformatics 2003, 19(12):1589–1591. 10.1093/bioinformatics/btg224View ArticlePubMedGoogle Scholar
- Sander C, Schneider R: Database of similarity derived protein structures and the structure meaning of sequence alignment. Proteins: Struct Funct Genet 1991, 9(1):56–68. 10.1002/prot.340090107View ArticleGoogle Scholar
- Zhong W, Altun G, Harrison R, Tai PC, Pan Y: Improved Kmeans Clustering Algorithm for Exploring Local Protein Sequence motifs Representing Common Structural Property. IEEE transactions on Nanobioscience 2005, 4(3):255–265. 10.1109/TNB.2005.853667View ArticlePubMedGoogle Scholar
- Lin TY: DatavMining and Machine Oriented Modeling: A Granular Computing Approach. Journal of Applied Intelligence, Kluwer 2002, 13(2):113–124. 10.1023/A:1008384328214View ArticleGoogle Scholar
- Kolodny R, Linial N: Approximate protein structural alignment in polynomial time. Proceedings of the National Academy of Science of the United States of America 2004, 101: 12201–12206. 10.1073/pnas.0404383101View ArticleGoogle Scholar
- Zagrovic B, Pande VS: How does averaging affect protein structure comparison on the ensemble level? Biophysical Journal 2004, 87: 2240–2246. 10.1529/biophysj.104.042184PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.