Using distances between Top-n-gram and residue pairs for protein remote homology detection
© Liu et al.; licensee BioMed Central Ltd. 2014
Published: 24 January 2014
Protein remote homology detection is one of the central problems in bioinformatics, which is important for both basic research and practical application. Currently, discriminative methods based on Support Vector Machines (SVMs) achieve the state-of-the-art performance. Exploring feature vectors incorporating the position information of amino acids or other protein building blocks is a key step to improve the performance of the SVM-based methods.
Two new methods for protein remote homology detection were proposed, called SVM-DR and SVM-DT. SVM-DR is a sequence-based method, in which the feature vector representation for protein is based on the distances between residue pairs. SVM-DT is a profile-based method, which considers the distances between Top-n-gram pairs. Top-n-gram can be viewed as a profile-based building block of proteins, which is calculated from the frequency profiles. These two methods are position dependent approaches incorporating the sequence-order information of protein sequences. Various experiments were conducted on a benchmark dataset containing 54 families and 23 superfamilies. Experimental results showed that these two new methods are very promising. Compared with the position independent methods, the performance improvement is obvious. Furthermore, the proposed methods can also provide useful insights for studying the features of protein families.
The better performance of the proposed methods demonstrates that the position dependant approaches are efficient for protein remote homology detection. Another advantage of our methods arises from the explicit feature space representation, which can be used to analyze the characteristic features of protein families. The source code of SVM-DT and SVM-DR is available at http://bioinformatics.hitsz.edu.cn/DistanceSVM/index.jsp
Protein remote homology detection is a central problem in computation biology, which refers to the detection of evolutional homology in proteins with low similarities. Through evolution, structure is more conserved than sequence. Thus, knowledge of protein structure and evolution is important for predicting the functions of proteins, which will promote protein binding study, rational drug design, and many other related fields and applications. Protein remote homology detection identifies proteins from different families, and therefore can be applied to predict structure and function of specific proteins.
Unfortunately, protein remote homology detection is still a changing problem in bioinformatics and therefore accurately and efficiently computational approaches are needed. During the past two decades, some computational methods have been proposed for protein remote homology detection, which can be mainly divided into two major categories: generative methods and discriminative algorithms. Early solutions of protein remote homology detection were generative methods, which trained a model to represent a protein family and then evaluated a query sequence according to this model. For example, BLAST , PSI-BLAST , and Hidden Markov Model (HMM)  searched the protein database based on a model built by both positively labeled and unlabeled proteins. Generative methods didn't perform well because they only make use of positive training samples to build the models for the prediction. Some generative methods have been improved by developing more sensitive profiles, for example, HHsearch method  used the hidden Markov model to calculate a novel profile. COMPASS  generated numerical profiles and constructed optimal profile-profile alignments. FFAS  was based on a new procedure for profile generation that takes into account all the relations within the family. Some online servers are available, including FORTE , RANKPOOP , webPRC , PHYRE , GenThreader , COMA , and, Bioshell .
The discriminative approaches mark protein sequences with a set of labels, positive if they are in the protein family and negative otherwise. These methods attempt to learn the distinction between different classes. Both positive and negative samples are used in training for these approaches. The most popular discriminative methods for remote homology detection problem are based on the Support Vector Machine (SVM) . SVM learns a linear decision boundary from both positive and negative training samples to discriminate between the unseen protein sequences. A key feature of SVM is that it needs fixed length input vector. Thus some researchers have introduced different types of kernel functions and feature vectors for protein representation. Most of these kernel functions were based on sequence composition and profiles. For features based on sequence composition, some methods were based on the similarity between a protein sequence and other sequences in the training sets. For example, Fisher kernel  and SVM-Pairwise  measured the similarity from the local alignment between proteins, but the alignment score fallen into a twilight zone when the protein sequence similarity is below 35% at the amino acid level . Later, these methods were improved by introducing several kernels, such as LA kernel , SVM-HUSTLE . However these methods ignored the information from the protein structure and evolutionary information, which led to limited performance improvement. Some kernels were based on sequence features, whose feature vector were calculated from the subsequences, incorporating the protein structure information or amino acid position information. For instance, in Motif kernel , a protein sequence was represented in a vector space indexed by a set of motifs over a alphabet and subsequences. Spectrum Kernel  searched all possible subsequences of length k from a alphabet to form a feature map. SVM-I-sites  encoded structure information into the feature vectors. Mismatch kernel  was calculated based on shared occurrences of (k, m)-patterns in the data. LSA  improved the performance of building-block-based methods. SVM-N-Peptide  reduced the size of amino acid alphabet for increasing values of k. The performance of the sequence-based methods is not satisfying because these methods only use the sequence features without using the evolutionary information or 3-dimension structure. In profile-based methods, the feature vectors were extracted from profiles which contain the evolutional information. These methods showed superior performance than the sequence-based methods. This is because that a profile is much richer than an individual protein sequence in encoding information. Protein evolution involves changes of single residues, insertions and deletions of several residues, gene doubling, and gene fusion. With these changes accumulated for a long period of time, many similarities between initial and resultant protein sequences are gradually eliminated, but the corresponding proteins may still share many common features, such as similar structure and function. Profile describes this kind of evolutionary information, and therefore the profile-based kernels outperform the sequence-based kernels for protein remote homology detection. For instance, SW-PSSM  introduced two classes of kernel functions which were constructed from protein similarity measures by employing effective profile-to-profile scoring schemes. Profile kernel  used probabilistic profiles to define position-dependent mutation neighborhoods along protein sequences. A Top-n-gram-based approach  was proposed for protein remote homology detection. Top-n-gram can be viewed as a profile-based building block of proteins obtained by combining the most frequent amino acids in the profiles. The proteins were converted into fixed length vectors by the occurrences of each Top-n-gram and input into SVM for the prediction. Although, this method showed some improvements in predictive performance, this method completely ignores the sequence-order information. Recent studies showed that the sequence-order effects are relevant for protein remote homology detection. For example, SVM-PDT  incorporated the sequence-order information by considering the amino acid physicochemical properties of any two residues in a protein within a given distance. ODH  provided the basis distance histograms for any possible pair of k-mers based on distances between short oligomers, which outperformed other position independent approaches. In ACC method , the sequence-order information was captured by the autocross-covariance (ACC) transformation. SVM-HMMSTR  can capture the sequential ordering of the local structures. SVM-RQA  used the recurrence quantification analysis (RQA) to detect the autocorrelation patterns along the protein sequences.
Motivated by the successful of the position dependent methods, in this study, we extend the Top-n-gram-based method  by considering the sequence-order information, which is measured by all the possible Top-n-gram pairs within a given distance. In this approach, first, each amino acids in a protein sequence are converted into Top-n-grams based on the frequency profiles calculated from multiple sequence alignment. Second, the feature vector is calculated by the occurrences of all the Top-n-gram pairs within a given distance threshold cutoff. Finally, this feature space is input into SVM for the prediction.
As shown by a series of publications [34–38], to develop a useful statistical prediction method or model for a biological system, one needs to engage the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) provide the downloadable source code or a web-server for the prediction method. Below, let us describe how to deal these procedures.
SCOP 1.53 superfamily benchmark
We used a common benchmark  to evaluate the performance of our methods for protein remote homology detection, which can be downloaded at http://noble.gs.washington.edu/proj/svm-pairwise/. This benchmark has been used by many studies, which can provide good comparability with previous approaches [4, 16, 18, 28–30, 35, 36]. There are 54 families and 4352 proteins selected from SCOP version 1.53. All protein sequences were extracted from the Astral database  and no pairwise alignments is higher than an E-value of 10-25. Proteins within one SCOP family were taken as positive test samples, and proteins outside the family but within the same superfamily were taken as positive training samples. Negative samples were selected from outside of the superfamily and were separated into training and test sets.
Distance-based Top-n-gram (DT) and distance-based Residue (DR)
In this study, two approaches were proposed to convert protein sequences into feature vectors, including Distance-based Top-n-gram approach (DT) and Distance-based Residue approach (DR). First of all, we will introduce the process of the Distance-based Top-n-gram approach.
In previous study, a Top-n-gram-based method  was proposed for protein remote homology detection, which showed better predictive performance than some state-of-the-art methods, including SVM-LA , SVM-pairwise , and SVM-pattern . Top-n-gram can be viewed as a profile-based building block of protein sequences, which contains the evolutionary information extracted from frequency profiles. Each amino acid in a protein sequence can be converted into a corresponding Top-n-gram by combining the top n most frequency amino acids in the corresponding column of a frequency profile, and the order of the amino acids in a Top-n-gram reflects the different importance of these amino acids during evolution. By replacing all the amino acids in a protein with their corresponding Top-n-grams, a protein sequence can be represented as a sequence of Top-n-grams instead of a sequence of amino acids. For more details of Top-n-gram, please refer to the study .
In order to incorporate the sequence-order information into the prediction, a Distance-based Top-n-gram (DT) approach was proposed, which extends the original Top-n-gram-based feature vector by considering the relative position information of Top-n-gram pairs in protein sequences. The proposed feature vector was calculated by counting the occurrences of all possible Top-n-gram pairs within a certain distance threshold. In this study, the Top-1-gram was selected to construct the Distance-based Top-n-gram feature vector in order to reduce the dimension of the feature vectors and reduce the computational cost. Therefore, we will introduce the proposed Distance-based Top-n-gram approach based on Top-1-gram.
The dimension of the feature vector is 20 + 20 * 20 * d MAX , where 20 is the size of the alphabet of Top-1-grams.
The Distance-based Residue approach (DR) is similar as the Distance-based Top-1-gram approach (DT), except that the native protein sequence was directly converted into the feature vector without replacing the amino acids with Top-1-grams.
Construction of SVM classifiers and classification
SVM learns a linear decision boundary from both positive and negative training samples to discriminate between the unseen protein sequences. A key feature of SVM is that it needs fixed length of the input vector. The proteins in the training set and test set were transformed into fixed-dimension feature vectors following the process introduced above, and then the training vectors were input into SVM to construct the classifier. The SVM gives a discriminative score for each protein in the test set, which indicates the class level of the protein. In order to have better comparability with other SVM-based methods, we employed the publicly available Gist SVM package version 2.3 (http://www.chibi.ubc.ca/gist/index.html). The SVM parameters were used by default of the Gist Package.
In order to evaluate the performance of SVM-based methods applied in unbalanced dataset, we applied receiver operating characteristics (ROC) score and ROC50 score to measure the performance of different methods. The ROC score is the normalized area under a curve that plots true positives against false positives for different possible thresholds for classification and the ROC50 score is the area under the ROC curve up to the first 50 false positives. The discriminative score obtained from the SVM classifier can be used to calculate the ROC score and ROC50 score.
Results and discussion
The impact of d MAX value on the performance of SVM-DT and SVM-DR
Comparative results of previous approaches
Nine state-of-the-art protein remote homology detection methods were selected to compare with the proposed methods, including Monomer-dist , SVM-Top-n-gram , SVM-Top-n-gram-LSA , SVM-PDT-Profile , PseAACIndex-Porfile , SVM-LA , SVM-Pairwise , BioSVM-2L , and HHSearch . HHSearch is a generative method, and the other eight methods as well as the proposed SVM-DR and SVM-DT are discriminative methods based on SVM. They are different in the strategies of constructing the feature vectors and kernel functions. The feature vector of Monomer-dist was based on the distances between short oligomers. SVM-Top-n-gram constructed the feature vectors by the occurrences of Top-n-grams, which can be viewed as a profile-based "building block" of proteins. SVM-Top-n-gram-LSA further improved this method by employing the Latent Semantic Analysis (LSA). SVM-PDT-Profile combined the amino acid physicochemical properties in the Amino Acid Index (AAIndex)  with the frequency profiles for the prediction. PseAACIndex-Porfile combined the Pseudo Amino Acid Composition (PseAAC) proposed by Chou with profile-based protein representation to convert proteins into fixed length vectors. The kernel of SVM-LA measured the similarity between a pair of proteins by taking into account all the optimal local alignment scores with gaps between all possible subsequences. BioSVM-2L constructed two-layer SVM classifiers with profile-based kernels. In SVM-Pairwise, each protein was represented as a vector of pairwise similarities to all proteins in the training set. HHSearch is one of the best protein remote homology detection methods, which employed a novel profile-based hidden Markov models.
Average ROC and ROC50 scores over 54 families for different methods.
SVM-LA (ß = 0.5)
SVM-PDT-Profile (ß = 8, n = 2)
PseAACIndex-Porfile (λ = 5)
BioSVM-2L (1st+2nd layers)
Correlations between discriminative features and protein family
Where M is the matrix of sequence representatives. The magnitude of the element in w represents the discriminative power of the corresponding feature. We used the L2-norm of the discriminant weight vector w of each Top-1-gram pairs and residue pairs to measure the importance of the features.
In this study, we proposed two methods SVM-DT and SVM-DR for protein remote homology detection, in which the feature vectors were constructed based on the occurrences of Top-n-gram pairs or residue pairs at distances shorter than a distance threshold d MAX . These approaches can be viewed as position dependant methods that incorporate the sequence-order information. SVM-DR is a sequence-based method, its advantage is that it doesn't need time consuming multiple sequence alignment step. SVM-DT is a profile-based method, which achieves more accurately predictive performance but higher computational cost is required due to the generation of Top-n-grams. Recently, position dependant methods have been attracted much attention. Remote homology proteins share low sequence similarity, and therefore, structure information is a key to improve the predictive performance. These position dependant methods partly incorporate the structure information by considering the relative orders of residues or other building blocks of proteins occurring in protein sequences, such as Monomer-dist proposed by Linger et al . This method used the distances between short oligomers to produce the feature vectors, which gave rise to very high-dimensional feature vectors. In contract, SVM-DR efficiently reduced the dimension of feature vectors by only considering the residue pairs at distances shorter than a distance threshold d MAX . SVM-DT further improved SVM-DR by using Top-n-grams to replace the residues in proteins and produced feature vectors based on Top-n-gram distances. This profile-based method used the evolutionary information in profiles and therefore showed better performance than the sequence-based methods and the position independent methods, such as SVM-Top-n-gram , indicating that the distance-based approaches are relevant for discrimination. Recent studies showed that besides sequence and profile information, other features describing the physicochemical properties of amino acids can accurately detect the protein homologies, such as the amino acid index (AAIndex) [29, 41]. We are looking forward to incorporating these features into the proposed distance-based framework and exploring new mathematical and statistical models for the representation of protein sequences.
This work was supported by the National Natural Science Foundation of China (No. 61300112, No. 61370165, No. 61173075, No. 61272383, and No. 61370010), the Natural Science Foundation of Guangdong Province (No. S2012040007390 and No.S2013010014475), the Scientific Research Innovation Foundation in Harbin Institute of Technology (Project No. HIT.NSRIF.2013103), the Shanghai Key Laboratory of Intelligent Information Processing, China (Grant No. IIPL-2012-002). MOE Specialized Research Fund for the Doctoral Program of Higher Education 20122302120070, Shenzhen Foundational Research Funding JCYJ20120613152557576, Shenzhen International Cooperation Research Funding GJHZ20120613110641217, and Open Projects Program of National Laboratory of Pattern Recognition.
The publication costs for this article were funded by the corresponding author.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 2, 2014: Selected articles from the Twelfth Asia Pacific Bioinformatics Conference (APBC 2014): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S2.
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic Local Alignment Search Tool. J Mol Biol. 1990, 215 (3): 403-410. 10.1016/S0022-2836(05)80360-2.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Karplus K, Barrett C, Hughey R: Hidden Markov Models for Detecting Remote Protein Homologies. Bioinformatics. 1998, 14 (10): 846-856. 10.1093/bioinformatics/14.10.846.View ArticlePubMedGoogle Scholar
- Såding J: Protein Homology Detection by HMM-HMM Comparison. Bioinformatics. 2005, 21 (9): 951-960.View ArticleGoogle Scholar
- Sadreyev RI, Tang M, Kim B-H, Grishin NV: COMPASS Server for Homology Detection: Improved Statistical Accuracy, Speed and Functionality. Nucleic Acids Res. 2009, 37 (Web Server): W90-W94. 10.1093/nar/gkp360.PubMed CentralView ArticlePubMedGoogle Scholar
- Jaroszewski L, Z ZL, Cai X-H, Weber C, Godzik A: FFAS Server: Novelfeatures and Applications. Nucleic Acids Res. 2011, 39 (Web Server): W38-W44.PubMed CentralView ArticlePubMedGoogle Scholar
- Tomii K, Akiyama Y: FORTE: a Profile-Profile Comparison Tool for Protein Fold Recognition. Bioinformatics. 2004, 20 (4): 594-595. 10.1093/bioinformatics/btg474.View ArticlePubMedGoogle Scholar
- Noble WS, Kuang R, Leslie C, Weston J: Identifying Remote Protein Homologs by Network Propagation. The FEBS journal. 2005, 272 (20): 5119-5128. 10.1111/j.1742-4658.2005.04947.x.View ArticlePubMedGoogle Scholar
- Brandt BW, Heringa J: WebPRC: The Profile Comparer for Alignment-Based Searching of Public Domain Databases. Nucleic Acids Res. 2009, 37 (Web Server): W48-W52. 10.1093/nar/gkp279.PubMed CentralView ArticlePubMedGoogle Scholar
- Kelley LA, Sternberg MJ: Protein Structure Prediction on The Web: A Case Study Using The Phyre Server. Nat Protoc. 2009, 4 (3): 363-371. 10.1038/nprot.2009.2.View ArticlePubMedGoogle Scholar
- Lobley A, Sadowski MJ, Jones DT: pGenTHREADER and pDomTHREADER: New Methods for Improved Protein Fold Recognition and Superfamily Fiscrimination. Bioinformatics. 2009, 25 (14): 1761-1767. 10.1093/bioinformatics/btp302.View ArticlePubMedGoogle Scholar
- Margelevicius M, Venclovas MLC: COMA Server for Protein Distant Homology Search. Bioinformatics. 2010, 26 (15): 1905-1906. 10.1093/bioinformatics/btq306.View ArticlePubMedGoogle Scholar
- Gront D, Blaszczyk M, Wojciechowski P, Kolinski A: BioShell Threader: Protein Homology Detection Based on Sequence Profiles and Secondary Structure Profiles. Nucleic Acids Res. 2012, 40 (Web Server): W257-W262.PubMed CentralView ArticlePubMedGoogle Scholar
- Noble WS, Pavlidis P: Support Vector Machine and Kernel Principal Components Analysis Software Toolkit. Columbia University. 2002Google Scholar
- Jaakkola T, Diekhans M, Haussler D: Using the Fisher Kernel Method to Detect Remote Protein Homologies. Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology. 1999, 149-158.Google Scholar
- Liao L, Noble WS: Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships. J Comput Biol. 2003, 10 (6): 857-868. 10.1089/106652703322756113.View ArticlePubMedGoogle Scholar
- Rost B: Twilight zone of protein sequence alignments. Protein Eng. 1999, 12 (2): 85-94. 10.1093/protein/12.2.85.View ArticlePubMedGoogle Scholar
- Saigo H, Vert JP, Ueda N, Akutsu T: Protein Homology Detection Using String Alignment Kernels. Bioinformatics. 2004, 20 (11): 1682-1689. 10.1093/bioinformatics/bth141.View ArticlePubMedGoogle Scholar
- Shah AR, Oehmen CS, Webb-Robertson B-J: SVM-HUSTLE--an Iterative Semi-Supervised Machine Learning Approach for Pairwise Protein Remote Homology Detection. Bioinformatics. 2008, 24 (6): 783-790. 10.1093/bioinformatics/btn028.View ArticlePubMedGoogle Scholar
- Ben-Hur A, Brutlag D: Remote Homology Detection: A Motif Based Approach. Bioinformatics. 2003, 19 (Suppl 1): i26-i33. 10.1093/bioinformatics/btg1002.View ArticlePubMedGoogle Scholar
- Leslie C, Eskin E, Noble WS: The Spectrum Kernel: A String Kernel for svm Protein Classification. Proc Pacific Symposium on Biocomputing. 2002, 566-575.Google Scholar
- Hou Y, Hsu W, Lee ML, Bystroff C: Efficient Remote Homology Detection Using Local Structure. Bioinformatics. 2003, 19 (17): 2294-2301. 10.1093/bioinformatics/btg317.View ArticlePubMedGoogle Scholar
- Leslie CS, Eskin E, Cohen A, Weston J, Noble WS: Mismatch String Kernels for Discriminative Protein Classification. Bioinformatics. 2004, 20 (4): 467-476. 10.1093/bioinformatics/btg431.View ArticlePubMedGoogle Scholar
- Dong QW, Wang XL, Lin L: Application of Latent Semantic Analysis to Protein Remote Homology Detection. Bioinformatics. 2006, 22 (3): 285-290. 10.1093/bioinformatics/bti801.View ArticlePubMedGoogle Scholar
- Ogul H, Mumcuoglu EU: A discriminative method for remote homology detection based on n-peptide compositions with reduced amino acid alphabets. BioSystems. 2007, 87 (1): 75-81. 10.1016/j.biosystems.2006.03.006.View ArticlePubMedGoogle Scholar
- Rangwala H, Karypis G: Profile-Based Direct Kernels for Remote Homology Detection and Fold Detection. Bioinformatics. 2005, 21 (23): 4239-4247. 10.1093/bioinformatics/bti687.View ArticlePubMedGoogle Scholar
- Kuang R, Ie E, Wang K, Wang K, Siddiqi M: Profile-Based Direct Kernels for Remote Homology Detection and Motif Extraction. J Bioinform Comput Biol. 2005, 3 (3): 527-550. 10.1142/S021972000500120X.View ArticlePubMedGoogle Scholar
- Liu B, Wang X, Lin L, Dong Q, Wang X: A Discriminative Method for Protein Remote Homology Detection and Fold Recognition Combining Top-n-grams and Latent Semantic Analysis. BMC Bioinformatics. 2008, 9: 510-10.1186/1471-2105-9-510.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu B, Wang X, Chen Q, Dong Q, Lan X: Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection. PLoS ONE. 2012, 7 (9): e46633-10.1371/journal.pone.0046633.PubMed CentralView ArticlePubMedGoogle Scholar
- Lingner T, Meinicke P: Remote Homology Detection Based on Oligomer Distances. Bioinformatics. 2006, 22 (18): 2224-2231. 10.1093/bioinformatics/btl376.View ArticlePubMedGoogle Scholar
- Liu X, Zhao L, Dong Q: Protein Remote Homology Detection Based on Auto-Cross Covariance Transformation. Computers in Biology and Medicine. 2011, 41 (8): 640-647. 10.1016/j.compbiomed.2011.05.015.View ArticlePubMedGoogle Scholar
- Hou Y, Hsu W, Lee L, Bystroff C: Remote Homolog Detection Using Local Sequence-Structure Correlations. Proteins. 2004, 57 (3): 518-530. 10.1002/prot.20221.View ArticlePubMedGoogle Scholar
- Yang Y, Tantoso E, Li K-B: Remote Protein Homology Detection Using Recurrence Quantification Analysis and Amino Acid Physicochemical Properties. Journal of Theoretical Biology. 2008, 252 (1): 145-154. 10.1016/j.jtbi.2008.01.028.View ArticlePubMedGoogle Scholar
- Zou Q, Wang Z, Wu Y, Liu B, Lin Z, Guan X: An Approach for Identifying Cytokines Based On a Novel Ensemble Classifier. BioMed Research International. 2013, 686090-10.1155/2013/686090.Google Scholar
- Zhang Y, Liu B, Dong Q, Jin VX: An improved profile-level domain linker propensity index for protein domain boundary prediction. Protein and Peptide Letters. 2011, 18 (1): 7-16. 10.2174/092986611794328717.View ArticlePubMedGoogle Scholar
- Liu B, Wang X, Lin L, Tang B, Dong Q, Wang X: Prediction of protein binding sites in protein structures using hidden Markov support vector machine. BMC Bioinformatics. 2009, 10: 381-10.1186/1471-2105-10-381.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu B, Wang X, Lin L, Dong Q, Wang X: Exploiting three kinds of interface propensities to identify protein binding sites. Computational Biology and Chemistry. 2009, 33 (4): 303-311. 10.1016/j.compbiolchem.2009.07.001.View ArticlePubMedGoogle Scholar
- Liu B, Zhang D, Xu R, Xu J, Wang X, Chen Q, Dong Q, Chou K-C: Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics. DOI: btt709,Google Scholar
- Andreeva A, Howorth D, Brenner SE, Hubbard TJP, Chothia C, Murzin AG: SCOP Database in 2004: Refinements Integrate Structure and Sequence Family Data. Nucleic Acids Research. 2004, 32 (Database): D226-D229.PubMed CentralView ArticlePubMedGoogle Scholar
- Brenner SE, Koehl P, M ML: The ASTRAL Compendium for Sequence and Structure Analysis. Nucleic Acids Res. 2000, 28 (1): 254-256. 10.1093/nar/28.1.254.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu B, Wang X, Zou Q, Dong Q, Chen Q: Protein Remote Homology Detection by Combining Chou's Pseudo Amino Acid Composition and Profile-Based Protein Representation. Molecular Informatics. 2013, 32: 775-782. 10.1002/minf.201300084.View ArticleGoogle Scholar
- Muda HM, Saad P, Othman RM: Remote Protein Homology Detection and Fold Recognition Using Two-Layer Support Vector Machine Classifiers. Computers in Biology and Medicine. 2011, 41 (8): 687-699. 10.1016/j.compbiomed.2011.06.004.View ArticlePubMedGoogle Scholar
- Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M: AAindex: Amino Acid Index Database, Progress Report 2008. Nucleic Acids Res. 2008, 36 (Database): D202-D205.PubMed CentralView ArticlePubMedGoogle Scholar
- Burns CS, Aronoff-Spencer E, Dunham CM, Lario P, Avdievich NI, Antholine WE, Olmstead MM, Vrielink A, Gerfen GJ, Peisach J: Molecular Features of the Copper Binding Sites in the Octarepeat Domain of the Prion Protein. Biochemistry. 2002, 41 (12): 3991-4001. 10.1021/bi011922x.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.