A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis
© Liu et al; licensee BioMed Central Ltd. 2008
Received: 23 March 2008
Accepted: 01 December 2008
Published: 01 December 2008
Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences.
In this paper, a novel building block of proteins called Top-n-grams is presented, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into Top-n-grams. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-n-gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top-n-grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top-n-grams and LSA gives significantly better results compared to related methods.
The method based on Top-n-grams significantly outperforms the methods based on many other building blocks including N-grams, patterns, motifs and binary profiles. Therefore, Top-n-gram is a good building block of the protein sequences and can be widely used in many tasks of the computational biology, such as the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites.
Protein homology detection is one of the most intensively researched problems in bioinformatics. Researchers are increasingly depending on computational techniques to classify proteins into functional or structural classes by means of homologies. Most methods can detect homologies at high levels of sequence similarity, while accurately detecting homologies at low levels of sequence similarity (remote homology detection) is still a challenging problem.
Many powerful methods and algorithms have been proposed to address the remote homology detection and fold recognition problems. Some methods are based on the pairwise similarities between protein sequences. Smith-Waterman dynamic programming algorithm  finds an optimal score for similarity according to a predefined objective function. RANKPROP  relies upon a precomputed network of pairwise protein similarities. Some heuristic algorithms, such as BLAST  and FASTA  trade reduced accuracy for improved efficiency. These methods do not perform well for remote homology detection, because the alignment score falls into a twilight zone when the protein sequences similarity is below 35% at the amino acid level . Later methods challenge this problem by incorporating the family information. These methods are based on a proper representation of protein families and can be split into two groups : generative models and discriminative algorithms. Generative models provide a probabilistic measure of association between a new sequence and a particular family. These methods such as profile hidden Markov model (HMM)  can be trained iteratively in a semi-supervised manner using both positively labeled and unlabeled samples of a particular family by pulling in close homology and adding them to the positive set . The discriminative algorithms such as Support Vector Machines (SVM)  provide state-of-the-art performance. In contrast to generative models, the discriminative algorithms focus on learning a combination of the features that discriminate between the families. These algorithms are trained in a supervised manner using both positive and negative samples to establish a discriminative model. The performance of SVM depends on the kernel function, which measures the similarity between any pair of samples. There are two approaches for deriving the kernel function. One approach is the direct kernel, which calculates an explicit sequence similarity measure. Another approach is the feature-space-based kernel, which chooses a proper feature space, represents each sequence as a vector in that space and then inner product (or a function derived from it) between these vector-space representations is taken as a kernel for the sequences .
LA kernel  is one of the direct kernel functions. This method measures the similarity between a pair of protein sequences by taking into account all the optimal local alignment scores with gaps between all possible subsequences. Another method is SW-PSSM  which is derived directly from explicit similarity measures and utilizes sequence profiles. The experiment results show that this kernel is superior to previously developed schemes.
SVM-Fisher  is one of the early attempts with the feature-space-based kernel, This method represents each protein sequence as a vector of Fisher scores extracted from a profile hidden Markov model (HMM) for a protein family. SVM-pairwise  is another successful method, in which each protein sequence is represented as a vector of pairwise similarities to all protein sequences in the training set. The SVM-I-sites method  compares sequence profiles to the I-sites library of local structural motifs for feature extraction. Later, this method is improved by using the order and relationships of the I-site motifs . SVM-BALSA  represents each protein sequence by a vector of Bayesian alignment scores associated with each training sample. The feature spaces in Spectrum kernel  consist of all short subsequence of length k ("k-mers"). SVM-n-peptide  improves Spectrum kernel by using reduced amino acid alphabets to reduce the dimensions of the feature vectors. Mismatch kernel  considers a k-mer to be present if a sequence contains a substring which differs with the k-mer at most a predefined number of mismatches. Profile kernel  considers a k-mer to be present if a sequence contains a substring whose PSSM-based ungapped alignment score with the k-mer is above a predefined threshold. Lingner and Meinicke  notice the distances between k-mers and introduce a feature vector based on the distances between the k-mers. The feature vector of eMOTIF kernel  is based on motifs extracted with the unsupervised eMOTIF method  from the eBLOCKS database . GPkernel  is another motif kernel based on discrete sequence motifs, which are evolved using genetic programming. SVM-HUSTLE  builds a SVM classifier for a query sequence by training on a collection of representative high-confidence training sets, recruits additional sequences and assigns a statistical measure of homology between a pair of sequences.
Multiple sequence alignments of protein sequences contain a lot of evolutionary information. This information can be obtained by analyzing the output of PSI-BLAST [26, 27]. Since the protein sequence frequency profile is a richer encoding of protein sequences than the individual sequence, it is of great significance to use such evolutionary information for protein remote homology detection and fold recognition. In our previous study, we introduced a feature vector of protein sequence based on binary profiles  which contain the evolutionary information of the protein sequence frequency profiles. The protein sequence frequency profiles are converted into binary profiles with a probability threshold. In detail, when the frequency of a given amino acid is higher than the threshold it is converted into an integral value 1, otherwise it is converted into 0. Binary profile can be viewed as a building block of proteins. It has been successfully applied in many computational biology tasks, such as the domain boundary prediction , the knowledge-based mean force potentials  and the protein binding sites prediction . Although the methods based on binary profiles give exciting results, binary profiles have several shortcomings. Firstly, the threshold which is used to convert the protein sequence frequency profiles into binary profiles is set by experience. Because there is no systematic method that could be used to optimize the threshold, there is no guarantee to find the best threshold. Secondly, binary profiles cannot discriminate the amino acids with different frequencies in the protein sequence frequency profiles. The amino acids whose frequencies are higher than the threshold are all converted into 1, which omits that these amino acids have different frequencies and different importance during evolutionary processes.
To overcome these shortcomings, in this study we present a novel building block of proteins called Top-n-grams to use the evolutionary information of the protein sequence frequency profiles and apply this novel building block to remote homology detection and fold recognition. The protein sequence frequency profiles calculated from the multiple sequence alignments outputted by PSI-BLAST  are converted into Top-n-grams by combining the n most frequent amino acids in each amino acid frequency profile. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-n-gram and then the corresponding vectors are inputted to SVM. In our previous studies [28, 32], we applied LSA  to protein remote homology detection. Several basic building blocks have been investigated as the "words" of "protein sequence language", including N-grams , patterns , motifs  and binary profiles . Here, we also demonstrate that the use of LSA on Top-n-grams can improve the prediction performance of both remote homology detection and fold recognition.
Results and discussion
Comparative results of various methods
Comparison against different methods on SCOP 1.53 superfamily benchmark
Average ROC and ROC50 scores
n = 1
n = 2
n = 3
SVM-Bprofile(Ph = 0.13)
SVM-Bprofile-LSA(Ph = 0.13)
SVM-LA(β = 0.5)
Comparison against different methods on SCOP 1.67 fold benchmark
Average ROC and ROC50 scores
n = 1
n = 2
n = 3
SVM-Bprofile(Ph = 0.11)
SVM-Bprofile-LSA(Ph = 0.11)
The influence of n on remote homology and fold recognition
Top-n-grams are generated by combining the n most frequent amino acids in the amino acid frequency profiles (see method section for details). Here, we investigate the influence of n on the prediction performance. As shown in Table 1 and Table 2, in terms of the ROC50 scores, the rates indicate that our method performs well for n = 1 and n = 2 and with a slight decrease of the ROC50 score for n = 3. The third most frequent amino acids in the amino acid frequency profiles are less likely to occur in the specific sequence positions during evolutionary processes, which might be the reason for the decrease of the prediction performance for n = 3.
LSA can improve the performance of building-block-based methods for both remote homology and fold recognition problems
The method based on Top-n-gram significantly outperforms the other building-block-based methods and most existing methods
Comparison against different methods on SCOP 1.53 fold benchmark
One important aspect of any homology detection method is its computational efficiency. In this regard, SVM-Top-n-gram-combine-LSA is comparable with Profile and more efficient than SVM-Bprofile-LSA, SVM-Pattern-LSA, SVM-Motif-LSA, SVM-pairwise, SVM-LA and SW-PSSM, but a little worse than SVM-Ngram-LSA, PSI-BLAST and the method without LSA.
Time complexities of different SVM-based methods for remote homology detection
O (n2l) + O (nml) + O (n2m)
O (n2l) + O (nml) + O (n2m)
O (nl) + O (nml) + O (n2m)
O (nl lognl+n2l2m) + O (nml) + O (n2m)
O (n2l2) + O (nml) + O (n2m)
O (n2l) + O (nml) + O (nmt) + O (n2R)
O (n2l) + O (nml) + O (nmt) + O (n2R)
O (nl) + O (nml) + O (nmt) + O (n2R)
O (nl lognl+n2l2m) + O (nml) + O (nmt) + O (n2R)
O (n2l2) + O (nml) + O (nmt) + O (n2R)
O (n2l2) + O (n3)
O (n2l2) + O (n3)
O (n2l) + O (n2l)
O (n2l) + O (n2l2)
Numbers of "words" of the five building blocks for remote homology detection
Numbers of "words"
Correlations between Top-n-grams and protein families
One of the main advantages of our method is the possibility to analyze the correlations between Top-n-grams and protein families and reveal biologically relevant properties of the underlying protein families.
In this study, we present a novel representation of protein sequences based on Top-n-grams and apply the latent semantic analysis to improve the prediction performance of both protein remote homology detection and fold recognition. Top-n-grams make up of a novel building block of protein sequences and the analysis presented here is based on evolutionary information without using any structural information. Compared with other building-block-based methods, such as N-grams, patterns, motifs or binary profiles, additional evolutionary information is extracted from the protein sequence frequency profile. The experiment results show that this additional evolutionary information is relevant for discrimination.
Using standard classifier, the results show that the prediction performance of our method is highly competitive with state-of-the-art methods within the field of remote homology detection and fold recognition. Although Profile and SW-PSSM yield better results, it should be noted that both the two methods comprise several parameters. The performance of Profile depends on a smoothing parameter and the performance of SW-PSSM depends on three parameters: gap opening, gap extension and a continuous parameter zero-shift. The extensive use of parameters bares the risk of adapting the model to particular test set, which complicates a fair comparison of different methods and the application of the methods to different dataset, because new dataset is likely to require readjustment of these parameters . Furthermore, because the performance can decrease for non-optimal values of these parameters, a time-consuming model selection process would be necessary for these methods to achieve optimal results. In contrast, homogeneity of ROC scores for Top-1-grams and Top-2-grams indicates the good generalization performance of our method, which obviates the tuning of any parameter and therefore our method does not require a time consuming optimization and avoids the risk of fitting the data to the test set.
Another advantage of our approach arises from the explicit feature space representation: the possibility to measure the correlations between Top-n-grams and protein families. Compared with other building-block-based methods, Top-n-grams significantly improve the detection performance while preserving the favourable interpretability of the building-block-based methods by an explicit feature space representation. As shown in results and discussion section, our method allows the researchers to analyze the correlations between Top-n-grams and protein families and reveal biologically relevant properties of the underlying protein families. In contrast, direct-kernel-based methods without explicit feature spaces, such as SW-PSSM and SVM-LA, do not provide an intuitive insight into the associated feature space for further analysis of relevant sequence features. Therefore, these kernels do not offer additional utility for researchers who are interested in finding the characteristic features of protein families.
Top-n-grams make up of a novel building block of protein sequences, which contains the evolutionary information extracted from the protein sequence frequency profiles. The results show that this building block has significantly higher discriminative power than many other building blocks. Because this novel building block is able to flexibly represent the proteins at sequence-level and residue-level, it can be used to investigate the whole protein sequence and the individual residue. Many applications of Top-n-grams are conceivable: e.g. the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites.
SCOP 1.53 superfamily benchmark
We use a common superfamily benchmark  to evaluate the performance of our method for protein remote homology detection, which is available at http://www.ccls.columbia.edu/compbio/svm-pairwise/. This benchmark has been used by many studies of remote homology detection methods [6, 11, 32], so it can provide good comparability with previous methods. The benchmark contains 54 families and 4352 proteins selected from SCOP version 1.53. These proteins are extracted from the Astral database  and include no pair with a sequence similarity higher than an E-value of 10-25. The proteins with lengths less than 30 are removed, because PSI-BLAST cannot generate profiles on short sequences. For each family, the proteins within the family are taken as positive test samples, and the proteins outside the family but within the same superfamily are taken as positive training samples. Negative samples are selected from outside of the superfamily and are separated into training and test sets.
SCOP 1.67 fold benchmark
A recently established fold benchmark  is used for protein fold recognition. The benchmark contains 3840 proteins and 86 superfamilies. The proteins extracted from SCOP version 1.67 are filtered with Astral database  and contain no pair with a sequence similarity more than 95%. The proteins with lengths less than 30 are removed. For each tested superfamily, there are at least 10 proteins in its positive training and test set. The proteins within one superfamily are taken as positive test samples, while the others in the same fold are taken as positive training samples. The negative test samples are selected from one random superfamily from each of the other folds and the negative training samples are selected from the remaining proteins. Because most of the proteins within a fold have a very low degree of similarity, this fold benchmark is considerably harder than the superfamily benchmark.
SCOP 1.53 fold benchmark
SCOP 1.53 fold benchmark [10, 36] is another fold benchmark used for protein fold recognition. The benchmark contains 23 superfamilies and 4352 proteins selected from SCOP version 1.53. These proteins are extracted from the Astral database  and include no pair with a sequence similarity higher than an E-value of 10-25. Proteins with lengths less than 30 are removed. Proteins within the same superfamily are considered as positive test samples, and proteins within the same fold but outside the superfamily are considered as positive training samples. For each tested superfamilies, there are at least 10 positive training and 5 positive test samples. Negative samples are chosen from outside of the positive sequences' fold and split equally into test and training sets.
Generation of protein sequence frequency profiles
where f i is the observed frequency of amino acid i, p j is the background frequency of amino acid j, q ij is the score of amino acid i being aligned to amino acid j in BLOSUM62 substitution matrix, which is the default score matrix of PSI-BLAST.
The target frequency is then calculated with the pseudo-count as:Q i = (αf i + βg i )/(α + β)
where β is a free parameter set to a constant value of 10 which is initially used by PSI-BLAST and α is the number of different amino acids in a given column minus one.
Converting protein sequence frequency profiles intoTop-n-grams
In order to use the evolutionary information of the protein sequence frequency profiles, the protein sequence frequency profiles are converted into Top-n-grams. For each amino acid frequency profile, the frequencies of the 20 standard amino acids describe the probabilities of the corresponding amino acids appearing in the specific sequence positions. The higher the frequency is, the more likely the corresponding amino acid occurs. It is reasonable to use the n most frequent amino acids to represent the amino acid frequency profiles, because these n amino acids are most likely to occur at a given sequence position during evolutionary processes. The following details how to convert protein sequence frequency profiles into Top-n-grams.
For each amino acid frequency profile, the frequencies of the 20 standard amino acids are sorted in descending order, and then the n most frequent amino acids are selected and combined according to their frequencies. We call this combination of the n amino acids a Top-n-gram. Each Top-n-gram differentiates the different frequencies of the n amino acids by their different positions in the Top-n-gram. The above process is iterated until all amino acid frequency profiles in the protein sequence frequency profile are converted into Top-n-grams. In other words, a protein sequence can be converted into k Top-n-grams, where k is the length of the protein sequence.
Construction of SVM classifiers and classification
The dimension of the feature vectors of Top-n-grams is 20 n . In this paper, Top-1-grams, Top-2-grams and Top-3-grams are investigated, the dimensions of their feature vectors are 20 (201), 400 (202) and 8000 (203), respectively. The training proteins are transformed into fixed-dimension feature vectors by the occurrence times of each Top-n-gram, and then the vectors are inputted to SVM to construct the classifier for a specified class. The test proteins are vectorized in the same way as the training proteins and fed into the classifier constructed for a given class to make separation between the positive and negative samples. The SVM assigns each protein in the test set a discriminative score which indicates a predicted positive level of the protein. The proteins with discriminative scores higher than the threshold zero are classified as positive samples and the others as negative ones. The above process is iterated until each class is tested.
The focus of this study is to find a suitable representation of protein sequences. In order to exclude differences owing to particular realizations of the SVM-based learning algorithm and for best comparability with other methods, we employ the publicly available Gist SVM package version 2.1.1 http://svm.sdsc.edu. The SVM parameters are used by default of the Gist Package except that the kernel function is set as Radial Basis Function (RBF).
Latent Semantic Analysis
Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text . Recently, LSA was introduced in computational biology, such as protein secondary structure prediction  and protein remote homology detection . The process of LSA is as follows:
Firstly, a word-document matrix W of co-occurrences between words and documents is constructed. The elements of W indicate the number of times each word appears in each document. The dimensions of W are M × N, where M is the total number of words and N is the number of given documents. Each word count is normalized to compensate the differences in document lengths and overall counts of different words in the document collection .
Secondly, in order to recognize related or synonymous words and reduce the dimensions, singular value decomposition (SVD) is performed on the word-document matrix W. Let K be the rank of W. W is decomposed into three matrices by SVD:W = USV T
where U is left singular matrix with dimensions (M × K), S is diagonal matrix of singular values (s1≥ s2 ≥ ... s K > 0) with dimensions (K × K), and V is right singular matrix with dimensions (N × K).
Thirdly, the dimensions of the solution are reduced by detecting the smaller singular values in S and ignoring the corresponding columns of U(rows of V). Only the top R (R << Min (M, N) dimensions for which the singular values in S are higher than a threshold are selected for further processing. Therefore, the dimensions of U, S and V are reduced to M × R, R × R and N × R, leading to noise removal and data compression. Values of R ranging from 200 to 300 are typically used for information retrieval. In this study, the best results are achieved when R takes the value around 300. For a test document which is not in the training set, the unseen document is required to add to the original training set, so the latent semantic analysis model need to be recomputed. Because SVD is a computationally expensive process, it is not suitable to perform SVD every time for a new test document. The test document vector t can be approximated from the mathematical properties of the matrices U, S and V as follow:t = dU
where d is the original vector of the test document, which is similar to the columns of the matrix W.
Through performing SVD on the word-document matrix W, the column vectors of W are projected onto the orthonormal basis which is formed by the row column vectors of the left singular matrix U. The columns of SV T give the coordinates of the vectors. Thus, the column vectors Sv j T or equivalently the row vectors v j S characterize the position of document d j in the R dimensions space. Each of the vector v j S refers to a document vector which is uniquely associated with the document in the training set.
In this study, Top-n-grams are treated as the "words" and the protein sequences are viewed as the "documents". Through collecting the weight of each word in the documents, the word-document matrix is constructed and then the latent semantic analysis is performed on the matrix to produce the latent semantic representation vectors of protein sequences, leading to noise-removal and smart description of protein sequences. The latent semantic representation vectors are then inputted into SVM to give the final results.
Two methods are used to evaluate the quality of the methods: the Receiver Operating Characteristic (ROC) scores and the ROC50 scores . A ROC score is the normalized area under a curve that plots true positives against false positives for different classification thresholds. A score of 1 denotes perfect separation of positive samples from negative ones, whereas a score of 0 indicates that none of the sequences selected by the algorithm is positive. A ROC50 score is the area under the ROC curve up to the first 50 false positives.
To determine whether two methods have statistically different ROC or ROC50 scores on a particular benchmark, the Wilcoxon signed-rank test is used. The p-values are Bonferroni-corrected for multiple comparisons.
where A is the number of times t and c co-occur, B is the number of times the t occurs without c, C is the number of times c occurs without t, D is the number of times neither c nor t occurs and N is the total number of protein sequences.
where P r (c i ) is the probability of protein family c i and m is the total number of protein families.
The authors would like to acknowledge insightful discussions with Chengjie Sun, Deyuan Zhang and Minghui Li, and helpful suggestions by the anonymous reviewers. We also thank Pal Saetrom, Tony Handstad and Huzefa Rangwala for giving us the results of their work. Special thanks give to Yuanhang Meng for his comments on this work that significantly improve the presentation of the paper. Financial support is provided by the National Natural Science Foundation of China (60673019 and 60435020) and the National High Technology Development 863 Program of China (2006AA01Z197).
- Smith TF, Waterman MS: Identification of Common Molecular Subsequences. J Mol Biol 1981, 147(1):195–197.View ArticlePubMedGoogle Scholar
- Noble WS, Kuang R, Leslie C, Weston J: Identifying remote protein homologs by network propagation. The FEBS journal 2005, 272(20):5119–5128.View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic Local Alignment Search Tool. J Mol Biol 1990, 215(3):403–410.View ArticlePubMedGoogle Scholar
- Pearson WR: Rapid and Sensitive Sequence Comparison with Fastp and Fasta. Methods Enzymol 1990, 183: 63–98.View ArticlePubMedGoogle Scholar
- Rost B: Twilight zone of protein sequence alignments. Protein Eng 1999, 12(2):85–94.View ArticlePubMedGoogle Scholar
- Lingner T, Meinicke P: Remote homology detection based on oligomer distances. Bioinformatics 2006, 22(18):2224–2231.View ArticlePubMedGoogle Scholar
- Karplus K, Barrett C, Hughey R: Hidden Markov Models for Detecting Remote Protein Homologies. Bioinformatics 1998, 14(10):846–856.View ArticlePubMedGoogle Scholar
- Qian B, Goldstein RA: Performance of an Iterated T-Hmm for Homology Detection. Bioinformatics 2004, 20(14):2175–2180.View ArticlePubMedGoogle Scholar
- Vapnik VN: Statistical Learning Theory. New York 1998.Google Scholar
- Rangwala H, Karypis G: Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics 2005, 21(23):4239–4247.View ArticlePubMedGoogle Scholar
- Saigo H, Vert JP, Ueda N, Akutsu T: Protein Homology Detection Using String Alignment Kernels. Bioinformatics 2004, 20(11):1682–1689.View ArticlePubMedGoogle Scholar
- Jaakkola T, Diekhans M, Haussler D: A Discriminative Framework for Detecting Remote Protein Homologies. J Comput Biol 2000, 7(1–2):95–114.View ArticlePubMedGoogle Scholar
- Liao L, Noble WS: Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships. J Comput Biol 2003, 10(6):857–868.View ArticlePubMedGoogle Scholar
- Hou Y, Hsu W, Lee ML, Bystroff C: Efficient Remote Homology Detection Using Local Structure. Bioinformatics 2003, 19(17):2294–2301.View ArticlePubMedGoogle Scholar
- Hou Y, Hsu W, Lee L, Bystroff C: Remote homolog detection using local sequence-structure correlations. Proteins 2004, 57: 518–530.View ArticlePubMedGoogle Scholar
- Webb-Robertson B-J, Oehmen C, Matzke M: SVM-BALSA: Remote homology detection based on Bayesian sequence alignment. Comput Biol Chem 2005, 29(6):440–443.View ArticlePubMedGoogle Scholar
- Leslie C, Eskin E, Noble WS: The Spectrum Kernel: A String Kernel for svm Protein Classification. Pac Symp Biocomput 2002, 564–575.Google Scholar
- Ogul H, Mumcuoglu EU: A discriminative method for remote homology detection based on n-peptide compositions with reduced amino acid alphabets. BioSystems 2007, 87(1):75–81.View ArticlePubMedGoogle Scholar
- Leslie CS, Eskin E, Cohen A, Weston J, Noble WS: Mismatch String Kernels for Discriminative Protein Classification. Bioinformatics 2004, 20(4):467–476.View ArticlePubMedGoogle Scholar
- Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie C: Profile-based string kernels for remote homology detection and motif extraction. J Bioinform Comput Biol 2005, 3: 527–550.View ArticlePubMedGoogle Scholar
- Ben-Hur A, Brutlag D: Remote homology detection: A motif based approach. Bioinformatics 2003, 19(Suppl 1):i26-i33.View ArticlePubMedGoogle Scholar
- Nevill-Manning CG, Wu TD, Brutlag DL: Highly Specific Protein Sequence Motifs for Genome Analysis. Proc Natl Acad Sci USA 1998, 95(11):5865–5871.PubMed CentralView ArticlePubMedGoogle Scholar
- Su QJ, Lu L, Saxonov S, Brutlag DL: eBLOCKS: enumerating conserved protein blocks to achieve maximal sensitivity and specificity. Nucleic Acids Res 2005, 33: D178-D182.PubMed CentralView ArticlePubMedGoogle Scholar
- Håndstad T, Hestnes AJ, Sætrom P: Motif kernel generated by genetic programming improves remote homology and fold detection. BMC Bioinformatics 2007, 8: 23.PubMed CentralView ArticlePubMedGoogle Scholar
- Shah AR, Oehmen CS, Webb-Robertson B-J: SVM-HUSTLE – an iterative semi-supervised machine learning approach for pairwise protein remote homology detection. Bioinformatics 2008, 24(6):783–790.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped Blast and Psi-Blast: A New Generation of Protein Database Search Programs. Nucleic Acids Res 1997, 25(17):3389–3402.PubMed CentralView ArticlePubMedGoogle Scholar
- Dowd SE, Zaragoza J, Rodriguez JR, Oliver MJ, Payton PR: Windows.Net Network Distributed Basic Local Alignment Search Toolkit (W.Nd-Blast). BMC Bioinformatics 2005, 6: 93.PubMed CentralView ArticlePubMedGoogle Scholar
- Dong Q, Lin L, Wang XL: Protein Remote Homology Detection Based on Binary Profiles. Proc 1st International Conference on Bioinformatics Research and Development (BIRD) Germany 2007, 212–223.Google Scholar
- Dong Q, Wang X, Lin L, Xu Z: Domain boundary prediction based on profile domain linker propensity index. Comput Biol Chem 2006, 30(2):127–133.View ArticlePubMedGoogle Scholar
- Dong Q, Wang X, Lin L: Novel knowledge-based mean force potential at the profile level. BMC Bioinformatics 2006, 7: 324.PubMed CentralView ArticlePubMedGoogle Scholar
- Dong Q, Wang X, Lin L, Guan Y: Exploiting residue-level and profile-level interface propensities for usage in binding sites prediction of proteins. BMC Bioinformatics 2007, 8: 147.PubMed CentralView ArticlePubMedGoogle Scholar
- Dong QW, Wang XL, Lin L: Application of Latent Semantic Analysis to Protein Remote Homology Detection. Bioinformatics 2006, 22(3):285–290.View ArticlePubMedGoogle Scholar
- Bellegarda J: Exploiting Latent Semantic Information in Statistical Language Modeling. Proc IEEE 2000, 88(8):1279–1296.View ArticleGoogle Scholar
- Dong Q, Lin L, Wang XL, Li MH: A Pattern-Based svm for Protein Remote Homology Detection. 4th international conference on machine learning and cybernetics. GuangZhou, China 2005, 3363–3368.Google Scholar
- Damoulas T, Girolami MA: Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection. Bioinformatics 2008, 24(10):1264–1270.View ArticlePubMedGoogle Scholar
- Supplementary data for "Profile-based direct kernels for remote homology detection and fold recognition"[http://bioinfo.cs.umn.edu/supplements/remote-homology/]
- Shawe-Taylor J, Cristianini N: Support Vector Machines and other kernel-based learning methods. Cambridge University Press; 2000.Google Scholar
- Rigoutsos I, Floratos A: Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics 1998, 14: 55–67.View ArticlePubMedGoogle Scholar
- Floratos A, Rigoutsos I: Research Report On the Time Complexity of the TEIRESIAS Algorithm. 98A000290 1998.Google Scholar
- Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 1994, 2: 28–36.PubMedGoogle Scholar
- Bailey TL, Elkan C: Fitting a Mixture Model by Expectation Maximization to Discover Motifs in Biopolymers. UCSD Technical Report CS94–351
- Landauer TK, Foltz PW, Laham D: Introduction to Latent Semantic Analysis. Discourse Processes 1998, 25: 259–284.View ArticleGoogle Scholar
- Yang Y, Pedersen JA: A comparative study on feature selection in text categorization. In 14th international conference on machine learning. San Francisco, USA; 1997:412–420.Google Scholar
- Brenner SE, Koehl P, Levitt M: The ASTRAL compendium for sequence and structure analysis. Nucleic Acids Res 2000, 28(1):254–256.PubMed CentralView ArticlePubMedGoogle Scholar
- Holm L, Sander C: Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 1998, 14(5):423–429.View ArticlePubMedGoogle Scholar
- Henikoff S, Henikoff JG: Position-Based Sequence Weights. J Mol Biol 1994, 243(4):574–578.View ArticlePubMedGoogle Scholar
- Ganapathiraju M, Klein-Seetharaman J, Balakrishnan N, Reddy R: Characterization of protein secondary structure, Application of latent semantic analysis using different vocabularies. IEEE Signal Processing Magazine 2004, 21: 78–87.View ArticleGoogle Scholar
- Gribskov M, Robinson NL: Use of Receiver Operating Characteristic (Roc) Analysis to Evaluate Sequence Matching. Comput Chem 1996, 20(1):25–33.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.