Application of nonnegative matrix factorization to improve profile-profile alignment features for fold recognition and remote homolog detection
© Jung et al; licensee BioMed Central Ltd. 2008
Received: 08 January 2008
Accepted: 01 July 2008
Published: 01 July 2008
Nonnegative matrix factorization (NMF) is a feature extraction method that has the property of intuitive part-based representation of the original features. This unique ability makes NMF a potentially promising method for biological sequence analysis. Here, we apply NMF to fold recognition and remote homolog detection problems. Recent studies have shown that combining support vector machines (SVM) with profile-profile alignments improves performance of fold recognition and remote homolog detection remarkably. However, it is not clear which parts of sequences are essential for the performance improvement.
The performance of fold recognition and remote homolog detection using NMF features is compared to that of the unmodified profile-profile alignment (PPA) features by estimating Receiver Operating Characteristic (ROC) scores. The overall performance is noticeably improved. For fold recognition at the fold level, SVM with NMF features recognize 30% of homolog proteins at > 0.99 ROC scores, while original PPA feature, HHsearch, and PSI-BLAST recognize almost none. For detecting remote homologs that are related at the superfamily level, NMF features also achieve higher performance than the original PPA features. At > 0.90 ROC50 scores, 25% of proteins with NMF features correctly detects remotely related proteins, whereas using original PPA features only 1% of proteins detect remote homologs. In addition, we investigate the effect of number of positive training examples and the number of basis vectors on performance improvement. We also analyze the ability of NMF to extract essential features by comparing NMF basis vectors with functionally important sites and structurally conserved regions of proteins. The results show that NMF basis vectors have significant overlap with functional sites from PROSITE and with structurally conserved regions from the multiple structural alignments generated by MUSTANG. The correlation between NMF basis vectors and biologically essential parts of proteins supports our conjecture that NMF basis vectors can explicitly represent important sites of proteins.
The present work demonstrates that applying NMF to profile-profile alignments can reveal essential features of proteins and that these features significantly improve the performance of fold recognition and remote homolog detection.
Nonnegative matrix factorization (NMF) is a feature extraction method that has a property of intuitive part-based representation of the original feature . Due to the non-negativity constraint, the parts produced by NMF can be interpreted as subsets of elements that tend to occur together in sub-portion of the dataset . In this way, NMF can be applied to the multidimensional dataset in order to discover patterns and to help interpretation of large biological dataset. This unique ability makes NMF a potentially promising method for biological sequence analysis.
Proteins are said to have a common fold if they share a similar spatial arrangement of major secondary structures. Proteins in the same fold may have low sequence similarity, but they often share similar functions. Fold recognition is to detect a group of proteins that share the common fold with a query protein. It can provide valuable information about the functional role and structure of unknown proteins.
In general, there have been two common approaches for remote homolog detection. The first approach is solely based on sequence information, whereas the second approach uses structural information in addition to sequence information. Hidden Markov Model (HMM) method , PSI-BLAST , FFAS , Picasso , and COMPASS  can be classified into the first method. GenTHREADER , 3D-PSSM , FUGUE , and PROSPECT [11, 12] represent the second approach. Currently most remote homolog detection methods are based on profile-profile alignment (PPA). Some examples are SP3 method , ProfNet , COACH , and HMM-HMM comparison method (HHsearch) . Although these methods are reliable for recognizing relatively close homologs related at the family level, there is still difficulty in finding related remote homologs, reaching only 25% sensitivity at 90% specificity at the superfamily level and very low sensitivity at the fold level.
Recently, the introduction of support vector machine (SVM), a machine learning method, brings remarkable performance improvement in remote homolog searches. Examples are SVM-HMMSTR , SVM-I-sites , SVM-pairwise , SVM-Fisher , and profile-profile alignment with SVM . More recently, several kernel methods such as local alignment kernels , profile-based direct kernels  and cluster kernels  are developed to derive a more powerful remote homolog detection. Among them, the method based on profile-profile alignment combined with SVM  detects 14% of remotely related proteins with 90% specificity at the fold level. Even though previous SVM-based methods have an ability to recognize the essential features from alignments of remotely related proteins, they do not provide the features with intuitive biological meaning. In addition, the dimension of the profile-profile alignment feature vectors for SVM is quite high, considering the number of intrinsic feature vectors for fold recognition. In such cases, the effect, referred to as the curse of dimensionality, may occur and negatively influence the classification of a given data set.
The methods known as feature extraction techniques can be applied to reduce this problem. There are several linear feature extraction techniques, such as principal component analysis (PCA), independent component analysis (ICA), and multidimensional scaling (MDS), as well as nonlinear analysis such as ISOMAP, LLE, and self-organized feature maps. Among them, nonnegative matrix factorization (NMF) is a linear technique that is characterized by its unique ability of intuitive part-based representation of the original feature . In previous studies, NMF is applied to biclustering of gene expression data  and discovering semantic features . As an attempt to popularize the NMF method in the biological data analysis community, LS-NMF  and bioNMF  are developed.
In this paper, we investigate the possibility of applying NMF to the profile-profile alignment features used for fold recognition and remote homolog detection. We expect that NMF would extract essential features from profile-profile alignment (PPA), improving the performance of fold recognition and identifying remote homolog relationship more accurately. The PPA features have two characteristics that are appropriate for utilizing NMF. First, not all PPA features are needed for recognition of each fold. Instead, a small portion of the sequence is usually enough for each decision, while some portions coming from poor alignments or improper profiles may act as "noises." This suggests that NMF can improve the performance of SVM by explicitly using part-based representation of essential features with lower dimensionality. Secondly, the PPA score is essentially the sum of log odds scores, which can be decomposed linearly. The assumption of linear decomposition used in NMF fits well with this characteristic.
Performance comparison for fold recognition at the fold level
In this section, we describe the fold recognition performance of SVM with NMF features compared to that of PPA features, along with HHsearch and PSI-BLAST results. In this work, we define the fold recognition problem in most rigorous way in that we only consider the situation where no proteins sharing the same superfamily members with a query protein are in the template library. To validate our method, we create training set (2437 proteins) and testing set (630 proteins and 34 folds) using SCOP 1.67 in such a way that the two sets do not share the same superfamily members.
The mean ROC50 scores at the fold and superfamily level
Mean ROC50 score
Performance comparison at the superfamily level (remote homolog detection)
Benchmarking with LSTM, LA-kernel, and SW-PSSM at the fold and superfamily level
Performance comparison between the present method (NMF), LSTM, LA-kernel, and SW-PSSM
SW-PSSM(2.0, 10, 0.0)
LSTM is a fast model-based protein homology detection method without alignment, and LA-kernel is SVM based method using string alignment kernel (please see Availability & requirements below). The best performing method, SW-PSSM is a profile-based local alignment kernel method (please see Availability & requirements below). We measure the ROC and ROC50 scores for 34 folds at the fold level and 95 superfamilies at the superfamily level. For the LSTM, we use default parameters and weight (-c lstmpars.mem12.ws12.txt and -w weight.mat) and for the LA-kernel we use version 0.3.2 with β = 0.5 (recommended value). In case of SW-PSSM, we use two parameter sets; default parameter set (gap opening = 2.0, gap extension = 10, zero-shift = 0.0) and another parameter set (gap opening = 3.0, gap extension = 0.75, zero-shift = 1.5) that was reported to be best-performing in the original paper. From Table 2, we note that the performance of SVM output with NMF features is better than the two methods, LSTM and LA-kernel, while SW-PSSM shows slightly better or comparable performance compared to the SVM output with NMF features. It should be noted, however, that our method and the other methods are not directly comparable, and we have not tried to develop better scoring scheme because that is not the main objective of this study. Furthermore, because our method produces more reliable similarity scores between the sequences than conventional PSSM methods, it may be possible to develop a more accurate new kernel-based method using our SVM output scores.
Variation of performance improvement as a function of the number of positive training examples
Performance variation with number of NMF basis vectors
When we choose the value of 70 as the number of NMF basis vectors, this fixed value may cause a problem for the templates whose sequence lengths are less than 70. Some of NMF basis vectors in those cases can be either duplicated or become zero vectors, possibly leading to performance degradation or improvement for NMF cases. However, the ratio of those events occurring is only 3.2%. Thereby such small effect can be ignored. Furthermore, from our experiments we discover that performance degradation occurs at the templates with a long sequence length rather than at those with a short sequence length. Further investigation is needed to determine the relationship between the number of NMF basis vectors and effective feature extraction.
NMF basis vectors overlap with functional sites and structurally conserved regions of proteins
Main assumption of this work is that NMF can capture essential features of proteins. In this regard, NMF can reduce redundant regions in each protein, and its basis vectors provide useful information about proteins such as functional sites or structurally conserved regions. To verify our conjecture we conduct the statistical analysis on basis vectors by comparing them with functional sites of proteins and structurally conserved regions.
Due to the part-based representation, NMF basis vectors consist of several blocks of nonzero scores. We compare blocks of NMF basis vectors with functional sites from PROSITE database  and structurally conserved regions from multiple structural alignments. To validate NMF's ability to detect functional sites and structurally conserved regions, we make two types of random basis vectors. 'Block random basis vector' is a vector in which nonzero blocks are randomly re-distributed along the vector, whereas 'point random basis vector' is a vector in which nonzero scores are randomly re-distributed. We also create 'PPA vectors', which are composed of the top 5% PPA scores.
From the two statistical analyses on functional sites and structurally conserved regions, we note that NMF basis vectors represent essential parts of sequences. SVM's ability to recognize remote homologs depends on feature vectors. SVMs are trained well if feature vectors consist only of essential features for remote homologs detection. In this regard, NMF reconstructs original PPA feature vectors with noise reduction and preserves core regions such as functionally important sites and structurally conserved regions. Therefore, NMF improves the performance of fold recognition.
In this work, we investigate the possibility of applying NMF to improve the profile-profile alignment features for fold recognition and remote homolog detection. We show that when NMF feature extraction method is used, the performance is greatly improved, compared to previous fold recognition algorithms. The main reason for the improvement is that NMF feature extraction method reduces "noises" and extracts essential features from PPAs. Due to this noise reduction property, SVM with NMF features shows a great performance improvement for fold recognition at the fold level and the remote homolog detection at the superfamily level. We also find that improvement is bigger when data set is larger, and the number of basis vectors needs to be optimized for the best performance. As an evidence for NMF's ability to extract the essential features from sequences, we discover that there exists a close relationship between NMF basis vectors and functional sites or structurally conserved regions of proteins. This supports our conjecture that NMF basis vectors explicitly represent essential features of proteins.
Feature extraction using NMF gives us intuitive understanding about the feature vectors. We can extract more useful feature vectors and analyze them to better understand the biological meaning of protein sequences, which makes NMF feature extraction method a promising tool not only for the fold recognition but also for the analysis of large-scale biological data. Furthermore, as we point out in Result Section, our method produces more accurate similarity scores between the sequences than conventional PSSM methods, which would allow us to develop more accurate kernel-based method based on our method.
For the future work we can use NMF to improve alignment quality. In fold recognition problem, improvement in sequence alignment accuracy remains a challenge, as existing methods still do not always reach the level of the best alignment possible . Accurate sequence alignments undoubtedly increase the performance of fold recognition. Our results indicate that NMF methods can remove false alignments by allowing only a combination of essential features in profile-profile alignment. In this regards, we believe that NMF can be a promising method to improve sequence alignment accuracy. Furthermore, by analyzing NMF basis vectors, we can extract intuitive information from protein sequences, which may be used for motif search.
We construct the template library based on SCOP ASTRAL Compendium version 1.67 . Proteins in the library share less than 40% sequence identity with each other. The domains in the classes a, b, c, d, and e are used, and discontinuous domains are removed. For fold recognition we randomly divide all templates into training set (2437 proteins) and testing set (630 proteins and 34 folds), where they do not come from the same superfamily. In this setting, protein domains within the same fold are considered as positive training examples. For remote homolog detection we also randomly divide all templates into training set (2342 proteins) and testing set (435 proteins and 94 superfamilies), they do not share the same family members. For both fold recognition and remote homolog detection, protein domains in outside the same fold are considered as negative examples.
A profile-profile alignment feature vectors and feature extraction with NMF
where , , , and are the frequencies (i.e. and ) and the PSSM scores (i.e. and ) of amino acid k, at position i of a template q and position j of a template t, respectively. If gaps occur, fixed negative scores are assigned. For each template of length n in the training set, alignments with the other templates in the training set are generated. Then, these alignments are transformed, respectively, into (n+2)-dimensional feature vectors, (sa1, sa2, sa3, ..., sa n , total_score, sequence_length), where sa i , total_score, and sequence_length are a profile-profile alignment score at position i of the template, total profile-profile alignment score, the length of the template, i.e., n, respectively. Total score and sequence length are normalized when SVM is trained.
Next, NMF is applied to this feature vectors to build the NMF feature vectors by performing matrix factorization. At the matrix factorization step, the total score and sequence length in the original feature vectors are excluded (therefore n-dimensiona1 feature vectors), and then later added to the NMF feature vectors. All n-dimensional feature vectors are placed in the columns of n × m matrix V where m is the number of training examples in the dataset. Given a nonnegative matrix V, we find nonnegative matrix factors W and H such that :V ≈ W H
The matrix is then approximately factorized into n × r matrix W and r × m matrix H, where W is a set of r basis vectors and H is a set of coefficient vectors for m training examples. Before NMF, We need to transform the n-dimensional feature vectors using a sigmoid function, , to make the matrix V nonnegative The sigmoid function changes the range of original PPA feature vectors from [-5 5] to [0 1]. Transforming feature vectors with a sigmoid function shows better result than adding the constant bias score 5 to the n-dimensional feature vectors, indicating that sigmoid function is more appropriate for preserving the original feature vector space. NMF is conducted using recently proposed projected gradient method  instead of the original multiplicative learning rule for faster convergence. We add two features to r-dimensional nonnegative feature vectors, where two features are the total score and sequence length which are normalized to have zero mean and unit variance. Thereby, the final feature vector is (r+2) dimensional coefficient vector. Figure 10 summarizes the modified procedure of feature extraction using NMF.
SVM is implemented by using SVM-light (please see Availability & requirements belowhttp://), freely available SVM software, with the radial basis function as a kernel, . We use default option for SVM except that the value of γ is fixed to 0.005 for the original PPA as in the previous work, and 0.0055 is chosen for the NMF features after trying several values of γ (0.001, 0.005, 0.0055, 0.01, 0.05, and 0.1). For each SVM output we add mean output scores of SVM machines which are included in the same fold. This modification reduces variance of SVM outputs, leading to stabilize the performance of SVM machines.
Testing and performance assessment
We generate the profile-profile alignments between the test proteins in the testing set and the templates in the training set, and transform them to the feature vectors, which are then evaluated by the trained SVMs to produce outputs. Fold recognition performances are measured by the receiver operating characteristic (ROC) scores and the ROC50 scores. ROC score is defined as the areas under the ROC curves, the plot of true positives as a function of the number of false positives . At the fold level, proteins in the same fold, but different superfamily are identified as homologs. At the superfamily level, proteins in the same superfamily but different family are considered as homologs. Otherwise, proteins in the different folds are defined as non homologs.
Comparison of NMF basis vectors with functionally important sites and structurally conserved regions
To define functional sites, we use ScanProsite, which provides predicted functional sites by scanning protein sequences for patterns of functional sites stored in the PROSITE database . We use MUSTANG_v.3  for multiple structural alignments and define over 95% structurally conserved regions in each fold as structurally important sites.
If we apply NMF to all the alignments between a query and the proteins in both the positive and negative sets, it becomes difficult to decide which basis vectors represent the biologically meaningful features of the query protein. Therefore, for meaningful analysis of NMF basis vectors, we extract the basis vectors by applying NMF only to alignments with sequences in the positive set. We create the positive set not only from the training set but also from testing set. As a result, there are 3118 new alignments, from which new NMF basis vectors are calculated.
We use top 5% values of nonzero scores in basis vectors to eliminate outlier scores. We assume that if the corresponding portion of a functional region or structural region with a block is more than a cutoff value of 50%, the functional region or structural region is considered to be matched. We find that the choice of the cutoff value is not very critical. For example, when we change the cutoff value from 50% to 65%, the number of matched regions remains virtually the same. To validate NMF's ability to match biologically meaningful regions, we create two types of basis vectors for comparison. The first type of basis vector is called 'PPA vectors', which are composed of the top 5% of raw PPA scores. The other type of basis vectors are two different kinds of random basis vectors: 'block random basis vector' and 'point random basis vector.' In 'block random basis vector', nonzero blocks of the NMF feature vectors are randomly re-distributed, whereas in 'point random basis vector', all nonzero scores of the NMF feature vectors are randomly re-distributed. The reason for 'block random basis vectors' is due to unique structural property of NMF feature vectors; they typically consist of several blocks of nonzero scores. Therefore, it is more meaningful to compare NMF feature vectors with 'block random basis vectors'. As shown in Figure 8 and Figure 9, 'point random basis vectors' have fewest overlaps with functional sites or structurally conserved regions, since the functional motif or structurally conservation sites are usually present as blocks.
Availability & requirements
LSTM is downloaded from: http://www.bioinf.jku.at/software/LSTM_protein/
LA-kernel is downloaded from: http://sunflower.kuicr.kyoto-u.ac.jp/~hiroto/project/homology.html
SW-PSSM is downloaded from: http://bioinfo.cs.umn.edu/supplements/profile-kernels
SVM-light is downloaded from: http://svmlight.joachims.org/
This work is supported by CHUNG Moon Soul Center for BioInformation and BioElectironics (CMSC), and by Korea Science and Engineering Foundation (grant number: 2007-04158).
- Lee DD, Seung HS: Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401(6755):788–791. 10.1038/44565View ArticlePubMedGoogle Scholar
- Pascual-Montano A, Carmona-Saez P, Chagoyen M, Tirado F, Carazo JM, Pascual-Marqui RD: bioNMF: a versatile tool for non-negative matrix factorization in biology. BMC Bioinformatics 2006, 7: 366. 10.1186/1471-2105-7-366PubMed CentralView ArticlePubMedGoogle Scholar
- Karplus K, Barrett C, Cline M, Diekhans M, Grate L, Hughey R: Predicting protein structure using only sequence information. Proteins 1999, Suppl 3: 121–125. Publisher Full Text 10.1002/(SICI)1097-0134(1999)37:3+%3C121::AID-PROT16%3E3.0.CO;2-QView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Rychlewski L, Jaroszewski L, Li W, Godzik A: Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci 2000, 9(2):232–241.PubMed CentralView ArticlePubMedGoogle Scholar
- Heger A, Holm L: Picasso: generating a covering set of protein family profiles. Bioinformatics 2001, 17(3):272–279. 10.1093/bioinformatics/17.3.272View ArticlePubMedGoogle Scholar
- Sadreyev R, Grishin N: COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol 2003, 326(1):317–336. 10.1016/S0022-2836(02)01371-2View ArticlePubMedGoogle Scholar
- Jones DT: GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol 1999, 287(4):797–815. 10.1006/jmbi.1999.2583View ArticlePubMedGoogle Scholar
- Kelley LA, MacCallum RM, Sternberg MJ: Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 2000, 299(2):499–520. 10.1006/jmbi.2000.3741View ArticlePubMedGoogle Scholar
- Shi J, Blundell TL, Mizuguchi K: FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J Mol Biol 2001, 310(1):243–257. 10.1006/jmbi.2001.4762View ArticlePubMedGoogle Scholar
- Kim D, Xu D, Guo JT, Ellrott K, Xu Y: PROSPECT II: protein structure prediction program for genome-scale applications. Protein Eng 2003, 16(9):641–650. 10.1093/protein/gzg081View ArticlePubMedGoogle Scholar
- Xu Y, Xu D: Protein threading using PROSPECT: design and evaluation. Proteins 2000, 40(3):343–354. 10.1002/1097-0134(20000815)40:3<343::AID-PROT10>3.0.CO;2-SView ArticlePubMedGoogle Scholar
- Zhou H, Zhou Y: SPARKS 2 and SP3 servers in CASP6. Proteins 2005, 61 Suppl 7: 152–156. 10.1002/prot.20732View ArticlePubMedGoogle Scholar
- Ohlson T, Elofsson A: ProfNet, a method to derive profile-profile alignment scoring functions that improves the alignments of distantly related proteins. BMC Bioinformatics 2005, 6: 253. 10.1186/1471-2105-6-253PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar RC, Sjolander K: A comparison of scoring functions for protein sequence profile alignment. Bioinformatics 2004, 20(8):1301–1308. 10.1093/bioinformatics/bth090View ArticlePubMedGoogle Scholar
- Soding J: Protein homology detection by HMM-HMM comparison. Bioinformatics 2005, 21(7):951–960. 10.1093/bioinformatics/bti125View ArticlePubMedGoogle Scholar
- Hou Y, Hsu W, Lee ML, Bystroff C: Remote homolog detection using local sequence-structure correlations. Proteins 2004, 57(3):518–530. 10.1002/prot.20221View ArticlePubMedGoogle Scholar
- Hou Y, Hsu W, Lee ML, Bystroff C: Efficient remote homology detection using local structure. Bioinformatics 2003, 19(17):2294–2301. 10.1093/bioinformatics/btg317View ArticlePubMedGoogle Scholar
- Liao L, Noble WS: Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J Comput Biol 2003, 10(6):857–868. 10.1089/106652703322756113View ArticlePubMedGoogle Scholar
- Jaakkola T, Diekhans M, Haussler D: A discriminative framework for detecting remote protein homologies. J Comput Biol 2000, 7(1–2):95–114. 10.1089/10665270050081405View ArticlePubMedGoogle Scholar
- Han S, Lee BC, Yu ST, Jeong CS, Lee S, Kim D: Fold recognition by combining profile-profile alignment and support vector machine. Bioinformatics 2005, 21(11):2667–2673. 10.1093/bioinformatics/bti384View ArticlePubMedGoogle Scholar
- Saigo H, Vert JP, Ueda N, Akutsu T: Protein homology detection using string alignment kernels. Bioinformatics 2004, 20(11):1682–1689. 10.1093/bioinformatics/bth141View ArticlePubMedGoogle Scholar
- Rangwala H, Karypis G: Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics 2005, 21(23):4239–4247. 10.1093/bioinformatics/bti687View ArticlePubMedGoogle Scholar
- Weston J, Leslie C, Ie E, Zhou D, Elisseeff A, Noble WS: Semi-supervised protein classification using cluster kernels. Bioinformatics 2005, 21(15):3241–3247. 10.1093/bioinformatics/bti497View ArticlePubMedGoogle Scholar
- Carmona-Saez P, Pascual-Marqui RD, Tirado F, Carazo JM, Pascual-Montano A: Biclustering of gene expression data by Non-smooth Non-negative Matrix Factorization. BMC Bioinformatics 2006, 7: 78. 10.1186/1471-2105-7-78PubMed CentralView ArticlePubMedGoogle Scholar
- Chagoyen M, Carmona-Saez P, Shatkay H, Carazo JM, Pascual-Montano A: Discovering semantic features in the literature: a foundation for building functional associations. BMC Bioinformatics 2006, 7: 41. 10.1186/1471-2105-7-41PubMed CentralView ArticlePubMedGoogle Scholar
- Wang G, Kossenkov AV, Ochs MF: LS-NMF: a modified non-negative matrix factorization algorithm utilizing uncertainty estimates. BMC Bioinformatics 2006, 7: 175. 10.1186/1471-2105-7-175PubMed CentralView ArticlePubMedGoogle Scholar
- Hochreiter S, Heusel M, Obermayer K: Fast model-based protein homology detection without alignment. Bioinformatics 2007, 23(14):1728–1736. 10.1093/bioinformatics/btm247View ArticlePubMedGoogle Scholar
- Hulo N, Bairoch A, Bulliard V, Cerutti L, De Castro E, Langendijk-Genevaux PS, Pagni M, Sigrist CJ: The PROSITE database. Nucleic Acids Res 2006, 34(Database issue):D227–30. 10.1093/nar/gkj063PubMed CentralView ArticlePubMedGoogle Scholar
- Konagurthu AS, Whisstock JC, Stuckey PJ, Lesk AM: MUSTANG: a multiple structural alignment algorithm. Proteins 2006, 64(3):559–574. 10.1002/prot.20921View ArticlePubMedGoogle Scholar
- Dunbrack RL Jr.: Sequence comparison and protein structure prediction. Curr Opin Struct Biol 2006, 16(3):374–384. 10.1016/j.sbi.2006.05.006View ArticlePubMedGoogle Scholar
- Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE: The ASTRAL Compendium in 2004. Nucleic Acids Res 2004, 32(Database issue):D189–92. 10.1093/nar/gkh034PubMed CentralView ArticlePubMedGoogle Scholar
- Lin CJ: Projected Gradient Methods for Non-negative Matrix Factorization. Volume 352. Department of Computer Science National Taiwan University; 2005.Google Scholar
- Gribskov M, Robinson NL: The use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Computers Chem 1996, 20: 25–34. 10.1016/S0097-8485(96)80004-0View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.