Volume 7 Supplement 1

## NIPS workshop on New Problems and Methods in Computational Biology

# Protein Ranking by Semi-Supervised Network Propagation

- Jason Weston
^{1}Email author, - Rui Kuang
^{2, 4}, - Christina Leslie
^{2}and - William Stafford Noble
^{3}

**7(Suppl 1)**:S10

https://doi.org/10.1186/1471-2105-7-S1-S10

© The Author(s); licensee BioMed Central Ltd. 2006

**Published: **20 March 2006

## Abstract

### Background

Biologists regularly search DNA or protein databases for sequences that share an evolutionary or functional relationship with a given query sequence. Traditional search methods, such as BLAST and PSI-BLAST, focus on detecting statistically significant pairwise sequence alignments and often miss more subtle sequence similarity. Recent work in the machine learning community has shown that exploiting the global structure of the network defined by these pairwise similarities can help detect more remote relationships than a purely local measure.

### Methods

We review RankProp, a ranking algorithm that exploits the global network structure of similarity relationships among proteins in a database by performing a diffusion operation on a protein similarity network with weighted edges. The original RankProp algorithm is unsupervised. Here, we describe a semi-supervised version of the algorithm that uses labeled examples. Three possible ways of incorporating label information are considered: (i) as a validation set for model selection, (ii) to learn a new network, by choosing which transfer function to use for a given query, and (iii) to estimate edge weights, which measure the probability of inferring structural similarity.

### Results

Benchmarked on a human-curated database of protein structures, the original RankProp algorithm provides significant improvement over local network search algorithms such as PSI-BLAST. Furthermore, we show here that labeled data can be used to learn a network without any need for estimating parameters of the transfer function, and that diffusion on this learned network produces better results than the original RankProp algorithm with a fixed network.

### Conclusion

In order to gain maximal information from a network, labeled and unlabeled data should be used to extract both local and global structure.

## Background

Pairwise sequence comparison is the "killer app" of bioinformatics. Algorithms like BLAST [1] and PSI-BLAST [2] allow a user to search a database of DNA or protein sequences using a single query sequence. The output of the search is a list of database sequences (called *targets*) that are ranked according to their similarity to the query. The similarities discovered by the algorithm may help the user infer functional properties of the query and target sequences; for example, a query sequence of unknown function that retrieves from the database a large collection of kinases is likely itself to be a kinase. This straightforward application is familiar to most molecular biologists. The web engine of the most popular pairwise sequence comparison algorithm, the BLAST server at the NCBI, runs 50,000 searches per day.

Early methods for detecting subtle sequence similarities were designed explicitly with respect to a simple model of molecular evolution. They measure similarities between protein pairs by computing the cost of mutation, insertion and deletion. The Smith-Waterman algorithm [3] is a provably optimal, quadratic time, dynamic programming algorithm to solve this problem, and BLAST is a linear time heuristic approximation algorithm [1].

More sophisticated solutions to this problem involve learning from data. In an example of this approach, an HMM or other generative model is constructed from a training set, which is accumulated iteratively from the target database. SAM-T98 [4] is an example of an iterative profile HMM method, and PSI-BLAST [2] is essentially a fast approximation of this approach. PSI-BLAST builds an alignment-based statistical model of a local region in the protein similarity network and then iteratively collects additional sequences from the database, adding them to the multiple alignment. The main idea is to characterize with a statistical model the family or superfamily that the query protein comes from. This model is then more capable than the original query sequence of finding other similar proteins. From a machine learning perspective, these methods can be described as *unsupervised learning* methods. While they learn from unlabeled database sequences in order to build a probabilistic model, they do not make use of known structural or functional information, which is available for a subset of the database of target proteins.

### RankProp

In this article we review RankProp, an unsupervised protein ranking algorithm. We also discuss some extensions to the algorithm that leverage the use of labeled data to make it a *semi-supervised learning* method. RankProp [5, 6] works by defining a protein similarity network. In this network, nodes are proteins, and edges represent pairwise protein similarities identified using the PSI-BLAST algorithm. Given a query (a node in this network) the algorithm performs a diffusion operation on the graph. Each node is assigned an initial activation level. In subsequent iterations, the activation level of a node is given by the weighted combination of neighboring nodes plus a stability term based on its initial activation. The query node has its activation set to a constant value. Repeated iterations of this procedure result in a diffusion of the query's activation level across the network, until a fixed point is reached. The output of the algorithm is a list of target proteins, ranked by activation level. A target protein can achieve a high ranking by being connected to many proteins similar to the query, even if its direct connection to the query is not a strongly weighted edge. RankProp has been shown to significantly outperform BLAST and PSI-BLAST, when tested on its ability to recognize remote homologs in the SCOP [7] database of protein domain structures [5]. RankProp is inspired by two separate sources: first, diffusion techniques from machine learning [6], and second by the Google ranking algorithm, PageRank [8].

An analogy can also be made between a protein database search, where one submits a query protein and is returned a ranked list of proteins, and a web search, where one enters a query word or phrase and retrieves a ranked list of web pages. RankProp is similar to PageRank [8], the ranking system used by Google. Both PageRank and RankProp are based upon constructing a network of object nodes with edges weighted by similarity between them. A diffusion operation is then performed to capture global properties of the network. PageRank ranks a web page more highly if it is linked with other highly ranked pages. See [6] for more details.

### RankProp with labeled data

We hypothesize that extending RankProp to make use of both *unlabeled* and *labeled* data will provide a significant improvement in the resulting rankings, compared to the rankings produced by the original RankProp algorithm. In this context, labels come from 3D structural information. The 3D structure that a protein assumes after folding largely determines its function in the cell. It is far easier to determine experimentally a protein's primary amino acid sequence than it is to discover its 3D structure. However, protein structure databases contain thousands of solved 3D structures. Thus, from a machine learning perspective, in addition to the large amount of unlabeled data — on the order of one million protein sequences — we also have available a small amount of labeled data — roughly 27,000 proteins with known 3D structures, organized into structural classes. We investigate three different ways of using this labeled data: (i) as a validation set to choose from competing similarity networks, (ii) to design edge weights that correspond exactly to the similarity measure of interest, the probability of homology, and (iii) to learn which similarity network to use on a per-query basis. Of these methods, the third method provides the best-performing algorithm across a range of evaluation metrics and superfamily sizes.

## Results

### Basic approach

The RankProp algorithm requires a protein similarity network as input. The protein similarity network represents the similarity between pairs of proteins by assigning a weight to each edge. The degree of similarity between two sequences is commonly summarized in an E-value, which is the expected number of times that this degree of sequence similarity would occur in a random database of the given size. RankProp bases its edge weights on E-values returned from PSI-BLAST searches, using a radial basis transfer function

*W*_{
ij
}= *exp*(-*E*_{
ij
}/*σ*), (1)

where *E*_{
ij
}is the E-value between protein *i* and *j*, and *W*_{
ij
}is the corresponding weight. In this way, edges between similar sequences are assigned large weights. The transfer function introduces a hyper-parameter *σ*, the radial basis width, which controls the importance of very similar proteins relative to distant ones.

We evaluate RankProp output using a 3D structure-based gold standard [7], measuring the extent to which known homologs occur above non-homologs in the ranked list. The protein network consists of 7329 SCOP domains and 101, 602 proteins from Swiss-Prot version 40. The SCOP domains were split into two portions: 379 superfamilies (4071 proteins) for training and 332 (2899 proteins) for testing (used as queries). For more details of the dataset see the Methods section.

We use receiver operating characteristic (ROC) curves to measure performance. The ROC score [11] is used to compare methods for a given query by measuring the area under a curve that plots true positives as a function of false positives for varying classification thresholds, where a true positive is an example that belongs to the same superfamily as the query, and a false positive is an example that is not in the same fold. The ROC_{
n
}score computes this score up to the *n* th false positive [12]. A value of 1 implies that the algorithm successfully assigns all the true relationships higher scores than the false relationships. For a random ranking of this data, the expected ROC_{50} score is close to 0 because most of the sequences are not related to the query.

Comparison of PSI-BLAST and RankProp variants. The table lists ROC_{1}, ROC_{10} and ROC_{50} scores, averaged across 2899 SCOP domains in the test set, for PSI-BLAST and five variants of the RankProp algorithm.

ROC | ROC | ROC | |
---|---|---|---|

PSI-BLAST | 0.5973 | 0.6167 | 0.6406 |

RankProp | 0.5964 | 0.6658 | 0.7169 |

RankProp | 0.5924 | 0.6671 | 0.7249 |

RankProp | 0.6040 | 0.6661 | 0.6999 |

Probabilistic RankProp | 0.6145 | 0.6660 | 0.7195 |

Adaptive RankProp | 0.6510 | 0.7000 | 0.7420 |

So far, we have described RankProp as a purely unsupervised approach. However, we would also like to make use of the labeled data available, which can be done by learning some aspect of the network with the available labels. In the following subsections, we consider three ways to achieve this goal.

### Model selection of radial basis width

RankProp takes as input a weighted protein similarity network. Clearly, the quality of the rankings produced by RankProp depends critically on the quality of the initial weights. If we can parameterize the weights in some way, then these parameters can be inferred using labeled data.

Our first method for making use of labeled data simply learns the radial basis width parameter *σ* from Equation (1). This approach requires running RankProp with each sequence in a labeled training set for each value of *σ*. The ROC_{
n
}scores of the resulting rankings can then be used to select an optimal value for *σ*. This procedure was performed in the original RankProp paper [5]. Table 1 shows that selecting an appropriate value of *σ* can significantly affect the performance of the RankProp algorithm.

### The probability of homology network

The RankProp algorithm with the transfer function (1) requires that the user specify the radial basis width parameter *σ* in advance. However, selecting an appropriate value for *σ* is difficult because the resulting edge weights have no clear semantics. Perhaps the most intuitive choice of edge weight between two proteins is the probability that the two proteins are structurally related, since our final measurement of success is the ranking performance based on the same structural relation.

*i*and

*j*is the probability that

*i*and

*j*belong to the same superfamily. Using a labeled set of proteins, we discretize the range of possible E-values, and for each resulting E-value bin, we compute the frequency of pairs of proteins in that bin being in the same superfamily. Figure 4 compares the resulting empirical mapping to the original mapping for various values of

*σ*. This probabilistic method is parameterless and yields a transfer function that is similar to the best performing value

*σ*= 100 [5] of the original algorithm.

Overall, the probabilistic network provides performance comparable or superior to all values of *σ* we tried, as shown in Figure 3 and Table 1, measured using ROC_{1}, ROC_{10} and ROC_{50} scores. However, the improvement is less convincing at the ROC_{10} and ROC_{1} levels, i.e., when the very highest ranked proteins become increasingly important. It is not surprising that RankProp has similar ROC_{1} performance to PSI-BLAST, because examples very close to the query using the original similarity metric are usually already highly ranked.

Although probabilistic RankProp does not outperform the original RankProp when used with the best choice of *σ*, the simple scheme of chosing *σ* via cross validation is very costly computationally. Given that the probabilistic network yields performance that is as good as the best choice of *σ*, we feel that it is a useful technique.

### Adaptive model selection of radial basis width

*σ*for

*a particular query*depends on the local density of the protein similarity network around that query. This effect can be seen in Figure 5. Here, we estimate the network density around a query by the number of strongly weighted (E-value less than 0.01) edges from the query to its neighbors. The figure shows that larger values of the radial basis width

*σ*produce better error rates for queries in more densely populated areas of the network. In other words, RankProp appears to perform better on smaller superfamilies for small

*σ*, and better on larger superfamilies for large

*σ*.

This observation suggests our third strategy for making use of labeled data: *given the density of the network in the region of each query, use the labeled training data to learn which value of σ gives optimal performance*. Accordingly, we choose several values of *σ* (10,100, and 1000) and solve three corresponding regression problems to predict the ROC_{
n
}score given a histogram of number of hits with E-value less than *t* as input for each *σ*. In this experiment we use ROC_{1}, but in principle we could optimize for any error measure. Then, given the query, we choose the value of *σ* that is predicted to give the best ROC_{1} score by the regressions. We call this method "adaptive RankProp." The results, given in Figure 3 and Table 1, show improved ROC_{1}, ROC_{10} and ROC_{50} scores. We note that the supplementary material of [5] also suggests to adapt the width *σ* per query, but there a hand-built rule of thumb was suggested, rather than choosing the width by learning from data.

## Discussion

In this article we reviewed the RankProp algorithm and suggested some ways of using labeled data to further improve the basic algorithm. Based on our experiments, we advocate making use of all available information — in this case using both labeled and unlabeled data — to achieve the best possible results.

The basic way to use labeled data with the ranking problem is to optimize the parameters of the model of choice, which in this case is the protein similarity network. In the previous sections, we defined three possible parameterizations and then optimized them, in each case yielding good results. However, many other parameterizations are possible. For example, one could build a network based upon several measures of similarity, including pairwise sequence similarity metrics (BLAST, PSI-BLAST), common motif occurrences (MotifProp [13]), and predicted secondary structure similarity. The relative weights of these measures could then be learned. Another possibility is to put label information explicitly into the network: if two proteins are known to be homologs, then the edge weight can be set to one, and if they are known *not* to be homologs, then it can be set to zero. However, some preliminary experiments (not shown) indicated that this approach does not improve ranking performance.

An implementation of RankProp is now available on the Santa Cruz Gene Sorter, http://genome.ucsc.edu, featuring a pre-computed network of human genes. We plan to extend this implementation to use a larger database, and establish a separate web server capable of processing new queries, rather than operating on a pre-defined network.

Finally, one important difference between RankProp and existing methods such as BLAST and PSI-BLAST is that RankProp does not return an E-value or other confidence measure along with each ranked protein. Defining such a confidence measure is the subject of our current research.

## Conclusion

The RankProp algorithm uses global network information to give improved protein rankings by performing diffusion on a graph built with PSI-BLAST similarity scores. PSI-BLAST improves upon BLAST by incorporating unlabeled data into its search algorithm, but advanced machine learning techniques appear to extract extra information useful for this task. In this article, we showed how labeled data can be used to further improve the unsupervised diffusion technique by learning various parameters of the similarity network. These results may have implications for other ranking problems in bioinformatics as well, as long as a suitable similarily network can be defined.

## Methods

### Data Preparation

We tested the quality of the protein rankings produced by RankProp, using as a gold standard the human-annotated SCOP database of protein 3D structural domains [7]. SCOP has been used as a gold standard in many previous studies (e.g., [14–16]). Sequences were extracted from verson 1.59 of the database, purged using ASTRAL [17] so that no pair of sequences shares more than 95% identity. The resulting collection of 7329 SCOP domains was split into two portions: 379 superfamilies (4071 proteins) for training and 332 (2899 proteins) for testing. Note that training and testing sequences never come from the same superfamily.

The SCOP database is organized hierarchically into classes, folds, superfamilies and families. For the purposes of this experiment, two domains that come from the same superfamily are assumed to be homologous, and two domains from different folds are assumed to be unrelated. For pairs of proteins in the same fold but different superfamilies, their relationship is uncertain, and so these pairs are not used in evaluating the algorithm.

In all the experiments reported here, the SCOP database was concatenated with 101,602 proteins from Swiss-Prot version 40. Using this larger database benefits both PSI-BLAST and RankProp.

### PSI-BLAST

PSI-BLAST (v 2.2.2) was used for comparison with RankProp. PSI-BLAST was run with default parameters, including the BLOSUM 62 matrix, but with an E-value threshold of 10,000 for reporting results. PSI-BLAST was allowed to run a maximum of six iterations, which previous work indicates is sufficient for good performance [16], using the default E-value threshold of 0.005 for inclusion in the model.

### RankProp

The protein similarity network for RankProp was built using the same version of PSI-BLAST as above. In the network *K* used by RankProp, the weight *K*_{
ij
}associated with a directed edge from protein *i* to protein *j* is exp(-*S*_{
j
}(*i*)/*σ*), where *S*_{
j
}(*i*) is the E-value assigned to protein *i* given query *j*. For efficiency, the number of outgoing edges from each node is capped at 1000, unless the number of target sequences with E-values less than 0.05 exceeds 1000.

Given the similarity network, the RankProp algorithm can then be described as follows:

**1. Initialization**:

*y*

_{1}(0) = 1;

*y*

_{ i }(0) = 0

- 2.
**for***t*= 0,1, 2,...**do** - 3.
**for***i*= 2**to***m***do** - 4.
*y*_{ i }(*t*+ 1) ←*K*_{1i}+ $\alpha {\displaystyle {\sum}_{j=2}^{m}{K}_{ji}{y}_{j}(t)}$ - 5.
**end for** - 6.
**until convergence** - 7.
**Termination**: Let ${y}_{i}^{*}$ denote the limit of the sequence {*y*_{ i }(*t*)}. Then ${y}_{i}^{*}$ is the ranking score of the*i*^{ th }point (largest ranked first).

Given a set of objects (in this case, proteins) *X* = {*x*_{1}, ..., *x*_{
m
}}, let *x*_{1} be the *query* and *x*_{2}, ..., *x*_{
m
}be the database (*targets*) that we would like to rank. Let *K* be the matrix of object-object similarities, i.e., *K*_{
ij
}gives a similarity score between *x*_{
i
}and *x*_{
j
}, with *K* normalized so that $\sum}_{j=2}^{m}{K}_{ji$ = 1 for all *i*. For computational efficiency, we set *K*_{1i}= *K*_{i 1}for all *i*, so that we can compute weights involving the query using a single execution of PSI-BLAST. Let *y*_{
i
}, *i* = 2, ..., *m*, be the initial ranking "score" of a target. In practice, for efficiency, the algorithm is terminated after a fixed number *I* of iterations, and *y*_{
i
}(*I*) is used as an approximation of ${y}_{i}^{*}$. In our experiments, RankProp was run for *I* = 20 iterations, which experiments in the supplement to [5] show brings the algorithm very close to convergence. The parameter *α* ∊ [0,1] is set *a priori* by the user. For *α* = 0, no global structure is found, and the algorithm's output is just the ranking according to the original distance metric. All our experiments use *α* = 0.95, looking for clear structure in the data. However, in principle this hyperparameter could be selected using labeled data as well.

### RankProp with probability of homology network

It is possible to define a similarity network for RankProp without resorting to the adjustment of free parameters. This is accomplished by making use of labeled data. To perform superfamily detection using RankProp (performing network propagation) the most natural weight to assign to an edge from protein *i* to protein *j* is the probability that *i* and *j* belong to the same superfamily. We suggest computing exactly this probability from labeled data. The method requires a training set of *m* proteins with known labels *y*_{
i
}, and a matrix *D* of PSI-BLAST E-values for these proteins, where *D*_{
ij
}= *S*_{
j
}(*i*) is the E-value between proteins *i* and *j* using *j* as the query and *i* as the target. We then compute a histogram of empirical frequencies for the PSI-BLAST E-values. More specifically, we choose bin centers *v*_{
k
}, and compute *n*_{
k
}, the number of times *D*_{
ij
}falls into the bin centered at *v*_{
k
}, and *s*_{
k
}, the number of times that the latter occurs when *i* and *j* are in the same superfamily. We then compute *s*_{
k
}/*n*_{
k
}, the empirical probability belonging to the superfamily of interest for the bin. The mapping $\widehat{p}(x)$ that converts a PSI-BLAST E-value to a probability of homology is created is done by locating the two closest bins and using linear interpolation on the estimated probabilities. We (arbitrarily) choose the bin centers *v* = (10^{-20}, 10^{-15}, 10^{-10}, 10^{-9.5}, ..., 10^{-4.5}, 10^{-4}, 10^{-3.75}, ..., 10^{3}).

The resulting map is given in Figure 4 and compared to the *exp*(-*S*_{
j
}(*i*)/*σ*) function of the original RankProp algorithm for different values of *σ*. The results show that it is as good as or better than any choice of *σ*. Although it does not improve over the best choice of *σ*, this method provides a very straightforward and computationally efficient method for building a strongly performing network. In comparison, choosing the value of *σ* by validation set is far slower to compute, as it involves running RankProp many times. The method described in this section simply requires one pass through the matrix of E-values generated from PSI-BLAST to compute its network.

### RankProp with adaptive training

In adaptive RankProp, one chooses a different value of *σ* for building the protein similarity network per test example. To implement this approach, a supervised machine learning method is used to predict the best choice of *σ* for a given query. The choice is made by using a regression function to predict the ROC score than one would achieve on that query for a given value of *σ*. We learn a separate regression function for each possible choice of *σ* and choose the value of *σ* with the highest predicted ROC score.

The input to each regression problem is a 5-dimensional vector, where the features count the number of E-values returned by PSI-BLAST using the given query that are less than 1e-10, 1e-5, 0.1, 1, and 10, respectively. The regression output is the predicted ROC_{1} score on a validation set when RankProp is trained with the given value of *σ*. Both input and output features can be generated for a training set, so the regression can be learned, and then applied to a new test example.

We subtracted the mean from the outputs and normalized the inputs to have mean zero and standard deviation one, and used linear least squares to learn the regression.

## Declarations

### Acknowledgements

We would like to thank Andre Elisseeff and Denyong Zhou for their help with this work, and Jim Kent and Mark Diekhans for their help with implementing the RankProp search capability for the human genome browser. This work is supported by the NSF (awards EIA-0312706 and DBI-0243257) and the NIH (GM74257-01). WSN is an Alfred P. Sloan Research Fellow.

## Authors’ Affiliations

## References

- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ:
**A basic local alignment search tool.***Journal of Molecular Biology*1990,**215:**403–410. 10.1006/jmbi.1990.9999View ArticlePubMedGoogle Scholar - Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ:
**Gapped BLAST and PSI-BLAST: A new generation of protein database search programs.***Nucleic Acids Research*1997,**25:**3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar - Smith T, Waterman M:
**Identification of common molecular subsequences.***Journal of Molecular Biology*1981,**147:**195–197. 10.1016/0022-2836(81)90087-5View ArticlePubMedGoogle Scholar - Karplus K, Karchin R, Barrett C, Tu S, Cline M, Diekhans M, Grate L, Casper J, Hughey R:
**What is the value added by human intervention in protein structure prediction?***Proteins*2001,**45**(S5):86–91. 10.1002/prot.10021View ArticleGoogle Scholar - Weston J, Elisseeff A, Zhou D, Leslie C, Noble WS:
**Protein ranking: from local to global structure in the protein similarity network.***PNAS USA*2004,**101:**6559–6563. 10.1073/pnas.0308067101PubMed CentralView ArticlePubMedGoogle Scholar - Zhou D, Weston J, Gretton A, Bousquet O, Schoelkopf B:
**Ranking on Data Manifolds.***Neural Information Processing Systems*2003.Google Scholar - Murzin AG, Brenner SE, Hubbard T, Chothia C:
**SCOP: A structural classification of proteins database for the investigation of sequences and structures.***Journal of Molecular Biology*1995,**247**(4):536–540. 10.1006/jmbi.1995.0159PubMedGoogle Scholar - Page L, Brin S, Motwani R, Winograd T:
**The PageRank Citation Ranking: Bringing Order to the Web.***Tech. rep., Stanford Digital Library Technologies Project*1998. [citeseer.ist.psu.edu/page98pagerank.html] [citeseer.ist.psu.edu/page98pagerank.html]Google Scholar - Weston J, Leslie C, Zhou D, Elisseeff A, Noble WS:
**Cluster Kernels for Semi-supervised Protein Classification.***Advances in Neural Information Processing Systems 17*2003.Google Scholar - Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie C:
**Profile-based string kernels for remote homology detection and motif extraction.***Computational Systems Biology Conference*2004.Google Scholar - Hanley JA, McNeil BJ:
**The meaning and use of the area under a receiver operating characteristic (ROC) curve.***Radiology*1982,**143:**29–36.View ArticlePubMedGoogle Scholar - Gribskov M, Robinson NL:
**Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching.***Computers and Chemistry*1996,**20:**25–33. 10.1016/S0097-8485(96)80004-0View ArticlePubMedGoogle Scholar - Kuang R, Weston J, Noble WS, Leslie C:
**Motif-based Protein Ranking with Network Propagation.***submission to Bioinformatics*2005.Google Scholar - Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Chothia C:
**Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods.***Journal of Molecular Biology*1998,**284**(4):1201–1210. 10.1006/jmbi.1998.2221View ArticlePubMedGoogle Scholar - Jaakkola T, Diekhans M, Haussler D:
**Using the Fisher kernel method to detect remote protein homologies.**In*Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology*. Menlo Park, CA: AAAI Press; 1999:149–158.Google Scholar - Schäffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF:
**Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements.***Nucleic Acids Research*2001,**29**(14):2994–3005. 10.1093/nar/29.14.2994PubMed CentralView ArticlePubMedGoogle Scholar - Brenner SE, Koehl P, Levitt M:
**The ASTRAL compendium for sequence and structure analysis.***Nucleic Acids Research*2000,**28:**254–256. 10.1093/nar/28.1.254PubMed CentralView ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.