The distance-profile representation and its application to detection of distantly related protein families

Background Detecting homology between remotely related protein families is an important problem in computational biology since the biological properties of uncharacterized proteins can often be inferred from those of homologous proteins. Many existing approaches address this problem by measuring the similarity between proteins through sequence or structural alignment. However, these methods do not exploit collective aspects of the protein space and the computed scores are often noisy and frequently fail to recognize distantly related protein families. Results We describe an algorithm that improves over the state of the art in homology detection by utilizing global information on the proximity of entities in the protein space. Our method relies on a vectorial representation of proteins and protein families and uses structure-specific association measures between proteins and template structures to form a high-dimensional feature vector for each query protein. These vectors are then processed and transformed to sparse feature vectors that are treated as statistical fingerprints of the query proteins. The new representation induces a new metric between proteins measured by the statistical difference between their corresponding probability distributions. Conclusion Using several performance measures we show that the new tool considerably improves the performance in recognizing distant homologies compared to existing approaches such as PSIBLAST and FUGUE.


Background
The ongoing sequencing efforts continue to discover the sequences of many new proteins, whose function is unknown. Currently, protein databases contain the sequences of about 1,800,000 proteins, of which more than half are partially or completely uncharacterized [1]. Typically, proteins are analyzed by searching for homologous proteins that have already been characterized. Homology establishes the evolutionary relationship among different organisms, and the biological properties of uncharacterized proteins can often be inferred from those of homologous proteins. However, detecting homology between proteins can be a difficult task.
Our ability to detect subtle similarities between proteins depends strongly on the representations we employ for proteins. Sequence and structure are two possible representations of proteins that hinge directly on molecular information. The essential difference between the representation of a protein as a sequence of amino acids and its representation as a 3D structure traditionally dictated different methodologies, different similarity or distance measures and different comparison algorithms. The power of these representations in detecting remote homologies differ markedly. Despite extensive efforts, current methods for sequence analysis often fail to detect remote homologies for sequences that have diverged greatly. In contrast, structure is often conserved more than sequence [2][3][4], and detecting structural similarity may help infer function beyond what is possible with sequence analysis. However, structural information is sparse and available for only a small part of the protein space.
One may argue that the weakness of sequence based methods is rooted in the underlying representation of proteins, the model used for comparison and/or the comparison algorithm, since in principle, according to the central dogma of molecular biology, (almost) all the information that is needed to form the 3D structure is encoded in the sequence. Indeed, in recent years better sequence-based methods were developed [5][6][7][8]. These methods utilize the information in groups of related sequences (a protein or domain family) to build specific statistical models associated to different groups of proteins (i.e. generative models) that can be used to search and detect subtle similarities with remotely related proteins. Such generative models assume a statistical source that generates instances according to some underlying distributions, and model the process that generates samples and the corresponding distributions.
When seeking a new similarity measure for proteins that departs from sequence and structure, a natural question is what is the "correct" encoding of proteins? Several works studied the mathematical representation of protein sequences based on sequence properties such as amino acid composition or chemical features [9][10][11][12]. However, these representations had limited success, since they did not capture the essence of proteins as ordered sequences of amino acids.
Recently, alternative representations of protein sequences based on the so-called kernel methods were proposed. These methods are drawn from the field of machine learning and strive to find an adequate mapping of the protein space onto the Euclidean space where classification techniques such as support vector machines (SVM) or artificial neural networks (ANN) can be applied. Under the kernel representation, each protein is typically mapped to a vector in a (high-dimensional) feature space, and the resulting vector is termed feature vector. Subsequently, an inner product is defined in the feature space in order to estimate the (dis)similarity among different proteins. A major advantage of the kernel methods is that with an adequate choice of the kernel function, the feature vectors need not be computed explicitly in order to evaluate the similarity relationships. In addition, the users may build a specific feature space such that the kernel function directly estimates these relationships. However, string kernels do not easily lend themselves to this property, and therefore they need to be computed explicitly. The main difference between the different kernel methods reside in the definition of feature elements which are either related to the parameters of some generative process for each group of related proteins or some measure of similarity among the protein sequences.
For instance, the SVM-Fisher algorithm [13] uses the Fisher kernel which is based on hidden Markov models (HMMs). The components of the feature vector are the derivatives of the log-likelihood score of the sequence with respect to the parameters of a HMM that has been trained for a particular protein family. Tsuda et. al. [14] implemented another representation based on marginalized and joint kernels and showed that the Fisher kernel is in fact a special case of marginalized kernel. They also experimented with the marginalized count kernels of different orders, that are similar to the spectrum kernel which was first introduced by [15]. The spectrum kernel is evaluated by counting the number of times each possible klong subsequence of amino acids (k-mer) occurs in one given protein. The marginalized count kernel takes into account both the observed frequency of different subsequences and the context (e.g. exon or intron for DNA sequences). The mismatch-spectrum kernel [16] is a generalization of the spectrum kernel that considers also mutation probabilities between k-mers which differ by no more than m characters. The homology kernel [17] is another biologically motivated sequence embedding process that measures the similarity between two proteins by taking into account their respective groups of homologous sequences. It can be thought of as an extension of the mismatch-spectrum kernel by adding a wildcard character and distinguishing the mismatch penalty between two substrings depending on whether the sequences are grouped together or not.
The covariance kernel is another type of kernel that uses a framework similar to the one employed by our method. The covariance kernel approach is probabilistic and much work is focused on the implementation of the generative models. In [18], the covariance kernel is based on the mutual information kernel which measures the similarity between data samples and a certain generative process. The generative process is characterized by a mediator distribution defined between the (usually vague) prior and the posterior distribution. On the other hand, [19] focuses on the representation of biological sequences using the probabilistic suffix tree [20] as the generative model for different groups of related proteins. The proposed kernel generates a feature vector for protein sequences, where each feature corresponds to a different generative model and its value is the likelihood of the sequence based on that model. Finally, we mention another related work [21] that uses the notion of pairwise kernel. Under this framework, each protein is represented by a vector that consists of pairwise sequence similarities with respect to the set of input sequences. It is also worth mentioning the many related studies in the field of natural language processing and text analysis. For example, an approach that in some ways is similar to the pairwise and covariance kernels and in other ways is related to the spectrum kernel approach, is used in [22] to represent verbs and nouns in English texts, with nouns represented as frequency vectors over the set of verbs (based on their association with different verbs) and vice versa. The nouns and verbs are then clustered based on this representation. Studies that attempted to devise automatic methods for text categorization and web-page classification often use similar techniques, where the text is represented as a histogram vector over a vocabulary of words (e.g. [23]).
Here we study a general framework of protein representation called the distance-profile representation, that utilizes the global information in the protein space in search of statistical regularities. The representation draws on an association measure between input samples (e.g. proteins and protein families) and can use existing measures of similarity, distance or probability (even if limited to a subset of the input space for which the measure can be applied). This representation induces a new measure of similarity for all protein pairs based on their vectorial representations. Our representation is closely related to the covariance and pairwise kernels described above. However, it is the estimation of pvalues through statistical modeling of the background process, coupled with the transformation to probability distributions, the noise reduction protocols and the choice of the distance function that result in a substantial impact on the performance, and we demonstrate how an adequate choice of the score transformation and the distance metric achieves a considerable improvement in detection of remote homologies.
This paper is organized as follows. We first introduce the notion of distance-profile. We describe how to process the feature vectors through noise reduction, and pvalue transformation followed by normalization. We compare the performance of our new method against several standard algorithms by testing them on a large set of protein families.

The distance-profile representation
Our goal is to seek a faithful representation of the protein space that will reflect evolutionary distances even if undetectable by means of existing methods of comparison. We explore a technique based on the distance-profile technique described in [25] and its derivative as applied to protein sequences in [26]. The power of the representation stems from its ability to recover structure in noisy data and boost weak signals [25,26].
The distance-profile representation is simple and can be applied to arbitrary spaces , if there exist an association measure between instances of and instances of such as a distance function, similarity function or a probability measure. Given an instance X in the input space , a reference set {Y 1 , Y 2 , ... Y n } of entities in (e.g. proteins or protein families, sequences or generative models) and an underlying association measure, we associate with the instance X a position in a high dimensional space, where every coordinate is associated with one member of the reference set and its value is the similarity with that particular reference object. I.e., we map X to a vector of dimension n in the host space This simple representation leads to the definition of a new distance or similarity measure among samples based on their vectorial representation, and when the reference set is identical to the input space ( = ), an iterative application of this representation can be used to form hierarchical clustering over the input samples [25]. In this paper we demonstrate the application of this method to the problem of homology detection between distantly related proteins.
One might observe the resemblance of our method with pairwise kernels and covariance kernels mentioned in the 'Background' section. However, it is the processing of the feature vectors and the choice of the metric, as is laid out next, which are the crucial ingredients that differentiate our method from the previous studies. As exemplified in this paper, under the proper transformations the distanceprofile representation has mathematical and statistical interpretations that have other implications, and it is these transformations that deem this method very effective for homology detection, database search and clustering.

The reference set
Remotely related proteins usually share little sequence similarity, however, they are expected to have similar structures. Therefore, as a reference set for our experiments we chose a non-redundant structure library consisting of domain structures that represent the current protein structure space. The set is derived from the SCOP database [27], release 1.57. Specifically, we used the Genetic Domain Sequence dataset that we downloaded from the Astral webpage [28]. The dataset is obtained from 14,729 entries in the Protein Data Bank (PDB) [29] and contains only protein domains with less than 40% identity between pairs of sequences. Our library consists of 3,964 distinct SCOP domains covering in total 644 folds, 997 superfamilies, and 1,678 families. For notation purpose, we denote this library of template proteins by SCOP-DB.

The association measure
We rely on "structure-aided" sequence alignment to bridge the gap between the sequence space and structure space. Given the library of structures = {Y 1 , Y 2 , Y 3 , ..., Y n }, every protein sequence P is mapped to a structure-specific n-dimensional feature vector, as is described above, where the association measure S(P, Y i ) is the similarity score of the sequence-structure alignment between the sequence P and the template structure Y i , computed with the FUGUE threading algorithm [30] (see Appendix).
From this point on we discuss the application of the distance-profile representation with threading-based association measures. However, we note that the methods described in this paper can be used to process feature vectors using other association measures. In the 'Discussion' section we show that a similar approach applied to sequence-profile alignment scores also produces a considerable improvement in detecting remote homologies.

Processing feature vectors: pvalue conversion and normalization
The choice of the underlying association function S(P, Y i ) can have a drastic impact on the effectiveness of the representation and we tested several variations.
The score association measure A possible choice is obviously the score reported by the algorithm that compares entities of with entities of . (for example, the zscore reported by the FUGUE threading algorithm). We denote feature vectors that are based on the score association measure by P score .

The pvalue association measure
If the association measure is distributed over a wide range, the most significant scores will inevitably shadow other numerically less important but still significant matches, thus reducing the sensitivity of the representation. This is the case with most types of similarity scores, including the threading zscores assigned by the FUGUE program.
To address this problem we convert the zscores to their underlying cumulative distribution function (cdf) value, where the amplitude of outlier zscores is reduced to within a reasonable range. As was shown in [31], the zscores of local alignment scores follow the extreme value distribution (EVD), see Figure 1, whose cumulative function F(x) takes the form Based on this background distribution we replace the original zscores with a new association measure such that S'(P, Y i ) = F(S(P, Y i )) where S(P, X i ) is the similarity zscore reported by FUGUE. With this transformation, all coordinates are bounded between 0 and 1, with high zscores transformed to values close to 1. Note that the pvalue of a given zscore x is pvalue(x) = 1 -F(x). We denote feature vectors that are based on the F(x) association measure by P pvalue . It should also be noted that in practice F(x) = 1pvalue(x) = 1 for large x because of machine precision limitations. Therefore, the pvalues that are associated to significant zscores (typically above 5) are approximated by their empirical distribution, thus allowing distinction between a pair of highly significant yet numerically disparate zscores, e.g. 10 versus 60.
The probability association measure the third variation we tested is based on a simple normalization of each feature vector to form a probability distribution. This transformation enables us to explore distance measures that are suited for probability vectors, as described in section 'Metrics and score functions'. Indeed, in this representation, the normalized vector entries can    The empirical cdf of the zscores obtained by FUGUE and the fitted EVD cdf with parameters λ = 2.0665 and µ = 1.3611

Figure 1
The empirical cdf of the zscores obtained by FUGUE and the fitted EVD cdf with parameters λ = 2.0665 and µ = 1.3611. Empirical cdf of z−score Fitted cdf of EVD be considered as the coefficients of a mixture model where the components models are the protein structures, each one inducing a different probability distribution over the protein sequence space. This interpretation emphasizes the similarity with covariance methods which also resort to a probabilistic representation of different protein families as described in the 'Background' section. We denote feature vectors that are based on this association measure by P prob .

Reducing noise: sparse feature vectors
Our reference set is composed of proteins that belong to different protein families and folds (see section 'The reference set'). Within that data set no two proteins share more than 40% sequence identity. Therefore, for a given query protein we expect to observe only a few significant similarity values in the vector P. That is, the entries that correspond to the structural templates of protein families that are related to the query. In other words, the feature vectors contain many entries that are essentially random and meaningless. These random numbers will contribute to the differences between feature vectors, thus masking possibly significant similarities. To reduce noise due to unrelated proteins we eliminate all entries with zscore below a certain threshold τ, or pvalue above a certain threshold τ', to reflect the fact that the corresponding sequence-structure pair is considered irrelevant (Another alternative is to weight the differences by the significance of the measurements. However, to speed up the processing and comparison of feature vectors we adopted the threshold approach.) The parameter τ (or τ') is optimized to maximize performance, as described in section 'Parameter optimization'. Note that entries with low zscores that are filtered in this step (assigned 0 zscore) remain zero under the transformation to the cdf pvalue as described above.
The processed feature vector is denoted by .
The noise reduction is applied to the original feature vectors and is followed by the pvalue conversion and the normalization to yield new feature vectors P pvalue and P prob , respectively. To illustrate the impact of our procedures on the distance profiles, let us consider the example of two closely related proteins d1qmva_(Thioredoxin peroxidase 2) and d1hd2a_ (Peroxiredoxin 5) that belong to family c.47.1.10 under the SCOP denomination. The sequences were threaded against the SCOP library of structural domains, and feature vectors were compiled from the zscores reported by FUGUE. In Table 1 we report the 2163th up to the 2169th entries of their original feature vectors as well as their transformations after noise reduction, pvalue conversion and normalization. As this example demonstrates, the zscore entries are noisy and spread over a wide numerical range. For instance, the 2167th entry of both vectors correspond to their threading score versus the structure of d1prxa_ (HorF6 peroxidase), another protein that is in the same SCOP family c.47.1.10. While the zscore reaches 15.79 for d1qmva_, it is only 6.38 for d1hd2a_. These large differences will inevitably result in large distances between the feature vectors despite the fact that they have significant zscore values in the same positions. The pvalue conversion and normalization (third and fourth rows in the table) resolve this problem by rescaling scores to within a fixed interval.

Metrics and score functions
Under the distance-profile representation, the similarity (distance) between two protein sequences P and P' is defined as the similarity (distance) of their corresponding feature vectors S(P, P') = f(P, P')P score   The function f can be a similarity function or a distance function and we considered several different variants. We tested the L 2 norm (the Euclidean metric) and the L 1 norm (the Manhattan distance). For probability distributions we also tested the Jensen-Shannon (JS) measure of divergence [34]. Given two probability distributions p and q, [35] defined as and r = λp + (1 -λ)q can be considered as the most likely common source distribution of both distributions p and q, with λ as a prior weight (without a priori information, a natural choice is λ = 1/2). Unlike the Kullback-Leibler measure, the JS measure is symmetric and bounded. It ranges between 0 and 1, where the divergence for identical distributions is 0. This measure has been used successfully in [7,36,37] to detect subtle similarities between statistical models of protein families and in [38] for automatic domain prediction from sequence information.
As an alternative approach to assess the similarity of a pair of proteins based on their distance-profile representation we propose the pvalue-distance (PD) function. This function assesses the distance between two proteins by estimating the probability to observe a random protein with a feature vector inside the volume delimited by their two feature vectors. The smaller the volume is, the more similar are the two vectors. The function operates on the pvalues used to form the feature vectors P pvalue . Given two feature vectors P and Q that correspond to proteins P and Q, we consider one coordinate i at a time and estimate the total probability mass of samples whose i-th feature is bounded between the feature values p i and q i as is illustrated in Figure 2a. Since each representative in the reference set induces a complex high-dimensional distribution over the protein sequence space, the one-dimensional pvalue measure p i can only serve to approximate a certain perimeter in the sequence space of sequences that are as similar or more similar to the i-th source than the protein P. Therefore, we use the least significant pvalue of p i and q i as an upper bound estimator of the volume of relevant instances, as illustrated in Figure 2b. The total volume is computed by taking the product over all coordinates.
Formally, consider the two pvalue feature vectors and that are obtained by mapping each zscore z i to its pvalue p i using the EVD background distribution as described in section 'Processing feature vectors'. Their pvalue distance is defined as In practice, the PD score is evaluated through its logarithm: (It should be noted that here the measure pvalue(z) is used directly to define the feature values, as opposed to F(z) = 1 -pvalue(z) that was used before when compiling the feature vectors P pvalue . However, to simplify notation we also refer to these feature vectors as P pvalue.) To distinguish all the measures discussed in this section from the association measures discussed in section 'Processing feature vectors', we refer to all of them from now on as distance metrics, although they are not necessarily metrics or distance functions.

Dataset preparation
We use the SCOP classification of protein structures [27] as our benchmark. The SCOP database is built mostly based on manual analysis of protein structures and is characterized by a hierarchy with four main levels: class, fold, superfamily and family. Proteins that belong to the same family display significant sequence similarity that indicates homology. At the next level (superfamily), families are grouped into superfamilies based on structural similarity and weak sequence similarity (e.g. conserved functional residues). Proteins that belong to different families within the same superfamily are considered remotely related. It is this level that has been used in many studies to evaluate sequence comparison algorithms (e.g. [7,39,40]). The challenge is to automatically detect similarities between families within the same superfamily, that were established manually by the SCOP experts.
To determine the optimal parameters for the distance-profile representation and compare its performance to other algorithms, we split the library SCOP-DB into a training set and a test set. Since our purpose is to test the ability to find remotely related proteins at the superfamily and fold levels, we first discard all proteins that have fewer than 5 remote homologs in SCOP-DB (i.e. are in superfamilies of size 5 or less). From the remaining 2,570 sequences we randomly select 100 for the training set, and the rest (2,470 sequences) are compiled into the test set.

Performance indices
To evaluate the performance of a given method for a specific query protein we compare the protein against SCOP-DB, sort the results and assess the correlation of the sorted list with established homology relations. In our experiments we consider two proteins to be related (a positive pair) if they belong to the same SCOP superfamily. All other pairs are treated as negatives. We also consider a more relaxed definition, where proteins are deemed related if they belong to the same SCOP fold.
A popular measure of performance used in signal detection and classification is the ROCk measure [41]. This is the cumulative count of positive samples detected until k negative samples are met in the sorted list of results. We use four different indices to assess performance, all are variations on commonly used sensitivity and accuracy measures. These indices measure the ability of a given algorithm to recognize different levels of structural similarity between protein sequences and within neighborhoods of varying sizes: • The ROC1 superfamily index (ROC1-S).
Given the sorted list of results for a query protein p, the ROC1-S index totals the number of proteins in the same superfamily as p that are observed from the top of the sorted list until the first false match (i.e. different superfamily) appears. Likewise, the ROC1-F index is defined by counting the number of proteins in the same fold as p from the top of the sorted list until the first match that involves two proteins with different folds. The last two indices are characterized by the following generic definition: the top-X-Y index for a protein p counts the total number of proteins sharing the same Y SCOP denomination among the n X closest sequences of p, where n X is the total number of sequences in the library that have the same X SCOP denomination as p itself (For example, to compute the top-fold-fold index for a query protein that belongs to a SCOP fold containing n proteins, we look at the top n proteins in the sorted list and count how many of them are The pvalue distance Figure 2 The pvalue distance. Left: Often is the case that the distance between two measurements depends not only on the relative nominal difference between the measurements but also on the absolute magnitude of the each one of the measurements. For example, two measurements z 1 and are statistically more similar to each other than the two measurements z 2 and . That is to say that there are fewer measurements with score as high as z 1 and and therefore fewer instances that have similar properties. The measurements in our case are the zscores that indicate the significance of the match between a sequence and a structural template (the source). For a given zscore z, the pvalue measure pvalue(z) is an estimate of the total probability mass in the protein sequence space of sequences that match the structural template with zscore ≥ z. Right: The least significant pvalue of the two associated with the two measurements is an estimate of the mass of sequences with similar properties (note pvalue1 <pvalue2).
actually in the same fold as the query protein). Self-similarity is ignored in the assessment of all performance indices.
Note that all these indices are closely related to sensitivity measures at different levels, however, the relevant neighborhood is calibrated on a per-superfamily/fold basis. And while the two ROC indices stop as soon as one false match is encountered, the top-X-Y indices credit a method that detects many true positives at the top even if mixed with a few false positives. Therefore, a method yielding a lower ROC1 index but higher TSS (or TFF) should be still considered successful since it clusters the query close to a larger number of related objects.
To obtain the overall performance of a method with respect to a set of queries, we simply take the sum of all their corresponding performance indices as the global result. I.e. given a query set Q = {q 1 , ..., q n }, the performance of a method M, using index I is where I(M, q i ) is the performance of the method M for the protein q i (for example, TFF(FUGUE, q i ) is the TFF performance of FUGUE on protein q i ).

Normalized performance indices
The global performance indices might be affected by the specific make up of the query set, since superfamilies and folds vary greatly in size. In order to reduce a potential bias due to large superfamilies/folds that perform very well, we use also normalized performance indices. To compute these, we divide each index by its upper bound, i.e. the total number of proteins in the SCOP-DB library that are classified to the same superfamily/fold as the query (except itself). For example, for a query protein q that belongs to a fold F of size n F the normalized TFF measure is given by This ratio is essentially the sensitivity of the method M on the query q, over a match list of size n F . The size of the relevant match list changes with each query.
The resulting ratios are then averaged at the superfamily level so as to obtain a representative average performance index per protein superfamily that is bounded between 0 and 1. Finally, the final index is computed by averaging over all the representative indices. I.e. given a query set Q = {q 1 , ..., q n } that are classified to k different superfamilies F 1 , ..., F k with n i queries in superfamily F i , then the overall performance of a method M, using the normalized index I N is given by

Parameter optimization
Our method depends on three parameters: the noise reduction level (the z-score cutoff threshold), the association measure and the distance metric. We consider all possible combinations among five zscore cutoff values (τ = 2.5 to 4.5 by increment of 0.5), three association measures that are P score (zscore), P pvalue (pvalue) and P prob (prob), and four distance metrics: L 1 , L 2 , the JS divergence measure and the new pvalue-distance (PD) measure. Note that the JS measure is only applicable to normalized vectors since it requires the input vector to represent a probability distribution and the PD measure is only applicable to the pvalue vectors.
To find the best parameters we first establish the feature vectors for each sequence in SCOP-DB. Each combination of an association function (zscore, pvalue, prob) and a noise threshold τ leads to a different set of feature vectors.
Next, we compute the distance between the feature vector of each training sequence and the vectors of all sequences in SCOP-DB, using one of the distance metrics of section 'Metrics and score functions'. Each combination of feature vectors and a distance metric results in a different set of "match lists". These sets are then evaluated using the performance indices described above. The results are reported in Table 2. For clarity, the best combination of parameters is printed in boldface for each index. To determine the optimal set of parameters, we first examine the effect of the association function. The results are unanimous with all five performance indices: the pvalue association measure achieves the best performance, followed by the prob association measure. However, the pvalue association measure is only effective when coupled with the PD distance metric, while the prob association measure seems to produce good results with all distance metrics and especially with L 1 and the JS divergence measure. As for the distance metric, its influence depends on the association function. The PD measure is applicable only to the pvalue vectors but it produces excellent results, much better than the traditional L 1 and L 2 metrics. Under the prob association measure, L I and JS metrics produce the best results while L 2 leads to significantly worse performance. With the two other association measures L 1 and L 2 metrics yield similar results. Finally, the optimal value of τ is approximately around 3.5 for most combinations.
In conclusion, these results suggest that the best performance is achieved with the pvalue association measure, the pvalue-distance (PD) metric and a zscore cutoff threshold The same conclusions are reached when using the normalized performance indices. It should be noted that although the prob association measure is not as good as the pvalue measure (combined with the PD metric), it still produces very good performance overall (with the JS or the L 1 metrics), further justifying the statistical interpretation of our representation and its equivalence with the coefficients of a mixture model over independent sources (as discussed in section 'Conclusions'). Figure 3a,b illustrate the distribution of the pairwise L 1distances between proteins under the representations P score and P prob , respectively. We observe that the pairwise distance between feature vectors P score spreads over a very large range and it is difficult to set a natural threshold below which feature vectors can be considered similar. In contrast, for P prob , about 95% of the pairwise distances are equal to 2, the maximum L 1 distance. This is the distance between pairs of normalized feature vectors (probability distributions) whose set of non-zero features do not overlap. The distribution shown in Figure 3b only focuses on those pairwise distances smaller than 2. The combination of noise reduction, pvalue conversion and normalization procedures effectively delimits the range of the pairwise distances, and any distance smaller than 2 indicates common features between the feature vectors. Figure 3c shows the empirical distribution of the PD measure between feature vectors P pvalue . The distribution is multi-modal (see Appendix for a more detailed discussion). This emirical distribution is used to estimate the significance of the PD measure.

Performance Comparison
We compare the performance of our algorithm against several existing algorithms, namely FUGUE [30], BLAST [42] and PSIBLAST [5]. Table 3 reports the results for all algorithms based on the training sequences. The DP algorithm was run with the optimal parameters that were determined in the previous section. With BLAST, we simply compare the query sequence with all sequences in SCOP-DB. With FUGUE, we thread the query sequence into each one of the structural templates of the proteins in SCOP-DB. For reference, we also list the results obtained with the structure comparison algorithm URMS [43]. The URMS results provide a rough upper bound on the expected performance since the algorithm is directly using the structural information which underlies many of the homology relationships in SCOP.
As Table 3 shows, FUGUE improves the ROC1-S and ROC1-F indices by 6 and 9% over PSIBLAST while the TSS and TFF indices are increased by a magnitude of 24 to  (Table 4). Quantitatively, the ROC1-S and ROC1-F are improved by about 90% and the TSS, TFF indices by about 42%. This is a substantial improvement that is larger than the relative improvement of PSIBLAST with respect to BLAST or that of FUGUE with respect to PSIBLAST, thus indicating that the statistical fingerprints of the distance-profile representation encode more information and are more sensitive than a direct comparison of the objects they operate on. The results agree with the previous indices, and the distance-profile method significantly outperforms all other methods as the amount of positives detected reaches almost 12,500 (compared to 8,000 with FUGUE and 7,000 with PSIBLAST) when the number of negatives reaches 50. A similar ROC50 curve is observed at the fold level as well. The results above demonstrate that our method can effectively detect remote homologies. However, an interesting question that one might raise is: under what scenarios the distance-profile representation can improve the results over the original method from which it is derived. To answer this question, we contrast the performance of DP-FUGUE and FUGUE for individual queries. Specifically, for each query in the test set we plot the normalized TSS index of DP-FUGUE vs. the TSS index of FUGUE for the same query. As the graph shows, in most cases the improvement occurs when the TSS index of the original method is above 20%, and on average one may expect an improvement if it is above Left: Distribution of pairwise distances between feature vectors P score (τ = 3.5, L 1 metric) Figure 3 Left: Distribution of pairwise distances between feature vectors P score (τ = 3.5, L 1 metric). Middle: Distribution of pairwise distances between feature vectors P prob (τ = 3.5, L 1 metric, distance 2 ignored). Right: Distribution of the PD measure for feature vectors P pvalue (τ = 3.5). The last distribution is plotted in logscale for distances smaller than 40.  Table 3: Comparison between BLAST, PSIBLAST, FUGUE, DP-FUGUE (τ = 3.5, PD, pvalue) and URMS on the training set. The normalized indices (in percentages) are given in parentheses. PSIBLAST's parameters were set to h = 1e -5 , e = 100 and j = 10 (although no improvement was observed after the fourth iteration). FUGUE was run using the default parameters. 10% (i.e. if as little as 10% of the superfamily are observed in the proximity of the query within a neighborhood as big as the superfamily).

The distance-profile representation over sequence-profile metrics
To test the effectiveness of the distance-profile method on other types of input we applied it to feature vectors that were generated with PSIBLAST, a sequence to profile alignment algorithm [5]. The feature values are set to the log (evalue) of the similarity score, as reported by PSI-BLAST after four iterations, unless the program converged before (as Table 3 demonstrates, the performance plateaus after four iterations). The PSIBLAST evalue-based feature vectors are processed in a similar fashion to the FUGUE zscore-based feature vectors. Each evalue e is mapped to its corresponding pvalue pvalue(e) = 1 -exp(-e) as in [44] and the value of the corresponding feature is defined as 1 -pvalue(e) = exp(-e). If the evalue is greater than a given threshold τ' then we reset the value of the fea-ture. Finally, the normalization converts the resulting feature vectors to probability distributions.
The parameters are optimized using a similar procedure to the one described in section 'Parameter optimization'. The optimal parameters are sought among the combinations of 4 evalue cutoff thresholds (τ' = 0.01, 0.1, 1, 10), two association measures (P pvalue (pvalue), P prob (prob)) and three possible metrics (L 1 , L 2 and JS divergence measure). In this case, the best combination based on the training set is (τ' = 10, L 1 , prob). Table 5 compares the performance indices associated to PSIBLAST after 4 iterations with DP-PSIBLAST with parameters (τ' = 10, L 1 , prob). The distance-profile representation clearly improves the performance compared to PSIBLAST (increases the indices by about 20% to 38%) and even to FUGUE in some cases, but is not as powerful as the distance-profile representation when applied to FUGUE (compare to the results of Table 3). It is also interesting to note that the overall improvement of DP-PSI-BLAST over PSIBLAST seems smaller than that of DP-FUGUE over FUGUE. As we showed at the end of the previous section, the magnitude of improvement depends on the initial success of the association measure among the queries and the reference set. Since PSIBLAST yields less instances with performance that exceeds the minimal threshold (i.e. cases that can be improved using the DP representation), it follows that the overall improvement of DP representation on PSIBLAST is less significant compared to that over FUGUE. Detailed examples are discussed in the Appendix.  ROC50 curves over the test set Figure 4 ROC50 curves over the test set. Along each curve we indicate the e-value (or z-score) at different rates of false positives. Thus, at each significance threshold one can estimate the ratio between the number of true and false matches. The effect of multiple sequence alignment on the performance All our experiments were performed in a single-query mode. I.e. in each test case a single query sequence is compared against the database. However, there are multiple reports [40,45,46] that suggest that significant performance gain can be obtained when using a multiple sequence alignment (MSA) or a sequence profile as a query.
To test the effect of MSA on the performance we had the change our experimental setup. We tested the impact of the new setting on PSIBLAST and FUGUE based on 7 SCOP families that were randomly chosen from all families for which the performance of PSIBLAST and FUGUE was poor. For each family we generated a MSA of all sequences in the family using CLUSTALW [47]. Each MSA was also converted to a position specific scoring matrix (profile). These MSA were used as queries for FUGUE (instead of the individual sequences) and the profiles as input for PSIBLAST, in search for related sequences in SCOP-DB. To fairly compare the results of PSIBLAST and FUGUE in the MSA mode to the DP method we had to run our method under a similar setup. The "MSA" mode of the DP method utilizes the information from all the sequences in a protein family in the same spirit a MSA does so, by combining the distance profiles associated to each member of the SCOP family. For each sequence in SCOP-DB, we take the average of its distance versus each member of the family in question in order to compute the "family-specific" distances with respect to SCOP-DB.
In Table 6, we summarize for each of the SCOP families the performance of FUGUE, PSIBLAST and DP-FUGUE under the MSA-query mode. Family members are not counted since they were already used to build the MSA. The adjusted indices thus indicate how many remote homologs are discovered in the MSA-query mode. For comparison we also report the average performance under the single-query mode. As the results demonstrate, the MSA mode improves over the standard single mode, and in most cases our method performs better than FUGUE and PSIBLAST, in particular in terms of the TSS index (reflecting how well remote homologs are clustered at the top of the ranked list). It should be noted that when we analyzed FUGUE in a similar manner to DP-FUGUE (i.e. by averaging over the individual family members) the results improved over the MSA mode of FUGUE, but not as much as DP-FUGUE in MSA mode.
We should comment that the MSA setup differs from our original idea of using the DP representation to perform unsupervised clustering of objects based on their distance profile. In the unsupervised learning mode we do not have information on the family association of each sequence, and therefore it is difficult to define the exact set of related sequences from which to generate a multiple sequence alignment.

Superfamily and fold prediction with the distance-profile method
We tested the power of our method on a new set of protein sequences that were added to SCOP after we compiled our benchmark. In total we found 624 sequences belonging to 453 new families within known superfamilies, 267 sequences associated to 182 new superfamilies within known folds and 375 sequences in 245 new folds, all with less than 40% identity between pairs of sequences. Each one of these new sequences was compared against all the sequences in SCOP-DB using all of the methods evaluated in this paper, and the matches were sorted based on the score or distance, as before. To apply the distance-profile method the sequences were first processed and mapped to feature vectors as described in section 'Results'. Table 7 summarizes the results using our four performance indices. We note that only the ROC1-F and TFF indices are reported for sequences belonging to new superfamilies within known folds (since none of these new sequences have superfamily members in the reference set). As the table demonstrates, for sequences belonging to new families, our method consistently improves homology detection. The improvement over FUGUE is significant with the ROC1-S and ROC1-F indices increasing by more than 100% (40% with normalized measures) while the TSS and TFF indices are improved by more than 30% (15% in normalized form). For sequences belonging to new superfamilies DP-FUGUE also improves over the three other methods, however, the improvement is smaller. As for the sequences in new folds, our experiment indicates that in most cases DP-FUGUE cannot improve the results, the reason being that FUGUE itself typically is unable to detect any significant match with members of our reference set. Finally, we conducted another experiment to study the performance of our method in detecting distant homologs that share little sequence identity. In order to do so, we selected from the new SCOP sequences those having less than 20% sequence identity with respect to the proteins in our reference set. We obtained 277 such sequences associated to new families and 103 to new superfamilies.

Conclusion
We study a new method for remote homology detection that utilizes global information on the proximity of entities in the protein space. Our method relies on the distance-profile representation of proteins and protein families that maps each query protein to a high-dimensional feature space, where the coordinates are determined by some association measures with respect to a reference set. These vectors are then processed and transformed to sparse feature vectors that are treated as statistical fingerprints of the query proteins. We experimented with several different types of association measures and demonstrated how an adequate choice of distance metric combined with a proper transformation of the feature vectors through noise reduction, pvalue conversion and normalization can greatly increase the performance of homology recognition (or prediction) compared to the existing approaches.
Interestingly, excellent performance is obtained with normalized feature vectors that correspond to probability distributions. The success of the distance-profile method in general and especially when using probability distribu-tions suggests a relation to mixture models [48]. Specifically, one can consider this representation as the coefficients of a mixture model or of a functional expansion, similar to the Taylor polynomial expansion. Given a set of basis functions such as polynomial functions one can span the complete space of continuous well-behaved functions with the right coefficients. The same principle applies here as our reference set essentially defines a set of basis functions. In statistical terms, each element of the reference set induces a different probability distribution over the protein sequence space. In our experiments the reference set is composed of protein structures, each one can be perceived as a different generative model. The likelihood of generating a sequence according to a model can be estimated by computing the probability that the sequence will fold into the corresponding structure, as measured with the pvalue association measure over the threading similarity score. Although these probability distributions (that correspond to different elements in the reference set) do not necessarily meet the requirement of orthogonality to be considered "basis functions", a sufficiently diverged set of proteins is expected to have the desired properties. It has yet to be defined more precisely what sufficiently diverged means and the minimal required diversity.
One intriguing aspect that has not been fully addressed is the interaction between the association function, the distance metric and the zscore cutoff value. In some cases the coupling is not surprising. For example, when the prob association function is used, the L 2 metric clearly underperforms compared to L 1 and JS metrics, as expected, since the vectors compared correspond to probability distribu- tions. With the other association measures, L 1 and L 2 perform similarly. Study of the theoretical aspect of this phenomenon will help understand how to take further advantage of these feature vectors to cluster the protein sequences more accurately.
A word of caution is in order here regarding the evaluation. While SCOP is considered the gold standard, it is not perfect and often one can find mis-classifications [7,24,49]. While it is hard to estimate the exact rate of errors, it is unlikely that they exceed thousands and we do not anticipate the results to change drastically even if these mis-classification were corrected.
We should also mention that the DP representation is not effective for detailed, atom-resolution prediction of 3D structure or for site-specific functional annotation, since it cannot produce alignments. Another weakness is that the distance-profile representation and the new pvalue-distance measure may fail to distinguish two proteins if they have almost identical "preferences" for the known structures but different preferences for other, unknown structures that are yet to be determined. However, given the current size of the protein structure space, it is expected that for most proteins the available structural information (as embodied in our reference set) is sufficient to estimate their proximity. Indeed, only 20% of the new SCOP 1.67 sequences were assigned to new folds, Clearly, as more structures are determined and integrated into the refer-ence set, the distance-profile representation is expected to improve.
One potential contribution of this work is the possibility of combining the transformations described in section 'Processing feature vectors' to the feature vectors used by existing kernel methods. These techniques can effectively reduce noise and increase the accuracy of classification of the feature space. In addition, since the feature vectors are typically sparse after noise reduction, an efficient computation of the kernel function can be implemented in a high-dimensional feature space.
Finally, a major advantage of the distance-profile representation is in its great flexibility. The underlying association measure can be based on sequence, structure, predicted function, threading, or any other similarity measure. Clearly, more distinctive association measures will create better representations. Indeed, the FUGUE zscore is clearly a better choice than the PSIBLAST evalue since FUGUE exploits sequence-structure alignment information rather than just sequence alignment information.
If the association measure can report a significance value (such as zscore or evalue) that emphasizes extremes and pinpoint the interesting cases, the statistical measure will be preferred over the raw score. This is especially useful when the raw score is meaningless by itself. In these cases one needs a yardstick or a scale to tell what is close and what is far and the statistical estimates provide such a  scale. Nevertheless, it is important to note that the distance-profile representation and the induced similarity measure are quite robust and work well even with raw association scores, noisy or corrupted data, and weak signal-to-noise ratio [25]. All our data, including the FUGUE results, the PSIBLAST results and the feature vectors are available at [50].
To perform sequence-structure threading of a query sequence, the process typically starts by obtaining a set of structural conformations from a database of known structures. The amino acid sequence of the query protein is aligned to each conformation in search of an alignment that would produce the minimal total energy (that depends on the structural environment of each reside, or the types of neighboring residues as determined by the sequence-structure alignment). The most likely candidate conformations are the ones yielding the lowest energy. If the energy values are significantly low compared for example to those obtained for shuffled sequences, then these structural conformations can be considered as compatible with the query sequence.
Many different approaches and implementations of threading algorithms have been proposed in the past [30,[51][52][53][54]. The exact details of the alignment algorithm and the computation of the total energy vary from one method to another. Unfortunately, most of them are not publicly available to allow an extensive comparison of sequences and structures. Of the few that are available, we chose FUGUE for our study. FUGUE [30] is a sequencestructure alignment algorithm that uses environment-spe-Distribution of pvalue-distances between feature vectors P pvalue with thresholding (τ = 3.5) Figure 6 Distribution of pvalue-distances between feature vectors P pvalue with thresholding (τ = 3.5). Thresholding introduces complex effect on the distribution of the PD measure and makes the derivation of its significance level difficult. Right: The complete distribution is plotted in logscale. The linear correlation in logscale suggests an exponential decay. Left: Zoomin on the range [0,40]. The distribution is multi-modal due to the contributions of a different number of elements to sum as in Equation (4). The smallest nonzero PD measures start at approximately 4.426, which corresponds to a match of two feature vectors P pvalue at one entry where the z-score is equal to 3.5, i.e. a pvalue of 0.0120 or -log pvalue = 4.426. The distribution decreases rapidly and increases again at PD ≈ 9, corresponding to a match of two feature vectors at two distinct entries with zscore ≈ 3.5. The pattern repeats periodically as the number of significant entries common to both feature vectors increases. cific substitution tables and structure-dependent gap penalties to evaluate the alignment score. It switches between local and global alignment based on the ratio between the length of the query sequence and that of the structural profile. Previous experimental results have shown that FUGUE outperforms other methods in fold recognition such as PSIBLAST [5], SAM-PSIBLAST [32], HMMER-PSI-BLAST [33], THREADER [51] and GenTHREADER [53]. It is worth mentioning that FUGUE should be considered as a structure-based sequence alignment algorithm rather than a sequence-structure alignment algorithm per se, since it does not involve the definition and the minimization of any energy function to measure the goodness of fit of the query sequence to the template structure. However, its use of substitution tables and structure-dependent parameters provides additional information which is unavailable to pure sequence-based methods. In FUGUE, the compatibility of each sequence-structure pair is assessed through the zscore, which measures the departure of the observed threading score value from its mean, normalized by the standard deviation (where the mean and standard deviation are computed based on the distribution of alignment scores over shuffled sequences. Zscores of meaningful matches are then shifted such that the minimal zscore starts at 0. It is claimed that a zscore less than 2 typically implies an uncertain match between the template structure and the query amino acid sequence, and a zscore larger than six implies an almost certain match between the protein and the template folding structure [30]. It should be noted that threading measures are asymmetric. Given two proteins, A and B with known structures, then S(B, A) does not necessarily equal to S (A, B). Actually, for most protein pairs the equality S(B, A) = S(A, B) does not hold, as different protein structures have different "sequence capacities". These capacities affect the ability of an arbitrary sequence to conform with the given structure, and therefore also the probability that the structure will be energetically favorable for the given sequence. However, the asymmetry may be a fundamental feature of the protein space that we may want to preserve and study later on in our analysis.

The statistical significance of the PD measure
We computed the distribution of pvalue-distances (PD) between pairs of feature vectors P pvalue of unrelated protein sequences. The distribution is a complex one, and features multiple modes (Figure 6a). Each mode corresponds to the occurrence of one additional term in the summation contributing to the final value of the PD measure as described in Eq (4), i.e. the two feature vectors share one additional entry where the pvalues are both significant. One may characterize the PD measure as the sum of a random number of EVD random variables. Unfortunately, analysis of such a distribution is generally intractable and therefore we can only estimate the significance level of our PD measure based on the empirical distribution.
In Figure 6b, we plot the distribution of the PD measure over its full range (up to 900) in log scale. We observe that the frequency of occurrence of large PD values (typically above 100) decreases exponentially, i.e. linearly in log scale (the increasingly dispersed pattern at large PD value > 600 is mainly due to data sparsity). This phenomenon can be explained by examining again the definition of the PD measure given in (4). If we consider a zscore as a random variable, its associated pvalue is a random variable uniformly distributed between 0 and 1. Based on the assumption that and are independent (which is true if the two feature vectors are randomly drawn from our database), max( , ) follows a triangular distribution and -log max( , ) is an exponential random variable. Therefore, -log PD can be viewed as a sum of exponential random variables. From the statistical literature, we know that the sum of k independent and identically distributed exponential random variables with parameter α can be modeled by the Gamma function.
For large x, we have where ε is an additive constant. Hence, log f G (x; k, α) approximately decreases with x in a linear fashion. In our case, clearly the summands in (4) are not independent of each other since the reference set is composed of groups of related sequences. Nonetheless, one may expect that the dependency is not too strong because these sequences share less than 40% sequence identity. The plot indeed suggests that the distribution of the PD measure follows this trend. Table 9 reports the results for four specific queries. For simplicity, proteins are designated by their SCOP family name followed by a number indicating their relative position in the original SCOP file. For instance, the first protein in the SCOP file belonging to the family a. 1.1.1 is  denoted by a.1.1.1.1, and so on. The queries were picked so as to demonstrate that the best performance for a given combination of a query protein and a performance index can be obtained with any of the methods we tested. However, on average the distance-profile method is significantly more sensitive and cases like c.37.1.3.6 are rare.

Examples of homology detection
Tables 10 and 11 list the closest neighbors of the protein queries d2tgf__ (Transforming growth factor alpha (designated g.3.11.1.8) and d1hdoa_ (Biliverdin IX beta reductase) (designated c.2.1.2.26) with each one of the five competing methods. As these demonstrate, a significantly larger number of proteins that are biologically related to the query sequence are placed at the top of the neighbor list with the distance-profile method. For example, the protein domain g.3.11.1.8 belongs to the superfamily g. Comparison of the normalized TSS index of FUGUE and DP-FUGUE Figure 5 Comparison of the normalized TSS index of FUGUE and DP-FUGUE. Left: Scatter plot based on all queries in the test set that belong to SCOP superfamilies containing at least 10 members. Right: Averaged scatter plot (for each TSS value of FUGUE we average all TSS indices of DP-FUGUE that are associated with that value). Points situated above the diagonal represent successful cases where the TSS index is improved when using the DP representation and vice versa. of the sorted list and our DP-FUGUE method further improves the index to 10 TP. The differences for the protein domain c.2.1.2.26 are even more substantial, with BLAST, PSIBLAST and DP-PSIBLAST reporting only two TP before the first FP, FUGUE reporting 5 TP, while DP-FUGUE reporting 28 TP, thus significantly enhancing the clustering of the members of the superfamily c.2.1. We observe that the performance of the DP method depends on the configuration of the sorted list obtained using the initial association measure. Our method works best when some true positives can be found among the top ones on the list, even if proceeded by or mixed with false positives; it refines the results by clustering even closer the similar elements while pushing further away the false positives. Our method does not improve the results in the case of c.37.1.13.6. However, we note that this is a difficult case of homology detection since there are more than 86 members in the superfamily c.37.1 while all methods detect only a very few true positives at the top of the sorted list. There are several cases where FUGUE lists a FP at the top of the list but a couple of TPs are among the closest ones. In these cases, DP can enhance the mapping and improve over FUGUE, as is also demonstarted in Figure 5. However, when there is only one TP at the top of the list, our method does not always improve over FUGUE. This is most difficult when the top true positive match is only marginally higher than the first false positive.