The ongoing sequencing efforts continue to discover the sequences of many new proteins, whose function is unknown. Currently, protein databases contain the sequences of about 1,800,000 proteins, of which more than half are partially or completely uncharacterized [1]. Typically, proteins are analyzed by searching for homologous proteins that have already been characterized. Homology establishes the evolutionary relationship among different organisms, and the biological properties of uncharacterized proteins can often be inferred from those of homologous proteins. However, detecting homology between proteins can be a difficult task.

Our ability to detect subtle similarities between proteins depends strongly on the representations we employ for proteins. Sequence and structure are two possible representations of proteins that hinge directly on molecular information. The essential difference between the representation of a protein as a sequence of amino acids and its representation as a 3D structure traditionally dictated different methodologies, different similarity or distance measures and different comparison algorithms. The power of these representations in detecting remote homologies differ markedly. Despite extensive efforts, current methods for sequence analysis often fail to detect remote homologies for sequences that have diverged greatly. In contrast, structure is often conserved more than sequence [2–4], and detecting structural similarity may help infer function beyond what is possible with sequence analysis. However, structural information is sparse and available for only a small part of the protein space.

One may argue that the weakness of sequence based methods is rooted in the underlying representation of proteins, the model used for comparison and/or the comparison algorithm, since in principle, according to the central dogma of molecular biology, (almost) all the information that is needed to form the 3D structure is encoded in the sequence. Indeed, in recent years better sequence-based methods were developed [5–8]. These methods utilize the information in groups of related sequences (a protein or domain family) to build specific statistical models associated to different groups of proteins (*i.e. generative models*) that can be used to search and detect subtle similarities with remotely related proteins. Such generative models assume a statistical source that generates instances according to some underlying distributions, and model the process that generates samples and the corresponding distributions.

When seeking a new similarity measure for proteins that departs from sequence and structure, a natural question is what is the "correct" encoding of proteins? Several works studied the mathematical representation of protein sequences based on sequence properties such as amino acid composition or chemical features [9–12]. However, these representations had limited success, since they did not capture the essence of proteins as ordered sequences of amino acids.

Recently, alternative representations of protein sequences based on the so-called *kernel* methods were proposed. These methods are drawn from the field of machine learning and strive to find an adequate mapping of the protein space onto the Euclidean space where classification techniques such as support vector machines (SVM) or artificial neural networks (ANN) can be applied. Under the kernel representation, each protein is typically mapped to a vector in a (high-dimensional) *feature space*, and the resulting vector is termed *feature vector*. Subsequently, an inner product is defined in the feature space in order to estimate the (dis)similarity among different proteins. A major advantage of the kernel methods is that with an adequate choice of the kernel function, the feature vectors need not be computed explicitly in order to evaluate the similarity relationships. In addition, the users may build a specific feature space such that the kernel function directly estimates these relationships. However, string kernels do not easily lend themselves to this property, and therefore they need to be computed explicitly. The main difference between the different kernel methods reside in the definition of feature elements which are either related to the parameters of some generative process for each group of related proteins or some measure of similarity among the protein sequences.

For instance, the SVM-Fisher algorithm [13] uses the *Fisher kernel* which is based on hidden Markov models (HMMs). The components of the feature vector are the derivatives of the log-likelihood score of the sequence with respect to the parameters of a HMM that has been trained for a particular protein family. Tsuda *et. al.* [14] implemented another representation based on *marginalized* and *joint kernels* and showed that the Fisher kernel is in fact a special case of marginalized kernel. They also experimented with the marginalized count kernels of different orders, that are similar to the *spectrum kernel* which was first introduced by [15]. The spectrum kernel is evaluated by counting the number of times each possible *k*-long subsequence of amino acids (*k*-mer) occurs in one given protein. The marginalized count kernel takes into account both the observed frequency of different subsequences and the context (e.g. exon or intron for DNA sequences). The *mismatch-spectrum kernel* [16] is a generalization of the spectrum kernel that considers also mutation probabilities between *k*-mers which differ by no more than *m* characters. The *homology kernel* [17] is another biologically motivated sequence embedding process that measures the similarity between two proteins by taking into account their respective groups of homologous sequences. It can be thought of as an extension of the mismatch-spectrum kernel by adding a wildcard character and distinguishing the mismatch penalty between two substrings depending on whether the sequences are grouped together or not.

The *covariance kernel* is another type of kernel that uses a framework similar to the one employed by our method. The covariance kernel approach is probabilistic and much work is focused on the implementation of the generative models. In [18], the covariance kernel is based on the *mutual information kernel* which measures the similarity between data samples and a certain generative process. The generative process is characterized by a mediator distribution defined between the (usually vague) prior and the posterior distribution. On the other hand, [19] focuses on the representation of biological sequences using the probabilistic suffix tree [20] as the generative model for different groups of related proteins. The proposed kernel generates a feature vector for protein sequences, where each feature corresponds to a different generative model and its value is the likelihood of the sequence based on that model. Finally, we mention another related work [21] that uses the notion of *pairwise kernel*. Under this framework, each protein is represented by a vector that consists of pairwise sequence similarities with respect to the set of input sequences. It is also worth mentioning the many related studies in the field of natural language processing and text analysis. For example, an approach that in some ways is similar to the pairwise and covariance kernels and in other ways is related to the spectrum kernel approach, is used in [22] to represent verbs and nouns in English texts, with nouns represented as frequency vectors over the set of verbs (based on their association with different verbs) and vice versa. The nouns and verbs are then clustered based on this representation. Studies that attempted to devise automatic methods for text categorization and web-page classification often use similar techniques, where the text is represented as a histogram vector over a vocabulary of words (e.g. [23]).

Here we study a general framework of protein representation called the *distance-profile* representation, that utilizes the global information in the protein space in search of statistical regularities. The representation draws on an association measure between input samples (e.g. proteins and protein families) and can use existing measures of similarity, distance or probability (even if limited to a subset of the input space for which the measure can be applied). This representation induces a new measure of similarity for all protein pairs based on their vectorial representations. Our representation is closely related to the covariance and pairwise kernels described above. However, it is the estimation of pvalues through statistical modeling of the background process, coupled with the transformation to probability distributions, the noise reduction protocols and the choice of the distance function that result in a substantial impact on the performance, and we demonstrate how an adequate choice of the score transformation and the distance metric achieves a considerable improvement in detection of remote homologies.

This paper is organized as follows. We first introduce the notion of distance-profile. We describe how to process the feature vectors through *noise reduction*, and *pvalue transformation* followed by *normalization*. We compare the performance of our new method against several standard algorithms by testing them on a large set of protein families.