 Methodology article
 Open Access
 Published:
Comparison study on kword statistical measures for protein: From sequence to 'sequence space'
BMC Bioinformatics volume 9, Article number: 394 (2008)
Abstract
Background
Many proposed statistical measures can efficiently compare protein sequence to further infer protein structure, function and evolutionary information. They share the same idea of using kword frequencies of protein sequences. Given a protein sequence, the information on its related protein sequences hasn't been used for protein sequence comparison until now. This paper proposed a scheme to construct protein 'sequence space' which was associated with protein sequences related to the given protein, and the performances of statistical measures were compared when they explored the information on protein 'sequence space' or not. This paper also presented two statistical measures for protein: gre.k (generalized relative entropy) and gsm.k (gapped similarity measure).
Results
We tested statistical measures based on protein 'sequence space' or not with three data sets. This not only offers the systematic and quantitative experimental assessment of these statistical measures, but also naturally complements the available comparison of statistical measures based on protein sequence. Moreover, we compared our statistical measures with alignmentbased measures and the existing statistical measures. The experiments were grouped into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the statistical measures to discriminate and classify protein sequences. The second set of the experiments aims at assessing how well our measure does in phylogenetic analysis. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of protein 'sequence space' and statistical measures were obtained.
Conclusion
Alignmentbased measures have a clear advantage when the data is high redundant. The more efficient statistical measure is the novel gsm.k introduced by this article, the cos.k followed. When the data becomes less redundant, gre.k proposed by us achieves a better performance, but all the other measures perform poorly on classification tasks. Almost all the statistical measures achieve improvement by exploring the information on 'sequence space' as word's length increases, especially for less redundant data. The reasonable results of phylogenetic analysis confirm that Gdis.k based on 'sequence space' is a reliable measure for phylogenetic analysis. In summary, our quantitative analysis verifies that exploring the information on 'sequence space' is a promising way to improve the abilities of statistical measures for protein comparison.
Background
Over the past few decades, major advances in the field of molecular biology, coupled with advances in genomic technologies, have led to an explosive growth of biological sequences databases. For example, there are several wellknown databases about protein: Pfam [1] (a secondary database for multiple alignments and profile hidden Markov models), SCOP [2] (a secondary database containing protein family and structural information), SwissProt [3] (primary database of protein sequences), and Protein Information Resource (PIR) [4] (primary database of protein sequences). This deluge of databases, in turn, produces new questions to analyze protein sequences such as how to classify protein sequences, induce their evolutionary information, and predict their structures.
Among protein sequence analysis, some important computational methods are similarity search, phylogenetic analysis and sequence classification. The similarity search [5–7] is to search a database of known function sequences and uses the structures and functions of the most closely matched known sequences to analyze the structure and function of query sequence. Phylogenetic analysis [8–12] is the study of the evolutionary history among species. It can also provide useful information for pharmaceutical researchers to determine which species share the medicinal qualities [13]. Classification protein [14, 15] is to get a biologically meaningful partition. It has several advantages: when proteins are grouped into a family, it can provide us some clues about the general features of this family and evolutionary evidence of proteins, and further infer the biological function of a new sequence by its similarity to some functionknown sequences. Moreover, protein classification can be used to facilitate protein threedimensional structure discovery, which is very important for understanding proteins' functions. However, these computational methods heavily rely on the (dis)similarity measures defined among biological sequences.
Because of the importance of research into (dis)similarity measures, numerous efficient algorithms have been developed, but challenges remain. Moreover, we believe that further improvements in the (dis)similarity measures will allow us to design more effective tools, which can help us to look back more deeply in evolutionary time. One kind of the most common dissimilarity measures in this area is edit distance by aligning two sequences. It is defined as the required number of insertions, deletions, and replacements of characters from the first protein sequence to obtain the second protein sequence. But this measure is encountered with difficulties: (i) computation with regard to large biological databases [16, 17]; (ii) the score schemes chosen [16]. Therefore, alignmentfree measures are actively pursued to overcome the limitations of protein analysis by alignment.
Up to now, many efficient alignmentfree measures for sequences comparison have been proposed, but they are still in the early development compared with alignmentbased methods. One of the comprehensive reviews [16] reported several concepts of (dis)similarity measures, such as Euclidean distance [18], Mahalanobis distances [19], KullbackLeibler discrepancy [20], Cosine distance [21] and Pearson's correlation coefficient [22]. Recently, several novel alignmentfree measures have been designed for protein sequences analysis, such as S1 and S2 [23], Wmetric [14], Universal Similarity Metric [15], Local decoding [24], CLUSS [25] and Long ShortTerm Memory [26].
Among the statistical measures, each sequence is mapped into an ndimensional vector according to its kword frequencies. Linear Algebra theory is further employed to define the similarity score between sequences represented in vector spaces. The kld extended by Wu et al. (2001) is computed in terms of two vectors of relative frequencies of kwords over a sliding window from two given DNA sequences. However, in an application where some entries of vectors are equal to 0 or 1, kld becomes unsuitable. In this paper, we present two statistical measures to overcome the limitation of the measure kld. The contents can be summarized as follows:

1.
We present a scheme to build protein 'sequence space' based on the score or amino acid substitution matrices and calculate kword frequencies of protein 'sequence space'.

2.
Two statistical measures gre.k and gsm.k, as the extended JensenShannon Divergence, are proposed. They are based on kword frequencies and JensenShannon Divergence. Although these two concepts are not new, their generalizations result in the novel aspect of these measures. Particularly, the statistical measure, Gdis.k, is proved to be a valid distance measure.

3.
Our measures are applied to extensive tests, e.g., protein sequence classification and phylogenetic analysis. The performances of our measures are compared with alignmentbased measures and the existing statistical measures. Through the experiments, we want to address the following questions with the aid of well known statistical index: (A) how well our statistical measures perform compared with the existing statistical measures and alignmentbased ones; (B) which statistical measure performs better when exploring the information on protein 'sequence space'; (C) whether the classification abilities of statistical measures depend on the choice of score matrices; (D) whether our measure, Gdis.k, is a valid distance measure for phylogenetic analysis.
Results and discussion
Classification of protein sequences
The proposed statistical measures are used to classify protein sequences. Several benchmark data sets of nonhomologous protein structures have been developed in the last few years [27–30]. In this study, we have chosen the 36 protein domains of [27], the Rost and Sander data set (RS) and the 86 prototype protein domains of [28]. The ChewKedem data set (Additional file 1) was introduced in [27] and further studied in [31]. It consists of 36 protein domains drawn from PDB entries of three classes (alpha/beta, mainlyalpha, mainlybeta). Although this data set has been extensively used, the main draw back of this data is small size and high redundant. The Rost and Sander data set (RS126) (Additional file 2) was designed for the secondary structure prediction of proteins with a pairwise sequence similarity of less than 25% [32], and it was used as a test data to evaluate the performances of similarity measures [33]. Here, we not only compare the proteins' secondary structures, but analyse the performance of (dis)similarity measures according to the proteins' classification as given by SCOP, release 1.69 [34]. We adopt this manually curated database as our gold standard containing expert knowledge for class level. This data set is trimmed to exclude sequences belonging to classes with <5 elements, thus a data set of 121 protein sequences, denoted by RS, is obtained. The SierkPearson data set (Additional file 3), which consists of a nonredundant subset of 2771 protein families and 86 nonhomologous protein families from the CATH protein domain database [35], was introduced in [28]. We estimate the homology of the data by employing CDHIT program, which clusters protein databases at given sequence homology threshold [36]. Running CDHIT with 70% homology threshold reveals that there are 29, 120, 86 sequences for data CK, RS and SP, respectively, below the homology threshold. This results clearly indicate that CK is high redundant, RS is low redundant, and SP is less redundant.
The experiments aim at evaluating the classification ability of the alignmentbased measures and the statistical measures. The evaluation procedure is based on a binary classification of each protein pair, where 1 corresponds to the two protein sequences sharing the same class, 0 otherwise.
Given a data with size n, a n × n similarity/distance matrix can be obtained via each measure. The entries of the upper triangular similarity/distance matrix constitute a similarity vector of length \left(\begin{array}{l}n\\ 2\end{array}\right), which is used as prediction. Also, we can get a vector of length \left(\begin{array}{l}n\\ 2\end{array}\right) consisted of 1 and 0 as class labels. A perfect measure would completely separate negative from positive set. Of course, this does not happen in practice, and the classes are interspersed. The ROC curves permit to assess the level of accuracy of this separation without choosing any distance threshold for the separation point. In particular, the AUC will give us a unique number of the relative accuracy of each measure.
The measures evaluated are: alignmentbased measures, our statistical measures (gre.k and gsm.k) and the six statistical measures outlined in Method section (ed.k, cos.k, se.k, W.k, s1.k and s2.k), where the alignmentbased measures are Clustal X, NeedlemanWunsch (global alignment) or SmithWaterman (local alignment) raw scores, with no correction for statistical significance, using ten score matrices (BLOSUM40, BLOSUM45, BLOSUM62, BLOSUM80, BLOSUM100, PAM40, PAM80, PAM120, PAM200, PAM250) and linear gap penalties or affine gap penalties, with a gap penalty of 8. All statistical measures based on kword frequencies of protein sequence and protein 'sequence space' run with k from 1 to 4, where protein 'sequence space' is constructed based on the score matrix (BLOSUM40, BLOSUM45, BLOSUM62, BLOSUM80, BLOSUM100, PAM40, PAM80, PAM120, PAM200, PAM250). For each measure, separate tests are done with each combination of parameter values, and the best combination is chosen to represent the score in the performance. ROC curves are computed to evaluate and compare the performances of our methods and other (dis)similarity measures.
The ROC curves obtained for the classifications are presented in Figures 1, 2, 3. Figure 1(a), Figure 2(a) and Figure 3(a) denote the ROC curves of alignmentbased measures and the statistical measures based on kword frequencies of protein sequences. Figure 1(b), Figure 2(b) and Figure 3(b) denote the ROC curves of alignmentbased measures and the statistical measures based on kword frequencies of protein 'sequence space'. The better (dis)similarity measures have plots with higher values of sensitivity for equal values of specificity, resulting in higher values for the areas under the curves. The AUC value is typically used as a measure of overall discrimination accuracy. Table 1 provides the areas under ROC curves (AUC) obtained from all the (dis)similarity measures for data sets CK, RS and SP.
Question A
In the CK experiment, Figure 1 and Table 1 show that alignmentbased measures perform better than alignmentfree measures. NWaffine.b45 outperforms other alignmentbased measures, its area under ROC curve is 0.860. Among the statistical measures based on kword frequencies of protein sequences, gsm.2 is clearly more efficient than other measures. Its area under ROC curve is 0.791. The next best measure is the cos.1, with the area under ROC curve 0.729, and the other measures lag behind. For the statistical measures based on kword frequencies of protein 'sequence space', gsm.4.b100 is significantly better than other statistical measures, the se.3.b100 followed.
In the RS experiment, Figure 2 and Table 1 indicate that some statistical measures perform as well as alignmentbased measures. By exploring the information on protein 'sequence space', the statistical measure, gsm.k, performs better than alignmentbase measures. For the alignmentbased measures, NWaffine.b40 performs better than other measures. As for the statistical measures based on kword frequencies of protein sequences, cos.1 outperforms the other measures. Among the statistical measures based on kword frequencies of protein 'sequence space', gsm.3.b40 is significantly better than all other measures, its area under ROC curve is 0.627, and the next best measure is gre.4.b100.
In the SP experiment, Figure 3 and Table 1 illustrate that some statistical measures defined by kword frequencies of protein sequences outperform alignmentbased measures. When the information on protein 'sequence space' is added, all the statistical measures, except for se.k and s2.k, perform better than alignmentbase measures. For the alignmentbased measures, SW measures perform better than NW measures. As for the statistical measures based on kword frequencies of protein sequences, gre.1 outperforms other measures, which is followed by cos.1 and eu.1. Among the statistical measures based on kword frequencies of protein 'sequence space', the area under ROC curve of gre.1.p40 is 0.575, better than other statistical measures, and the next best measures are the cos.1.p40 and eu.1.p40.
From the above three experiments, we can see that alignmentbased measures have a clear advantage when the data is high redundant. The most efficient statistical measure is the novel gsm.k introduced by this report. When the data becomes less redundant, gre.k proposed by us achieves a better performance, but all the alignmentbased and the existing measures perform poorly on all classification tasks. The inspection of the ROC curves themselves (Figures 1, 2, 3) further illustrates these comparisons between (dis)similarity measures.
Question B
The main goal of construction of protein 'sequence space' is to improve the classification ability of (dis)similarity measures by extracting the information on related protein sequences. However, it should be noted that not all the (dis)similarity measures are suitable for this scheme. In order to find which statistical measure is suitable for this scheme, we define a function DAUC (measure, score matrix, k) to evaluate whether the classification ability of (dis)similarity measures improve or not,
where AUC (measure, score matrix, k) denotes the area under ROC curve of the statistical measure based on the kword frequencies of protein 'sequence space', which is constructed based on the score matrix; AUC (measure, k) denotes the area under ROC curve of measure defined by the kword frequencies of protein sequence.
Judging from definition of DAUC, it is easier to recognize that if DAUC ≥ 0, utilizing protein 'sequence space' improves the classification ability of the (dis)similarity measures. The DAUC values for the data CK, RS and SP are presented in Figures 4, 5, 6.
As would be expected, the DAUC values of the different measures (Figures 4, 5, 6) show two clear trends: (i) the DAUC values increase from k = 1 to k = 4 for all three data sets. When the length of word is equal to 4, almost all the statistical measures' classification abilities are improved. It should be noted that the classification discrimination of statistical measures based on higher order word frequencies, such as eu.k, se.k and cos.k, worsens [14], because the high dimension of the frequency vectors and the relative low dimension of the sequences length itself cause the frequency vector F to be very sparse. Interestingly, the construction of protein 'sequences space' maintains the accuracy and overcomes the difficulty arising from higher order word; (ii) it is interesting to note that there is a dependency between usefulness of protein 'sequence space' and the level of data's redundant. When the data is high redundant such as CK, the 'sequence space' is more similar. Consequently, the (dis)similarity measures based on 'sequence space' achieve a little improvement (Figure 4 (k = 4)). But the accuracy of classification is also improved with word's length increasing. As for the less redundant data such as RS and SP, all the statistical measures based on 'sequence space' achieve significantly improvement when word's length increases to 4 (Figures 4, 5 (k = 4)).
Question C
Using protein 'sequence space' contributes to the accuracy of protein classification. However, the construction of protein 'sequence space' relies heavily on the score matrix. In order to evaluate the influence of different score matrices, the function MAUC(measure, score matrix) is defined by
where AUC (measure, score matrix, k) denotes the area under ROC curve of the statistical measure based on the kword frequencies of protein 'sequence space' that is built based on the score matrix. The MAUC values of all the statistical measures based on ten score matrices for three data sets are presented in Figure 7. Figure 7 largely confirms that the measures possess different performances based on different score matrices. The changes of DAUC for the data CK, RS and SP are similar. For BLOSUM score matrix, BLOSUM40 and BLOSUM100 perform better in improvement of the statistical measures' classification abilities. As for PAM score matrix, PAM120 or PAM250 improves the classification ability of all the (dis)similarity measures on the high redundant data more obviously, except for the measures eu.k and gre.k. PAM40 or PAM80 contributes to improve the classification ability of the (dis)similarity measures more obviously on the less redundant data.
Phylogenetic analysis
Since Gdis.k is a statistical distance measure, it is further tested to analyze phylogenetic relationships. Given a set of protein sequences, their phylogenetic relationships can be obtained through the following main operations: firstly, the kword frequencies of protein 'sequence space' are calculated; secondly, the statistical distances are calculated and arranged into a distance matrix; finally, the phylogenetic relationships is obtained by neighborjoining program in the PHYLIP package [37].
A data set includes 68 SMC proteins, 5 Rad50 proteins and 5 MukB proteins (Additional file 4), which have been widely studied [38–42]. Our distance measure is applied to this data, and the results are shown in Figure 8. To assess the robustness of an estimated tree under perturbations of the input alignment, it is customary to perform a bootstrap analysis, where entire columns of the alignment are resampled with replacement. The bootstrap technique is employed to evaluate the tree topologies by resampling the sequence 100 times. We obtain the phylogenetic relationships drawn by MEGA program [11], bootstrap values, lower than 50, are hidden. Generally, an independent method can be developed to evaluate the accuracy of phylogenetic relationships, or the validity of phylogenetic relationships can be tested by comparing it with authoritative ones. Here, we adopt the latter one to test the validity of our measure.
Question D
Our results are quite consistent with the accepted taxonomy and authoritative ones [40–42] in the following three aspects. First of all, all the organisms are clearly separated from each other. Among the SMC proteins, it is consistently observed that SMC1 and SMC4 are grouped closely (there are the larger SMC subunits of the cohesin and condensin SMC heterodimers, respectively), and the smaller subunits, SMC3 and SMC2, appear to group closely. SMC5 and SMC6 are grouped together, which is consonant with that they heterodimerize as part of a DNA repair complex [42, 43]. Secondly, it is obvious from this tree that the closest relatives to the SMC proteins are the Rad50 proteins, followed by MukB proteins. Many of these Rad50 superfamily proteins have the conserved Nterminal FKS (or FRS) motif (located before the Walker A site), which is presented in most of the SMC proteins [41]. Finally, among the SMC proteins, it is observed that SMC1 protein and SMC4 protein are closer to SMC proteins, followed by SMC2, SMC3, SMC5 and SMC6 [41, 42]. It suggests that the duplication events giving rise to each subfamily must have occurred either before or very soon after the origin of eukaryotes. Since the rate of accepted amino acid substitution varies among different eukaryotic taxa within each subfamily. Condensin SMCs appear to show a higher substitution than cohesin SMCs, the mean distances within subfamilies of these proteins (averaged across all condensin and cohesin SMCs for each pairwise comparison between different organisms) are about half (0.54 ± 0.134) the corresponding distances between SMC5 and SMC6 proteins [41]. These reasonable results confirm that Gdis.k is a reliable distance measure for phylogenetic analysis.
Conclusion
Prior to this research, the statistical measures are perceived as adequate for analysis of biological data mainly because of their flexibility and scalability with data set size. In particular, some of them are quantitatively compared for the recognition of SCOP relationships [14]. This article presents a novel way to compare protein sequences by exploring the information on 'sequence space' and two new statistical measures: gre.k and gsm.k. It offers the first systematic and quantitative experimental assessment of statistical measures based on protein sequence and protein 'sequence space', which naturally complements the many available comparisons based on protein sequences.
The accuracy of each (dis)similarity measure to classify protein sequence is assessed through the experiments on high redundant and less redundant data sets. The comparative index AUC is a good measure of overall accuracy of a classification scheme. The proposed statistical distance measure, Gdis.k, is further tested to analyze phylogenetic relationships.
As for the high redundant data, alignmentbased measures have a clear advantage. gsm.k, followed by cos.k, is clearly more efficient among the existing statistical measures (Figure 1 and Table 1). When the data becomes less redundant, all the statistical measures, except for se.k and s2.k, outperform the alignmentbased measures by exploring the information on protein 'sequence space', and gre.k proposed by us achieves the best performance (Figure 3 and Table 1). The scheme for constructing 'sequence space' can provide more information than the protein sequence only and contributes to the accuracy of protein classification, especially for the less redundant data sets such as RS and SP. Almost all the statistical measures based on 'sequence space' achieve significantly improvement when word's length increases to 4 (Figures 4, 5, 6). In addition, the reasonable results of phylogenetic analysis illustrate the validity of our distance measure for phylogenetic analysis.
Overall our comparison study highlights the necessity for alignmentfree measures to extract more information as possible. Thus, this understanding can then be used to guide development of more powerful measures for protein sequence comparison with future possible improvement on evolutionary, structure and function study. But, it is worthy to note that although exploring the information on 'sequence space' improves the classification ability of some (dis)similarity measures, they all perform very poorly, near random classification values of 0.5 for less redundant data. That is to say, they may be useless in practice. So we expect a further investigation on the statistical methods, especially for low redundant datasets
Methods
Word statistics
Word statistics in protein sequence
There is a large body of literatures on word statistics [45], where sequences are interpreted as a succession of symbols and are further analyzed by representing the frequencies of its small segments. A kword is a series of k consecutive letters in a sequence. The kword statistical analysis consists of counting occurrences of kwords in a given sequence. For a sequence s, the count of a kword w, denoted by c(w), is the number of occurrence of w in the sequence s. The standard approach for counting kwords in a sequence of length m is to use a sliding window of length k, shifting the frame one base at a time from position 1 to mk+1. In this method, kwords are allowed to overlap in the sequence. In this way, a sequence can be represented by an ndimensional vector {C}_{k}^{s} made up of kword counts
where n is the number of all possible kwords. For example, consider the protein sequence s = VCST, we can obtain the vector made up of 2word counts
The frequencies of kwords, {F}_{k}^{s}, can he calculated by
Word statistics in protein 'sequence space'
The number of possible protein sequences is enormous. When a protein sequence is given, we are interested in its related proteins, and we denote them as the 'sequence space' of the given protein.
Substitution matrices represent similarity of amino acids, where each entry m_{ ij }of a substitution matrix [m_{ ij }] represents the 'normalized probability' (score) that amino acid i can mutate into amino acid j. Let i ℵ j denotes that the amino acids i and j are similar. Usually, two amino acids i and j are considered similar if m_{ ij }> 0. That is to sayi ℵ j if m_{ ij }> 0 ∀ i, j ∈ Ω
where Ω = {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}. Note that the substitution matrices are symmetric matrices, i.e., a being similar to b implies that b is similar to a. But this similarity of amino acids is not a transitive relation. For example, a is similar to b and b is similar to c, but a is not similar to c. Therefore, 20 amino acids are not possibly classified into several similarity classes according to this property.
We shall bypass the above similarity classes and consider a new star set which is easily to implement. A star set assumes that the properties are known between vertices and center. We can construct a star set including all the vertices and the center, and specifically write the center as the first element of the set to distinguish one set from the others. For example, S is similar to A, T and N in BLOSUM62 substitution matrix, so S is the center and they can constitute a star set {S, A, T, N} presented in Figure 9. For writing convenience, we write the star set {S, A, T, N} as ℵS = {x  x ℵ S, x ∈ Ω}. With the aid of star set, 20 amino acids can be partitioned into 20 star sets presented in Table 2 based on BLOSUM62 substitution matrix.
Our work derives a way to build 'sequence space' with the help of star set. From the definition of star set, we know that each amino acid corresponds a star set. For example, the star set of the amino acid S is ℵS = {S, A, T, N} according to BLOSUM62 substitution matrix. Given two protein sequences P = p_{1}p_{2} ⋯ p_{ n }and Q = q_{1}q_{2} ⋯ q_{ n },∀ p_{ i }∈ P, q_{ i }∈ Q, if p_{ i }∈ ℵq_{ i }⇒ P ℑ Q
where P ℑ Q denotes that the protein sequences P and Q are related. Given a protein sequence s, its 'sequence space', denoted by SP_{ s }, is defined as follows:SP_{ s }= {P  P ℑ s, length(P) = length(s)}
where P is a protein sequence, length(P) denotes the length of the protein sequence P. The protein 'sequence space' can be constructed as follows: for each protein sequence, beginning with the first amino acid, we scan through the protein sequence and substitute the star sets for amino acids at each position, respectively. Thus a special set of protein sequences is obtained, which is denoted as the 'sequence space' of the protein sequence. For example, given a protein sequence s = VCST, the star sets of V, C, S, and T are {V, M, I, L}, {C}, {S, A, T, N} and {S, A, T, N} according to BLOSUM62 substitution matrix, and the 'sequence space' of protein s is {V, M, I, L}{C}{S, A, T, N}{T, S}.
Once the protein 'sequence space' is built, the kword frequencies of 'sequence space' can be computed similarly. A segment of k symbols from a finite alphabet, A with 20 letters, is designated a kword. The set W_{ k }= (w_{k,1}, w_{k,2}, ⋯, w_{k, Y}) consists of all possible kwords that can be extracted from protein 'sequence space', and has Y elements, where Y = 20^{k}. The count of kwords in protein 'sequence space', denoted by {C}_{k}^{s{p}_{s}}=\left({c}^{s{p}_{s}}({w}_{k,1}),{c}^{s{p}_{s}}({w}_{k,1}),\cdots ,{c}^{s{p}_{s}}({w}_{k,Y})\right) can be calculated by taking a sliding window with kwide and scanning through the protein 'sequence space'. For example, considering the protein sequence s = VCST, its 'sequence space' is {V, M, I, L}{C}{S, A, T, N}{T, S}, we can get a vector of 2word counts
Similarly, one can then calculate kword frequencies of protein 'sequence space', denoted as {F}_{k}^{s{p}_{s}}, by
Statistical distance measures
Previous (dis)similarity measures
We first describe the six previous statistical measures for biological sequences.
Many statistical measures for sequence comparison are to fix a short word length k, compute the frequencies of all kwords in each sequence, and assess the similarity of the two frequency vectors.
1. Euclidian distance (ed.k)
The Euclidian distance is one of the most common dissimilarity measures of biological sequences. The dissimilarity score between two protein sequences X and Y is the Euclidian distance between their kword frequencies {F}_{k}^{A}=\left(f({w}_{k,1}^{A}),f({w}_{k,1}^{A}),\cdots ,f({w}_{k,n}^{A})\right) and {F}_{k}^{B}=\left(f({w}_{k,1}^{B}),f({w}_{k,1}^{B}),\cdots ,f({w}_{k,n}^{B})\right)[18]
2. Cosine of the angle (cos.k)
In order to derive estimation of relatedness from the vector definitions of biological sequences, Stuart et al. (2002) proposed the pairwise cosine for generating accurate gene and species phylogenies from whole genome sequences.
Cosine is a standard measure of vector similarity, and its application for this purpose can be understood intuitively.
3. Standardized Euclidean distance (se.k)
The above measures explore the use of Euclidean distances and correlations between kword frequencies representations of sequences. Standardized Euclidean distance takes into account the data covariance structure
where S = [s_{ ij }] represents the covariance matrix of kword frequencies. The standard Euclidean distance forces cov (f_{ i }, f_{ j }) = 0 for i ≠ j. Therefore, in this distance measure the correlations between different kwords are ignored and only the same kword variances are accounted for. The standard Euclidean distance was first proposed for sequence comparison by Wu et al. (1997).
4. KullbackLeibler discrepancy (kld)
Let P_{1} and P_{2} be two probability frequencies on a universe X, the KullbackLeibler divergence (kld) or the relative entropy, denoted as kld(P1, P2), of P_{1} with respect to P_{2} is defined by the Lebesgue integral [46],
Although relative entropy is not a true metric, it satisfies many important mathematical properties. Wu et al. (2001) have applied KullbackLeibler discrepancy to compare DNA sequences based on the frequencies of all kwords.
5. Wmetric (W.k)
In an application where the covariance matrices S chosen in standard Euclidean distance is replaced by amino acid substitution matrices, Vinga et al. (2004) proposed and demonstrated the use of Wmetric as a novel kword composition metric
where W is amino acid substitution matrices such as BLOSUM and PAM. W.k is a distance defined between protein sequences, which bridges between alignmentbased metrics and measures based solely on kword composition.
6. S_{1}and S_{2}(s1.k and s2.k)
S_{1}and S_{2} are statistical measures for protein sequences based on the concept of comparing the similarity between the kword appearances [23]. If the set {W}_{k}^{X}=\left({w}_{k,1}^{X},{w}_{k,2}^{X},\cdots ,{w}_{k,n}^{X}\right) and {W}_{k}^{Y}=\left({w}_{k,1}^{Y},{w}_{k,2}^{Y},\cdots ,{w}_{k,n}^{Y}\right) consist of all possible kwords that can be extracted from proteins X and Y, respectively, S_{1} and S_{2} can be computed by
where Match({\text{W}}_{\text{k}}^{\text{X}}, {\text{W}}_{\text{k}}^{\text{Y}}) is the total number of kwords shared by two proteins X and Y, constant c is a normalizing factor; Word({\text{W}}_{\text{k}}^{\text{X}}) and Word({\text{W}}_{\text{k}}^{\text{Y}}) denote the total numbers of occurred kwords in proteins X and Y.
Novel statistical distance measures
We describe two novel statistical measures for protein sequences comparison based on kword frequencies.
1. Generalized relative entropy (gre.k)
Relative entropy is the most important concept in both statistical biology and information theory. It has been explored as similarity measures such as kld and SimMM [17, 20] to compare biological sequences. However, in an application where P_{ k }is equal to 0 or 1, kld(P^{1}, P^{2}) → ∞. So the similarity measure kld becomes unsuitable. For such an application, we generalize relative entropy with the help of JensenShannon Divergence, denoted by gre.k, by
Now, if f({w}_{k,t}^{X}) is equal to 0 and 1,
So gre.k can deal with all kinds of kword frequencies.
2. Gapped similarity measure (gsm.k)
From the definition of gre.k, it is worthy to note that the frequencies of kwords that are present in both sequences have different impact on the gre.k. But the frequencies of kwords that are present in only one sequence have no contribution to gre.k. Because if f({w}_{k,t}^{X}) or f({w}_{k,t}^{Y}) is equal to 0,
Similarly, the measures S_{1} and S_{2} focus on the appearances of kwords but ignore their frequencies. Motivated by extracting the information from all the kwords, we investigate a novel statistical measure for protein sequence comparison, called the gapped similarity measure
In the definition of function score, the frequencies of all the kwords in protein sequence are considered. Indeed, the measure gsm.k is the edit score between kword frequencies of the two protein sequences X and Y. If a kword w appears in the two sequences, the edit score is f({w}_{k,w}^{X})\cdot {\mathrm{log}\phantom{\rule{0.5em}{0ex}}}_{2}\left(\frac{2\cdot f({w}_{k,w}^{X})}{f({w}_{k,w}^{X})+f({w}_{k,w}^{Y})}\right). If a kword w appears in protein sequence X not Y, it seems that the kword w is deleted from the protein sequence Y, we choose the maximum value of function f({w}_{k,w}^{X})\cdot {\mathrm{log}\phantom{\rule{0.5em}{0ex}}}_{2}\left(\frac{2\cdot f({w}_{k,w}^{X})}{f({w}_{k,w}^{X})+f({w}_{k,w}^{Y})}\right) as the gap penalty according to followed proposition.
Proposition. If {F}_{k}^{A}=\left(f({w}_{k,1}^{A}),f({w}_{k,1}^{A}),\cdots ,f({w}_{k,n}^{A})\right)and {F}_{k}^{B}=\left(f({w}_{k,1}^{B}),f({w}_{k,1}^{B}),\cdots ,f({w}_{k,n}^{B})\right) are two kword frequency vectors of length n,
Proof: To find its maximum, we rewrite
Since \frac{f({w}_{k,t}^{X})}{f({w}_{k,t}^{X})+f({w}_{k,t}^{Y})}\le 1, we can get
Thus
Similarly, the symmetric form of gsm.k, denoted as Gdis.k, between two sequences X and Y is defined by
A distance metric, D(·,·), should satisfy the following conditions:

1.
D(S, Q) ≥ 0, where the equality is satisfied iff S = Q (identity).

2.
D(S, Q) = D(Q, S)(symmetry).

3.
D(S, Q) ≤ D(S, T) + D(T, Q)(triangle inequality).
In the appendix, we prove that the statistical measure, Gdis.k, defined above satisfies the three conditions and is, therefore, a valid distance metric.
Evaluation methods
Similarity/dissimilarity measures are compared by considering how well they classify protein sequences, as well as by computing receiver operator characteristic (ROC) curves. ROC goes back to signal detection and classification problems and is now widely used [47]. This approach is employed in binary classification of continuous data, usually categorized as positive (1) or negative (0) cases. The classification accuracy can be measured by plotting, for different threshold values, the number of true positives (TP), also named sensitivity or coverage versus false positives (FP), or (1specificity), encountered for each threshold, properly normalized [Eq. 22].
A ROC curve is simply the plot of sensitivity versus (1specificity) for different threshold values. The area under a ROC curve (AUC) is a widely employed parameter to quantify the quality of a classificator because it is a threshold independent performance measure and is closely related to the Wilcoxon signedrank test [48]. For a perfect classifier, the AUC is 1 and for a random classifier the AUC is 0.5
Availability
Software name: SMPSSS
Software home page: http://math.dlut.edu.cn/daiqi/SMPSSS.html
Operating system(s): windows
Programming languages: perl
License: web server freely available without registration
Restrictions to use by nonacademics: on request
Appendix
The proof of valid distance metric
Lemma 1. For a real convex function f in its domain [a, b], ∀ x_{ i }∈ [a, b], λ_{ i }> 0 (i = 1, 2, ⋯, n), {\displaystyle {\sum}_{i=1}^{n}{\lambda}_{i}}=1, Jensen's inequality can be stated as:
Proof: Let {x}_{0}={\displaystyle {\sum}_{i=1}^{n}{\lambda}_{i}{x}_{i}}, x_{0} ∈ [s, b]. We expand f(x) around x_{0}, and by Taylor's theorem, we have that
Since f(x) is a real convex function f in its domain [a, b], f" (ξ) > 0. Thus we havef(x) ≥ f(x_{0}) + f'(x_{0})(x  x_{0}).
For all x_{ i }∈ [a, b], we can obtain that
Multiplying the above inequalities with λ_{ i }, we have
Summing the above inequalities,
Thus, we obtain that
Proposition 1. ∀ x, y > 0,
Proof: Let f(x) = xlnx, x > 0, we calculate f'(x) and f"(x),
Thus f(x) is a real convex function.
According to Lemma 1, we have
Then
If {F}_{k}^{A}=\left(f({w}_{k,1}^{A}),f({w}_{k,1}^{A}),\cdots ,f({w}_{k,n}^{A})\right)and {F}_{k}^{B}=\left(f({w}_{k,1}^{B}),f({w}_{k,1}^{B}),\cdots ,f({w}_{k,n}^{B})\right) are two kword frequency vectors of protein sequences X and Y, respectively, we define similarity score, denoted by ss.k, as follows:
where
Proposition 2.
Proof: Firstly, we need to show that
Case 1: f({w}_{k,t}^{I}) = f({w}_{k,t}^{J}) = 0, it satisfies the above inequality.
Case 2: The entry of f({w}_{k,t}^{I}) or f({w}_{k,t}^{J}) is equal to zero. Without loss of generality, assume f({w}_{k,t}^{I}) = 0 and f({w}_{k,t}^{J}) ≠ 0, we can easily get that
Case 3: f({w}_{k,t}^{I}) ≠ 0 and f({w}_{k,t}^{J}) ≠ 0. Using the Proposition 1, we can easily obtain the inequality (24).
To find its maximum, we use the Proposition in Method section to get that
Theorem 1. The statistical measure Gdis.k(X,Y) is a distance metric.
Proof: Again, by definition ss.k(X, Y) and Proposition 2, we can obtain that it satisfies two important mathematical properties: (1) positivity: Gdis.k(X, Y) ≥ 0 and Gdis.k(X, Y) = 0 ⇔ {F}_{k}^{X} = {F}_{k}^{Y}; (2) symmetry: Gdis.k(X, Y) = Gdis.k(Y, X). We now need to show that Gdis.k(X, Y) ≥ 0 satisfies the triangle inequality:Gdis.k(X, Y) ≤ Gdis.k(X, Z) + Gdis.k(Z, Y).
Case 1: {F}_{k}^{X} = {F}_{k}^{Y} = {F}_{k}^{Z}, it satisfies the triangle inequality.
Case 2: Among three kword frequency vectors, two vectors are equal. Without loss of generality, assume {F}_{k}^{X} ≠ {F}_{k}^{Y} and {F}_{k}^{X} = {F}_{k}^{Z}, we can easily obtain thatGdis.k(X, Y) ≤ Gdis.k(X, Z) + Gdis.k(Z, Y).
Case 3: {F}_{k}^{X} ≠ {F}_{k}^{Y} ≠ {F}_{k}^{Z}. From the definition of ss.k and Proposition 2, we have
SinceGdis.k(X, Y) = ss.k(X, Y)/n + 2 ≤ 4.
ThusGdis.k(X, Y) ≤ Gdis.k(X, Z) + Gdis.k(Z, Y).
Abbreviations
 AUC:

Area Under the Curve
 b..:

BLOSUM..
 CATH:

Hierarchical Classification of Protein Domain Structures
 CDHIT:

Cluster Database at High Identity with Tolerance
 CK:

ChewKedem Data
 cos.k:

Cosine of the Angle Based on kword Frequencies of Protein Sequence
 cos:

k.matrix: Cosine of the Angle Based on kword Frequencies of Protein 'Sequence Space' Constructed According to Score Matrix
 DAUC:

Different Area under the Curve
 ed.k:

Euclidian Distance Based on kword Frequencies of Protein Sequence
 ed.k.matrix:

Euclidian Distance Based on kword Frequencies of Protein 'Sequence Space' Constructed According to Score Matrix
 FP:

False Positives
 Gdis.k:

Gapped Distance Measure Based on kword Frequencies
 gre.k:

Generalized Relative Entropy Based on kword Frequencies of Protein Sequence
 gre.k.matrix:

Generalized Relative Entropy Based on kword Frequencies of Protein 'Sequence Space' Constructed According to Score Matrix
 gsm.k:

Gapped Similarity Measure Based on kword Frequencies of Protein Sequence
 gsm.k.matrix:

Gapped Similarity Measure Based on kword Frequencies of Protein 'Sequence Space' Constructed According to Score Matrix
 kld:

KullbackLeibler Discrepancy
 MAUC:

Maximal Area under the Curve
 MEGA:

Molecular Evolutionary Genetics Analysis
 NW:

NeedlemanWunsch Measure
 NWinear.matrix:

NeedlemanWunsch Measure Using Score Matrix and Linear Gap Penalty
 NWaffine.matrix:

NeedlemanWunsch Measure Using Score Matrix and Affine Gap Penalty
 p..:

PAM..
 pfam:

Protein Family
 PIR:

Protein Information Resource
 ROC:

Receiver Operating Curve
 RS:

Rost and Sander Data
 s1.k:

S1 Measure Based on kword Frequencies of Protein Sequence
 s1.matrix:

S1 Measure Based on kword Frequencies of Protein 'Sequence Space' Constructed According to Score Matrix
 s2.k:

S2 Measure Based on kword Frequencies of Protein Sequence
 s2.k.matrix:

S2 Measure Based on kword Frequencies of Protein 'Sequence Space' Constructed According to Score Matrix
 SCOP:

Structural Classification of Proteins
 se.k:

Standardized Euclidean Distance Based on kword Frequencies of Protein Sequence
 se.k.matrix:

Standardized Euclidean Distance Based on kword Frequencies of Protein 'Sequence Space' Constructed According to Score Matrix
 SM:

SmithWaterman Measure
 SMlinear.matrix:

SmithWaterman Measure Using Score Matrix and Linear Gap Penalty
 SMaffine.matrix:

SmithWaterman Measure Using Score Matrix and Affine Gap Penalty
 SMC:

Structural Maintenance of Chromosomes
 SP:

SierkPearson Data
 SPs:

'Sequence Space' of Sequence s
 SS.k:

Similarity Score Based on kword Frequencies
 SwissProt:

SwissProt Database
 TP:

True Positives
 W.k.matrix:

Wmetric Based on kword Frequencies and Score Matrix
References
Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths JS, Howe KL, Marshall M, Sonnhammer ELL: The Pfam Protein FamiliesDatabase. Nucleic Acids Res 2002, 30: 276–280.
Andreeva A, Howorth D, Brenner SE, Hubbard TJP, Chothia C, Murzin AG: SCOP database in refinements integrate structure and sequence family data. Nucleic Acid Res 2004, 32: D226D229.
Bairoch A, Apweiler R: The SWISSPROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 2000, 28: 45–48.
Wu CH, Huang H, Arminski L, CastroAlvear J, Chen Y, Hu ZZ, Ledley RS, Lewis KG, Mewes HW, Orcutt BC, Suzek BE, Tsugita A, Vinayaka CR, Yeh LSL, Zhang J, Barker WC: The Protein Information Resource, an integrated public resource of functional annotation of proteins. Nucleic Acids Res 2002, 30: 35–37.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402.
Pham TD: Spectral distortion measures for biological sequence comparisons and database searching. Pattern Recog 2007, 40: 516–529.
Felsenstein J: Evolutionary trees from DNA sequences, a maximum likelihood approach. J Mol Evol 1981, 17: 368–376.
Felsenstein J: Inferring phylogenies from protein sequences by parsimony, distance and likelihood methods. Meth Enzymol 1996, 266: 418–427.
Huelsenbeck JP, Ronquist F: MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 2001, 17: 754–755.
Kumar S, Tamura K, Nei M: MEGA3: integrated software for molecular evolutionary genetics analysis and sequence alignment. Brief Bioinform 2004, 5(2):150–163.
Ronquist F, Huelsenbeck JP: MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 2003, 19: 1572–1574.
Komatsu K, Zhu S, Fushimi H, Qui TK, Cai S, Kadota S: Phylogenetic analysis based on 18S rRNA gene and matK gene sequences of Panax vietnamensis and five related species. Planta Med 2001, 67: 461–465.
Vinga S, GouveiaOliveira R, Almeida JS: Comparative evaluation of word composition distances for the recognition of SCOP relationships. Bioinformatics 2004, 20(2):206–15.
Ferragina P, Giancarlo R, Greco V, Manzini G, Valiente G: Compressionbased classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment. BMC Bioinformatics 2007, 8: 252–272.
Vinga S, Almeida J: Alignmentfree sequence comparison – a review. Bioinformatics 2003, 19: 513–523.
Pham TD, Zuegg J: A probabilistic measure for alignmentfree sequence comparison. Bioinformatics 2004, 20: 3455–3461.
Blaisdell BE: A measure of the similarity of sets of sequences not requiring sequence alignmen. Proc Natl Acad Sci USA 1986, 83: 5155–5159.
Wu TJ, Burke JP, Davison DB: A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics 1997, 53: 1431–1439.
Wu TJ, Hsieh YC, Li LA: Statistical measures of DNA dissimilarity under Markov chain models of base composition. Biometrics 2001, 57: 441–448.
Stuart GW, Moffett K, Baker S: Integrated gene and species phylogenies from unaligned whole genome protein sequences. Bioinformatics 2002, 18: 100–108.
Fichant G, Gautier C: Statistical method for predicting protein coding regions in nucleic acid sequences. Comput Appl Biosci 1987, 3: 287–295.
Wu KP, Lin HN, Sung TY, Hsu WL: A New Similarity Measure among Protein Sequences. Proceedings of IEEE CSB2003 Computer Society Bioinformatics Conference 2003, 347–352.
Didier G, Laprevotte I, Pupin M, Hénaut A: Local decoding of sequences and alignmentfree comparison. J Comput Biol 2006, 13: 1465–1476.
Kelil A, Wang S, Brzezinski R, Fleury A: CLUSS: Clustering of Protein Sequences Based on a New Similarity Measure. BMC Bioinformatics 2007, 8: 286–305.
Hochreiter S, Heusel M, Obermayer K: Fast modelbased protein homology detection without alignment. Bioinformatics 2007, 23: 1728–1736.
Chew LP, Kedem K: Finding the Consensus Shape for a Protein Family. Algorithmica 2003, 38: 115–129.
Sierk M, Person W: Sensitivity and Selectivity in Protein Structure Comparison. Protein Sci 2004, 13(3):773–785.
Thiruv B, Quon G, Saldanha SA, Steipe B: Nh3D: A Reference Dataset of NonHomologous Protein Structures. BMC Struct Biol 2005, 5: 12.
Word JM, Lovell SC, LaBean TH, Taylor HC, Zalis ME, Presley BK, Richardson JS, Richardson DC: Visualizing and Quantifying Molecular GoodnessofFit: SmallProbe Contact Dots with Explicit Hydrogen Atoms. J Mol Biol 1999, 285(4):1711–1733.
Krasnogor N, Pelta DA: Measuring the Similarity of Protein Structures by Means of the Universal Similarity Metric. Bioinformatics 2004, 20(7):1015–1021.
Rost B, Sander C: Prediction of protein secondary structure at better than 70% accuracy. J Mol Biol 1993, 232: 584–599.
Barthel D, Hirst JD, Blażewicz J, Burke EK, Krasnogor N: ProCKSI: A Decision Support System for Protein (Structure) Comparison, Knowledge, Similarity and Information. BMC Bioinformatics 2007, 8: 416.
SCOP: Structural Classification of Proteins[http://scop.mrclmb.cam.ac.uk/scop]
Pearl F, et al.: The CATH Domain Structure Database and Related Resources Gene3D and DHS Provide Comprehensive Domain Family Information for Genome Analysis. Nucleic Acids Res 2005, 33(D):D247D251.
Li W, Godzik A, Cdhit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22: 1658–1659.
Felsenstein J: PHYLIPPhylogeny inference package (version 3.2). Cladistics 1989, 5: 164–166.
Saitoh N, Goldberg I, Earnshaw WC: The SMC proteins and the coming of age of the chromosome scaffold hypothesis. BioEssays 1995, 17: 759–766.
Lowe J, Cordell SC, Ent F: Crystal structure of the SMC head domain: an ABC ATPase with 900 residues antiparallel coiledcoil inserted. J Mol Biol 2001, 306: 25–35.
Hirano M, Hirano T: Hingemediated dimerization of SMC protein is essential for its dynamic interaction with DNA. EMBO J 2002, 21: 5733–5744.
Cobbe N, Heck MM: SMCs in the world of chromosome biology from prokaryotes to higher eukaryotes. J Struct Biol 2000, 129: 123–143.
Soppa J: Prokaryotic structural maintenance of chromosomes (SMC) proteins: distribution, phylogeny, and comparison with MukBs and additional prokaryotic and eukaryotic coiledcoil proteins. Gene 2001, 278: 253–264.
Taylor EM, Moghraby JS, Lees JH, Smit B, Moens PB, Lehmann AR: Characterization of a novel human SMC heterodimer homologous to the Schizosaccharomyces pombe Rad18/Spr18 complex. Mol Biol Cell 2001, 12: 1583–1594.
Fujioka Y, Kimata Y, Nomaguchi K, Watanabe K, Kohno K: Identification of a novel nonSMC component of the SMC5/SMC6 complex involved in DNA repair. J Biol Chem 2002, 277: 21585–21591.
Reinert G, Schbath S, Waterman MS: Probabilistic and statistical properties of words: an overview. J Comput Biol 2000, 7: 1–46.
Kroupa T: Measure of divergence of possibility measures. Proceedings of the 6th Workshop on Uncertainty Processing (WUPES'2003), Hejnice, Czech Republic 173–181.
Egan JP: Signal Detection Theory and ROCAnalysis. Academic Press, New York; 1975.
Bradley AP: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recog 1997, 30: 1145–1159.
Acknowledgements
The authors thank all the anonymous referees for their valuable suggestions and support. In particular, the authors thank Prof. Susana Vinga for providing the MATLAB code for Wmetric.
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
QD conceived the method and prepared the manuscript. QD implemented the software and performed the ROC analysis. QD and TMW contributed to the discussion and have approved the final manuscript.
Electronic supplementary material
12859_2008_2379_MOESM4_ESM.pdf
Additional file 4: The protein data used in phylogenetic analysis. The protein sequences used in phylogenetic analysis with abbreviated names, full names and Accession numbers. (PDF 192 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Dai, Q., Wang, T. Comparison study on kword statistical measures for protein: From sequence to 'sequence space'. BMC Bioinformatics 9, 394 (2008). https://doi.org/10.1186/147121059394
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/147121059394
Keywords
 Statistical Measure
 Receiver Operator Characteristic Curve
 Sequence Space
 Score Matrix
 Classification Ability