Volume 9 Supplement 6

## Symposium of Computations in Bioinformatics and Bioscience (SCBB07)

# An improved string composition method for sequence comparison

- Guoqing Lu
^{1}Email author, - Shunpu Zhang
^{2}and - Xiang Fang
^{2}

**9(Suppl 6)**:S15

https://doi.org/10.1186/1471-2105-9-S6-S15

© Lu et al; licensee BioMed Central Ltd. 2008

**Published: **28 May 2008

## Abstract

### Background

Historically, two categories of computational algorithms (alignment-based and alignment-free) have been applied to sequence comparison–one of the most fundamental issues in bioinformatics. Multiple sequence alignment, although dominantly used by biologists, possesses both fundamental as well as computational limitations. Consequently, alignment-free methods have been explored as important alternatives in estimating sequence similarity. Of the alignment-free methods, the string composition vector (CV) methods, which use the frequencies of nucleotide or amino acid strings to represent sequence information, show promising results in genome sequence comparison of prokaryotes. The existing CV-based methods, however, suffer certain statistical problems, thereby underestimating the amount of evolutionary information in genetic sequences.

### Results

We show that the existing string composition based methods have two problems, one related to the Markov model assumption and the other associated with the denominator of the frequency normalization equation. We propose an improved complete composition vector method under the assumption of a uniform and independent model to estimate sequence information contributing to selection for sequence comparison. Phylogenetic analyses using both simulated and experimental data sets demonstrate that our new method is more robust compared with existing counterparts and comparable in robustness with alignment-based methods.

### Conclusion

We observed two problems existing in the currently used string composition methods and proposed a new robust method for the estimation of evolutionary information of genetic sequences. In addition, we discussed that it might not be necessary to use relatively long strings to build a complete composition vector (CCV), due to the overlapping nature of vector strings with a variable length. We suggested a practical approach for the choice of an optimal string length to construct the CCV.

## Background

The increasing proliferation of biological sequence data has created tremendous opportunities for biologists and medical researchers to address both fundamental issues (e.g., molecular evolution) and practical problems (e.g., drug design). On the other hand, it poses many computational challenges for theoretical scientists to create efficient and reliable methods or algorithms for sequence analyses and knowledge mining. Sequence comparison, an essential operation for gene finding and protein function annotation, is one such challenge. The methods for sequence comparison are classified into two categories, alignment-based and alignment-free. The alignment-based sequence analysis methods have both fundamental and computational limitations [1–4]. For example, these methods cannot deal with changes like chromosome reversal or gene translocation. They also encounter difficulties in aligning dissimilar sequences. Another drawback with sequence alignment is its computational complexity, where no optimal solution can be achieved when a large number of sequences are compared. Consequently, considerable efforts have been made to seek for alternative, i.e., alignment-free, methods for sequence comparison.

The alignment-free methods seen in the past few decades can be divided into three categories: gene contents [5–7], data compression [8–11], and string (or word) composition [12–18]. Of these methods, the string-composition-based methods, especially the composition vector (CV) method [12] and the complete composition vector (CCV) method [16], have received substantial attention. The CV method uses strings of a fixed length whereas the CCV method uses strings of multiple lengths. The CCV method was found to provide finer evolutionary information than the CV method; however, it has disadvantages regarding computing time and memory usage. Both of the above mentioned methods apply a Markov model assumption to estimate the random background of observed frequencies, which has been found to be problematic, as detailed in Section 2. In this paper, we will provide an improved CCV (ICCV) method and demonstrate that this new method is more robust and efficient in performing sequence comparison compared with the existing CCV method. The issue of how to build a more informative CCV, i.e., how to select the maximum vector string length for better evolutionary information representation, will be addressed as well.

The contents of this paper are arranged as follows. In the Methods section, we point out the two aforementioned problems in the existing CV or CCV methods and describe our new ICCV method. In the Results section, we compare the CCV and ICCV methods through simulations and experimental data analysis. In the Discussion section, we discuss the potential impact of the simple assumption of a uniform and independent model and issues related to selecting the maximum string length for CCV construction.

## Methods

### Existing CV and CCV methods

Define *S* as a DNA sequence consisting of *N* nucleotides. Let *f*(*α*_{1}...*α*_{
k
}) be the observed frequency of the *k*-mer string *α*_{1}...*α*_{
k
}, where *α*_{
i
}is one of the four nucleotides A, C, T, or G and *k* is the string length (1 ≤ *k* <*N*). We define ${S}_{k}=({f}_{1},{f}_{2},\mathrm{...},{f}_{{4}^{k}})$ as a vector of observed frequencies for a given *k*, where 4^{
k
}is the number of *k*-mer strings, and let *γ*_{
K
}= (*S*_{1}, *S*_{2}, ..., *S*_{
K
}) as a combined vector for some constant *K* (*K* < *N*), where *K* is the maximum string length considered. From the perspective of molecular evolution, *S*_{
k
}or *γ*_{
K
}reflects both random mutation and selection, and the random background needs to be normalized in order to represent genetic information contributed by natural selection. After the normalization of observed frequencies, *S*_{
k
}is converted into a composition vector (CV), and *γ*_{
K
}is transformed into a complete composition vector (CCV).

*k*-mer strings in

*S*was originally proposed by Brendel et al. [20] and has been used with minor modifications for phylogenetic studies of prokaryotes and viruses [12, 16]. We have found two problems associated with string frequency normalization in existing methods. To explicate these problems, we reiterate the normalization equation of the observed frequency of

*α*

_{1}...

*α*

_{ k }, i.e.,

*f*(

*α*

_{1}...

*α*

_{ k }), described in [12] as below:

where ${f}^{0}({\alpha}_{1}\cdots {\alpha}_{k})=\frac{f({\alpha}_{1}\cdots {\alpha}_{k-1})f({\alpha}_{2}\cdots {\alpha}_{k})}{f({\alpha}_{2}\cdots {\alpha}_{k-1})}.\frac{(N-k+1)(N-k+3)}{{(N-k+2)}^{2}}$ for *k* ≥ 3.

*f*(

*α*

_{1}...

*α*

_{ k }) and the estimated expected frequency

*f*

^{0}(

*α*

_{1}...

*α*

_{ k }). We computed both quantities for

*k*= 3, 4, 5 using a randomly chosen virus sequence. The correlation coefficients between

*f*

^{0}(

*α*

_{1}...

*α*

_{ k }) and

*f*(

*α*

_{1}...3;

*α*

_{ k }) are 0.92, 0.92 and 0.86, respectively, for

*k*= 3, 4, 5 with

*p*< 0.0001 (Fig. 1).

Define *E*_{0}[*f*(*α*_{1}...*α*_{
k
})] as the true expected frequency of *k*-mer string *α*_{1}...*α*_{
k
}in *S*. Since there exists a highly positive correlation between *f*^{0}(*α*_{1}...*α*_{
k
}) and *f*(*α*_{1}...*α*_{
k
}), the difference between them tends to be smaller than the difference between *f*(*α*_{1}...*α*_{
k
}) and *E*_{0}[*f*(*α*_{1}...*α*_{
k
})], indicating the information contributed by selective evolution is underestimated.

Another problem associated with Eq. [1] is the denominator. As originally proposed in [20], a square root needs to be applied to the denominator. Without such an operation, the normalized frequency tends to be over-standardized.

### Improved CCV (ICCV) method

*k*-mer string and the frequency variance in a given sequence

*S*based upon this simple assumption. Define

*x*

_{ i }as follows:

*i*= 1, 2, 3, ...,

*N*-

*k*+ 1 and

*N*-

*k*+ 1 is the maximum frequency one can observe for string

*α*

_{1}...

*α*

_{ k }in DNA sequence

*S*of length

*N*. Therefore, it can be shown that $f({\alpha}_{1}...{\alpha}_{k})={\displaystyle \sum _{i=1}^{N-k+1}{x}_{i}}$. The expectation and variance of

*f*(

*α*

_{1}⋯

*α*

_{ k }) are given as

where ${J}_{t}=\{\begin{array}{l}1,\phantom{\rule{0.5em}{0ex}}\text{if(}{a}_{1}\mathrm{...}{a}_{k-t}\text{)=(}{a}_{t+1}\mathrm{...}{a}_{k}\hfill \\ 0,\phantom{\rule{0.5em}{0ex}}\text{otherwise}\hfill \end{array}$, for *t* = 1, 2, 3,..., *k* - 1. For a full derivation of the above equation, readers may refer to [21, 22].

*k*-mer string is given as:

for *k* ≥ 1.

We construct an improved CCV (ICCV) with the normalized frequencies of all *k*-mer strings computed using Eq. [2]. Since *E*[*f*(*α*_{1}...*α*_{
k
})] is a theoretical value based on *N* and *k*, it is independent of *f*(*α*_{1}...*α*_{
k
}) for a fixed *k*. Therefore, the ICCV method we proposed does not experience the underestimation problem of the existing CCV methods. Another advantage of ICCV over CCV is that ICCV is constructed for any *k* but CCV is constructed for *k* > 3. The latter neglects the evolutionary information contained in *1*-mer and *2*-mer strings.

### Distance measurement

*α*= (

*a*

_{1},

*a*

_{2}, ...,

*a*

_{ T }) and β = (

*b*

_{1},

*b*

_{2}, ...,

*b*

_{ T }) be the CCV or the ICCV of two DNA sequences

*A*and

*B*, respectively. To calculate

*D*(

*A*,

*B*), the distance between

*A*and

*B*, we adopt a distance measurement in this paper as detailed below:

where $C(\alpha ,\beta )=\frac{{\displaystyle \sum _{i=1}^{T}{a}_{i}\times {b}_{i}}}{({\displaystyle \sum _{i=1}^{T}{a}_{i}^{2}\times {\displaystyle \sum _{i=1}^{T}{b}_{i}^{2}}{)}^{1/2}}}$. *C*(*α*, *β*) is the cosine of the angle between *α* and *β*.

### Data sets

Besides the simulated data sets, we used a real dataset to compare the ICCV and the CCV methods. Fifty-four influenza A viral HA sequences were used. Each has approximately 1,659 base pairs. Based upon alignment-based phylogenetic analyses, each sequence was assigned a clade number by the International H5N1 Evolution Working Group (RO Donis, personal communication) [23].

### Data analysis and visualization

Statistical package R version 2.5.1 was used for programming and implementation of the CCV and ICCV methods. The trees were generated using the Neighbor-joining program in the PHYLIP 3.6.4 package. The resulting phylogenetic trees were displayed with MEGA 4.

## Results

### Analysis of simulation data sets

*K*= 6 show the same topology of six sequences, as shown in Fig. 3. However, the ICCV tree provides much higher bootstrapping values in support of two major clades. This indicates the ICCV method is more robust in resolving phylogenetic relationships of remotely related clades than the existing CCV method.

### Application on influenza A virus lineage Analysis

*K*= 7 agree with each other in the clade designation (denoted as 0, 1, ..., 9), but at the sub-clade level it appears that the designation based on the ICCV method is more convincing. For example, it is logical to assign the viral strain dk/Guangxi/13/4 to Sub-clade 2.4 as shown in the ICCV tree. However, this is not the case when examining the CCV tree. In addition, the positions of Clade 3, Clade 7 and Sub-clade 2.3.3 on the ICCV tree are not the same as on the CCV tree. When comparing trees generated from different methods, both the ICCV tree and the tree constructed by the H5N1 Working Group have exactly the same topology, which suggests that the ICCV method is more dependable than the existing CCV method.

## Discussion

### Does the uniform and independent assumption matter?

As we can envision, the only potential weakness associated with the ICCV method is the assumption of a uniform and independent model. It has been shown that the null hypothesis of equiprobable occurrence of different nucleotides is reasonable in the context of the DNA structures that have evolved from a "primordial soup' or 'base pool' containing equal quantities of each base [21]. Sege and Saxberg (1982) [24] have discussed this issue thoroughly. The hypothesis of independent occurrence of different nucleotides has also been accepted in numerous situations, particularly in the analysis of relatively short strings [21]. Arritia et al. [25] showed that the approximation of actual dependence in a DNA sequence to the theory of independence of bases is quite good.

We used our influenza H5N1 virus sequence database to examine the assumptions of uniformity and independence. Chi-square tests reject that the four nucleotides A, C, T, and G occur in equal probabilites (*p* < 0.0001) or occur independently of one another (*p* < 0.0001). Although the assumption does not generally hold, both results from the analyses of simulated data and experimental data showed that our improved method is more robust than the existing CCV method, indicating that the violation of the assumption on base composition has no significant impact on the accuracy of the ICCV method.

### Is increasing the maximum string length necessary?

*K*= 3, 4, 5, 7, 8, 9, 10 (Fig. 5). For the purpose of comparison, we also show the ICCV tree for

*K*= 6 from Fig. 3. In Fig. 5, it is clearly shown that as

*K*increases from 3 to 5, the supporting values significantly improve. However, this trend declines as

*K*increases from 6 to 10. Obviously, in this case,

*K*= 5 or 6 is a cutoff point, which means increasing

*K*after a certain number may not necessarily improve the result. Therefore, it might not be the case that increasing the maximum string length would result in a vector containing finer evolutionary information.

The reason for this is that the overlapping nature of strings with multiple lengths causes the overlap of evolutionary information carried by each individual CV. As multiple CVs are combined into a complete CV, the complete CV collects the exclusive evolutionary information that each CV contains, but at the same time the overlapping information that individual CVs contain is also summed up. Therefore, increasing the string length *K* to a certain point will certainly improve the result, but the trend of improvement reaches its peak and afterwards declines. The question is how to choose an optimal string length for construction of the CCV, which will be discussed next.

### How to choose an optimal string length for the CCV

*W*of length

*M*, which provides an empirical nucleotide distribution for the class of sequences in the dataset. Then

*S*

_{ k }for

*W*is computed. Since

*S*

_{ k }is the vector of observed frequencies of all the

*k*-mer strings in

*W*, $\frac{{S}_{k}}{M-k+1}=({q}_{k(1)},{q}_{(k2)},\mathrm{...},{q}_{k({4}^{k})})$ is the observed probability for all the

*k*-mer strings. As for a random sequence, the probability for all the

*k*-mer strings is $p=({p}_{1},{p}_{2},\mathrm{...},{p}_{{4}^{k}})$, where ${p}_{i}=\frac{1}{{4}^{k}}$ for i = 1, 2,...., 4

^{ k }. Therefore, we can determine the difference between these two probability distributions by their Kullback-Leibler distance:

*D*_{
k
}(*W*) would be small if the two distributions are close to each other, which indicates that *S*_{
k
}does not contain rich evolutionary information and should be excluded from calculating the ICCVs.

*A*and calculated

*D*

_{ k }(

*A*) for

*k*= 1, 2, ..., 30. Similarly, we calculated

*D*

_{ k }(

*B*) for

*k*= 1, 2, ..., 30, where

*B*is a randomly generated sequence with the same length as sequence

*A*. Then we computed $Q(k)=\frac{{D}_{k}(A)}{{D}_{k}(B)}$ for

*k*= 1, 2, ...., 30 (Fig. 6). In Fig. 6, we can see that the magnitude of

*Q*(

*k*) is fairly large when

*k*is small. As

*k*increases,

*Q*(

*k*) starts to decrease, and then it reaches a steady state at

*Q*(

*k*) = 1 when

*k*is larger than 7. The reason for this is that the effect of selective evolution is more significant on shorter strings than it is on longer strings. Therefore,

*D*

_{ k }(

*A*) is much larger than

*D*

_{ k }(

*B*) when

*k*is small. However, as

*k*increases, the effect of selective evolution on

*k*-mer strings starts to decline. Thus, the behavior of

*k*-mer strings in sequence

*A*becomes more similar to that of a random sequence and

*D*

_{ k }(

*A*) becomes closer to

*D*

_{ k }(

*B*), which indicates that less evolutionary information is carried by

*S*

_{ k }. For the experimental dataset, since

*Q*(

*k*) is fairly close to 1 when

*k*is larger than 7, an appropriate choice for the maximum (optimal) string length

*K*would be 7.

## Conclusion

In this paper, we show that the existing CV and CCV methods underestimate the evolutionary information contained in a DNA sequence due to the Markov model assumption and the denominator used for the normalization of observed string frequencies. Experiments using simulated and experimental data sets demonstrated that our ICCV method generates more accurate and robust results compared with the currently used CCV method. The consistency between the ICCV tree and the alignment-based tree recommended by the International H5N1 Evolution Working Group indicates that the ICCV method is a valuable alternative to the alignment-based methods. It is also shown that the violation of the assumption about base composition has no significant impact on the accuracy of the ICCV method. As to the issue related to maximum string length, we believe that it is not necessary to use relatively long strings to construct the CCV due to the overlapping nature of strings with variable length. We suggest a practical approach for choosing the optimal string length for the CCV.

## Declarations

### Acknowledgements

This publication was made possible by NSF Grant Number NSF DEB-0732969 and NIH Grant Number NIH 1R01LM009219-01A1. XF is grateful to the Dean's Office in the College of Arts and Science at the University of Nebraska at Omaha for providing a Graduate Assistantship. GL acknowledges the UCR Award from the University of Nebraska at Omaha. We are grateful to the WHO/OIE/FAO H5N1 Evolution Working Group for sharing the sequence data and the clade designation information. We also want to thank Mary Christman for proofreading the final manuscript.

This article has been published as part of *BMC Bioinformatics* Volume 9 Supplement 6, 2008: Symposium of Computations in Bioinformatics and Bioscience (SCBB07). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S6.

## Authors’ Affiliations

## References

- Wiens JJ, Servedio MR:
**Phylogenetic analysis and intraspecific variation: performance of parsimony, likelihood, and distance methods.***Syst Biol*1998,**47:**228–53. 10.1080/106351598260897View ArticlePubMedGoogle Scholar - Attwood TK:
**Genomics: the Babel of bioinformatics.***Science*2000,**290:**471–473. 10.1126/science.290.5491.471View ArticlePubMedGoogle Scholar - Pearson WR: Protein sequence comparison and protein evolution. Tutorial-ISMB2000 Tutorial-ISMB2000Google Scholar
- Vinga S, Almeida J:
**Alignment free sequence comparison-a review.***Bioinformatics*2003,**19:**513–523. 10.1093/bioinformatics/btg005View ArticlePubMedGoogle Scholar - Herniou E, Luque T, Chen X, Vlak J, Winstanley D, Cory J, O'Reilly D:
**Use of whole genome sequence data to infer baculovirus phylogeny.***J Virol*2001,**75:**8117–8126. 10.1128/JVI.75.17.8117-8126.2001PubMed CentralView ArticlePubMedGoogle Scholar - House C, Fitz-Gibbon S:
**Using homolog groups to create a whole-genomic tree of free-living organisms: An update.***J Mol Evol*2002,**54:**539–547. 10.1007/s00239-001-0054-5View ArticlePubMedGoogle Scholar - Snel B, Bork P, Huynen MA:
**Genomes in flux: the evolution of archaeal and proteobacterial gene content.***Genome Research*2002,**12:**17–25. 10.1101/gr.176501View ArticlePubMedGoogle Scholar - Otu HH, Sayood K:
**A new sequence distance measure for phylogenetic tree construction.***Bioinformatics*2003,**19:**2122–30. 10.1093/bioinformatics/btg295View ArticlePubMedGoogle Scholar - Benedetto D, Caglioti E, Loreto V:
**Language trees and zipping.***Physical Review Letters*2002,**88:**048702. 10.1103/PhysRevLett.88.048702View ArticlePubMedGoogle Scholar - Chen X, Kwong S, Li M:
**A compression algorithm for DNA sequences and its applications in genome comparison.**In*Proceedings of the Sixth Annual International Computing and Combinatorics Conference (RECOMB)*. ACM Press; 2000:107–117.Google Scholar - Li M, Badger JH, Chen X, Kwong S, Kearney P, Zhang H:
**An information-based sequence distance and its application to whole mitochondrial genome phylogeny.***Bioinformatics*2001,**17:**149–154. 10.1093/bioinformatics/17.2.149View ArticlePubMedGoogle Scholar - Hao B, Qi J:
**Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance.***J Bioinform Comput Biol*2004,**2:**1–19. 10.1142/S0219720004000442View ArticlePubMedGoogle Scholar - Yu ZG, Zhou LQ, Anh VV, Chu KH, Long SC, Deng JQ:
**Phylogeny of prokaryotes and chloroplasts revealed by a simple composition approach on all protein sequences from complete genomes without sequence alignment.***J Mol Evol*2005,**60:**538–545. 10.1007/s00239-004-0255-9View ArticlePubMedGoogle Scholar - Wan XF, Wu X, Lin G, Holton SB, Desmone RA, Shyu CR, Guan Y, Emch ME:
**Computational identification of reassortments in avian influenza viruses.***Avian Dis*2007,**51:**434–439. 10.1637/7625-042706R1.1View ArticlePubMedGoogle Scholar - Qi J, Wang B, Hao BI:
**Whole proteome prokaryote phylogeny without sequence alignment: A**K**-string composition approach.***J Mol Evol*2004,**58:**1–11. 10.1007/s00239-003-2493-7View ArticlePubMedGoogle Scholar - Wu X, Wan X, Wu G, Xu D, Lin G:
**Phylogenetic analysis using complete signature information of whole genomes and clustered Neighbour-Joining method.***Int J Bioinform Res Appl*2006,**2:**219–248.View ArticlePubMedGoogle Scholar - Stuart G, Moffet K, Baker S:
**Integrated gene and species phylogenies from unaligned whole genome sequence.***Bioinformatics*2002,**18:**100–108. 10.1093/bioinformatics/18.1.100View ArticlePubMedGoogle Scholar - Stuart G, Moffet K, Leader J:
**A comprehensive vertebrate phylogeny using vector representation of protein sequences from whole genomes.***Mol Biol Evol*2002,**19:**554–562.View ArticlePubMedGoogle Scholar - Wu X, Cai Z, Wan X, Hoang T, Goebel R, Lin G:
**Nucleotide composition string selection in HIV-1 subtyping using whole genomes.***Bioinformatics*2007,**23:**1744–1752. 10.1093/bioinformatics/btm248View ArticlePubMedGoogle Scholar - Brendel V, Beckmann JS, Trifonov EN:
**Linguistics of nucleotide sequences: Morphology and comparison of vocabularies.***J Biomol Struct Dyn*1986,**4:**11–21.View ArticlePubMedGoogle Scholar - Gentleman JF, Mullin RC:
**The distribution of the frequency of occurrence of nucleotide subsequence, based on their overlap capability.***Biometrics*1989,**45:**35–52. 10.2307/2532033View ArticlePubMedGoogle Scholar - Wu TJ, Burke JP, Davison DB:
**A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words.***Biometrics*1997,**53:**1431–1439. 10.2307/2533509View ArticlePubMedGoogle Scholar - Zhang S, Fang X, Davis T, Ruben D, Lu G:
**Multidimensional scaling and model-based clustering analyses for the clade assignments of the HPAI H5N1 viruses.**In*Options for the Control of Influenza VI*. London. Blackwell; 2007:in press.Google Scholar - Sege RD, Saxberg BEH:
**A statistical test for comparing several nucleotide sequences.***Nucleic Acids Research*1982,**10:**375–389. 10.1093/nar/10.1.375PubMed CentralView ArticlePubMedGoogle Scholar - Arritia R, Gordon L, Waterman WS:
**The Erdös-Rényi law in distribution, for coin tossing and sequence matching.***Annals of Statistics*1990,**18:**539–570. 10.1214/aos/1176347615View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.