An improved string composition method for sequence comparison
© Lu et al; licensee BioMed Central Ltd. 2008
Published: 28 May 2008
Historically, two categories of computational algorithms (alignment-based and alignment-free) have been applied to sequence comparison–one of the most fundamental issues in bioinformatics. Multiple sequence alignment, although dominantly used by biologists, possesses both fundamental as well as computational limitations. Consequently, alignment-free methods have been explored as important alternatives in estimating sequence similarity. Of the alignment-free methods, the string composition vector (CV) methods, which use the frequencies of nucleotide or amino acid strings to represent sequence information, show promising results in genome sequence comparison of prokaryotes. The existing CV-based methods, however, suffer certain statistical problems, thereby underestimating the amount of evolutionary information in genetic sequences.
We show that the existing string composition based methods have two problems, one related to the Markov model assumption and the other associated with the denominator of the frequency normalization equation. We propose an improved complete composition vector method under the assumption of a uniform and independent model to estimate sequence information contributing to selection for sequence comparison. Phylogenetic analyses using both simulated and experimental data sets demonstrate that our new method is more robust compared with existing counterparts and comparable in robustness with alignment-based methods.
We observed two problems existing in the currently used string composition methods and proposed a new robust method for the estimation of evolutionary information of genetic sequences. In addition, we discussed that it might not be necessary to use relatively long strings to build a complete composition vector (CCV), due to the overlapping nature of vector strings with a variable length. We suggested a practical approach for the choice of an optimal string length to construct the CCV.
The increasing proliferation of biological sequence data has created tremendous opportunities for biologists and medical researchers to address both fundamental issues (e.g., molecular evolution) and practical problems (e.g., drug design). On the other hand, it poses many computational challenges for theoretical scientists to create efficient and reliable methods or algorithms for sequence analyses and knowledge mining. Sequence comparison, an essential operation for gene finding and protein function annotation, is one such challenge. The methods for sequence comparison are classified into two categories, alignment-based and alignment-free. The alignment-based sequence analysis methods have both fundamental and computational limitations [1–4]. For example, these methods cannot deal with changes like chromosome reversal or gene translocation. They also encounter difficulties in aligning dissimilar sequences. Another drawback with sequence alignment is its computational complexity, where no optimal solution can be achieved when a large number of sequences are compared. Consequently, considerable efforts have been made to seek for alternative, i.e., alignment-free, methods for sequence comparison.
The alignment-free methods seen in the past few decades can be divided into three categories: gene contents [5–7], data compression [8–11], and string (or word) composition [12–18]. Of these methods, the string-composition-based methods, especially the composition vector (CV) method  and the complete composition vector (CCV) method , have received substantial attention. The CV method uses strings of a fixed length whereas the CCV method uses strings of multiple lengths. The CCV method was found to provide finer evolutionary information than the CV method; however, it has disadvantages regarding computing time and memory usage. Both of the above mentioned methods apply a Markov model assumption to estimate the random background of observed frequencies, which has been found to be problematic, as detailed in Section 2. In this paper, we will provide an improved CCV (ICCV) method and demonstrate that this new method is more robust and efficient in performing sequence comparison compared with the existing CCV method. The issue of how to build a more informative CCV, i.e., how to select the maximum vector string length for better evolutionary information representation, will be addressed as well.
The contents of this paper are arranged as follows. In the Methods section, we point out the two aforementioned problems in the existing CV or CCV methods and describe our new ICCV method. In the Results section, we compare the CCV and ICCV methods through simulations and experimental data analysis. In the Discussion section, we discuss the potential impact of the simple assumption of a uniform and independent model and issues related to selecting the maximum string length for CCV construction.
Existing CV and CCV methods
Define S as a DNA sequence consisting of N nucleotides. Let f(α1...α k ) be the observed frequency of the k-mer string α1...α k , where α i is one of the four nucleotides A, C, T, or G and k is the string length (1 ≤ k <N). We define as a vector of observed frequencies for a given k, where 4 k is the number of k-mer strings, and let γ K = (S1, S2, ..., S K ) as a combined vector for some constant K (K < N), where K is the maximum string length considered. From the perspective of molecular evolution, S k or γ K reflects both random mutation and selection, and the random background needs to be normalized in order to represent genetic information contributed by natural selection. After the normalization of observed frequencies, S k is converted into a composition vector (CV), and γ K is transformed into a complete composition vector (CCV).
where for k ≥ 3.
Define E0[f(α1...α k )] as the true expected frequency of k-mer string α1...α k in S. Since there exists a highly positive correlation between f0(α1...α k ) and f(α1...α k ), the difference between them tends to be smaller than the difference between f(α1...α k ) and E0[f(α1...α k )], indicating the information contributed by selective evolution is underestimated.
Another problem associated with Eq.  is the denominator. As originally proposed in , a square root needs to be applied to the denominator. Without such an operation, the normalized frequency tends to be over-standardized.
Improved CCV (ICCV) method
for k ≥ 1.
We construct an improved CCV (ICCV) with the normalized frequencies of all k-mer strings computed using Eq. . Since E[f(α1...α k )] is a theoretical value based on N and k, it is independent of f(α1...α k ) for a fixed k. Therefore, the ICCV method we proposed does not experience the underestimation problem of the existing CCV methods. Another advantage of ICCV over CCV is that ICCV is constructed for any k but CCV is constructed for k > 3. The latter neglects the evolutionary information contained in 1-mer and 2-mer strings.
where . C(α, β) is the cosine of the angle between α and β.
Besides the simulated data sets, we used a real dataset to compare the ICCV and the CCV methods. Fifty-four influenza A viral HA sequences were used. Each has approximately 1,659 base pairs. Based upon alignment-based phylogenetic analyses, each sequence was assigned a clade number by the International H5N1 Evolution Working Group (RO Donis, personal communication) .
Data analysis and visualization
Statistical package R version 2.5.1 was used for programming and implementation of the CCV and ICCV methods. The trees were generated using the Neighbor-joining program in the PHYLIP 3.6.4 package. The resulting phylogenetic trees were displayed with MEGA 4.
Analysis of simulation data sets
Application on influenza A virus lineage Analysis
Does the uniform and independent assumption matter?
As we can envision, the only potential weakness associated with the ICCV method is the assumption of a uniform and independent model. It has been shown that the null hypothesis of equiprobable occurrence of different nucleotides is reasonable in the context of the DNA structures that have evolved from a "primordial soup' or 'base pool' containing equal quantities of each base . Sege and Saxberg (1982)  have discussed this issue thoroughly. The hypothesis of independent occurrence of different nucleotides has also been accepted in numerous situations, particularly in the analysis of relatively short strings . Arritia et al.  showed that the approximation of actual dependence in a DNA sequence to the theory of independence of bases is quite good.
We used our influenza H5N1 virus sequence database to examine the assumptions of uniformity and independence. Chi-square tests reject that the four nucleotides A, C, T, and G occur in equal probabilites (p < 0.0001) or occur independently of one another (p < 0.0001). Although the assumption does not generally hold, both results from the analyses of simulated data and experimental data showed that our improved method is more robust than the existing CCV method, indicating that the violation of the assumption on base composition has no significant impact on the accuracy of the ICCV method.
Is increasing the maximum string length necessary?
The reason for this is that the overlapping nature of strings with multiple lengths causes the overlap of evolutionary information carried by each individual CV. As multiple CVs are combined into a complete CV, the complete CV collects the exclusive evolutionary information that each CV contains, but at the same time the overlapping information that individual CVs contain is also summed up. Therefore, increasing the string length K to a certain point will certainly improve the result, but the trend of improvement reaches its peak and afterwards declines. The question is how to choose an optimal string length for construction of the CCV, which will be discussed next.
How to choose an optimal string length for the CCV
D k (W) would be small if the two distributions are close to each other, which indicates that S k does not contain rich evolutionary information and should be excluded from calculating the ICCVs.
In this paper, we show that the existing CV and CCV methods underestimate the evolutionary information contained in a DNA sequence due to the Markov model assumption and the denominator used for the normalization of observed string frequencies. Experiments using simulated and experimental data sets demonstrated that our ICCV method generates more accurate and robust results compared with the currently used CCV method. The consistency between the ICCV tree and the alignment-based tree recommended by the International H5N1 Evolution Working Group indicates that the ICCV method is a valuable alternative to the alignment-based methods. It is also shown that the violation of the assumption about base composition has no significant impact on the accuracy of the ICCV method. As to the issue related to maximum string length, we believe that it is not necessary to use relatively long strings to construct the CCV due to the overlapping nature of strings with variable length. We suggest a practical approach for choosing the optimal string length for the CCV.
This publication was made possible by NSF Grant Number NSF DEB-0732969 and NIH Grant Number NIH 1R01LM009219-01A1. XF is grateful to the Dean's Office in the College of Arts and Science at the University of Nebraska at Omaha for providing a Graduate Assistantship. GL acknowledges the UCR Award from the University of Nebraska at Omaha. We are grateful to the WHO/OIE/FAO H5N1 Evolution Working Group for sharing the sequence data and the clade designation information. We also want to thank Mary Christman for proofreading the final manuscript.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 6, 2008: Symposium of Computations in Bioinformatics and Bioscience (SCBB07). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S6.
- Wiens JJ, Servedio MR: Phylogenetic analysis and intraspecific variation: performance of parsimony, likelihood, and distance methods. Syst Biol 1998, 47: 228–53. 10.1080/106351598260897View ArticlePubMedGoogle Scholar
- Attwood TK: Genomics: the Babel of bioinformatics. Science 2000, 290: 471–473. 10.1126/science.290.5491.471View ArticlePubMedGoogle Scholar
- Pearson WR: Protein sequence comparison and protein evolution. Tutorial-ISMB2000 Tutorial-ISMB2000Google Scholar
- Vinga S, Almeida J: Alignment free sequence comparison-a review. Bioinformatics 2003, 19: 513–523. 10.1093/bioinformatics/btg005View ArticlePubMedGoogle Scholar
- Herniou E, Luque T, Chen X, Vlak J, Winstanley D, Cory J, O'Reilly D: Use of whole genome sequence data to infer baculovirus phylogeny. J Virol 2001, 75: 8117–8126. 10.1128/JVI.75.17.8117-8126.2001PubMed CentralView ArticlePubMedGoogle Scholar
- House C, Fitz-Gibbon S: Using homolog groups to create a whole-genomic tree of free-living organisms: An update. J Mol Evol 2002, 54: 539–547. 10.1007/s00239-001-0054-5View ArticlePubMedGoogle Scholar
- Snel B, Bork P, Huynen MA: Genomes in flux: the evolution of archaeal and proteobacterial gene content. Genome Research 2002, 12: 17–25. 10.1101/gr.176501View ArticlePubMedGoogle Scholar
- Otu HH, Sayood K: A new sequence distance measure for phylogenetic tree construction. Bioinformatics 2003, 19: 2122–30. 10.1093/bioinformatics/btg295View ArticlePubMedGoogle Scholar
- Benedetto D, Caglioti E, Loreto V: Language trees and zipping. Physical Review Letters 2002, 88: 048702. 10.1103/PhysRevLett.88.048702View ArticlePubMedGoogle Scholar
- Chen X, Kwong S, Li M: A compression algorithm for DNA sequences and its applications in genome comparison. In Proceedings of the Sixth Annual International Computing and Combinatorics Conference (RECOMB). ACM Press; 2000:107–117.Google Scholar
- Li M, Badger JH, Chen X, Kwong S, Kearney P, Zhang H: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 2001, 17: 149–154. 10.1093/bioinformatics/17.2.149View ArticlePubMedGoogle Scholar
- Hao B, Qi J: Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance. J Bioinform Comput Biol 2004, 2: 1–19. 10.1142/S0219720004000442View ArticlePubMedGoogle Scholar
- Yu ZG, Zhou LQ, Anh VV, Chu KH, Long SC, Deng JQ: Phylogeny of prokaryotes and chloroplasts revealed by a simple composition approach on all protein sequences from complete genomes without sequence alignment. J Mol Evol 2005, 60: 538–545. 10.1007/s00239-004-0255-9View ArticlePubMedGoogle Scholar
- Wan XF, Wu X, Lin G, Holton SB, Desmone RA, Shyu CR, Guan Y, Emch ME: Computational identification of reassortments in avian influenza viruses. Avian Dis 2007, 51: 434–439. 10.1637/7625-042706R1.1View ArticlePubMedGoogle Scholar
- Qi J, Wang B, Hao BI: Whole proteome prokaryote phylogeny without sequence alignment: A K -string composition approach. J Mol Evol 2004, 58: 1–11. 10.1007/s00239-003-2493-7View ArticlePubMedGoogle Scholar
- Wu X, Wan X, Wu G, Xu D, Lin G: Phylogenetic analysis using complete signature information of whole genomes and clustered Neighbour-Joining method. Int J Bioinform Res Appl 2006, 2: 219–248.View ArticlePubMedGoogle Scholar
- Stuart G, Moffet K, Baker S: Integrated gene and species phylogenies from unaligned whole genome sequence. Bioinformatics 2002, 18: 100–108. 10.1093/bioinformatics/18.1.100View ArticlePubMedGoogle Scholar
- Stuart G, Moffet K, Leader J: A comprehensive vertebrate phylogeny using vector representation of protein sequences from whole genomes. Mol Biol Evol 2002, 19: 554–562.View ArticlePubMedGoogle Scholar
- Wu X, Cai Z, Wan X, Hoang T, Goebel R, Lin G: Nucleotide composition string selection in HIV-1 subtyping using whole genomes. Bioinformatics 2007, 23: 1744–1752. 10.1093/bioinformatics/btm248View ArticlePubMedGoogle Scholar
- Brendel V, Beckmann JS, Trifonov EN: Linguistics of nucleotide sequences: Morphology and comparison of vocabularies. J Biomol Struct Dyn 1986, 4: 11–21.View ArticlePubMedGoogle Scholar
- Gentleman JF, Mullin RC: The distribution of the frequency of occurrence of nucleotide subsequence, based on their overlap capability. Biometrics 1989, 45: 35–52. 10.2307/2532033View ArticlePubMedGoogle Scholar
- Wu TJ, Burke JP, Davison DB: A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics 1997, 53: 1431–1439. 10.2307/2533509View ArticlePubMedGoogle Scholar
- Zhang S, Fang X, Davis T, Ruben D, Lu G: Multidimensional scaling and model-based clustering analyses for the clade assignments of the HPAI H5N1 viruses. In Options for the Control of Influenza VI. London. Blackwell; 2007:in press.Google Scholar
- Sege RD, Saxberg BEH: A statistical test for comparing several nucleotide sequences. Nucleic Acids Research 1982, 10: 375–389. 10.1093/nar/10.1.375PubMed CentralView ArticlePubMedGoogle Scholar
- Arritia R, Gordon L, Waterman WS: The Erdös-Rényi law in distribution, for coin tossing and sequence matching. Annals of Statistics 1990, 18: 539–570. 10.1214/aos/1176347615View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.