An improved string composition method for sequence comparison

Background Historically, two categories of computational algorithms (alignment-based and alignment-free) have been applied to sequence comparison–one of the most fundamental issues in bioinformatics. Multiple sequence alignment, although dominantly used by biologists, possesses both fundamental as well as computational limitations. Consequently, alignment-free methods have been explored as important alternatives in estimating sequence similarity. Of the alignment-free methods, the string composition vector (CV) methods, which use the frequencies of nucleotide or amino acid strings to represent sequence information, show promising results in genome sequence comparison of prokaryotes. The existing CV-based methods, however, suffer certain statistical problems, thereby underestimating the amount of evolutionary information in genetic sequences. Results We show that the existing string composition based methods have two problems, one related to the Markov model assumption and the other associated with the denominator of the frequency normalization equation. We propose an improved complete composition vector method under the assumption of a uniform and independent model to estimate sequence information contributing to selection for sequence comparison. Phylogenetic analyses using both simulated and experimental data sets demonstrate that our new method is more robust compared with existing counterparts and comparable in robustness with alignment-based methods. Conclusion We observed two problems existing in the currently used string composition methods and proposed a new robust method for the estimation of evolutionary information of genetic sequences. In addition, we discussed that it might not be necessary to use relatively long strings to build a complete composition vector (CCV), due to the overlapping nature of vector strings with a variable length. We suggested a practical approach for the choice of an optimal string length to construct the CCV.


Background
The increasing proliferation of biological sequence data has created tremendous opportunities for biologists and medical researchers to address both fundamental issues (e.g., molecular evolution) and practical problems (e.g., drug design). On the other hand, it poses many computational challenges for theoretical scientists to create efficient and reliable methods or algorithms for sequence analyses and knowledge mining. Sequence comparison, an essential operation for gene finding and protein function annotation, is one such challenge. The methods for sequence comparison are classified into two categories, alignment-based and alignment-free. The alignmentbased sequence analysis methods have both fundamental and computational limitations [1][2][3][4]. For example, these methods cannot deal with changes like chromosome reversal or gene translocation. They also encounter difficulties in aligning dissimilar sequences. Another drawback with sequence alignment is its computational complexity, where no optimal solution can be achieved when a large number of sequences are compared. Consequently, considerable efforts have been made to seek for alternative, i.e., alignment-free, methods for sequence comparison.
The alignment-free methods seen in the past few decades can be divided into three categories: gene contents [5][6][7], data compression [8][9][10][11], and string (or word) composition [12][13][14][15][16][17][18]. Of these methods, the string-compositionbased methods, especially the composition vector (CV) method [12] and the complete composition vector (CCV) method [16], have received substantial attention. The CV method uses strings of a fixed length whereas the CCV method uses strings of multiple lengths. The CCV method was found to provide finer evolutionary information than the CV method; however, it has disadvantages regarding computing time and memory usage. Both of the above mentioned methods apply a Markov model assumption to estimate the random background of observed frequencies, which has been found to be problematic, as detailed in Section 2. In this paper, we will provide an improved CCV (ICCV) method and demonstrate that this new method is more robust and efficient in performing sequence comparison compared with the existing CCV method. The issue of how to build a more informative CCV, i.e., how to select the maximum vector string length for better evolutionary information representation, will be addressed as well.
The contents of this paper are arranged as follows. In the Methods section, we point out the two aforementioned problems in the existing CV or CCV methods and describe our new ICCV method. In the Results section, we compare the CCV and ICCV methods through simulations and experimental data analysis. In the Discussion section, we discuss the potential impact of the simple assumption of a uniform and independent model and issues related to selecting the maximum string length for CCV construction.

Existing CV and CCV methods
Define S as a DNA sequence consisting of N nucleotides. Let f(α 1 ...α k ) be the observed frequency of the k-mer string α 1 ...α k , where α i is one of the four nucleotides A, C, T, or G and k is the string length (1 ≤ k <N). We define as a vector of observed frequencies for a given k, where 4 k is the number of k-mer strings, and let γ K = (S 1 , S 2 , ..., S K ) as a combined vector for some constant where K is the maximum string length considered. From the perspective of molecular evolution, S k or γ K reflects both random mutation and selection, and the random background needs to be normalized in order to represent genetic information contributed by natural selection. After the normalization of observed frequencies, S k is converted into a composition vector (CV), and γ K is transformed into a complete composition vector (CCV).
The method to normalize the observed frequencies of different k-mer strings in S was originally proposed by Brendel et al. [20] and has been used with minor modifications for phylogenetic studies of prokaryotes and viruses [12,16]. We have found two problems associated with string frequency normalization in existing methods.
a a a a a a a Define E 0 [f(α 1 ...α k )] as the true expected frequency of kmer string α 1 ...α k in S. Since there exists a highly positive correlation between f 0 (α 1 ...α k ) and f(α 1 ...α k ), the difference between them tends to be smaller than the difference between f(α 1 ...α k ) and E 0 [f(α 1 ...α k )], indicating the information contributed by selective evolution is underestimated.
Another problem associated with Eq. [1] is the denominator. As originally proposed in [20], a square root needs to be applied to the denominator. Without such an operation, the normalized frequency tends to be over-standardized.

Improved CCV (ICCV) method
We assume that the four bases A, C, T, and G occur randomly with equal chance and derive the expected frequency of a k-mer string and the frequency variance in a given sequence S based upon this simple assumption. Define x i as follows: Correlation between the observed frequencies and the estimated expected frequencies of strings with length k = 3, 4, 5, respectively, in a randomly selected sequence from our database Figure 1 Correlation between the observed frequencies and the estimated expected frequencies of strings with length k = 3, 4, 5, respectively, in a randomly selected sequence from our database. Reference lines in the plots designate y = x. where , for t = 1, 2, 3,..., k -1. For a full derivation of the above equation, readers may refer to [21,22].
With both expectation and variance derived, the normalization function for the observed frequency of a k-mer string is given as: We construct an improved CCV (ICCV) with the normalized frequencies of all k-mer strings computed using Eq. [2]. Since E[f(α 1 ...α k )] is a theoretical value based on N and k, it is independent of f(α 1 ...α k ) for a fixed k. Therefore, the ICCV method we proposed does not experience the underestimation problem of the existing CCV methods. Another advantage of ICCV over CCV is that ICCV is constructed for any k but CCV is constructed for k > 3. The latter neglects the evolutionary information contained in 1-mer and 2-mer strings.

Distance measurement
Let α = (a 1 , a 2 , ..., a T ) and β = (b 1 , b 2 , ..., b T ) be the CCV or the ICCV of two DNA sequences A and B, respectively. To calculate D (A, B), the distance between A and B, we adopt a distance measurement in this paper as detailed below: where . C(α, β) is the cosine of the angle between α and β.

Data sets
To generate simulation data sets to compare the performance of the ICCV and the CCV methods, we adopted a similar approach as in [8]. In brief, an ancestor sequence was randomly picked from our influenza virus database; and the progeny sequences were derived through simulation using different types of mutations (insertion, deletion, substitution, inversion, transposition or translocation) and following a pre-defined tree topology (Fig. 2). Six types of mutations at the rate of 9-15% were applied to generate A1 and A2 from A, and B1 and B2 from B. Three types of mutations (insertion, deletion, substitution) at the rate of 2-5% were used to generate A0 from A and B0 from B. A total of 1000 data sets were generated for phylogenetic analysis.
Besides the simulated data sets, we used a real dataset to compare the ICCV and the CCV methods. Fifty-four influenza A viral HA sequences were used. Each has approximately 1,659 base pairs. Based upon alignment-based phylogenetic analyses, each sequence was assigned a clade number by the International H5N1 Evolution Working Group (RO Donis, personal communication) [23].

Data analysis and visualization
Statistical package R version 2.5.1 was used for programming and implementation of the CCV and ICCV methods. The trees were generated using the Neighbor-joining The predefined tree topology used to generate nucleotide sequences for the simulation study Figure 2 The predefined tree topology used to generate nucleotide sequences for the simulation study.

Ancestor Sequence
program in the PHYLIP 3.6.4 package. The resulting phylogenetic trees were displayed with MEGA 4.

Analysis of simulation data sets
Both CCV and ICCV trees for K = 6 show the same topology of six sequences, as shown in Fig. 3. However, the ICCV tree provides much higher bootstrapping values in support of two major clades. This indicates the ICCV method is more robust in resolving phylogenetic relationships of remotely related clades than the existing CCV method.

Application on influenza A virus lineage Analysis
As shown in Fig. 4, ICCV and CCV trees for K = 7 agree with each other in the clade designation (denoted as 0, 1, ..., 9), but at the sub-clade level it appears that the designation based on the ICCV method is more convincing. For example, it is logical to assign the viral strain dk/Guangxi/ 13/4 to Sub-clade 2.4 as shown in the ICCV tree. However, this is not the case when examining the CCV tree. In addition, the positions of Clade 3, Clade 7 and Sub-clade 2.3.3 on the ICCV tree are not the same as on the CCV tree. When comparing trees generated from different methods, both the ICCV tree and the tree constructed by the H5N1 Working Group have exactly the same topology, which suggests that the ICCV method is more dependable than the existing CCV method.

Does the uniform and independent assumption matter?
As we can envision, the only potential weakness associated with the ICCV method is the assumption of a uniform and independent model. It has been shown that the null hypothesis of equiprobable occurrence of different nucleotides is reasonable in the context of the DNA structures that have evolved from a "primordial soup' or 'base pool' containing equal quantities of each base [21]. Sege and Saxberg (1982) [24] have discussed this issue thoroughly. The hypothesis of independent occurrence of different nucleotides has also been accepted in numerous situations, particularly in the analysis of relatively short strings [21]. Arritia et al. [25] showed that the approximation of actual dependence in a DNA sequence to the theory of independence of bases is quite good.
We used our influenza H5N1 virus sequence database to examine the assumptions of uniformity and independence. Chi-square tests reject that the four nucleotides A, C, T, and G occur in equal probabilites (p < 0.0001) or occur independently of one another (p < 0.0001). Although the assumption does not generally hold, both results from the analyses of simulated data and experimental data showed that our improved method is more robust than the existing CCV method, indicating that the violation of the assumption on base composition has no significant impact on the accuracy of the ICCV method. Wu et al. (2007) [19] suggested that increasing the maximum string length results in a vector containing finer evolutionary information. To investigate this issue, we used the same simulated sequences data as in section 3.1, and constructed the ICCV trees for K = 3,4,5,7,8,9,10 (Fig. 5). For the purpose of comparison, we also show the ICCV tree for K = 6 from Fig. 3. In Fig. 5, it is clearly shown that as K increases from 3 to 5, the supporting values significantly improve. However, this trend declines as K increases from 6 to 10. Obviously, in this case, K = 5 or 6 is a cutoff point, which means increasing K after a certain number may not necessarily improve the result. Therefore, it might not be the case that increasing the maximum string length would result in a vector containing finer evolutionary information.

Is increasing the maximum string length necessary?
The reason for this is that the overlapping nature of strings with multiple lengths causes the overlap of evolutionary information carried by each individual CV. As multiple CVs are combined into a complete CV, the complete CV collects the exclusive evolutionary information that each CV contains, but at the same time the overlapping information that individual CVs contain is also summed up. Therefore, increasing the string length K to a certain point will certainly improve the result, but the trend of improvement reaches its peak and afterwards declines. The question is how to choose an optimal string length for construction of the CCV, which will be discussed next.

How to choose an optimal string length for the CCV
Firstly, all the DNA sequences in the dataset are concatenated into a single sequence W of length M, which provides an empirical nucleotide distribution for the class of sequences in the dataset. Then S k for W is computed. Since S k is the vector of observed frequencies of all the k-mer Consensus trees of simulated sequences constructed based on the CCV and ICCV methods for K = 6 Figure 3 Consensus trees of simulated sequences constructed based on the CCV and ICCV methods for K = 6. Phylogenetic trees obtained from the experimental sequence set using the CCV and ICCV methods for K = 7 Figure 4 Phylogenetic trees obtained from the experimental sequence set using the CCV and ICCV methods for K = 7.  6). In Fig. 6, we can see that the magnitude of Q(k) is fairly large when k is small. As k increases, Q(k) starts to decrease, and then it reaches a steady state at Q(k) = 1 when k is larger than 7.
The reason for this is that the effect of selective evolution is more significant on shorter strings than it is on longer strings. Therefore, D k (A) is much larger than D k (B) when k is small. However, as k increases, the effect of selective evolution on k-mer strings starts to decline. Thus, the behavior of k-mer strings in sequence A becomes more similar to that of a random sequence and D k (A) becomes closer to D k (B), which indicates that less evolutionary information is carried by S k . For the experimental dataset, since Q(k) is fairly close to 1 when k is larger than 7, an appropriate choice for the maximum (optimal) string length K would be 7. Consensus trees obtained from simulated sequences using the ICCV method for K = 3, 4,..., 10, respectively Figure 5 Consensus trees obtained from simulated sequences using the ICCV method for K = 3, 4,..., 10, respectively.

Conclusion
In this paper, we show that the existing CV and CCV methods underestimate the evolutionary information contained in a DNA sequence due to the Markov model assumption and the denominator used for the normalization of observed string frequencies. Experiments using simulated and experimental data sets demonstrated that our ICCV method generates more accurate and robust results compared with the currently used CCV method. The consistency between the ICCV tree and the alignmentbased tree recommended by the International H5N1 Evolution Working Group indicates that the ICCV method is a valuable alternative to the alignment-based methods. It is also shown that the violation of the assumption about base composition has no significant impact on the accuracy of the ICCV method. As to the issue related to maximum string length, we believe that it is not necessary to use relatively long strings to construct the CCV due to the overlapping nature of strings with variable length. We suggest a practical approach for choosing the optimal string length for the CCV.