Universal seeds for cDNA-to-genome comparison
- Leming Zhou^{1},
- Jonathan Stanton^{1} and
- Liliana Florea^{1, 2}Email author
DOI: 10.1186/1471-2105-9-36
© Zhou et al; licensee BioMed Central Ltd. 2008
Received: 20 June 2007
Accepted: 23 January 2008
Published: 23 January 2008
Abstract
Background
To meet the needs of gene annotation for newly sequenced organisms, optimized spaced seeds can be implemented into cross-species sequence alignment programs to accurately align gene sequences to the genome of a related species. So far, seed performance has been tested for comparisons between closely related species, such as human and mouse, or on simulated data. As the number and variety of genomes increases, it becomes desirable to identify a small set of universal seeds that perform optimally or near-optimally on a large range of comparisons.
Results
Using statistical regression methods, we investigate the sensitivity of seeds, in particular good seeds, between four cDNA-to-genome comparisons at different evolutionary distances (human-dog, human-mouse, human-chicken and human-zebrafish), and identify classes of comparisons that show similar seed behavior and therefore can employ the same seed. In addition, we find that with high confidence good seeds for more distant comparisons perform well on closer comparisons, within 98–99% of the optimal seeds, and thus represent universal good seeds.
Conclusion
We show for the first time that optimal and near-optimal seeds for distant species-to-species comparisons are more generally applicable to a wide range of comparisons. This finding will be instrumental in developing practical and user-friendly cDNA-to-genome alignment applications, to aid in the annotation of new model organisms.
Background
The next few years are expected to bring a significant increase in the number of available genomes, driven by advances in sequencing technologies [1]. As genome sequencing projects outpace the generation of native mRNA and protein sequences, gene annotation projects for these genomes will need to rely instead on cDNA information from other species. While existing alignment programs align cDNA and the corresponding genomic sequences accurately, they are inadequate for cross-species comparisons [2]. Beginning with blast [3, 4], most alignment programs have used a seed-and-extend technique to produce local alignments, starting from exact or near-exact word matches (seeds) between the two sequences and extending them to a local alignment in several stages. Blast uses an exact match of 11 contiguous positions, represented by a vector of 1s (11_{ c }= 11111111111). Such a seed is called continuous. More recently, spaced seeds have been introduced, which allow wildcard positions in the seed pattern, marked with 0s. For instance, Kent and Zahler [5] used a seed that allowed for mismatches at the wobble codon position (WABA_{12} = 11011011011011011) to increase the sensitivity of cDNA-genomic sequence alignment. For the same weight, or number of 1 positions, spaced seeds were shown to perform better than continuous seeds in most cases [6].
Recently, Ma et al. [7] introduced a framework for predicting the sensitivity of a spaced seed given a model of alignment, and showed how to determine the theoretical seed sensitivity with an efficient dynamic programming method [8]. Given a Bernoulli model of alignments with the parameter p (i.e., the probability that any one alignment position is a match) estimated from the average sequence identity, they determined optimal seeds for human-mouse genomic sequence alignment. These seeds were later implemented in the programs PatternHunter and blastz [9]. Subsequent studies used increasingly complex alignment models, such as the order 1 inhomogeneous 3-periodic Markov models in [6, 10] or the higher order (3–5) inhomogeneous Markov models of [11], often in the specialized context of coding sequences. Extensions of the seed models were proposed to further improve sensitivity or extend the range of applicability to non-nucleotide sequences, for instance multiple spaced seeds [12], or vector seeds [13]. In the latter, each position in the seed pattern is a real value representing the weight of a match or substitution at that position in the total match score, and a seed match is declared when the total score exceeds an a priori fixed threshold. However, these methods increase the memory and running time of searches, both of which are critical for practical high-throughput applications.
While early alignment and seed models used the traditional {0, 1} alphabet, to further improve the seed sensitivity transition-only wildcards were introduced to differentiate between transitions and transversions, first implemented in blastz [9]. Later, Noé and Kucherov [14] formalized the concept and showed that spaced seeds with transitions extend the sensitivity range of {0, 1} seeds, and Zhou and Florea [11] additionally provided a framework for specificity calculation, and showed that they offer better sensitivity-specificity tradeoffs than {0, 1} seeds in practice.
When comparing coding sequences, such as in cDNA-to-genome alignments, the codon organization, higher-order position dependencies [15] and specific transition-transversion biases [16] inherent in the gene sequences are likely to be reflected in the alignment patterns. In [6, 10], a three-state Markov model of order 1 is used to simulate the three codon positions, while Brejová et al. [17] introduce two models, the first a three-state Markov model of order 0, and the second a more complex Bernoulli formulation in which each codon is modeled independently. Recently, we proposed to use higher order (3–5) inhomogeneous Markov models with transitions [11] to capture both transition-transversion biases and sequence compositional patterns.
As the number of sequenced organisms increases, the range of possible pairwise species comparisons will grow quadratically with the number of species. For practical reasons it becomes desirable to identify a small number of seeds that would perform well for a wide range of comparisons, at varying evolutionary distances. Earlier, Choi et al. [18–20] determined and analyzed best seeds for genomic sequence comparisons for several sequence identity levels under a Bernoulli model of alignment, in a greatly simplified model of evolutionary distance. At low weights (10–12), seeds performed optimally or near-optimally across a wide range of sequence similarity levels. At higher weights, however, there was significant fluctuation in seed sensitivity, which led them to the hypothesis that different seeds will be needed for each type of comparison. This simple p-level representation of evolutionary sequence divergence is likely inadequate for coding sequences, which are under a more diverse set of evolutionary pressures. We approach the question of designing universal good seeds for cDNA-to-genome comparisons, i.e. seeds that perform well for a large number of comparisons, starting from a complex representation of alignment as a high order 3-periodic inhomogeneous Markov model incorporating transitions [11]. Using comparisons between human and four others species (mouse, dog, chicken and zebrafish) to sample a wide range of evolutionary distances, we analyze the distributions of seed sensitivities statistically to characterize and identify universal good seeds. In the remainder of this section, we provide a brief introduction to the problem of designing optimal spaced seeds.
Spaced Seeds
{0,1} spaced seeds
Alignment programs typically use short exact or approximate matches of an a priori specified pattern (seed) to detect local alignments between two sequences. A continuous seed, such as the one used in blast [3, 4], requires an exact match of a fixed length k between the sequences, and is represented as a vector of 1s (11_{ c }= 11111111111). In contrast, spaced seeds [7] allow for approximate matches by including wildcard positions in the pattern, marked with 0s (e.g., WABA_{12} = 11011011011011011). The number of 1 positions in the pattern is called the weight of the seed, and the length of the pattern is called the span. Conventionally, seeds must start with a 1 position. In keeping with previous studies, we will use a fixed seed span k = 22 [6, 10, 11].
An alignment is represented as a string of 0s (mismatches) and 1s (matches) generated from a model $\mathcal{M}$, for instance a Bernoulli or a Markov model. A seed S = s_{1} ... s_{ k }, s_{1} = 1, is said to detect the alignment w = w_{1} ... w_{ L }∈ {0, 1}^{ L }if there is an approximate match for the pattern in the alignment string such that all 1 positions in the seed pattern map to 1 positions in the alignment, i.e. there exists i = 1..L - k + 1 such that w_{i+l-1}= 1 for all l with s_{ l }= 1. If such an i exists, the seed is said to occur in the alignment w at position i. The theoretical sensitivity of a seed is then defined as the probability that it will detect a random alignment of length L generated from the model $\mathcal{M}$, or equivalently:Sn(S) = P({w ∈ {0, 1}^{ L }|S detects w}) [6, 8]. Traditionally, the alignment length L is 64, determined as the average length of a gap-free alignment in human-mouse comparisons. By definition, an optimal seed is a seed with the highest sensitivity. For a given seed, its sensitivity can be computed exactly using dynamic programming [8, 10, 21]. Optimal seeds can then be produced by exhaustively searching the seed space [8], while close approximations can be obtained with fast heuristics, such as hill-climbing [6, 10] or exploiting the seed structure to reduce the search space [20].
{0, 1, x} spaced seeds for cDNA-to-genome alignment
Unlike genomic sequences, gene sequences exhibit higher order dependencies between positions [15], more pronounced transition-transversion biases [16], and distinct manifestations at the three codon positions, which are likely to translate into alignment patterns. To account for these characteristics, in previous work [11] we proposed a framework that differentiated between transitions and transversions, by using an additional alphabet symbol x, as well as among the three codon positions, using a third order inhomogeneous 3-periodic Markov model of cDNA-to-genome alignments. In the seed pattern, x marks a position that allows transitions but not transversions, while in the alignment model it simply represents a transition. The weight of the new symbol is 0.5. We denote (n_{1}, n_{0}, n_{ x }) the class of seed patterns with n_{1}, n_{0} and n_{ x }symbols of 1, 0 and x, respectively, n_{1} + n_{0} + n_{ x }= k. For a given weight W, there may be multiple (n_{1}, n_{0}, n_{ x }) combinations with n_{1} + 0.5·n_{ x }= W . For instance, for the weight W = 10 and span k = 22, the following combinations are possible: (10, 12, 0), (9, 11, 2), ..., (0, 2, 20).
In the following sections, we investigate the behavior of seeds, and in particular best seeds, among four comparisons or cDNA-to-genome alignment models (human-mouse, human-dog, human-chicken and human-zebrafish) to obtain a robust characterization of universal good seeds.
Results
The repertoire of species to be sequenced is expected to increase dramatically over the next few years, driven by new, more effective and increasingly reliable sequencing technologies. As the number and phylogenetic diversity of genomes increases, designing optimized seeds for each pair of compared species quickly becomes impractical. It becomes desirable to identify a limited number of seeds that perform well, if not optimally, for a large number of comparisons. We call these universal good seeds. We examine seeds for four comparisons between species with significantly different evolutionary distances and mutation patterns (human-mouse, human-dog, human-chicken and human-zebrafish). For simplicity, we will refer to each comparison by the name of the second organism (DOG, MUS, CHK, ZFS). By analyzing the distribution of seed sensitivities between models statistically, we identify strategies for designing universal good seeds.
Universal good seeds
We address two questions: First, are there groups of comparisons, or equivalently alignment models, that produce similar behavior of all seeds? We call such models seed-equivalent, and one optimal seed would then satisfy all comparisons in a seed-equivalence group. Second, are there universal good seeds, i.e. seeds that are optimal or near-optimal for a large range of comparisons and evolutionary distances?
Determining seed-equivalent models
Comparison of seed sensitivity distributions between models. Seed population sizes for (12,10,0), (14,8,0), (16,6,0) are 352716, 203490 and 54264, respectively. Explanation of columns: y_{min, max}– minimum and maximum sensitivity in the compared model; x_{ max }– maximum sensitivity in the reference model; t_{α/2}σ_{ y }– half the length of the 95% prediction interval (t_{α/2}= 1.96 when α = 0.05); T(x, y) = ((a + bx_{max} + $c{x}_{\mathrm{max}}^{2}$ + $d{x}_{\mathrm{max}}^{3}$) - t_{α/2}σ_{ y })/y_{max}, where a, b, c and d are the coefficients of the regression curve, and σ_{ y }is the estimated regression standard error of prediction for a given x value. Because the number of values is large, σ_{ y }≃ σ for all y. Outliers are determined as those points that satisfy either of the following two criteria: y - $\widehat{y}$ > 2.5σ (U) or $\widehat{y}$ - y > 2.5σ (L). Here, $\widehat{y}$ = a + bx + cx^{2} + dx^{3}.
x _{ max } | y _{ max } | y _{ min } | t _{α/2} σ _{ y } | T(x, y) | U | L | |
---|---|---|---|---|---|---|---|
W = 12 | |||||||
CHK-DOG | 0.742 | 0.992 | 0.936 | 0.002 | 0.996 | 0.04% | 1.75% |
DOG-CHK | 0.992 | 0.742 | 0.429 | 0.028 | 0.947 | 0.99% | 0.09% |
CHK-MUS | 0.742 | 0.964 | 0.827 | 0.005 | 0.991 | 0.01% | 1.74% |
MUS-CHK | 0.964 | 0.742 | 0.429 | 0.017 | 0.978 | 1.70% | 0.00% |
CHK-ZFS | 0.742 | 0.476 | 0.196 | 0.006 | 0.995 | 1.52% | 0.10% |
ZFS-CHK | 0.476 | 0.742 | 0.429 | 0.006 | 0.984 | 0.04% | 1.36% |
DOG-MUS | 0.992 | 0.964 | 0.827 | 0.004 | 0.996 | 1.63% | 0.03% |
MUS-DOG | 0.964 | 0.992 | 0.936 | 0.001 | 0.999 | 0.02% | 1.95% |
DOG-ZFS | 0.992 | 0.476 | 0.196 | 0.033 | 0.866 | 1.59% | 0.01% |
ZFS-DOG | 0.476 | 0.992 | 0.936 | 0.003 | 0.995 | 0.03% | 1.71% |
MUS-ZFS | 0.964 | 0.476 | 0.196 | 0.023 | 0.934 | 1.98% | 0.00% |
ZFS-MUS | 0.476 | 0.964 | 0.827 | 0.006 | 0.988 | 0.01% | 1.70% |
W = 14 | |||||||
CHK-DOG | 0.567 | 0.971 | 0.900 | 0.005 | 0.990 | 0.00% | 1.71% |
DOG-CHK | 0.971 | 0.567 | 0.335 | 0.027 | 0.913 | 3.26% | 0.00% |
CHK-MUS | 0.567 | 0.901 | 0.758 | 0.009 | 0.985 | 0.00% | 1.66% |
MUS-CHK | 0.901 | 0.567 | 0.335 | 0.018 | 0.953 | 1.89% | 0.00% |
CHK-ZFS | 0.567 | 0.299 | 0.132 | 0.005 | 0.979 | 1.61% | 0.01% |
ZFS-CHK | 0.299 | 0.567 | 0.335 | 0.007 | 0.983 | 0.00% | 1.48% |
DOG-MUS | 0.971 | 0.901 | 0.758 | 0.005 | 0.996 | 1.43% | 0.01% |
MUS-DOG | 0.901 | 0.971 | 0.900 | 0.002 | 0.997 | 0.01% | 1.69% |
DOG-ZFS | 0.971 | 0.299 | 0.132 | 0.024 | 0.836 | 1.65% | 0.00% |
ZFS-DOG | 0.299 | 0.971 | 0.900 | 0.006 | 0.988 | 0.00% | 1.66% |
MUS-ZFS | 0.901 | 0.299 | 0.132 | 0.018 | 0.882 | 2.09% | 0.00% |
ZFS-MUS | 0.299 | 0.901 | 0.758 | 0.012 | 0.978 | 0.00% | 1.70% |
W = 16 | |||||||
CHK-DOG | 0.405 | 0.930 | 0.821 | 0.010 | 0.982 | 0.00% | 1.98% |
DOG-CHK | 0.930 | 0.405 | 0.248 | 0.021 | 0.907 | 1.65% | 0.00% |
CHK-MUS | 0.405 | 0.808 | 0.633 | 0.013 | 0.976 | 0.00% | 1.98% |
MUS-CHK | 0.808 | 0.405 | 0.248 | 0.015 | 0.944 | 1.86% | 0.00% |
CHK-ZFS | 0.405 | 0.176 | 0.086 | 0.004 | 0.959 | 1.99% | 0.00% |
ZFS-CHK | 0.176 | 0.405 | 0.248 | 0.006 | 0.986 | 0.00% | 1.90% |
DOG-MUS | 0.930 | 0.808 | 0.633 | 0.006 | 0.991 | 1.62% | 0.00% |
MUS-DOG | 0.808 | 0.930 | 0.821 | 0.003 | 0.996 | 0.04% | 1.49% |
DOG-ZFS | 0.930 | 0.176 | 0.086 | 0.015 | 0.804 | 1.68% | 0.00% |
ZFS-DOG | 0.176 | 0.930 | 0.821 | 0.012 | 0.976 | 0.00% | 1.97% |
MUS-ZFS | 0.808 | 0.176 | 0.086 | 0.012 | 0.858 | 1.96% | 0.00% |
ZFS-MUS | 0.176 | 0.808 | 0.633 | 0.018 | 0.964 | 0.00% | 1.97% |
Determining universal seeds
With the same argument as above, when the margin of error t_{α/2}σ_{ y }is small, with high confidence high-scoring seeds in the reference model are expected to lie at the top of the sensitivity range in the compared model. Hence, good seeds will largely be shared between the models. To measure how closely the predicted value of the optimal seed in the reference model (x_{ max }) approaches the optimal (real) seed in the compared model (y_{ max }), we used the statistical lower bound of the predicted interval for x_{ max }, as a worst case scenario, and compared it with y_{ max }. These ratios are shown as T (x, y) in Table 1. As expected, the values are consistently above 0.98 between the seed-equivalent models, but also when a seed in a more distant model is projected onto a closer model, and the trends are more pronounced with the larger seed weights. This level of accuracy in predicting near-optimal seeds by regression projection, albeit statistical, is comparable with that obtained with the heuristic algorithm proposed in [20]. To rule out the effects of outliers among the top-scoring candidates, we identified those seeds that fall significantly outside of the 95% confidence range (Table 1, columns 7 and 8). Although roughly 1–2% of seeds are expected to place outside the predicted range, they are scattered along the sensitivity range for the model.
Collectively, these findings suggest that simply selecting top scoring seeds in the most distant model, in our case ZFS, would lead to good seeds for all other models, and therefore gives a simple strategy for determining universal good seeds.
Evaluation
Seeds optimized for the CHK, DOG, MUS, ZFS comparisons, for weight W = 10..16, using hill-climbing. For large weights (e.g., W ≥ 16), the fixed span k = 22 may significantly constrain the range of seeds, and therefore the seeds produced under this model may not be optimal in practice.
Comparison | n _{1} | n _{0} | n _{ x } | W | Sensitivity | Seed |
---|---|---|---|---|---|---|
CHK | 9 | 11 | 2 | 10 | 0.9033819146 | 1x11011011x11000000000 |
CHK | 9 | 9 | 4 | 11 | 0.8373594847 | 1xx1011011011xx1000000 |
CHK | 10 | 8 | 4 | 12 | 0.7617553847 | 1xx1011011011xx1100000 |
CHK | 11 | 7 | 4 | 13 | 0.6781533749 | 11x1101101x011xx110000 |
CHK | 12 | 6 | 4 | 14 | 0.5907201266 | 11x11011011011xxx11000 |
CHK | 12 | 4 | 6 | 15 | 0.5096393921 | 11x110xxx1011011011xx1 |
CHK | 14 | 4 | 4 | 16 | 0.4328248093 | 11x110xx11011011011x11 |
DOG | 8 | 10 | 4 | 10 | 0.9991951771 | 11xx1101x011x100000000 |
DOG | 9 | 9 | 4 | 11 | 0.9976354479 | 11x110110x0x1x11000000 |
DOG | 10 | 8 | 4 | 12 | 0.9943213958 | 11x110x1011x1x11000000 |
DOG | 10 | 6 | 6 | 13 | 0.9882916262 | 11x1101x1x0xx011x11000 |
DOG | 11 | 5 | 6 | 14 | 0.9790689537 | 11xx11011x0x0x1011x110 |
DOG | 12 | 4 | 6 | 15 | 0.9652197277 | 11x110x1x010x1011x1x11 |
DOG | 13 | 3 | 6 | 16 | 0.9461146848 | 11x110x1x01xx1011x1111 |
MUS | 8 | 10 | 4 | 10 | 0.9946391036 | 1xx1011011xx1100000000 |
MUS | 9 | 9 | 4 | 11 | 0.9868058410 | 11x1101x011xx110000000 |
MUS | 9 | 7 | 6 | 12 | 0.9737371393 | 1x1101x010x1x011xx1000 |
MUS | 10 | 6 | 6 | 13 | 0.9538071885 | 1x1101xx10x1x011x11000 |
MUS | 11 | 5 | 6 | 14 | 0.9260863280 | 11xx110x1x010x1011x110 |
MUS | 11 | 3 | 8 | 15 | 0.8901430490 | 11x110x1x01xx1011x1xx1 |
MUS | 13 | 3 | 6 | 16 | 0.8445827211 | 11x1101xxx011x11011x11 |
ZFS | 9 | 11 | 2 | 10 | 0.7113928043 | 1x11011011x11000000000 |
ZFS | 10 | 10 | 2 | 11 | 0.6041894577 | 11x11011011x1100000000 |
ZFS | 10 | 8 | 4 | 12 | 0.4894026703 | x1x11011011011xx100000 |
ZFS | 12 | 8 | 2 | 13 | 0.3972876500 | 11x11011011011x1100000 |
ZFS | 12 | 6 | 4 | 14 | 0.3090795416 | 1xx110110x1011011x1100 |
ZFS | 13 | 5 | 4 | 15 | 0.2419994803 | 11x1101101xx11011x1100 |
ZFS | 14 | 4 | 4 | 16 | 0.1863098023 | 11xx11011011x11011x110 |
Theoretical (T) and empirical (E) sensitivities of optimal seeds in the four models. Letters C, D, M, Z indicate the CHK, DOG, MUS and ZFS comparisons, respectively. C_{ do }represents sensitivity values for the optimal DOG seeds (do) when applied to the CHK (C) model. Values are averaged within each weight group.
W | C_{ co } | C_{ do } | C_{ mo } | C_{ zo } | D_{ co } | D_{ do } | D_{ mo } | D_{ zo } |
---|---|---|---|---|---|---|---|---|
10^{ T } | 0.881 | 0.875 | 0.878 | 0.881 | 0.998 | 0.999 | 0.999 | 0.998 |
10^{ E } | 0.904 | 0.896 | 0.897 | 0.905 | 0.983 | 0.984 | 0.984 | 0.982 |
11^{ T } | 0.808 | 0.800 | 0.803 | 0.808 | 0.996 | 0.996 | 0.996 | 0.996 |
11^{ E } | 0.876 | 0.859 | 0.856 | 0.875 | 0.976 | 0.978 | 0.976 | 0.976 |
12^{ T } | 0.717 | 0.708 | 0.715 | 0.716 | 0.990 | 0.990 | 0.990 | 0.989 |
12^{ E } | 0.822 | 0.819 | 0.822 | 0.831 | 0.967 | 0.970 | 0.968 | 0.967 |
13^{ T } | 0.628 | 0.619 | 0.624 | 0.627 | 0.978 | 0.979 | 0.979 | 0.977 |
13^{ E } | 0.779 | 0.756 | 0.761 | 0.790 | 0.954 | 0.954 | 0.953 | 0.953 |
14^{ T } | 0.555 | 0.542 | 0.549 | 0.554 | 0.967 | 0.969 | 0.969 | 0.965 |
14^{ E } | 0.735 | 0.706 | 0.731 | 0.763 | 0.943 | 0.942 | 0.943 | 0.943 |
15^{ T } | 0.483 | 0.472 | 0.480 | 0.481 | 0.952 | 0.955 | 0.954 | 0.949 |
15^{ E } | 0.706 | 0.676 | 0.686 | 0.733 | 0.927 | 0.927 | 0.926 | 0.930 |
16^{ T } | 0.412 | 0.400 | 0.407 | 0.410 | 0.931 | 0.935 | 0.935 | 0.927 |
16^{ E } | 0.667 | 0.610 | 0.627 | 0.679 | 0.912 | 0.909 | 0.910 | 0.910 |
W | M _{ co } | M _{ do } | M _{ mo } | M _{ zo } | Z _{ co } | Z _{ do } | Z _{ mo } | Z _{ zo } |
10^{ T } | 0.992 | 0.992 | 0.993 | 0.992 | 0.663 | 0.647 | 0.654 | 0.663 |
10^{ E } | 0.978 | 0.978 | 0.979 | 0.978 | 0.745 | 0.716 | 0.725 | 0.751 |
11^{ T } | 0.981 | 0.982 | 0.982 | 0.981 | 0.547 | 0.529 | 0.534 | 0.547 |
11^{ E } | 0.970 | 0.969 | 0.969 | 0.969 | 0.678 | 0.636 | 0.626 | 0.676 |
12^{ T } | 0.963 | 0.963 | 0.963 | 0.961 | 0.432 | 0.417 | 0.426 | 0.433 |
12^{ E } | 0.957 | 0.958 | 0.958 | 0.957 | 0.570 | 0.562 | 0.565 | 0.591 |
13^{ T } | 0.932 | 0.933 | 0.934 | 0.930 | 0.342 | 0.329 | 0.335 | 0.344 |
13^{ E } | 0.940 | 0.938 | 0.940 | 0.941 | 0.504 | 0.464 | 0.474 | 0.525 |
14^{ T } | 0.904 | 0.905 | 0.906 | 0.900 | 0.276 | 0.260 | 0.267 | 0.278 |
14^{ E } | 0.925 | 0.922 | 0.927 | 0.929 | 0.436 | 0.394 | 0.425 | 0.488 |
15^{ T } | 0.867 | 0.870 | 0.871 | 0.861 | 0.220 | 0.207 | 0.215 | 0.222 |
15^{ E } | 0.909 | 0.906 | 0.908 | 0.913 | 0.395 | 0.358 | 0.365 | 0.447 |
16^{ T } | 0.822 | 0.825 | 0.826 | 0.815 | 0.172 | 0.160 | 0.166 | 0.173 |
16^{ E } | 0.889 | 0.882 | 0.886 | 0.890 | 0.350 | 0.283 | 0.302 | 0.368 |
Discussion
While significant effort has gone in designing optimal seeds for comparing human and mouse sequences, a remarkably few other comparisons received any attention at all. With more species being sequenced over the next few years, identifying a small number of seeds that would perform optimally or near-optimally for a large range of comparisons is becoming essential. For genomic sequences, the sequence identity level has been used as a simple measure of evolutionary distance. Since the sequence composition of genes is significantly more complex, best seeds for comparing coding sequences are expected to depend not only on the sequence identity level, but also on the pattern of mutations, in particular transition-transversion biases. Intuitively, similar alignment models should produce similar behavior of seeds. Thus, one solution to seed proliferation is to group comparisons that exhibit similar behavior of seeds into seed-equivalent classes, such that all comparisons in one class are well served by the same optimal seed. Furthermore, it would be even more desirable to identify one or a small set of seeds suitable for a wide variety of evolutionary distances and mutation patterns, herein called universal seeds.
Our statistical regression analyses of seed sensitivities among four comparisons, chosen at various evolutionary distances, showed that for some pairs of comparisons seed behavior can be predicted closely with high-confidence. In particular, it was possible to identify two sets of models that have relatively different sensitivity ranges but very similar seed behavior, and therefore can be deemed seed-equivalent. Moreover, even at larger evolutionary distances more distant comparisons are predictive of closer ones. This conclusion is corroborated by several factors, including the length of the prediction interval, the pattern of outliers, and the ratio of predicted to optimal sensitivity (Table 1). For instance, at 2.5σ regression error, all but a few of the outliers perform better than predicted in the distant model compared to the closer one. In other words, distant comparisons contain more information than closer ones. One outstanding question is how to determine whether two models are close enough to be seed-equivalent. While we have not yet found a theoretical formulation, we investigated the relationship between alignment Markov models using a conventional distance measure between their probability distributions. The Kullback-Leibler Divergence (KLD) [23] can be applied on the space of alignment words $\mathcal{X}$ = {0, 1, x}^{64} to produce a distance between the two models [see Additional file 3]:
Distances between models: KLD is the Kullback-Leibler Divergence.
Comparisons | KLD (P, Q) |
---|---|
CHK-DOG (DOG-CHK) | 4.857 (4.077) |
CHK-MUS (MUS-CHK) | 2.351 (2.012) |
CHK-ZFS (ZFS-CHK) | 0.870 (0.885) |
DOG-MUS (MUS-DOG) | 0.468 (0.491) |
DOG-ZFS (ZFS-DOG) | 8.366 (9.997) |
MUS-ZFS (ZFS-MUS) | 5.352 (6.329) |
Lastly, one natural question that arises is whether universal seeds exist for other types of comparisons, such as between genomic sequences. Our preliminary experiments using a {0, 1} Bernoulli model of alignment [7, 19] and four different sequence similarity levels to represent different evolutionary distances (p = 0.65, 0.75, 0.85 and 0.95) indicate that, again, good seeds for the more distant comparisons will perform well on the closer ones [see Additional file 4]. Moreover, for weight 12, p = 0.65, 0.75, 0.85 form a seed-equivalent cluster and the optimal seed is shared among the four models, whereas for the larger weights (14, 16) the seed-equivalent groups are sparser and no one seed is optimal for all comparisons. These findings are consistent with the observations in [19]. Thus, although more analyses are needed to test it on various models, our simple strategy for selecting universal good seeds may be more widely applicable to a variety of sequence comparison problems.
Conclusion
We performed a statistical analysis of seed sensitivities for four species-to-species cDNA-to-genome comparisons, spanning a wide range of evolutionary distances and mutation patterns, with the goal to determine criteria for selecting a small set of seeds that would perform well on a wide range of comparisons. In particular, grouping models that exhibit similar behavior of seeds into seed-equivalence classes could significantly reduce the number of optimal seeds. Most important for practical applications, the analyses showed that with high probability optimal and near-optimal seeds for the most distant available comparison will translate into good seeds for a wide range of comparisons. These insights, and the sets of optimal seeds predicted for the four comparisons and for a wide array of weights, represent a useful resource in guiding the selection of seeds for developing practical applications.
Availability
Material referenced in the paper can be found in the Additional files below and at our website [24].
Methods
Optimal seeds
Given an alignment model, we calculate seed sensitivity in the {0, 1, x} model recursively as described in [11]. For convenience, we include a summary here.
where P_{0}, P_{1} and P_{2} are the marginal probabilities of the order 2 Markov model $\mathcal{M}$, and ε is the empty word.
To determine optimal seeds, we use a hill-climbing heuristic [6, 11], starting from a random seed and swapping any two distinct symbols with the goal to optimize the score locally. If a 'better' seed is found, it becomes the start seed for variations in the next cycle, until there are no changes to the optimal score. The procedure is applied 20 times for each weight W and for each combination (n_{1}, n_{ x }, n_{0}), where n_{ i }is the number of i symbols in the seed, satisfying n_{1} + n_{0} + n_{ x }= 22 and n_{1} + 0.5·n_{ x }= W.
Seed evaluation
We evaluate seed sensitivity both theoretically and empirically to identify good seeds for practical applications. To determine the effects of species divergence on the choice of seeds, we analyze comparisons between human and four other species (dog, mouse, chicken and zebrafish), which coarsely sample the range of vertebrate evolutionary distances.
Theoretical seed evaluation
Given a seed and an alignment model, the theoretical seed sensitivity can be calculated recursively as described above. To train the alignment models for the four sets of comparisons, exons in the human genome were determined from spliced genomic alignments of human mRNA and EST sequences, produced with the programs Sim4/ESTmapper [2, 26]. A subset of coding exons and their reading frames were then determined by Fastx [27] matches against the SwissProt database [28]. Lastly, match statistics within the coding exon regions were collected from whole-genome alignments downloaded from the UCSC Genome Browser [29] and used to train the alignment models.
Empirical evaluation
This evaluation component tests the ability of a seed to accurately match dog, mouse, chicken and zebrafish coding exons with their orthologs in the human genome. For this purpose, reference sets of orthologous exons were constructed from homologous RefSeq gene pairs [30, 31] as follows. Exon coordinates in the human mRNA and in the human genome were determined from Sim4 [26] alignments of the human mRNA sequence on the human genome HG17 produced with the high-throughput program ESTmapper [2]. These coordinates were then projected onto the mRNA of the other species via the pairwise mRNA-mRNA alignment. Starting with roughly 500 gene pairs for each comparison, this procedure produced 2,408 (dog), 4,198 (mouse), 4,869 (chicken) and 4,543 (zebrafish) exon pairs for use in our empirical analysis.
During the evaluation, dog, mouse, chicken and zebrafish coding sequences are searched against the human genome. The search program scans the input sequences, looking for seed matches of length 22 in the human genome. For efficiency, the search uses a bit-encoded index of 22 bp words (22-mers) in the genome, from which non-informative bits corresponding to 0 or x positions in the seed are removed. Empirical sensitivity is calculated as the fraction of human exons in the reference set that are detected by the seed: Sn^{ e }= TP/(TP + FN) [32].
Statistical analysis
We compare the distributions of seed sensitivities between any two comparisons (DOG, MUS, CHK and ZFS) using statistical regression methods. We used a t-test to determine the 95% confidence interval ($\widehat{y}$ - t_{α/2}σ_{ y }, $\widehat{y}$ + t_{α/2}σ_{ y }) for the projection $\widehat{y}$ = R(x) of a sensitivity value x of a seed from the reference model to the compared model. Here, α = 0.05, t_{α/2}= 1.96, and σ_{ y }is the estimated regression standard error of prediction for a given x value. Because the number of values is large, σ_{ y }≃ σ for all y. A non-linear regression method with an order 3 polynomial was applied to the data: R(x) = (a +bx_{max} + $c{x}_{\mathrm{max}}^{2}$ + $d{x}_{\mathrm{max}}^{3}$). Lastly, outliers were determined as data points x whose values y in the compared model fall below or above the projected interval: y - $\widehat{y}$ > 2.5σ or $\widehat{y}$ - y > 2.5σ.
Declarations
Acknowledgements
Sensitivity calculations were performed on the "Herd" Scientific Computing Cluster at The George Washington University (NSF grant CLS20163A to JS). This work was supported in part by a Sloan Research Fellowship to LF.
Authors’ Affiliations
References
- National Human Genome Research Institute, Listing of Genome Sequencing Proposals[http://www.genome.gov/10002154]
- Florea L, Di Francesco V, Miller J, Turner R, Mobarry C, Yao A, Harris M, Walenz B, Dew I, Merkulov G, Charlab R, Deng Z, Istrail S, Li P, Sutton G: Gene and alternative splicing annotation with AIR. Genome Res 2005, 15(1):54–66.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic local alignment search tool. J Mol Biol 1990, 215(3):403–410.View ArticlePubMedGoogle Scholar
- Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402.PubMed CentralView ArticlePubMedGoogle Scholar
- Kent W, Zahler A: Conservation, regulation, synteny, and introns in a large-scale C. briggsae-C. elegans genomic alignment. Genome Res 2000, 10(8):1115–1125.View ArticlePubMedGoogle Scholar
- Buhler J, Keich U, Sun Y: Designing seeds for similarity search in genomic DNA. Proceedings of the Seventh Annual International Conference on Computational Molecular Biology (RECOMB 2003) 2003, 7: 67–75.Google Scholar
- Ma B, Tromp J, Li M: PatternHunter: faster and more sensitive homology search. Bioinformatics 2002, 18(3):440–445.View ArticlePubMedGoogle Scholar
- Keich U, Li M, Ma B, Tromp J: On spaced seeds for similarity search. Discrete Appl Math 2004, 138(3):253–263.View ArticleGoogle Scholar
- Schwartz S, Kent W, Smit A, Zhang Z, Baertsch R, Hardison R, Haussler D, Miller W: Human-mouse alignments with BLASTZ. Genome Res 2003, 13(1):103–107.PubMed CentralView ArticlePubMedGoogle Scholar
- Buhler J, Keich U, Sun Y: Designing seeds for similarity search in genomic DNA. J Comput Syst Sci 2005, 70(3):342–363.View ArticleGoogle Scholar
- Zhou L, Florea L: Designing sensitive and specific spaced seeds for cross-species mRNA-to-genome alignment. J Comput Biol 2007, 14(2):113–130.View ArticlePubMedGoogle Scholar
- Sun Y, Buhler J: Designing multiple simultaneous seeds for DNA similarity search. J Comput Biol 2005, 12(6):847–861.View ArticlePubMedGoogle Scholar
- Brejová B, Brown D, Vinar T: Vector seeds: an extension to spaced seeds. J Comput Syst Sci 2005, 70(3):364–380.View ArticleGoogle Scholar
- Noé L, Kucherov G: Improved hit criteria for DNA local alignment. BMC Bioinformatics 2004, 5: 149.PubMed CentralView ArticlePubMedGoogle Scholar
- Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol 1997, 268(1):78–94.View ArticlePubMedGoogle Scholar
- Nei M, Kumar S: Molecular evolution and phylogenetics. New York: Oxford University Press; 2000.Google Scholar
- Brejová B, Brown D, Vinar T: Optimal spaced seeds for homologous coding regions. J Bioinform Comput Biol 2004, 1(4):595–610.View ArticlePubMedGoogle Scholar
- Choi K, Zhang L: Sensitivity analysis and efficient method for identifying optimal spaced seeds. J Comp Syst Sci 2004, 68: 22–40.View ArticleGoogle Scholar
- Choi K, Zeng F, Zhang L: Good spaced seeds for homology search. Bioinformatics 2004, 20(7):1053–1059.View ArticlePubMedGoogle Scholar
- Preparata F, Zhang L, Choi KP: Quick, practical selection of effective seeds for homology search. J Comput Biol 2005, 12(9):1137–1152.View ArticlePubMedGoogle Scholar
- Kucherov G, Noé L, Roytberg M: A unifying framework for seed sensitivity and its application to subset seeds. J Bioinform Comp Biol 2006, 4(2):553–569.View ArticleGoogle Scholar
- The R Project for Statistical Computing[http://www.r-project.org]
- Cover T, Thomas J: Elements of Information Theory. New York: John Wiley & Sons, Inc; 1991.View ArticleGoogle Scholar
- George Washington University, Computational Molecular Biology Lab[http://dna.cs.gwu.edu]
- Aho A, Corasick M: Efficient string matching: an aid to bibliographic search. Comm ACM 1975, 18(6):333–340.View ArticleGoogle Scholar
- Florea L, Hartzell G, Zhang Z, Rubin G, Miller W: A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 1998, 8(9):967–974.PubMed CentralPubMedGoogle Scholar
- Pearson W, Lipman D: Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 1988, 85(8):2444–2448.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu C, Apweiler R, Bairoch A, Natale D, Barker W, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin M, Mazumder R, O'Donovan C, Redaschi N, Suzek B: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 2006, (34 Database):D187–191.Google Scholar
- UCSC Genome Browser[http://genome.ucsc.edu]
- Pruitt K, Tatusova T, Maglott D: NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2007, (35 Database):D61–65.Google Scholar
- Wheeler D, Barrett T, Benson D, Bryant S, Canese K, Chetvernin V, Church D, DiCuccio M, Edgar R, Federhen S, Geer L, Kapustin Y, Khovayko O, Landsman D, Lipman D, Madden T, Maglott D, Ostell J, Miller V, Pruitt K, Schuler G, Sequeira E, Sherry S, Sirotkin K, Souvorov A, Starchenko G, Tatusov R, Tatusova T, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2007, (35 Database):D5–12.Google Scholar
- Burset M, Guigo R: Evaluation of gene structure prediction programs. Genomics 1996, 34(3):353–367.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.