On finding minimal absent words
- Armando J Pinho^{1}Email author,
- Paulo JSG Ferreira^{1},
- Sara P Garcia^{1} and
- João MOS Rodrigues^{1}
https://doi.org/10.1186/1471-2105-10-137
© Pinho et al; licensee BioMed Central Ltd. 2009
Received: 06 January 2009
Accepted: 08 May 2009
Published: 08 May 2009
Abstract
Background
The problem of finding the shortest absent words in DNA data has been recently addressed, and algorithms for its solution have been described. It has been noted that longer absent words might also be of interest, but the existing algorithms only provide generic absent words by trivially extending the shortest ones.
Results
We show how absent words relate to the repetitions and structure of the data, and define a new and larger class of absent words, called minimal absent words, that still captures the essential properties of the shortest absent words introduced in recent works. The words of this new class are minimal in the sense that if their leftmost or rightmost character is removed, then the resulting word is no longer an absent word. We describe an algorithm for generating minimal absent words that, in practice, runs in approximately linear time. An implementation of this algorithm is publicly available at ftp://www.ieeta.pt/~ap/maws.
Conclusion
Because the set of minimal absent words that we propose is much larger than the set of the shortest absent words, it is potentially more useful for applications that require a richer variety of absent words. Nevertheless, the number of minimal absent words is still manageable since it grows at most linearly with the string size, unlike generic absent words that grow exponentially. Both the algorithm and the concepts upon which it depends shed additional light on the structure of absent words and complement the existing studies on the topic.
Keywords
Background
There has been recent interest in absent words in DNA sequences, which are words that do not occur in a given genome. At the individual level, such words can be used as biomarkers for potential preventive and curative medical applications as derived from personal genomics efforts, while at the group level the comparison of genetic traits may impact, for example, on population genetics, or evolutionary profiles obtained from comparative genomics. It is therefore not surprising that absent words have been the subject of recent studies [1–3].
Hampikian and Andersen [1] used the term "nullomer" to designate the shortest words that do not occur in a given genome and the term "prime" to refer to the shortest words that are absent from the entire known genetic data. Herold et al. [3] used the term "unword" also to designate the shortest absent words. According to the definition, any given DNA sequence has nullomers/unwords of a certain size, that are uniquely defined for that sequence, and also of the shortest possible size.
The algorithm used by Hampikian and Andersen [1] to obtain the absent words tracks the occurrence of all possible words up to a user-specified length limit n, using a set of 4^{ n }counters for the 4^{ n }possible words of length n. This yields the existing absent words up to the given length limit, n. The approach taken by Herold et al. [3] has some computational advantages over that of Hampikian and Andersen [1], by being less demanding in terms of memory needs and processing time.
In this paper, we generalize the concept of nullomer/unword, such that other words, not necessarily the shortest ones, can be included (for a precise definition see Definition 3). In fact, the original definition adopted by Hampikian and Andersen [1] and by Herold et al. [3] might be too limiting, because there are sequences that have only a few nullomers/unwords. For example, and according to the results presented in [3], the genome of the worm, Caenorhabditis elegans, has two nullomers/unwords, whereas the genome of the extreme thermophile, Thermococcus kodakarensis, has only one.
As stated by Herold et al. [3], longer absent words may also be of interest. For generating those longer absent words, they propose adding all unwords (say, of size k) as additional sequences to the genome and re-running the program. These additional absent words, which we call generic absent words and denote by , also include extended nullomers/unwords, i.e., words that contain nullomers/unwords. However, not all generic absent words are trivial extensions of nullomers/unwords.
Nullomers/unwords satisfy the following property, P.
Property 1 ( P ). If the leftmost or the rightmost character of a given nullomer/unword is removed, then the resulting word is no longer an absent word.
This property P does not hold for the absent words obtained by trivially extending nullomers/unwords nor for the longer absent words suggested by Herold et al. [3]. In other words, for a generic absent word, there is no way of knowing in advance if the elimination of some characters from one of the extremities of the word yields an absent word or not.
We have developed an efficient algorithm for computing these minimal absent words, which, in practice, runs in approximately linear time. Our work can be regarded as a complement to the works of Hampikian and Andersen [1] and of Herold et al. [3], in the sense that it provides a generalization of the nullomer/unword concept previously introduced, and helps to clarify their structure.
Methods
Basic definitions
Let S be a string over a finite alphabet Σ. We denote by S[p], 1 ≤ p ≤ |S| the p th character of S, where |S| designates the length (i.e., number of characters) of S, and by S[p_{1}..p_{2}], p_{1} ≤ p_{2} the substring of S that starts at position p_{1} and ends at position p_{2}. Therefore, S[1..p] denotes a prefix of S and S[p..|S|] a suffix. Sr indicates the concatenation of character r to the right end side of string S, whereas lS indicates the concatenation of character l to the left end side of S.
For convenience, we define two additional virtual characters, # and $. They are virtual in the sense that they do not belong to the alphabet Σ. By definition, the character to the left of the first character of the string is #, and the character to the right of the last character of the string is $. In other words, we define S[0] = # and S[|S| + 1] = $.
Let , where α is a substring of S, to be the set of positions of S where α occurs, so that S[p..p + |α| - 1] = α, ∀p ∈ and S[p..p + |α| - 1] ≠ α, ∀p ∉ . We define and to be the sets of characters that appear, respectively, to the immediate left and right of the several occurrences of α, and the sets and . We also denote by ℰ_{ α }⊆ Σ × Σ the set of all pairs of characters (S[p - 1], S[p + |α|]), ∀p ∈ , i.e., all pairs of characters "enclosing" the occurrences of α.
Definition 1 (Maximal repeated pair [4]). A maximal repeated pair in a string S is a triple (p_{1}, p_{2}, α), such that p_{1} ≠ p_{2}, p_{1}, p_{2} ∈ , S[p_{1} - 1] ≠ S[p_{2} - 1] and S[p_{1} + |α|] ≠ S[p_{2} + |α|].
Definition 2 (Maximal repeat [4]). A substring α is a maximal repeat of S if there is at least a maximal repeated pair in S of the form (p_{1}, p_{2}, α).
Characterization
We are now ready to formally introduce the concept of minimal absent word.
Definition 3 (Minimal absent word). A string γ, |γ| ≥ 3, is a minimal absent word of S if γ is not a substring of S, but γ[2..|γ|] and γ[1..|γ| - 1] are substrings of S.
and that the set of generic absent words, , is too large to be of any practical interest.
Theorem 1. If lαr is a minimal absent word of string S, then α is a maximal repeat in S.
Proof. According to Definition 3, if lαr is a minimal absent word of S, then lα and αr are substrings of S, i.e., lα = S[p_{1}..p_{1} + |α|] and αr = S[p_{2}..p_{2} + |α|], with p_{2} ≠ p_{1} + 1 (if p_{2} = p_{1} + 1 then lαr would be a substring of S, contradicting the assumption that it is a minimal absent word). Now consider that the character to the immediate right of lα is r' = S[p_{1} + |α| + 1] and that the character to the immediate left of αr is l' = S[p_{2} - 1]. Because lαr does not exist in S, then l' cannot be the same character as l and r' cannot be the same character as r, implying that (p_{1} + 1, p_{2}, α) is a maximal repeated pair and, therefore, α is a maximal repeat in S. □
Note that this applies to minimal absent words with at least three characters, according to Definition 3. The restriction could be removed by allowing α to be the empty string. For the sake of clarity, we do not consider this case here. In fact, directly finding minimal absent words of length two requires |Σ|^{2} string matching operations, which can be performed in a reasonable time, unless the size of the alphabet is unusually large. This is why in Definition 3 we restricted the size of a minimal absent word to be at least three.
Theorem 2. A string lαr is a minimal absent word of S if and only if (l, r) ∉ ℰ_{ α }, for l ∈ ℒ_{ α }and r ∈ ℛ_{ α }.
Proof. If (l, r) ∉ ℰ_{ α }, then in none of the occurrences of α in S we have, simultaneously, a l character to the immediate left of α and a r character to its immediate right, implying that lαr does not occur in S. On the other hand, since l ∈ ℒ_{ α }, then there is at least one position in S where the substring lα occurs, the same holding for the αr substring, because l ∈ ℛ_{ α }. Therefore, according to Definition 3, lαr is a minimal absent word of S.
Now, consider that lαr is a minimal absent word of S and (l, r) ∈ ℰ_{ α }. In that case, there would be a substring lαr in S, contradicting the assumption that lαr is a minimal absent word. □
Finding the minimal absent words
Theorem 1 states that all minimal absent words are associated with maximal repeats. Therefore, finding all minimal absent words may be associated to finding all maximal repeats in a string, which can be done using suffix trees in O(|S|) time [4]. Moreover, suffix trees can be built and stored also in O(|S|) time/memory, respectively [5–7]. See [4] for an introduction to suffix trees.
Suffix trees
A suffix tree of a string S is a rooted directed tree with exactly |S| leaves (numbered 1 to |S|). Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty substring of S. No two edges out of a node can have edge-labels beginning with the same character. For any leaf p, the concatenation of the edge-labels on the path from the root to leaf p corresponds to the suffix that starts at position p, i.e., to S[p..|S|].
The condition stating that each internal node, other than the root, should have at least two children, implies that some strings do not have a suffix tree representation. In fact, this condition is violated in strings having a suffix that is a prefix of another suffix. To remedy this, a character that does not appear in any other position of S is usually appended at the end of the string (the "$" character is frequently used for this purpose [4]).
Definition 4 (Left character [4]). For each position p in S, character S[p - 1] is called the left character of p. The left character of a leaf of the suffix tree is the left character of the suffix represented by that leaf.
The characters appearing inside parentheses near the leaves of the suffix tree of Fig. 2 are the corresponding left characters. Notice the # character associated with leaf number one, corresponding to S[0].
Definition 5 (Left diverse [4]). A node v of the suffix tree is called left diverse if at least two leaves in v's subtree have different left characters.
In the suffix tree depicted in Fig. 2, nodes v_{1} and v_{2} are the only left diverse nodes. According to Theorem 7.12.2 of [4], the string α labeling the path to a node v of the suffix tree is a maximal repeat if and only if v is left diverse. Therefore, in Fig. 2, the substrings formed along the paths from the root node to each of the two nodes v_{1} and v_{2} correspond to maximal repeats. Those strings are A and ACT, which are the base of the minimal absent words AAA, TAC and AACTA. Recall that, in a string S, there might be, at most, |S| maximal repeats (Theorem 7.12.1 in [4]). This implies that the number of minimal absent words of a string S is upper bounded by |S||Σ|^{2}.
Suffix arrays
Suffix trees are a powerful data structure that allowed important advances in string processing [4]. However, the space required by a suffix tree, although growing linearly with the size of the string, might still be excessive for some applications [8, 9].
Suffix arrays are an alternative data structure that is more space efficient (4 bytes per input character for strings of size up to 2^{32}, in its basic form). However, to increase the efficiency of certain tasks, they might require auxiliary information [9]. Introduced in [10, 11], the suffix arrays can be constructed in linear time from the corresponding suffix tree [4] or using direct algorithms [12–14].
Suffix array p_{ k }and auxiliary information, in this case the lcp and bwt arrays, for S = ACTAACTG.
k | p _{ k } | lcp _{ k } | bwt _{ k } | S[p_{ k }..|S|] |
---|---|---|---|---|
1 | 4 | 0 | T | AACTG |
2 | 1 | 1 | # | ACTAACTG |
3 | 5 | 3 | A | ACTG |
4 | 2 | 0 | A | CTAACTG |
5 | 6 | 2 | A | CTG |
6 | 8 | 0 | T | G |
7 | 3 | 0 | C | TAACTG |
8 | 7 | 1 | C | TG |
Generalized suffix array p_{ k }and auxiliary information for strings S_{1} = ACTAACTG and S_{2} = CGTACTA.
k | p _{ k } | lcp _{ k } | bwt _{ k } | S[p_{ k }..|S|] |
---|---|---|---|---|
1 | 16 | 0 | T | A |
2 | 4 | 1 | T | AACTG |
3 | 13 | 1 | T | ACTA |
4 | 1 | 4 | # | ACTAACTG |
5 | 5 | 3 | A | ACTG |
6 | 10 | 0 | # | CGTACTA |
7 | 14 | 1 | A | CTA |
8 | 2 | 3 | A | CTAACTG |
9 | 6 | 2 | A | CTG |
10 | 8 | 0 | T | G |
11 | 11 | 1 | C | GTACTA |
12 | 15 | 0 | C | TA |
13 | 3 | 2 | C | TAACTG |
14 | 12 | 2 | G | TACTA |
15 | 7 | 1 | C | TG |
The lcp-array contains the lengths of the longest common prefix between consecutive ordered suffixes, i.e., lcp_{ k }indicates the length of the longest common prefix between S[p_{k-1}..|S|] and S[p_{ k }..|S|], 2 ≤ k ≤ |S|. By convention, lcp_{1} = lcp_{|S|+1}= 0.
The bwt-array is a permutation of S, such that bwt_{ k }= S[p_{ k }- 1]. Remember that the character to the immediate left of S[1] has been defined to be #, a convention that explains the value of bwt_{ k }for p_{ k }= 1. Conceptually, the bwt-array does not provide additional information, because the left character of any character of S can be determined by direct access to S. In fact, in this paper, we use both notations, bwt_{ k }and S[p_{ k }- 1], interchangeably. However, in practice, the bwt-array allows sequential memory access and hence improves the performance, due to better cache use [16].
Definition 6 (Lcp-interval). Interval [i..j], 1 ≤ i <j ≤ |S|, is an lcp-interval of lcp-depth d, denoted ⟨d, i, j⟩, if
1. lcp_{ i }<d,
2. lcp_{ k }≥ d, ∀i <k ≤ j,
3. lcp_{ k }= d, for at least one k in i <k ≤ j,
4. lcp_{j+1}<d.
The lcp-intervals of the example string S = ACTAACTG are ⟨1, 1, 3⟩ and ⟨1, 7, 8⟩ of lcp-depth 1, ⟨2, 4, 5⟩ of lcp-depth 2, and ⟨3, 2, 3⟩ of lcp-depth 3. Note that each of these lcp-intervals correspond to a distinct internal node of the suffix tree (see Fig. 2). For example, the lcp-interval ⟨1, 1, 3⟩ is associated with node v_{1}, whereas the lcp-interval ⟨3, 2, 3⟩ corresponds to node v_{2}. Therefore, we can think of a virtual tree of lcp-intervals having a structure similar to the corresponding suffix tree [16].
This correspondence between lcp-intervals and internal nodes of the suffix tree is important, because it helps mapping some concepts from the suffix tree data structure into the suffix array approach. For example, the notion of left diverse node can be mapped directly into the lcp-intervals. Finding if a node, associated with lcp-interval ⟨d, i, j⟩, is left diverse, is the same as finding if at least two characters of bwt_{ k }differ, for i ≤ k ≤ j. Moreover, in that case, the corresponding maximal repeat is, for example, α = S[p_{ i }..p_{ i }+ d - 1] (note that all substrings S[p_{ k }..p_{ k }+ d - 1], ∀i ≤ k ≤ j, are identical).
Algorithm 1 (adapted from [16, 17]) generates all lcp-intervals using the lcp-array and a stack. The "Push" and "Pop" operations have the usual meaning when associated to stack processing. The variable "top" refers to the lcp-interval, ⟨d, i, j⟩, on the top of the stack.
Algorithm 1. Computation of lcp-intervals.
Push ⟨0, 0, 0⟩
for k = 2 to |S| do
i ← k - 1
while lcp_{ k }<top.d do
lcpint ← Pop
lcpint.j ← k - 1
Process lcpint
i ← lcpint.i
end while
if lcp_{ k }> top.d then
Push ⟨lcp_{ k }, i, 0⟩
end if
end for
In order to find the minimal absent words, the function "Process" in Algorithm 1 executes Algorithm 2, that builds the ℒ_{ α }, ℛ_{ α }and ℰ_{ α }sets, determines if the lcp-interval is left diverse, and, if true, outputs the minimal absent words associated with the lcp-interval ⟨d, i, j⟩.
Algorithm 2. Computation of the minimal absent words for a given lcp-interval ⟨d, i, j⟩, where α = S[p_{ i }..p_{ i }+ d - 1].
ℰ_{ α }← ∅
for k = i to j do
if S[p_{ k }- 1] ≠ # and S[p_{ k }+ d] ≠ $ then
ℰ_{ α }← ℰ_{ α }∪ {(S[p_{ k }- 1], S[p_{ k }+ d])}
end if
end for
if | | > 1 then {Left diverse}
for all l ∈ ℒ_{ α }do
for all r ∈ ℛ_{ α }do
if (l, r) ∉ ℰ_{ α }then
Substring lαr is a minimal absent word
end if
end for
end for
end if
Our results remain valid for sets of strings = {S_{1}, S_{2},..., S_{ z }} over a finite alphabet Σ. In this case, the minimal absent words are generated through the concatenation of the strings using a delimiting character not belonging to the alphabet. The delimiter avoids the creation of artificial substrings across string boundaries.
Table 2 shows the (generalized) suffix array [18] associated to strings S_{1} and S_{2}.
Results and discusion
The graphic displayed in Fig. 4 shows an apparently curious behavior: the time taken by the algorithm increases as the size of the alphabet decreases. This might be due to the fact that, for the same string length, strings over smaller alphabets imply deeper suffix trees and, since the lcp-intervals are related to the internal nodes of the suffix tree, generating them for smaller alphabets takes longer.
From the curves presented in Fig. 4, it can be seen that, in practice, the running time of the algorithm is approximately linear with the length of the string. Moreover, the number of minimal absent words (displayed in Fig. 3) also shows a similar behavior, contrasting with the exponential growth of the number of generic absent words of a string.
Number of minimal absent words and generic absent words for some genomes.
Organism | Reference | Genome size | Length, n | ||
---|---|---|---|---|---|
104 | 104 | 11 | |||
H. sapiens | Release 36.1 | ≈ 2.9 Gb | 44 149 | 44 970 | 12 |
2 039 862 | 2 368 682 | 13 | |||
190 | 190 | 11 | |||
M. musculus | Release m36.1 | ≈ 2.6 Gb | 52 087 | 53 573 | 12 |
2 192 708 | 2 579 838 | 13 | |||
104 | 104 | 11 | |||
D. melanogaster | FB 5 | ≈ 162 Mb | 172 849 | 173 674 | 12 |
10 040 282 | 11 335 034 | 13 | |||
2 | 2 | 10 | |||
C. elegans | WB 170 | ≈ 100 Mb | 7 664 | 7 680 | 11 |
1 092 286 | 1 151 728 | 12 | |||
2 262 | 2 262 | 11 | |||
N. crassa | Assembly 7 | ≈ 39 Mb | 1 064 938 | 1 082 787 | 12 |
20 213 298 | 27 903 272 | 13 | |||
2 | 2 | 9 | |||
S. cerevisiae S228C | SGD 1 | ≈ 12 Mb | 6 435 | 6 450 | 10 |
414 520 | 462 882 | 11 | |||
248 | 248 | 8 | |||
S. aureus MSSA476 | NC002953 | ≈ 2.8 Mb | 11 908 | 13 744 | 9 |
162 113 | 251 497 | 10 | |||
1 | 1 | 8 | |||
T. kodakarensis | NC006624 | ≈ 2.09 Mb | 2 314 | 2 322 | 9 |
136 917 | 154 340 | 10 | |||
3 | 3 | 6 | |||
M. jannaschii | NC000909 | ≈ 1.66 Mb | 126 | 150 | 7 |
3 790 | 4 834 | 8 | |||
5 | 5 | 6 | |||
M. genitalium | NC000908 | ≈ 0.58 Mb | 340 | 380 | 7 |
6 156 | 8 733 | 8 |
Conclusion
Words absent from DNA data have been the subject of recent studies [1–3]. In this paper, we provided a precise characterization of a class of absent words, named minimal absent words, that extends the class previously discussed of nullomers/unwords. Our minimal absent words share with nullomers/unwords the property of being minimal, that is, the removal of one character from either end of a nullomer/unword yields an existing word. The set of minimal absent words is much larger than the set of nullomers/unwords, and, therefore, potentially more useful for applications that require a richer variety of absent words. We also proposed an algorithm for generating the minimal absent words that is based on suffix arrays and that, in practice, runs in approximately linear time. We hope that this algorithm and the concept of minimal absent word may shed some more light on the structure of absent words and complement the existing studies on the topic.
Declarations
Acknowledgements
This work was supported in part by the Fundação para a Ciência e a Tecnologia (FCT).
The implementation of our algorithm, available at ftp://www.ieeta.pt/~ap/maws, includes code obtained from M. Douglas McIlroy's website http://www.cs.dartmouth.edu/~doug/sarray/.
Authors’ Affiliations
References
- Hampikian G, Andersen T: Absent sequences: Nullomers and primes. Pacific Symposium on Biocomputing 2007, 12: 355–366.Google Scholar
- Acquisti C, Poste G, Curtiss D, Kumar S: Nullomers: really a matter of natural selection? PLoS ONE 2007., 2(10):Google Scholar
- Herold J, Kurtz S, Giegerich R: Efficient computation of absent words in genomic sequences. BMC Bioinformatics 2008., 9(167):Google Scholar
- Gusfield D: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge: Cambridge University Press; 1997.View ArticleGoogle Scholar
- Weiner P: Linear pattern matching algorithm. 14th Annual IEEE Symposium on Switching and Automata Theory 1973, 1–11.View ArticleGoogle Scholar
- McCreight EM: A space-economical suffix tree construction algorithm. Journal of the ACM 1976, 23(2):262–272.View ArticleGoogle Scholar
- Ukkonen E: On-line construction of suffix trees. Algorithmica 1995, 14(3):249–260.View ArticleGoogle Scholar
- Kurtz S: Reducing the space requirement of suffix trees. Software–Practice and Experience 1999, 29(13):1149–1171.View ArticleGoogle Scholar
- Abouelhoda MI, Kurtz S, Ohlebusch E: Enhanced suffix arrays and applications. In Handbook of Computational Molecular Biology, Computer and Information Science Series. Edited by: Aluru S. Chapman & All/CRC; 2006.Google Scholar
- Manber U, Myers G: Suffix arrays: a new method for on-line string searches. Proc of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms 1990, 319–327.Google Scholar
- Manber U, Myers G: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 1993, 22(5):935–948.View ArticleGoogle Scholar
- Kärkkäinen J, Sanders P: Simple linear work suffix array construction. In Proc 30th Int Conf on Automata, Languages and Programming of LNCS. Volume 2719. Springer-Verlag; 2003:943–955.View ArticleGoogle Scholar
- Kim DK, Sim JS, Park H, Park K: Linear-time construction of suffix arrays. In Combinatorial Pattern Matching: Proc. of the 14th Annual Symposium, LNCS. Volume 2676. Springer-Verlag; 2003:186–199.View ArticleGoogle Scholar
- Kim DK, Sim JS, Park H, Park K: Space efficient linear time construction of suffix arrays. In Combinatorial Pattern Matching: Proc. of the 14th Annual Symposium, of LNCS. Volume 2656. Springer-Verlag; 2003:200–210.Google Scholar
- Burrows M, Wheeler DJ: A block-sorting lossless data compression algorithm. Digital Systems Research Center; 1994.Google Scholar
- Abouelhoda MI, Kurtz S, Ohlebusch E: The enhanced suffix array and its applications to genome analysis. In Algorithms in Bioinformatics: Proc. of the 2nd Workshop, of LNCS. Volume 2452. Rome, Italy: Springer-Verlag; 2002:449–463.View ArticleGoogle Scholar
- Kasai T, Lee G, Arimura H, Arikawa S, Park K: Linear-time longest-common-prefix computation in suffix arrays and its applications. In Combinatorial Pattern Matching: Proc. of the 12th Annual Symposium, of LNCS. Volume 2089. Springer-Verlag; 2001:182–192.View ArticleGoogle Scholar
- Shi F: Suffix arrays for multiple strings: a method for on-line multiple string searches. In Concurrency and Parallelism, Programming, Networking, and Security: 2nd Asian Computing Science Conference, ASIAN'96, of LNCS. Volume 1179. Springer-Verlag; 1996:11–22.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.