 Research article
 Open Access
On finding minimal absent words
 Armando J Pinho^{1}Email author,
 Paulo JSG Ferreira^{1},
 Sara P Garcia^{1} and
 João MOS Rodrigues^{1}
https://doi.org/10.1186/1471210510137
© Pinho et al; licensee BioMed Central Ltd. 2009
 Received: 06 January 2009
 Accepted: 08 May 2009
 Published: 08 May 2009
Abstract
Background
The problem of finding the shortest absent words in DNA data has been recently addressed, and algorithms for its solution have been described. It has been noted that longer absent words might also be of interest, but the existing algorithms only provide generic absent words by trivially extending the shortest ones.
Results
We show how absent words relate to the repetitions and structure of the data, and define a new and larger class of absent words, called minimal absent words, that still captures the essential properties of the shortest absent words introduced in recent works. The words of this new class are minimal in the sense that if their leftmost or rightmost character is removed, then the resulting word is no longer an absent word. We describe an algorithm for generating minimal absent words that, in practice, runs in approximately linear time. An implementation of this algorithm is publicly available at ftp://www.ieeta.pt/~ap/maws.
Conclusion
Because the set of minimal absent words that we propose is much larger than the set of the shortest absent words, it is potentially more useful for applications that require a richer variety of absent words. Nevertheless, the number of minimal absent words is still manageable since it grows at most linearly with the string size, unlike generic absent words that grow exponentially. Both the algorithm and the concepts upon which it depends shed additional light on the structure of absent words and complement the existing studies on the topic.
Keywords
 Suffix
 Internal Node
 Maximal Repeat
 Suffix Tree
 Random String
Background
There has been recent interest in absent words in DNA sequences, which are words that do not occur in a given genome. At the individual level, such words can be used as biomarkers for potential preventive and curative medical applications as derived from personal genomics efforts, while at the group level the comparison of genetic traits may impact, for example, on population genetics, or evolutionary profiles obtained from comparative genomics. It is therefore not surprising that absent words have been the subject of recent studies [1–3].
Hampikian and Andersen [1] used the term "nullomer" to designate the shortest words that do not occur in a given genome and the term "prime" to refer to the shortest words that are absent from the entire known genetic data. Herold et al. [3] used the term "unword" also to designate the shortest absent words. According to the definition, any given DNA sequence has nullomers/unwords of a certain size, that are uniquely defined for that sequence, and also of the shortest possible size.
The algorithm used by Hampikian and Andersen [1] to obtain the absent words tracks the occurrence of all possible words up to a userspecified length limit n, using a set of 4^{ n }counters for the 4^{ n }possible words of length n. This yields the existing absent words up to the given length limit, n. The approach taken by Herold et al. [3] has some computational advantages over that of Hampikian and Andersen [1], by being less demanding in terms of memory needs and processing time.
In this paper, we generalize the concept of nullomer/unword, such that other words, not necessarily the shortest ones, can be included (for a precise definition see Definition 3). In fact, the original definition adopted by Hampikian and Andersen [1] and by Herold et al. [3] might be too limiting, because there are sequences that have only a few nullomers/unwords. For example, and according to the results presented in [3], the genome of the worm, Caenorhabditis elegans, has two nullomers/unwords, whereas the genome of the extreme thermophile, Thermococcus kodakarensis, has only one.
As stated by Herold et al. [3], longer absent words may also be of interest. For generating those longer absent words, they propose adding all unwords (say, of size k) as additional sequences to the genome and rerunning the program. These additional absent words, which we call generic absent words and denote by , also include extended nullomers/unwords, i.e., words that contain nullomers/unwords. However, not all generic absent words are trivial extensions of nullomers/unwords.
Nullomers/unwords satisfy the following property, P.
Property 1 ( P ). If the leftmost or the rightmost character of a given nullomer/unword is removed, then the resulting word is no longer an absent word.
This property P does not hold for the absent words obtained by trivially extending nullomers/unwords nor for the longer absent words suggested by Herold et al. [3]. In other words, for a generic absent word, there is no way of knowing in advance if the elimination of some characters from one of the extremities of the word yields an absent word or not.
We have developed an efficient algorithm for computing these minimal absent words, which, in practice, runs in approximately linear time. Our work can be regarded as a complement to the works of Hampikian and Andersen [1] and of Herold et al. [3], in the sense that it provides a generalization of the nullomer/unword concept previously introduced, and helps to clarify their structure.
Methods
Basic definitions
Let S be a string over a finite alphabet Σ. We denote by S[p], 1 ≤ p ≤ S the p th character of S, where S designates the length (i.e., number of characters) of S, and by S[p_{1}..p_{2}], p_{1} ≤ p_{2} the substring of S that starts at position p_{1} and ends at position p_{2}. Therefore, S[1..p] denotes a prefix of S and S[p..S] a suffix. Sr indicates the concatenation of character r to the right end side of string S, whereas lS indicates the concatenation of character l to the left end side of S.
For convenience, we define two additional virtual characters, # and $. They are virtual in the sense that they do not belong to the alphabet Σ. By definition, the character to the left of the first character of the string is #, and the character to the right of the last character of the string is $. In other words, we define S[0] = # and S[S + 1] = $.
Let , where α is a substring of S, to be the set of positions of S where α occurs, so that S[p..p + α  1] = α, ∀p ∈ and S[p..p + α  1] ≠ α, ∀p ∉ . We define and to be the sets of characters that appear, respectively, to the immediate left and right of the several occurrences of α, and the sets and . We also denote by ℰ_{ α }⊆ Σ × Σ the set of all pairs of characters (S[p  1], S[p + α]), ∀p ∈ , i.e., all pairs of characters "enclosing" the occurrences of α.
Definition 1 (Maximal repeated pair [4]). A maximal repeated pair in a string S is a triple (p_{1}, p_{2}, α), such that p_{1} ≠ p_{2}, p_{1}, p_{2} ∈ , S[p_{1}  1] ≠ S[p_{2}  1] and S[p_{1} + α] ≠ S[p_{2} + α].
Definition 2 (Maximal repeat [4]). A substring α is a maximal repeat of S if there is at least a maximal repeated pair in S of the form (p_{1}, p_{2}, α).
Characterization
We are now ready to formally introduce the concept of minimal absent word.
Definition 3 (Minimal absent word). A string γ, γ ≥ 3, is a minimal absent word of S if γ is not a substring of S, but γ[2..γ] and γ[1..γ  1] are substrings of S.
and that the set of generic absent words, , is too large to be of any practical interest.
Theorem 1. If lαr is a minimal absent word of string S, then α is a maximal repeat in S.
Proof. According to Definition 3, if lαr is a minimal absent word of S, then lα and αr are substrings of S, i.e., lα = S[p_{1}..p_{1} + α] and αr = S[p_{2}..p_{2} + α], with p_{2} ≠ p_{1} + 1 (if p_{2} = p_{1} + 1 then lαr would be a substring of S, contradicting the assumption that it is a minimal absent word). Now consider that the character to the immediate right of lα is r' = S[p_{1} + α + 1] and that the character to the immediate left of αr is l' = S[p_{2}  1]. Because lαr does not exist in S, then l' cannot be the same character as l and r' cannot be the same character as r, implying that (p_{1} + 1, p_{2}, α) is a maximal repeated pair and, therefore, α is a maximal repeat in S. □
Note that this applies to minimal absent words with at least three characters, according to Definition 3. The restriction could be removed by allowing α to be the empty string. For the sake of clarity, we do not consider this case here. In fact, directly finding minimal absent words of length two requires Σ^{2} string matching operations, which can be performed in a reasonable time, unless the size of the alphabet is unusually large. This is why in Definition 3 we restricted the size of a minimal absent word to be at least three.
Theorem 2. A string lαr is a minimal absent word of S if and only if (l, r) ∉ ℰ_{ α }, for l ∈ ℒ_{ α }and r ∈ ℛ_{ α }.
Proof. If (l, r) ∉ ℰ_{ α }, then in none of the occurrences of α in S we have, simultaneously, a l character to the immediate left of α and a r character to its immediate right, implying that lαr does not occur in S. On the other hand, since l ∈ ℒ_{ α }, then there is at least one position in S where the substring lα occurs, the same holding for the αr substring, because l ∈ ℛ_{ α }. Therefore, according to Definition 3, lαr is a minimal absent word of S.
Now, consider that lαr is a minimal absent word of S and (l, r) ∈ ℰ_{ α }. In that case, there would be a substring lαr in S, contradicting the assumption that lαr is a minimal absent word. □
Finding the minimal absent words
Theorem 1 states that all minimal absent words are associated with maximal repeats. Therefore, finding all minimal absent words may be associated to finding all maximal repeats in a string, which can be done using suffix trees in O(S) time [4]. Moreover, suffix trees can be built and stored also in O(S) time/memory, respectively [5–7]. See [4] for an introduction to suffix trees.
Suffix trees
A suffix tree of a string S is a rooted directed tree with exactly S leaves (numbered 1 to S). Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty substring of S. No two edges out of a node can have edgelabels beginning with the same character. For any leaf p, the concatenation of the edgelabels on the path from the root to leaf p corresponds to the suffix that starts at position p, i.e., to S[p..S].
The condition stating that each internal node, other than the root, should have at least two children, implies that some strings do not have a suffix tree representation. In fact, this condition is violated in strings having a suffix that is a prefix of another suffix. To remedy this, a character that does not appear in any other position of S is usually appended at the end of the string (the "$" character is frequently used for this purpose [4]).
Definition 4 (Left character [4]). For each position p in S, character S[p  1] is called the left character of p. The left character of a leaf of the suffix tree is the left character of the suffix represented by that leaf.
The characters appearing inside parentheses near the leaves of the suffix tree of Fig. 2 are the corresponding left characters. Notice the # character associated with leaf number one, corresponding to S[0].
Definition 5 (Left diverse [4]). A node v of the suffix tree is called left diverse if at least two leaves in v's subtree have different left characters.
In the suffix tree depicted in Fig. 2, nodes v_{1} and v_{2} are the only left diverse nodes. According to Theorem 7.12.2 of [4], the string α labeling the path to a node v of the suffix tree is a maximal repeat if and only if v is left diverse. Therefore, in Fig. 2, the substrings formed along the paths from the root node to each of the two nodes v_{1} and v_{2} correspond to maximal repeats. Those strings are A and ACT, which are the base of the minimal absent words AAA, TAC and AACTA. Recall that, in a string S, there might be, at most, S maximal repeats (Theorem 7.12.1 in [4]). This implies that the number of minimal absent words of a string S is upper bounded by SΣ^{2}.
Suffix arrays
Suffix trees are a powerful data structure that allowed important advances in string processing [4]. However, the space required by a suffix tree, although growing linearly with the size of the string, might still be excessive for some applications [8, 9].
Suffix arrays are an alternative data structure that is more space efficient (4 bytes per input character for strings of size up to 2^{32}, in its basic form). However, to increase the efficiency of certain tasks, they might require auxiliary information [9]. Introduced in [10, 11], the suffix arrays can be constructed in linear time from the corresponding suffix tree [4] or using direct algorithms [12–14].
Suffix array p_{ k }and auxiliary information, in this case the lcp and bwt arrays, for S = ACTAACTG.
k  p _{ k }  lcp _{ k }  bwt _{ k }  S[p_{ k }..S] 

1  4  0  T  AACTG 
2  1  1  #  ACTAACTG 
3  5  3  A  ACTG 
4  2  0  A  CTAACTG 
5  6  2  A  CTG 
6  8  0  T  G 
7  3  0  C  TAACTG 
8  7  1  C  TG 
Generalized suffix array p_{ k }and auxiliary information for strings S_{1} = ACTAACTG and S_{2} = CGTACTA.
k  p _{ k }  lcp _{ k }  bwt _{ k }  S[p_{ k }..S] 

1  16  0  T  A 
2  4  1  T  AACTG 
3  13  1  T  ACTA 
4  1  4  #  ACTAACTG 
5  5  3  A  ACTG 
6  10  0  #  CGTACTA 
7  14  1  A  CTA 
8  2  3  A  CTAACTG 
9  6  2  A  CTG 
10  8  0  T  G 
11  11  1  C  GTACTA 
12  15  0  C  TA 
13  3  2  C  TAACTG 
14  12  2  G  TACTA 
15  7  1  C  TG 
The lcparray contains the lengths of the longest common prefix between consecutive ordered suffixes, i.e., lcp_{ k }indicates the length of the longest common prefix between S[p_{k1}..S] and S[p_{ k }..S], 2 ≤ k ≤ S. By convention, lcp_{1} = lcp_{S+1}= 0.
The bwtarray is a permutation of S, such that bwt_{ k }= S[p_{ k } 1]. Remember that the character to the immediate left of S[1] has been defined to be #, a convention that explains the value of bwt_{ k }for p_{ k }= 1. Conceptually, the bwtarray does not provide additional information, because the left character of any character of S can be determined by direct access to S. In fact, in this paper, we use both notations, bwt_{ k }and S[p_{ k } 1], interchangeably. However, in practice, the bwtarray allows sequential memory access and hence improves the performance, due to better cache use [16].
Definition 6 (Lcpinterval). Interval [i..j], 1 ≤ i <j ≤ S, is an lcpinterval of lcpdepth d, denoted ⟨d, i, j⟩, if
1. lcp_{ i }<d,
2. lcp_{ k }≥ d, ∀i <k ≤ j,
3. lcp_{ k }= d, for at least one k in i <k ≤ j,
4. lcp_{j+1}<d.
The lcpintervals of the example string S = ACTAACTG are ⟨1, 1, 3⟩ and ⟨1, 7, 8⟩ of lcpdepth 1, ⟨2, 4, 5⟩ of lcpdepth 2, and ⟨3, 2, 3⟩ of lcpdepth 3. Note that each of these lcpintervals correspond to a distinct internal node of the suffix tree (see Fig. 2). For example, the lcpinterval ⟨1, 1, 3⟩ is associated with node v_{1}, whereas the lcpinterval ⟨3, 2, 3⟩ corresponds to node v_{2}. Therefore, we can think of a virtual tree of lcpintervals having a structure similar to the corresponding suffix tree [16].
This correspondence between lcpintervals and internal nodes of the suffix tree is important, because it helps mapping some concepts from the suffix tree data structure into the suffix array approach. For example, the notion of left diverse node can be mapped directly into the lcpintervals. Finding if a node, associated with lcpinterval ⟨d, i, j⟩, is left diverse, is the same as finding if at least two characters of bwt_{ k }differ, for i ≤ k ≤ j. Moreover, in that case, the corresponding maximal repeat is, for example, α = S[p_{ i }..p_{ i }+ d  1] (note that all substrings S[p_{ k }..p_{ k }+ d  1], ∀i ≤ k ≤ j, are identical).
Algorithm 1 (adapted from [16, 17]) generates all lcpintervals using the lcparray and a stack. The "Push" and "Pop" operations have the usual meaning when associated to stack processing. The variable "top" refers to the lcpinterval, ⟨d, i, j⟩, on the top of the stack.
Algorithm 1. Computation of lcpintervals.
Push ⟨0, 0, 0⟩
for k = 2 to S do
i ← k  1
while lcp_{ k }<top.d do
lcpint ← Pop
lcpint.j ← k  1
Process lcpint
i ← lcpint.i
end while
if lcp_{ k }> top.d then
Push ⟨lcp_{ k }, i, 0⟩
end if
end for
In order to find the minimal absent words, the function "Process" in Algorithm 1 executes Algorithm 2, that builds the ℒ_{ α }, ℛ_{ α }and ℰ_{ α }sets, determines if the lcpinterval is left diverse, and, if true, outputs the minimal absent words associated with the lcpinterval ⟨d, i, j⟩.
Algorithm 2. Computation of the minimal absent words for a given lcpinterval ⟨d, i, j⟩, where α = S[p_{ i }..p_{ i }+ d  1].
← ∅
← ∅
ℰ_{ α }← ∅
for k = i to j do
← ∪ {S[p_{ k } 1]}
← ∪ {S[p_{ k }+ d]}
if S[p_{ k } 1] ≠ # and S[p_{ k }+ d] ≠ $ then
ℰ_{ α }← ℰ_{ α }∪ {(S[p_{ k } 1], S[p_{ k }+ d])}
end if
end for
if   > 1 then {Left diverse}
for all l ∈ ℒ_{ α }do
for all r ∈ ℛ_{ α }do
if (l, r) ∉ ℰ_{ α }then
Substring lαr is a minimal absent word
end if
end for
end for
end if
Our results remain valid for sets of strings = {S_{1}, S_{2},..., S_{ z }} over a finite alphabet Σ. In this case, the minimal absent words are generated through the concatenation of the strings using a delimiting character not belonging to the alphabet. The delimiter avoids the creation of artificial substrings across string boundaries.
Table 2 shows the (generalized) suffix array [18] associated to strings S_{1} and S_{2}.
Results and discusion
The graphic displayed in Fig. 4 shows an apparently curious behavior: the time taken by the algorithm increases as the size of the alphabet decreases. This might be due to the fact that, for the same string length, strings over smaller alphabets imply deeper suffix trees and, since the lcpintervals are related to the internal nodes of the suffix tree, generating them for smaller alphabets takes longer.
From the curves presented in Fig. 4, it can be seen that, in practice, the running time of the algorithm is approximately linear with the length of the string. Moreover, the number of minimal absent words (displayed in Fig. 3) also shows a similar behavior, contrasting with the exponential growth of the number of generic absent words of a string.
Number of minimal absent words and generic absent words for some genomes.
Organism  Reference  Genome size 

 Length, n 

104  104  11  
H. sapiens  Release 36.1  ≈ 2.9 Gb  44 149  44 970  12 
2 039 862  2 368 682  13  
190  190  11  
M. musculus  Release m36.1  ≈ 2.6 Gb  52 087  53 573  12 
2 192 708  2 579 838  13  
104  104  11  
D. melanogaster  FB 5  ≈ 162 Mb  172 849  173 674  12 
10 040 282  11 335 034  13  
2  2  10  
C. elegans  WB 170  ≈ 100 Mb  7 664  7 680  11 
1 092 286  1 151 728  12  
2 262  2 262  11  
N. crassa  Assembly 7  ≈ 39 Mb  1 064 938  1 082 787  12 
20 213 298  27 903 272  13  
2  2  9  
S. cerevisiae S228C  SGD 1  ≈ 12 Mb  6 435  6 450  10 
414 520  462 882  11  
248  248  8  
S. aureus MSSA476  NC002953  ≈ 2.8 Mb  11 908  13 744  9 
162 113  251 497  10  
1  1  8  
T. kodakarensis  NC006624  ≈ 2.09 Mb  2 314  2 322  9 
136 917  154 340  10  
3  3  6  
M. jannaschii  NC000909  ≈ 1.66 Mb  126  150  7 
3 790  4 834  8  
5  5  6  
M. genitalium  NC000908  ≈ 0.58 Mb  340  380  7 
6 156  8 733  8 
Conclusion
Words absent from DNA data have been the subject of recent studies [1–3]. In this paper, we provided a precise characterization of a class of absent words, named minimal absent words, that extends the class previously discussed of nullomers/unwords. Our minimal absent words share with nullomers/unwords the property of being minimal, that is, the removal of one character from either end of a nullomer/unword yields an existing word. The set of minimal absent words is much larger than the set of nullomers/unwords, and, therefore, potentially more useful for applications that require a richer variety of absent words. We also proposed an algorithm for generating the minimal absent words that is based on suffix arrays and that, in practice, runs in approximately linear time. We hope that this algorithm and the concept of minimal absent word may shed some more light on the structure of absent words and complement the existing studies on the topic.
Declarations
Acknowledgements
This work was supported in part by the Fundação para a Ciência e a Tecnologia (FCT).
The implementation of our algorithm, available at ftp://www.ieeta.pt/~ap/maws, includes code obtained from M. Douglas McIlroy's website http://www.cs.dartmouth.edu/~doug/sarray/.
Authors’ Affiliations
References
 Hampikian G, Andersen T: Absent sequences: Nullomers and primes. Pacific Symposium on Biocomputing 2007, 12: 355–366.Google Scholar
 Acquisti C, Poste G, Curtiss D, Kumar S: Nullomers: really a matter of natural selection? PLoS ONE 2007., 2(10):Google Scholar
 Herold J, Kurtz S, Giegerich R: Efficient computation of absent words in genomic sequences. BMC Bioinformatics 2008., 9(167):Google Scholar
 Gusfield D: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge: Cambridge University Press; 1997.View ArticleGoogle Scholar
 Weiner P: Linear pattern matching algorithm. 14th Annual IEEE Symposium on Switching and Automata Theory 1973, 1–11.View ArticleGoogle Scholar
 McCreight EM: A spaceeconomical suffix tree construction algorithm. Journal of the ACM 1976, 23(2):262–272.View ArticleGoogle Scholar
 Ukkonen E: Online construction of suffix trees. Algorithmica 1995, 14(3):249–260.View ArticleGoogle Scholar
 Kurtz S: Reducing the space requirement of suffix trees. Software–Practice and Experience 1999, 29(13):1149–1171.View ArticleGoogle Scholar
 Abouelhoda MI, Kurtz S, Ohlebusch E: Enhanced suffix arrays and applications. In Handbook of Computational Molecular Biology, Computer and Information Science Series. Edited by: Aluru S. Chapman & All/CRC; 2006.Google Scholar
 Manber U, Myers G: Suffix arrays: a new method for online string searches. Proc of the 1st Annual ACMSIAM Symposium on Discrete Algorithms 1990, 319–327.Google Scholar
 Manber U, Myers G: Suffix arrays: a new method for online string searches. SIAM Journal on Computing 1993, 22(5):935–948.View ArticleGoogle Scholar
 Kärkkäinen J, Sanders P: Simple linear work suffix array construction. In Proc 30th Int Conf on Automata, Languages and Programming of LNCS. Volume 2719. SpringerVerlag; 2003:943–955.View ArticleGoogle Scholar
 Kim DK, Sim JS, Park H, Park K: Lineartime construction of suffix arrays. In Combinatorial Pattern Matching: Proc. of the 14th Annual Symposium, LNCS. Volume 2676. SpringerVerlag; 2003:186–199.View ArticleGoogle Scholar
 Kim DK, Sim JS, Park H, Park K: Space efficient linear time construction of suffix arrays. In Combinatorial Pattern Matching: Proc. of the 14th Annual Symposium, of LNCS. Volume 2656. SpringerVerlag; 2003:200–210.Google Scholar
 Burrows M, Wheeler DJ: A blocksorting lossless data compression algorithm. Digital Systems Research Center; 1994.Google Scholar
 Abouelhoda MI, Kurtz S, Ohlebusch E: The enhanced suffix array and its applications to genome analysis. In Algorithms in Bioinformatics: Proc. of the 2nd Workshop, of LNCS. Volume 2452. Rome, Italy: SpringerVerlag; 2002:449–463.View ArticleGoogle Scholar
 Kasai T, Lee G, Arimura H, Arikawa S, Park K: Lineartime longestcommonprefix computation in suffix arrays and its applications. In Combinatorial Pattern Matching: Proc. of the 12th Annual Symposium, of LNCS. Volume 2089. SpringerVerlag; 2001:182–192.View ArticleGoogle Scholar
 Shi F: Suffix arrays for multiple strings: a method for online multiple string searches. In Concurrency and Parallelism, Programming, Networking, and Security: 2nd Asian Computing Science Conference, ASIAN'96, of LNCS. Volume 1179. SpringerVerlag; 1996:11–22.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.