Skip to main content
Fig. 2 | BMC Bioinformatics

Fig. 2

From: PanDelos: a dictionary-based method for pan-genome content discovery

Fig. 2

Examples of indexing structures. In the top-side of the image a, an example of the indexing structure ESA+N is shown for the string WLLPPP. The string is indexed by lexicographically sorting its suffixes. The array SA, LCP and N are computed according to the ordering. The indexing structure is composed by the three arrays, and the other columns shown on the image are virtually extracted. The SA array stores star positions of suffixes and it is used to keep trace of the lexicographic ordering. Values along the LCP and N arrays are used to identify intervals that correspond to specific k-mers [30]. The 1-mer L is identified by an interval that covers the first two positions of the structure, while the 1-mer P covers three positions and the 1-mer W cover one positions. Thus, the multiplicity of L, P and W are respectively 2, 3, and 1. 2-mers intervals are shown in the second columns,from the left. Note that the third position is not cover when 2-mer intervals are extracted because it can not identify the start of any 2-mer. The second section of the image b, c, d, e, f, show the extension the indexing structure in order to manage set of strings. Four input strings, s1, s2, t1 and t2 are indexed. Firstly, a global string is built by concatenating the four strings and by putting a special symbol (represented as N) on the concatenation joints. Then, similarly to the single string case, suffixes of the global sequences are sorted in lexicographic order. The sorting procedure defines the content of the SA array and LCP, N and SID arrays are computed in accordance with it. The SID array informs for each suffix the sequences from which it originates. The indexing structure helps in extracting the information, namely the multiplicities of 2-mers in every sequence, that is ideally represented in the matrix b. The illustrations d, e and f show the final values that the matrices M, P1 and P−1 take after every 2-mer of the global sequence have been taken into account

Back to article page