Skip to main content

Table 3 Representative search times for K = 4

From: MICA: desktop software for comprehensive searching of DNA databases

Query

Chromosome 1

Human Genome

 

Time (sec)

Hits

Time (sec)

Hits

Nondegenerate 3-mers

0.82 [0.51]

6.96 × 106

11.5

8.90 × 107

Nondegenerate 4-mers

0.13 [0.028]

1.69 × 106

2.5

2.17 × 107

Nondegenerate 6-mers

0.35 [0.11]

160,702

5.8

2.05 × 106

Nondegenerate 8-mers

0.38 [0.11]

16,631

6.2

213,099

Nondegenerate 15-mers

0.56 [0.11]

1.39

9.0

5.81

Nondegenerate 30-mers

0.54 [0.10]

1.03

8.3

1.24

Nondegenerate 100-mers

0.41 [0.069]

1.01

6.1

1.02

Nondegenerate 1000-mers

0.14 [0.019]

1.00

2.3

1.00

Alu 30-mer fragment

0.77 [0.095]

1,130

13.6

14,041

GDGCHC (Bsp 1286I)

0.43 [0.11]

398,999

6.9

4,776,086

GCCNNNNNGGC (Bgl I)

0.37 [0.12]

44,761

6.1

520,776

ACNNNNGTAYC (Bae I)

1.55 [0.34]

20,243

23.2

259,837

  1. Both DNA strands were searched using K = 4. Results for the 3- to 1000-mer searches are average values obtained by searching with multiple queries. For 3-mers, all 64 possible nondegenerate queries were tested by extending each 3-mer to a partially degenerate 4-mer. For 4-mers, all 256 possible nondegenerate queries were tested. For 6- and 8-mers, 100 randomly chosen nondegenerate queries were tested. In the case of 15- to 1000-mers, each test involved 100 nondegenerate queries that were extracted randomly from chromosome 1 and checked to confirm that a given query had no more than 10 matches in the genome. The Alu 30-mer fragment GGCCGGGCGCGGTGGCTCACGCCTGTAATC is a conserved sequence found at the 5' ends of Alu repeat elements [14]. The three partially degenerate queries are the recognition sequences for the restriction enzymes Bsp 1286I, Bgl I, and Bae I. For chromosome 1, the search times without brackets were obtained after pre-loading only file elements A – C and E – J (see Table 1) into memory, and the faster search times with brackets were obtained after pre-loading the entire file. For the entire genome, the search times include the time needed to load elements A – C and E – J of each file into memory. Thus, the data for chromosome 1 reflect the time needed to search a file that is already open, whereas the data for the entire genome reflect the time needed to search a set of unopened files. To ensure that the search times without brackets reflect MICA performance for newly opened indexes, each search was preceded by a large number of extraneous reads, which flushed the main memory of any prior data from the relevant index.