MICA: desktop software for comprehensive searching of DNA databases

Stokes, William A; Glick, Benjamin S

doi:10.1186/1471-2105-7-427

BMC Bioinformatics

Table 3 Representative search times for K = 4

From: MICA: desktop software for comprehensive searching of DNA databases

Query	Chromosome 1		Human Genome
	Time (sec)	Hits	Time (sec)	Hits
Nondegenerate 3-mers	0.82 [0.51]	6.96 × 10⁶	11.5	8.90 × 10⁷
Nondegenerate 4-mers	0.13 [0.028]	1.69 × 10⁶	2.5	2.17 × 10⁷
Nondegenerate 6-mers	0.35 [0.11]	160,702	5.8	2.05 × 10⁶
Nondegenerate 8-mers	0.38 [0.11]	16,631	6.2	213,099
Nondegenerate 15-mers	0.56 [0.11]	1.39	9.0	5.81
Nondegenerate 30-mers	0.54 [0.10]	1.03	8.3	1.24
Nondegenerate 100-mers	0.41 [0.069]	1.01	6.1	1.02
Nondegenerate 1000-mers	0.14 [0.019]	1.00	2.3	1.00
Alu 30-mer fragment	0.77 [0.095]	1,130	13.6	14,041
GDGCHC (Bsp 1286I)	0.43 [0.11]	398,999	6.9	4,776,086
GCCNNNNNGGC (Bgl I)	0.37 [0.12]	44,761	6.1	520,776
ACNNNNGTAYC (Bae I)	1.55 [0.34]	20,243	23.2	259,837

Both DNA strands were searched using K = 4. Results for the 3- to 1000-mer searches are average values obtained by searching with multiple queries. For 3-mers, all 64 possible nondegenerate queries were tested by extending each 3-mer to a partially degenerate 4-mer. For 4-mers, all 256 possible nondegenerate queries were tested. For 6- and 8-mers, 100 randomly chosen nondegenerate queries were tested. In the case of 15- to 1000-mers, each test involved 100 nondegenerate queries that were extracted randomly from chromosome 1 and checked to confirm that a given query had no more than 10 matches in the genome. The Alu 30-mer fragment GGCCGGGCGCGGTGGCTCACGCCTGTAATC is a conserved sequence found at the 5' ends of Alu repeat elements [14]. The three partially degenerate queries are the recognition sequences for the restriction enzymes Bsp 1286I, Bgl I, and Bae I. For chromosome 1, the search times without brackets were obtained after pre-loading only file elements A – C and E – J (see Table 1) into memory, and the faster search times with brackets were obtained after pre-loading the entire file. For the entire genome, the search times include the time needed to load elements A – C and E – J of each file into memory. Thus, the data for chromosome 1 reflect the time needed to search a file that is already open, whereas the data for the entire genome reflect the time needed to search a set of unopened files. To ensure that the search times without brackets reflect MICA performance for newly opened indexes, each search was preceded by a large number of extraneous reads, which flushed the main memory of any prior data from the relevant index.

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com