Skip to main content

Table 1 MICA file structure and data storage requirements

From: MICA: desktop software for comprehensive searching of DNA databases

File Element

Generic Storage Requirement (bytes)

Storage Requirement for Chromosome 1 (bytes)

Sequence Segment

A. Segment Format

1

1

B. Segment Size

4

4

C. Sequence Properties

1

1

D. DNA Sequence

L

245,522,847 (234 MB)

SEGMENT TOTAL

6 + L

245,522,853 (234 MB)

Index Segment

E. Segment Format

1

1

F. Segment Size

4

4

G. Index Properties

1

1

H. Chunk Counts Summary

4K+1

K = 4: 1,024 (1 KB)

K = 6: 16,384 (16 KB)

I. Degenerate K-mer Count

4

4

J. N-Stretch Count (S)

4

4

K. Chunk Data Array

(4K * C + number of nondegenerate K-mers) * 2

K = 4: 447,573,936 (427 MB)

K = 6: 476,350,748 (454 MB)

L. Degenerate Data Array

(number of partially degenerate K-mers) * (4 + K)

K = 4: 1,752 (1.7 KB)

K = 6: 3,650 (3.6 KB)

M. N-Stretch Data Array

8S

296

SEGMENT TOTAL

Typically about 2L bytes.

K = 4: 447,577,022 (427 MB)

K = 6: 476,371,092 (454 MB)

  1. A MICA file consists of a Sequence Segment (elements A – D) followed by an Index Segment (elements E – M). If a sequence occupies more than 16 chunks, then loading of a MICA index consists of reading elements A – C of the Sequence Segment, skipping over the DNA sequence, and reading elements E – J of the Index Segment. The parameters for human chromosome 1 were: L = 245,522,847 bases; C = 3,747 chunks; S = 37 N-stretches (total of 22,695,000 N's).
  2. (A) A single byte is used to specify the Sequence Segment format.
  3. (B) The size of the Sequence Segment is specified by a 4-byte integer.
  4. (C) A single byte is used to record properties of the sequence, including its topology (linear or circular) and its strandedness (single- or double-stranded).
  5. (D) The DNA sequence is stored as a string of uppercase ASCII characters.
  6. (E) A single byte is used to specify the Index Segment Format.
  7. (F) The size of the Index Segment is specified by a 4-byte integer.
  8. (G) A single byte is used to record the byte order ("endianness") of the index. If the byte order is opposite to that of the machine being used to run the queries, MICA corrects the byte order when processing the index data.
  9. (H) The Chunk Counts Summary is a list of 4K 4-byte integers representing the total number of times each nondegenerate K-mer appears in the sequence. For the MICA index, the 4-base nondegenerate DNA alphabet is arranged in the order G, A, T, C. Thus, the first nondegenerate K-mer listed is GGGG (K = 4) or GGGGGG (K = 6), and the last one listed is CCCC (K = 4) or CCCCCC (K = 6). This lexicographical order yields contiguous index reads for K-mers that end in the most common partially degenerate bases: R (A or G), Y (C or T), and W (A or T).
  10. (I) The Degenerate K-mer Count is a 4-byte integer representing the total number of partially degenerate K-mers in the sequence.
  11. (J) The N-Stretch Count S is a 4-byte integer representing the number of separate stretches of K or more consecutive N's.
  12. (K) The Chunk Data Array is divided into 4K partitions corresponding to the 4K nondegenerate K-mers. Each partition contains a list of 2-byte integers representing the number of times the K-mer is present in each of the C chunks, followed by a list of 2-byte integers representing the intra-chunk positions of the K-mer in each of the C chunks. The first partition contains the data for GGGG (K = 4) or GGGGGG (K= 6), and the second partition contains the data for GGGA (K = 4) or GGGGGA (K = 6), and so on.
  13. (L) The Degenerate Data Array is a list of the partially degenerate K-mers. Each partially degenerate K-mer is represented as a 4-byte integer that marks the absolute position of the K-mer, followed by a K-byte string that encodes the sequence of the K-mer.
  14. (M) The N-Stretch Data Array consists of S pairs of 4-byte integers that represent the starting positions and lengths of the N-stretches.