Disk-based k-mer counting on a PC

Background The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. Results We propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data. Conclusions By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.


KMC usage
KMC program constructs a database of statistics for input set of FASTQ files. This database can then be used from other software: directly via KMC API (described in Section 2) or by reading a textual file containing a list of k-mers and their related counters. This textual file can be obtained for a database by KMC-dump program, that is presented in Section 3 as a sample application of our KMC API. Section 4 describes the database format in detail for those interested in the low-level access to the data. Section 5 contains additional experimental results and a description of the parameters of execution of the examined programs. Section 6 contains description of how the automatic setting of parameters of KMC works.
Below we describe in detail the parameters and options of the KMC command-line tool, in version 0.3. The general syntax is: kmc [options] <input file name> <output file name> <working directory> [<working directory2> ...] or: kmc [options] <@input file names> <output file name> <working directory> [<working directory2> ...] where the parameters are: • input file name -a single file in FASTQ format (gzipped or not), • @input file names -a file name with list of input files in FASTQ format (gzipped or not), • output file name -the output database file; if such a file exists, it will be overwritten.
The parameters -sc<value>, -sf<value>, -sp<value>, -so<value>, and -sr<value> concern the internal work of KMC, i.e., their settings may affect the program's processing speed, but won't change its output. Not setting at least one parameter from this group makes KMC ignore them all.

API
In this section we describe two classes, CKmerAPI and CKMCFile. They can be used to gain access to the databases produced by KMC program.

CKMCFile class
This class handles a k-mer database. Its key methods are: • CKMCFile() -constructor, • bool OpenForRA(std::string file name) -opens two files: file name with added extension ".kmc pre" and ".kmc suf", reads their whole content to enable random access (in memory), and then closes them, • bool OpenForListing(std::string file name) -opens the file file name with added extension ".kmc pre" and allows to read the k-mers one by one (whole database is not loaded into memory), • bool ReadNextKmer(CKmerAPI &kmer, float &count) -reads next k-mer to kmer and updates its count; the return value is bool; true as long as not eof-of-file (available only when database is opened in listing mode), • bool Close() -if the file was opened for random access, the allocated memory for its content is released; if the file was opened for listing, the allocated memory for its content is released and the ".kmer" file is closed, • bool SetMinCount(uint32 x) -set the minimum counter value for k-mers; if a k-mer has count below x, it is treated as non-existent, • uint32 GetMinCount(void) -returns the value (uint32) set with SetMinCount, • bool SetMaxCount(uint32 x) -set the maximum counter value for k-mers; if a k-mer has count above x, it is treated as non-existent, • uint32 GetMaxCount(void) -returns the value (uint32) set with SetMaxCount, • uint64 KmerCount(void) -returns the number of k-mers in the database (available only for databases opened in random access mode), • uint32 KmerLength(void) -returns the k-mer length in the database (available only for databases opened in random access mode), • bool RestartListing(void) -sets the cursor for listing k-mers from the beginning of the file (available only for databases opened in listing mode). The method OpenForListing(std::string file name) invokes it automatically, but it can be also called by a user, • bool Eof(void) -returns true if all k-mers have been listed, • bool CheckKmer(CKmerAPI &kmer, float &count) -returns true if kmer exists in the database and set its count if the answer is positive (available only for databases opened in random access mode), • bool IsKmer(CKmerAPI &kmer) -returns true if kmer exists (available only for databases opened in random access mode), • void ResetMinMaxCounts(void) -sets min count and max count to the values read from the database, • bool Info(uint32 & kmer length, uint32 & mode, uint32 & counter size, uint32 & lut prefix length, uint32 & min count, uint32 & max count, uint64 & total kmers) -gets current parameters from the k-mer database, • CKMCFile() -destructor.

Example of API usage
The KMC-dump application (Figs. S1 and S2) shows how to list and print k-mers with at least min count and at most max count occurrences in the database. Fig. S1 presents parsing the command-line parameters, including -ci<value> and -cx<value>. Input and output file names are also expected. The code in Fig. S2 is for actual database handling. This database is represented by a CKMCFile object, which opens an input file for k-mer listing (the method bool OpenForListing(std::string file name) is invoked). The parameter of the method SetMinCount (SetMaxCount) must be not smaller (not greater) than the corresponding parameter -ci (-cx) with which KMC was invoked (otherwise, nothing will be listed). The listed k-mers are in the form like: AAACACCGT\t<value> where the first part is the k-mer in natural representation, which is followed by a tab character, and its associated value (integer or float). (Such format is compatible with Quake, a widely used tool for sequencing error correction.) Note that, if needed, one can easily modify the output format, changing the lines 39 and 41 in

Database format
The KMC application creates output files with two extensions: • .kmc pre -information on k-mer prefixes (plus some other data) are stored here, • .kmc suf -information on k-mer suffixes and the related counters are stored here.
All integers in the KMC output files are stored in LSB (least significant byte first) format.

The .kmc pre file structure
A .kmc pre file contains, in the order: • [header position], • [marker] (another copy, to signal the file is not truncated). [marker] 4 bytes with the letters: KMCP.

[header position]
An integer consisting of the last 4 bytes in the file. Its contains the relative position of the beginning of the field [header]. After opening the file, one should do the following: 1. Read the first 4 bytes and check if they contain the letters KMCP.
2. Jump to position 4 bytes back from end of file and read the header position x.
3. Jump to position x + 4 bytes back from end of file and read the header.

Read [data].
[header] The header contains fields describing the file .kmc pre: • uint32 kmer lengthk-mer length, • uint32 mode -mode: 0 (occurrence count) or 1 (counting according to Quake quality), • uint32 counter size -counter field size: for mode 0 it is 1, 2, 3, or 4; for mode 1 it is always 4, • uint32 lut prefix length -the length (in symbols) of the prefix cut off from k-mers; it is invariant of the scheme that 4 divides kmer length − lut pref ix length, • uint32 min count -minimum number of k-mer occurrences to write in the database (if the counter is smaller, the k-mer data are not written), • uint32 max count -maximum number of k-mer occurrences to write in the database (if the counter is greater, the k-mer data are not written), • uint64 total kmers -total number of k-mers in the database, • uint32 tmp[8] -not used in the current version. [data] Here are k-mer prefix data. More precisely, it is an array of uint64 elements, of size 4 lut pref ix length . Position x in the array stores the number of different k-mers whose prefix of length lut pref ix length is (in binary) less than x. DNA symbols are encoded as follows: A ← 0, C ← 1, G ← 2, T ← 3. For example, if the queried k-mer is ACGACTGAT and lut pref ix length = 4, then we cut off the first 4 symbols, i.e., the prefix ACGA, which is interpreted in binary as x = 24 (since 0 × 2 6 + 1 × 2 4 + 2 × 2 2 + 0 × 2 0 = 24). Now we look into "data" at locations x and x + 1 to read, e.g., 1523 and 1685. This means that in the file .kmc suf in records from 1523 to 1685 − 1 there are suffixes of the k-mers with prefix ACGA. Having got this range, we can now binary search the suffix CTGAT.

The.kmc suf file structure
A .kmc suf file contains, in order: • [marker] (another copy, to signal the file is not truncated).
The k-mers are stored with their leftmost symbol first, packed into bytes. For example, CCACAAAT is represented as 0x51 (for CCAC), 0x03 (for AAAT). Integers are stored according to the LSB (little endian) convention, floats are stored in the same way as they are stored in the memory.
total kmers is a value taken from the .kmc pre file. record t is a type representing a k-mer. Its first field is the k-mer suffix string, stored on (kmer length − lut pref ix length)/4 bytes. The next field is counter size, with the number of bytes used by the counter, which is either a 1 . . . 4-byte integer, or a 4-byte float.

Test platforms
K-mer Counter (KMC), was implemented in C++11, using gcc compiler (version 4.7.1) for the linux build and Microsoft Visual Studio 2012 for the Windows build. Experiments comprise also the tools Jellyfish and BFCounter.
Two test machines were used. One was a 4 AMD Opteron TM 6136 2.4 GHz CPUs (32 cores) server with 128 GB RAM, and fast RAID-0 disk matrix of total size 2.0 TB. The other was a "home" PC, with 6-core AMD Phenom II 1090 3.2 GHz CPU, 16 GB RAM and 3 SATA HDD of sizes 2 TB each. The hard disks at the PC machine were: two Seagate Barracuda Green ST2000DL003 with 5,900 rpm and one Hitachi Deskstar 7K3000 with 7,200 rpm.

Parameters of programs BFCounter
The main problem with running BFCounter (ver. 0.2) concerns proper estimation of the number of distinct k-mers. In the experiments we set this parameter to be about 20 percent greater than the actual number of distinct k-mers (which was known from a KMC test).

Jellyfish
Jellyfish (ver. 0.5) requires to give as a parameter the size of the hash table, which should be a bit larger than the total number of distinct k-mers. Similarly as for BFCounter we passed the value about 20 percent greater than the number of distinct k-mers (known from prior KMC execution). Unfortunately, for some cases (denoted by asterisk in the tables with experimental results), it resulted in Jellyfish crash due to the insufficient amount of RAM (128 GB of RAM was available in the test machine). In such cases we halved the parameter value and ran Jellyfish again, which produced two output files. These files should be merged then, but unfortunately, Jellyfish was unable to do that due to insufficient amount of RAM.
To allow right comparison with BFCounter (which does not count unique k-mers), we set in the execution of Jellyfish that it should not count k-mers with count value less than 2.0.
Probably due to some bug, when the output file name specified in the command line is shorter than 10 letters, Jellyfish often dramatically slows down (2-3 times). Thus, we always used long file names. DSK DSK (ver. 1.4993) was executed with default parameters that means 5 GB limit of RAM. The k-mers occurring less than 2 times were excluded.

KMC
KMC was executed with setting that k-mers occurring less than 2 times should not be counted. The parameter -p was set to 3 for two smallest data sets, to 4 for Caenorhabditis elegans data set and to 5 for two human data sets.

Results
The results presented at the following pages are for 5 data sets characterized in the main part of the paper. For 3 data sets examined in the main paper we duplicate here the tables and provide extra results (for other values of k). Table S1: k-mers counting results for D. ananassae (8.7 GB FASTQ file or 6 gzipped FASTQ files of total size 1.8 GB). RAM and disk spaces are in GB (1GB=2 30 B). Time is in seconds. The programs were used for the number of threads adjusted to the number of cores to achieve maximum speed.      Table S3: k-mers counting results for Zea mays (45.9 GB FASTQ file or 108 gzipped FASTQ files of total size 16.3 GB). Test methodology and column description are just as for Table S1.

Automatic setting of parameters in KMC
The automatic setting of parameters mechanism tries to allocate the available resources (i.e., CPU cores) in the best possible way. The optimal number of threads for the parts of the algorithm is, however, hard to obtain, since it depends on many things, like the compression method of input files, the number and speed of disks, etc. Thus, our automatic mechanism is obviously suboptimal, nevertheless, experiments show that it performs reasonably well. If the results are unsatisfactory, the KMC user can specify these parameters from command line. The most important factor of the mechanism is the number of available cores (possibly overridden if the user specifies it with -s? parameters).