We introduce a novel alignment method, KMA, and scoring scheme, ConClave, which allows for mapping of raw reads directly against redundant databases. KMA diverges from known mappers by allowing redundancy within the databases, and KMA also produces consensus sequences and a result overview. KMA was created to be as intuitive and user friendly as possible, based on our current user demands and analysis challenges. In order to solve this, KMA works in five main steps: trimming of reads, heuristic k-mer mapping, fine alignment, ConClave scoring and reference assembly (see Fig. 1).
Heuristic k-mer mapping
When mapping against databases, a query sequence can fall within two categories: the query does not look like anything in the database or the query looks like something in the database. Depending on the database, it is therefore necessary to determine at an early stage whether alignment of a given sequence is feasible or if the sequence should be discarded from further analysis. Computationally, it is much faster to map raw reads against template sequences and determine their approximate origin than it is to perform alignment giving the exact differences between sequences. In order to exploit this, KMA maps k-mers between the query sequence and template database. The matching k-mers are then used to produce a mapping score where a matching k-mer is rewarded with a score of k while k-mer mismatches are penalized with − 1. For a query sequence to be accepted for alignment, it must have a positive score based on this mapping, equivalent to a minimum identity of (k + 1)− 1 in k-mer space. In the case where a query sequence is accepted, the potential template candidates are limited by linking the matching k-mers to their respective template sequences.
Fine alignment
KMA uses a hash map of indexed k-mers from which to start alignment, where this indexing is computed for templates passing the heuristic k-mer mapping. The indexed k-mers are then used as seeds from which to start alignment, where each matching seed is extended to give an optimal score. To enable a high resolution of gaps and mismatches, KMA utilizes the Needleman-Wunsch algorithm [9] to align the pieces of query and template sequence between seed-extends. The tails of the alignment, before the first seed and after the last, are aligned with a specialized version of the Needleman-Wunsch algorithm allowing for early stopping so that leading and trailing gaps can be skipped.
ConClave scoring
As KMA is designed to map sequences against redundant databases, the possibility of a tie for best matching template should be considered more as rule than an exception. For KMA to be able to find the one most likely template for each query sequence, a novel scoring scheme, ConClave, was developed.
The ConClave scoring scheme requires that all best matching templates are initially accepted for all best matching query sequences. The alignment scores obtained are then summed together for each template candidate in order to form the ConClave score, in this way reflecting a maximum score for each template. For alignments where ties do not occur, a second score is conducted as the unique ConClave score, which are used to reflect the unique alignments.
Ties for best matching templates can now be resolved by choosing the most likely template based on the ConClave score. In case of a tie on the ConClave score, the unique ConClave score is used to choose a best matching template. For samples were two templates are equally well-suited candidates, reflected by identical ConClave scores, the algorithm will choose the first sequence added to the database.
Reference assembly
The ConClave scoring and alignment of reads enables assembly guided by a reference, resulting in a consensus sequence for the given reference / template. From the alignment, the differences between each read and the associated template are known, and this gives information about which nucleotides are called at each position in the template. Each nucleotide is then determined by majority voting, where the strength of the called nucleotide is tested with a McNemar test, giving a robust base calling performance across different sequencing platforms [10]. In case a nucleotide is not significantly overrepresented according to the McNemar test (according to a user specified α, default 0.05), it is reported in lower case.
After reference assembly, statistics for each template are collected in a single file, containing features such as template identity, coverage and depth.
Performance evaluation and comparison
The performance of KMA was evaluated on a dataset of 13 Escherichia coli from Grad et al. (2012). The dataset included 11 E. coli from the German / French outbreak in 2011 and two historical E. coli [11]. Phenotypic susceptibility tests were known for all of the 13 E. coli, and the presence of the beta-lactamase blaCTX-M-15 were verified by PCR [11].
The 13 E. coli were mapped to the ResFinder database of known resistance genes [12], where the associated phenotype of each gene were compared with phenotypic susceptibility tests.
In order to measure the performance on larger databases, a core genome MLST (cgMLST) database for E. coli was downloaded from EnteroBase (available at: http://enterobase.warwick.ac.uk [Accessed 18 January 2018]. The cgMLST scheme contains 2447741 closely related genes, which together matches the size of the human genome.
KMA was compared with six different methods: SRST2 (v. 0.1.8) [3], MGmapper (v. 2.7) [7], BWA-MEM (v- 0.7.15) [5], Bowtie2 (v. 2.2.4) [4], MiniMap2 (v. 2.6) [6] and Salmon (v. 0.9.1) [8], each capable of mapping raw reads directly against entire databases of sequences.
SRST2 uses Bowtie2 to map against a homology-reduced database, clustered at 90% identity using CD-hit [13]. From each cluster, one allele is chosen, as the one with the highest alignment score from all alignments towards that cluster. After mapping and alignment, SRST2 performs SNP calling and reports the differences between query and template sequences [3].
For MGmapper to accept a template, it has to contain at least one unique mapping read when mapped with BWA-MEM. After mapping the reads, MGmapper reports the depth and template coverage of the accepted genes [7].
Salmon was developed for quantifying RNA transcript levels using the EM-algorithm. Which partially matches the redundant database problem, as only a subset of the templates in the database are likely to be present. Salmon can be used as an extension to BWA-MEM, Bowtie2 and Minimap2, when the alignment method is set to report several alignments per read [8].
SAMtools and BEDTools were used to estimate depth and template coverage of mappings performed with BWA-MEM, Bowtie2 and Minimap2 so that these could be used for a more direct comparison as well [14, 15].
To prove the difficulty of mapping raw reads directly against redundant databases, a simulated dataset of single-end and paired-end reads was created, where each gene in the ResFinder database was split into raw reads with a length 100 bp, and an insert size of 250 for the paired end set. Error and indel rates were set to zero so that the simulated reads would represent the perfect case scenario of short-read sequencers. Meaning, that any inconsistencies from the mapping methods, would reveal each of their weaknesses in mapping against redundant databases, and give a clear view of their assumptions. Simulated reads were created with chopDB, which is included alongside the commands used to compare KMA with existing methods and the ResFinder database.