Variable-order sequence modeling improves bacterial strain discrimination for Ion Torrent DNA reads

Background Genome sequencing provides a powerful tool for pathogen detection and can help resolve outbreaks that pose public safety and health risks. Mapping of DNA reads to genomes plays a fundamental role in this approach, where accurate alignment and classification of sequencing data is crucial. Standard mapping methods crudely treat bases as independent from their neighbors. Accuracy might be improved by using higher order paired hidden Markov models (HMMs), which model neighbor effects, but introduce design and implementation issues that have typically made them impractical for read mapping applications. We present a variable-order paired HMM that we term VarHMM, which addresses central issues involved with higher order modeling for sequence alignment. Results Compared with existing alignment methods, VarHMM is able to model higher order distributions and quantify alignment probabilities with greater detail and accuracy. In a series of comparison tests, in which Ion Torrent sequenced DNA was mapped to similar bacterial strains, VarHMM consistently provided better strain discrimination than any of the other alignment methods that we compared with. Conclusions Our results demonstrate the advantages of higher ordered probability distribution modeling and also suggest that further development of such models would benefit read mapping in a range of other applications as well. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1710-0) contains supplementary material, which is available to authorized users.


S1.1 Seeding
The general idea of our seeding method is to scan through the entire reference and look for short matches using the start of a read and then exploit read similarities to reduce the number of calculations and processing time. These short matches are pre-calculated one time only for a paired HMM once it has been trained. For the pre-calculated values, a paired HMM is used to calculate scores for all possible alignment combinations that are 8 nucleotides in length as shown in table S1. The pre-calculated values are then stored in the following matrix: where there are 65536 combinations for a sequence that is 8 nucleotides in length. Each of the two sequences in an alignment is hence given an index value as seen in table S1, where the index value is calculated as the sum: n 0 ·4 7 +n 1 ·4 6 +n 2 ·4 5 +n 3 ·4 4 +n 4 ·4 3 +n 5 ·4 2 +n 6 ·4 1 +n 7 ·4 0 , n 1 to n 7 are numerical values that represent the nucleotide at each position in a sequence from position 0 to 7 with: A = 0, C = 1, G = 2, T = 3 (for example AAAAAAAT would give an index value of 3).  The alignment score s is defined as: where V max is the probability of the highest scoring Viterbi alignment as described in the 'Material and Methods' and Figure 2 of the main text, and L read is the length of the read and is used for normalization. The above equation for s is used to calculate the score for all paired HMM alignments (for example, for both the above pre-calculated alignments and the alignments described in the next section).

S1.2 Making Alignments and Parameter Selection
Once the S matrix described in the previous section has been calculated, we can proceed to align a given set of reads with any reference. This involves two initialization steps followed by the main process: Initialization 1. Only align reads that have a length that is longer than a value of L min . Sort all reads alphabetically using the Quicksort algorithm (1).
2. Scan through the reference sequence, and at each position retrieve a segment of the reference that is 8 nucleotides in length. Convert the 8-length segment to an index value k n (as shown in table S1), where n refers to the position in the reference that k n was calculated. Each k n value is stored in a vector of indexes r = [k 1 , k 2 , ..., k N ], where given a reference of length

Main Process
1. Compare the first 8 nucleotides of a read with the previous read. If different (or if we are processing the very first read), then proceed to step 2 (otherwise go to step 3).
2. For the first 8 nucleotides of the current read m, calculate the read index value l m . Scan through the reference and at each position n get the index value k n = r(n) and then the alignment score in S. Going through positions 1 to N in the reference we get: S(k 1 , l m ), S(k 2 , l m ), ... S(k N , l m ). If a score is greater than s 8 min , then the reference position is stored in a vector u which contains all the positions in the reference where we think the read might align to. In the case we only want to use exact matches, then a comparison is just made between each k n index from the reference and the read index l m (in which case the S matrix calculation is optional), and we store a reference position in u if k n = l m .
3. For each alignment in u, we make a rough comparison between all of the read and the reference segment to eliminate any obvious mismatching alignments. This is done by counting the number of A, C, G, and T nucleotides in both the read and reference segment for every window of w len nucleotides (if a read and reference section length are less than w len then they are not included as a window and ignored). For each window, we calculate the difference in nucleotide counts between the read and reference (i.e. for each window |#As in read − #As in reference|+...+|#Ts in read − #Ts in reference|), and calculate z as the average across all the windows.
4. For each alignment in u that has a z value that is less than z max , or where the read length is less than w len , we calculate an alignment that is 24 in length (or less if the read is shorter). We then create a new array v which contains the 24-length alignments that have a score greater than s 24 min .
5. For each alignment in v, we extend the alignment so that the entire read is aligned and the best and 2nd best alignments are saved (the next read is then processed and we return to step 1). If v is empty, or if the best score is less than s min , then the first 8 nucleotides of the read are removed and steps 1-4 are repeated (up to a maximum number of N rep times). The values of the constants used in the above steps are shown in table S2. The s 8 min cutoff was selected so that only exact matches of 8-length were considered for the next alignment steps, thus providing the fastest but lowest sensitivity for the seeding step. The values s 24 min and z max were implemented to filter out alignments that appear unlikely to provide a match and thus reduce processing time (similar to s 8 min , higher s 24 min and lower z max values improve speed but reduce sensitivity). We set s 24 min and z max so that they only filtered out what we considered to be obvious mismatching alignments.
The cutoff score s min can be used to only keep matches that we think are sufficiently well aligned. The s min value was derived using the lowest alignment score s lowest that was observed across all the training alignments (from the last iteration of a paired HMM in the training data). If we were using the training set, then we would set s min such that we only keep alignments that are equal to or better than the lowest score s lowest . In a real alignment scenario however, we may of course have alignments that are correct but with lower scores than s lowest . We therefore used a value for s min that was 10% lower than s lowest (to keep more alignments then a lower s min value should be used). Rounded to the first decimal, this resulted in the same s min value of -2.5 for all of the paired HMMs.
In step 5, given a read length of L read we select a segment of the reference that is also length L read for the alignment. Our paired HMM performs global alignment however, and so there may be indels at the end of the reference or the read which must be considered. We apply rules R1 and R2 for each of these two situations: R1) If there are gaps at the end of the reference segment then we extend it. For example: If there are gaps at the end of the read then we remove this last part of the alignment. For example: In the first case, we extend the segment of the reference we are aligning to, and can then simply extend the alignment matrices as required (see Figure 2). In the second case, we assign the aligned sequence the score from the alignment matrix just before the gaps at the end of the read. Finally, we consider a special case for homopolymers or repeats that can occur for the higher models. For example, we might get the following alignment: If we consider a 2nd order paired HMM, and try to align a C to a C, then the alignment probability C|AC to C|AC is for example different from C|CC to C|CC. If this occurs near the end of an alignment, then a higher order model might not place the gap at the very end of the sequence as shown above. A 0th order model can in contrast be set to always place the gap at the end using the Viterbi algorithm because all the C alignments produce the same probability. To address this issue, we therefore set the higher order models to only use a 0th order model for the last 10 nucleotides of the read when performing a full alignment. We used Bowtie and TMAP to align the reads from the training dataset described in the Datasets section of the manuscript. We then selected the reads with a score higher than -100 for Bowtie and 50 for TMAP as this yield a similar number of alignments as the Last training dataset. We then trained one VarHMM with the Bowtie alignments and another VarHMM with TMAP alignments as described in the Training section of the manuscript. Alignment testing was performed as in manuscript Figure 4, where we again used the parameters in Table S2 for VarHMM.

S2.3 Read Mappingc min and Smaller Training Sets
To measure the relationship between c min and the samples in the training data, we used the alignments from Last (that are used for initial parameter estimation as described in the Training section of the manuscript), and calculated the following values for the different orders and states for values of c min = 50, c min = 100, c min = 1000:

Percent of combinations (P C): FC expressed as a percentage.
Percent of total counts that make up P C (P T C): Sum of total counts for the cases in F C and P C divided by the total counts. For example, if c min = 100 and we count 1000 total dinucleotide alignments in the training data, and 950 of these are A|A aligned to A|A, then we have only have one case where the count is above 100 (so that F C = 1/256), while P T C = 950/1000 = 95%.
The results of our analysis are shown in Table S3. The F C and P C values deteriorated for higher values of c min , while the P T C values only changed a little. This is because mismatches are uncommon, while matching combinations make up a very high percentage of the training alignments, and so for c min = 1000 and a 3rd order state M for example, we have F C = 1024/65536, while P T C = 0.99893. Therefore, when increasing c min or decreasing our training set size, we only exclude more rare cases from the higher order modeling of the M state.
To assess the affects of using a smaller training set and using the same c min value, we trained VarHMM with half of the Last training reads that we used for the VarHMM model seen in the Results section of the manuscript. The results are shown in Figure S3, and as can be seen, the decrease in accuracy was very minor despite using a smaller training set.  Table S3: Training counts and percentages for different values of c min . The first column shows the order (O1 = 1st order, O2 = 2nd order, and O3 = 3rd Order) and the respective states in column 2, while F C, P C, and P T C are shown in the subsequent columns (see text for details).

S2.4 Classification
When classifying a pathogen with sequenced DNA reads, high read coverage is often limited to purified samples, and factors such as contaminated samples from other genomic sources or use of multiplexed sequencing can make classification significantly harder (2). Francis et al.
(2) approximate such effects by using sequenced reads sets with low coverage (0.001X, 0.01X, and 0.1X which yielded sets with 92, 924, and 9237 reads respectively). For each of these sets, they generated 1000 different datasets, each containing reads randomly selected from their full dataset of sequenced reads (containing 92370 reads).
In our analysis, we tested VarHMM, Last, Bowtie, and TMAP with the R200 and R400 datasets on the DH10B and MG1655 test set, to see how the classification compared with the read mapping performance described in the Results section of the manuscript. We only show results for 0.01X and 0.1X coverage (which yielded reads for the R200 set and R400), as 0.001X only provided as low as 19 reads (making accuracy difficult to assess), and all of the aligners could perform 100% accurate classification for 1X datasets. Following the Francis et al. test framework, we generated 1000 random datasets using each of the datasets with different coverage (i.e. 1000 datasets randomly selected from the 0.01X set, and 1000 datasets selected from the 0.1X dataset). We then measured the percent of the random datasets that were classified to the correct strain.
Similar to the read alignment test in the manuscript Results section, we varied a threshold and assigned reads. We then selected the strain with the most reads assigned to it from a dataset (this performed better for all aligners than other approaches we tried, such as summing the probabilities for the reads assigned to each strain and selecting the strain with the highest sum of probabilities). A read was assigned to a strain if it only aligned to one of the strains, or if the alignment score to one strain was better than the other strain alignment by the threshold that was used.
The classification results can be seen in Figure S4, where VarHMM showed the best results, followed by Last and then Bowtie and TMAP (Segemehl generally showed a relatively high error rate and is not included in the plots for the sake of visual clarity).

S3 Aligner Version Details
Version details for the different alignment methods used can be seen in

S5.2 Training Results Example
Alignment and transition probabilities from training that were used with VarHMM and the 0th, 1st, and 2nd order models are included with the source code described in the previous section and can be used to make alignments. We have also included an example of the probability and transition matrices for the 0th order HMM below (the values included with the source code are log scaled while the values below are the probabilities before applying the log function): Probability matrix:  If we look at the diagonal in the probability matrix values, where we can see the probability of matching A with A, C with C, ... , and T with T then the values are very similar to the ones in the Last probability matrix which also uses a 0th order model.