Comparison of random and targeted NeatFreq selection methods on sequence coverage. A). Elevated kmer counts within repetitive regions cause over-reduction using an RMKF cutoff. The genomic regions labeled with stars indicate regions identified as repeats by RepeatFinder . Reads from repetitive regions are placed in low selectivity bins due to the high frequency of similar mers within the data set. Therefore, over-reduction occurs at multiples directly related to the count of repetitive regions. B). This histogram shows the retrieval of sequences at different RMKF cutoff levels when using each of the bin selection methods. Aligned sequence coverage distribution is shown for the first 40,000 bp of the S. aureus genome using query sequences selected by random (top) and targeted (bottom) methods. The targeted method is more effective at recruiting low coverage regions resulting from single cell amplification bias in variable coverage region, including 0-fold regions. The X-axis shows genomic coordinate from the reference used for mapping the extracted reads and the Y-axis shows the level of coverage at each genomic position. C). The histogram gives zoomed view of the low coverage area highlighted by an arrow in Figure 2A (region 278 kbp - 292 kbp). Alignment histograms show that the targeted algorithm, in contrast to the random selection, retains the low coverage areas in the variable dataset, resulting in an increased sequencing span. D). Coverage histogram of reads aligned to the largest H1N1 Influenza genomic reference segment (log scale). Random selection from the entire dataset (without retention bins) was performed to a count of reads equal to that used by targeted selection at RMKF cutoff =40. This random selection from all reads is the most subject to input coverage variability and fails to reduce deep spikes to generate coverage levels compatible with OLC assembly.