Speedup of ABEA algorithm
We initially compared the optimised GPU version with the optimised CPU version of the ABEA algorithm (not the unoptimised CPU version in the original Nanopolish, see below) by executing them on publicly available raw nanopore genome sequencing data. The CPU version was run with maximum supported threads on the tested systems. The optimised CPU version will be henceforth referred to as C-opti and the optimised GPU version will be referred to as G-opti.
First we benchmarked on five different systems (Table 2) over Dsmall dataset. Speedups (including all the overheads) observed for G-opti compared to C-opti are: ∼4.5× on the low-end-laptop and the workstation; ∼4× on Jetson TX2 SoC; and ∼3× on high-end-laptop and HPC (Fig. 3). Note that only a ∼3× speedup was observed on high-end-laptop and HPC (versus >=4 × on other systems) due to the CPU on those particular systems having a comparatively higher amount of CPU cores (12 and 40 respectively).
We next benchmarked on two larger datasets (Drapid and Dligation). A speedup up of ∼3× was observed for all three systems for the two big datasets— Dligation and Drapid (Fig. 4). Due to more ultra long reads (>100kb) in Dligation and Drapid than in Dsmall, the overall speedup for SoC is limited to around ∼3× compared to ∼4× for Dsmall.
It is noteworthy to mention that comparing performance to the unoptimised CPU version in Nanopolish is not straightforward, as the time for individual components (e.g. ABEA) cannot be accurately measured because each read executes on its own code path (detailed in Supplementary Materials). We nonetheless estimated the runtime of unoptimised ABEA by injecting timestamp (gettimeofday) functions into the original Nanopolish code, directly before and after the ABEA component to measure runtimes for individual reads. Nanopolish was launched with multiple threads and the runtimes were averaged by the number of threads to get a reasonable estimate for ABEA. When evaluated using the Dsmall dataset, the optimised ABEA CPU version in f5c was ∼1.3-1.7 × times faster than the unoptimised ABEA in the original Nanopolish program (∼1.4× speedup on Jetson TX2, workstation and HPC, ∼1.7× on low-end-laptop and ∼1.3× on high-end-laptop).
Comparative performance of f5c with Nanopolish
The overall performance of the GPU-accelerated ABEA algorithm was evaluated through a DNA methylation (5-methylcytosine) detection work-flow. We compared the total runtime for methylation calling using the original Nanopolish against f5c (both CPU-only and GPU-accelerated versions) by running on two publicly available nanopore datasets (see “Methods” section).
We refer to the original Nanopolish (version 0.9) as nanopolish-unopti, f5c run only on the CPU as f5c-C-opti and GPU accelerated f5c as f5c-G-opti. We executed nanopolish-unopti, f5c-C-opti and f5c-G-opti on the full datasets Drapid and Dligation. Note that all execution instances were performed with the maximum number of CPU threads available on each system.
f5c-C-opti on the Drapid dataset was: ∼2× faster than nanopolish-unopti on SoC and lapH and ∼4× faster on HPC. On Dligation, nanopolish-unopti crashed on SoC (limited by 8GB RAM) and lapH (16GB RAM) due to the Linux Out Of Memory (OOM) killer [12] (Fig. 5). On Dligation, f5c-C-opti on HPC was not only 6 × faster than nanopolish-unopti, but also consumed only ∼15 GB RAM, as opposed to >100 GB used by nanopolish-unopti (both with 40 compute threads). Hence, it is evident that CPU optimisations alone can do significant improvements.
When comparing the total execution time (including disk I/O) of the entire methylation calling process with different hardware acceleration options in f5c, f5c-G-opti was 1.7 × faster than f5c-C-opti on SoC, 1.5-1.6 × on lapH and <1.4× on HPC (Fig. 5). On HPC, the speedup was limited to <1.4× due to file I/O being the bottleneck. N.B. only the ABEA algorithm step utilises the GPU acceleration.
For the Drapid dataset, the execution time of f5c-G-opti versus nanopolish-unopti was ∼4×, ∼3× and ∼6× faster on SoC, lapH and HPC, respectively (Fig. 5). On the Dligation dataset on HPC, f5c-G-opti was a remarkable ∼9× faster.
Although parameters that may affect biological accuracy were untouched, we did observe subtle variations in the output as a consequence of hardware-based fluctuations in the treatment of floating point units. We assessed the impact of these subtle variations on the measurement of relative methylation frequencies by comparing results for Nanopolish, f5c-C-opti and f5c-G-opti on the Dsmall dataset, which encompasses 5M bases of human chromosome 20 with an average read coverage of 30 ×. Of the ∼32,000 surveyed CpG sites, f5c-C-opti and f5c-G-opti produced different methylation frequencies for only 6 (∼0.02%) and 65 (∼0.2%) positions, with an average position-specific difference in methylation frequency values of ∼1.5% and ∼0.4%, respectively. Both variants of f5c yielded overall Pearson correlation values of 0.99999 with Nanopolish. Moreover, the overall correlation between Nanopolish and bisulfite sequencing data from NA12878 is 0.88723, while the correlation for f5c-C-opti and f5c-G-opti is 0.88723 and 0.88724, respectively. The impact of hardware-based differences in the calculation of methylation frequencies is therefore negligible.