wg-blimp: an end-to-end analysis pipeline for whole genome bisulfite sequencing data

Background Analysing whole genome bisulfite sequencing datasets is a data-intensive task that requires comprehensive and reproducible workflows to generate valid results. While many algorithms have been developed for tasks such as alignment, comprehensive end-to-end pipelines are still sparse. Furthermore, previous pipelines lack features or show technical deficiencies, thus impeding analyses. Results We developed wg-blimp (whole genome bisulfite sequencing methylation analysis pipeline) as an end-to-end pipeline to ease whole genome bisulfite sequencing data analysis. It integrates established algorithms for alignment, quality control, methylation calling, detection of differentially methylated regions, and methylome segmentation, requiring only a reference genome and raw sequencing data as input. Comparing wg-blimp to previous end-to-end pipelines reveals similar setups for common sequence processing tasks, but shows differences for post-alignment analyses. We improve on previous pipelines by providing a more comprehensive analysis workflow as well as an interactive user interface. To demonstrate wg-blimp’s ability to produce correct results we used it to call differentially methylated regions for two publicly available datasets. We were able to replicate 112 of 114 previously published regions, and found results to be consistent with previous findings. We further applied wg-blimp to a publicly available sample of embryonic stem cells to showcase methylome segmentation. As expected, unmethylated regions were in close proximity of transcription start sites. Segmentation results were consistent with previous analyses, despite different reference genomes and sequencing techniques. Conclusions wg-blimp provides a comprehensive analysis pipeline for whole genome bisulfite sequencing data as well as a user interface for simplified result inspection. We demonstrated its applicability by analysing multiple publicly available datasets. Thus, wg-blimp is a relevant alternative to previous analysis pipelines and may facilitate future epigenetic research.


Shiny interface
We implemented a web application to ease sharing our WGBS analysis results with multiple researchers across multiple institutes and to simplify access to BAM files for inspection using the IGV. We chose the R Shiny framework for our analysis as it provides necessary functionality conveniently through R, such as loading and handling tables using the data.table package and creating plots with ggplot2.
The wg-blimp user interface consists of five separate tabs for dataset selection, quality control statistics, pipeline parameters, overview over called DMRs and UMR/LMR/PMD segmentation.

Dataset selection
All WGBS datasets analysed by the wg-blimp analysis pipeline may be loaded into the Shiny user interface for sharing with other researchers. Figure S1 shows the corresponding part of the interface.

QC statistics
The wg-blimp workflow performs quality control checks using FastQC, Picard, Qualimap, MultiQC and gathers methylation reports from MethylDackel. The shiny interface provides a brief overview over the gathered metrics and links to more detailed MultiQC and Qualimap reports. Links to alignment data for use with the IGV are also provided here. Figure S2 shows the statistics tab of the interface.

Pipeline parameters
Results generated by bioinformatics pipelines commonly depend on the parameters in use.
We provide the config file along with the DMR analysis results to enable a more compre-  hensive discussion of results. Figure S3 depicts the parameter overview tab.

DMRs
A tab to inspect DMRs retrieved by the calling pipeline is also provided. DMRs may be filtered by:  • Intersection with CG islands Figure S4 gives an overview over the DMR tab. Filters are applied immediately and updated reactively through Shiny.

Segmentation
wg-blimp's MethylSeekR integration preemptively computes UMRs/LMRs with and without consideration of PMDs. α distributions are displayed to simplify decision whether the methylome at hand contains PMDs. If the distribution appears to be bimodal, one may assume PMDs to be present. The Shiny interface also provides information on the relation of number of CpGs to median methylation in each segment, as depicted in Figure 2. Usually UMRs are longer than LMRs, which should be distinguishable from the UMR/LMR heatmap. A plot displaying information on FDR and methylation cutoffs is also provided for quality control.

Analysis of public WGBS datasets using wg-blimp
We first applied wg-blimp to a published dataset of two isogenic pairs of human monocyte and macrophage samples. Details are described in the original publication (EGA accession EGAS00001001595) [1]. We also used wg-blimp to analyse a blood and sperm dataset to find differences in methylation between sperm and somatic cells. The dataset is also publicly available (ENA accession PRJEB28044). For this dataset pooled DNA from sperm and blood DNA (n=6 for each) was extracted from healthy young (age 18-24) and old (age 61-71) men, yielding four WGBS samples [2]. For both datasets 1% Lambda phage DNA was added for later quality control. Sequencing was performed on Illumina HiSeq 2500. Protocols for the H1 ESC sample are described in the original publication (NCBI SRA accession SRR3274347) [3]. Table S1 shows the QC parameters as displayed in wg-blimp's shiny interface for each of the datasets.

Quality control
Coverage profiles are ≥ 19× and the mean methylation for lambda phage indicates successful bisulfite conversion for the monocyte/macrophage and blood/sperm datasets. Please note that lambda phage conversion rates are lacking for the H1 ESC sample as reads assigned to the lambda genome were discarded during upload to read repositories and thus not included in our computation. Since no QC parameter suggests errors, all samples were used for further analysis.

DMRs
The DMR algorithms used are expected to yield low numbers of calls for the monocyte/macrophage dataset and high numbers for the blood/sperm comparison. Filtering may be performed in a trivial fashion through wg-blimp's user interface.  We also performed the monocyte/macrophage analysis using hg19 as reference to ensure the reference genome is not the primary cause of the difference in DMR calls. As expected, wg-blimp still yields a much higher number of calls. For hg19 a total of 6 691 DMRs is reported, with >90% of hg38 calls overlapping wg-blimp's lift-over hg19 calls. Of these 6 691 bsseq accounts for 29 DMRs, 18 of which overlap the original 114 calls. The remaining difference between our 29 bsseq DMRs and the original 114 BSmooth calls can be attributed to differences in alignment, methylation calling, bsseq configuration, and further development of the bsseq software itself.
Hence, the difference in the number of DMRs called can be explained by the configuration of DMR callers. bsseq performs strict calling, whereas metilene and camel produce more, albeit less precise DMRs. This comparison underlines the importance of tool and parameter choices for DMR calling, as it drastically impacts analysis results.

Segmentation
We focus on segmenting the H1 ESC methylome because of its previous analysis by the authors of the original MethylSeekR publication [4]. α distributions are similar for different chromosomes, so MethylSeekR infers the distribution for only a single chromosome. α smaller than 1 represents a polarized distribution favoring low and high methylation. α equal or greater than 1 suggests a uniform distribution. If an α distribution has a bimodal shape, we assume presence of PMDs. Since the α distribution appears to be unimodal  Figure S5. This is consistent with the analysis of the original MethylSeekR authors where H1 cells showed a similar α distribution whereas, among other, fetal lung fibroblasts showed a clearly distinguishable bimodal α distribution [4]. Figure S6 shows intersection of LMRs/UMRs with exons and promoters. We here set promoters to the ranges around transcription start sites (TSSs): [T SS − 1000, T SS + 1000]. As expected, while 82.32% of UMRs are intersecting promoters, only 12.29% of LMRs show promoter intersection. This is also coherent with the original MethylSeekR findings [4].

Computational requirements
wg-blimp logs run times and memory requirements for each step. The resulting plots from the monocyte/macrophage and blood/sperm analyses are displayed in Figures S7 to S12.

Parameters
This section shows the configuration files and environment used for the analyses.

Run time benchmark of published WGBS pipelines
We applied the pipelines listed in Table 1 to the blood/sperm dataset (ENA accession PR-JEB28044) to compare run times across different workflows. For each workflow we followed the guidelines given by the respective authors and used a setup as close to the default settings as possible. As each pipeline has a unique setup, we provide detailed descriptions about the setup used for each analysis.
All analyses (except the CpG Me workflow, see subsection 3.3 for details) were executed on servers equipped with two Intel Xeon E5-2695 v4 CPU's, 528 GB of memory and Debian 9 as operating system (OS). For each workflow we assigned a maximum of 64 cores and limited alignment jobs to a maximum of 16 cores where possible. This speeds up execution because all four samples can be processed concurrently, thus minimizing run time owed to sequential portions of alignment programs. We ensured each pipeline had exclusive access to server resources to prevent other processes from interfering with benchmark results. In addition, all computations were performed on a partition excluded from backups to prevent our backup processes from slowing down pipeline runs. All pipelines (except Methy-Pipe, see subsection 3.4 for details) used hg38 as reference. We excluded installation and reference index generation run times as these operations only need to be executed once before potentially analysing multiple datasets.
While we encountered technical issues with several pipelines, we still report each pipeline's rule out technical issues originating from our misuse of the respective pipelines, we attempted to be as exhaustive as possible when searching for the causes of these errors. Furthermore, some of these issues may already be known to the developers and will be fixed in the near future, as several pipelines are under active development. Nevertheless, we excluded BAT from the analysis because it processed only a small fraction of raw reads (see subsection 3.1 for details). We also excluded the ENCODE pipeline because we have no adequate control over the benchmark environment hosted at DNAnexus.
We note that such a benchmark has several caveats and is only intended to provide information about the order of magnitude of each pipeline's run time for a WGBS dataset: The technical issues prevent getting exact run times and may result in underestimation of pipeline run times in cases of failure. Also, such a comparison may disfavour feature-rich pipelines, as more features included in a workflow will inherently increase its run time.
Similarly, tools may not primarily be selected for their performance, but their stability and precision. Figure S13 shows to be run on our local infrastructure and has an inherent performance advantage as a result.
We conclude that all pipelines show adequate run times for the blood/sperm dataset, with wg-blimp performing similar as other bwa-meth based workflows. However, we note that for higher sample counts (hundreds or even thousands of samples) we would prefer one of the workflows based on bwa-meth to minimize processing times (Nextflow methylseq (bwa-meth), snakePipes or wg-blimp).
The following sections describe configurations and scripts used for each pipeline run.
A detailed description of wg-blimp's configuration for this dataset can be found in subsection 2.5.

BAT configuration
We executed BAT using the Docker container christianbioinf/bat (digest 51f4d7aa7104) and the following script: before completion. We speculate that segemehl has an upper limit of reads it can process in a single run. While this causes no issues when many input files with low read counts are provided, it is discarding most of the reads when fed single files with high read counts. As this issue drastically distorts BAT's run time for the blood/sperm dataset, we omit it from the comparison.
The following script was used for running bicycle: Inspecting bicycle's log revealed multiple segmentations faults when running the analysis.
In total, bicycle threads produced 264 log files for alignment (66 per sample). 33 of these log files reported core dumps, resulting in expected 12.5% data loss. To ensure we were not simply using bicycle incorrectly, we started the script above using only 2 threads instead of 64. This run did not cause any core dumps, so we suspect bicycle's multithreading implementation to have caused the issue.

CpG Me configuration
For CpG Me/DMRichR we used the code from the official GitHub repository (https://github.com/benlaufer/CpG Me) at git tag 1.2. A distinct difference between CpG Me and other pipelines is its reliance on a Slurm execution system. As a result, we could not use our local server infrastructure for a benchmark, but instead used the PALMAII HPC system of the University of Münster. We note that the resources requested by CpG Me are similar to the requirements of other pipelines described here and we estimate run times to be comparable as a result. While run time measurements may be inaccurate when job submission queues are full, we observed only ∼50% of the 504 nodes being busy during the benchmark using the clusters' Ganglia monitoring system. Since CpG Me requires only fastq files following strict naming patterns as input, explicit configuration is not needed and not reported here as a result. We invoked the pipeline using the command: We could not execute DMRichR as two methylation reports resulted in core dumps, and DMRichR requires at least two samples per group: . . . We are unsure why this error is occurring, as debugging applications within Slurm is more cumbersome than on our local infrastructure. We are reporting here the conda configuration used by Slurm jobs for potential debugging purposes:

Methy-Pipe configuration
We manually installed perl 5.26.2 and R 3.6.2 within a conda installation to run Methy-Pipe.  Please note that we intentionally set the maximum run time per job to 2 400 hours to prevent pipeline timeouts from extending run time.

Nextflow methylseq (bwa-meth) configuration
The configuration for this workflow is similar to the Bismark configuration above: We note that this workflow in its current version fails to produce methylation calls for all samples. However, this is a known issue that the developers are already working on, see https://github.com/nf-core/methylseq/issues/139 .

PiGx configuration
For the analysis using PiGx we used the Docker container bimsbbioinfo/pigx:publication (digest 987e466930d3) and the following configuration: We note here that PiGx differential methylation analysis took ∼22 hours to complete. We suspect the multithreading to not work as intended: When assigning 64 cores to differential methylation analysis excessive amounts of RAM were used, resulting in OS process termination. However, even with 8 cores each fork used only ∼13% of a single core's resources, meaning the multithreaded version of this analysis does not effectively utilize multiple cores.

snakePipes configuration
For snakePipes we set up a conda environment containing the following packages: We note that snakePipes did not successfully perform DMR calling using our configuration. While no explicit error was thrown, the DMR calling process seemed to have frozen, as strace did not report any activity. We unsuccessfully waited for >450 hours for the process to complete and speculate that some conda dependencies were pulling wrong system libraries, resulting in processes hanging. As this is an issue that might also affect wg-blimp in the future, we provide a Docker container as a backup in case of issues with the Bioconda installation.