Fast, accurate, and lightweight analysis of BS-treated reads with ERNE 2

Background Bisulfite treatment of DNA followed by sequencing (BS-seq) has become a standard technique in epigenetic studies, providing researchers with tools for generating single-base resolution maps of whole methylomes. Aligning bisulfite-treated reads, however, is a computationally difficult task: bisulfite treatment decreases the (lexical) complexity of low-methylated genomic regions, and C-to-T mismatches may reflect cytosine unmethylation rather than SNPs or sequencing errors. Further challenges arise both during and after the alignment phase: data structures used by the aligner should be fast and should fit into main memory, and the methylation-caller output should be somehow compressed, due to its significant size. Methods As far as data structures employed to align bisulfite-treated reads are concerned, solutions proposed in the literature can be roughly grouped into two main categories: those storing pointers at each text position (e.g. hash tables, suffix trees/arrays), and those using the information-theoretic minimum number of bits (e.g. FM indexes and compressed suffix arrays). The former are fast and memory consuming. The latter are much slower and light. In this paper, we try to close this gap proposing a data structure for aligning bisulfite-treated reads which is at the same time fast, light, and very accurate. We reach this objective by combining a recent theoretical result on succinct hashing with a bisulfite-aware hash function. Furthermore, the new versions of the tools implementing our ideas|the aligner ERNE-BS5 2 and the caller ERNE-METH 2|have been extended with increased downstream compatibility (EPP/Bismark cov output formats), output compression, and support for target enrichment protocols. Results Experimental results on public and simulated WGBS libraries show that our algorithmic solution is a competitive tradeoff between hash-based and BWT-based indexes, being as fast and accurate as the former, and as memory-efficient as the latter. Conclusions The new functionalities of our bisulfite aligner and caller make it a fast and memory efficient tool, useful to analyze big datasets with little computational resources, to easily process target enrichment data, and produce statistics such as protocol efficiency and coverage as a function of the distance from target regions. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0910-3) contains supplementary material, which is available to authorized users.


Creation of the indexes
Bismark + Bowtie 1 In the following, the variable path to genome folder stores the path containing the original fasta file.

Reads simulation
To simulate the reads, we used SimSeq (https://github.com/jstjohn/SimSeq) in combination to custom scripts that can be downloaded at github.com/nicolaprezza/test-bs-aligner. The main script of this pipeline is pipeline bismark and erne.sh. The pipeline simulates a methylation experiment, executes the erne/bismark aligners and methylation callers, and counts the number of correctly called methylations and total number of called methylations. To modify parameters of the simulation (e.g. number of reads, error rate, ecc...), edit testcase-bs-aligner-default.sh. The script requires the following programs to be pre-installed: -SimSeq (github.com/jstjohn/SimSeq) installed in ∼/workspace/SimSeq/ samtools (http://www.htslib.org/) fastx-mutate-tools (https://github.com/nicolaprezza/fastx-mutate-tools) installed in ∼/workspace/fastx-mutate-tools/ realpath fastx-toolkit (http://hannonlab.cshl.edu/fastx toolkit/) gawk, awk erne (http://erne.sourceforge.net/) bismark (www.bioinformatics.babraham.ac.uk/projects/bismark/) The workflow was designed as follows: firstly, we introduced 0.5% of SNPs at random positions in the input genome, producing a snps.fasta file. This file was further mutated generating two bisulfite-converted files, C to T.fasta and G to A.fasta, where we uniformly substituted with probability 0.5 Cs into Ts (methylations on forward strand) and Gs into As (methylations on reverse strand), respectively. Together with these two files, we generated a bed file containing the corresponding methylation values (0 or 1) for each cytosine in the genome.
Let N be the total number of simulated read pairs. We used SimSeq to simulate 0.02 · N read pairs with sequencing errors using snps.fasta as reference file. SimSeq was executed using the built-in Illumina error model. These reads, representing BS-conversion failures, were saved in two files query1.fastq and query2.fastq. In order to generate bisulfite-converted reads, we used SimSeq to simulate further 0.98·N pairs from the reference C to T.fasta and 0.98·N pairs from the reference G to A.fasta, being careful to select from the fist batch only pairs having the first/second read on forward/reverse strand, respectively, and from the second batch only pairs having the first/second read on reverse/forward strand, respectively. Pairs extracted from the second batch had to be further swapped (i.e. we exchanged the role of read 1 and read 2 in each pair) in order to produce a valid directional dataset. All such simulated pairs were finally appended to the end of the files query1.fastq and query2.fastq. To conclude, we introduced indels in the files query1.fastq and query2.fastq, using 0.0003 and 0.8 as open and extend probability, respectively (this corresponds at inserting 3 indels every 10000 base pairs with average indel length equal to 5). After the simulation, query1.fastq and query2.fastq were quality-trimmed using ERNE-FILTER.

Default parameters
In this section we don't write the parameters for which we used the default values. See the main text for more details on the default values.
ERNE-BS5 2 In the following, ref.ebm is the reference file created with erne-create (see above).
erne-bs5 --reference ref.ebm \ --query1 query1.fastq \ --query2 query2.fastq \ --output alignment.bam \ --threads 2 \ --no-auto-trim Bismark + Bowtie 1 In the following, the variable path to genome folder stores the path containing the original fasta file. The variables output dir and temp dir store the path to the output directory and to a directory for the temporary files, respectively.
bismark $path_to_genome_folder \ -1 query1.fastq \ -2 query1.fastq \ -o $output_dir \ --temp_dir $temp_dir Bismark + Bowtie 2 In the following, the variable path to genome folder stores the path containing the original fasta file. The variables output dir and temp dir store the path to the output directory and to a directory for the temporary files, respectively.

Common parameters (no seed errors)
In this section we don't write the parameters for which we used the default values. See the main text for more details on the default values.
ERNE-BS5 2 In the following, ref.ebm is the reference file created with erne-create (see above).
erne-bs5 --reference ref.ebm \ --query1 query1.fastq \ --query2 query2.fastq \ --output alignment.bam \ --threads 2 \ --no-auto-trim \ --seed-errors 0 Bismark + Bowtie 1 In the following, the variable path to genome folder stores the path containing the original fasta file. The variables output dir and temp dir store the path to the output directory and to a directory for the temporary files, respectively.
bismark $path_to_genome_folder \ -1 query1.fastq \ -2 query1.fastq \ -o $output_dir \ --temp_dir $temp_dir \ -n 0 Bismark + Bowtie 2 In the following, the variable path to genome folder stores the path containing the original fasta file. The variables output dir and temp dir store the path to the output directory and to a directory for the temporary files, respectively. In the following, the variable path to genome folder stores the path containing the original fasta file. The variable output dir stores the path to the output directory. Notice that with option --gzip some of the output files are compressed, but the methylation annotations (.cov file) are output in an uncompressed format. File alignment.bam is the alignment output by bismark.

ERNE-METH
In the following, alignment.bam is the alignment output by erne-bs5, and genome.fasta is the same genome used to build the index with erne-create.
The command produces methylation annotations in erne format and statistics files, all having as prefix the content of the variable out prefix. With the option --compress gz, erne-meth produces compressed methylation annotations.