Je is implemented in Java 7 and uses the htsjdk (http://samtools.github.io/htsjdk/) and picard [8] libraries. Je has been designed with extensibility in mind with each sub-module (demultiplex, demultiplex-illu, clip or markdupes) encapsulated in its own package. This is reflected on the command line level where the command to run (demultiplex, demultiplex-illu, clip or markdupes) should be specified right after the je executable followed by relevant module’s options e.g. je demultiplex < options>, where < options > is the option list. The top level class Je.java is responsible to parse this command line and invoke the appropriate sub-module’s class (for example Jeclipper.java in the jeclipper package) with user’s provided options. The sub-module class is then responsible to validate user’s options before computing.
The demultiplex command
The demultiplex command is used when the sample-encoding barcode is found at the beginning of the read (Fig. 1a, right). It can deal with SE and PE reads having barcodes in one or both reads, with or without UMIs (Additional file 1: Supplementary Text). This includes situations where barcodes contain degenerate positions (like in the individual-nucleotide resolution Cross-Linking and ImmunoPrecipitation (iCLIP) protocol), are combined with UMIs into composite barcodes (Fig. 1d, bottom) or found in different reads (e.g. sample-encoding barcode in read_1 and UMIs in read_2, Fig. 1c). Je’s demultiplex module offers many options to tune sample identification stringency (e.g. mismatch number, barcode combination), read processing (e.g. trimming, clipping) and output format (gzip compression, md5 checksum generation). In all situations that include UMIs (or degenerate barcodes), demultiplex output is fully compatible with Je’s markdupes command.
The demultiplex-illu command
The demultiplex-illu command is used when sample-encoding barcodes are provided in separate fastq file(s) and UMIs are found at the beginning of the read(s). While CASAVA’s bcl2fastq2 tool is usually used to convert bcl files to fastq files and perform demultiplexing at the same time; it can also generate non-demultiplexed fastq files together with associated fastq index files (Fig. 1a, left). This alternative proves useful when debugging new protocols that use the index position for other purposes than sample encoding; or to overcome bcl2fastq2 barcode matching limitations (e.g. only allows up to two mismatches). Je’s demultiplex-illu module offers the same options as the demultiplex module and its output is fully compatible with Je’s markdupes command.
The clip command
The clip command is used to extract UMIs from fastq files that do not require sample demultiplexing at the same time. Similarly to demultiplex and demultiplex-illu commands, extracted UMIs are added to the read headers (as expected by markdupes) and read headers are reformatted to fulfill read mappers requirements (most read mappers expect headers for read_1 and read_2 to be strictly identical). The clip module offers identical read processing (e.g. trimming, clipping) and output formatting options as the demultiplexing modules.
The markdupes command
The markdupes command extends the popular Picard’s MarkDuplicates tool [8] by adding support for UMIs embedded in read headers (as generated by the demultiplex, demultiplex-illu or clip commands). This module takes mapped reads as input (in SAM/BAM format) and identifies PCR (and optical) read duplicates based on their mapping positions and UMIs. In short, reads identified as duplicates based on their mapping locations are further regrouped based on their UMIs (Additional file 1: Supplementary Text). All reads of a UMI group are declared duplicates but one (according to the chosen scoring strategy). Finally, duplicate reads are either discarded or included in output (with bitwise flag 1024). Je’s markdupes module supports random UMIs (any combination of a k-mer can occur) or runs with a predefined list of UMIs (as in e.g. NEXTflex™ kit from Bioo Scientific). In both situations, different options (in addition to all native Picard’s MarkDuplicates options) are offered to tune UMI comparison stringency like the number of mismatches to still consider two UMIs identical, or how to handle Ns found in UMIs.
Galaxy integration
A wrapper for integration in Galaxy [9] was written for each Je sub-module following Galaxy guidelines and best practices. All wrappers (and Je code) were uploaded to the Galaxy toolshed [10] as a repository suite, enabling Galaxy administrators to either install each sub-module separately or together as a suite.