Overview
Pipeline execution begins with a preliminary gauge of read quality and other quality metrics, via FastQC 0.11.8 [15]. Reads are then optionally trimmed using Trimmomatic 0.39 [46], and a post-trimming quality assessment is performed again with FastQC. Alignment to a reference genome is performed with HISAT2 2.1.0 [37], along with pseudoalignment to the transcriptome with kallisto 0.46.1 [38] or Salmon 1.2.1 [39]. A combination of regtools 0.5.1 [40] and featureCounts (Subread 2.0.0) [14] is used to quantify genes, exons, and exon-exon junctions. At the same time, expressed regions (ERs) are optionally computed with the Bioconductor [29] R package derfinder [41]. The result is a RangedSummarizedExperiment [29] object with counts information, RData files with ER information, and plots visualizing the associated data. Variant calling is also performed for human samples, using bcftools 1.10.2 [44] to produce a VCF file [51] for the experiment. SPEAQeasy is flexible and allows for newer versions of software to be used in place of the ones listed above.
Configuration
Usage of SPEAQeasy involves configuring two files: the “main” script and a configuration file. The “main” script contains the command which runs the pipeline, along with options specific to the input data, and fundamental choices about how the pipeline should behave. In this script, the researcher must specify if reads are paired-end or single-end, the reference species/genome (i.e. hg38, hg19, mm10, or rn6), and the expected strandness pattern to see in all samples (e.g. “reverse”). Strandness is automatically inferred using pseudoalignment rates with kallisto [38], and the pipeline can be configured to either halt upon any disagreement between asserted and inferred strand, or simply warn and continue. In particular, we perform pseudoalignment to the reference transcriptome using a subset of reads from each sample, trying both the rf-stranded and fr-stranded command-line options accepted by kallisto. The number of successfully aligned reads for each option is used to deduce the actual strandness for each sample. For example, an approximately equal number (40–60%) of aligned reads for each option suggests the reads lack strand-specificity and are thus “unstranded”; a large enough fold-difference between the two indicates either “reverse” or “forward”-strandness. Specifically, greater than 80% of total reads aligned must have aligned using the rf-stranded option to deduce a sample is “reverse”-stranded, and less than 20% to infer “forward”-strandness. We have found these cutoffs to reliably identify inaccurate --strand specification from the user, while not being so strict as to mistakenly disagree with correct specification. Another example command option in the “main” script controls whether to trim samples based on adapter content metrics from FastQC [15], trim all samples, or not perform trimming at all.
The configuration file allows for fine-tuning of pipeline settings and hardware resource demands for each pipeline component. Ease of use is a core focus in SPEAQeasy, and configuration files for SLURM, SGE, and local linux environments are pre-built with sensible defaults. The user is not required to modify the configuration file at all to appropriately run SPEAQeasy; however, a great degree of control and customization exists for those users who desire it. Advanced users can tweak simple configuration variables to pass arbitrary command-line arguments directly to each of the software tools invoked by SPEAQeasy. For example, when creating wiggle coverage files from BAM alignment files, the default is to normalize counts to 40 million mapped reads of 100 base pairs. This is achieved by the default value for the following variable in each configuration file:
$${\texttt{bam2wig\_args }} = \, ``- {\texttt{t}}\;4000000000"$$
Suppose a researcher were instead interested in normalizing to 40 million mapped reads of 150 base pairs, and wanted to skip duplicate hit reads. The above variable could be adjusted to pass the appropriate command arguments to bam2wig.py [58]:
$${\texttt{bam2wig\_args }} = \, ``- {\texttt{t}}\;6000000000\; - {\texttt{u}}"$$
The same procedure can be used to fine-tune any other software tool used in SPEAQeasy, allowing a level of control similar to directly running each step. At the same time, settings related to variables such as strandness, possible pairing of reads, and file naming choices are automatically accounted for.
Inputs
A single file, called samples.manifest, is used to point SPEAQeasy to the input FASTQ files, and associate samples with particular IDs. It is a table saved as a tab-delimited text file, containing the path to each read (or pair of reads), optional MD5 sums, and a sample ID. Sample IDs can be repeated, which allows samples initially split across multiple files to be merged automatically (Fig. 5). Input files must be in FASTQ format, with “.fq” “.fastq” extensions supported, and possibly with the additional “.gz” extension for gzip-compressed files.
Outputs
SPEAQeasy produces several output files, some of which are produced by the processing tools themselves (Additional file 3: Table S2) and others by SPEAQeasy for facilitating downstream analyses (Fig. 2). The main SPEAQeasy output files, relative to the specified --output directory, are:
-
Under the count_objects/ directory, rse_gene_[experiment_name].Rdata, rse_exon_[experiment_name].Rdata, rse_jx_[experiment_name].Rdata and rse_tx_[experiment_name].Rdata: these are RangedSummarizedExperiment objects [29] that contain the raw expression counts (gene & exon: featureCounts; exon-exon junctions: from regtools; transcript: either kallisto or Salmon counts), the quality metrics as the sample phenotype data (Additional file 3: Table S3), and the expression feature information that depends on the reference genome used.
-
Under the merged_variants/ directory for human samples, mergedVariants.vcf.gz: this is a Variant Call Format (VCF) file [51] with the information for 740 common variants that can be used to identify sample swaps. For example, if two or more brain regions were sequenced from a given donor, the inferred genotypes at these variants can be used to verify that samples are correctly grouped. If external DNA genotype information exists from a DNA genotyping chip, one can then verify that the RNA sample indeed matches the expected donor, to ensure that downstream expression quantitative trait locus (eQTL) analyses will use the correct RNA and DNA paired data.
-
Under the coverage/bigWigs/ directory when SPEAQeasy is run with the --coverage option,[sample_name].bw for unstranded samples or [sample_name].forward.bw and [sample_name].reverse.bw for stranded samples: these are base-pair coverage bigWig files standardized to 40 million 100 base-pair reads per bigWig file. They can be used for identification of expressed regions in an annotation-agnostic way [41], for quantification of regions associated with degradation such as in the qSVA algorithm [59], visualization on a genome browser [60], among other uses.
Software management
SPEAQeasy provides two options for managing software dependencies. If docker [61] is available on the system the user intends to run the pipeline, software can be managed in a truly reproducible and effortless manner. As a pipeline based on Nextflow, SPEAQeasy can isolate individual components of the workflow, called processes, inside docker containers. Containers describe the entire environment and set of software versions required to run a pipeline command (such as hisat2-align), eliminating common problems that may occur when a set of software tools (such as SPEAQeasy) is installed on a different system than it was developed. Each docker image is pulled automatically at runtime if not already downloaded (on the first pipeline run), and otherwise the locally downloaded image is used.
Because docker is not always available, or permissions are not trivial to configure, Linux users may alternatively install software dependencies locally. From within the repository directory, the user would run the command:
$${\texttt{bash}}\;{\texttt{install}}\_{\texttt{software}}{\texttt{.sh}}\;{\texttt{"local"}}$$
This installs each software utility from source, where available, and as a pre-compiled binary otherwise. Because installation is performed within a subdirectory of the repository, the user need not have root access for the majority of tools. However, we require that Java and Python3 be globally installed. The motivation for this requirement is that we expect most users to have these tools already installed globally, and local installation of these tools is generally advised against because of potential conflicts with other installations on the system.
Though docker and local software installation are the officially supported and recommended methods for managing software, other alternatives exist for interested users. SPEAQeasy includes a file called conf/command_paths_long.config, containing the long paths for each software utility to be called during pipeline execution. Users can substitute in the paths to already-installed software versions for any utility, in this file. Those familiar with Lmod environment modules [50] can also trivially specify in their configuration file module names to use for a particular SPEAQeasy process. However, this tends to only be a viable option for those with a diverse set of bioinformatics modules already installed.
Annotation
SPEAQeasy is intended to be greatly flexible with annotation and reference files. By default, annotation files (the reference genome, reference transcriptome, and transcript annotation) are pulled from GENCODE [62] for human and mouse samples, or Ensembl [63] for rat samples. The choice of species is controlled by the command flag “--reference” in the “main” script, which can hold values “hg38”, “hg19”, “mm10”, or “rn6”. In the configuration file, simple variables control the GENCODE release or ensembl version to be used. When the pipeline run is executed, SPEAQeasy checks if the specified annotation files have already been downloaded. If so, the download is not performed again for the current or future runs. This reflects a general feature of SPEAQeasy, provided by its Nextflow base- processes are never “repeated” if their outputs already exist. The outputs are simply cached and the associated processes are skipped.
SPEAQeasy also offers easy control over the particular sequences included in the analysis- a feature we have not seen in other publically-available RNA-seq pipelines utilizing databases such as GENCODE or Ensembl. In particular, researchers are sometimes only interested in alignments/results associated with the canonical reference chromosomes (e.g. chr1-chr22, chrX, chrY, and chrMT for homo sapiens). Alternatively, sometimes extra contigs (sequences commonly beginning with “GL” or “KI”) are a desired part of the analysis as well. RNA-seq workflows commonly overlook subtle disagreement between the sequences aligned against, and sequences included in downstream analysis. SPEAQeasy provides a single configuration variable, called anno_build, to avoid this issue, and capture the majority of use cases. Setting the variable to “main'' uses only the canonical reference sequences for the entire pipeline; a value of “primary” includes additional contigs seen in GENCODE [62] annotation files having the “primary” designation in their names (e.g. GRCh38.primary_assembly.genome.fa).
Users are not limited to using GENCODE/Ensembl annotation, however. Instead, one can optionally point to a directory containing the required annotation files with the main command option “--annotation [directory path]”. To specify this directory contains custom annotation files, rather than the location to place GENCODE/Ensembl files, one uses the option “--custom_anno [label]”. The label associates internally-produced files with a name for the particular annotation used. The required annotation files include a genome assembly fasta, a reference transcriptome fasta, and a transcriptome annotation GTF. For human samples, a list of sites in VCF format [51] at which to call variants is also required. Finally, if ERCC quantification is to be performed, an ERCC index for kallisto must be provided [45].
Use cases
We expect that the majority of users will have access to cloud computing resources or a local computing cluster, managing computational resources across a potentially large set of members with a scheduler such as Simple Linux Utility for Resource Management (SLURM) or Sun Grid Engine / Son of Grid Engine (SGE). However, SPEAQeasy can also be run locally on a Linux-based machine. For each of these situations, a “main” script and associated configuration file are pre-configured for out-of-the-box compatibility. For example, a SLURM user would open run_pipeline_slurm.sh to set options for his/her experiment, and optionally adjust settings in conf/slurm.config (or conf/docker_slurm.config for docker users).
In the configuration file, simple variables such as “memory” and “cpus” transparently control hardware resource specification for each process (such as main memory and number of CPU cores to use). These syntaxes come from Nextflow, which manages how to translate these simple user-defined options into a syntax recognized by the cluster (if applicable). However, Nextflow also makes it simple to explicitly specify cluster-specific options. Suppose, for example, that a particular user intends to use SPEAQeasy on an SGE-based computing cluster, but knows his/her cluster limits the default maximum file size that can be written during a job. If a SPEAQeasy process happens to exceed this limit, the user can find the process name in the appropriate config file (Additional file 3: Table S1), and add the line “clusterOptions = '-l h_fsize = 100G'" (this is the SGE syntax for raising the mentioned file size limit to 100G per file, a likely more liberal constraint).
We also expect a common use case would involve sharing a single installation of SPEAQeasy among a number of users (e.g. a research lab). A new user wishing to run SPEAQeasy on his/her own dataset simply must copy the appropriate “main” script (e.g. run_pipeline_slurm.sh) to a desired directory, and modify it for the experiment. All users then benefit from automatic access to any annotation files which have been pulled or built by the pipeline in the past, and by default share configuration (potentially reducing work in optimizing setup specific to one’s cluster). However, user-specific annotation locations or configuration settings can be chosen by simple command-line options, if preferred.
Test samples
The test samples were downloaded from the Sequence Read Archive (SRA) or simulated using polyester [64], depending on the organism, strandness, and pairing of the samples. Each was then subsetted to 100,000 reads.
-
Human:
-
Single-end, reverse: SRS7176970 and SRS7176971 [65]
-
Single-end, forward: ERS2758385 and ERS2758384
-
Paired-end reverse: SRS5027402 and SRS5027403 [66]
-
Paired-end forward: samples dm3_file1 and dm3_file2 from Rail-RNA [67]; samples sample_01 and sample_02 generated with polyester [64]
-
Mouse
-
Single-end, reverse: SRS7205735 and ERS3517668
-
Single-end, forward: all files generated with polyester [64]
-
Paired-end, reverse: SRS7160912 and SRS7160911
-
Paired-end, forward: all files generated with polyester [64]
-
Rat
-
Single-end, reverse: SRS6431375
-
Single-end, forward: all files generated with polyester [64]
-
Paired-end, reverse: SRS6590988 and SRS6590989 [68]
-
Paired-end, forward: all files generated with polyester [64]