FastqPuri: high-performance preprocessing of RNA-seq data

Background RNA sequencing (RNA-seq) has become the standard means of analyzing gene and transcript expression in high-throughput. While previously sequence alignment was a time demanding step, fast alignment methods and even more so transcript counting methods which avoid mapping and quantify gene and transcript expression by evaluating whether a read is compatible with a transcript, have led to significant speed-ups in data analysis. Now, the most time demanding step in the analysis of RNA-seq data is preprocessing the raw sequence data, such as running quality control and adapter, contamination and quality filtering before transcript or gene quantification. To do so, many researchers chain different tools, but a comprehensive, flexible and fast software that covers all preprocessing steps is currently missing. Results We here present FastqPuri, a light-weight and highly efficient preprocessing tool for fastq data. FastqPuri provides sequence quality reports on the sample and dataset level with new plots which facilitate decision making for subsequent quality filtering. Moreover, FastqPuri efficiently removes adapter sequences and sequences from biological contamination from the data. It accepts both single- and paired-end data in uncompressed or compressed fastq files. FastqPuri can be run stand-alone and is suitable to be run within pipelines. We benchmarked FastqPuri against existing tools and found that FastqPuri is superior in terms of speed, memory usage, versatility and comprehensiveness. Conclusions FastqPuri is a new tool which covers all aspects of short read sequence data preprocessing. It was designed for RNA-seq data to meet the needs for fast preprocessing of fastq data to allow transcript and gene counting, but it is suitable to process any short read sequencing data of which high sequence quality is needed, such as for genome assembly or SNV (single nucleotide variant) detection. FastqPuri is most flexible in filtering undesired biological sequences by offering two approaches to optimize speed and memory usage dependent on the total size of the potential contaminating sequences. FastqPuri is available at https://github.com/jengelmann/FastqPuri. It is implemented in C and R and licensed under GPL v3. Electronic supplementary material The online version of this article (10.1186/s12859-019-2799-0) contains supplementary material, which is available to authorized users.


Overview
FastqPuri consists of 6 executables which can be run sequentially or individually. Some of them are alternatives for one another, some are optional.
-Qreport creates a quality report in html format. From a fastq input file and a quality threshold minQ ∈ [0,45] under which bases are considered low quality, by default, this threshold is set to 27. Graphs with the following information are created: (1) (5) and (6) indicate how many reads will be discarded or trimmed at a certain quality threshold, information that cannot be retrieved from graphics showing averages. The data shown in the graphs is stored in a binary file used by Sreport for the summary report.
-makeTree takes a fasta file containing potential contaminating sequences, creates a tree data structure and saves it to a file to be used by trimFilter/trimFilterPE. Intended for small fasta files.
-makeBloom creates a bloom filter from potential contaminating sequences, and stores it in a file for use by trimFilter/trimFilterPE. Suited for large fasta files.
-trimFilter takes a fastq file as an input and filters out or trims reads according to the presence of: (1) adapter remnants, (2) contaminations from unwanted genomes, (3) low quality base calls, (4) N's. The output consists of a fastq file containing the reads classified as good (left unmodified or trimmed), a fastq files with discarded reads (one per filter option selected) and a binary file with summary statistics.
-trimFilterPE filters paired end data in an analogous way to trimFilter. It takes two fastq files as input, and applies the selected filters on each read. If one of the reads of a pair does not pass a given filter, both reads are discarded. The output is a fastq file of trimmed reads for each input file and a binary output file as in trimFilter, with the only difference that the number of trimmed reads is counted once per each input file.
-Sreport generates summary reports in html format from a list of Qreport binary output files or trimFilter/trimFilterPE binary output files. This produces an overview of all samples in a dataset.
We suggest to run FastqPuri on single-end data with a small file of potential contaminations, e.g. rRNA, like so: Qreport-Sreport-makeTree-trimFilter-Qreport-Sreport; and on paired-end data with a large sequence file of potential contaminations, e.g. several genomes, like so: Qreport-Sreport-makeBloom-trimFilterPE-Qreport-Sreport. Of course, FastqPuri can also filter single-end data with a bloom filter and paired-end data with a tree-based filter. Figure 1 of the main paper shows a scheme of the workflow of FastqPuri.

QUALITY REPORT
The quality report is obtained byexecuting Qreport. It reads a fastq file and generates a html report with the information/graphs described in FIG. 1, FIG. 2. It accepts raw fastq files and files that were already filtered. A nucleotide in a read is considered to have low quality if it lies below a user defined threshold. See README Qreport.md for more details on the input parameters. The quality values take the Phred+33 convention. We have tested the performance of Qreport vs fastQC [1] using an input fastq file with 51559773 reads, of read length 51, compressed with bzip2. Since fastQC does not accept this format as an input, we compare the performance running both programs on the uncompressed file and Qreport running on the compressed file versus the time taken for the file to be uncompressed plus the time taken by fastQC to run. The performance and peak memory usage are collected in TABLE I.

CONTAMINATION OF TECHNICAL SEQUENCES
The presence of technical sequences, such as adapter remnants, in next generation sequencing (NGS) data can affect the alignment and pseudo-alignment/quasi-mapping. This tool offers the possibility of identifying such remnants through the option --adapter in the executables trimFilter and trimFilterPE for single-end and paired-end data, respectively. The approach used is very similar to trimmomatic [2]. Figure FIG. 3 sketches the possible situations for single-end data.
Reads are scanned from the 3' end to the 5' end, with a 16 nucleotides long seed. If only up to a user defined number of mismatches is encountered, the seed is accepted and a local alignment is performed. We suggest to allow 1 or 2 mismatches in the seed. During alignment, a score is constructed by adding log 10 (4) when the bases match   and penalizing with −Q/10 for mismatches. If the score exceeds a user defined threshold, it is accepted as adapter contamination. For example, a perfect match of 12 bases will result in a score of 7.22, and 24 perfectly matching bases will score 14.44. Two mismatches with quality scores of 30 will reduce the score by 0.6, such that we recommend scores ranging from 5 (very sensitive) to 15 (rather strict). At the ends of the read, a seed of 8 nucleotides is also allowed, with the same score and number of mismatches restrictions. However, an alignment of at least 12 nucleotides is always demanded. If the read presents adapter contamination, it is trimmed if the remaining sequence length is at least the user defined minimum allowed length and discarded otherwise. The alignments tested in paired-end data are illustrated in FIG. 4. As for single-end data, a 'seed-and-extend' method is applied. We use 16 nucleotide long seeds and when a valid one is found, i.e. a seed with at most the user  (2) the overlap shows no read-through into the adapters, (3) the overlap between the reads present a partial adapter read through, (4) reads overlap including a whole read-through into the adapters, (5) there are no reads, only adapters, fragments contain no useful information.
defined number of mismatches, a score is computed, in the very same way as in the single-end case. If the score exceeds the user defined threshold, both reads are trimmed if the remaining subsequences lengths are allowed and discarded otherwise.
To check the performance of the code, we have compared our tool with trimmomatic [2] for both single and paired-end data using fastq file(s) with 51559773 reads, of read length 51, measuring the real time, the CPU time and the peak memory usage. Results are listed in

CONTAMINATION OF UNWANTED GENOMES
Sometimes we want to remove reads belonging to undesired biological sequences, such as pathogen or host genomes, or potential contaminations that might interfere with subsequent analysis. This is provided as an option in trimFilter and trimFilterPE; we offer two options depending on the length of unwanted sequences exceeding 10MB.

Short contaminations sequence: 4-ary tree
If the unwanted sequences fasta file is smaller than 10MB, then one can look for contaminations by constructing a 4-ary tree out of the fasta file. One could use the executable makeTree to construct a tree and save it to disk, if one wanted to run the code on several samples. However, constructing the tree is in general a relatively cheap task for the sequence lengths under consideration and it does not represent a large overhead. Therefore, we can call trimFilter(PE) with the option --method TREE and it will be generated "on the flight".
When constructing a tree, the user not only has to pass the fasta file as an input, but also the depth of the tree and, when looking for contaminations, a score threshold is needed as well. A 4-tree, with {A, C, G, T} as leaves and a user defined depth k is constructed including all k-mers in the unwanted genome. Then, all k-mers in a given read are tested to be in the tree. A read will be considered to be a contamination if the score computed as the proportion of k-mers found in the tree is larger than the score threshold. This method is very fast, every search is O(2k * (L−k +1)), where L is the read length and the factor 2 appears because the reverse complement has to be checked also. However, it has the drawback that is very memory intensive. This is why we have limited the construction of a tree to sequences with sizes below 10MB.

Long contaminating sequences: bloom filter
This method is designed for large sequences up to 4GB. In this case, it is sensible to construct the bloom filter and store it in a file. This is achieved with the executable makeBloom. Then, trimFilter(PE) can be called with the option --method BLOOM and the code looks for contaminations provided a score threshold is passed as an input.

Creating a bloom filter and checking elements
A bloom filter is a probabilistic data structure used to test if an element is a member of a set. False positive matches are possible but false negative are not. Let n be the number of elements in the set S, which in our case is the number of all possible k-mer in the sequence. Then, it is proceeded as follows: • Decide on the number g of hash functions we will use for the construction of the filter and the length of the filter m (number of bits). This choice will be made based on the false positive rate we want to achieve (see Parameter optimization), • We create an empty bloom filter, B, i.e. an array of m' bits set to 0'. For every element in the set, s α ∈ S, compute the g hash functions, H i (s α )∀i ∈ {1, . . . , g} and set the corresponding bits to 1 in the filter, i.e., • Then, if we want to check whether an element (a k-mer from a fastq read) s β is in the set S, we compute H i (s β )∀i ∈ {1, . . . , g} and check whether all coresponding positions in the filter are set to 1, in which case we can say that s β might be in the set. Otherwise it is definitely not in the set.
• A read is discarded if the fraction of its k-mers that gave a positive result when looking for them in the filter exceeds the user defined threshold.
In this process, we do not consider k-mers with N's.

Parameter optimization
We choose the parameters so that the desired false positive rate is achieved. Alternatively, we can pass the filter size, and then the number of hash functions to be used is tuned so that the false positive rate is minimized.
We assume the hash functions select all positions with the same probability. The probability that a bit in the filter B is not set to 1 after inserting an element using g hash functions is (1 − 1 m ) g . If we insert n elements, the probability that an element is still 0 is (1 − 1 m ) gn , and the probability of a bit being 1 is 1 − (1 − 1 m ) gn . The false positive rate, i.e. the probability that all positions computed from the hash functions for a given element are zero without it being in the set is given by, For a given n and m, the value of g that minimizes p is, dp(g,n,m) dg = 0 ⇒ g = m n log (2) The optimal number of bits m per element assuming the optimal number of hash functions being used is, The memory usage will be determined then by m. In FIG. 5, we can see both, the optimal number of bits per element and the optimal number of hash functions as a function of the false positive rate.
As an example, let's assume we want to look for contaminations in a genome of ∼3Gbps and want to keep the false positive rate at 2%. Then, we will need a filter of ∼3.05GB.
The sensitivity can be increased by decreasing the k-mer size and the score threshold. False negatives only occur in the presence of mismatches due to variants, or errors in the base calling procedure, since the filter itself does not allow for false negatives. To increase specificity one can increase the score threshold or, obviously reduce the positive rate as input parameter. In addition to the human dataset used in the main manuscript, we also checked the performance of filtering out reads from contaminating species by using dgwsim simulated data. We generated 10e5 150bp Escherichia coli reads and 10e5 150bp human reads, and created several bloom filters for the Escherichia coli genome with k-mer size 25 and false positive rate p = 0.075. Then, we run trimFilter on the E.coli-human data to look for contaminations using scores ranging from 0.05 to 0.2. FIG. 6 displays a ROC curve summarizing the results.

QUALITY FILTER
Fastq files may contain reads with low quality cycles that might lead to alignment of the read at the wrong place or counting the read for the wrong transcript. This tool allows the user to pass a quality threshold as an input, under which cycle qualities are considered to be poor, and the flag --lowQ followed by one of the options, • NO: (or flag absent): nothing is done to the reads with low quality, • ALL: all reads containing at least one low quality nucleotide are discarded, • ENDS: look for low quality base callings at the beginning and at the end of the read. Trim them at both ends until the quality is above the threshold. Keep the read if the length of the remaining part is at least the minimum allowed. Discard it otherwise, • FRAC [--percent p]: discard the read if there are more than p% nucleotides whose quality lies below the threshold, • ENDSFRAC [--percent p]: first trim the ends as in the ENDS option. Accept the trimmed read if the number of low quality nucleotides does not exceed p%, discard it otherwise.
• GLOBAL --global n1:n2: cut all reads globally n1 nucleotides from the left and n2 from the right.
Sometimes, we want to discard or trim reads containing nucleotides tagged as N's. This can be done by our tool by passing the flag --NNN and one of the following options, • NO: (or flag absent): nothing is done to the reads containing N's, • ALL: all reads containing at least one N are discarded,

FIG. 6:
• ENDS: N's are trimmed if found at the ends, left "as is" otherwise. If the trimmed read length is smaller than the minimal allowed read length, it is discarded.
• STRIP: Obtain the largest N free subsequence of the read. Accept it if its length is at least the minimal allowed length, discard it otherwise.

SUMMARY REPORTS
The executable SReport creates three types of summary reports in html format: -t Q Quality summary report: takes a folder containing *.bin files outputted by Qreport as input, and generates an html report with a table containing, for all samples: number of reads, number of tiles, percentage of reads with low quality cycles, percentage of reads with cycles tagged as N, and a heatmap showing the average quality per position for all samples.
-t F Filtered data summary report, single-end: takes a folder containing *.bin files outputted by trimFilter and generates an html report with a table specifying the setup of the run, i.e., the filters that were applied to the files if they were the same for all samples. If the files were generated with different setups, then the table is ambiguous and is therefore not generated. It also contains a table containing, for all samples (rows), the total number of reads, the number of accepted reads, percentage of reads discarded due to: adapter contaminations, unwanted genomes contaminations, low quality issues, presence of Ns, percentage of reads trimmed due to: adapter contaminations, low quality issues and presence of N's.
-t D Filtered data summary report, paired-end: it takes a folder containing *bin files outputted by trimFil-terPE and generates an html report very similar to the one for single-end data, with the only difference that there are two columns containing trimmed percentages, i.e., there is a column for read 1 and another one for read 2.