FastqCleaner: an interactive Bioconductor application for quality-control, filtering and trimming of FASTQ files

Background Exploration and processing of FASTQ files are the first steps in state-of-the-art data analysis workflows of Next Generation Sequencing (NGS) platforms. The large amount of data generated by these technologies has put a challenge in terms of rapid analysis and visualization of sequencing information. Recent integration of the R data analysis platform with web visual frameworks has stimulated the development of user-friendly, powerful, and dynamic NGS data analysis applications. Results This paper presents FastqCleaner, a Bioconductor visual application for both quality-control (QC) and pre-processing of FASTQ files. The interface shows diagnostic information for the input and output data and allows to select a series of filtering and trimming operations in an interactive framework. FastqCleaner combines the technology of Bioconductor for NGS data analysis with the data visualization advantages of a web environment. Conclusions FastqCleaner is an user-friendly, offline-capable tool that enables access to advanced Bioconductor infrastructure. The novel concept of a Bioconductor interactive application that can be used without the need for programming skills, makes FastqCleaner a valuable resource for NGS data analysis. Electronic supplementary material The online version of this article (10.1186/s12859-019-2961-8) contains supplementary material, which is available to authorized users.


1
Launching the application The interactive application can be launched in R with the following command: library('FastqCleaner') launch _ fqc () As an alternative method, an RStudio addin (RStudio version 0.99.878 or higher required) installed with the package can be found in the Addins menu ( Figure 1). This button allows the direct launch of the application with a single click. 2

Description of the application
The application contains three main panels, as described below.

First panel
The first panel includes two elements: a dashboard for selection of trimming and filtering operations, and a menu for selection of the input file/s (Fig. 2).

Selecting operations
The "operations menu" (Fig. 2, elements 1 to 8) shows the available operations for file processing: 1. Remove by N(s): removes sequences with a number of Ns (non identified bases) above a selected threshold value 2. Remove low complexity sequences: remove sequences with a value of complexity above a threshold value 3. Remove adapters: removes adapters and partial adapters. Adapter sequences from both ends of single or paired read reads can be selected. Sequences can be reversecomplemented before processing. The program also allows to consider indels and/or anchored adapters.

4.
Filter by average quality: computes the average quality of sequences and removes those with a value below a given threshold 5. Trim low quality 3' tails: removes the 3' tails of sequences that are below a given threshold 6. Trim 3' or 5' by a fixed number: removes a fixed number of bases from the 3' and/or 5' ends in the complete set of sequences 7. Filter sequences by length: removes all the sequences with a number of bases below a threshold value 8. Remove duplicated sequences: removes duplicated reads, conserving only one copy of each sequence present in the file

Loading files
The "file selection menu" (Fig. 2, elements 9 to 17) contains options to handle the input file (type of file, file selection), buttons to run, clear and reset the aplication, and the "advanced" submenu: 9. Single-end reads / paired-end reads: type of input files 10. "FILE" button: to select an input file 11. "RUN!" button: to run the program 12. Output format: to select whether the output file should be compressed (.gz) or not 13. "CLEAR" button: to clear the configuration of the operations menu that have been selected in the first panel, but keeping the input file(s) 14. "RESET" button: to restart the application, removing the input file(s) and the selected configurations 15. Selection notificator: information of the path of the selected file/s 16. Encoding notificator: information of the input file/s encoding 17. Advanced options button: to select a custom encoding and set the number of reads included in each chunk for processing, as described below

Advanced options
The "advanced options submenu" (Fig. 3) allows to customize some fine aspects of the trimming and filtering process:

Second panel
The second panel ("file operations" panel, Fig. 4) shows the operations that were sucessfuly performed on the input file after running the program.

Third panel
The third panel ( "live results" panel, Fig. 5) shows interactive diagnostics plots for both input and output files. The program takes a random sample of reads for construction of the plots (default: 10000 reads). 24. Diagnostics plots: the plot to be shown, that can be one of the following: • Per cycle quality : quality plots across reads for each cycle (i.e., sequence position) • Per cycle mean quality : average quality across reads per base, for each cycle (i.e., sequence position) • Mean quality distribution: Quality distribution, using for the construction of the histogram the mean quality of each read A direct download is provided in this link .
A tipical FastqCleaner workflow starts with the input file/s upload (Fig. 6).

Figure 6:
File input menu. The example shows a single-end reads case (sample file 'example.fastq.gz'). For paired-end reads, the selection of the corresponding library type generates an additional button to upload the second file.
The file encoding is automatically detected by the program, but it can also be manually specified in the advanced submenu (Fig. 7). This menu also offers an option to customize the chunk size used for processing.  To use a filter, the "Use filter?" checkbox must be checked. A filter in use is indicated with a checkmark in the filter box The program then starts to run after pressing the "RUN!" button ( Fig. 9).  The type of plot to be displayed and the options for the construction of the plot are available in the third panel ( Fig. 11). This panel also show the selected plot/s. To clean the operations, for example to run a different configuration, the "CLEAN" i( Fig. 11) must be pressed. The "RESET" button ( Fig. 11) restarts the interface.
Additional help can be found in the "help" button located at the top-right of the app (Fig.  12). The functions included in the package are described in the following section.

Main functions
• adapter_filter Based on the Biostrings isMatchingStartingAt and isMatchingEndingAt functions. It can remove adapters and partial adapters from the 3' and 5' sequence ends. Adapters can be anchored or not. When indels are allowed, the method is based on the "edit distance" of the sequences.
where: d i = D i / n i D i represents the frequency of dinucleotides of the sequence i relative to the frequency in the whole pool of sequences.
The relation H i /H r between H i and a reference entropy value H r is computed, and the obtained relations are compared with a given complexity threshold. By default the program uses a reference entropy of 3.908, that corresponds to the entropy of the human genome in bits, and a complexity threshold of 0.5.

Documentation of the function • length_filter
Removes sequences with a length lower than minimum threshold value or/and higher than a maximum threshold value.