HOME-BIO (sHOtgun MEtagenomic analysis of BIOlogical entities): a specific and comprehensive pipeline for metagenomic shotgun sequencing data analysis

Background Next-Generation-Sequencing (NGS) enables detection of microorganisms present in biological and other matrices of various origin and nature, allowing not only the identification of known phyla and strains but also the discovery of novel ones. The large amount of metagenomic shotgun data produced by NGS require comprehensive and user-friendly pipelines for data analysis, that speed up the bioinformatics steps, relieving the users from the need to manually perform complex and time-consuming tasks. Results We describe here HOME-BIO (sHOtgun MEtagenomic analysis of BIOlogical entities), an exhaustive pipeline for metagenomics data analysis, comprising three independent analytical modules designed for an inclusive analysis of large NGS datasets. Conclusions HOME-BIO is a powerful and easy-to-use tool that can be run also by users with limited computational expertise. It allows in-depth analyses by removing low-complexity/ problematic reads, integrating the analytical steps that lead to a comprehensive taxonomy profile of each sample by querying different source databases, and it is customizable according to specific users’ needs.


-INTRODUCTION
This tutorial explains how to run an analysis using HOME-BIO with one test sample that it is possible to download on https://doi.org/10.5281/zenodo.4061297. The test sample is composed of two paired-end fastq files of human DNA, created in silico. Following, it is described how to download, install and run the pipeline.

-DESCRIPTION
HOME-BIO (sHOtgun MEtagenomic analysis of BIOlogical entities) is a modular and exhaustive pipeline for analysis of biological entity estimation, specifically designed for shotgun sequenced clinical samples. HOME-BIO analysis provides a comprehensive taxonomy classification by querying different databases and carries out the main steps of a metagenomics investigation. HOME-BIO is a dockerized solution for metagenomics analysis. Inside a DOCKER image, UBUNTU with Python 3.8 and Anaconda 3 (V. 2020/02) have been installed. Inside the Anaconda environment, it has been installed common software in a metagenomic investigation (MultiQC, FASTQC, Cutadapt, Bowtie, Bowtie2, STAR, Kraken2, Bracken, SPAdes, Kaiju, and krona).
The pipeline takes in input fastq files.
• "Quality Control" module allows to perform sequence read quality checks and includes FASTQC and MultiQC, to perform the quality check and the summary of quality control, respectively, while the adapter trimming and removal of low-quality reads is performed by Cutadapt. If required, HOME-BIO performs a filtering step to remove host and contaminant sequence fragments by mapping each of them on the corresponding genome(s); • "Metagenomic Shotgun" module performs taxonomic profiling, by classifying unmapped reads with Kraken2, and an additional proteinlevel classification with Kaiju; • "Assembly de novo" module uses SPAdes (option --meta) for the assembly of unmapped reads; contigs sequence are generated and then classified as described before.

-Docker and Python installation
To use HOME-BIO, it is mandatory to install and run DOCKER (https://hub.docker.com/) and install and use Python 3.
• At the following link (https://hub.docker.com/search?q=&type=edition&offering=commun ity) you can find the correct version of DOCKER for your OS and all the info about how to install and run DOCKER. • There are several options to install Python 3 depending on your OS: Linux\Debian sudo apt update sudo apt-get install python3.8 Centos\RHEL\Fedora yum update -y yum install -y python3

Windows\MacOS
Just download and install it from https://www.python.org/downloads/windows/. For Windows users: during the installation, please, allow to add python in the path in order to run it directly by typing "python".

-Pipeline installation
Run DOCKER on your PC or server and pull our DOCKER image from https://hub.docker.com/r/biohaz/home_bio or just type in your console: docker pull biohaz/home_bio:latest Now, download the GitHub repository or, if you have git clone installed, type: git clone https://github.com/carlferr/HOME-BIO.git The figure above shows the content of the HOME-BIO-master.zip file downloaded from GitHub.

-Download the databases
To run HOME-BIO analysis, you should download indexed genome/protein reference databases for bacteria, protozoa, or viruses. We provide all of them on Zenodo (https://doi.org/10.5281/zenodo.4055180) (DB_KKRAKEN2_KAIJU.zip). Download the zipped folder, and, before running the analysis, please unzip it on your machine.
In the figure above, it is possible to see the content of the DB_KRAKEN2_KAIJU.zip file, available on Zenodo. Inside, there are three folders for Kraken2 and one folder for the Kaiju database.
It is mandatory to have at least 400 GB of available space on the disk to extract the compressed folder. It is NOT mandatory to use our previously indexed databases, but the user will need to create his own databases and, then, index them with Kraken2 and Kaiju. See the instructions on Kraken2 (https://github.com/DerrickWood/kraken2/wiki/Manual) and Kaiju (https://github.com/bioinformatics-centre/kaiju) user manuals.

-Prepare the Config file
Before running HOME-BIO, the users should manually change the "config_file.txt".
The figure above shows the content of the "config_file.txt" file.
This file is divided in two parts. The first [DEFAULT] indicates the default values for each parameter, while the second part [CUSTOM] should be modified by the user.
In particular, the user will manually modify, in the [CUSTOM] section, the correct options for the analysis, following the written examples. For some options an absolute path is mandatory, while for others it is required a "yes" or "no" selection. Only the "Adapter" option requires a sequence (in capital letters) to use as adapter during the trimming step; moreover, the adapter sequences may also contain any IUPAC wildcard character (such as N), as suggested in cutadapt user guide https://cutadapt.readthedocs.io/en/stable/guide.html. For the "k-mers" option of the Assembly de novo module, HOME-BIO is set to "auto". In this way, SPAdes will choose the best k-mer length depending on the reads length.
Only a comma-separated list of k-mer sizes can be used (all values must be odd, less than 128, and listed in ascending order, e.g. 21,33,55). For more details, see the SPADES manual (http://cab.spbu.ru/files/release3.13.0/manual.html). If the contaminant filtering is not requested, the "Path contaminant genome" is not used but it is still mandatory to write something (e.g., "/your_path/" or "none"). E.g. for "Number of threads [n] = " the user should insert, in the [CUSTOM] section, an integer number [n] indicating the max number of threads to be used (in this case 16 are used). For "path to host genome" is indicated the absolute path of Bowtie2 indexed host genome folder: "/root/Scrivania/HOME-BIO-master/Genome/". The script Config_file_creator.py is able to re-generate the config file (in case it was lost or not usable anymore). The user will modify it as suggested before. Just type: python Config_file_creator.py It will create the "config_file.txt" or, if it is still present, rewrite it.

E.g.
bowtie2-build genome.fa host_genome_index If the user does not have a human reference sequence (here considered as host genome), he should follow the next step in order to download it from NCBI, decompress the genome sequence file (Homo sapiens -assembly GRCh38.p13), and, finally, create the proper index in the correct folder. In the config_file example above, options are set in order to:

E.g.
-perform a quality check before and after the trimming -run the analysis with 16 threads -use the reads in a folder in paired-end mode (here, two paired-end fastq test files were used which can be downloaded on https://doi.org/10.5281/zenodo.4061297) -use the human genome as host-genome (hg38 downloaded from NCBI and pre-indexed with Bowtie2) -do not use a contaminant genome and do not perform the filtering -use a specific adapter -use the shotgun module and the assembly de novo module -use our pre-indexed kaiju and kraken2 databases (downloaded and unzipped from Zenodo) -use the default value of confidence for Kraken2 taxonomy annotation (0.5) -set "DNA" as nucleic acid (since our test sample is DNA) -do not use a custom set of kmers length.
If the user knows about a possible contaminant in the samples, he should indicate an absolute path of the pre-indexed genome with path contaminant genome option and he should use Bowtie2 in the same way the host genome was created (bowtie2-build [options]* <reference_in> <bt2_base>). We provide an example of a pre-indexed contaminant genome available on Zenodo (http://doi.org/10.5281/zenodo.4281135). Users should unzip the file and move the content in a folder (e.g. "/root/Desktop/HOME-BIOmaster/contaminant_genome/"). Remember to compile the config file with this absolute path.

-HOW TO RUN HOME-BIO
HOME-BIO uses fastq (or fastq.gz) files as input. Remember to put all your input data in one unique folder (e.g., "Fastq_folder", "Data", or "Input").
It is possible to test the pipeline using a testing dataset (two paired-end fastq files) freely available (https://doi.org/10.5281/zenodo.4061297) on Zenodo. The figure above shows the paired-end test sample in one unique folder. Now, you can run HOME-BIO. Please, move in the HOME-BIO-master directory and just type in the console (for Linux\Debian\Centos\RHEL\Fedora): 1. FastQC and MultiQC output files with quality control report of samples (in html and tab delimited-format), before and after adapter sequence removal. It is also provided a quality check summary of all the fastq samples analyzed in HOME-BIO; 2. Fastq files obtained after low quality read filtering and the adapter sequence removal procedures; 3. Alignment of output file obtained with the host and the contaminant filtering step. For each sample processed in HOME-BIO, metrics, alignment files (in sam format), fastq files with aligned fragment, and unmapped reads (used in next steps) are generated. 4. Metagenomic annotation output files from the Shotgun module. These folders contain classified and unclassified read sequences (in fastq format) obtained by querying HOME-BIO databases. Furthermore, there are also some intermediate files used to perform the taxonomy classification (see "9_Results" explanation).
5. Assembly de novo output files. For each sample analyzed, users can find metrics and parameters information of the de novo assembly step.
In particular, contig.fasta and scaffold.fasta files contain all the sequences assembled with SPAdes. 6. Metagenomic annotation output files for the Assembly de novo module (as described before). 7. Protein-level annotation output files for the Shotgun module. This folder contains the summary files of the protein-based profiling reporting the detected entities and the related number (and percentage) of the reads mapping in HOME-BIO database; html interactive piechart, with all the entities found in the samples; tab-delimited output files with one line per reads processed indicating taxonomy classification results (name of the read, NCBI taxon IDs of identified taxa, the length or score of the best match used for the classification and the matching sequences). 8. Protein-level annotation output files for the Assembly de novo module (as described before). HOME-BIO generates also a log.txt file containing all the messages prompted by the script during the analysis.
The "9_Results" folder is the final output, resulting from each module performed in HOME-BIO. In this tutorial, both modules were used: For each sample processed, it is possible to obtain the following resulting files. The figure  below shows the content of the assembly_de_novo_module_results folder. The identical scheme is present in the shotgun_module_results. D. *_HOMEBIO_results_protein_validated*: these sample reports have the same structure of the outputs mentioned above in the C section. In addition, they contain the 7th column reporting the protein validation information. If a given taxon finds a match both in the protein validation step with Kaiju annotation and in the Kraken2 taxonomy annotation with a different reference database, it will be "protein_validated"; as in example below for bacteria Citrobacter koseri. E. *HOMEBIO_tax_count*: sample reports in tab-delimited format reporting number of genera and species (and related taxonomy) found in the samples.