metaSPARSim: a 16S rRNA gene sequencing count data simulator

Background In the last few years, 16S rRNA gene sequencing (16S rDNA-seq) has seen a surprisingly rapid increase in election rate as a methodology to perform microbial community studies. Despite the considerable popularity of this technique, an exiguous number of specific tools are currently available for proper 16S rDNA-seq count data preprocessing and simulation. Indeed, the great majority of tools have been developed adapting methodologies previously used for bulk RNA-seq data, with poor assessment of their applicability in the metagenomics field. For such tools and the few ones specifically developed for 16S rDNA-seq data, performance assessment is challenging, mainly due to the complex nature of the data and the lack of realistic simulation models. In fact, to the best of our knowledge, no software thought for data simulation are available to directly obtain synthetic 16S rDNA-seq count tables that properly model heavy sparsity and compositionality typical of these data. Results In this paper we present metaSPARSim, a sparse count matrix simulator intended for usage in development of 16S rDNA-seq metagenomic data processing pipelines. metaSPARSim implements a new generative process that models the sequencing process with a Multivariate Hypergeometric distribution in order to realistically simulate 16S rDNA-seq count table, resembling real experimental data compositionality and sparsity. It provides ready-to-use count matrices and comes with the possibility to reproduce different pre-coded scenarios and to estimate simulation parameters from real experimental data. The tool is made available at http://sysbiobig.dei.unipd.it/?q=Software#metaSPARSimand https://gitlab.com/sysbiobig/metasparsim. Conclusion metaSPARSim is able to generate count matrices resembling real 16S rDNA-seq data. The availability of count data simulators is extremely valuable both for methods developers, for which a ground truth for tools validation is needed, and for users who want to assess state of the art analysis tools for choosing the most accurate one. Thus, we believe that metaSPARSim is a valuable tool for researchers involved in developing, testing and using robust and reliable data analysis methods in the context of 16S rRNA gene sequencing. Electronic supplementary material The online version of this article (10.1186/s12859-019-2882-6) contains supplementary material, which is available to authorized users.

• estimate_library_size function: extracts real library sizes from the original dataset.
However, for the estimation of the complete parameter set, function estimate parameters f rom data has been implemented as a wrapper of the three above functions.

Presets
Here is described the set of pre-computed parameters for 16S count matrix simulation included in metaSPARSim. The parameter sets collected in metaSPARSim have been estimated using metaSPARSim from five different real data, describing more than 1100 samples across almost 90 different sample groups. These datasets were chosen to represent 16S rDNA sequencing data diversity, having all of them peculiar characteristics in terms of number of samples, constitutive groups, sequencing depths and sample source. The first two sets were estimated from data coming from experiments performed in the context of this work (R1-R2 sets), while the other ones are estimated from Human Microbiome Project (HMP) [2,3] data (R3-R4 sets) and Atacama soil [4] data (R5 ).

Real data
In this section are described in detail the two proprietary real datasets used for simulator testing.
Animal gut microbiome This sample collection had the aim of monitoring chicken gut microbiota modifications in the first 4 weeks of life in occurrence of Campylobacter spp. infection, and comparing them with healthy chickens risen in analogous conditions but within farms in which no Campylobacter spp. infection occurred. For this study, we selected four broiler farms belonging to the same supply chain, half of which became positive for Campylobacter spp. during the sampling period, while the other half showed no infection during all the 4 weeks of monitoring. Five samples were collected from each farm at 5 different time points (7th, 14th, 18th, 21th and 28th day of chickens' age), for a total of 110 caecal samples. For the two positive farms, ten samples (five positive and five negative for Campylobacter spp.) were collected for the time points in which infection was first detected. A scheme of this experiment is reported in Figure 1. Total DNA was then extracted and V3-V4 regions of 16S rRNA gene were amplified with the primers CC-TACGGGNGGCWGCAG (forward) and GACTACHVGGGTATCTAATCC (reverse), following Klindworth et al. [5], and sequenced on HiSeq2500 platform in RAPID mode (2x250 bp). Sequencing data underwent a quality control procedure using the FastQC tool [6]. Data were then cleaned by removing adapters, primers and performing dereplication of sequences using a in-house bash script. In addition, data were filtered based on the quality and length of the reads, so that only data with a quality higher than a given threshold (QPhred ≥ 20) and reads whose length exceeded 100bp were retained. All subsequent steps were performed using QIIME1 [7] pipeline (version 1.9.0). Data obtained from the filtering step underwent read pairing, in order to obtain a single file in which the reads obtained by sequencing the 16S fragments on the forward strand and on the reverse are joined by their overlapping region. Then, OTU picking step was performed, assigning reads to a particular taxonomy by directly mapping the same reads to a 16S sequences database (GreenGenes database [8], last release May 2013). Food microbiome from raw milk cheese The second study had the aim of following the dynamics of the microbial community of "Latteria" raw milk cheese during its ripening period, in natural ageing conditions and in presence of contamination by pathogens like Listeria monocytogenes and Staphilococcus aureus. The cheesemaking was made in a dairy in Friuli Venezia Giulia region, following all the typical steps of this particular raw milk cheese production. The design was made by 4 types of cheesemaking: • plain, with no contamination • contaminated with Listeria innocua, chosen for security reasons as a substitute of Listeria monocytogenes as being characterized by the same dynamical behaviour • contaminated with Staphilococcus aureus • contaminated with both Listeria innocua and Staphilococcus aureus.
Each cheesemaking was performed in triple replicate, to access biological variability. Samples from from cheese till the 30rd day of ripening period were collected, for a total of 10 sampling time points and 120 samples (12 from cheese at each time point). A scheme of this experiment is showed in Figure 2. Total RNA was extracted and retrotranscribed; then V3-V4 regions of 16S rRNA gene were amplified with the primers CCTACGGGNGGCWGCAG (forward) and GACTACHVGGGTATCTAATCC (reverse), following Klindworth et al. [5], and sequenced on HiSeq2500 platform in RAPID mode (2x250 bp). After sequencing, a control procedure using the FastQC tool [6] was performed on resultant data. After that, QIIME2 [7] pipeline (version 2017.11) was used to perform sequence quality control and feature table construction via DADA2 pipeline, which also performs phiX reads and chimeric sequences filtering. Taxonomic assignment was obtained using the QIIME2 Naive Bayes classifier pre-trained on Silva [9] database. Human Microbiome Project data Data obtained in the context of the Human Microbiome Project (HMP) ( [2,3]) were used to test for metaS-PARSim performance. The project had the aim of creating resources to easily characterize the human microbiota. Within this project, the microbial communities from 300 healthy individuals across several different sites on the human body was characterized: nasal passages, oral cavity, skin, gastrointestinal tract, and urogenital tract samples were collected and variable regions 3-5 (V35) of 16S rRNA gene were sequences ( Figure 3). The two datasets considered in this work that were derived from HMP data are composed by a subsample of this publicly available dataset (https: //qiita.ucsd.edu/study/description/1928), taking 5 random samples from 8 sampled groups belonging to the oral cavity part of data: hard palate, gingivae, tongue dorsum, teeth, palatine tonsils, throat, saliva and buccal mucosa. Samples were randomly drawn from the available. Evaluation metrics for metaSPARSim performance assessment The goodness in reproducing realistic sparsity, intensity and variability characteristics was evaluated following different both qualitative and quantitative measures: • Q-Q (quantile-quantile) plots [11]: a graphical method for comparing two probability distributions by plotting their quantiles against each other. The two compared distributions are considered to be equal when plotted data lay on the diagonal.
• boxplots [12]: a method for graphically representing numerical data collections through their quartiles.
• RDI (Raw data, Descriptive statistics, and Inferential statistics) plots [13]: this kind of plots permit to represent both punctual and distribution information, joining scatter plot, box plot and density plot together.
• Mann-Whitney U test [14]: applied to relative abundances vectors. This non parametric test is based on the null hypothesis that the two tested samples come from the same population (i.e. they both have the same median). If the resultant P -value is less than a fixed significance level (here 0.05), then the null hypothesis is rejected in favour of the alternative hypothesis, i.e. the two samples come from different populations. This test was used to check for possible statistically significant dissimilarities between real and simulated vectors of intensities and variances.
• Cohen's d [15]: applied to relative abundances vectors. It is one measure associated with the calculation of so-called effect size, a quantitative measure of the magnitude of a phenomenon; here, the effect size quantifies the size of the difference between two groups. In this analysis, we decided to include it alongside the significance test because it has been widely shown ( [16], [17], [18]) that when examining effects using large samples significant testing can be misleading because even small or trivial effects are likely to produce statistically significant results. Thus, reporting only the significant P-value for an analysis is not adequate to fully understand the results. Cohen's d is defined as the difference between two means divided by a standard deviation for the data, i.e.
where s is the pooled standard deviation, defined as where n 1 and n 2 are the two sample sizes and s 1 and s 2 are the related standard deviations. Table 1 contains cut-offs for magnitudes interpretation, as initially suggested by Cohen and expanded by Sawilowsky [19].             Table 4 Mann-Whitney U tests and effect size results in comparing real and simulated mean count distribution within groups for HMP dataset.