Beware to ignore the rare: how imputing zero-values can improve the quality of 16S rRNA gene studies results

Background 16S rRNA-gene sequencing is a valuable approach to characterize the taxonomic content of the whole bacterial population inhabiting a metabolic and spatial niche, providing an important opportunity to study bacteria and their role in many health and environmental mechanisms. The analysis of data produced by amplicon sequencing, however, brings very specific methodological issues that need to be properly addressed to obtain reliable biological conclusions. Among these, 16S count data tend to be very sparse, with many null values reflecting species that are present but got unobserved due to the multiplexing constraints. However, current data workflows do not consider a step in which the information about unobserved species is recovered. Results In this work, we evaluate for the first time the effects of introducing in the 16S data workflow a new preprocessing step, zero-imputation, to recover this lost information. Due to the lack of published zero-imputation methods specifically designed for 16S count data, we considered a set of zero-imputation strategies available for other frameworks, and benchmarked them using in silico 16S count data reflecting different experimental designs. Additionally, we assessed the effect of combining zero-imputation and normalization, i.e. the only preprocessing step in current 16S workflow. Overall, we benchmarked 35 16S preprocessing pipelines assessing their ability to handle data sparsity, identify species presence/absence, recovery sample proportional abundance distributions, and improve typical downstream analyses such as computation of alpha and beta diversity indices and differential abundance analysis. Conclusions The results clearly show that 16S data analysis greatly benefits from a properly-performed zero-imputation step, despite the choice of the right zero-imputation method having a pivotal role. In addition, we identify a set of best-performing pipelines that could be a valuable indication for data analysts. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04587-0.

Total sparsity. Table S1. Simulated Dataset 1, count matrix sparsity. Real, raw and pre-processed datasets are ordered for decreasing similarity with true sparsity (i.e. sparsity of real data). In the "Behaviour" column, datasets with overestimated sparsity are labelled with "O", while underestimation is labelled with "U".  Table S2. Simulated Dataset 2, count matrix sparsity. Real, raw and pre-processed datasets are ordered for decreasing similarity with true sparsity (i.e. sparsity of real data). In the "Behaviour" column, datasets with overestimated sparsity are labelled with "O", while underestimation is labelled with "U".  Table S3. Simulated Dataset 3, count matrix sparsity. Real, raw and pre-processed datasets are ordered for decreasing similarity with true sparsity (i.e. sparsity of real data). In the "Behaviour" column, datasets with overestimated sparsity are labelled with "O", while underestimation is labelled with "U".  Relative abundance profile Figure S1. Simulated Dataset 1 -SMAPE on relative abundance profile. For each pipeline and sample, SMAPE was computed between the relative abundance in ground truth data and the relative abundance in preprocessed data.   Figure S4. Simulated Dataset 1 -Aitchison's distance on relative abundance profile. For each pipeline and sample, Aitchison's distance was computed between the relative abundance in ground truth data and the relative abundance in pre-processed data. For each pipeline, the boxplots show the distribution of Aitchison's distance values. Distributions of Aitchison's distance values that result statistically lower (onesided Mann-Whitney paired U-test, Benjamini-Hochberg correction, significant threshold 0.05) than Aitchison's distance values on raw data are indicated with the symbol "*", followed by the interpretation of Cohen's d effect size (N: negligible, VS: very small, S: small, M: medium, L: large, VL: very large, H: huge). The vertical dashed line indicates the median Aitchison's distance of raw data. Figure S5. Simulated Dataset 2 -Aitchison's distance on relative abundance profile. For each pipeline and sample, Aitchison's distance was computed between the relative abundance in ground truth data and the relative abundance in pre-processed data. For each pipeline, the boxplots show the distribution of Aitchison's distance values. Distributions of Aitchison's distance values that result statistically lower (one-sided Mann-Whitney paired U-test, Benjamini-Hochberg correction, significant threshold 0.05) than Aitchison's distance values on raw data are indicated with the symbol "*", followed by the interpretation of Cohen's d effect size (N: negligible, VS: very small, S: small, M: medium, L: large, VL: very large, H: huge). The vertical dashed line indicates the median Aitchison's distance of raw data. Figure S6. Simulated Dataset 3 -Aitchison's distance on relative abundance profile. For each pipeline and sample, Aitchison's distance was computed between the relative abundance in ground truth data and the relative abundance in pre-processed data. For each pipeline, the boxplots show the distribution of Aitchison's distance values. Distributions of Aitchison's distance values that result statistically lower (one-sided Mann-Whitney paired U-test, Benjamini-Hochberg correction, significant threshold 0.05) than Aitchison's distance values on raw data are indicated with the symbol "*", followed by the interpretation of Cohen's d effect size (N: negligible, VS: very small, S: small, M: medium, L: large, VL: very large, H: huge). The vertical dashed line indicates the median Aitchison's distance of raw data. Table S9. Results of statistical analysis on Aitchison's distance on relative abundance profile. For each dataset, pipeline and sample, Aitchison distance was computed between relative abundance in ground truth data and relative abundance in pre-processed data. For each of the three test datasets and for each pipeline, the corrected p-values (one-sided Mann-Whitney paired U-test on raw data Aitchison distance vs pre-processed data Aitchison distance, Benjamini-Hochberg correction, significant threshold 0.05) and the effect sizes are reported. p-values associated to statistically significant comparisons are indicated with the symbol "*" close to the p-value value. The interpretation of Cohen's d effect size is reported in brackets close to the related value: (N) negligible, (VS) very small, (S) small, (M) medium, (L) large, (VL) very large, (H) huge.           Table S15. Simulated Dataset 1 -Results of the pipelines on alpha richness index. In the "Wrong comparisons" column is reported, in decreasing order for each index, the number of group-group comparisons not agreeing with the ground truth. The percentage over the total number of comparisons is also reported in the column "%".  Table S16. Simulated Dataset 2 -Results of the pipelines on alpha richness index. In the "Wrong comparisons" column is reported, in decreasing order for each index, the number of group-group comparisons not agreeing with the ground truth. The percentage over the total number of comparisons is also reported in the column "%".  Table S17. Simulated Dataset 3 -Results of the pipelines on alpha richness index. In the "Wrong comparisons" column is reported, in decreasing order for each index, the number of group-group comparisons not agreeing with the ground truth. The percentage over the total number of comparisons is also reported in the column "%". The top-left plot shows the true structure of the data in terms of beta diversity. The remaining plots show the beta diversity obtained on the raw data and the 35 pre-processing pipelines outputs. Plots are displayed in a grid: imputation methods on the rows, normalization methods on the columns. The colours encode the different groups (i.e. experimental conditions) within the datasets.

Figure S14. Simulated Dataset 2 -Beta diversity (Bray-Curtis dissimilarity).
For each pipeline, the plots show the Non-metric Multidimensional Scaling (NMDS) dimensionality reduction on Bray-Curtis dissimilarity. The top-left plot shows the true structure of the data in terms of beta diversity. The remaining plots show the beta diversity obtained on the raw data and the 35 pre-processing pipelines outputs. Plots are displayed in a grid: imputation methods on the rows, normalization methods on the columns. The colours encode the different groups (i.e. experimental conditions) within the datasets.

Figure S15. Simulated Dataset 3 -Beta diversity (Bray-Curtis dissimilarity).
For each pipeline, the plots show the Non-metric Multidimensional Scaling (NMDS) dimensionality reduction on Bray-Curtis dissimilarity. The top-left plot shows the true structure of the data in terms of beta diversity. The remaining plots show the beta diversity obtained on the raw data and the 35 pre-processing pipelines outputs. Plots are displayed in a grid: imputation methods on the rows, normalization methods on the columns. The colours encode the different groups (i.e. experimental conditions) within the datasets. Figure S16. Simulated Dataset 1 -Beta diversity (Whittaker dissimilarity). For each pipeline, the heatmaps show the beta diversity values computed from each pair of samples within the dataset. The top-left heatmap shows the true structure of the data in terms of beta diversity. The remaining heatmaps show the beta diversity obtained by raw data and the 35 pre-processing pipelines outputs. Heatmaps are displayed in a grid: imputation methods on the rows, normalization methods on the columns.

Figure S17. Simulated Dataset 2 -Beta diversity (Whittaker dissimilarity).
For each pipeline, the heatmaps show the beta diversity values computed from each pair of samples within the dataset. The top-left heatmap shows the true structure of the data in terms of beta diversity. The remaining heatmaps show the beta diversity obtained by raw data and the 35 pre-processing pipelines outputs. Heatmaps are displayed in a grid: imputation methods on the rows, normalization methods on the columns.

Figure S18. Simulated Dataset 3 -Beta diversity. Beta diversity (Whittaker dissimilarity).
For each pipeline, the heatmaps show the beta diversity values computed from each pair of samples within the dataset. The top-left heatmap shows the true structure of the data in terms of beta diversity. The remaining heatmaps show the beta diversity obtained by raw data and the 35 pre-processing pipelines outputs. Heatmaps are displayed in a grid: imputation methods on the rows, normalization methods on the columns.
Differential abundance analysis Figure S19. Simulated Dataset 1 -Jaccard index on differentially abundant species. For each pipeline, Jaccard indices were computed between species identified as differentially abundant (DA) in pre-processed data and DA species in ground truth data. For each pipeline, the boxplots show the distribution of Jaccard indices. Distributions of Jaccard index values that result statistically larger (one-sided Mann-Whitney paired U-test, Benjamini-Hochberg correction, significant threshold 0.05) than Jaccard index values calculated between DA species identified in raw data and DA species in ground truth data are indicated with the symbol "*", followed by the interpretation of Cohen's d effect size (N: negligible, VS: very small, S: small, M: medium, L: large, VL: very large, H: huge). The vertical dashed line indicates the median Jaccard index value of raw data. Figure S20. Simulated Dataset 2 -Jaccard index on differentially abundant species. For each pipeline, Jaccard indices were computed between species identified as differentially abundant (DA) in pre-processed data and DA species in ground truth data. For each pipeline, the boxplots show the distribution of Jaccard indices. Distributions of Jaccard index values that result statistically larger (one-sided Mann-Whitney paired U-test, Benjamini-Hochberg correction, significant threshold 0.05) than Jaccard index values calculated between DA species identified in raw data and DA species in ground truth data are indicated with the symbol "*", followed by the interpretation of Cohen's d effect size (N: negligible, VS: very small, S: small, M: medium, L: large, VL: very large, H: huge). The vertical dashed line indicates the median Jaccard index value of raw data.

Figure S21. Simulated Dataset 3 -Jaccard index on differentially abundant species.
For each pipeline, Jaccard indices were computed between species identified as differentially abundant (DA) in pre-processed data and DA species in ground truth data. For each pipeline, the boxplots show the distribution of Jaccard indices. Distributions of Jaccard index values that result statistically larger (one-sided Mann-Whitney paired U-test, Benjamini-Hochberg correction, significant threshold 0.05) than Jaccard index values calculated between DA species identified in raw data and DA species in ground truth data are indicated with the symbol "*", followed by the interpretation of Cohen's d effect size (N: negligible, VS: very small, S: small, M: medium, L: large, VL: very large, H: huge). The vertical dashed line indicates the median Jaccard index value of raw data. Table S21. Results of statistical analysis on Jaccard index on differentially abundant species. For each pipeline and test dataset, Jaccard indices were computed between species identified as differentially abundant (DA) in pre-processed data and DA species in ground truth data. Similarly, Jaccard indices were computed between species identified as differentially abundant (DA) in raw data and DA species in ground truth data. For each test dataset and for each pipeline, a one-sided Mann-Whitney paired U-test (Benjamini-Hochberg correction, significant threshold 0.05) was performed to test whether Jaccard indices from pre-processed data are larger than Jaccard indices from raw data. Corrected p-values and effect sizes are reported in the table. pvalues associated to statistically significant comparisons are indicated with the symbol "*" close to the pvalue value. The interpretation of Cohen's d effect size is reported in brackets close to the related value: (N) negligible, (VS) very small, (S) small, (M) medium, (L) large, (VL) very large, (H) huge.