MBECS: Microbiome Batch Effects Correction Suite
BMC Bioinformatics volume 24, Article number: 182 (2023)
Despite the availability of batch effect correcting algorithms (BECA), no comprehensive tool that combines batch correction and evaluation of the results exists for microbiome datasets. This work outlines the Microbiome Batch Effects Correction Suite development that integrates several BECAs and evaluation metrics into a software package for the statistical computation framework R.
The emergence of unwanted variation in next-generation sequencing applications is a well-researched challenge. A particular form of unwanted technical variation are batch effects (BE) that potentially result from any distinct grouping of samples during the processing steps. Hence, the introduced variability reflects the differences in, for example, the environmental conditions, batches of reagents, sequencing machines, or sample handling for corresponding batches [1, 2]. Consequently, unwanted variation can negatively affect the downstream statistical analyses as it represents a confounding factor that can obscure or exacerbate the biological truth in a dataset . The comprehensive scientific research into causes and strategies for preventing and correcting batch effects indicates this topic's importance [4, 5]. While appropriate measures during the planning and execution of an experiment can limit the emergence and magnitude of batch effects, they are not entirely preventable and thus need to be accounted for before statistical analyses . Despite the availability of batch effect correcting algorithms (BECA) and instructive guides on mitigating of BEs , no comprehensive tool that combines batch correction and evaluation of the results exists for microbiome datasets. This work introduces the Microbiome Batch Effects Correction Suite (MBECS), which integrates several established BECAs and evaluation metrics into a software package for the R statistical computation framework.
The Microbiome Batch Effect Correction Suite is designed as a software toolbox that enables users to estimate the severity of batch effects, facilitates the utilization of different BECAs, and finally provides comparative metrics to evaluate the success of each method. To that end, the package offers a convenient 5-step workflow that produces a report to guide the user in selecting the optimal results for downstream analyses.
The software builds upon the phyloseq  package, which facilitates the intuitive import and export of existing microbiome datasets and enables the use of other count-based datasets. The packages' data object extends the phyloseq class with additional fields that store normalized and batch-corrected feature abundance tables. All operations are performed on this single data object that keeps track of the results, promoting tidy scripts and enabling MBECS comparative reporting.
The normalization methods implemented in MBECS are total-sum scaling (TSS) and centered log-ratio transformation (CLR) . Available BECAs include established correction algorithms such as ComBat and Remove Batch Effects from the SVA package  and Remove Unwanted Variation 3 implemented in the RUV package . Additionally, the package implements batch mean centering, Percentile Normalization, and Singular Value Decomposition as correction approaches .
Quantifying the variability in a dataset that can be attributed to batch effects is not trivial. A relative log expression (RLE) plot, for example, can indicate the presence of batch effects, yet it is not a suitable approach to determining whether or not they have been removed successfully by a correction algorithm . Thus, the suite implements several distinct metrics to provide the user with comprehensive information to assess the severity of BEs before and after batch-correction procedures. Available methods include constructing linear models from recorded biological and batch factors to estimate the variability attributed to batch effects before and after the correction procedures. Further approaches implemented are partial redundancy analysis and principal variance components analysis [13, 14]. Finally, the silhouette coefficient is a qualitative measure of the goodness of fit of samples to their respective biological groupings .
The packages' native workflow depicted in Fig. 1 will create a preliminary report upon importing the dataset. This report summarizes the data concerning covariate information, distribution of samples into biological groups and known batches, heatmaps, and box plots of the most variable features concerning the batch factor and relative log-expression plots. The preliminary report also provides the metrics mentioned above to assess variability for the uncorrected data. The user can decide whether or not batch correction is required based on that account. The subsequent processing step allows the application of selected correction methods depending on the experimental design. Methods like RUV-3 specifically require technical replicates in different batches to work; Batch mean centering is only applicable to datasets that comprise two-factor biological groupings, i.e., case–control studies . Therefore, it is up to the user which methods to use, and all the correction results are stored within the data object.
The third step constructs the post-correction report. This report provides comparative analyses between uncorrected data and all the employed correction algorithms. The user can use these to evaluate the correction algorithms in terms of reduced unwanted variability while preserving the biological variation that is investigated with the experimental design. An instructive manual for the package and examples of preliminary and post-corrections reports are available as supplemental material accompanying the online article (Additional file 1, Additional file 2, Additional file 3).
The Microbiome Batch Effect Correction Suite is available as a software package for the R programming framework at Bioconductor. The latest development version can be obtained from the GitHub repository.
Availability and requirements
Project name: MBECS Microbiome Batch Effect Correction Suite
Project home page: http://www.bioconductor.org/packages/release/bioc/html/MBECS.html
Operating system(s): Platform independent
Programming language: R (> = 4.1)
Other requirements: CRAN and Bioconductor packages (methods, magrittr, phyloseq, limma, lme4, lmerTest, pheatmap, rmarkdown, cluster, dplyr, ggplot2, gridExtra, ruv, sva, tibble, tidyr, vegan, stats, utils, Matrix)
Any restrictions to use by non-academics: None
Availability of data and materials
The source code is freely available under Artistic-2.0 license at https://github.com/rmolbrich/MBECS and at https://bioconductor.org/packages/release/bioc/html/MBECS.html. The packages vignette and examples utilize artificial mockup data to illustrate workflow and execution. The package vignette and two exemplary reports are available as supplementary data.
Chen C, et al. Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PLoS ONE. 2011;6:e17238.
Čuklina J, et al. Review of batch effects prevention, diagnostics, and correction approaches. In: Matthiesen R, editor., et al., Mass spectrometry data analysis in proteomics, methods in molecular biology. New York: Springer; 2020. p. 373–87.
Goh WWB, et al. Why batch effects matter in omics data, and how to avoid them. Trends Biotechnol. 2017;35:498–507.
Wang Y, LêCao KA. Managing batch effects in microbiome data. Brief Bioinform. 2020;21:1954–70.
Scherer A, editor. Batch effects and noise in microarray experiments: sources and solutions. Chichester: Wiley; 2009.
Zhou L, et al. Examining the practical limits of batch effect-correction algorithms: when should you care about batch effects? J Genet Genomics. 2019;46:433–43.
McMurdie PJ, Holmes S. phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE. 2013;8:e61217.
Kucera M, Malmgren B. Logratio transformation of compositional data—a resolution of the constant sum constraint. Mar Micropaleontol. 1998;34:117–20.
Leek JT, et al. The SVA package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012;28:882–3.
Gagnon-Bartsch JA, Speed TP. Using control genes to correct for unwanted variation in microarray data. Biostatistics. 2012;13:539–52.
Gibbons SM et al. Correcting for batch effects in case-control microbiome studies. 17
Gandolfo LC, Speed TP. RLE plots: Visualizing unwanted variation in high dimensional data. PLoS ONE. 2018;13:e0191629.
Li J, et al. Principal variance components analysis: estimating batch effects in microarray gene expression data. In: Scherer A, editor., et al., Batch effects and noise in microarray experiments. Chichester: Wiley; 2009. p. 141–54.
Liu Q. Variation partitioning by partial redundancy analysis (RDA). Environmetrics. 1997;8:75–85.
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65.
Open Access funding enabled and organized by Projekt DEAL. Hauke Busch acknowledges funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany`s Excellence Strategy (EXC 22167-390884018).
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Olbrich, M., Künstner, A. & Busch, H. MBECS: Microbiome Batch Effects Correction Suite. BMC Bioinformatics 24, 182 (2023). https://doi.org/10.1186/s12859-023-05252-w