An empirical Bayes approach to normalization and differential abundance testing for microbiome data

Background Advances in DNA sequencing have offered researchers an unprecedented opportunity to better study the variety of species living in and on the human body. However, the analysis of microbiome data is complicated by several challenges. First, the sequencing depth may vary by orders of magnitude across samples. Second, species are rare and the data often contain many zeros. Third, the specimen is a fraction of the microbial ecosystem, and so the data are compositional carrying only relative information. Other characteristics of microbiome data include pronounced over-dispersion in taxon abundances, and the existence of a phylogenetic tree that relates all bacterial species. To address some of these challenges, microbiome analysis workflows often normalize the read counts prior to downstream analysis. However, there are limitations in the current literature on the normalization of microbiome data. Results Under the multinomial distribution for the read counts and a prior for the unknown proportions, we propose an empirical Bayes approach to microbiome data normalization. Using a tree-based extension of the Dirichlet prior, we further extend our method by incorporating the phylogenetic tree into the normalization process. We study the impact of normalization on differential abundance analysis. In the presence of tree structure, we propose a phylogeny-aware detection procedure. Conclusions Extensive simulations and gut microbiome data applications are conducted to demonstrate the superior performance of our empirical Bayes method over other normalization methods, and over commonly-used methods for differential abundance testing. Original R scripts are available at GitHub (https://github.com/liudoubletian/eBay).

The non-degenerate case. A pair of nodes, labeled as 55 and 56, were set to be differentially abundant. This led to 7 differentially abundant leaf nodes, labeled as 2-8.

Figure S2
Comparison of recall and precision with data generated from the DTM model across different β: the non-degenerate case. We simulated 100 data sets from the DTM model with θ = 0.27 and β ∈ {0.1, 2, 4, 6, 7, 8}. a and b Recall of ttest and Wilcoxon rank sum test with various normalization methods. c and d Recall and precision of DESeq2, ANCOM, metagenomeSeq, Wrench, and those of t-test and Wilcoxon rank sum test, both applied after counts were normalized by eBay or eBay-tree.

Figure S3
Comparison of recall and precision with data generated from the DTM model across different θ: the non-degenerate case. We simulated 100 data sets from the DTM model with β = 4 and θ ∈ {0.05, 0.1, 0.15, 0.2, 0.25, 0.3}. a and b Recall of t-test and Wilcoxon rank sum test with various normalization methods. c and d Recall and precision of DESeq2, ANCOM, metagenomeSeq, Wrench, and those of t-test and Wilcoxon rank sum test, both applied after counts were normalized by eBay or eBay-tree. Figure S4 The degenerate case. Two pairs of nodes, {55, 56} and {57, 58}, were set to be differentially abundant, but only 5 leaf nodes, labeled as 2, 3, 6, 7, 8, inherited the differences.

Figure S5
Comparison of recall and precision with data generated from the DTM model across different β: the degenerate case. We simulated 100 data sets from the DTM model with θ = 0.27 and β ∈ {0.1, 2, 4, 6, 7, 8}. a and b Recall of ttest and Wilcoxon rank sum test with various normalization methods. c and d Recall and precision of DESeq2, ANCOM, metagenomeSeq, Wrench, and those of t-test and Wilcoxon rank sum test, both applied after counts were normalized by eBay or eBay-tree.

Figure S6
Comparison of recall and precision with data generated from the DTM model across different θ: the degenerate case. We simulated 100 data sets from the DTM model with β = 4 and θ ∈ {0.05, 0.1, 0.15, 0.2, 0.25, 0.3}. a and b Recall of t-test and Wilcoxon rank sum test with various normalization methods. c and d Recall and precision of DESeq2, ANCOM, metagenomeSeq, Wrench, and those of t-test and Wilcoxon rank sum test, both applied after counts were normalized by eBay or eBay-tree.

Figure S8
Comparison of recall and precision between eBay-tree and eBay-tree (global). To detect differentially abundant taxa for the non-degenerate case, we simulated 100 data sets from the DTM model. Counts were normalized by eBaytree. Rather than do the test globally on the normalized data at the leaf-node level (purple), our phylogeny-ware detection procedure carries out local tests at tree splits (red). a and b θ = 0.27 and β ∈ {0.1, 2, 4, 6, 7, 8}. c and d β = 4 and θ ∈ {0.05, 0.1, 0.15, 0.2, 0.25, 0.3}. Figure S9 Timings (seconds) and space (log(bytes)), averaged over 10 runs with data generated from the DTM model with n 1 = n 2 = 50, versus the number of taxa.

Figure S10
The phylogenetic tree of 50 bacterial taxa inferred by maximum likelihood. We performed sequence alignment and built the tree using PyNAST and FastTree, respectively.  Figure S11 Differentially abundant species detected by Wilcoxon rank sum test based on the tree in Figure S10. (a) Visualization of set intersections among normalization methods in Table 1 and differential abundance testing methods in Table 2. (b) The number of matches between the top K taxa identified by random forests and the top K differentially abundant taxa detected by various testing methods. Wil: Wilcoxon, metaSeq: metagenomeSeq.

Figure S12
The phylogenetic tree of 50 bacterial taxa built based on distances. We computed the distances between any two species and constructed the tree using the neighbor-joining method in MEGA7.   Figure S12. The results were obtained in the same way as in Figure 4, except that the tree in Figure S12 was used for the phylogeny-aware detection procedure. (a) Visualization of set intersections among normalization methods in Table 1 and differential abundance testing methods in Table  2. (b) The number of matches between the top K taxa identified by random forests and the top K differentially abundant taxa detected by various testing methods. metaSeq: metagenomeSeq.  Figure S14 Differentially abundant species detected by Wilcoxon rank sum test based on the tree in Figure S12. The results were obtained in the same way as in Figure S11, except that the tree in Figure S12 was used for the phylogeny-aware detection procedure. (a) Visualization of set intersections among normalization methods in Table 1 and differential abundance testing methods in Table 2. (b) The number of matches between the top K taxa identified by random forests and the top K differentially abundant taxa detected by various testing methods. Wil: Wilcoxon, metaSeq: metagenomeSeq.