 Methodology Article
 Open Access
 Published:
An empirical Bayes approach to normalization and differential abundance testing for microbiome data
BMC Bioinformatics volume 21, Article number: 225 (2020)
Abstract
Background
Advances in DNA sequencing have offered researchers an unprecedented opportunity to better study the variety of species living in and on the human body. However, the analysis of microbiome data is complicated by several challenges. First, the sequencing depth may vary by orders of magnitude across samples. Second, species are rare and the data often contain many zeros. Third, the specimen is a fraction of the microbial ecosystem, and so the data are compositional carrying only relative information. Other characteristics of microbiome data include pronounced overdispersion in taxon abundances, and the existence of a phylogenetic tree that relates all bacterial species. To address some of these challenges, microbiome analysis workflows often normalize the read counts prior to downstream analysis. However, there are limitations in the current literature on the normalization of microbiome data.
Results
Under the multinomial distribution for the read counts and a prior for the unknown proportions, we propose an empirical Bayes approach to microbiome data normalization. Using a treebased extension of the Dirichlet prior, we further extend our method by incorporating the phylogenetic tree into the normalization process. We study the impact of normalization on differential abundance analysis. In the presence of tree structure, we propose a phylogenyaware detection procedure.
Conclusions
Extensive simulations and gut microbiome data applications are conducted to demonstrate the superior performance of our empirical Bayes method over other normalization methods, and over commonlyused methods for differential abundance testing. Original R scripts are available at GitHub (https://github.com/liudoubletian/eBay).
Background
It is well known that microbes interact with their human host. The human microbiome, which refers to the collection of microbes and their genetic information in the human body, contributes to healthy human physiology and development, and dysbiosis of microbial communities is linked to many diseases, such as obesity, type 2 diabetes, and inflammatory bowel disease [1–3]. Host genetics and environmental factors, in turn, affect the health and diversity of the human microbiome [4, 5]. However, the mechanisms underlying human health and disease remain largely unknown because of the complexity and dynamics of microbial communities. In order to understand the taxonomic composition and biological function of microbiomes, highthroughout sequencing technologies and advanced bioinformatics tools are now routinely employed in microbiome studies [6]. For example, marker gene analysis involves extracting DNA from primary samples, sequencing a highly variable region, and clustering sequence reads into Operational Taxonomic Units (OTUs) by sequence similarity (e.g., 97%). The evolutionary relationships among OTUs can also be inferred, by using a reference database, or by inferring the phylogenetic tree de novo [7].
Like differential expression analysis in microarray studies, one fundamental task in microbiome studies is differential abundance analysis, that is, to detect OTUs or species that have differential abundance between two or more experimental conditions, e.g., health versus disease [8]. Although differential expression analysis has been extensively studied, methods designed for continuous microarray data are not directly applicable for discrete microbiome data. The problem is further complicated by inherent characteristics of microbial community sequencing data [9]. In particular, the total reads per sample, known as the sequence depth or library size, can vary by orders of magnitude, and some OTUs are rare and therefore the data matrix is sparse. Consequently, there is a need to develop specialized analytical tools for microbiome data. Microbiome analysis workflows often begin with some type of normalization. Two commonlyused normalization approaches are rarefying, which subsamples the data without replacement to uniform sequence depth across samples, and total sum scaling, which divides read counts by the total count in each sample and bases downstream analyses on relative abundances [10]. While these two methods work well for the purpose of ordination, they often result in a high rate of false positives when testing for differentially abundant species [11]. Although rarefying is a recommended option in major data analysis toolkits [12, 13], it is inadmissible because it throws away some data and ignores the compositionality [10]. Microbiome data are compositional because the abundance of an OTU in a specimen is not the abundance of the corresponding taxon in the microbial ecosystem [14]. The special feature of compositional data is that a composition carries only relative abundance information.
Total sum scaling conditions on sequence depth and results in compositional data, i.e., raw proportions that sum up to 1. Since the data points map to a simplex rather than the Euclidean space, standard data analysis techniques, such as the ttest, are invalid. Instead of using the proportions directly, methods for analyzing compositional data all involve some type of transformation, the most common of which is the logratio transformation [15, 16]. Once the unitsum constraint is removed, classical statistical methods apply, with care and proper interpretation to transformed data. Indeed, logratiobased inferences are increasingly popular in downstream microbiome analyses [14, 17–19].
Note that the raw proportions from total sum scaling are operationally equivalent in every way to the original count data when logratio transformed. One major problem with this naive scaling normalization technique [20] is when the normalized data have zeros, the log transformation is problematic. One approach to this issue is to replace the zero by a small positive value and renormalize the data. Nevertheless, the choice of the constant is problemdependent and its effect on the results is not wellstudied [21]. Zero replacement is an active area of research, and statistically rigorous methods have emerged in the literature. For example, [22] and [23] respectively developed a nonparametric approach and a parametric treatment for imputing zeros. More recently, motivated by the fact that raw proportions from total sum scaling are maximum likelihood estimates of the unknown parameters under the multinomial model, [24] and [25] proposed replacement techniques from a Bayesian point of view. Assuming a Dirichlet prior for the set of proportions, a zero value is replaced by its posterior Bayesian estimate. The Bayesian method gives an estimate of the true composition, and hence can be viewed as a modelbased alternative to total sum scaling.
The posterior Bayesian estimator shrinks the maximum likelihood estimator towards the mean vector of a Dirichlet prior. The smoothed estimates are more accurate than the raw proportions for OTUs with extremely high or low read counts. However, the obvious drawback of the existing methods is that a uniform prior is used, and therefore the shrinking point is uninformative. In addition, the prior is applied to single data points, but the observations may have a lot in common, and these similarities can be used to learn from the experience of others [26]. In this paper, we propose an empirical Bayes approach to normalization. Rather than adopting an uninformative prior, we assume that the parameters of the Dirichlet distribution is unknown, and we estimate them by using all observations in the data set. In addition to uneven sequence depth, data sparsity, and compositionality, the proposed method is designed to address overdispersion and phylogeny.
It is known that microbiome data, and sequencing data in general, are overdispersed, and that the multinomial distribution does not allow for overdispersion. Overdispersion is also a natural consequence of the data laying on the simplex. To account for the excess variation, the Dirichletmultinomial (DM) distribution is commonly used in practice [27, 28]. DM is an analytically tractable compound distribution. This is a consequence of the fact that the Dirichlet distribution is a conjugate distribution to the multinomial distribution. The DM parameters are the hyperparameters in the Dirichlet prior. We estimate these parameters from OTU counts by maximum likelihood. Then, we plugin the estimates into the prior distribution, and normalize the data using the posterior mean. We further extend our method by incorporating phylogeny into the analysis. This is accomplished by using a treebased extension of DM, called the Dirichlettree multinomial (DTM) distribution [29, 30]. Loosely speaking, DTM is a product of independent local DMs on internal nodes of the phylogenetic tree. While DM intrinsically imposes a negative correlation structure among bacterial counts, DTM allows for both positive and negative correlations [31].
Results
We generated bacterial counts from a DM or DTM model, with the true vector of proportions π estimated based on a real dataset [32], which contains the counts of 60 taxa from 1897 samples, together with a phylogenetic tree describing the evolutionary relationship among these taxa. We note that, as mentioned earlier, in microbial ecology studies compositionality is not something imposed by post sequencing processing, and so microbiome data are compositions off the machine.
Simulations without tree information
We first generated taxa abundance data from the DM model, with an overdispersion parameter θ=0.15. We set the sample size n_{1}=n_{2}=50 and the number of taxa p=40. The sequencing depth was drawn uniformly from 5000 to 50000. Denote ϕ_{k}=(ϕ_{k1},...,ϕ_{kp})^{T} as the vector of true proportions in group k∈{1,2}. Initially, we generated ϕ_{1}=ϕ_{2} as a random sample from the 60dimensional vector π. We then normalized it to have unit sum, and varied the relative abundances of 6 taxa as follows:
where s∈{1,22,23}, t∈{7,11,40} for k=1, s∈{7,11,40}, t∈{1,22,23} for k=2, and β∈{0.01,0.15,0.2,0.25,0.3,0.35} represents the degree of difference between two groups. We further explored the impact of overdispersion. We fixed β=0.25 and varied θ∈{0.05,0.1,0.15,0.2,0.25,0.3}.
We estimated the recall and precision using 100 simulated data sets. The results are shown in Figs. 1 and 2. Generally, the recall and precision increased as the effect size β increased, and as the overdispersion parameter θ decreased. From the upper panels we see that the recall of the empirical Bayes method, with ttest, was higher than other normalization methods. From the lower panels we can see that eBayt and eBayWilcoxon were the overall winner. DESeq2, Wrench, metagenomeSeq had lower recall, and for small values of β, the precision of ANCOM was low.
Simulations with tree information
In this section, taxa abundances were generated from the DTM model, with the tree structure shown in Fig. 3. We set θ=0.27 and n_{1}=n_{2}=50. The depth was sampled from a uniform distribution on (5000, 50000).
The nondegenerate case. A pair of nodes, labeled as 55 and 56, were set to be differentially abundant in a similar way as in the previous section. Specifically, at each tree split, we sampled ϕ_{0} from π, and set ϕ_{1}=ϕ_{2}=(ϕ_{0},1−ϕ_{0})^{T}. We then increased the relative abundance of one of them, while decreasing the relative abundance of the other, by invoking (1) with s∈{55} and t∈{56} for k=1, and s∈{56} and t∈{55} for k=2. This led to 7 differentially abundant leaf nodes, labeled as 2–8 (Additional file 1: Figure S1). We set the effect size β∈{0.1,2,4,6,7,8}. We also used simulated data to investigate the effect of overdispersion by fixing β=4 and setting θ∈{0.05,0.1,0.15,0.2,0.25,0.3}. Figures S2 and S3 summarize the simulation results. The empirical Bayes method eBaytree was superior to other normalization methods, and when applied with ttest or Wilcoxon rank sum test, it outperformed DESeq2, ANCOM, Wrench, and metagenomeSeq.
The degenerate case. Two pairs of nodes, {55,56} and {57,58}, were set to be differentially abundant, but only 5 leaf nodes, labeled as 2, 3, 6, 7, 8, inherited the differences (Figure S4). This was achieved by taking s∈{55,57} and t∈{56,58} for k=1, and s∈{56,58} and t∈{55,57} for k=2. We set β∈{0.1,2,4,6,7,8}. On the other hand, we fixed β=4 and set θ∈ {0.05,0.1,0.15,0.2,0.25,0.3}. The simulation results were shown in Figures S5 and S6. Again, the conclusions were similar.
More simulations
Simulation from DM with a random tree. We further examined the behavior of eBaytree in the absence of tree, using the same data as in simulations without tree information. We generated the tree structure randomly, and used eBaytree for data normalization. The results are summarized in Figure S7. We can see that incorporating the tree compulsorily did not deteriorate the performance much. In the presence of tree, we also compared the performance of the phylogenyware detection procedure and the global method of applying ttest or Wilcoxon rank sum test after normalizing data using (15). Figure S8 shows that the naive method failed.
Simulated data from the gammaPoisson model. To assess the robustness of the proposed methodology, we generated taxa counts from the gammaPoisson model which was used for evaluating the performance of ANCOM [14]. We set the sample size n_{1}=20 for case and n_{2}=30 for control with p=100. To generate the difference between two conditions, for the first 5 significant features in case, we changed the proportions of those features by adding u_{ij} to the Poisson parameter μ_{ij}. For the remaining 5 features, we subtracted u_{ij} from the Poisson parameter μ_{ij}. The μ_{ij} was sampling from a Gamma distribution Gamma(200,1) and u_{ij} was sampling from a uniform distribution U((δ−1)×30,δ×30) where δ∈{1,2,3,4,5}. The simulation results in Fig. 4 show that eBayt compared favorably with ANCOM and both were superior to Wrench, DESeq2, and metagenomeSeq.
Simulated data from the zeroinflated lognormal (ZILN) model. As suggested by a referee, we also generated taxa counts from the zeroinflated lognormal model which was used for assessing the performance of metagenomeSeq [8]. We set the sample size n_{1}=n_{2}=50 and the number of taxa p=100. To generate the difference between conditions, for the first 5 significant features in one of the conditions, we changed the proportions of those features by adding 1/50×δ percentage of the sample’s total counts. For the remaining 5 features, we subtracted 1/50×δ percentage of the sample’s total counts. The δ was set to be {0.1,0.3,0.5,0.7,0.9}. As expected, we see from Fig. 5 that metagenomeSeq outperformed DESeq2. Unfortunately, methods treating microbiome data
as compositions, especially eBay and ALDEx2 in the Bayesian framework, failed in this case. The reason is that the zeros generated by the ZILN model are all structural zeros, while in eBay and ALDEx2 it is assumed implicitly that zeros are the result of undersampling. As will be discussed later, extending the empirical Bayes method to handling both structural zeros and sampling zeros is interesting and important.
Finally, in Figure S9, we show comparative timings in seconds and space in bytes for problems with n_{1}=n_{2}=50 and different numbers of taxa. While eBay was computationally more efficient, with parallel computation the computational complexity of eBaytree performed similarly with eBay.
Gut microbiota and malnutrition
Childhood undernutrition is a significant health problem in Southern Asia and sub Saharan Africa, and severe acute malnutrition (SAM) remains a major cause of child mortality worldwide [33]. For this reason, the World Health Organization updated guidelines for the improved management of SAM in infants and children [34]. In a recent study of 996 stool samples collected monthly from 50 healthy Bangladeshi children during the first 2 years of life, [32] identified bacterial taxonomic biomarkers for characterizing gutmicrobiota maturation. By applying random forests from the perspective of regression, they determined a list of 60 bacterial species, ranked in descending order of their importance to the regression. Incorporating these biomarkers into a prediction model, and applying this model to children with SAM enrolled in a randomized trial, they showed that SAM is significantly associated with microbiota immaturity.
Rather than summarizing the relative abundances of these 60 bacterial taxa into a single index (i.e., the predicted value), we revisited the problem in terms of differential abundance testing. To eliminate the effect of age, we restricted our analysis to 12 to 18monthold children. There were 20 healthy children in the singleton validation dataset and 27 children with SAM. We further filtered bacterial species with prevalance less than 20%, resulting in 50 taxa. We extracted representative sequences for these taxa, performed sequence alignment, and then constructed a phylogenetic tree (Figure S10), using the default and recommended methods PyNAST and FastTree in QIIME [12]. We applied ttest and Wilcoxon rank sum test after normalizing counts by the treebased empirical Bayes method and other methods in Table 1, and compared them to DESeq2, ANCOM, Wrench, and metagenomeSeq. Note that eBay took 0.41 seconds to analyze the data on a Macbook Pro (Intel Corei5, 1.4 GHz, 8GB RAM).
To assess the performance of our method and other methods, we recorded the lists of differentially abundant taxa. In addition, for each method, we ordered the taxa according to their pvalues, and calculated the number of matches between the top K differentially abundant taxa and the top K taxa in the ranked list of 60 bacterial species, where K=10,15,20, and 25. The results are summarized in Fig. 6 and Figure S11. From Fig. 6a, we can see that eBayt and eBaytreet detected more differentially abundant species than other methods. The two taxa detected uniquely by eBayt and eBaytreet were Rumicnococcus_sp_5_1_39BFAA and Megamonas. Million et al. [37] indicated that Ruminococcus_sp_5_1_39BFAA tends to be depleted in malnourished children, while Megamonas was reported to be significantly altered in the malnourished children compared to agematched healthy children [32]. Furthermore, Fig. 6b shows that the ranked list of taxa detected by eBayt and eBaytreet was more concordant with that identified by the random forests algorithm. FastTree infers the phylogeny by maximum likelihood. Alternatively, we computed the distances between any two species based on an evolution model [38], and then built a phylogenetic tree (Figure S12) based on these distances [39]. The corresponding results are summarized in Figures S13 and S14, and the conclusions are qualitatively similar. These results confirm that compared to healthy children, children with SAM had significant gutmicrobiota immaturity.
Gut microbiome and body mass index
Studies have shown that gut microbiome is associated with body mass index (BMI) and explains a significant fraction of BMI variation [5]. In a study of the impact of longterm dietary patterns on gut microbiome composition, [40] showed that taxa correlated with BMI also correlated with fat and percent calories from saturated fatty acids. In this study, the researchers enrolled 98 healthy volunteers and collected their stool samples as well as diet information. DNA samples were extracted and analyzed by 454/Roche pyrosequencing, and sequence reads were processed by the QIIME pipeline. To explore the relationship between BMI and gut microbiota, we reanalysed the data via differential abundance testing. Following the World Health Organization guideline, we categorized BMI as normal weight, overweight, and obese, and for simplicity we focused on the normal weight and obese individuals. After filtering the taxa with prevalence less than 10% and abundance <0.2% in all samples, we were left with 314 taxa and 70 samples. eBay took 1.678 seconds to process the data on a Macbook Pro (Intel Corei5, 1.4 GHz, 8GB RAM). The results are summarized in Fig. 7. The 9 taxa identified uniquely by eBayt were mainly from the families Lachnospiraceae and Ruminococcaceae, both of which were reported to be significantly correlated with BMI [41, 42].
Discussion
Although the important role of microbiota in human health and disease has been recognized increasingly over the past decade, data from highthroughput DNA sequencing present challenges to statistical analysis and interpretation. We have proposed an empirical Bayes technique for microbiome data normalization prior to downstream analysis. Assuming a multinomial distribution for the read counts and specifying a Dirichlet prior for the underlying proportions, our method shrinks the relative abundances towards the mean vector of the prior. The marginal distribution of the data allows for overdispersion and has the same set of parameters as the prior distribution. We estimated these parameters empirically from the data by maximizing the evidence. To incorporate the phylogenetic tree in the normalization process, we extended our method by taking as the prior a product of Dirichlet distributions that factorized over the tree. We examined the downstream effect of normalization in the context of differential abundance analysis, by applying ttest and Wilcoxon rank sum test to the normalized data. In the presence of tree, rather than using the normalized data directly, we proposed a phylogenyaware differential abundance detection procedure by carrying out local tests at tree splits.
The excessive number of zeros in bacterial counts can lead to some inefficiency in the normalization and downstream analysis. In this paper, we have introduced an empirical Bayes method to normalize data and we assume implicitly that all microbes are present in the microbial ecosystem and the zeros are the result of undersampling. However, in the presence of hundreds or thousands of bacterial species, these zeros can also represent components that are truly absent from the community [8, 9], especially when the specimens are drawn from different environments. How to normalize count data that allows zeroinflation is an interesting research topic. The zeroinflated generalized Dirichlet model [43] can potentially provide a solution to this problem. Work along this line is in progress.
Conclusions
Uneven library size, data sparsity, compositionality, and overdispersion, all make drawing valid biological inferences from microbial datasets difficult. To overcome these challenges, we proposed an empirical Bayes technique for microbiome data normalization prior to downstream analysis. We further extended our method by incorporating the phylogenetic tree into the normalization process. We examined the downstream effect of normalization in the context of differential abundance analysis. In the presence of tree, we proposed a phylogenyaware detection procedure. Results from an extensive simulation study and real data applications showed that the empirical Bayes approach was more efficient than other normalization methods, and the corresponding testing method compared favorably with stateoftheart methods.
Methods
Consider a microbiome dataset with n samples and p OTUs. For the ith sample, let x_{i}=(x_{i1},...,x_{ip})^{T} denote the vector of read counts of p OTUs, and \(N_{i}=\sum _{j=1}^{p} x_{ij}\) the total number of reads. Total sum scaling can be derived through maximum likelihood. Given N_{i}, it is natural to model the abundance vector according to a multinomial distribution, x_{i}∼Mult(π_{i};N_{i}). The probability mass function is
where \(\boldsymbol {\pi }_{i} = (\pi _{i1},...,\pi _{ip})^{\mathrm {T}}, 0<\pi _{ij}<1, \sum _{j=1}^{p}\pi _{ij} = 1\), and Γ(·) is the gamma function. Then the method of maximum likelihood yields the naive count normalization
Empirical Bayes normalization
One disadvantage of total sum scaling is that the estimates for OTUs with zero counts are simply zero, causing difficulty in downstream analyses, such as logratio based compositional data analysis. To overcome this problem, we consider a Bayesian approach. Specifically, we assume that x_{i}∼Mult(π_{i};N_{i}), and specify a prior distribution for π_{i}. We then calculate the posterior for π_{i} given x_{i}, and compute the posterior mean estimate.
The most common and convenient prior for π_{i} is the Dirichlet distribution [44]. This distribution, denoted by Dir(α), is parameterized by a pvector of positive scalars, α=(α_{1},…,α_{p})^{T}, and has probability density function
Multiplying the multinomial distribution Mult(π_{i};N_{i}) by the Dirichlet prior Dir(α) gives the posterior distribution
This is the density of Dir(x_{i}+α). The posterior mean is given by
Posterior Bayesian estimation produces nonzero estimates for the true proportions. Furthermore, it is easy to check that the posterior mean is a weighted average of the vector of raw proportions and the mean of the prior distribution:
where \(\phi _{j} = \alpha _{j}/\sum _{j = 1}^{p}\alpha _{j}\) and \(\alpha _{+} = \sum _{j = 1}^{p}\alpha _{j}\). Put another way, we shrink the maximum likelihood estimates towards our knowledge about π_{i} before we see the data.
In practice, the hyperparameters α_{j} are unknown, and so we cannot use posterior Bayesian estimates. Uniform priors, which assume that α_{1}=⋯=α_{p}, are used in [24] and [25]. The mean vector of a uniform prior, (1/p,...,1/p)^{T}, is the center or neural element of the (p−1)dimensional simplex with the Aitchison metric [16]. Nevertheless, we do not have to take this composition as the preferred shrinking point. In the rest of this section, we propose an empirical Bayes approach by empirically estimating α_{j} from the data.
Note that after integrating x_{i}∼Mult(π_{i};N_{i}) over π_{i}∼Dir(α), the marginal distribution of x_{i} is Dirichletmultinomial, x_{i}∼DirMulti(α), with probability mass function
The DM distribution has the same set of parameters as the Dirichlet prior. Furthermore, it is the most common distribution for modeling overdispersed and multivariate taxa count data [28, 45]. Let θ=1/(1+α_{+}), we call θ the overdispersion parameter. Let \({\hat {\boldsymbol \alpha }}\) be the maximum likelihood estimate. Substituting it into (6) gives the empirical Bayes solution for normalization
Phylogenyaware normalization
Suppose that the phylogenetic relationships among OTUs can be encoded by a rooted tree \(\mathcal {T} = (\mathcal {L}, \mathcal {I})\), where terminal nodes, or leaves, in \(\mathcal {L}\) correspond to OTUs, and internal nodes in \(\mathcal {I}\) represent bacterial taxa at different taxonomic levels. Figure 3 shows an example of a binary tree over 50 OTUs. For each internal node \(A \in \mathcal {I}\), let \(\mathcal {C}(A)\) be the set of child nodes of A. For each A and \(w \in \mathcal {C}(A)\), let x_{Aw} and π_{Aw} be the total count and probability in the branch from A to w. Here, for ease of notation, we omit the subscript i. One attractive property of the multinomial distribution is that it can be factorized over \(\mathcal {T}\) [29]. Specifically, let \( b_{Aw} = \pi _{Aw} / \sum _{w \in \mathcal {C}(A)}\pi _{Aw}\), b_{A}=(b_{Aw},w∈A), and x_{A}=(x_{Aw},w∈A), then
The conjugate prior for this parameterization is no longer a single global Dirichlet density, but rather a product of local Dirichlet densities, one for each internal node:
This is known as the Dirichlettree distribution [46]. The posterior distribution has the form
This density is exactly that of a Dirichlettree distribution, except that we update the hyperparameters after seeing the data.
The development so far is based on Dirichlet priors on branches. The posterior density function of π given the data can be computed by a change of variables and is given in [47]. Furthermore, the posterior mean of π is
where we define δ_{Aw}(l) to be 1, if the branch from A to w leads to \(l \in \mathcal {L}\), and 0 otherwise.
The remaining step is the same: the Bayes estimator is itself being empirically estimated from the data by maximizing the evidence, i.e., the marginal distribution of the data. This distribution, known as the Dirichlettree multinomial distribution (DTM), is a product of DM distributions that factorize over the tree
Comparing to DM, a distinctive property of DTM is that the correlations between bacterial counts can be simultaneously negative and positive [29, 31]. Since the distributions placed on different internal nodes are independent, maximum likelihood estimation can be carried out separately and in parallel. Let \(\hat {\boldsymbol {\alpha }}_{A}\) be the maximum likelihood estimate. Substituting it into (13) leads to the phylogenyaware normalization
Centered logratio transformation
The normalization methods investigated in this paper are shown in Table 1. Except for rarefying, all methods infer proportions from the raw read counts. Because proportions are constrained by the simplex, standard statistical methods for downstream analyses are not applicable. To convert proportions into linear independent components, [48] introduced the centered logratio transformation, which is an isometric transformation of the simplex with the Aitchison metric onto a subspace of real space with the Euclidean metric. Let (u_{1},…,u_{m})^{T} denote a generic mvector of proportions. This transformation has the form
Transformed data are then analyzed in the same way as standard data. We employ this strategy in this paper.
Differential abundance analysis
After effective normalization, a common downstream analysis is differential abundance testing. In this section, we examine the impact of normalization using the results from a differential abundance analysis. As with [10, 11], and [49], we focus on detecting microbes that are differentially abundant between two conditions. Table 2 lists the methods considered in this paper. For the moment, we assume that the tree information is not available. Among these, ttest and Wilcoxon rank sum test are standard methods for comparing two groups, DESeq2 [50] is modelbased and is borrowed from the field of RNAseq, and metagenomeSeq [8] and ANCOM [14] are also modelbased and are proposed specifically for microbiome sequencing data.
Note that ttest and Wilcoxon rank sum test apply to either raw counts or proportions, and ANCOM normalizes the raw counts by taking ratios relative to a reference taxon. DESeq2 and metagenomeSeq use raw counts, but each of them has a builtin normalization process. Furthermore, ANCOM involves the replacement of zeros by a small positive number. For simplicity, a pseudocount of one is added to the raw counts before applying normalization. To control the false discovery rate (FDR), all tests are corrected for multiple testing using the Benjamini–Hochberg procedure [53].
Differential abundance analysis in the presence of tree structure is somewhat complicated. To our knowledge, incorporating the dependence structure among the microbes into any of ANCOM, DESeq2, and metagenomeSeq is not trivial and deserves further study. Here, we propose a phylogenyaware detection approach based on either ttest or Wilcoxon rank sum test. One simple approach is to do the test directly after the treebased normalization (15). However, the obvious drawback of this naive approach is that the estimation error in a tree split is propagated down to all of the splits below it. To alleviate this problem, instead of a global test, we carry out local tests at tree splits. If a node is differentially abundant, then so are all of its descendants. Since the number of nodes at a split is much lower than the number of leaf nodes and the local tests can be done splitbysplit, this approach is computationally more stable and less intensive.
There is an exception. If two nodes are differentially abundant and are both ancestors of a leaf node, then it is possible that the leaf node is not differentially abundant. To finesse the problem, we note that in these degenerate cases, there must be a path from the differentially abundant nodes to the leaf node. We make a correction by locating the most recent ancestor node to this path that is nondifferentially abundant, and do test on the set of all leaf nodes of this node and update the results. The treeguided detection procedure is summarized in Algorithm 1.
Abbreviations
 OTUs:

Operational taxonomic units
 DM:

Dirichletmultinomial
 DTM:

Dirichlettree multinomial
 FDR:

False discovery rate
 SAM:

Severe acute malnutrition
 ZILN:

Zeroinflated lognormal
References
Cho I, Blaser MJ. The human microbiome: at the interface of health and disease. Nat Rev Genet. 2012; 13(4):260.
Clemente JC, Ursell LK, Parfrey LW, Knight R. The impact of the gut microbiota on human health: an integrative view. Cell. 2012; 148(6):1258–70.
Zhao L, Zhang F, Ding X, Wu G, Lam YY, Wang X, et al.Gut bacteria selectively promoted by dietary fibers alleviate type 2 diabetes. Science. 2018; 359(6380):1151–6.
Spor A, Koren O, Ley R. Unravelling the effects of the environment and host genotype on the gut microbiome. Nat Rev Microbiol. 2011; 9(4):279.
Rothschild D, Weissbrod O, Barkan E, Kurilshikov A, Korem T, Zeevi D, et al.Environment dominates over host genetics in shaping human gut microbiota. Nature. 2018; 555(7695):210–15.
Kuczynski J, Lauber CL, Walters WA, Parfrey LW, Clemente JC, Gevers D, et al.Experimental and analytical tools for studying the human microbiome. Nat Rev Genet. 2012; 13(1):47.
NavasMolina JA, PeraltaSánchez JM, González A, McMurdie PJ, VázquezBaeza Y, Xu Z, et al.Advancing Our Understanding of the Human Microbiome Using QIIME. In: Methods in Enzymology. vol. 531. Elsevier: 2013. p. 371–444. https://doi.org/10.1016/b9780124078635.000198.
Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for microbial markergene surveys. Nat Methods. 2013; 10(12):1200.
Li H. Microbiome, metagenomics, and highdimensional compositional data analysis. Ann Rev Stat Appl. 2015; 2:73–94.
McMurdie PJ, Holmes S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput Biol. 2014; 10(4):e1003531.
Weiss SJ, Xu Z, Amir A, Peddada S, Bittinger K, Gonzalez A, et al.Effects of library size variance, sparsity, and compositionality on the analysis of microbiome data. PeerJ PrePrints. 2015. https://doi.org/10.7287/peerj.preprints.1157v1.
Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al.QIIME allows analysis of highthroughput community sequencing data. Nat Methods. 2010; 7(5):335.
Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al.Introducing mothur: opensource, platformindependent, communitysupported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009; 75(23):7537–41.
Mandal S, Van Treuren W, White RA, Eggesbø M, Knight R, Peddada SD. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb Ecol Health Dis. 2015; 26(1):27663.
Aitchison J. The Statistical Analysis of Compositional Data. J R Stat Soc Ser B. 1982; 44(2):139–77.
Egozcue JJ, PawlowskyGlahn V, MateuFigueras G, BarceloVidal C. Isometric logratio transformations for compositional data analysis. Math Geol. 2003; 35(3):279–300.
Friedman J, Alm EJ. Inferring correlation networks from genomic survey data. PLOS Comput Biol. 2012; 8(9):e1002687.
Lin W, Shi P, Feng R, Li H. Variable selection in regression with compositional covariates. Biometrika. 2014; 101(4):785–97.
Wang T, Zhao H. Structured subcomposition selection in regression and its application to microbiome data analysis. Ann Appl Stat. 2017; 11(2):771–91.
Kumar MS, Slud EV, Okrah K, Hicks SC, Hannenhalli S, Bravo HC. Analysis and correction of compositional bias in sparse sequencing count data. BMC Genomics. 2018; 19(1):799.
Costea PI, Zeller G, Sunagawa S, Bork P. A fair comparison. Nat Methods. 2014; 11(4):359.
MartínFernández JA, BarcelóVidal C, PawlowskyGlahn V. Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Math Geol. 2003; 35(3):253–78.
MartínFernández JA, Hron K, Templ M, Filzmoser P, PalareaAlbaladejo J. Modelbased replacement of rounded zeros in compositional data: classical and robust approaches. Comput Stat Data Anal. 2012; 56(9):2688–704.
Fernandes AD, Reid JN, Macklaim JM, McMurrough TA, Edgell DR, Gloor GB. Unifying the analysis of highthroughput sequencing datasets: characterizing RNAseq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome. 2014; 2(1):15.
MartínFernández JA, Hron K, Templ M, Filzmoser P, PalareaAlbaladejo J. Bayesianmultiplicative treatment of count zeros in compositional data sets. Stat Model. 2015; 15(2):134–58.
Efron B. LargeScale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction: Cambridge University Press; 2012.
Mosimann JE. On the compound multinomial distribution, the multivariate βdistribution, and correlations among proportions. Biometrika. 1962; 49(1/2):65–82.
Chen J, Li H. Variable selection for sparse Dirichletmultinomial regression with an application to microbiome data analysis. Ann Appl Stat. 2013; 7(1):418–42.
Wang T, Zhao H. A Dirichlettree multinomial regression model for associating dietary nutrients with gut microorganisms. Biometrics. 2017; 73(3):792–801.
Tang Y, Ma L, Nicolae DL. A phylogenetic scan test on a Dirichlettree multinomial model for microbiome data. Ann Appl Stat. 2018; 12(1):1–26.
Connor RJ, Mosimann JE. Concepts of independence for proportions with a generalization of the Dirichlet distribution. J Am Stat Assoc. 1969; 64(325):194–206.
Subramanian S, Huq S, Yatsunenko T, Haque R, Mahfuz M, Alam MA, et al.Persistent gut microbiota immaturity in malnourished Bangladeshi children. Nature. 2014; 510(7505):417.
Black RE, Victora CG, Walker SP, Bhutta ZA, Christian P, Onis MD, et al.Maternal and child undernutrition and overweight in lowincome and middleincome countries. Lancet. 2013; 382(9890):427–51.
WHO. Guideline: Updates on the management of severe acute malnutrition in infants and children: World Health Organization; 2013.
McMurdie PJ, Holmes S. phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE. 2013; 8(4):e61217.
Unifying the analysis of highthroughput sequencing datasets: characterizing RNAseq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome; 2(1):15.
Million M, Diallo A, Raoult D. Gut microbiota and malnutrition. Microb Pathog. 2017; 106:127–38.
Kumar S, Stecher G, Tamura K. MEGA7: Molecular evolutionary genetics analysis version 7.0 for bigger datasets. Mol Biol Evol. 2016; 33(7):1870–4.
Letunic I, Bork P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res. 2019; 47(W1):W256–9.
Linking LongTerm Dietary Patterns with Gut Microbial Enterotypes. Science; 334(6052):105–8. https://doi.org/10.1126/science.1208344.
Ottosson F, Brunkwall L, Ericson U, Nilsson PM, OrhoMelander M. Connection Between BMIRelated Plasma Metabolite Profile and Gut Microbiota. J Clin Endocrinol Metab. 2018; 103(4).
Finnicum CT, Doornweerd S, Dolan CV, Luningham JM, Beck JJ, Willemsen G, et al.Metataxonomic analysis of individuals at BMI extremes and monozygotic twins discordant for BMI. Twin Res Hum Genet. 2018; 21(3):203–13.
Tang ZZ, Chen G. Zeroinflated generalized Dirichlet multinomial regression model for microbiome compositional data analysis. Biostatistics. 2019; 20(4):698–713.
Walley P. Inferences from multinomial data: learning about a bag of marbles. J R Stat Soc Ser B. 1996; 58(1):3–57.
La Rosa PS, Brooks JP, Deych E, Boone EL, Edwards DJ, Wang Q, et al.Hypothesis testing and power calculations for taxonomicbased human microbiome data. PLoS ONE. 2012; 7(12):e52078.
Minka T. The Dirichlettree distribution. Paper available online at: https://tminka.github.io/papers/dirichlet/minkadirtree.pdf. 1999.
Dennis IIISY. On the hyperDirichlet type 1 and hyperLiouville distributions. Commun Stat Theory Methods. 1991; 20(12):4069–81.
Aitchison J. The Statistical Analysis of Compositional Data. 1986.
Thorsen J, Brejnrod A, Mortensen M, Rasmussen MA, Stokholm J, AlSoud WA, et al.Largescale benchmarking reveals false discoveries and count transformation sensitivity in 16S rRNA gene amplicon data analysis methods used in microbiome studies. Microbiome. 2016; 4(1):62.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 2014; 15:550.
Paulson JN, Pop M, Bravo HC. metagenomeSeq: Statistical analysis for sparse highthroughput sequncing. 2013. Bioconductor package.
Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26(1):139.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B. 1995; 57(1):289–300.
Acknowledgements
Not applicable.
Funding
This research was supported in part by the National Natural Science Foundation of China (11601326, 11971017), National Key R&D Program of China (2018YFC0910500), Shanghai Municipal Science and Technology Major Project (2017SHZDZX01), SJTU Transmed Awards Research Young Faculty Grant (YG2019QNA26, YG2019QNA37), and Neil Shen’s SJTU Medical Research Fund.
Author information
Authors and Affiliations
Contributions
TL conceived the ideas, developed the methodology, conducted the numerical studies and drafted the manuscript. HZ conceived the ideas, revised the manuscript and commented on various drafts of the manuscript. TW conceived the ideas, developed the methodology, supervised the manuscript writing and edited the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
Figure S1. A phylogenetic tree representing the nondegenerate case. Figure S2. Comparison of recall and precision with data generated from the DTM model across different β: the nondegenerate case. Figure S3. Comparison of recall and precision with data generated from the DTM model across different θ: the nondegenerate case. Figure S4. A phylogenetic tree representing the degenerate case. Figure S5. Comparison of recall and precision with data generated from the DTM model across different β: the degenerate case. Figure S6. Comparison of recall and precision with data generated from the DTM model across different θ: the degenerate case. Figure S7. Comparison of recall and precision between eBay and eBaytree. Figure S8. Comparison of recall and precision between eBaytree and eBaytree (global). Figure S9. Timings (seconds) and space (log(bytes)), averaged over 10 runs with data generated from the DTM model with n_{1}=n_{2}=50, versus the number of taxa. Figure S10. The phylogenetic tree of 50 bacterial taxa inferred by maximum likelihood. Figure S11. Differentially abundant species detected by Wilcoxon rank sum test based on the tree in Figure S10. Figure S12. The phylogenetic tree of 50 bacterial taxa built based on distances. Figure S13. Differentially abundant species detected by ttest based on the tree in Figure S12. Figure S14. Differentially abundant species detected by Wilcoxon rank sum test based on the tree in Figure S12.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Liu, T., Zhao, H. & Wang, T. An empirical Bayes approach to normalization and differential abundance testing for microbiome data. BMC Bioinformatics 21, 225 (2020). https://doi.org/10.1186/s1285902003552z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902003552z
Keywords
 Bayesian shrinkage
 Differentially abundant OTUs
 MetagenomeSeq
 Phylogenyaware analysis
 Rarefying