A latent allocation model for the analysis of microbial composition and disease

Background Establishing the relationship between microbiota and specific diseases is important but requires appropriate statistical methodology. A specialized feature of microbiome count data is the presence of a large number of zeros, which makes it difficult to analyze in case-control studies. Most existing approaches either add a small number called a pseudo-count or use probability models such as the multinomial and Dirichlet-multinomial distributions to explain the excess zero counts, which may produce unnecessary biases and impose a correlation structure taht is unsuitable for microbiome data. Results The purpose of this article is to develop a new probabilistic model, called BERnoulli and MUltinomial Distribution-based latent Allocation (BERMUDA), to address these problems. BERMUDA enables us to describe the differences in bacteria composition and a certain disease among samples. We also provide a simple and efficient learning procedure for the proposed model using an annealing EM algorithm. Conclusion We illustrate the performance of the proposed method both through both the simulation and real data analysis. BERMUDA is implemented with R and is available from GitHub (https://github.com/abikoushi/Bermuda).


Background
Low-cost metagenomic and amplicon-based sequencing has provided a snapshots of microbial communities and their surrounding environments. One of the goals for case-control studies using microbiome data is to investigate whether cases differ from controls in term of the microbiome composition of a particular body ecosystems (e.g., the gut) and which taxa are responsible for any differences observed [1]. (Here, we use the generic term "taxa" to denote a particular phylogenetic classification.) These studies present microbiome data are represented as count data using operational taxonomic units (OTUs). The number of occurrences of each OTU is measured for each sample drawn from an ecosystem, and the resulting *Correspondence: shimamura@med.nagoya-u.ac.jp 4 Division of Systems Biology, Nagoya university Graduate School of Medicine, 65 Tsurumai-cho, Showa-ku, 4668550 Nagoya, Japan Full list of author information is available at the end of the article OTU counts are summarized for any level of the bacterial phylogeny, e.g., species, genes, family, order, etc. An important feature of these microbiome count data is that it is highly sparse-i.e., a very high proportion of the data entries are zero-which makes analyzing these data difficult.
A common strategy to handle these excessive zeros is to add a small number called a pseudo-count. For example, Xia et al. (2013) [2] applied a logistic normal model to their data, adjusted by a pseudo-count. Although adding a pseudo-count is a simple and widely used strategy, it can add an unnecessary bias to the data. Further, Weiss et al. (2017) [3] noted that there is no clear consensus on how to choose that value. Another common strategy to mitigate the effects of these excessive zeros is to use non-parametric statistical tests. Wagner et al. (2011) [4] described a test statistic that combines the proportion of zeros in the data with the statistics on values other than 0.
However, this statistical test can only be used for comparing two taxa. In addition, the test cannot evaluate the cooccurrence relationships between many taxa, or the effect of combination of taxa. Other strategies include modeling excess zeros using probability models [5,6]. Such an approach is called "zero-inflated" modeling, and directly models the probability of producing excessive zeros. However, zero-inflated models make an implicit assumption that microbial composition is identical among individuals. Thus, such models cannot capture the effects of individual differences in microbial composition.
Contributions This article proposes a new probabilistic model, called BERnoulli and MUltinomial Distributionbased latent Allocation (BERMUDA), to address these problems. The contributions of our work are summarized below: 1 BERMUDA is a generative statistical model that allows a set of taxa to be explained by unobserved groups and can be used to find the inherent relationship between taxa and a specific disease and to generate microbiome count data through the model. 2 In BERMUDA, the abundance of each taxon can be viewed as a mixture of various groups, which enables us to describe the differences in bacteria composition between samples. 3 We provide a simple and efficient learning procedure for the proposed model using an annealing EM algorithm that reduces the local maxima problem inherent to the traditional EM algorithm. The software package that implements the proposed method in the R environment is available from GitHub (https://github.com/abikoushi/Bermuda).
We describe our proposed model and algorithm in the "Methods" section. We also provide the efficiency of BERMUDA using synthetic and real data in "A Simulation Study" and "Results" sections, respectively.

Proposed model
Suppose that we observe a microbial count dataset with disease labels, {(w nk , y n ); n = 1, . . . , N, k = 1, . . . , K)}, where w nk is the abundance of the k-th taxon and y n is a binary outcome such that y n = 1 if the n-th sample has a certain disease and y n = 0 otherwise. Let w n be the kth row of matrix W = (w nk ) and M n = K k=1 w nk be the total reads count of the n-th sample.
We extract the associations between microbial composition and a specific disease by also supposing that there exist L latent clusters that vary with microbial composition and the disease risk. Let z n = (z n1 , . . . , z nL ) T be an indicator vector such that z nl = 1 if the n-th sample is in the l-th class and z nl = 0 otherwise. We then consider the following generative model: where ρ = (ρ 1 , . . . , ρ L ) T is the probability of developing a certain disease, is a vector of the hyperparameters of the Dirichlet prior distribution. Figure 1 displays the plate notation for the proposed model. The gray node represents an observed variable and the white node represents an unobserved variable; the latent variable z n affects both y n and w n . If the latent variable z n is given, the complete likelihood of this model is represented by the following formula: The posterior distribution is then proportional to: (3)

Parameter estimation
We find the maximum a posteriori probability (MAP) estimators, using an annealing EM (AEM) algorithm [7]. One advantage of using an AEM algorithm is that it reduces the local maxima problem from which the traditional EM algorithm suffers. In the E-step, using the inverse temperature 0 < β ≤ 1, we calculate To simplify the explanation, we set γ = α k −1. From the logarithm of (3), in the M-step, we update the parameters using: If γ = 0, MAP estimators are equivalent to maximum likelihood estimatos (MLEs).
Letφ,ρ andP be MAP estimators of φ, ρ and P. If given w n and the estimators, we can evaluate the probability that the n-th sample has the target disease. The conditional probability is given bỹ The advantage of using the Dirichlet prior distribution is that we can evaluate the abundance of the taxa whose abundance is exactly zero.
The n-th sample is then classified into the l-th class that maximizizes the conditional probability given bŷ In fitting the model, it is important to choose an appropriate number for L. In this article, we use cross-validation to choose L. From (8), we can evaluate the probability that the n-th sample has the target disease. We can then evaluate the log-loss function represented by: where J is an arbitrarily chosen subsample size for the validation data. We then select an L which minimizes (10) in this analysis.

A simulation study
In this section, we generated synthetic data and evaluated the performance of our method in order to gain insights into the accuracy of the parameters estimated using the proposed model. A simulation study was conducted as follows. An i.i.d. sample is generated by (1) where we set N = 700, M n = 10000, L = 7, γ = 10 −9 , φ = (1/7, . . . , 1/7) T , and ρ = (0, 3, 0.4, . . . , 0.9) T . P is chosen by a standard Dirichlet random number. We estimated the parameters from 10,000 replicates of the experiment. Table 1 shows the mean and standard error (se) of the estimates for ρ and φ using the proposed method. It can be observed that the estimates are unbiased to the order of 1/100. Figure 2 shows the relationship between estimates and true P in this simulation. In this figure, the points are arranged diagonally, implies that the estimator is unbiased. The overall accuracy of classification byẑ nl (9) is 0.87.

Parkinson's disease data
We first seek to identify the gut dysbiosis in relation to the development of Parkinson's disease (PD), which is thought to be associated with intestinal microbiota. We analyzed φ 0.14 0.14 0.14 0.14 0.14 0.14 0.14 Mean 0.14 0.14 0.14 0.14 0.14 0.14 0. We set γ = 10 −9 , which is equivalent to giving a weakly informative prior. The number of components L = 6 is selected using 10-fold cross-validation (Fig. 3). To ensure the stability, we iterated the cross validation 10,000  Figure 3 shows the log-loss functions for different numbers of the components L. Figure 4 presents the estimated appearance probabilities of the 20 genera. The clusters are sorted by estimated PD riskρ ( Table 2). As displayed Fig. 4, the distribution of Prevotella is quite distinctive, being concentrated in the low-risk cluster of PD. Faecalibacterium also tends to be higher in the low-risk cluster. In contrast, Akkermansia is concentrated in the high-risk cluster.

Zeller's colorectal carcinoma data
Next, we investigate the identification of gut dysbiosis associated with the development of colorectal cancer (CRC). Zeller et al. (2014) [12] studied gut metagenomes extracted from 157 persons, 91 of whom are CRC patients and 66 are controls. The data are available as an R package "curatedMetagenomicData" (https://github.com/ waldronlab/curatedMetagenomicData). In the analysis, we used the abundance of order-level taxa.
While training the model, we set γ = 10 −9 . The number of components L = 3 was selected using 10-fold cross-validation. To ensure the stability, we iterated the cross validation 10,000 times and used the mean of logloss functions. Figure 5 shows that log-loss functions for different numbers of the components, L. The clusters are sorted by the estimated CRC riskρ. Figure 6 presents the estimated appearance probabilities for each cluster. Previous studies showed that Fusobacterium flourishes in colorectal cancer cells [13]. Figure 6 shows that the abundance of Fusobacteriales is positively correlatedρ. We also observe bacteria, such as Bacteroides and Chlamydiales with monotonically increasing abundance. This result indicates that BERMUDA can be a valuable tool for discovering new disease-related bacteria.

Discussion
We evaluated the accuracy of parameter estimation using the simulated data. Table 1 and Fig. 2 shows that the proposed method can produce reasonable estimates and classify samples into true groups.
We also applied BERMUDA to the real metagenomic sequencing data in order to identify the associations between the gut microbiota and PD. We compared the results of BERMUDA with those of previous studies. Petrov et al. (2016) [14] reported that the gut microbiota of PD patients contained high levels of Christensenella, Catabacter, Lactobacillus, Oscillospira, and Bifidobacteriumm, and the control cluster was characterized by increased content of Dorea, Bacteroides, Prevotella, and Faecalibacterium. In family level analysis, Hill-Burns et al. (2017) [9] reported PD patients contained high levels of Bifidobacteriaceae, Lactobacillaceae,  Of the Verrucomicrobiaceae, it has been suggested that Akkarmansia may be related to PD. BERMUDA revealed Prevotella, Faecalibacterium, and Akkermansia associated with PD, which were commonly found in several studies. Thus, the analysis with real data demonstrates that the proposed method can identify the connection between the gut microbiota and PD, with the results are strongly supported by the previous PD research.  BERMUDA can take into account these differences and identify combinations of taxa rather than single taxa in the analysis of association with a specific disease risk. We demonstrated the applicability of BERMUDA to microbial analyses with simulation and real data. We expect that BERMUDA can be efficiently applied to studies that seek for an association between gut dysbiosis and a specific disease.