 Research
 Open Access
 Published:
Bayesian compositional regression with microbiome features via variational inference
BMC Bioinformatics volume 24, Article number: 210 (2023)
Abstract
The microbiome plays a key role in the health of the human body. Interest often lies in finding features of the microbiome, alongside other covariates, which are associated with a phenotype of interest. One important property of microbiome data, which is often overlooked, is its compositionality as it can only provide information about the relative abundance of its constituting components. Typically, these proportions vary by several orders of magnitude in datasets of high dimensions. To address these challenges we develop a Bayesian hierarchical linear logcontrast model which is estimated by mean field MonteCarlo coordinate ascent variational inference (CAVIMC) and easily scales to high dimensional data. We use novel priors which account for the large differences in scale and constrained parameter space associated with the compositional covariates. A reversible jump Monte Carlo Markov chain guided by the data through univariate approximations of the variational posterior probability of inclusion, with proposal parameters informed by approximating variational densities via auxiliary parameters, is used to estimate intractable marginal expectations. We demonstrate that our proposed Bayesian method performs favourably against existing frequentist state of the art compositional data analysis methods. We then apply the CAVIMC to the analysis of real data exploring the relationship of the gut microbiome to body mass index.
Introduction
The human microbiome is the combined genome of the microorganisms that live in the human body. It has been estimated that these microbes make up to 10 trillion cells, equivalent to the number of human cells [1]. Advances in genome sequencing technologies has enabled scientists to study these microbes and their function and to research microbiomehost interactions both in health and disease. The decreasing cost and increasing accessibility of nucleotide sequencing means it is the primary tool used to study the microbiome [2]. Any microbiome dataset is compositional [3] as the magnitude of a single operational taxonomic unit (OTU) depends on the sum of all the OTUs counts, and only provides information about the relative magnitudes of the compositional components. This means that the standard methods of analysis such as linear regression are not applicable to microbiome data [4], unless a transformation is performed.
The large dimensions of these datasets often present a problem in variable selection where the number of covariates p exceeds the number of observations n (\(p>> n\)) and the space of possible combinations of significant variables is large, imposing a high computational burden. Sparse variable selection of the p covariates is expected, where just a few microbes are associated with the response. Bayesian variable selection approaches have the advantage of being able to include prior knowledge and simultaneously incorporate many sources of variation. Shrinkage priors encourage the majority of regression coefficients to be shrunk to very small values when an estimator is applied identifying associations [5]. Alternatively, introducing latent variables produces posterior distributions of model inclusion and parameter values which enable model choice and a probabilistic understanding of the strength and nature of the association [6]. The different approaches within explicit variable selection are characterised by the location of the latent variable and its relationship with the covariates ([7,8,9]).
To model compositional data, a transformation is required to transfer the compositional vectors into Euclidean space. Various logratio transformations have been proposed including additive logratio (alr), centred logratio (clr) [10] and more recently isometric logratio (ilr) [11]. The ilr transformation defines balances proportional to the log difference between two groups which are scale invariant. In ilr linear regression models, just the first parameter can be interpreted. Thus, the only way to interpret the role of d compositional parts for explaining the response is to consider d different regression models [12].
In the context of regression, the reparameterised alr transformation (or logcontrast model) removes the requirement for a reference category and results in a sum to zero constraint on the associated parameter space within the linear model, has proved to be useful in allowing a direct inference between selected covariates and the compositional data set [13, 14] propose an adaptive \(l_1\) regularisation regression for the logcontrast lasso. This has been extended to multiple linear constraints for subcompositional coherence across predefined groups of predictors [15]. To obtain a sparser model [16] introduce an additional feature selection step on those variables identified in a twostep logratio lasso. A general approach to convex optimisation, where the model has been extended to the highdimensional setting via regularization has recently been proposed by [17]. In the Bayesian framework [18] introduce a generalised transformation matrix on the parameters rather than the covariates, as a function of a tuning parameter c, similar to the generalized lasso. This ensures parameter estimates remain in the p space and as c reaches infinity the sum to zero constraint is imposed. By incorporating the matrix into conjugate prior and avoiding any singular distributions by not strictly imposing the zero sum constraint, a Gibbs sampler for the marginal posterior of the selection parameter can be derived. Alternative Bayesian approaches treat the the microbiome predictors as random, parameterised by a multivariate count model. [19] combine this with the ilr transformation in a predictive model which identifies correlations across the microbiome. [20] cluster on a categorical covariate via a Gaussian mixture model in an ANOVA type model, but both approaches do not allow a direct inference between the compositional predictors and the response.
The abundances of features in microbiome data often differ by orders of magnitude. As far as we know this has not been explicitly accounted for in the current literature. In the Bayesian lasso [5] separate scale parameters can have a hierarchical prior placed on them rather than this component being marginalised over which results in the Laplace prior. In the regularisation case, the choice of hyperprior defines how the parameters are shrunk to zero. This model is easily extended to the adaptive lasso [21] by positing independent exponential priors on each scale parameter, and then augmenting each tuning parameter with additional hyperpriors.
Typically, model selection is performed using Markov chain Monte Carlo (MCMC) methods. Various stochastic search based methods have been used to explore the model space in a computationally efficient manner ([9, 22, 23]). Despite this body of work, MCMC can still be considered too slow in practice for sufficiently large scale problems. Variational inference is an alternative technique which uses optimisation to achieve computational savings by approximating the marginal posterior densities. Its success in machine learning problems has led to concerted efforts in the literature to encourage its use by statisticians ([24, 25]). The speed of variational inference gives it an advantage, particular for exploratory regression, where a very large model is fitted to gain an understanding of the data and identify a subset of the microbiome which can be explored in more detail.
Approximate solutions arise in variational inference by restricting the family of densities which can be used as a proxy for the exact conditional density. Typically, the mean field variational family is used where independence is assumed across the factors. Thus by specifying conjugate priors, approximate marginal posteriors are members of the exponential family [26]. However, many models of interest such as logistic regression and non conjugate topic models, do not enjoy the properties required to exploit this algorithm. Using variational inference in these settings require algorithms to be adjusted to for the specific model requirement. A variety of strategies have been explored including alternative bounds ([27, 28]), numerical quadrature [29] and Monte Carlo approximation [30].
We propose a Bayesian hierarchical linear logcontrast model for compositional data which is estimated by mean field Monte Carlo coordinate ascent variational inference. We use the alr transformation within a logcontrast model which removes the need to specify a reference category. Sparse variable selection is performed through novel priors within a hierarchical prior framework which account for the constrained parameter space associated with the compositional covariates and the different orders of magnitude in the taxon abundances. As our constrained priors are not conjugate, Monte Carlo expectations are used to approximate intractable integrals. These expectations are obtained via a reversible jump Monte Carlo Markov chain (RJMCMC) [31], which is guided by the data through univariate approximations of the intractable variational posterior probability of inclusion. We exploit the nested nature of variational inference by proposing parameters from approximated variational densities via auxiliary parameters. Model averaging over all the explored models can be performed and shrunk estimates of the regression coefficient (by the model uncertainty) are available. The approach accommodates high dimensional microbial data and offers the potential to be scaled up for models with multiple responses.
We compare the performance of the proposed modelling approach with the lasso, the logcontrast lasso [14], twostage logratio lasso [16] and selbal [32] on simulated data. Our method is then applied to a subset of the “Know Your Heart” crosssectional study of cardiovascular disease [33] in order to examine the association of the gut microbiome with body mass index (BMI). The study was conducted in two Russian cities Novosibirsk and Arkhangelsk, enrolling 4542 men and women aged between 35 and 69 years recruited from the general population. A health check questionnaire was completed, providing information on smoking, weight and levels of alcohol consumption. We analyse the microbiome of 515 subjects from the Arkhangelsk region at the phylum and genus level, as the 16 S rRNA sequencing of faecal samples was only performed for these participants, alongside age and health covariates.
Methods
Microbiome model
The microbiome data begins as raw counts for each taxon. Any zeros are replaced by a small pseudocount (typically 0.5), before each row is standardised to sum to 1. The sample space of a vector of components is a simplex for each data point, where the rows of each vector make up the design matrix \({\varvec{Q}}_{n \times d}\). The set of compositional explanatory variables can be transformed onto the unconstrained sample space \({\mathbb {R}}^{d1}\) using the alr transformation
where \({\varvec{q}}_i\) is the ith row of \({\varvec{Q}}\) and the ratios have been arbitrarily chosen to involve the division of each of the first \(d1\) components by the final component. The log linear model, with the alr transformed variables as proposed by [13], can be expressed as
where \(\tilde{{\varvec{\theta }}} = (\theta _1,...,\theta _{d1})^T\) is the corresponding \((d  1)\) vector of regression coefficients and \(\epsilon _i\) is independent noise distributed as \(N(0, \sigma ^2)\). Although convenient, the interpretation of the model depends on the arbitrary choice of the reference category. If we expand the dot product \(alr({\varvec{q}}_i) \cdot \tilde{{\varvec{\theta }}}\) and set
the log contrast model can be conveniently expressed in matrix form [14] as
where \({\varvec{Z}}=(\log {\varvec{q}}_1,..., \log {\varvec{q}}_d)\) is the \(n \times d\) compositional design matrix and \({\varvec{\theta }} = (\theta _1,...,\theta _d)^T\) is a dvector of regression coefficients constrained to the affine hyperplane.
This likelihood is used by [18] who specify a d dimensional multivariate normal distribution on \(\theta\) within a “spikeandslab” prior,
where \({\mathbb {J}}_d\) is a matrix of ones and \(\textbf{V}\) is the generalised transformation matrix which incorporates the tuning parameter c to constrain the \({\varvec{\theta }}\) parameter space and takes the form in (5) for the alr transformation. This approach allows the probability distribution to remain in the d dimensional space as \(\textbf{V}\) is a matrix of full rank, facilitating conjugate updates, as the sum to zero constraint is not imposed exactly.
Interest often lies in assessing the association of unconstrained data, in the form of categorical or continuous covariates against the response, alongside the microbiome. Two additional design matrices are added to the likelihood, \({\varvec{X}}\) which comprises the scaled continuous covariates and \({\varvec{W}}\) which contains the dummy variables for the \(g = 1,...,G\) categorical variables coded to indicate the \(m_g\) levels with respect to the intercept. The likelihood for our model is thus expressed as
Compositional priors
The linear constraint on the unconstrained vector can be expressed in matrix form as
where \(\textbf{T}\) is an idempotent matrix of rank \(d1\). If we originally parametrise \(\theta _j \sim N(\mu _j, \psi _j)\), where the large differences in the order of magnitude of each row of the \({\varvec{Z}}\) design matrix are accounted for by allowing each parameter \(\theta _j\) to have a separate variance parameter \(\psi _j\), then the constrained random variables associated with the compositional explanatory variables are from a singular multivariate normal distribution
with \({\varvec{\psi }}\) a vector of scale parameters. This prior respects the sum to zero constraint imposed by the reparametrisation of the likelihood in (6). The distribution is degenerate, the transformation matrix \(\textbf{T}\) means the covariance matrix is singular, and will assign 0 values to all sets in the d dimensional space. [18] treat the constraint as a tuning parameter, restricting the values that \({\varvec{\theta }}\) can take whilst still remaining in the d dimensional space so that the marginal posterior can be obtained in closed form. Our approach imposes the constraint exactly. The singular multivariate normal prior for the compositional data can be considered to be at the unobtainable limit of c in the alr transformation approach (5), when the tuning parameter creates a singular matrix where the standard normal prior is no longer appropriate.
We augment the prior on \({\varvec{\theta }}\) with dependent latent indicator variables from a product of Bernoulli distributions which have been truncated to account for the alr transformation which prevents the selection of a single taxon into the model
where \(\textrm{I}\) is the indicator function. This truncation is particularly important in the presence of sparsity. The full singular multivariate normal spikeandslab prior for \(p({\varvec{\theta }}{\varvec{\xi }}) = p({\varvec{\theta }}_{\xi }{\varvec{\xi }})p({\varvec{\theta }}_{{\bar{\xi }}}{\varvec{\xi }})\), where \({\varvec{\theta }}_\xi\) and \({\varvec{\theta }}_{{\bar{\xi }}}\) are subvectors of \({\varvec{\theta }}\) such that
\(\Sigma ^+_\xi\) denotes the MoorePenrose pseudo inverse of the matrix \(\textbf{T}_\xi D({\varvec{\psi }}_\xi )\textbf{T}_\xi\) defined by \(A^+= VS^+U^T\) if \(A=USV^T\) is the singular value decomposition of A and \(S^+\) is the diagonal matrix which has the same entries as S and where \(S^+_ii=1/S_{ii}\) for the nonzero diagonal entries. The pseudodeterminant \(\text {det}^*\) is defined as the product of the nonzero eigenvalues of the matrix and \({\varvec{\xi }}\) is a vector of zeros and ones. The \({\varvec{\theta }}_\xi\) parameters are dependent (the covariance for unit scale is equal to the fraction \( 1/d_\xi\) and for the case of \(d_\xi =2\) the correlation is 1). This prior implies a univariate spikeandslab on the diagonal of the covariance matrix in (10),
A beta distribution is placed on the sparsity parameter \(\kappa\) and the hyperparameter \(b_\psi\) is given a gamma prior. This approach can be interpreted as replacing the continuous mixing density in the Bayesian lasso, which can have either hierarchical structure [21] or be marginalised over [5], with a discrete mixture. This set of explicit variable selection priors on the compositional data ensures that the marginal posterior of variable \(\xi _j\) represents the inclusion of the jth taxon in the model.
Priors
The choice of the remaining prior distributions is partly down to convenience. The prior distributions and likelihood are semiconjugate pairs which means the optimal form for the mean field variational density is in the same exponential family form.
We employ a variable selection spikeandslab prior [34] for \(\beta _s\) associated with the continuous variables in the design matrix \({\varvec{X}}\), where each s parameter is independent. The spike is a point mass at 0 (Dirac distribution) with probability \(1p(\gamma _s) = 1\omega\) and the slab is a zero centred Gaussian with variance w which requires the variables to be standardised. The binary latent indicator variable \(\gamma _s\) represents the inclusion of the sth covariate in the model.
In the case of the categorical data matrix, we are interested in selecting the group of variables associated with the response into the model, rather than a particular level. Each factor variable (or group) \(g = 1,..,G\) has \(j =1,...,m_g, m_{g+1}\) levels which are coded as dummy variables in \({\varvec{W}}\) with reference to the intercept. Motivated by the Bayesian group lasso [35] who introduce binary indicators to perform selection both between and within the groups levels, we employ a variable selection spikeandslab prior on the vector \({\varvec{\zeta }}_g\) with dimension \(m_g\). The spike is a point mass at 0 (Dirac distribution) with probability \(1p(\chi _g) = 1\varrho\) and the slab is a zero centred Gaussian with variance v. The binary latent indicator variable \(\chi _g\) represents the inclusion of the gth categorical variable into the model. In the case where there factors have just 2 levels, the prior reduces to the same form as its unrestricted continuous counterpart, with a different scale parameter.
Hierarchical priors are also included to fully incorporate the uncertainty surrounding these parameters. The probability that a given covariate in the design matrices of \({\varvec{X}}\) and \({\varvec{W}}\) affects the response is modelled by the parameters \(\omega\) and \(\varrho\), with beta priors. Inverse gamma distributions with gamma (shape and scale) hyperpriors on their respective scales are placed on the prior variance parameters w and v.
Variational inference
We employ coordinate ascent variational inference (CAVI) [36] as our estimation procedure, rather than relying entirely on MCMC which often requires substantial computing resources when the dimensionality of the problem is large. We use structured mean field variational family, where dependencies between parameters are explicitly incorporated within blocks and independence is retained across the blocks ([37,38,39,40]). Each latent variable is still governed by a distinct factor in the variational density. An example of an approximating posterior block which captures the natural dependency between the latent indicator variable \(\gamma _j\) and the corresponding regression coefficient \(\beta _j\) directly associated with the design matrix \({\varvec{X}}\) is
This leads to a natural type of approximation for hierarchical Bayesian models, where the hierarchical structure of the prior often suggests a good hierarchical structure for the posterior approximation. The full structured mean field approximation distribution \(q({\varvec{\vartheta }})\), where \({\varvec{\vartheta }}\) represents all of the latent variables in the model, is defined in the Additional file 1: Sect. 1. The full DAG of the Monte Carlo coordinate ascent variational inference (CAVIMC) model is in Additional file 1: Fig. S2.
Unconstrained updates
The variational inference updates are available analytically for all unconstrained parameters and hyperparameters in the model. Derivations are given in the Additional file 1: Sect. 1. The updates involve a combination of univariate and multivariate calculations. The regression parameters directly associated with the \({\varvec{X}}\) and \({\varvec{W}}\) design matrices have joint updates in the same spikeandslab form as their priors. The conjugate update for \(q(\beta _{s}, \gamma _{s})\) is
with free parameters
and
where \((\cdot )^{(1)}\) denotes the q expectation. The conjugate update for \(q({\varvec{\zeta }}_g, \chi _g)\) is
where the free parameters for \({\varvec{\zeta }}_g\) are updated by the multivariate extension of the previous univariate update,
The marginal expectation of \({\varvec{\zeta }}_{g}\) and \(\beta _s\) is the mean of the conditional density when the parameter is included in the model, shrunk by the probability of being included in the model. The nested q density update for each free parameter(s) is the expectation of the log joint distribution with respect to all the other factors. Thus, any update involving a marginal expectation from a parameter with a spike and slab prior involves a form of regularisation.
The selection of the spikeandslab priors for \(\beta _s\), \({\varvec{\zeta }}_g\) and \({\varvec{\theta }}\) with sparsity inducing hyperparameters for variable selection, shrinks the parameters estimates in the variational updates rather then performing explicit variable selection as in MCMC. These estimates are a useful proxy for the final model effects, but as opposed to a model with regularisation priors, the expectation of the model indicator parameters gives us the probability of a covariate being associated with the response. In the case of \({\varvec{\zeta }}_g\), which is associated with the gth categorical covariate, the parameterisation has a convenient interpretation. Each element in the vector is free to vary but all elements are shrunk by the same value. Thus the expectation \((\chi _{g})^{(1)}\) is the probability of the categorical covariate (rather than the individual levels) being included in the model.
CAVIMC
The conditional vector update \(q({\varvec{\theta }}{\varvec{\psi }},{\varvec{\xi }})\) is available analytically and takes the form
where \(\delta _0\) is the Dirac distribution on the subvector \({\varvec{\theta }}_{{\bar{\xi }}}\) with updates
The truncated Bernoulli prior distributions for \({\varvec{\xi }}\) and unique scale parameter \(\psi _j\) for each element in \({\varvec{\theta }}\), prevents a conjugate posterior update for the joint block \(q({\varvec{\theta }},{\varvec{\psi }},{\varvec{\xi }})\). All other updates are available analytically.
The difficult to compute joint \(q({\varvec{\theta }}, {\varvec{\psi }}, {\varvec{\xi }})\) update is performed by inserting a Monte Carlo step within the mean field variational inference approach. We take advantage of the structure of the target density \(p({\varvec{\vartheta }},\textbf{y}) \equiv f({\varvec{\vartheta }})\) (the data \(\textbf{y}\) is omitted for notational purposes as its fixed) which has the form
for rdimensional constant vector \({\varvec{\eta }}\), vector function \(T({\varvec{\vartheta }})\) and relevant scalar functions \(h > 0\). In our case this admits the factorisation for all \(j \notin {\mathcal {J}}\)
where \({\mathcal {J}}\) is the set of all analytically available updates. This allows us to avoid generating and storing the samples from the approximating densities which would involve considerable computational cost, by using the q marginal expectations in the Monte Carlo estimate for \(q({\varvec{\theta }}{\varvec{\psi }},{\varvec{\xi }})\). [30] show that, under regularity conditions, an CAVIMC recursion will get arbitrarily close to a maximiser of the evidence lower bound with any given high probability.
The MCMC approach involves two move types, withinmodel moves where the samples are generated from a MetropolisHastings sampler and betweenmodel moves which are sampled from a RJMCMC. The samplers involve using some form of the joint approximating posterior \(q({\varvec{\theta }},{\varvec{\psi }}, {\varvec{\xi }}{\varvec{y}}) \propto ~ q({\varvec{\theta }}{\varvec{\psi }}, {\varvec{\xi }},{\varvec{y}})q({\varvec{\xi }},{\varvec{\psi }}{\varvec{y}})\) which is simplified as \(q({\varvec{\theta }}{\varvec{\psi }}, {\varvec{\xi }},{\varvec{y}})\) has the conjugate spikeandslab form (14).
Randomly choose either a betweenmodel move which consists of sequentially updating \({\varvec{\xi }},{\varvec{\psi }}{\varvec{\xi }}\) and \({\varvec{\theta }}{\varvec{\psi }}\), \({\varvec{\xi }}\) or a withinmodel move where \({\varvec{\xi }}\) is not updated. This naturally leads to questions regarding the proposals for \({\varvec{\psi }}\) which has a constrained support and \({\varvec{\xi }}\) which has the potential to be a very large binary space.
Betweenmodel RJMCMC  approximating \(q({\varvec{\xi }}, {\varvec{\psi }}{\varvec{y}})\) to \(p({\varvec{\xi }}{\varvec{\vartheta }})\) for the proposal distribution \(j_m({\varvec{\xi }},{\varvec{\xi }}')\)
The choice of priors for the parameters associated with microbiome features, the indicator vector \({\varvec{\xi }}\) and set of scale parameters \({\varvec{\psi }}_\xi\), prevents a conjugate update for \(q({\varvec{\theta }}, {\varvec{\psi }}, {\varvec{\xi }})\). An MCMC step is introduced to sample from the intractable q approximating posterior. To search the binary space we use a RJMCMC where the proposal for \(\psi _j\) conditional on \(\xi _j = 1\) is from the q approximating density of the auxiliary parameter \(\Omega _j\)
where the calculation of the free parameters \(a_{\Delta _j}^*\) and \(b_{\Delta _j}^*\) is explained in the next section. \({\varvec{\theta }}\) is generated directly from the singular multivariate normal target distribution (14).
There is considerable research in sampling highdimensional binary vectors. [41] propose a general model for the proposal which combines local moves with global ones by changing blocks of variables. They find that the acceptance rates for MetropolisHastings samplers that include, exclude or swap a single variable improves. [22] extend their model with adaptive parameters which change during the mixing of the MCMC. Motivated by incorporating information from data into the proposal parameters, we use the variational inference posterior distribution \(q({\varvec{\xi }},{\varvec{\psi }}\textbf{y})\) which is only available up to a constant of proportionality
to obtain a univariate approximation relative to the jth element to guide the RJMCMC. These normalised probabilities are used to obtain our proposal probabilities in a birthdeath and swap sampling scheme. Similar to adaptive parameters in MCMC, these selection probabilities are updated at each iteration of the CAVI.
The pseudo determinant in (19) is approximated by removing the constraints \(\textbf{T}_{\xi }\) and taking the MCMC expectation conditional on \(\xi _{j} = 1\). So for the jth element the approximation is
where the curly brackets \(\{\}\) denote an MCMC expectation and \({0}\!\!\!/\) defines an expectation over all nonzero values. A similar approach can be used to approximate the determinant containing \(\Sigma _{\theta _\xi }\)
where \({\bar{\sigma }}_{\theta_j}^2\) is the nonzero variance average over the MCMC iterations, obtained by extracting the diagonal from \(\Sigma _{\theta _{(\xi ,\psi )}}\) at each iteration. If the jth term has not been included in the model the term is approximated by
After approximating \(\Sigma _{\theta _{\xi }}\) to a scalar for each jth element the matrix dot product reduces to
To account for the cross product terms which contains the elements of \({\varvec{\xi }}\) not equal to j and the associated \({\varvec{\mu }}_{\theta }\) terms, a combination of conditional expectations and marginal expectations which shrink the values in proportion to its probability of being zero, is used. As \(\xi _{j}\) can not be separated from the sum in the numerator \(d_{\xi }\), two approximations of the matrix dot product are used conditional on the expectation from the previous chain.
Defining the expectations with respect to the parameter currently being updated from the previous MCMC by a curly bracket as:

\(\{\mu _{\theta _{j}}\}^{\{1\}}_{{0}\!\!\!/}\): Conditional expectation \(\xi _{j}=1\). Weighted average of the nonzero terms from previous chain,

\(\{\mu _{\theta _{j}}\}^{\{1\}}\): Expectation wrt q from the previous chain,

\(\{d_{\xi }\}^{\{1\}}\): Expectation wrt q from the previous chain,
the approximation of the dot product \((\textbf{T}_{\xi } {\varvec{\mu }}_{\theta _{\xi }})^T \textbf{T}_{\xi } {\varvec{\mu }}_{\theta _{\xi }}\) for \(\{d_{\xi }\}^{\{1\}} >2\) is thus
and
Although \(\{ d_{\xi } \in {\mathbb {N}}_0  d_{\xi } \le d, d_{\xi } \ne 1 \}\), the support of the MCMC expectation \(\{d_{\xi }\}^{\{1\}}\) is the positive real line so we threshold on 2. When \(\{d_{\xi }\}^{\{1\}}>2\) the probabilities used in the proposal distribution for the RJMCMC, derived from approximating Equation (19) and normalising is
which contains the variational expectations and an MCMC conditional expectation from the previous iterations. This is then used to propose the various move types in the RJMCMC.
Pseudo updates for MCMC proposals
A conjugate update for the parameters associated with the microbiome features \(q({\varvec{\theta }}, {\varvec{\psi }}, {\varvec{\xi }})\) is prevented by the choice of priors for the indicator vector \({\varvec{\xi }}\) and set of scale parameters \({\varvec{\psi }}_\xi\). Samples from the intractable q approximating posterior are simulated from an MCMC step instead. The move types in the RJMCMC for \({\varvec{\xi }}\) use an elementwise approximation of the joint \(q({\varvec{\xi }})\) density (23). For the proposal distribution of \({\varvec{\psi }}\), we use the model likelihood and an unconstrained approximation to the constrained priors. In order to do this we define auxiliary parameters (upper case Greek letters) which are unconstrained versions of the constrained parameters. We derive pseudo variational updates from an unconstrained model with a simpler prior parametrisation, then use the q approximating distribution of the relevant auxiliary parameter as our proposal for \({\varvec{\psi }}\). We can think of the auxiliary parameters as introducing an alternative directed acyclic graph (DAG) which is updated first, helping us to approximate the model in order to guide the MCMC step (depicted in the Additional file 1: Fig. S1). These updates are refined by the full variational inference updates which account for the constraint at each iteration. The parameter \(\kappa\) and the hyperparameters \(a_{\Delta }, b_{\Delta }\) which are set to \(a_{\psi }, b_{\psi }\) provide a link back to the constrained model.
The series of pseudo variational updates are determined from a simple prior parametrisation where the parameters associated with the compositional covariates are not constrained to sum to 0. This unconstrained model has the following prior parametrisation
where \({\varvec{\Omega }}\) are the unconstrained version of the \({\varvec{\theta }}\) parameters, \({\varvec{\Delta }}\) are the variance parameters for \({\varvec{\Omega }}\) which are both dependent on the model selection parameters \({\varvec{\Upsilon }}\). The prior for the model selection parameter \(\Upsilon _{j}\) is a simple Bernoulli distribution. The remaining priors and likelihood take the form defined in the initial prior parametrisation. The introduction of independence across each univariate \((\Omega _{j}, \Delta _{j}, \Upsilon _{j})\) block, (where the data is being treated as unconstrained) ensures the q expectations are all available in closed form (derived in the Additional file 1: Sect. 1).
Despite the similarities of the prior parametrisation to (13), the addition of a separate scale parameter \(\Delta _{j}\) for \(\Omega _{j}\) prevents a joint conjugate update on the \((\Omega _{j}, \Delta _{j}, \Upsilon _{j})\) block. Instead we update \(q(\Omega _j, \Upsilon _j)\) (for \(j=1,...,d\)) before updating \(q(\Delta _j\Upsilon _j)\). Both require expectations conditional on \(\Upsilon _j\) as well as the typical marginal expectations. The full \(q(\Omega _{j}, \Upsilon _{j})\) update is
The binary form of the pseudo update for \(\Omega _j\) and \(\Upsilon _j\) enables us to determine the values for the conditional expectations. In (24) we have under q, where we condition on the value of \(\Upsilon _j\)
which allows us to set the expectations in the normal variance update as \({\mathbb {E}}_q[\Delta _j^{1}\Upsilon _j = 1 ]\)
The conditional expectation prevents us averaging over \(\Upsilon _j\) which shrinks the marginal expectation, creating an update which has the same form as (13). Using the form of (25) to determine the conditional expectation and normalising gives the probability of inclusion
The univariate approximation of \(q({\varvec{\xi }},{\varvec{\psi }}{\varvec{y}})\) (23) can be interpreted as a refinement of \((\Upsilon _{j})^{(1)}\) using MCMC expectations and information on all elements of \({\varvec{\xi }}\) to partially account for the constraint in the probability of inclusion.
The spikeandslab form of the pseudo update for \(q(\Delta _{j}\Upsilon _{j})\) allows us to again back out the conditioning in the conditional expectation of \({\mathbb {E}}_q[\Omega ^2_j\Upsilon _j]\) in \(b^*_{\Delta _j}\).
As the update \(\Delta _{j}\) is conditional on \(\Upsilon _j\), the free parameters in the proposal distributions are not a function of shrunken estimates. The \(q(\Delta _j\Upsilon _j, \textbf{y})\) auxiliary approximating density is then used to propose scale parameters with the appropriate support, which are informed by the data, for \({\varvec{\psi }}_\xi\) in the MCMC move.
Algorithm
CAVI is performed by iterating through the analytical variational updates, maximising the evidence lower bound (ELBO) with respect to each coordinate direction whilst fixing the other coordinate values. For the \(q({\varvec{\theta }},{\varvec{\psi }},{\varvec{\xi }})\) block an MCMC is implemented to obtain Monte Carlo estimates of the intractable marginal expectations of the approximating densities. The proposal probabilities for the sampling scheme are a function of the data and the free parameters, and are updated at each iteration of the CAVI.
For each run we compute the ELBO (derived in Additional file 1: Sect. 1), with the updated free parameters, until this converges to the local optimum. The ELBO is no longer monotonically increasing because of the Monte Carlo variability, but we are able to declare convergence when the random fluctuations are small around a fixed point. The implementation of the overall approach is described in Algorithm 1, with the MCMC move detailed in 2.
It is computationally inefficient to start with a large number of iterations m, when the current variational distribution can be far from the maximiser. The software allows the user to specify a smaller number of iterations to begin with before increasing the number of iterations as the algorithm becomes more stable, improving the accuracy of the Monte Carlo estimates.
Simulation study
We validate the performance of our variational inference model, focusing on the compositional element, in a simulation experiment against four frequentist variable selection approaches with software freely available; \(L_1\) lasso [42], logcontrast lasso [14], twostage logratio lasso [16] and selbal [32]. All except the vanilla lasso explicitly account for the compositional nature of the design matrix within the feature selection.
We simulate data from an additive logratio model. An \(n \times d\) data matrix \(\textbf{O} = (o_{ij})\) is drawn from a multivariate normal distribution \(N_p({\varvec{\mu }}_o, \Sigma _o)\), and then the compositional covariate matrix \(\textbf{Q}= (q_{ij})\) is obtained via the transformation \(q_{ij} = \frac{\exp (\tau o_{ij})}{\sum _{k=1}^d \exp (\tau o_{ik})}\). The covariates thus follow a logistic normal distribution [43]. To account for the differences in the order of magnitudes of the components so common in microbiome data, we fix \(\tau = 2\) and let \(\mu _{oj} = \log (d \times 0.5)\) for j = 1,..., 5 and \(\mu _{oj}=0\) otherwise. Thus we have 5 of the compositional (or microbiome) features with a larger order of magnitude. We vary the number of compositional features d from 45 to 200 for \(n=100\), to ensure a setting where the features out number the observations. As the correlations between the abundances of features in the microbiome can vary quite considerably according to the taxonomy class, we choose three settings for \(\Sigma _o\): \(\Sigma _o = \textbf{I},~(\rho ^{ij} )\) with \(\rho = 0.2\) or \(0.4\).
We select 6 compositional features to be associated with the response, 3 of which have a larger order of magnitude, via a d dimensional \({\varvec{\theta }}\) vector with nonzero elements \({\varvec{\theta }}_{\xi } = (1, 1.5, 0.5, 1, 1.5, 0.5)\). The signal to noise ratio (SNR) is defined as \(\text {SNR} = \text {Mean}({\varvec{\theta }}_\xi ) / \sigma\). To generate the data with the various choices of SNR, this expression is solved for \(\sigma\) and 100 simulations are generated.
Rearranging the simulated data model as we have only 3 unique nonzero values of \({\varvec{\theta }}\)
we obtain the allpairs logratio model with a 0 intercept, as in [16].
The logcontrast lasso of [14] is a \(L_1\) penalization lasso on the log transformed variables with the additional constraint that the sum of coefficients is zero. This is fitted using the glmnet.constr in R, by augmenting the data with an additional data point with all features equal to 1 and a response value of zero. By assigning this value a large weight the resulting parameter estimates will approximately sum to zero. A twostage logratio lasso procedure by [16] builds on the [14] model by adding an additional forward selection step which effectively prunes the model for a sparser solution, whilst maintaining the parameter constraint. The selbal [32] is a balance selection algorithm which starts with a search of the two taxa whose balance, a logcontrast where the coefficients of the linear function sum to zero, is most closely associated with the response. Once the first twovariable balance is selected, the algorithm performs a forward selection process to add further variables to the model. For all of the comparison methods, prediction and crossvalidation is performed over a grid of values to determine model selection and tuning parameter estimation.
As the focus of the simulation study is on the compositional element, the parameters associated with the unrestricted design matrix are omitted from the Bayesian structure in the CAVIMC. Vague priors are placed on the hyperparameters for the CAVIMC model (highlighted with red in the DAG (Additional file 1: Fig. S2)). Standard variable selection in highdimensional data with spikeandslab priors in a Bayesian framework is well understood [44]. The sparsity of the compositional features is controlled by the choice of \(a_\kappa\) and \(b_\kappa\) on the Beta hyperprior on \(\kappa\). We fix their choice by specifying a prior average number of covariates, \(d^*\), expected to be included in the model, setting
We perform the simulations with \(d^*\) equal to 6 or 12 and report the average, reflecting uncertainty in \(d^*\).
Since the optimisation problem for maximizing the ELBO is nonconvex, the approach can be sensitive to initialization of the variational parameters. For each simulation, the variational parameters are initialized via random samples from the associated prior distribution. 25 variational inference iterations are performed (although the algorithm typically converges after approximately 6 iterations) for each run. The initial number of betweenmodel MCMC iterations is set to 5000, before 10,000 iterations are performed after the 5th set of variational inference updates.
To assess the performance of the approaches we use metrics which evaluate the ability to select the correct variables and estimate the appropriate effects. The prediction error (PE), defined as \(\text {PE} = \frac{1}{n_{test}}(\textbf{y}_{test}\textbf{Z}_{test}\hat{{\varvec{\theta }}}_{train})^T(\textbf{y}_{test}\textbf{Z}_{test}\hat{{\varvec{\theta }}}_{train})\), is computed using an independent test sample of size 5. We compute the \(l_2\) loss \(\hat{{\varvec{\theta }}} {\varvec{\theta }}^2\) and bias squared to assess the accuracy of the coefficient estimates. To asses the accuracy of the variable selection, the true positive rate (TPR or sensitivity) and false positive rate (FPR or 1  specificity) is reported, where positives and negatives in the context of the frequentist approaches refer to nonzero and zero coefficients respectively. For each of these metrics, the respective standard deviation across datasets is included. Variable selection for the CAVIMC is performed by thresholding the marginal approximate posterior distribution \({\mathbb {E}}[q(\xi _jy)]\) at 0.5. The approximate posterior mean is used for the parameter estimate of the Bayesian model.
The vanilla lasso and the logcontrast lasso consistently detect the true parameters for low SNRs, but this comes at a considerable cost of a high false positive rate. This failure to capture the sparsity of the true model is a function of the the number of compositional covariates and the correlation between the compositional covariates. The results for \(\rho =0\) and \(d=45\) and \(d=200\) are plotted in Figs. 1 and 2. When d is 45 and \(\rho =0.4\) these two approaches can incorrectly select over a third of the covariates. This is not the case with either the selbal, logratio lasso or the CAVIMC. These methods control for false positives whilst still maintaining a high probability of identifying the correct features.
The proposed CAVIMC Bayesian method out performs the constrained lasso and selbal with respect to FPR, and prediction error (Tables 1, 2, 3, 4, 5, 6, 7 and 8). The performance of the CAVIMC is very similar to the logratio lasso for moderate and strong SNRs. The CAVIMC requires slightly more signal to detect the true parameters, but consistently outperforms the logratio lasso in controlling for false positives. The Bayesian approach has the additional benefit of a posterior distribution for each of the associated compositional parameters \(\theta _j\) in the model and additional model flexibility. Where as the logratio lasso is restricted to models of the form of (29), the CAVIMC can accurately capture models for any number of unique nonzero values.
Each of the methods ability to detect the true parameters in the model deteriorate in the presence of large correlation and low SNR (Tables 1 and 2 ). The selbal appears to be the most robust method for larger correlation but clearly struggles to select the correct features even with much higher SNRs. The betweenmodel moves in the CAVIMC rely on a RJMCMC which is guided by an approximation of the likelihood. When the signal is low and correlation high, this reduces the ability to guide the sampler to the the true parameters within a large binary space. The approximation of \(q({\varvec{\xi }},{\varvec{\psi }}\textbf{y})\) for univariate proposals is slightly less effective at identifying the correct features, compared with the logcontrast lasso approach which identifies the initial variables for the logratio lasso.
The \(L_2\) loss and squared bias diagnostics (Additional file 1: Tables 9–16) indicate the CAVIMC estimates the model well, as it typically outperform all but the logratio lasso. Given the true model in the simulation study is a logratio model (29), the logratio lasso benefits from estimating a much smaller number of parameters than the other methods. As the CAVIMC is more flexible than the logratio lasso, the squared bias for this simulation scenario is typically larger, but this comes with the distinct advantage of being able to accurately capture a much large space of models.
The performance of the CAVIMC, from varying the thresholding value for \({\mathbb {E}}[q(\xi _j\textbf{y})]\) when the SNR of 0.83, is plotted in Fig. 3 where the purple point represents the value for 0.5. Despite the logratio lasso having a larger TPRs, the points for \(\rho =0.2\) and \(\rho =0.4\) fall inside the CAVIMC ROC curve (further ROC curves are in the Additional file 1: Fig. S3–S5).
Data
We apply our proposed method to a subset of the main study in Arkhangelsk, containing 515 men and women aged between 35 and 69 years recruited from the general population, from the “Know your Heart” crosssectional study of cardiovascular disease [33]. As part of the study, participants were asked to volunteer faecal samples for analysis of the gut microbiome. The relative abundances of the microbes were then determined by 16 S rRNA sequencing (using the variable regions V3–V4) followed by taxonomic classification using a Naive Bayes classifier [45]. A baseline questionnaire captured unconstrained covariate information on age, sex and smoking status. Information on alcohol consumption from the questionnaire and biomarker data was used to derive a categorical factor with four levels on alcohol use.
The gut microbiome plays an important role in energy extraction and obesity [46], which we illustrate by regressing body mass index (BMI) against the microbiome at the phylum and genus level alongside the unconstrained covariates. The counts are transformed into relative abundances after adding a small constant of 0.5 to replace the zero counts [47] and then log transformed. BMI is also log transformed and the continuous age covariate is standardised.
Vague priors are placed on the hyperparameters for the CAVIMC model. Given the previous results from microbiome against BMI analysis, \(d^*\) for the hyperprior on \(\kappa\) is set to 8. The birthdeath or swap move parameter \(\phi\) is set to 0.5. Four runs of the CAVIMC algorithm are performed, each with different initialisation values for the q expectations and the ELBO is monitored to confirm convergence. For each run 20 variational inference iterations are performed (although the algorithm typically converges after approximately 6 iterations). The initial number of betweenmodel MCMC iterations is set to 5000, before 10,000 iterations are performed after the 5th set of variational inference updates.
Despite different initial starting point the CAVIMC converges to the same maximum. Thresholding the marginal expectation of the approximate posterior distributions at 0.5, we find an increase in Firmicutes (which has a −0.8 correlation with Bacteroidetes) and a decrease in Synergistetes is associated with an increase of BMI at the phylum level. At the genus level, BMI is increased by an increase in Roseburia and a reduction in Oscillospira. The corresponding marginal expectation of the approximating posterior \({\mathbb {E}}[q({\varvec{\xi }}y)]\), for both the phylum and genus level are plotted in Figs. 4 and 5. We also find BMI to be positively associated with age. The ELBO for each model at each microbiome level indicates an optimum has been reached (Additional file 1: Fig. S6 and S7), with each run finding the same local optimum.
Our findings appear to be consistent with previous studies. The ratio of Firmicutes to Bacteroidetes at the phylum level is considered to be a biomarker for obesity ([48, 49]). Increases in physical training of rats has led to an increase in their levels of Synergistetes [50]. At the genus level [51] identifies Roseburia to be positively correlated with obesity in children, and [52] determines Oscillospira to be negatively associated with BMI.
Discussion
Our Bayesian hierarchical linear logcontrast model estimated by structured mean field Monte Carlo coordinate variational inference improves Bayesian regression modelling for compositional data. Sparse variable selection is performed through priors which fully account for the constrained parameter space associated with the compositional covariates. We introduce Monte Carlo expectations to approximate integrals which are not available in closed form. These expectations are obtained via RJMCMC with proposal parameters informed by approximating variational densities via auxiliary parameters with pseudo updates. As long as there is sufficient signal to guide the RJMCMC, the approach leads to a high TPR and low FPR in compared with frequentist compositional approaches.
The CAVIMC suffers when the SNR is low and the correlation is high. Addressing the correlation by adapting the prior parametrisation may help to improve the model in these settings. One approach to address this issue is to use a Markov Random Field prior [53] which imposes a structure on the selection of \({\varvec{\xi }}\). [18] use this prior to incorporate the phylogenetic relationship among the bacterial taxa alongside a model which partially accounts for the constraint on the parameters. Alternatively, to avoid having to predefine the structure of the taxa, a Dirichlet Process could be used to account for the correlation of the microbiome by clustering the covariates [54] prior to the regression.
At the genus level, despite the CAVIMCMC identifying associations between the BMI and Roseburia and Oscillospira, some of the other microbiome features which have been found to be associated with BMI were not detected. Bifidobacterium has been found to be negatively associated with BMI in children [55]. This taxon was also found to be associated with BMI in adults, alongside a negative association between BMI and Methanobrevibacter [56]. However, associations between BMI and the gut microbiome at the genus level are subject to a high degree of variation across studies [57]. This maybe partly explained by the tools used to construct the microbiome datasets, which can identify quite different results from the same sample [58].
As genetic sequencing becomes more widely available, interest grows in modelling the relationship between the microbiome and a complex set of phenotypes such as blood concentrations of lipids or other metabolites. Bayesian hierarchical models have been introduced for multiple outcomes ([59, 60]), which leverage shared information improving predictor selection. These approaches often use the simplifying assumption of conditionally independent residuals to allow different covariates to be associated with different responses. In future work, we would like to explore this multiple response extension to our model, using a hierarchical approach to allow information on the shared parameters to be pooled whilst incorporating correlation between the responses to aid variable selection.
Supplementary material
Supplementary Material which contains the derivations of all of the analytical updates for the CAVIMC is available online.
Availability of data and materials
The Know Your Heart data that support the findings of this study are available from the the International Project on Cardiovascular Disease in Russia (IPCDR) Steering Group (at the London School of Hygiene and Tropical Medicine) but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of IPCDR Steering Group. Software in the form of Python code, together with a sample input data set and complete documentation is available on request from the corresponding author.
Abbreviations
 alr:

Additive logratio
 BMI:

Body mass index
 CAVI:

Coordinate ascent variational inference
 CAVIMC:

Monte Carlo coordinate ascent variational inference
 clr:

Centred logratio
 DAG:

Directed acyclic graph
 ELBO:

Evidence lower bound
 ilr:

Isometric logratio
 MCMC:

Markov chain Monte Carlo
 OTU:

Operational taxonomic unit
 PE:

Prediction error
 RJMCMC:

Reversible jump Monte Carlo Markov chain
 SNR:

Signal to noise ratio
References
Sender R, Fuchs S, Milo R. Revised estimates for the number of human and bacteria cells in the body. PLoS Biol. 2016;14(8):1–14. https://doi.org/10.1371/journal.pbio.1002533.
Franzosa EA, Hsu T, SirotaMadi A, Shafquat A, AbuAli G, Morgan XC, Huttenhower C. Sequencing and beyond: integrating molecular ‘omics’ for microbial community profiling. Nature Rev Microbiol. 2015;13:360–72. https://doi.org/10.1038/nrmicro3451.
Gloor GB, Macklaim JM, PawlowskyGlahn V, Egozcue JJ. Microbiome datasets are compositional: and this is not optional. Front Microbiol. 2017;8:1–6. https://doi.org/10.3389/fmicb.2017.02224.
Li H. Microbiome, metagenomics, and highdimensional compositional data analysis. Ann Rev Stat Appl. 2015;2:73–94. https://doi.org/10.1146/annurevstatistics010814020351.
Park T, Casella G. The Bayesian lasso. J Am Stat Assoc. 2008;103(482):681–6. https://doi.org/10.1198/016214508000000337. arXiv:0804.3173v7.
Guan Y, Stephens M. Bayesian variable selection regression for genomewide association studies and other largescale problems. Ann Appl Stat. 2011;5(3):1780–815. https://doi.org/10.1214/11AOAS455. arXiv:1110.6019.
George EI, McCulloch RE. Variable selection via Gibbs sampling. J Am Stat Assoc. 1993;88(423):881–9. https://doi.org/10.1080/01621459.1993.10476353. arXiv:0703063.
Kuo L, Mallick B. Variable selection for regression models. Indian J Stat. 1998;60(1):65–81.
Dellaportas P, Forster JJ, Ntzoufras I. On Bayesian model and variable selection using MCMC. Stat Comput. 2002;12(1):27–36. https://doi.org/10.1023/A:1013164120801.
Aitchison J. The statistical analysis of compositional data. J R Stat Soc Ser B Methodol. 1982;44(2):139–77. https://doi.org/10.1007/9789400941090.
Egozcue JJ, PawlowskyGlahn V, MateuFigueras G, BarcelóVidal C. Isometric logratio transformations for compositional data analysis. Math Geol. 2003;35(3):279–300. https://doi.org/10.1023/A:1023818214614.
Hron K, Filzmoser P, Thompson K. Linear regression with compositional explanatory variables. J Appl Stat. 2012;39(5):115–1128. https://doi.org/10.1080/02664763.2011.644268.
Aitchison J, BaconShone J. Log contrast models for experiments with mixtures. Biometrika. 1984;71(2):323–30. https://doi.org/10.1093/biomet/71.2.323.
Lin W, Shi P, Feng R, Li H. Variable selection in regression with compositional covariates. Biometrika. 2014. https://doi.org/10.1093/biomet/asu031.
Shi P, Zhang A, Li H. Regression analysis for microbiome compositional data. Ann Appl Stat. 2016;10(2):1019–40. https://doi.org/10.1214/16AOAS928. arXiv:1603.00974.
Bates S, Tibshirani R. Logratio lasso: scalable, sparse estimation for logratio models. Biometrics. 2019;75(2):613–24. https://doi.org/10.1111/biom.12995. arXiv:1709.01139.
Combettes PL, Müller CL. Regression models for compositional data: general logcontrast formulations, proximal optimization, and microbiome data applications. Stat Biosci. 2021;13(2):217–42. https://doi.org/10.1007/s12561020092832. arXiv:1903.01050.
Zhang L, Shi Y, Jenq RR, Do KA, Peterson CB. Bayesian compositional regression with structured priors for microbiome feature selection. Biometrics. 2020;77(3):824–38. https://doi.org/10.1111/biom.13335.
Koslovsky MD, Hoffman KL, Daniel CR, Vannucci M. A Bayesian model of microbiome data for simultaneous identification of covariate associations and prediction of phenotypic outcomes. Ann Appl Stat. 2020;14(3):1471–92. https://doi.org/10.1214/20AOAS1354.
Li Q, Jiang S, Koh AY, Xiao G, Zhan X. Bayesian modeling of microbiome data for differential abundance analysis. 2019. arXiv:1902.08741.
Leng C, Tran MN, Nott D. Bayesian adaptive lasso. Ann Inst Stat Math. 2014;66(1):221–44. https://doi.org/10.1007/s1046301304296. arXiv:1009.2300.
Lamnisos D, Griffin JE, Steel MFJ. Adaptive Monte Carlo for Bayesian variable selection in regression models. J Comput Graph Stat. 2013;22(3):729–48. https://doi.org/10.1080/10618600.2012.694756.
Nott DJ, Kohn R. Adaptive sampling for Bayesian variable selection. Biometrika. 2005;92(4):747–63. https://doi.org/10.1093/biomet/92.4.747.
Blei DM, Kucukelbir A, McAuliffe JD. Variational inference: a review for statisticians. J Am Stat Assoc. 2017;112(518):859–77. https://doi.org/10.1080/01621459.2017.1285773. arXiv:1601.00670.
Ormerod JT, Wand MP. Explaining variational approximations. Am Stat. 2010;64(2):154. https://doi.org/10.1198/tast.2010.09058.
Carbonetto P, Stephens M. Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Anal. 2012;7(1):73–108. https://doi.org/10.1214/12BA703.
Jaakkola TS, Jordan MI. A variational approach to Bayesian logistic regression models and their extensions. In: Sixth International Workshop on Artificial Intelligence and Statistics. 1997.
Bishop CM, Svensen M. Bayesian Hierarchical Mixtures of Experts, pp. 57–64. UAI, ???. 2003.
Honkela A, Valpola H. Unsupervised variational Bayesian learning of nonlinear models. In: Advances in Neural Information Processing Systems. 2005.
Ye L, Beskos A, De Iorio M, Hao J. Monte Carlo coordinate ascent variational inference. Stat Comput. 2020;30:887–905. https://doi.org/10.1007/s1122202009924y.
Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82(4):711–32. https://doi.org/10.1093/biomet/82.4.711.
RiveraPinto J, Egozcue JJ, PawlowskyGlahn V, Paredes R, NogueraJulian M, Calle ML. Balances: a new perspective for microbiome analysis. mSystems. 2018;3(4):1–12. https://doi.org/10.1128/msystems.0005318.
Cook S, Malyutina S, Kudryavtsev AV, Averina M, Bobrova N, Boytsov S, Brage S, Clark TG, Benavente ED, Eggen AE, Hopstock LA, Hughes A, Johansen H, Kholmatova K, Kichigina A, Voevoda M, Westgate K, Leon DA. Know your heart: rationale, design and conduct of a crosssectional study of cardiovascular structure, function and risk factors in 4500 men and women aged 35–69 years from two Russian cities. Wellcome Open Research. 2018;3:1–29.
George EI, McCulloch RE. Approaches for Bayesian variable selection. Stat Sin. 1997;1(7):339–73.
Xu X, Ghosh M. Bayesian variable selection and estimation for group lasso. Bayesian Anal. 2015;10(4):909–36. https://doi.org/10.1214/14BA929.
Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK. Introduction to variational methods for graphical models. Mach Learn. 1999;37(2):183–233. https://doi.org/10.1023/A:1007665907178.
Salimans T, Knowles DA. Fixedform variational posterior approximation through stochastic linear regression. Bayesian Anal. 2013;8(4):837–82. https://doi.org/10.1214/13BA858. arXiv:1206.6679.
Bishop CM, Winn J. Variational message passing. J Mach Learn Res. 2006;6(1):661.
Hoffman MD, Blei DM. Structured stochastic variational inference. J Mach Learn Res. 2015;38:361–9 arXiv:1404.4114.
Xing EP, Jordan MI, Russell S. A generalized mean field algorithm for variational inference in exponential families. In: UAI 03: Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence. 2002;pp. 583–591.
Lamnisos D, Griffin JE, Steel MFJ. Transdimensional sampling algorithms for Bayesian variable selection in classification problemswith many more variables than observations. J Comput Graph Stat. 2009;18(3):592–612. https://doi.org/10.1198/jcgs.2009.08027.
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol. 1996;58(1):267–88. https://doi.org/10.1111/j.25176161.1996.tb02080.x.
Aitchison J, Shen SM. Logisticnormal distributions: some properties and uses. Biometrika. 1980;67(2):261–72.
Scott JG, Berger JO. Bayes and empiricalBayes multiplicity adjustment in the variableselection problem. Ann Stat. 2010;38(5):2587–619. https://doi.org/10.1214/10AOS792. arXiv:1011.2333v1.
Bokulich NA, Kaehler BD, Rideout JR, Dillon M, Bolyen E, Knight R, Huttley GA, Gregory Caporaso J. Optimizing taxonomic classification of markergene amplicon sequences with QIIME 2’s q2featureclassifier plugin. Microbiome. 2018;6(90):1–17. https://doi.org/10.1186/s401680180470z.
Tseng CH, Wu CY. The gut microbiome in obesity. J Formos Med Assoc. 2019;118:3–9. https://doi.org/10.1016/j.jfma.2018.07.009.
Aitchison J. The Statistical Analysis of Compositional Data. Blackburn Press: Caldwell, NJ, USA., ???. 2003.
Armougom F, Henry M, Vialettes B, Raccah D, Raoult D. Monitoring bacterial community of human gut microbiota reveals an increase in Lactobacillus in obese patients and methanogens in anorexic patients. PLoS ONE. 2009;4(9):1–8. https://doi.org/10.1371/journal.pone.0007125.
Davis CD. The gut microbiome and its role in obesity. Nutr Today. 2016;51(4):167–74. https://doi.org/10.1097/NT.0000000000000167.
de Oliveira Neves VG, de Oliveira DT, Oliveira DC, Oliveira Perucci L, dos Santos TAP, da Costa Fernandes I, de Sousa GG, Barboza NR, GuerraSá R. Highsugar diet intake, physical activity, and gut microbiota crosstalk: implications for obesity in rats. Food Sci Nutr. 2020;8(10):5683–5695. https://doi.org/10.1002/fsn3.1842
Yuan X, Chen R, McCormick KL, Zhang Y, Lin X, Yang X. The role of the gut microbiota on the metabolic status of obese children. Microb Cell Fact. 2021;20(1):1–13. https://doi.org/10.1186/s12934021015489.
Chen Y, Zheng H, Xia Zhang G, Lan Chen F, Dan Chen L, Cong Yang Z. High Oscillospira abundance indicates constipation and low BMI in the Guangdong gut microbiome project. Sci Rep. 2020;10(1):1–8. https://doi.org/10.1038/s4159802066369z.
Chen Y, Welling M. Bayesian structure learning for Markov random fields with a spike and slab prior. Uncertainty in Artificial Intelligence  Proceedings of the 28th Conference, UAI 2012, 2012;pp. 174–184. arXiv:1206.1088.
Curtis SMK, Ghosh SK. A Bayesian approach to multicollinearity and the simultaneous selection and clustering of predictors in linear regression. J Stat Theory Pract. 2011;5(4):715–35. https://doi.org/10.1080/15598608.2011.10483741.
Ignacio A, Fernandes MR, Rodrigues VAA, Groppo FC, Cardoso AL, AvilaCampos MJ, Nakano V. Correlation between body mass index and faecal microbiota from children. Clin Microbiol Infect. 2016;22(3):258–12588. https://doi.org/10.1016/j.cmi.2015.10.031.
Schwiertz A, Taras D, Schäfer K, Beijer S, Bos NA, Donus C, Hardt PD. Microbiota and SCFA in lean and overweight healthy subjects. Obesity. 2010;18(1):190–5. https://doi.org/10.1038/oby.2009.167.
Verdam FJ, Fuentes S, De Jonge C, Zoetendal EG, Erbil R, Greve JW, Buurman WA, De Vos WM, Rensen SS. Human intestinal microbiota composition is associated with local and systemic inflammation in obesity. Obesity. 2013;21(12):607–15. https://doi.org/10.1002/oby.20466.
Nearing JT, Douglas GM, Hayes M, Macdonald J, Desai D, Allward N, Jones CMA, Wright R, Dhanani A, Comeau AM, Langille MGI. Microbiome differential abundance methods produce disturbingly different results across 38 datasets. bioRxiv. 2021;13(1):342.
Ruffieux H, Davison AC, Hager J, Irincheeva I. Efficient inference for genetic association studies with multiple outcomes. Biostatistics. 2017;18(4):618–36. https://doi.org/10.1093/biostatistics/kxx007. arXiv:1609.03400.
Lewin A, Saadi H, Peters JE, MorenoMoral A, Lee JC, Smith KGC, Petretto E, Bottolo L, Richardson S. MTHESS: An efficient Bayesian approach for simultaneous association detection in OMICS datasets, with application to eQTL mapping in multiple tissues. Bioinformatics. 2016;32(4):523–32. https://doi.org/10.1093/bioinformatics/btv568.
Acknowledgements
Not applicable.
Funding
This work was supported by the UK Medical Research Council grant MR/N013638/1 and, MR/M013138/1 “Methods and tools for structural models integrating multiple highthroughput omics data sets in genetic epidemiology”. The approach is applied to data from the the Know Your Heart study, a component of International Project on Cardiovascular Disease in Russia (IPCDR) and funded by Wellcome Trust Strategic Award [100217], UiT The Arctic University of Norway (UiT), Norwegian Institute of Public Health, and Norwegian Ministry of Health and Social Affairs. The funding bodies had no role in the design of the study, data collection, analysis, interpretation of data, or in writing the manuscript.
Author information
Authors and Affiliations
Contributions
DS and AL developed the statistical methods outlined in the paper. All authors except AL and DS helped create the Know Your Heart dataset. DS drafted the manuscript with support from AL. All authors read and approved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The Know Your Heart study, which generate the applied dataset used in the article, complies with the Declaration of Helsinki. The study was approved by the ethical committees of ethics committees of the London School of Hygiene & Tropical Medicine (approval number 8808 received 24.02.2015), Novosibirsk State Medical University (approval number 75 approval received 21/05/2015), the Institute of Preventative Medicine, Novosibirsk (no approval number; approval received 26/12/2014), and the Northern State Medical University, Arkhangelsk (approval number 01/0115 received 27/01/2015). Signed informed consent was obtained both at baseline interview and at the health check. At baseline interview the consent was obtained for passing on name, address, and telephone number to the polyclinic medical team for those deciding to have a health check. Agreement for interview per se was obtained verbally. At the health check written informed consent was obtained for participation in the study. Participants were given the option also to consent to be recontacted by the study team in the future.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
. The derivation of the CAVIMC updates, details of the RJMCMC moves and model proposals, properties of the constraint matrix, additional plots and tables.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Scott, D.A.V., Benavente, E., LibisellerEgger, J. et al. Bayesian compositional regression with microbiome features via variational inference. BMC Bioinformatics 24, 210 (2023). https://doi.org/10.1186/s1285902305219x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902305219x
Keywords
 Compositional
 Variational inference
 Microbiome
 Singular multivariate normal
 Markov chain Monte Carlo