 Software
 Open Access
 Published:
coda4microbiome: compositional data analysis for microbiome crosssectional and longitudinal studies
BMC Bioinformatics volume 24, Article number: 82 (2023)
Abstract
Background
One of the main challenges of microbiome analysis is its compositional nature that if ignored can lead to spurious results. Addressing the compositional structure of microbiome data is particularly critical in longitudinal studies where abundances measured at different times can correspond to different subcompositions.
Results
We developed coda4microbiome, a new R package for analyzing microbiome data within the Compositional Data Analysis (CoDA) framework in both, crosssectional and longitudinal studies. The aim of coda4microbiome is prediction, more specifically, the method is designed to identify a model (microbial signature) containing the minimum number of features with the maximum predictive power. The algorithm relies on the analysis of logratios between pairs of components and variable selection is addressed through penalized regression on the “allpairs logratio model”, the model containing all possible pairwise logratios. For longitudinal data, the algorithm infers dynamic microbial signatures by performing penalized regression over the summary of the logratio trajectories (the area under these trajectories). In both, crosssectional and longitudinal studies, the inferred microbial signature is expressed as the (weighted) balance between two groups of taxa, those that contribute positively to the microbial signature and those that contribute negatively. The package provides several graphical representations that facilitate the interpretation of the analysis and the identified microbial signatures. We illustrate the new method with data from a Crohn's disease study (crosssectional data) and on the developing microbiome of infants (longitudinal data).
Conclusions
coda4microbiome is a new algorithm for identification of microbial signatures in both, crosssectional and longitudinal studies. The algorithm is implemented as an R package that is available at CRAN (https://cran.rproject.org/web/packages/coda4microbiome/) and is accompanied with a vignette with a detailed description of the functions. The website of the project contains several tutorials: https://malucalle.github.io/coda4microbiome/
Background
Although there are still many unknowns about the specific mechanisms of action of the human microbiome, there is growing evidence of its relevance in human health [24, 38]. In recent years, much progress has been made in microbiome research thanks to highthroughput DNA sequencing technologies that allow precise quantification of the composition of the microbiome. The study of the microbiome is considered a great opportunity for improving the current treatment of some diseases and for deriving microbial biomarkers that could be used as diagnostic or prognostic tools.
Microbiome composition is dynamic and the study of microbiome changes over time is of primary importance for understanding the relationship between microbiome and human phenotypes. Longitudinal studies are costly, both economically and logistically, but there is growing evidence of the limitations of crosssectional studies for providing a full picture of the role of the microbime in human health. Microbiome longitudinal studies can be very valuable in this context, provided appropriate methods of analysis are used [34]. The analysis of microbiome data involves significant experimental and computational challenges [5]. One of them is the compositional nature of the data, which requires the use of specific methods of analysis [9, 17,18,19]. Compositional data refers to constraint multivariate nonnegative data that carry relative information. Microbiome relative abundances (proportions) are constrained by a total sum equal to one. This total constraint induces strong dependencies among the observed abundances of the different taxa. In fact, the observed abundance of each taxon is not informative and only provide a relative measure of abundance when compared to the abundances of other taxa [36]. Ignoring the compositional nature of microbiome data can lead to spurious results [28, 37]. This is particularly critical in the context of microbiome longitudinal studies where compositions measured at different times can be affected by distinct batch effects and similar quality control or filtering protocols may yield to different subcompositions at each time point.
Aitchison [2] laid the foundations of Compositional Data Analysis (CoDA), which relies on extracting the relative information of compositional data by comparing the parts of the composition. Logarithms of ratios between components (logratios) are the fundamental transformation in this framework [20, 31] and is known as the logratio approach.
Some methods used in microbiome analysis, such as ALDEx2 [13], LinDA [39, 40], ANCOM [26], ANCOMBC [23], fastANCOM [39] and LOOCM [21], perform the logratio approach to identify differential abundant taxa between two study groups. Here we introduce coda4microbiome, a new R package for analyzing microbiome data within the CoDA framework in both, crosssectional and longitudinal studies. coda4microbiome is an improvement of our previous algorithm, selbal [33], using a more flexible model and a more computationally efficient global variable selection method that results in a considerable reduction of computational time. coda4microbiome differs from most differential abundance (DA) testing methods that aim to characterize microbial communities by selecting taxa with significant different abundances between two study groups (e.g., controls vs cases). Like selbal, the aim of coda4microbiome is prediction, i.e., the method is designed to identify a model (microbial signature) containing the minimum number of features with the maximum predictive power. The algorithm relies on the analysis of logratios between pairs of components and variable selection is addressed through penalized regression on the “allpairs logratio model”, the model containing all possible pairwise logratios.
For longitudinal data, pairwise logratios measured at different time points gives a curve profile or trajectory for each sample. A summary of the shape of these individual trajectories will be the basis for the analysis. More specifically, the algorithm infers dynamic microbial signatures by performing penalized regression over the summary of the logratio trajectories (the area under these trajectories).
In both, crosssectional and longitudinal studies, after reparameterization of the initial “allpairs logratio model”, the inferred microbial signature is expressed as a function of the (logtransformed) initial variables in the form of a logcontrast model [3], i.e., a loglinear model with the constraint that the sum of the coefficients is equal to zero. The zerosum constraint ensures the invariance principle required for compositional data analysis. These microbial signatures can be interpreted as the (weighted) balance between two groups of taxa, those that contribute positively to the microbial signature and those that contribute negatively. For longitudinal data and a binary outcome (e.g. disease status), the signature provides two groups of taxa with different logratio trajectories for cases and controls.
The algorithm is implemented in the R package “code4microbiome” (https://cran.rproject.org/web/packages/coda4microbiome/). Several graphical representations of the results are provided that facilitate the interpretation of the analysis: plot of the logratio trajectories, plot of the signature (selected taxa and coefficients) and plot of the prediction accuracy of the model. In fact, coda4microbiome is not just an R package but a broader initiative that aims to bridge the gap between compositional data analysis and microbiome research. To this end, we are conducting training activities and developing materials that are available at the website of the project: https://malucalle.github.io/coda4microbiome/
The methodology for crosssectional data is described in “Microbioal signature based on logratio analysis: crosssectional studies” Section and illustrated with data from a pediatric Crohn's disease study (“Crosssectional data: Crohn’s disease (CD) study” Section). The methodology for longitudinal data is described in “Microbial signature based on logratio analysis: longitudinal studies” Section and illustrated in “Longitudinal data: early childhood and the microbiome study” Section with data from the “Early childhood and the microbiome (ECAM) study” [7, 8]. We performed a simulation study for benchmarking of several microbiome analysis algorithms (“Simulation study” and “Simulation study results” Section) and applied coda4microbiome in several real datasets (“Real datasets analysis” Section).
Materials and methods
Microbioal signature based on logratio analysis: crosssectional studies
Assume we have n subjects with phenotype \(Y = \left( {Y_{1} , \ldots ,{ }Y_{n} } \right)\) and denote by \(X_{i} = \left( {X_{i1} ,{ }X_{i2} , \ldots ,X_{iK} } \right)\) the microbiome composition of subject i for K taxa. \(X\) can represent either relative abundances (proportions) or raw read counts. We approach the identification of those taxa that are associated to the outcome through penalized regression on the “allpairs logratio model”, a generalized linear model containing all possible pairwise logratios [6]:
The regression coefficients in Eq. (1) are estimated to minimize a loss function \(L\left( \beta \right)\) subject to an elasticnet penalization term on the regression coefficients\(:\)
A common reparameterization of the penalization parameters is \(\lambda_{1} = \lambda \left( {1  \alpha } \right)/2\) and \(\lambda_{2} = \lambda \alpha\) where \(\lambda\) controls the amount of penalization and \(\alpha\) the mixing between the two norms.
For the linear regression model the loss function is given by the residual sum of squares
where \(M\) is the matrix of all pairwise logratios and has dimension \(n\) by \(K\left( {K  1} \right)/2\). The expression of the optimization problem (2) for other models, like the logistic regression, can be found in Friedman et al. [14]. We use the function cv.glmnet() from the R package glmnet [14] to solve (2) within a crossvalidation process that provides the optimal value of \(\lambda\) with a default value for \(\alpha\) equal to 0.9. Noncompositional covariates are previously modeled with Y and the fitted values are considered as “offset” in the penalized regression.
The result of the penalized optimization provides a set of selected pairs of taxa, those with a nonnull estimated coefficient. The linear predictor of the generalized linear model (2) provides a microbial signature score for each individual, \(i \in \left\{ {1, \ldots , n} \right\}\), \(M_{i} = \mathop \sum \limits_{1 \le j < k \le K } \hat{\beta }_{jk} \cdot{\text{log}}\left( {X_{ij} /X_{ik} } \right)\), which is associated with the phenotype \(Y_{i}\). Because of the linearity of the logarithm, the microbial signature M can be rewritten in terms of the selected single taxa which is more interpretable than in terms of pairs of taxa:
where \(\hat{\theta }_{j} = \mathop \sum \limits_{k = j + 1}^{K} \hat{\beta }_{jk}  \mathop \sum \limits_{k = 1}^{j  1} \hat{\beta }_{kj}\), that is, the sum of the coefficients \(\hat{\beta }\) that correspond to a logratio that involves component j [6].
It can be proved that \(\mathop \sum \limits_{j = 1}^{K} \hat{\theta }_{j} = 0\) and thus, the microbial signature M is a logcontrast function involving the selected taxa (those with \(\hat{\theta }_{j} \ne 0)\). This ensures the invariance principle required for proper compositional data analysis and it facilitates the interpretation of the microbial signature. Indeed, expression \(\mathop \sum \limits_{j = 1}^{K} \hat{\theta }_{j} \cdot{\text{log}}\left( {X_{j} } \right)\) in (3) can be interpreted as a weighted balance between two groups of taxa, \(G_{1}\) and \(G_{2}\), the taxa with a positive coefficient vs those with a negative coefficient [36].
Microbial signature based on logratio analysis: longitudinal studies
Summary of logratio trajectories
Assume n subjects with fixed phenotype \(Y = \left( {Y_{1} ,{ } \ldots ,{ }Y_{n} } \right)\). Subject i has been observed in \(L_{i} { }\) time points, \(\left( {t_{i1} ,{ }t_{i2} ,{ } \ldots ,{ }t_{{iL_{i} }} } \right)\). We denote by \(X_{i} \left( {t_{ij} } \right) = \left( {X_{i1} \left( {t_{ij} } \right),{ }X_{i2} \left( {t_{ij} } \right),{ } \ldots ,X_{iK} \left( {t_{ij} } \right)} \right)\) the microbiome composition of subject i at time \(t_{ij}\), where K is the number of taxa which is assumed to be the same for all the individuals and all the time points. \(X_{i} \left( {t_{ij} } \right)\) can represent either relative abundances (proportions) or raw counts. We denote by \(logX_{i} \left( {t_{ij} } \right)\) the logarithm transformation of microbiome abundances after zero imputation [27]. The logabundance trajectory of component A for individual i is denoted by \(logX_{iA} = \left( {logX_{iA} \left( {t_{i1} } \right), logX_{i2A} \left( {t_{i2} } \right), \ldots ,logX_{iA} \left( {t_{{iL_{i} }} } \right)} \right)\) and the logratio trajectory between components A and B for individual i is given by:
We summarize the logratio trajectory between components A and B for individual i within two time points \(l_{1}\) and \(l_{2}\) as the integral of the logratio trajectory:
where the values of the logratio for \(t \notin \left( {t_{i1} , t_{i2} , \ldots , t_{{iL_{i} }} } \right)\) are linearly interpolated.
We do not take the absolute value in Eq. (4) because the sign of the integral is informative: Positive values of \(s_{i} \left( {A,B} \right)\) correspond to trajectories of component A above trajectories of component B, that is, larger relative abundances of A with respect to B, while negative values represent the opposite. Values of \(s_{i} \left( {A,B} \right)\) around zero can represent similar abundances between A and B over time or a nonhomogeneous trend between A and B within the observed region.
Another advantage of the summary \(s_{i} \left( {A,B} \right)\) is computational. Since the integral is linear, \(s_{i} \left( {A,B} \right)\) is equal to the difference between the integrals of logtransformed microbiome abundances of taxa A and taxa B:
Thus, the number of integrals to be calculated is of the order of K, the number of taxa, instead of \(K\left( {K  1} \right)/2\), the number of pairwise logratios.
Microbial signature based on logratio analysis
To identify those logratios that are most associated with the outcome Y, we implement glm penalized regression on the logratio summaries of all pairs of taxa:
where \(s\left( {j,k} \right)\) is the summary of the logratio trajectory corresponding to components \(X_{j}\) and \(X_{k}\).
Equation (5) is identical to Eq. (1) for crosssectional studies except for the change of the pairwise logratios by the summary of the logratios trajectories. Thus, the inference and variable selection process is performed similarly with elasticnet penalized regression within a crossvalidation process using cv.glmnet() from the R package glmnet [14].
For each individual, \(i \in \left\{ {1, \ldots , n} \right\}\), the microbial signature score is given by \(M_{i} = \mathop \sum \limits_{1 \le j < k \le K } \hat{\beta }_{jk} \cdot s_{i} \left( {j,k} \right)\). Because of the linearity of the integrals used as summaries of the logratio trajectories and following the same reparameterization than in Eq. (3), M can be rewritten in terms of the selected single taxa which is more interpretable than the selected pairs of components:
where \(\hat{\theta }_{j} = \mathop \sum \limits_{k = j + 1}^{K} \hat{\beta }_{jk}  \mathop \sum \limits_{k = 1}^{j  1} \hat{\beta }_{kj}\).
Since \(\mathop \sum \limits_{k = 1}^{K} \hat{\theta }_{k} = 0\), the microbial signature M is the integral of the trajectory of a logcontrast function involving the selected taxa (those with \(\hat{\theta }_{k} \ne 0)\) and, similarly to the signatures for crosssectional data, it can be interpreted as a weighted balance between two groups of taxa, \(G_{1}\) and \(G_{2}\), the taxa with a positive coefficient vs those with a negative coefficient.
coda4microbiome main functions
The package coda4microbiome [10] contains several functions that implement the proposed algorithms. The method for the identification of microbial signatures in crosssectional studies (“Microbioal signature based on logratio analysis: crosssectional studies” Section) is implemented in function coda_glmnet() and the method for longitudinal data (“Microbial signature based on logratio analysis: longitudinal studies” Section) is implemented in function coda_glmnet_longitudinal().
The library also contains additional functions like plot_signature_curves() that provides a plot of the signature trajectories or filter_longitudinal() that filters those individuals and taxa with enough longitudinal information.
The coda4microbiome methodology is visually described with a pictogram in the supplementary material (Additional file 1: Fig. S1).
Simulation study
We performed a case–control simulation study to evaluate the discrimination (or classification) performance and computational burden of coda4microbiome in comparison to other methods used for microbiome analysis: selbal [33], ANCOMBC [23], ALDEx2 [13], DESeq2 [25], edgeR [32], metagenomeSeq [30], and LinDA [40].
Both, coda4microbiome and selbal, provide a classification model (microbial signature) that defines how the selected taxa are combined. For the other methods that only provide a set of differentially abundant taxa, the classification model was obtained by fitting a logistic regression model containing the DA taxa. Metagenomic simulated data was generated using faecal samples from the “Global Patterns” dataset [11] as template, following the data generation model described by Weiss et al. [37]. This model generates true positives taxa so that their relative abundances match their real abundance in the environment. As in Weiss et al. [37], some of the simulation parameters were fixed for all scenarios: the number of most prevalent taxa to keep in simulation template (2000 taxa), the number of true positive taxa (100 taxa), and the sequencing depth (2000 reads). Both categories had the same number of samples, being 50 or 100 samples in each group. The effect size of the true positive taxa was set to 1.25, 1.5, 2, 5, 10 or 20. This results in a total of 12 simulated scenarios, and we considered 10 replicates for each one. After simulated metagenomic datasets were generated, taxa with less than 5% of prevalence among samples were removed, and 1 count was added to all taxa abundances to overcome problems with logtransformations.
To evaluate the discrimination accuracy of the different methods, a fivefold crossvalidation process was applied. Samples in each simulation set were randomly grouped into five different cvfold groups, ensuring the same number of cases and controls in each one. For every crossvalidation fold, the train set includes all cvfold groups except one, used for testing the model afterwards. The same cvfold groups assignment in a simulation set were used for testing all the algorithms. For DA methods, a taxa selection step was performed on the train set based on the significance of the Benjamini–Hochberg adjusted pvalue [4] with a threshold of 0.05. Relative abundances of selected taxa were used to fit a logistic regression model able to classify the two groups. coda4microbiome and selbal were trained on the train set and the obtained microbial signature was evaluated on the test set. For all methods, the measure of performance was the Area Under the ROC Curve (AUC). We also compared the number of taxa selected by each method and the computational time.
Results
Crosssectional data: Crohn’s disease (CD) study
We illustrate coda4microbiome algorithm for crosssectional studies with data from a pediatric Crohn’s disease (CD) study [16]. The dataset, available at coda4microbiome package, includes microbiome compositions of 975 individuals, 662 with CD and 313 without any symptoms. The abundance table agglomerated at the genus level contains 48 genera.
We implemented coda4microbiome::coda_glmnet() function to the Crohn’s dataset. The algorithm identifies that the outcome is binary and implements a penalized logistic regression. The results of the analysis provide a first plot (Fig. 1) showing the crossvalidation accuracy (AUC) curve from cv.glmnet(). For the default lambda (“lambda.1se”), the algorithm selects 27 pairwise logratios that, as we will see later, correspond to 24 different taxa.
The results of coda_glmnet include the number, the name, and the coefficients of the selected taxa. These can be visualized in a bar plot where the selected taxa and the corresponding coefficients are represented (Fig. 2).
A third plot describes the discrimination capacity of the selected microbial signature (Fig. 3). This is accompanied with three classification accuracy measures: the apparent AUC, i.e., the AUC of the signature applied to the same data that was used to generate the model, and the mean and sd of the crossvalidation AUC obtained from the output of cv.glmnet(). For this dataset, the apparent AUC is 0.84 and the mean (sd) crossvalidation AUC are 0.82 (0.0081).
When the outcome is a continuous numerical variable, coda_glmnet() function implements penalized linear regression and Fig. 3 is a scatter plot between predictions and the outcome values.
Longitudinal data: early childhood and the microbiome study
To illustrate coda4microbiome for longitudinal studies we use data from the “Early childhood and the microbiome (ECAM) study” that followed a cohort of 43 U.S. infants during the first 2 years of life for the study of their microbial development and its association with earlylife antibiotic exposures, cesarean section, and formula feeding [7, 8].
Metadata and microbiome data were downloaded from https://github.com/caporasolab/longitudinalnotebooks. Initially the data contained information on 43 child and 445 taxa at the genus level. We filtered those individuals and taxa with enough information for timecourse profiling: we removed individuals with only one timepoint observation and those taxa with less than 30 children (70% of individuals) with at least 3 nonzero observations over the followup period. After this filtering, the data reduced to 42 children and 37 taxa.
Here we focus on the effects of the diet on the early modulation of the microbiome by comparing microbiome profiles between children with breastmilk diet (bd) vs. formula milk diet (fd) in their first 3 months of life.
Using function coda_glmnet_longitudinal(), we identified a microbial signature with maximum discrimination accuracy between the two diet groups. The signature is defined by the relative abundances of two groups of taxa, \(G_{1}\) and \(G_{2}\), where \(G_{1}\) is composed of 6 taxa (those with a positive coefficient in the regression model) and \(G_{2}\) is composed of 2 taxa (those with a negative coefficient) (Table 1 and Fig. 4). Group \(G_{1}\) is mainly dominated by three taxa within the order Clostridiales (family Ruminococcaceae (2) and gender Blautia) and one taxon within the gender Actinomyces. Two taxa (g_Veillonella and f_Lachnospiraceae) have a coefficient close to zero and will have a very small contribution to the signature. Group \(G_{2}\) is composed by two taxa within the genders Haemophilus and Staphylococcus.
The trajectories of the microbial signature over the observed period are represented in Fig. 5, where the color of the curves corresponds to the diet group. Each trajectory represents the relative mean abundances between the two taxa groups for each child. We can see that the two groups are clearly separated. Those children under breastmilk diet (in orange) usually have trajectories below zero, which means they have more relative mean abundance of g_Haemophilus and g_Staphylococcus with respect to the relative abundance of taxa in group \(G_{1}\), while children with formula milk diet (in blue) have more relative abundance of taxa in group \(G_{1}\) relative to \(G_{2} .\)
Figure 6 displays the distribution of the microbial signature scores for the two diet groups and offers a visual assessment of the (apparent) discrimination accuracy of the signature. Quantitatively, the apparent discrimination accuracy of the signature (i. e. the AUC of the signature applied to the same data that was used to generate the model) is 0.96 and the mean crossvalidation AUC is 0.74 (sd = 0.10).
The results are consistent with previous studies on the association of the infant gut microbiome composition and breastmilk feeding practices. In Fehr et al. [12], Haemophilus parainfluenzae and Staphylococcus were found to be enriched with exclusive breastmilk feeding together with lower prevalence of Actinomyces at 3 months. Lachnospiraceae (Blautia) was enriched among infants who were no longer fed breastmilk. Similar results are reported in Laursen et al. [22] where the duration of exclusive breastfeeding was negatively correlated with genera within Lachnospiraceae (e.g., Blautia) and genera within Ruminococcaceae. Positive correlations with exclusive breastfeeding were observed for g_Bifidobacterium and Pasteurellaceae (Haemophilus).
Simulation study results
Figure 7 show the number of selected taxa by each method for simulated datasets with different effect sizes (1.25, 1.5, 2, 5, 10 and 20) and 100 samples per group. Similar results are obtained for simulations with 50 samples per group (results not shown). For all methods, except for coda4microbiome and selbal, the larger the effect size, the more taxa are selected, as it is expected since the power of the DA tests increases with larger effect sizes. The opposite is true for coda4microbiome and selbal, as the effect size increases, less variables are needed in the model to obtain good classifications. Despite the fold effect and sample size, coda4microbiome finds a predictive microbial signature with less features than selbal, with similar AUCs.
Figures 8 provides two different representations of the discrimination accuracy (AUC) of each method: a boxplot distribution and a line plot of the mean cvAUC for the 10 replicates of each scenario and 50 samples per group. Figure 9 provides the same information for the case of 100 samples per group. The numerical results (mean and sd) are detailed in Table 2. coda4microbiome and selbal perform similarly in all scenarios. The higher classification accuracy of coda4microbiome and selbal is especially remarkable in scenarios with low effect size and a small sample size (Fig. 8). For an effect size equal to 1.25 the mean cvAUC of these two methods is 0.763 and 0.795, respectively, while all the other methods have mean cvAUC below 0.6. For an effect size of 1.5, coda4microbiome and selbal discrimination is 0.943 and 0.912, respectively, and only DESeq2 and edgeR have a good performance, though with lower discrimination values (0.851 and 0.841, respectively). All the other methods methods have mean cvAUC below 0.6. For larger effect sizes the performance of all the methods if good (discrimination around 1) except for LinDA that has a poor performance in all the scenarios. Similar results are obtained for larger sample sizes (Fig. 9). In this case, DESeq2 and edgeR perform very well, with discrimination accuracy still slightly lower than coda4microbiome and selbal when the effect size is equal to 1.25 but slightly larger to coda4microbiome when the effect size is equal to 1.5. All methods, except LinDA, reach AUCs over 0.9 in simulations with a fold effect of 2, 5 or 10. On scenarios with very high fold effect, such as 20, classification performance decreases for most of the methods except for coda4microbiome, selbal and ALDEx2.
Figure 10 show the computational times for each method for different effect sizes (1.25, 1.5, 2, 5, 10 and 20) and 100 samples per group. Similar results are obtained for simulations with 50 samples per group (results not shown). ANCOMBC and selbal are the two methods that spent more time in the analysis. coda4microbiome is clearly more computationally efficient than selbal.
Real datasets analysis
In order to better compare the computational times of coda4microbiome over selbal, we applied both methods to 8 real datasets available at [28]. Table 3 shows the great improvement of coda4microbiome in comparison with selbal, especially when using the recommended function selbal.cv() that, as coda4microbiome, implements crossvalidation in the analysis and thus, provides more robust results. In fact, because of the computational burden of selbal.cv(), we only could use selbal() function in the simulations. Table 3 is ordered according to the number of taxa in each dataset and this allows to easily see the high correlation between computational time and the number of features.
Discussion
coda4microbiome algorithm represents an improvement of our previous algorithm selbal [33]. Both, coda4microbiome and selbal search for two groups of taxa, A and B, that are jointly associated with the outcome of interest Y. The main differences between both algorithms are (1) the model for combining the relative abundances of taxa in group A and B, (2) the process for selecting the taxa that will constitute the microbial signature and (3) the type of study that can be approached with each method:

(1)
selbal expresses the microbial signature as an ilr balance between A and B [31], i.e., as the logratio of the geometric mean abundances of taxa in group A vs taxa in group B. Instead, coda4microbiome microbial signature is expressed as a logcontrast model where those taxa with a positive coefficient define group A, those with a negative coefficient define group B and those with a zero coefficient are not part of the microbial signature.

(2)
selbal performs forward selection and coda4microbime implements elasticnet penalized regression variable selection.

(3)
selbal is only available for crosssectional studies while coda4microbime is implemented for both crosssectional and longitudinal studies.
In summary, coda4microbiome improves selbal by considering a more general model (an ilr balance is a special logcontrast), a more powerful variable selection process (forward selection does not ensure a global optimum) and can be used in both, crosssectional and longitudinal studies.
The results of our simulations indicate that when the aim is classification, DA tests followed by fitting a regression model with the selected significant taxa perform worse than coda4microbiome or selbal, which are methods specifically developed for model prediction. coda4microbiome performs very well even in situations where the fold change of the associated taxa is quite low (e.g. 1.25), which is probably the case for most of real microbiome associations. Under such small fold effects, other methods such as edgeR, DESeq2, ALEDx2, ANCOMBC, MetagenomeSeq and LinDA perform poorly. Selbal instead, performs similarly to coda4microbiome with good discrimination accuracy for the same simulation scenarios. Though selbal and coda4microbiome have similar classification power, the latest requires less computational time which is an important advantage especially for datasets with a large number of features.
Conclusions
We developed an R package for microbiome analysis that deals with the compositional nature of microbiome data in both, crosssectional and longitudinal studies. coda4microbiome provides a set of functions to explore and study microbiome data within the CoDA framework, with a special focus on identification of microbial signatures that can serve as biomarkers of disease risk and prognostic. The results are expressed as the (weighted) balance between two groups of taxa, those that contribute positively to the microbial signature and those that contribute negatively. The interpretability of results is of major importance in this context. The package provides several graphical representations that facilitate the interpretation of the analysis and the identified microbial signatures.
The main difference between coda4microbiome and other CoDA methods that also employ the logratio approach, such as ALDEx2 [13], ANCOMBC [23] or fastANCOM [39], is that they perform differential abundance testing while coda4microbiome is focused on prediction. coda4microbiome improves our previous algorithm, selbal [33]. Both have similar performance but coda4microbiome is more computationally efficient.
Longitudinal microbiome studies, especially those focused on the human microbiome, have usually low resolution: the number of individuals is small, each individual has few observation times, the observations of the different individuals are not made at exactly the same time, the data are very variable, the expected behavior of the abundance trajectories is not linear or quadratic, etc. This makes it difficult to justify and implement a parametric modeling of trajectories and limits the use of models for longitudinal data (time series, mixed models). In this context, a description of the trajectories such as the one we propose, although less precise, allows to extract valuable information from the data as we have shown in the example. Other longitudinal data modeling strategies [1, 15, 29, 35] could be used in longitudinal microbiome studies with higher resolution such as laboratory or animal experimental studies. Simulation studies should be performed to assess the performance of coda4microbiome for longitudinal microbiome data against other existing methods.
With this new R package, we aim to enhance microbiome analysis by taking into consideration the compositional nature of microbiome data through the use of compositional data analysis methods.
Availability of data and materials
The algorithm is implemented as an R package code4microbiome available at CRAN (https://cran.rproject.org/web/packages/coda4microbiome/) Project name: coda4microbiome. Project home page: https://malucalle.github.io/coda4microbiome/. Operating system(s): Platform independent. Programming language: R. Other requirements: R (≥ 3.5.0). License: MIT + file LICENSE. Any restrictions to use by nonacademics: none. The datasets used to illustrate the algorithm is available as a data object in the “coda4microbiome” package.
References
Äijö T, Müller CL, Bonneau R. Temporal probabilistic modeling of bacterial compositions derived from 16S rRNA sequencing. Bioinformatics. 2018;34(3):372–80. https://doi.org/10.1093/bioinformatics/btx549.
Aitchison J. The statistical analysis of compositional data. J R Statist Soc. 1982;44:139–77.
Aitchison J, BaconShone J. Log contrast models for experiments with mixtures. Biometrika. 1984;71:323–30.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B. 1995;57(1):289–300.
Bharti R, Grimm DG. Current challenges and bestpractice protocols for microbiome analysis. Brief Bioinform. 2021;22(1):178–93.
Bates S, Tibshirani R. Logratio lasso: scalable, sparse estimation for logratio models. Biometrics. 2019;75:613–24.
Bokulich NA, Chung J, Battaglia T, Henderson N, Jay M, Li H, Lieber AD, Wu F, PerezPerez GI, Chen Y, Schweizer W, Zheng X, Contreras M, DominguezBello MG, Blaser MJ. Antibiotics, birth mode, and diet shape microbiome maturation during early life. Sci Transl Med. 2016;8:343ra82. https://doi.org/10.1126/scitranslmed.aad7121.
Bokulich NA, Dillon MR, Zhang Y, Rideout JR, Bolyen E, Li H, Albert PS, Caporaso JG. q2longitudinal: longitudinal and pairedsample analyses of microbiome data. mSystems. 2018;3:e00219e318. https://doi.org/10.1128/mSystems.0021918.
Calle ML. Statistical analysis of metagenomics data. Genomics Inform. 2019;17(1): e6.
Calle ML, Susin A. coda4microbiome: Compositional Data Analysis for Microbiome Studies https://cran.rproject.org/package=coda4microbiome. (2022).
Caporaso JG, Lauber CL, Walters WA, BergLyons D, Lozupone CA, Turnbaugh PJ, Fierer N, Knight R. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci USA. 2011;108(SUPPL. 1):4516–22. https://doi.org/10.1073/PNAS.1000080107/SUPPL_FILE/PNAS.201000080SI.PDF.
Fehr K, Moossavi S, Sbihi H, Finlay B, Turvey SE, Azad MB. Breastmilk feeding practices are associated with the cooccurrence of bacteria in mothers’ milk and the infant gut: the CHILD Cohort study. Cell Host & Microbiome. 2020;28(2):285297.e4. https://doi.org/10.1016/j.chom.2020.06.009.
Fernandes AD, Reid JN, Macklaim JM, et al. Unifying the analysis of highthroughput sequencing datasets: characterizing RNAseq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome. 2014;2:15.
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22. https://doi.org/10.18637/JSS.V033.I01.
Gerber GK, Onderdonk AB, Bry L. Inferring dynamic signatures of microbes in complex host ecosystems. PLoS Comput Biol. 2012;8(8):e1002624. https://doi.org/10.1371/journal.pcbi.1002624.
Gevers D, Kugathasan S, Denson LA, VázquezBaeza Y, Van Treuren W, Ren B, Schwager E, Knights D, Song SJ, Yassour M, Morgan XC, Kostic AD, Luo C, González A, McDonald D, Haberman Y, Walters T, Baker S, Rosh J, Stephens M, Heyman M, Markowitz J, Baldassano R, Griffiths A, Sylvester F, Mack D, Kim S, Crandall W, Hyams J, Huttenhower C, Knight R, Xavier RJ. The treatmentnaïve microbiome in newonset Crohn’s disease. Cell Host Microbe. 2014;15:382–92.
Gloor GB, Wu JR, PawlowskyGlahn V, Egozcue JJ. It’s all relative: analyzing microbiome data as compositions. Ann Epidemiol. 2016;26(5):322–9. https://doi.org/10.1016/j.annepidem.2016.03.003.
Gloor GB, Reid G. Compositional analysis: a valid approach to analyze microbiome high throughput sequencing data. Can J Microbiol. 2016;62(8):692–703. https://doi.org/10.1139/cjm20150821.
Gloor GB, Macklaim JM, PawlowskyGlahn V, Egozcue JJ. Microbiome datasets are compositional: and this is not optional. Front Microbiol. 2017;8:2224.
Greenacre M. Compositional data analysis. Annu al Rev Stat Appl. 2021;8:271–99.
Hu Y, Satten GA, Hu YJ. LOCOM: a logistic regression model for testing differential abundance in compositional microbiome data with false discovery rate control. Proc Natl Acad Sci. 2022;119(30): e2122788119.
Laursen MF, Andersen LBB, Michaelsen KF, Mølgaard C, Trolle E, Bahl MI, Licht TR. Infant gut microbiota development is driven by transition to family foods independent of maternal obesity. MSphere. 2016;1(1):e00069e115. https://doi.org/10.1128/mSphere.000691.
Lin H, Peddada S. Analysis of compositions of microbiomes with bias correction. Nat Commun. 2020;11(1):1–11. https://doi.org/10.1038/s41467020170417.
Lo BC, et al. Gut microbiota and systemic immunity in health and disease. Int Immunol. 2021;33:197–209.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 2014;15(12):1–21. https://doi.org/10.1186/S1305901405508/FIGURES/9.
Mandal S, Van Treuren W, White RA, Eggesbø M, Knight R, Peddada SD. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb Ecol Health Dis. 2015;26:27663. https://doi.org/10.3402/mehd.v26.27663.
MartínFernández JA, Hron K, Templ M, Filzmoser P, PalareaAlbaladejo J. Modelbased replacement of rounded zeros in compositional data: classical and robust approaches. Comput Stat Data Anal. 2012;56:2688–704.
Nearing JT, Douglas GM, Hayes MG, MacDonald J, Desai DK, Allward N, Jones CAM, Wright RJ, Dhanani AS, Comeau AM, Langille MGI. Microbiome differential abundance methods produce different results across 38 datasets. Nat Comm. 2022;13:342.
Park Y, Ufondu A, Lee K, Jayaraman A. Emerging computational tools and models for studying gut microbiota composition and function. Curr Opin Biotechnol. 2020;66:301–11. https://doi.org/10.1016/j.copbio.2020.10.005.
Paulson JN, Colin Stine O, Bravo HC, Pop M. Differential abundance analysis for microbial markergene surveys. Nat Methods. 2013;10(12):1200–2. https://doi.org/10.1038/nmeth.2658.
PawlowskyGlahn V, Egozcue JJ, TolosanaDelgado R. Modeling and analysis of compositional data: Statistics in practice. Chichester: Wiley; 2015.
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40. https://doi.org/10.1093/BIOINFORMATICS/BTP616.
RiveraPinto J, Egozcue JJ, PawlowskyGlahn V, Paredes R, NogueraJulian M, Calle ML. Balances: a new perspective for microbiome analysis. MSystems. 2018;3(4):1–12. https://doi.org/10.1128/msystems.0005318.
Schmidt T, Raes J, Bork P. The human gut microbiome: from association to modulation. Cell. 2018;172:1198–215. https://doi.org/10.1016/j.cell.2018.02.044.
Silverman JD, Durand HK, Bloom RJ, Mukherjee S, David LA. Dynamic linear models guide design and analysis of microbiota studies within artificial human guts. Microbiome. 2018;6:202. https://doi.org/10.1186/s4016801805843.
Susin A, Wang Y, Lê Cao KA, Calle ML. Variable selection in microbiome compositional data analysis. NAR Genomics Bioinform. 2020;2(2):lqaa029.
Weiss S, Xu ZZ, Peddada S, Amir A, Bittinger K, Gonzalez A, Lozupone C, Zaneveld JR, VázquezBaeza Y, Birmingham A, Hyde ER, Knight R. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome. 2017;5(1):1–18. https://doi.org/10.1186/s401680170237y.
Zheng D, Liwinski T, Elinavet E. Interaction between microbiota and immunity in health and disease. Cell Res. 2020;30:492–506.
Zhou C, Wang H, Zhao H, et al. fastANCOM: a fast method for analysis of compositions of microbiomes. Bioinformatics. 2022;38(7):2039–41.
Zhou H, He K, Chen J, Zhang X. LinDA: linear models for differential abundance analysis of microbiome compositional data. Genome Biol. 2022;23(1):1–23. https://doi.org/10.1186/S13059022026555/FIGURES/5.
Acknowledgements
Not applicable.
Funding
This work was partially supported by the Spanish Ministry of Economy, Industry and Competitiveness, references PID2019104830RBI00 (M.L.C), PID2021123657OBC33 (A.S) and PID2021122136OBC21 (A.S.).
Author information
Authors and Affiliations
Contributions
MC and AS participated in all the stages of the project: implementation of the application, writing the documentation and preparing the manuscript, analysis and interpretation of results. MP implemented the simulation analysis. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Pictogram of coda4microbiome algorithm.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Calle, M.L., Pujolassos, M. & Susin, A. coda4microbiome: compositional data analysis for microbiome crosssectional and longitudinal studies. BMC Bioinformatics 24, 82 (2023). https://doi.org/10.1186/s12859023052053
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859023052053
Keywords
 Compositional data analysis
 Logratio analysis
 Longitudinal studies
 Microbiome analysis
 Microbial signatures
 Penalized regression