tidyMicro: a pipeline for microbiome data analysis and visualization using the tidyverse in R

Background The drive to understand how microbial communities interact with their environments has inspired innovations across many fields. The data generated from sequence-based analyses of microbial communities typically are of high dimensionality and can involve multiple data tables consisting of taxonomic or functional gene/pathway counts. Merging multiple high dimensional tables with study-related metadata can be challenging. Existing microbiome pipelines available in R have created their own data structures to manage this problem. However, these data structures may be unfamiliar to analysts new to microbiome data or R and do not allow for deviations from internal workflows. Existing analysis tools also focus primarily on community-level analyses and exploratory visualizations, as opposed to analyses of individual taxa. Results We developed the R package “tidyMicro” to serve as a more complete microbiome analysis pipeline. This open source software provides all of the essential tools available in other popular packages (e.g., management of sequence count tables, standard exploratory visualizations, and diversity inference tools) supplemented with multiple options for regression modelling (e.g., negative binomial, beta binomial, and/or rank based testing) and novel visualizations to improve interpretability (e.g., Rocky Mountain plots, longitudinal ordination plots). This comprehensive pipeline for microbiome analysis also maintains data structures familiar to R users to improve analysts’ control over workflow. A complete vignette is provided to aid new users in analysis workflow. Conclusions tidyMicro provides a reliable alternative to popular microbiome analysis packages in R. We provide standard tools as well as novel extensions on standard analyses to improve interpretability results while maintaining object malleability to encourage open source collaboration. The simple examples and full workflow from the package are reproducible and applicable to external data sets.

fields. Technological advances in sequencing accuracy, depth, and throughput now allow comprehensive analysis of microbial communities through both targeted (i.e., markergene surveys, such as 16S rRNA genes) and untargeted (i.e., shotgun metagenomics) approaches. However, the high dimensional nature of these data can be difficult to manage. Research projects often involve management and integration of multiple sequencebased datasets, along with the corresponding clinical data. Merging these data sets together into a workable format can be difficult and time consuming.
This problem inspired microbiome analysis pipelines to construct their own data structures for data management, the most popular in R [6] being the phyloseq-class [7,8]. The majority of omics analysis pipelines available in R rely on "S4" level objects since they provide flexibility beyond data frames and lists. However, these data structures impose severe limitations to the users' control over their analysis and do not accommodate functions from external packages. Furthermore, popular microbiome pipelines focus heavily on community level analyses and exploratory visualizations (e.g. ordination plots, stacked bar charts of raw abundances). Those who offer taxon level models do not provide model visualization techniques. Model visualizations tailored to microbiome data would improve collaboration and convey "big picture" results better than using large summary tables alone.
We introduce a new R package, "tidyMicro, " that extends typical components of microbiome analysis tools by incorporating several new data analysis and visualization tools while following the principles of the tidyverse [9]. These data structures and workflow principles grant the capability to adopt user extensions and integrate external functions, a unique feature that distinguishes tidyMicro from other R-based microbiome analysis pipelines. The viability of the pipeline was evaluated using published sequencing data from a study on bacterial community relationships to bronchopulmonary dysplasia in preterm infants [10] and data from a study on the nasal microbiome of hospital inpatients suffering from Staphylococcus aureus infections [11]. The package is currently available through GitHub and CRAN.

Implementation
The tidyMicro pipeline consists of 5 macro operations with several options within each (Fig. 1). All operations are implemented and supported through R, with graphics and coding style following the general style of the tidyverse. The first step in the pipeline is merging OTU table(s) and clinical data together into a "tidy" format ( Fig. 1a). From here, several functions are provided to explore the data visually (Fig. 1b), calculate and analyze community diversity measures (i.e. alpha-diversity and beta-diversity analyses; Fig. 1c), model abundance and prevalence of individual taxa (Fig. 1d), and summarize and visualize model results (Fig. 1e).
In this paper, we demonstrate a standard workflow using two published data sets. We use data from a study that set out to define the nasal microbiome of hospital inpatients who are persistently colonized with methicillin-resistant Staphylococcus aureus (MRSA) compared with matched, non-colonized controls [11] as our primary example set. We also use a subset of a published data set from a study assessing connections between airway taxa and bronchopulmonary dysplasia (BPD) severity in prematurely born infants [10]. The BPD study allows us to demonstrate a new visualization technique for longitudinal data and the package's ability to handle multiple OTU tables. These examples do not illustrate all possible options and functions available through tidyMicro. Instead they serve as a useful overview of the pipeline's primary functions with full descriptions of novel features. It also demonstrates the pipeline's ability to create reproducible microbiome analyses through popular open source software.

MRSA study
In the MRSA study researchers acquired samples from an active MRSA screening from the Department of Veterans Affairs Eastern Colorado Health Care System. Patients were screened for nasal MRSA colonization at admission, inter-ward transfer, weekly intervals of a ≥ 7 day stay in any ward, discharge, and death. An MRSA persistent carrier case was defined as someone from whom at least 5 swabs had been taken, at least 1 week apart, and ≥ 80% of nasal swabs were positive for MRSA. An MRSA noncarrier control was defined as a someone from whom at least 5 swabs had been taken, at least 1 week apart, and all swabs were negative for MRSA. Noncarrier controls were then matched to persistent carriers based on known colonization risk factors including age, diabetes mellitus, long-term care residence (nursing home), and several others. Bacterial profiles were determined by broad-range amplification and sequence analysis of 16S ribosomal RNA (rRNA) genes. We have classified the sequencing reads into their genus level taxonomic ranks. This cohort's demographics are summarized in Table 1.

Bronchopulmonary dysplasia
In the BPD study, tracheal aspirate samples were collected at 7, 14, and 21 days of life (± 48 h) from pre-term infants requiring mechanical ventilation. This subset of the original data set contains sequences from 24 infants, all ventilated at 7 days of life. Only 15 of these 24 had samples from all 3 time points. Bacterial profiles were determined by broad-range amplification and sequence analysis of 16S rRNA genes. We have classified Fig. 1 Flowchart of the tidyMicro pipeline. The supplied OTU table(s) must be in the standard format output by QIIME with column names that match a sequencing library names column in the clinical data. The initial step is merging all OTU tables and clinical data using the tidy_micro function (a). From here, the tidyMicro set can be used for exploratory visuals (b), community level analyses (c), and taxa level analyses (d, e) the sequencing reads into their phylum, class, order, and family level taxonomic ranks to demonstrate the pipeline's ability to manage multiple OTU tables. These cohort demographics are summarized in Table 2.

Data Structure
In a typical OTU table format [12][13][14], the first column contains OTU names and the following columns contain sequence read counts for each sample (sequencing library). The tidy_micro function can read in any number of OTU tables from targeted or untargeted sequencing methods and merge them with clinical data into a tidy data format. Merging four OTU tables from sequences classified to the phylum, class, order, and family level taxonomic ranks is demonstrated below: This creates a data frame with a hierarchical structure with a "block" for each OTU table and taxa "blocks" within each OTU table block (Fig. 2). The taxa blocks will contain the taxa counts for each sample and the samples' corresponding clinical information.
Within each taxa block tidy_micro also calculates taxa presence/absence, relative abundance, and centered log ratio (CLR) transformed counts Z p = log , X p is the observed count of taxa p , and P is the number of taxa in the OTU table. Aitchison introduced the CLR as a useful transformation of compositional data to a Euclidean space [15]. We add 1 sequencing depth to each taxa's count before the CLR transformation in order to avoid issues with log(0) . Once data are wrangled into this framework the high dimensional nature of the data becomes manageable.

Object Flexibility
The tidyMicro data format complies to the three primary principles of tidy data [16]: 1. each variable forms a column, 2. each observation forms a row, and 3. each type of observational unit forms a table. This malleable data structure allows users to easily work with familiar packages and functions outside of this pipeline. For example, creating a subset of a tidyMicro set is easily accomplished using standard operations within R such as subset or filter. New functions necessary for unique analyses can easily be applied within the OTU table or taxa blocks through apply functions or the tidyverse's d_ply family. Similarly, all visualizations are created using ggplot2, meaning they can be easily customized through additional "geoms". This ability to integrate base and external functions is unique among microbiome analysis pipelines. This flexibility will be key in fostering innovation from future users and encouraging the collaborative nature of open source software.

Exploratory data analyses
Preliminary data exploration is straightforward with a tidyMicro set. We provide several wrapper functions for basic summary statistics of taxa abundance. Tabular outputs are complemented by graphical representations of abundance data, which can be much easier to interpret. We provide several standard data visualization tools including ordination plots (Fig. 3a, b), stacked bar charts (Fig. 3c), and heatmaps (Fig. 3d), as well as two novel visualization methods. The Rocky Mountain plot (Fig. 4) is a variant of the Manhattan plot that displays both the direction and magnitude of correlations between a variable of interest and each taxon within an OTU table, resembling the peaks and valleys seen along the front range of the Rocky Mountains. All taxa are colored by phylum, and correlations that exceed a desired magnitude will be automatically labeled using ggrepel [17]. Neither the CLR transformed counts nor the taxa abundances are normally distributed, so a rank-based correlation (e.g. Spearman, Kendall) is recommended and provided in the package. Three Mode principal component and principle coordinate analyses are tools used to account for the correlation structures of repeated measures within the ordination framework [18]. They collapse over the time component of variation and perform principle component / coordinate analyses on the remaining two dimensions (subject and taxa). For the motivating example, we display plots (using the 15 infants with all 3 sequencing time points) made using principle components (Fig. 5a) and principle coordinates (Fig. 5b).

Taxa filtering
This package provides taxa "filters". Counts from taxa below a specified prevalence and/ or abundance cutoff will be aggregated into an "Other" taxa category. There is also an option to specify taxa that you are not interested in analyzing to be aggregated into this "Other" taxa. This is analogous to the "pruning" function in phyloseq. Filtering can be applied within the initial tidy_micro step or externally at a later stage. Below is an example where counts from taxa present in fewer than 5% of sequence libraries and taxa with less than 0.01% abundance in all sequencing libraries are aggregated together.

Diversity calculations and analyses
Community-level analyses are an important step for understanding microbial community diversity. The community-level measures summarize key features within individual communities (alpha-diversity) and between individual communities (beta-diversity). As is standard among microbiome pipelines, tidyMicro provides optimized algorithms for calculating the following alpha diversity indices from bootstrapped rarefied sequence datasets: S obs , Chao1, Shannon's and Simpson's diversity indices, and Shannon's and Simpson's evenness indices. Also included is Good's coverage estimator, with options to filter out sequencing libraries with low coverage and/or low sequencing depth. tidyMicro gives the option to calculate alpha diversity indices for every OTU table or only one  specified table. If the study contains an OTU table from multiple environments, each  table's diversity might be useful. Since diversities measure how many different taxa are present and how evenly they are distributed, aggregating counts together into few groups could bias these estimates. For this reason, we recommend only calculating diversity measures on the lowest taxonomic rank if the study has OTU tables of different taxonomic ranks from the same site. Calculating alpha diversities for the family level OTU table is demonstrated below: For beta diversity analyses we incorporate the vegan [19] package, which provides a large array of methods to generate dissimilarity tables. vegan also provides a function to test for differences in beta diversities in a pseudo-regression framework using PERMANOVA.

Analyses of Individual Taxa
In parallel with community analyses (i.e., alpha-and beta-diversity), identification of individual taxa that differ between groups of subjects often is of interest to investigators. Taxon abundance can be difficult to model due to strong right skews (Fig. 6a), "U" or "J" shapes (Fig. 6b), and sparsity (Fig. 6c). tidyMicro provides multiple modeling options and automated model summaries enabling greater user control over their analyses of taxon abundances. This level of flexibility and focus on taxa level models is a novel feature of tidyMicro.
Negative binomial and beta binomial regression are the current gold standards for modeling taxon abundance. Each method has advantages and disadvantages. Negative binomial models are often adequate and easier to fit, but they are unable to capture any "U" shapes that may arise in taxa abundances. Beta binomial models offer more flexibility but require more computational time. The negative binomial model coefficients also represent estimated rate ratios, which are a more straightforward representation of associations than the odds ratios provided by beta binomial models. tidyMicro has built in functions to fit negative binomial models using the MASS package [20] or beta binomial models using the VGAM package [21] to every taxa within a specified OTU table. Negative binomial models include sequencing depths as an offset term by default to deal with the compositionality of these data, and the beta binomial models handle this by directly modeling the taxa RA. We provide several useful summaries of the models and taxa abundances. An example of fitting negative binomial models to all genus-level taxa with MRSA infection and smoking status as covariates with sequencing depth as an offset term is demonstrated below: These more complicated parametric models may have convergence issues when applied to microbiome studies and simpler methods may need to be considered. Consequently, tidyMicro includes functions to perform rank-sum tests on taxa abundance or Chi-squared tests on taxa presence to address this potential issue. These functions give the user the option to run the test on all taxa or just the taxa for which the regression models failed to converge.

Model Summaries and Visualizations
Output from (potentially) hundreds of negative binomial or beta binomial models can be difficult to interpret. For this reason, tidyMicro includes model summaries and several model-visualization tools. The modeling functions produce summary tables from each taxa's model (Additional file 1 ) and estimate tables that contain rate ratios (or odds ratios), Wald confidence intervals, and false discovery rate (FDR) adjusted p-values (Additional file 2). When interaction terms are present, the appropriate main effect nb_mods micro_set, table = genus, MRSA_Positive, Smoking, Offset = TRUE . estimates and covariances are summed together before being exponentiated into rate ratios (or odds ratios) and confidence intervals in the estimate table.
tidyMicro includes heatmaps and forest plots of β parameter estimates to help visualize consistent trends. The package also introduces novel regression-based rocky mountain plots and novel parametric stacked bar charts. The regression-based rocky mountain plots are similar to the association based rocky mountain plots used for exploratory data analysis (Fig. 4). FDR p-values of a specified covariate are log transformed and the magnitude is plotted along the y-axis. For positive β estimates, the log(FDR p-value) is multiplied by -1 so the direction along the y-axis corresponds to the direction of the estimated relationship (Fig. 7). Parametric stacked bar charts back-transform estimated β parameters to estimated taxa abundance. Back-transformed estimates allow the user to plot estimated continuous trends in composition and estimated composition controlling for confounders. Both main effects (Fig. 8a) and interactions (Fig. 8b) can be plotted with the parametric stacked bar charts. An example of plotting the relationship between age and MRSA infection is demonstrated below: The top_taxa argument is used to aggregate the estimated abundances of all but the 5 most abundant into an "Other" category. One can also specify an abundance cutoff for taxa to be excluded from the "Other" category. This functionality exists in all stacked bar chart functions within tidyMicro.

Fig. 7
Rocky mountain plot made from negative binomial models. Relationships between MRSA infection and genus level taxa abundance after controlling for smoking status were estimated using negative binomial models using log(sequencing depth) as an offset. All models were fit using the glm.nb function in the MASS package. False discovery rate (FDR) adjusted p-values of estimated β coefficients are log transformed, and the magnitude is plotted along the y-axis. For positive β estimates, the log( FDR p-value) is multiplied by -1, so the direction along the y-axis corresponds to the direction of the estimated relationship

Discussion
As microbiome data and analyses become more common, user accessibility to reliable and extensible analysis pipelines has become increasingly important. Boosting user control through easily understandable and manipulatable data objects is a key step in expanding analysts' toolkits. The tidyMicro pipeline helps meet this need by introducing the principles of tidy data and fully leveraging the tidyverse into microbiome analyses. We also extend functionality available in other popular microbiome pipelines with novel exploratory visualization methods (e.g., Rocky Mountain plots, Three Mode PCA and PCoA), a strong focus on modeling of individual taxa abundances with multiple modeling options, and automated summary and estimate tables.