Data collection
We extracted pre-processed OTU tables along with the corresponding metadata from Qiita [26], a microbiome database and study management platform. Qiita uses third party plugins including QIIME (Quantitative Insights Into Microbial Ecology) (http://qiime.org/) or QIIME 2 (https://qiime2.org/) to process microbial 16S rRNA sequences of each study, which are contributed by users on this platform. Qiita classifies the microbes using the GreenGenes database and generates OTU tables. OTU tables display the counts of all the microbe species present in every sample. Each study and the corresponding raw count data comes from different individuals and institutions, which makes our analysis comprehensive. The metadata includes information about the samples, such as location and sample identification. For easier analysis of the collected data, we separated the downloaded studies from QIITA and pooled them into four independent datasets. We performed our main analysis on one of the pooled datasets and used the other three datasets for validation. In Additional file 1: Table S1, we specify all the studies used and which pooled dataset it belongs to (main, validation datasets 1, 2 or 3). Data from the main dataset is presented in the figures unless explicitly stated otherwise. The data from the OTU tables and metadata are transformed using a log2 scale and were subsequently uploaded onto a web-based tool for analyzing big data called Hegemon [7, 8, 27,28,29].
StepMiner algorithm
To classify a relationship, thresholds are first determined for each microbe using the StepMiner algorithm. The StepMiner algorithm [30] is a tool that helps identify stepwise transitions (either step-up or step-down transitions) calculated using sum-of-square errors. After the data is normalized, steps are defined as the sharpest change between low microbe count and high microbe count. In order to fit a step function, the StepMiner algorithm computes the average of the values on both sides of the step for all possible step positions. The midpoint of the step position that minimizes the square error is chosen as the threshold for each respective microbe. The step is placed at the largest jump from low values to high values and sets the threshold at the point where the step crosses the original data. The microbe counts are normalized and transformed into log2 scale before Boolean analysis. Microbe counts (in log2 scale) are further classified as either ‘high’, ‘low’, or ‘intermediate’. If t is the microbe count threshold, levels above t + 0.5 are ‘high’, levels below t – 0.5 are ‘low’, and levels between t – 0.5 and t + 0.5 are ‘intermediate’. Points in the intermediate region are ignored because these points might appear on either side of the threshold due to noise.
Boolean implication analysis
There are six possible Boolean implications: symmetric (opposite and equivalent) or asymmetric (low → low, low → high, high → low, high → high). The asymmetric relationships are determined by checking if one of the four quadrants in the scatter plot is significantly sparse compared with other quadrants. If A low → B low and A high → B high are both sparsely populated, then A is equivalent to B. If A high → B low and A low → B high are both sparsely populated, then A is opposite to B. The BooleanNet statistic tests [27] determine whether there is a Boolean relationship between A and B. Consider the relationship A low → B high. First, test if the microbe counts in the sparse quadrant are significantly less than the expected counts in an independence model. Let a00, a01, a10, and a11 represent the quadrants in which the microbe counts of A and B are low and low, low and high, high and low, and high and high, respectively.
$$total={a}_{00}+{a}_{01}+{a}_{10}+{a}_{11}$$
$$\mathrm{number of A low counts }= n{A}_{low}=({a}_{00}+{a}_{01})$$
$$\mathrm{number of B low counts }= n{B}_{low}=({a}_{00}+{a}_{10})$$
$$expected=\left(\frac{n{A}_{low}}{total}\times \frac{n{B}_{low}}{total}\right)\times total=\left({a}_{00}+{a}_{01}\right)\times \frac{\left({a}_{00}+{a}_{10}\right)}{total}$$
$$S\; statistic=\frac{expected-observed}{\sqrt{expected}}$$
Second, the observed values in the sparse quadrant are not ideal for Boolean implication formula. They are assumed to be erroneous points for the purpose of analysis only as was described previously in the context of gene expression analyses [27]. However, these points may or may not be erroneous from a real biological point of view. We wanted to discover the general trends of Boolean implication relationships with such a strong assumption. A maximum likelihood estimate of this error rate is then computed:
$$error\;rate=\frac{1}{2}(\frac{{a}_{00}}{{a}_{00}+{a}_{01}}+\frac{{a}_{00}}{{a}_{00}+{a}_{10}})$$
If both tests succeed, the low-low quadrant is considered sparse, so the implication A low → B high is true. An implication is considered significant if the S statistic is greater than 3 and the error rate is less than 0.1. Microbe relationships that pass both of these tests are now considered candidate invariants.
The OTU tables are then uploaded onto Hegemon to visualize the Boolean relationships on scatter plots for comparing two microbes against each other. In each graph, one microbe species’ counts (using OTU ID A) is plotted on the x-axis, and another microbe species’ counts (OTU ID B) is plotted on the y-axis. Each data point represents a sample and the counts are plotted on a log–log scale. Using the graphs constructed on Hegemon, we can visually confirm if a Boolean implication relationship, determined using the BooleanNet statistics, is present.
Metadata analysis
We used a Hegemon function that calculates the differential analysis for specific factors in the metadata using t-tests in R software framework (R version 3.4.4—2018–03-15). We then selected the OTU IDs that had higher mean differential values and higher –log10(p) values. After the list of potential IDs were generated, we visually confirmed on Hegemon whether certain factors such as environment versus animal or body site affects the microbe counts.
Computation of false discovery rate
To evaluate the significance of the Boolean implication relationships found, we computed the FDR for each of the pooled datasets by randomly permuting the counts for each microbe independently 10 times and determining the Boolean relationships for this randomized dataset using the method described above. The FDR is the ratio of the average number of Boolean relationships in the randomized dataset to the original dataset.
Correlation analysis
Correlation analysis was conducted on a subset of 500 microbes from the main dataset. The Pearson correlation coefficients and Boolean relationships for each pair of microbes were calculated to compare these two methods of analysis. The package “pearsonr” from the python library “scipy.stats” was used to perform the calculations.