ProgPerm: Progressive permutation for a dynamic representation of the robustness of microbiome discoveries

Background Identification of features is a critical task in microbiome studies that is complicated by the fact that microbial data are high dimensional and heterogeneous. Masked by the complexity of the data, the problem of separating signals (differential features between groups) from noise (features that are not differential between groups) becomes challenging and troublesome. For instance, when performing differential abundance tests, multiple testing adjustments tend to be overconservative, as the probability of a type I error (false positive) increases dramatically with the large numbers of hypotheses. Moreover, the grouping effect of interest can be obscured by heterogeneity. These factors can incorrectly lead to the conclusion that there are no differences in the microbiome compositions. Results We translate and represent the problem of identifying differential features, which are differential in two-group comparisons (e.g., treatment versus control), as a dynamic layout of separating the signal from its random background. More specifically, we progressively permute the grouping factor labels of the microbiome samples and perform multiple differential abundance tests in each scenario. We then compare the signal strength of the most differential features from the original data with their performance in permutations, and will observe a visually apparent decreasing trend if these features are true positives identified from the data. Simulations and applications on real data show that the proposed method creates a U-curve when plotting the number of significant features versus the proportion of mixing. The shape of the U-Curve can convey the strength of the overall association between the microbiome and the grouping factor. We also define a fragility index to measure the robustness of the discoveries. Finally, we recommend the identified features by comparing p-values in the observed data with p-values in the fully mixed data. Conclusions We have developed this into a user-friendly and efficient R-shiny tool with visualizations. By default, we use the Wilcoxon rank sum test to compute the p-values, since it is a robust nonparametric test. Our proposed method can also utilize p-values obtained from other testing methods, such as DESeq. This demonstrates the potential of the progressive permutation method to be extended to new settings. Supplementary Information The online version supplementary material available at 10.1186/s12859-021-04061-3.

Theorem 2. Suppose n 1 , n 2 and k are non-negative integers. The combination coefficient n1 k n2 k approaches its maximum, when k equals the closest integer greater than n1n2−1 n1+n2+2 .

S2. COMPUTATIONAL TIME
We describe the computational burden as follows. In each permutation scenario, we have ν draws. In each draw, we perform p independent tests. If we add the number of tests across the all the permutation scenarios, we have K k=1 νp = N p K k=1 log n 1 k + log n 2 k ≤ N p log K k=0 n 1 k n 2 k = N p log N K ≤ N pK(1 + log N − log K) If the two groups have the same sample size, meaning n 1 = n 2 , then K = N/2. The time complexity will be less than 0.85 × pN 2 . For example, if the sample size N is 100, the number of microbiome features p is 1000, the time of running a Wilcoxon Rank Sum Test is 0.004 seconds, then the total time of executing the progressive permutation will be 9 hours. However, utilizing the parallel computing (implemented via the "doParallel" R package) on an 8-core computer produces a running time of 30 minutes.

S3. DISTRIBUTION OF ZEROS
Zero-Inflation is one of the main characteristics of microbiome data. Figure S1 shows the distribution of zeros across samples and variables of SimData 1, SimData 2 and SimData 3. The distribution is comparable to the distribution of zeros in DeFilippo Data S2. In this section, we provide the results of the permutation method with Wilcoxon test and DESeq. We plot the traces of − log 10 p-values in Figure S3, S5, S7, and S9. We plot the number of significant features in Figure S4, S6, S8, and S10. In general, the traces of p-values from DESeq method spread out more (with a wider range) than the traces of p-values from the Wilcoxon test. For setting 1, the number of significant features does not approach zero in the full permutation scenario when using the permutation method with DESeq. Both the methods are implemented on simulation data Set 1, which varies the zero inflation parameters for each variable. The data contains a dense signal, where the number of true differential features is 70.

S5. RESULTS OF CONTINUOUS OUTCOME
In medical research, continuous outcomes, such as BMI, clinical scores, are available and linked with human microbiome. However, "the curse of dimensionality" can also be confronted when we use a canonical multiple linear regression to estimate the associations. More taxa would be identified as significantly associated with outcome variable, if more OTUs were generated and included in linear models. Because the coefficient of determination R 2 (R is multiple correlation between outcome and regressors) is increasing with the number of regressors in the model [1]. Besides, the sign and amount of coefficient estimates are usually messed up by heterogeneity of the data. With these similar motivations in the binary response case, we extended the progressive permutation for a continuous outcome.
We describe the progressive permutation procedure for a continuous outcome as follows. Suppose we collect N samples of microbiome specimens and obtain p microbiome features. We use k = {0, 1, · · · , K} to describe the progressive permutation scenarios. k = 0 describes the original data without any permutation. We use K to describe the maximal permutation scenario which can be specified by user. For instance, we set up K to be 10 in the following application. At each permutation scenario k, we randomly split the outcome variable into two parts with sample sizes of n 1 k = N k K and n 2 k = N (1 − k K ) in each part. We permute the outcome values in the first part and keep the second part unchanged. Then we perform p correlation tests to associate each microbiome feature with the permutated continuous outcome (such as Kendall 's tau or Spearman Rank Correlation tests) and obtain all the p-values. We can calculate the number of significant taxa as nsig(k) = p j=1 I pj (k)≤α , where α is the prespecified significance level. We expect to see the lowest nsig(k) in the fully mixing scenario K. The number of significant hits nsig(k) decreases with the proportion of mixing k/K.
At each permutation scenario k (1 ≤ k < K), we start from a random seed and try a subset of ν = N log N n 1 k draws. Therefore, for each variable j, we obtain ν samples of p-values p j (k) and numbers of significant taxa nsig(k). We summarize the distribution of these samples by their medians and 2.5%-97.5% quantile intervals. To visualize these p-values in an organized manner, we rank the significance of all the variables in the observed data, and then plot their − log 10 p-values with the same order across permutation scenarios. In general, the paralleled traces of − log 10 p-values of more significant variables will be on the top of less significant ones. With the increase of mixing, the significant p-values gradually become nonsignificant, indicating that the signal is weaker and the noise is stronger. As there would be almost no signal if data were fully mixed, all the p-values are close to 1 at the full permutation scenario k = K. Therefore, we could observe a decreasing trend in the number of significant features from no permutation scenario (k = 0) to the full permutation scenario k = K.
We applied the progressive permutation method to a real data set, which investigates the associations between microbiome and fatigue. The fatigue score ranges from 0 to 10. We treated it as a continuous outcome. The sample data set includes 88 samples and 841 microbial taxa. The generated results are shown as follows. The overall association between microbiome compositions and fatigue is not very strong, since the trace plot of number of significant features is flat (shown in FIG. S11a). There are not many features with -log10(p-value) more than 1.3 (-log10(0.05)) (FIG. S11b). The number of significant features obtained from differential tests on observed data is 35. Most of the effect sizes range between -3 and 3 (FIG. S11c). The correlations range between -0.25 and 0.25 (FIG. S11d), which are weak. The conclusion of weak overall association can be made from FIG. S12, as AOI is 0.031, AUMC is 0.029 and slope is -0.031. FIG. S13 lists all the fragility index of the top 35 significant features (p-value less than 0.05 in observed data) with a decreasing order. Enterococcus faecalis and Enterococcaceae are the two taxa Plot of traces of − log 10 p-values vs. proportion of mixing. Both methods are implemented on simulation data Set 1, which varies the zero inflation parameters for each variable. The data contains a sparse signal, where the number of true differential features is 30.
with the highest number. Currently, to identify significant features, we compare the p-values of the original data with the p-values of full permutation scenario (separating signals from random noises). FIG.'S14 does a coverage plot of the top 35 features with decreasing order and identifies 18 features to be selected ones. FIG. S14 lists the effect sizes of all the 18 selected features. For instance, Enterococcus faecalis is positively associated with fatigue.
In summary, progressive permutation with continuous variable can not only show the overall association with microbiome compositions, but also identify the significance of individual features. Analogously, we could extend the progressive permutation model to other types of outcomes, such as count data, ordinal variable, etc. By choosing proper test statistics for each data type, we could show the similar converging trend of the p-values.
The distribution of p-value is crucial for correction of false positive rates. As we can see in FIG. S16, with increase of mixing proportion, the p-value leaves the area of significance (p < 0.05) and redistribute themselves to nonsignificant areas. In other words, p-values changes from a skewed distribution (peak is close to 0) to a uniform distribution. Because the p-value is uniformly distributed when the null hypothesis is true meaning that there is no signal in the full permutation scenario. These information might be useful to adjust p-values according to null distributions. Both methods are implemented on simulation data Set 1, which varies the zero inflation parameters for each variable. The data contains a sparse signal, where the number of true differential features is 30.

Coverage plot
FIG. S14: Coverage plot of the top 50 features with decreasing order. The color dots denote the -log10(p-value) of top 50 features in the observed data (mixing proportion is 0). The horizontal bars describe the 95% quantile confidence intervals of the -log10(p-value) in the full permutation scenario (the mixing proportion is 1).