Methodology details
The method consists of four steps, and a flowchart of the method is presented Fig. 1. In the first step (Fig. 1b), OTUs are divided into OTU sets or blocks which will be discussed in detail later. In the second step (Fig. 1c), within each block, a treebased approach will be used to generate a ranking of OTUs based on the depth importance measures that account for their marginal contribution to the phenotype. In the third step (Fig. 1d), we empirically determine the number of top OTUs to form a supertaxon as describe in detail as follows. In the last step (Fig. 1e), top OTUs within each local block are then aggregated into a supertaxon and marginal logistic regression is applied to associate the supertaxon with the phenotype. We describe details in each step as follows.
Set partition
OTUs are first divided into sets by their sequence similarity or biological clustering through RDP classifier [17]. The hierarchical information will be utilized to assist in partition at different levels (Genus, Family, Order, Class).
Ranking
We consider the following generalization of the logistic model,
$$\log \frac{{P\left( {Y_{i} = 1} \right)}}{{P\left( {Y_{i} = 0} \right)}} = F\left( {v_{i} , z_{i} } \right),$$
where \(F\) is a function not limited to be linear, \(i\) is subject index (\(i = 1, \ldots ,n\)), \(Y_{i}\) indicates the disease status, \(v_{i} = \left( {v_{i,1g} , \ldots ,v_{i,Jg} } \right)^{T}\) includes \(J_{g}\) OTU features in an OTU set \(g\) for. \(i\)th subject, and \(z_{i}\) denotes all confounding covariates to be adjusted. To estimate the unknown function \(F\), we consider the treebased method which allows potential nonlinear relations and any possible interactions between OTUs.
We adopt the ranking approach and modify the aggregation approach in Hu et al. [14] to identify the supertaxon of OTUs. It is based on the ranked but generally weak association between an individual OTU and a disease of interest. Within each partitioned set, OTUs are then ranked by their importance in terms of disease discrimination ability. To account for joint effects of multiple OTUs, the importance of each OTU is measured by the socalled depthimportance in a randomforest framework [18, 19], which is a proxy to its effect size and leads to an ordering of OTUs within the set. Specifically, for a set of OTU features \(g\), we construct forest \(f\) consisting of a total number of \(\left f \right\) trees. Each tree in the forest is built without pruning based on a randomly selected subset of variants in \(g\). We define \(v_{jg} = \left( {v_{1,jg} , \ldots ,v_{n,jg} } \right)^{T}\) as the \(j\)th OTU in \(g\)th set across subjects. Then, depth importance score of \(v_{jg}\) in tree \(T\) is defined as
$$V_{T} \left( {v_{jg} } \right) = \sum\limits_{{t \in T,{ }t{ }is{ }split{ }by{ }v_{jg} }} {2^{{  L_{t} }} G_{t} } ,$$
where \(L_{t}\) is the depth of node \(t\) and \(G_{t}\) is the χ.^{2} independence test statistics of node \(t\). The calculation of depth importance considers both effect size and depth of an OTU in the tree. We refer the readers to previous randomforest framework for more detailed calculations on depth importance score [18, 19]. Then the overall depth importance score for \(v_{jg}\) is given by
$$V_{f} \left( {v_{jg} } \right) = \frac{1}{\left f \right}\sum\limits_{T \in f} {V_{T} \left( {v_{jg} } \right)} ,$$
overall \(\left f \right\) trees in the forest \(f\). With the overall depth importance score \(V_{f} \left( {v_{jg} } \right)\) calculated, we can obtain the ordering of OTUs within the set \(g\). Let \(d_{jg}\) be the index of the OTU with the \(j\)th largest depth importance score for \(g\)th set. For example, if we have five OTUs in set \(g\), i.e., \(J_{g} = 5\) and overall depth importance scores for those OTUs are \(V_{f} \left( {v_{g} } \right) = \left( {V_{f} \left( {v_{1g} } \right),V_{f} \left( {v_{2g} } \right),V_{f} \left( {v_{3g} } \right),V_{f} \left( {v_{4g} } \right),V_{f} \left( {v_{5g} } \right)} \right)^{T} = \left( {0.2, 0.4,0.3,0.6,0.1} \right)^{T}\), the OTU with the largest overall depth importance score is the 4th OTU, i.e., \(d_{1g} = 4\). Thus, we end up with ordered OTUs and their ranks \(d_{g} = \left( {d_{1g} , \ldots ,d_{5g} } \right)^{T} = \left( {4,2,3,1,5} \right)^{T}\). For \(i\)th subject, let \(v_{i,djg}\) be the value of \(d_{jg}\)th OTU in \(g\)th set. Define
$$x_{ig} = \left\{ {\begin{array}{*{20}c} {\mathop {\min }\limits_{{1 \le j \le J_{g} }} \left\{ {j: v_{{i,d_{jg} g}} > 0} \right\}, if \exists v_{{i,d_{jg} g}} > 0,} \\ {J_{g} + 1, otherwise,} \\ \end{array} } \right.$$
The creation of \(x_{ig}\) aims to provide guidance on selecting best cutoff in the following steps.
Best cutoff selection
To select the top OTUs, we first need to define how to form a supertaxon from top OTUs. There are two ways of formation. One is called supertaxon for binary features (STB), where the OTUs data are turned into binary one, indicating that we only care about the presence or absence of an OTU. The other is named as supertaxon for continuous features (STC), where the OTUs data are kept as original and abundance information can be utilized. Both STB and STC are proposed to deal with the sparsity issue and complementary each other under various sparsity levels. For STB, the variable is turned into binary for each threshold; that is, for a threshold \(c\),
\(S_{ig} = I\left( {x_{ig} < c} \right)\),
where \(I\left( \cdot \right)\) is the indicator function, and \(c \in \left\{ {x_{1g} , \ldots , x_{ng} } \right\}\). For STC, the variable is turned into mean of nonzero features for each threshold; that is, for a threshold \(c\),
$$S_{ig} = {{\sum\limits_{{d_{jg} \le c}} {v_{igdjg} } } \mathord{\left/ {\vphantom {{\sum\limits_{{d_{jg} \le c}} {v_{igdjg} } } {\sum\limits_{{d_{jg} \le c}} {I\left( {v_{igdjg} > 0} \right)} }}} \right. \kern\nulldelimiterspace} {\sum\limits_{{d_{jg} \le c}} {I\left( {v_{igdjg} > 0} \right)} }},$$
where \(I\left( \cdot \right)\) is the indicator function, and \(c \in \left\{ {x_{1g} , \ldots , x_{ng} } \right\}\). Both transformations are designed to reduce the effect from the unobserved taxa within the set. While STB focuses more on the effect of taxa’s presence, STC further takes the expression level into account. For different sparsity levels of taxa, STB and STC are expected have their own advantages. With the formed supertaxon, a univariate logistic regression is carried out to investigate its effect, and the final threshold is the one gives the smallest pvalue among all possible thresholds.
Marginal associations
With selected OTUs, a supertaxon can be constructed with those OTUs in the depthimportanceordered OTU list and the total number of OTUs used to form the supertaxon is the same as the final threshold. The logistic regression is performed to assess the association and effect between a supertaxon and disease status. We refer readers to Hu et al. [14] for more details.
Simulation setup
In the simulation, performance of our proposed method in detecting differentially expressed OTU features was evaluated. We adopt the same model for generating the synthetic OTUs data as in Osborne et al. [20] to account for high correlations among OTUs. We compare our method with four classical methods listed below.

(1)
DESeq2 [21]: an RNAseq based method that models the observed OTU abundances using negative binomial (NB) distribution.

(2)
Zeroinflated beta regression (ZIBR): an extension of the generalized linear model (GLM) approach that takes sparse nature of OTU data into account [12].

(3)
Analysis of compositions of microbiomes with bias correction (ANCOMBC) method [22]: it models observed abundances using an offsetbased loglinear model. It was shown to control the false discovery rate (FDR) and competed very well with other methods in terms of power in a review paper on comparing statistical methods in differential abundance analysis for microbiome data [22].

(4)
The original method, supervariants, proposed in Hu et al. [14]: it captures the potential interactions among OTUs through randomforest model for binary features without splitting OTUs into sets, which can be considered as a special case of our approach based on OTU level.
Specifically, to mimic the dependence among OTUs, we simulate correlated OTUs from the following model [20].
$$n_{i} \sim N\left( {3000, 250} \right)$$
$$X_{i} \sim {\text{Multinomial}}\left( {h_{i} ,n_{i} } \right)$$
$$h_{i} \sim Dirichlet\left( {\alpha_{i} } \right)$$
$$\log \left( {\alpha_{i} } \right) \sim MVN\left( {Y_{i} B + B_{0} ,\Omega } \right)$$
where \(Y\) is a binary covariate to mimic the casecontrol study, with \(Y = 0\) for \(\frac{N}{2}\) subjects and \(Y = 1\) for the other \(\frac{N}{2}\) subjects. Of them, \(n_{i}\) is the total count in \(i\)th sample. \(h_{i}\) is the relative abundances and \(\alpha_{i}\) is the absolute abundance. Through \({\Omega }\), we can capture the correlations among OTUs. Following Osborne et al. [20], we set \(B_{0j} \sim U\left( {6,8} \right)\) with probability of 0.2 and \(B_{0j} \sim U\left( {2,4} \right)\) with probability of 0.8, which allows that some variables have larger counts and others to be sparser, as common in microbiome data. This controls the average sparsity for OTUs to be around 0.6 in the simulated dataset.
For evaluating type I error rate, we set \(B\) to 0. \({\Omega }\) is generated from the random graph as described in Osborne et al. [20]. The simulation is carried with \(N = 800\) subjects and 1000 OTUs into 20 sets or blocks (50 OTUs per block). Since there are multiple blocks, we evaluate type I error rate through familywise error rate (FWER), where an error is made if any set is rejected. To control the familywise error rate, we divide the data into two even sets. The first set is used for performing steps a–d in Fig. 1 while the second set is applied to the final step in Fig. 1. For the competing four methods, since they are only able to perform univariate testing for each OTU, the FWER in this case is calculated if a set is rejected if any OTU in the set is rejected. The FWER is calculated at nominal level 0.05.
In terms of power evaluation, as in the previous setting, we simulate 800 samples with 1000 OTUs into 20 blocks (50 OTUs per block). Of those, only 40 OTUs are associated with the binary covariates and scattered evenly in the first 4 blocks (10 true OTUs / block). We simulate elements in \(B\) as follows:

(1)
For 40 binary covariates associated OTUs, \(B_{j} \sim U\left( {0.5,1} \right)\) with probability of 0.5 and \(B_{j} \sim U\left( {0.1,5} \right)\) with probability of 0.5.

(2)
For the rest of OTUs, \(B_{j} = 0\).
\(\Omega\) is generated from the random, hub and cluster graph as described in Osborne et al. [20]. The power performances are evaluated based on two perspectives. First, we evaluate the blocklevel identification rate, defined as the number of times a block is selected over 500 repeats. For ZIBR, ANCOMBC and supervariants, since they all generate results for single OTUlevel testing, a block is regarded as selected if any OTU in the block is selected. We also evaluate the sensitivity (TP/(TP + FN)), specificity (TN/(FP + TN)), precision (TP/(TP + FP)) in terms of blocklevel selection. Second, for assessing the OTUlevel identification, we also report the average sensitivity, specificity and precision over repeats.
To further investigate the effects of sparsity levels on methods performance, more numerical studies are conducted to compare the performances of STB and STC under various sparsity levels. The goal of these studies is to provide some empirical experience about the scenarios where one method gains over the other. Specifically, by varying sparsity levels, we investigate on the cases where STB outperforms STC and vice versa to provide some guidance for practical usage.
The same model is used to simulate correlated OTUs. Again, we simulate 800 samples with 1000 OTUs into 20 blocks (50 OTUs per block). Of those, only 40 OTUs are associated with the binary covariates and scattered evenly in the first 4 blocks (10 true OTUs / block). \({\Omega }\) is generated from random graph. We simulate elements in \(B\) as follows:

(1)
For 10 binary covariates associated OTUs in 1st block, \(B_{j} \sim U\left( {0.1,0.5} \right)\).

(2)
For 10 binary covariates associated OTUs in 2nd block, \(B_{j} \sim U\left( {  0.5,  0.1} \right)\)

(3)
For 10 binary covariates associated OTUs in 3rd block, \(B_{j} \sim U\left( {0.5,1} \right)\)

(4)
For 10 binary covariates associated OTUs in 4th block, \(B_{j} \sim U\left( {  1,  0.5} \right)\)

(5)
For the rest of OTUs, \(B_{j} = 0\).
Additionally, we set \(B_{0j} \sim U\left( {a,a + 2} \right)\) with probability of 0.2 and \(B_{0j} \sim U\left( {2,4} \right)\) with probability of 0.8, where \(a\) controls sparsity level. The sparsity varies from 0.4 to 0.8 with \(a\) ranging from 4 to 8. The performance is evaluated through blocklevel identification rate, sensitivity, specificity and precision in terms of blocklevel selection.