Randomized quantile residuals for diagnosing zero-inflated generalized linear mixed models with applications to microbiome count data

Background For differential abundance analysis, zero-inflated generalized linear models, typically zero-inflated NB models, have been increasingly used to model microbiome and other sequencing count data. A common assumption in estimating the false discovery rate is that the p values are uniformly distributed under the null hypothesis, which demands that the postulated model fit the count data adequately. Mis-specification of the distribution of the count data may lead to excess false discoveries. Therefore, model checking is critical to control the FDR at a nominal level in differential abundance analysis. Increasing studies show that the method of randomized quantile residual (RQR) performs well in diagnosing count regression models. However, the performance of RQR in diagnosing zero-inflated GLMMs for sequencing count data has not been extensively investigated in the literature. Results We conduct large-scale simulation studies to investigate the performance of the RQRs for zero-inflated GLMMs. The simulation studies show that the type I error rates of the GOF tests with RQRs are very close to the nominal level; in addition, the scatter-plots and Q–Q plots of RQRs are useful in discerning the good and bad models. We also apply the RQRs to diagnose six GLMMs to a real microbiome dataset. The results show that the OTU counts at the genus level of this dataset (after a truncation treatment) can be modelled well by zero-inflated and zero-modified NB models. Conclusion RQR is an excellent tool for diagnosing GLMMs for zero-inflated count data, particularly the sequencing count data arising in microbiome studies. In the supplementary materials, we provided two generic R functions, called rqr.glmmtmb and rqr.hurdle.glmmtmb, for calculating the RQRs given fitting outputs of the R package glmmTMB. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04371-6.

with differential abundance under different conditions. For example, many human microbiome studies aim to identify microbial taxa with differential abundance in healthy and diseased patients [1]. In microbiome data, the microbial taxa are represented by nucleic acid sequences called operational taxonomic units (OTUs) at different levels in an independent taxonomic way [2,3]. In this paper, we use microbiome sequencing data as an example in our discussion for the simplicity of terminologies. However, the applicability of the methods discussed in this paper is not limited to microbiome count data.
Generalized linear models (GLM) are commonly used to model the sequencing count data. Negative-binomial (NB) based regression models are used in many widely used bioinformatics analysis tools and methods [4][5][6][7][8]. Excessive zeros are commonly observed in sequencing count data. For microbiome data, the reason for the excessive zeros is either due to the absence of taxa (structural zeros) or the presence of taxa with a low frequency, which results in observed counts below detection limits (sampling zeros). One way to deal with excessive zeros is to use a zero-inflated model [9], which is a mixture of a regular count regression model, such as Poisson or NB model, and logistic regression to model the excessive zeros. Another way is to use a zero-modified model, also called a hurdle model [10], with one part being a logistic regression model to model the zeros and the other part being the truncated count regression model (e.g. truncated NB) to model the positive count data. Moreover, subjects in microbiome data often have clustering structures, for example, humans from the same family or plants from the same plot. To model the association of the abundance of taxa with such environmental factors, we often use random effects to account for the clustering structure in microbiome study [8,11]. Increasing evidence [12] shows that the zero-inflated models can give better fits (measured by AIC) to sequencing count data than the corresponding models without a zero-inflation component. As such, recently zero-inflated generalized linear models with or without random effects, typically zero-inflated NB (ZINB) models, have been increasingly used to model microbiome and other sequencing count data [1,[11][12][13][14][15][16][17][18][19][20][21][22]. In addition to the applications in sequencing count data, zero-inflated generalized linear mixed models (GLMM) have also been widely applied to model count data arising in a wide variety of fields, such as ecology and epidemiology [23][24][25][26][27][28][29]. The aforementioned features, including zero-inflation, over-dispersion, and clustering, are also commonly observed in the count data collected from these areas.
A common drawback of using a parametric model such as a ZINB model is that the model may fail to provide an adequate fit to a dataset. For example, Hawinkel et al. [30] proposed a specific smooth test for checking the GOF of NB models with applications to a large set of sequencing count datasets and concluded that NB models do not fit well to many of the sequence datasets. The model mis-specification problem has been largely neglected in today's statistical modelling practice, including in bioinformatics. However, the conclusions drawn from poorly fit models may be seriously misleading. In differential abundance analysis, the p values for all taxa are typically converted into q-values or passed to an FDR controlling procedure [31] for controlling the false discovery rate (FDR) at a nominal level. A common assumption in estimating FDR is that the p values are uniformly distributed under the null hypothesis, which holds when the postulated model fits the data of all taxa adequately. When the model is mis-specified, the distribution of the p values under the null hypothesis may be far away from the uniform distribution, often resulting in an underestimate of the true FDR and excess false discoveries. The excess false discoveries may then lead to costly but fruitless follow-up studies on the falsely identified taxa. This problem has been discussed in detail and demonstrated with simulation studies by [17,32,33]. As such, model checking is critical to control the FDR at a nominal level in differential abundance analysis.
It is challenging to conduct model checking and diagnostics for generalized linear models for count data. Akaike's information criterion (AIC) is commonly used to compare the goodness-of-fits of competing models. However, AIC cannot check whether a postulated model is close enough to the true model (e.g. the adequacy of a model). Examining the normality of Pearson's residuals is a standard tool for diagnosing normal regression. Pearson and deviance residuals are often used to diagnose generalized linear models. However, both Pearson and deviance residuals are far from normality for count regression. In particular, Pearson and deviance residuals cluster on curves due to the discreteness [34,35]. Due to the lack of normality, it is challenging to conduct model checking and diagnostics with Pearson and deviance residuals for count regression. Recently, a few GOF tests based on the cumulative sums of residuals [36,37] have been developed for the zero-inflated models [38]. However, these GOF tests cannot be used to reveal the nature of model discrepancy for suggesting certain strategies to improve a poorly fit model. The smooth test proposed in [30] is difficult to be extended to more flexible models, such as zero-inflated models. In addition, the smooth test is a likelihood ratio test (LRT). The validity of the p value of an LRT test itself depends on the correctness of the assumed null model.
The method of randomized quantile residual (RQR) was proposed by Dunn and Smyth [39] to overcome the challenges of diagnosing count regression. The central idea of the RQR is to randomize the predictive p value (i.e. tail probability of CDF for response) into a uniform random number. With this randomization, the distribution of RQRs is a standard normal under the true model for the dataset. Therefore, we can conduct model diagnostics with RQRs for count regression models in the same way for normal regression. Recently, Feng et al. [35] compared the performance of conducting GOFs with RQRs in generalized linear models and concluded that the GOF tests with RQRs are well-calibrated and have good power. The method of RQR has also been increasingly applied to some zero-inflated regression models without considering random effects [40][41][42][43][44]. However, to the best of our knowledge, the method of RQR has not been applied to sequencing count data. Furthermore, the performance of RQRs in diagnosing zero-inflated GLMMs for sequencing count data has not been extensively investigated in the literature.
The primary objective of this article is to demonstrate that the method of RQR performs very well for diagnosing zero-inflated GLMMs and is particularly suitable for checking whether such models provide adequate fits to sequencing count data. The rest of the article is organized as follows. Sections "Generalized linear mixed models for zero-inflated data" and "Randomized quantile residuals" describes zero-inflated GLMMs and the method of RQRs respectively. In Section "Simulation studies" we report the results of performing large-scale simulation studies to investigate the performance of RQRs for zero-inflated GLMMs. The simulation studies show that the probabilities of type I errors of the GOF tests with RQRs are very close to the nominal level, and the GOF tests have excellent power; where X i and X i are fixed factors for modelling i and p i respectively, which may or may not be identical; = ( 1 , 2 , … , p ) T and ̃= (̃1,̃2, … ,̃p) T are the corresponding p-dimensional vectors of unknown regression coefficients; Z i and Z i are the p-dimensional vectors for the random effects of the conditional count and logistic components of the model, respectively; u = (u 1 , u 2 , … , u q ) T and ũ = (ũ 1 ,ũ 2 , … ,ũ q ) T are the unobserved random effects vectors, which are often assumed to be normally distributed u ∼ N (0, Σ) , where Σ is a positive definite variance-covariance matrix.

Zero-modified (hurdle) mixed-effects models
Zero-modified models are also called hurdle models [10]. Both zero-modified and zeroinflated models can be used to model excess zeros in the response variable. In contrast to zero-inflated models, zero-modified models treat zero-count and non-zero outcomes as two completely separate categories, rather than treating the zero-count outcomes as a mixture of structural and sampling zeros. A zero-modified model is composed of two components. A component is a probability distribution for describing the probability of observing value zero or not. The other component models the positive count data using a truncated negative binomial or truncated Poisson, by removing the zero part from the Poisson or NB distribution, and the denominator is to re-normalize the probability so that it still sums to 1. In particular, the PMF and CDF for y i of a zero-modified model are then written as follows: where I(⋅) is the indicator function, which is equal to 1 when the condition in bracket is true and equal to 0 otherwise. Similar to zero-inflated models, one can choose different model, g(⋅) , for modelling counts. In ZMNB models, g(y i ) = f NB (y i ; i , ) ; in ZMP models, g(y i ) = f Pois (y i ; i ) for ZMP model. The i and i are similarly linked to covariates using Eq. (5).

Comparison of zero-inflated and zero-modified models
When the same g(⋅) is chosen, the conditional distributions for y i given y i > 0 in the zero-inflated and zero-modified model are identical-both described with the PMF g(y i )∕(1 − g(0)) . The difference of these two models lies in the modelling of P(y i = 0) . In zero-modified models, P(y i = 0) = i is linked to covariates directly. In contrast, in zeroinflated models, P(y i = 0) = p i + (1 − p i )g(0) is not linked to covariates directly; instead, the mixture proportion p i is linked to covariates. However, we see that when g(0) is very small, which occurs when i is large, these two models are very close.

Randomized quantile residuals
Examining the residuals of a regression model is a standard tool for assessing normal regression [45]. Pearson residuals is the raw residual divided by the square root of the variance, written as r i = , where ̂i is the fitted value and V (y i ) is the estimated variance of y i respectively. Deviance residuals are often defined for generalized linear models [46]. Deviance residual for y i is defined as the part attributed to y i in the deviance, which is the difference of log-likelihood of the fitted model to that of a saturated model. For zero-inflated and zero-modified models, it is challenging to find a reasonable saturated model. Most importantly, the distributions of Pearson and deviance residuals are not normal for count regression [35,39] under the true model. Therefore, the graphical examination of Pearson and deviance residuals are often not informative for diagnosing count regression models. Quantitative assessment of the overall GOF with Pearson and deviance residuals are often based on 2 approximation for their sampling distributions. The Pearson 2 statistic is written as, The asymptotic distribution of D and X 2 under the true model is often assumed to be 2 n−p , where n is the sample size and p is the number of parameters. However, the use of this asymptotic distribution for both X 2 and D lacks theoretical underpinning.
The method of randomized quantile residual (RQR) [39] was proposed to overcome the difficulties of using traditional residuals for diagnosing regression models for discrete outcomes. The idea of RQR is to transform the tail probability of each response value into the equivalent standard normal quantile. Let F (y i ; i , ) denote the cumulative distribution function (CDF) for random variable y i , which is parametrized by i (covariate related) and (covariate unrelated, such as size parameter of NB distribution). If the CDF is continuous, F (y i ; i , ) is uniformly distributed on (0, 1) RQRs can then be defined as q i = Φ −1 {F (y i ;̂i,̂)}, where Φ −1 () is the quantile function of a standard normal distribution. If the CDF is discrete, randomization is added to make it continuous. To be more specific, let p(y i ; i , ) denote the PMF of y i . The randomized tail probability can be defined as: where u i is a uniform random variable on [0, 1] , and F (Y i − ; i , ) is the lower limit of F at y i . When F is discrete, we let a i = lim y→y i − F (y i ; i , ) and b i = F (y i ; i , ) , then the RQR for y i is calculated as Feng et al. [35] gives a detailed explanation of the RQR and illustrates the RQR using a simple GLM model with nonlinear effects.
From the definition (8), the computation of RQRs is straightforward once we can compute the CDF of y i . For zero-inflated and zero-modified models, these CDFs are given by Eqs. (2) and (7). We generated two R functions, i.e, rqr.glmmtmb and rqr.
hurdle.glmmtmb to calculate RQRs for diagnosing different types of mixed effects counts models. The function rqr.glmmtmb is designed for diagnosing Poisson, NB, ZIP and ZINB mixed-effects models and the function rqr.hurdle.glmmtmb is designed for diagnosing ZMP and ZMNB mixed-effects models. Both functions only request input the fitting results from a model fitted in the glmmTMB package [47], and then these two functions output the RQRs for the corresponding model. These functions are provided in the Additional file 1.
Under the true model with the true parameters, the distribution of RQRs is a standard normal. Based on this null distribution, we could conduct residual diagnostics for count regression models, including zero-inflated GLMMs, in the same way for normal regression models with Pearson's residuals, including overall GOF tests and graphical examinations such as residual plots and Q-Q plots. However, the standard normality holds only when the true model with the true parameters is used in Eq. (8). The actual performance of the RQR in particular models with parameters estimated with finite samples still demands empirical investigation. Feng et al. [35] show that the performance of the RQR is good for generalized linear models. In this paper, we investigate the performance of the RQR in zero-inflated GLMMs with simulated datasets that look like actual microbiome count data.

Simulation studies
In this section, we present simulation studies to evaluate the performance of RQRs. We simulate data from the ZINB, ZMB, ZIP, and ZMP model, respectively, with varying degrees of excess zeros and over-dispersion. For illustrative purposes, we first assess the GOFs of the true model in comparison with the misspecified models using RQRs and Pearson residuals graphically for a single simulated dataset. Then we simulate 3000 replicate samples to assess the performance of the overall GOF test by testing the normality of the RQRs. The histogram of normality test p values and the probability of rejecting the wrong model are presented for comparing the performance of RQRs and Pearson residuals. Section "Description of data generating process" describes data generating process. Section "Simulation results" presents the results of the simulation studies. Section "Illustration of model diagnostics with RQRs for a single dataset" illustrates the performance of RQRs based on a single simulated dataset, and Section "Results of GOF tests based on RQRs with multiple simulated datasets "presents the results for simulated studies based on replicated datasets.

Data generation
We first simulate dataset from zero-inflated model with the outcome variable Y i , i = 1, … , n , generated as follows, 1. Generate a binary variable H i indicating whether Y i is a structural zero or not, with the probability of p(H i = 0) = p i , which is linked to the fixed and random factors using a logistic link function: where ̃X i (m) denotes the coefficient associated with the mth fixed factor, m = 1, … , s and ũ Z i (t) , t = 1, … , T , denotes the coefficient for the tth random effect term. 2. If H i = 0, Y i = 0 ; otherwise, Y i is generated from a NB or Poisson model with mean i , which is linked to the fixed and random effect terms as follows: where X i (m) represents the coefficient associated with the mth fixed-effect factor, m = 1, … , s , and u Z i (t) denotes the coefficient associated with the tth random factor. log (T i ) denotes the offset term to adjust for the varying total sequence reads across the samples. We also generate data from a zero-modified model with a data generation process similar to one for the zero-inflated model described above. The only difference is that when H i > 0 , Y i is generated from a truncated Poisson or NB model. In the zero-modified model, i represents the mean and i represents the proportion of zeros.

Parameter settings
We generate datasets with s = 3 fixed factors and t = 2 random factors and different sample size n = 50, 100, 200, 400 . Each fixed factor has three levels, and each random factor has two levels. The regression coefficients for the fixed-effects covariates i follow a normal distribution with mean = 0 , and standard deviation = 0.1 , and the coefficients for the random effects u i follow a normal distribution with mean = 0 , and standard deviation = 2 . The total read T i follows a Poisson distribution with a mean = 300,000 . The shape parameter follows a unif(2, 3) distribution.
To investigate the robustness of the performance of RQRs, we consider four scenarios by varying ̃0 and 0 , in Eqs. (10) and (11), which control the zero proportions (ZP) and dispersion of count component, respectively. More specifically, in scenarios 1 and 2, ̃0 = 3.5 , which represents the high ZP, while in scenarios 3 and 4, ̃0 = −5.5 , which represents the low ZP. In scenarios 1 and 3, 0 = −5.5 for NB model and 0 = −5.7 for Poisson model, which represents the relatively high count data, while in scenarios 2 and 4, 0 = −7.8 for NB model and 0 = −8 for Poisson model, representing the relatively low count data.

Illustration of model diagnostics with RQRs for a single dataset
In this section, we illustrate RQRs in comparison to Pearson residuals for diagnosing six GLMMs for a single dataset generated with ZMNB models with scenario 4 parameter settings (low count, low ZP). RQRs are calculated by our created function rqr.glmmtmb or rqr.hurdle.glmmtmb. Figure 1 depicts the results simulated from ZMNB model when n = 400 . The panels in the first column display the RQRs versus fitted values, from which we can see that residuals from model ZMNB and ZINB are randomly scattered around y = 0 without any discernible pattern. The standardized residuals for those two models are within -3 to 3, which indicates the ZINB model has similar fitting results as the ZMNB model, and both fit the data well. However, the RQRs for the models ZMP and ZIP are not evenly distributed around y = 0 , suggesting that ZMP and ZIP models do not fit the data well. This indicates that ZMP and ZIP models fail to model the over-dispersion adequately. The RQRs from the NB model show a decreased pattern, and the RQRs from the Poisson model are clustered at the top and bottom, indicating that the Poisson model could not handle over-dispersion and excessive zeros well. The (11) panels in the second column display the scatter plots of Pearson residuals versus fitted values, which indicates that the Pearson residuals fail to provide meaningful information regarding the GOF of the models, which is not surprising, as Pearson residuals are theoretically not normally distributed for count regression.
The normality of the RQRs is examined using Q-Q normality plots, as shown in the panels of the third column of Fig. 1. The points in the Q-Q plots for ZMNB and ZINB align closely to the straight line with a slope of 1, which indicates these two models fit

Q−Q plot for Pearson residuals
Theoretical Quantiles Sample Quantiles Fitted values  the data reasonably well. In contrast, the Q-Q plots for the ZMP, ZIP, and Poisson models are depicted as two separate lines with a substantial gap, which clearly indicates that the distributional assumptions for these models are not consistent with the true data. The points of the QQ plot of RQRs for the NB model shown in Fig. 1 follow more closely along the straight line than the ZMP, ZIP, and Poisson models, since NB distributions have heavy tails at two sides when is small. However, the misfit of NB models for the zero-inflated data can still be clearly identified in the scatter plot of RQRs. This demonstrates the advantage of examining the scatter plots of RQRs against fitted values for diagnosing model fit. Q-Q plots for the Pearson residuals are depicted in the panels of the fourth column of Fig. 1, all of which are curves, which poses a challenge for visually checking the model fit. The performance of RQRs is also examined when data are simulated from ZMP, ZINB and ZIP models, respectively, as shown in Additional file 1: Figs. S1, S2 and S3. The results indicate that RQRs are able to diagnose the model fit while Pearson residuals provide very limited information for checking the model fit.

Results of GOF tests based on RQRs with multiple simulated datasets
This section presents the results of examining the performance of RQRs for diagnosing the GOF of the regression models based on 3000 replicated datasets from the true model. The Shapiro-Wilk (SW) normality test for the residuals is used as the overall GOF test. When the model is true, the p values obtained from SW normality test are expected to be a uniform distribution. Additional file 1: Fig. S4 shows that when the true model is the ZMNB model, the p values obtained from the SW normality test of RQRs for ZMNB and ZINB are uniformly distributed in all four scenarios. Similar to the results presented in Section "Illustration of model diagnostics with RQRs for a single dataset", both ZMNB and ZINB models perform well when data are simulated from ZMNB, while the rest four models fail to adequately capture the over-dispersion or zeroinflation. Additional file 1: Fig. S5 indicates that the p values of the SW normality test for the Pearson residuals are all concentrated around zero; therefore, Pearson residuals fail to distinguish the true and wrong models. We further investigate the type I error rate of the SW normality test (probability of rejecting the true model) for RQRs and Pearson residuals at varying sample sizes, n = 50, 100, 200 , and 400. Ideally, the type I error of the SW normality test should be around 0.05. Table 1 presents the probability of rejecting the model when the 3000 replicated datasets of size n = 100 are simulated from ZMNB, ZINB, ZMP, and ZIP, respectively. In each scenario, we summarize the zero proportion, 5%, 50%, and 95% quantiles of non-zero counts over the 3000 replicated datasets. We also summarize the number of converged model fittings for the 3000 replicated datasets, shown as N in the last columns of Tables 1 and 2. When the true model is ZMNB or ZINB, the type I error rates are close to 0.05, and the probabilities of rejecting the ZMP, ZIP, NB and Poisson models are all very high, indicating that RQRs are able to identify the misspecified models. When the true model is ZMP or ZIP, the type I error rates for ZMNB, ZINB, ZMP, and ZIP models are around 0.05 under different scenarios. In each scenario, the probabilities of rejecting the NB model are above 0.17, and the probabilities of rejecting the Poisson model are 1, suggesting RQRs can detect that NB and Poisson models are not appropriate for modelling zero-inflated data. In particular, the Poisson model is unable to model excess zeros. Moreover, zero proportion and the dispersion of the positive count data do not influence the type I error rates for the correctly specified model. Note that the convergence rates for all the models are higher when the true model is ZMNB or ZINB, as compared to the scenarios when data are generated from ZMP or ZIP model. Such difference is attributed to the need to use an extremely large of NB distribution for approximating the Poisson distribution; this often breaks down the model fitting with the package glmmTMB. However, this problem may be solved by other GLMM packages.
When the sample size n = 50, 200 and 400 (Additional file 1: Tables S1, S2; Table 2), the results are consistent with result when sample size n = 100 . Hence, the type I error rate is not affected by the sample size. However, the convergence rate decreases when the sample size decreases. Only about 100 model fittings converged when n = 100 compared to about 900 converged replicated datasets when n = 400 in the scenario when data are simulated from the ZMP model. The convergence issue is more likely to occur when the sample size is too small to estimate the parameters reliably.
In comparison to the RQRs, Pearson residuals cannot differentiate the true and wrong models, as displayed in Additional file 1: Tables S3-S6. The SW normality tests based on Pearson residuals for all the models have high probabilities of rejecting models regardless of the sample sizes, zero proportion, and scale of count data. For example, the probabilities of rejecting the true model ZMNB are all equals to 1 under all scenarios and different sample sizes. Therefore, Pearson residuals are useless compared with RQRs for testing the overall GOF of the methods.

Application to a real human microbiome dataset
In this section, a real human microbiome dataset will be introduced. We apply various models discussed previously to this dataset and use RQRs to test the GOF of all models.

Data sources and descriptions
As a response to the epidemic of worldwide obesity, efforts to identify the relationship between host and environmental factors and energy balance have increased. Comparisons of the distal gut microbiota of genetically obese mice and their lean littermates have revealed that obesity is associated with two dominant bacterial divisions, i.e., the Bacteroidetes and the Firmicutes [48]. The human distal gut harbours a vast ensemble of microbes helping to break down otherwise indigestible material. It is often of interest to investigate the relationship between gut microbial ecology and body fat in humans [49]. Each distinct microbe species can be assigned to a diverse taxonomic rank based on shared characteristics, including species, genus, family, order, class, phylum, kingdom, and domain. The OTU data used in our application were generated at the genus level, which is the commonly used OTU level for microbiome sequencing analysis, and there are 14 different genera in total [50]. Each sample consists of 154 individuals, and we characterize individuals into 31 monozygotic (MZ) twin pairs, 23 dizygotic (DZ) twin pairs and 46 mothers. Twins were between 21 and 32 years old and were of European (EA) or African (AA) ancestry, respectively. Individuals were classified as obese/ overweight if body mass index (BMI) ≥ 25, or lean if BMI < 25. Fecal samples were frozen immediately after they were produced for extracting the DNAs of the bacteria, then the 16S rRNA sequencing method was used to group the bacteria into different OTUs with a sequence identity threshold of 97% [11]. Two subjects were dropped from samples for quality control. Among the rest of the 152 individuals, 34 were measured once, and 118 were measured twice (time point 1 and time point 2) for fecal samples. There are 281 OTU measures on the genus level in total. For each measurement, OTU count at each genus level, as well as the total number of reads per measure, were recorded. Figure 2 shows the histograms of the four genera selected from the data for the purpose of illustration of the distribution of the OTU measures, which all exhibit right skewness.

Model checking with the RQR method
In this analysis, ancestry and obesity were selected as the host factors while age and family as the random factors. Then, ZMNB, ZMP, ZINB, ZIP, NB and Poisson models were fitted to each of the 14 genus-level OTUs. First, we fitted Poisson and NB models to the original dataset. However, the model checking results based on examining the normality of their RQRs showed that these models do not fit the original data very well (results shown in [51]). The OTU counts at the genus level contain very few  Fig. 2 Histograms of the OTU counts of four selected genera actual zeros. Therefore, zero-inflated models cannot fit the original data better. Considering that small OTU counts at the genus level are likely caused by the mismatching in sequence alignment of reads, we truncated the OTU counts to be zero when their values are less than 10 for most genera except the following four genera. The truncation thresholds for Bacteroides, Ruminococcus,Faecalibacterium and Lachnospiraceae are set to be 50, 50, 100, and 150, respectively. We will present the model diagnostics results for the truncated datasets. Figure 3 shows the Q-Q plots of the RQRs of six models fitted to Euba OTU counts. We see that the Q-Q plots of the RQRs of the ZMNB model and the ZINB model fall along a straight line with a slope of 1 and just a few points slightly deviating from the diagonal line, which indicates that RQRs are normally distributed. The Q-Q plots for the other four models exhibit curvature patterns. Therefore, these Q-Q plots show that only ZMNB and ZINB appear to fit the dataset well. Table 3 shows the p values for the SW normality test of RQRs for all 14 OTUs at the genus level. For easy visual inspection of RQRs, we sort the genera by the test p values of the ZINB model. The first column lists 14 different genera in the twin study OTU data. If the p value for the SW normality test is less than 0.05, the model may not fit the data well. RQRs contain randomness. As a result, we calculate the mean of the SW test p values based on RQRs by replicating RQRs 100 times. As shown in Table 3, the ZMNB and ZINB models provide reasonable fits to this data with all SW p values greater than 0.05. However, the SW p values for the ZMP, ZIP, NB, Poisson, NB, and Poisson models are mostly very small (except the p value of NB for Blau). These small SW test p values indicate that these models do not fit the data well. We also use the Akaike information criterion (AIC) to compare the six models. Table 4 presents the AIC values of all of the six models. The ZMNB and ZINB models also have smaller AIC values compared to

Discussion and conclusion
Model checking is critical to control the FDR at a nominal level in differential abundance analysis for sequencing count data. In this paper, we conduct large-scale simulation studies to investigate the performance of the RQRs diagnosing zero-inflated GLMMs, which are often applied to model sequencing count data. Our simulation studies show that the type I error rates of the GOF tests with RQRs are very close to the nominal level. In addition, the scatter plots and Q-Q plots of RQRs are useful