ZINB model
Essential genes are likely to have no insertions or very few counts (because mutants with transposon insertions in those regions are not viable), while non-essential genes are likely to have counts near the global average for the dataset. Insertion counts at TA sites in non-essential regions are typically expected to approximate a Poisson distribution. This expectation is based on a null model in which the expected fraction of insertions at a site is determined by the relative abundance of those clones in the library, and the observed counts in a sequencing experiment come from a stochastic sampling process. This process is expected to follow a multinomial distribution [24], which is approximated by the Poisson for sufficiently large numbers of reads (total dataset size) [25].
Let Y={yg,c,i,j} represent the set of observed read counts for each gene g, in condition c∈{c1..cn}, at TA site i=1..Ng, for replicate j=1..Rc. We are interested in modeling the gene- and condition-specific effects on the counts, p(y|g, c,i,j). We treat the observations at individual TA sites and in different replicates as independent identically-distributed (i.i.d.), samples drawn from the distribution for the gene and condition:
$$p(y|g,c,i,j) = p(y|g,c) $$
Read-count data is often modeled using the Negative Binomial (NB) distribution [25]. The NB distribution can be thought of as a Poisson distribution with over-dispersion, resulting from an extra degree of freedom:
$$ NB(y \mid p,r) = {y+r-1 \choose y} p^{y}(1-p)^{r} $$
(1)
$$y|g,c \sim NB(p_{g,c},r_{g,c}) $$
where p is a success probability (i.e. of a mutant getting a transposon insertion at a particular site), and r, often called a size parameter, represents the dispersion. Unlike the Poisson distribution, which has a single parameter λ=1/p, and for which the variance is restricted to equal the mean, the extra parameter in NB allows for fitting counts with a variance greater or less than expected (i.e. different from the mean). The NB distribution converges to a Poisson as r→∞ [26]. A common re-parameterization of the NB distribution is to specify the distribution based on the mean, μ, and the dispersion parameter, r, which then determines the success probability, p, through the following relationship:
$$p = \frac{\mu}{\mu + r} $$
In practice, TnSeq data often has an excess of empty sites (TA sites with counts of 0), exceeding those that would be expected under a typical NB distribution. Because essential genes typically constitute only 10−20% of the genome in most organisms, a library with transposon insertions at 50% of its sites (i.e. 50% saturation) would mean that even non-essential genes will have a large portion of sites missing (i.e. equal to zero). Thus, while the NB distribution may be sufficient to model counts in other domains, TnSeq requires more careful consideration.
One way to solve this problem is to model the read-counts for a gene g and condition c as coming from a Zero-Inflated Negative Binomial distribution (ZINB) instead:
$$ y|g,c \sim ZINB (\pi_{g,c}, r_{g,c}, \mu_{g,c}) $$
(2)
where
$$\begin{array}{*{20}l} ZINB(y \mid \pi, r, \mu) &= \left\{\begin{array}{ll} \pi + (1-\pi) \times NB(0 \mid r, \mu) & y = 0\\ (1-\pi) \times NB(y \mid r, \mu) & y >0 \end{array}\right. \end{array} $$
Here the π parameter represents the probability that a count of zero is extraneous (i.e. does not belong to the NB distribution), and can be interpreted as similar to the probability that an empty site is essential (i.e. empty due to fitness costs incurred through its disruption, rather than stochastic absences). In this way, both read-counts (through the r and μ parameters of the NB distribution) and insertion density (through π) can be used to differentiate genes that are essential in one condition and non-essential in another.
Generalized linear model
To capture the conditional dependence of the ZINB parameters (μ, r, π) on the experimental conditions, we adopt a linear regression (GLM) approach, using a log-link function. This is done independently for each gene g. We use Yg to represent the subset of all observed counts in gene g at any TA site, in any condition, in any replicate (Yg is illustrated as a column vector in Fig. 1). The vector of expected means μg of the ZINB distribution (non-zero component) for each observation in gene g is expressed as:
$$ ln \ \boldsymbol{\mu}_{g} = \boldsymbol{X}_{g} \boldsymbol{\alpha}_{g} $$
(3)
where Xg is a binary design matrix (see Fig. 1), indicating the experimental condition for each individual observation (insertion count at a TA site) in gene g, and αg is a vector of coefficients for each condition. For m observations and n conditions, the size of Xg will be m×n and the size of αg will be n×1. Hence, there will be n coefficients for each gene, one for estimating the mean non-zero count for each condition. The conditional expectations for the non-zero means for each condition can be recovered as: \(\langle \mu _{g,c_{1}}, \ \ldots, \ \mu _{g,c_{n}} \rangle =exp(\boldsymbol {\alpha }_{g})\).
If additional covariates distinguishing the samples are available, such as library, timepoint, or genotype, they may be conveniently incorporated in the linear model with an extra matrix of covariates, Wg (m×k for k covariates), to which a vector of k parameters βg will be fit:
$$ ln \ \boldsymbol{\mu}_{\boldsymbol{g}} = \boldsymbol{X}_{\boldsymbol{g}} \boldsymbol{\alpha}_{\boldsymbol{g}} + \boldsymbol{W}_{\boldsymbol{g}} \boldsymbol{\beta}_{\boldsymbol{g}} $$
(4)
For the dispersion parameter of the NB, τ (or size parameter r=1/τ), we assume that each gene could have its own dispersion, but for simplicity, we assume that it does not differ among conditions. Hence, it is fitted by a common intercept:
$$ln \ r_{g} = \rho_{g} $$
Finally, for the zero-inflated (Bernoulli) parameter, π, we fit a linear model depending on condition, with a logit link function a conventional choice for incorporating probabilistic variables bounded between 0 and 1 as terms in a linear model):
$$ logit(\boldsymbol{\pi}_{g}) = \left\langle ln \left(\frac{\pi_{g,c}}{1-\pi_{g,c}} \right) \right\rangle_{c=1..n} = {\boldsymbol{X}_{\boldsymbol{g}} \boldsymbol{\gamma}_{\boldsymbol{g}}} $$
(5)
Thus each gene will have its own local estimate of insertion density in each condition, πg,c=exp(γg,c)/(1+exp(γg,c)). In the case of covariates, logit(πg)=Xgγg+Wgδg, where Wg are the covariates for each observation, and δg are the coefficients for them.
Putting these all together:
$$ {\begin{aligned} \begin{array}{lll} p(y|g,c) & = & ZINB(\mu_{g,c},r_{g},\pi_{g,c}) \\ & = & ZINB(exp({\boldsymbol{X}_{\boldsymbol{g}} \boldsymbol{\alpha}_{\boldsymbol{g}} + \boldsymbol{W}_{\boldsymbol{g}} \boldsymbol{\beta}_{\boldsymbol{g}}}),exp(\rho_{g}), logit({\boldsymbol{X}_{\boldsymbol{g}} \boldsymbol{\gamma}_{\boldsymbol{g}}+\boldsymbol{W}_{\boldsymbol{g}}\boldsymbol{\delta}_{\boldsymbol{g}}})) \end{array} \end{aligned}} $$
(6)
The parameters of the GLM can be solved by maximum-likelihood using iteratively re-weighted least squares (IWLS). In this work, we use the pscl package in R [27].
Correcting for saturation differences among TnSeq datasets
An important aspect of comparative analysis of TnSeq data is the normalization of datasets. Typically, read-counts are normalized such that the total number of reads is balanced across the datasets being compared. Assuming read-counts are distributed as a mixture of a Bernoulli distribution (responsible for zeros) and another distribution, g(x), responsible for non-zero counts i.e.,
$$\begin{array}{*{20}l} f(x) &= \left\{\begin{array}{ll} \theta \times \mathrm{g}(x) & x > 0\\ (1-\theta) \times \text{Bern}(x| p=0) & x =0 \end{array}\right. \end{array} $$
then the expected value of this theoretical read-count distribution (with mixture coefficient θ) is given by:
$$ {\mathrm{E}}\left[ f(x)\right] = \theta \times {\mathrm{E}}\left[ g(x)\right] $$
(7)
The expected value of such a distribution can be normalized to match that of a another dataset, fr(x), (such as reference condition, with saturation θr) by multiplying it by a factor, w, defined in the following way:
$$\begin{array}{*{20}l} {\mathrm{E}}\left[ f_{r}(x)\right] &= w \times {\mathrm{E}}\left[ f(x)\right]\\ \theta_{r} \times {\mathrm{E}}\left[ g_{r}(x)\right] &= w \times \left(\theta \times {\mathrm{E}}\left[ g(x)\right] \right) \end{array} $$
$$ w = \frac{\theta_{r} \times {\mathrm{E}}\left[ g_{r}(x)\right]}{\theta \times {\mathrm{E}}\left[ g(x)\right]} $$
(8)
This guarantees that the expected value in read-counts is the same across all datasets. TTR normalization (i.e. total trimmed read count, the default in TRANSIT [15]) estimates E[g(x)] in a robust manner (excluding the top 1% of sites with highest counts, to reduce the influence of outliers, which can affect normalization and lead to false positives).
While TTR works well for methods like resampling (which only depend on the expected counts being equivalent under the null-hypothesis), it does not work well for methods designed to simultaneously detect differences in both the local magnitudes of counts (non-zero mean) and the saturation (fraction of non-zero sites) such as ZINB. This is because TTR in effect inflates the counts at non-zero sites in datasets with low saturation, in order to compensate for the additional zeros (to make their expected values equivalent). This would cause genes to appear to have differences in (non-zero) mean count (μg,a vs μg,b), while also appearing to be less saturated (πg,a vs πg,b), resulting in false positives.
To correct for differences in saturation, we incorporate offsets in the linear model as follows. First, assume there are d datasets (combining all replicates over all conditions). Let the statistics of each dataset be represented by a d×1 vector of non-zero means, M (genome-wide averages of insertion counts at non-zero sites), and a d×1 vector of the fraction of sites with zeros in each dataset, Z. For the m observations (insertion counts at TA sites) in gene g, let Dg be the binary design matrix of size m×d indicating the dataset for each observation. Then the linear equations above can be modified to incorporate these offsets (a specific offset for each observation depending on which dataset it comes from).
$$ ln(\boldsymbol{\mu}_{g}) = {\boldsymbol{X}_{\boldsymbol{g}} \boldsymbol{\alpha}_{\boldsymbol{g}} + \boldsymbol{W}_{\boldsymbol{g}}\boldsymbol{\beta}_{\boldsymbol{g}}} + ln({\boldsymbol{D}_{\boldsymbol{g}} \boldsymbol{M}}) $$
(9)
$$ logit(\pi_{g}) = {\boldsymbol{X}_{\boldsymbol{g}} \boldsymbol{\gamma}_{\boldsymbol{g}} + \boldsymbol{W}_{\boldsymbol{g}} \boldsymbol{\delta}_{\boldsymbol{g}}} + logit({\boldsymbol{D}_{\boldsymbol{g}} \boldsymbol{Z}}) $$
(10)
Note that M and Z are just vectors of empirical constants in the linear equation, not parameters to be fit. Hence the fitted coefficients (αg,βg,γg,δg) are effectively estimating the deviations in the local insertion counts in a gene relative to the global mean and saturation for each dataset. For example, if observation Xg,c,i,j comes from dataset d (where i and j are indexes of TA site and replicate), and the global non-zero mean of that dataset is Md, then exp(Xgαg) estimates the ratio of the expected mean insertion count for gene g in condition c to the global mean for dataset d (ignoring covariates):
$$\frac{\mu_{g,c}}{M_{d}} = exp(\alpha_{g,c}) $$
Statistical significance
Once the ZINB model is fit to the counts for a gene, it is necessary to evaluate the significance of the fit. T-tests could be used to evaluate the significance of individual coefficients (i.e. whether they are significantly different from 0). However, for assessing whether there is an overall effect as a function of condition, we compare the fit of the data Yg (a set of observed counts for gene g) to a simpler model - ZINB without conditional dependence - and compute the difference of log-likelihoods (or log-likelihood ratio):
$$ -2 \{ {\mathcal{L}}_{0}(Y_{g}|\Theta_{0}) -{\mathcal{L}}_{1}(Y_{g}|\Theta_{1}) \} = -2 \ ln\left(\frac{L_{0}(Y_{g}|\Theta_{0})}{L_{1}(Y_{g}|\Theta_{1})}\right) $$
(11)
where the two models are given by:
$$ \begin{array}{ll} M_{1}: & L_{1}({Y_{g}}|\boldsymbol{X}_{g},\Theta_{1}) = ZINB({Y_{g}}|\boldsymbol{X_g},\mu_{g},r_{g},\boldsymbol{\pi}_{g})\\ & ln \ \mu_{g} = {\boldsymbol{X}_{\boldsymbol{g}}\boldsymbol{\alpha}_{\boldsymbol{g}}}, \ ln \ r_{g} = \rho_{g}, \ logit(\pi_{g}) = {\boldsymbol{X}_{\boldsymbol{g}}\boldsymbol{\gamma}_{\boldsymbol{g}}} \\ M_{0}: & L_{1}({Y_{g}}|\Theta_{0}) = ZINB({Y_{g}}|\mu_{g},r_{g},\pi_{g}) \\ & ln \ \mu_{g}=\alpha^{0}_{g}, \ ln \ r_{g}=\rho_{g}, \ logit(\pi_{g}) = \gamma^{0}_{g} \\ \end{array} $$
(12)
where Θ1=〈αg,ρg,γg〉 and \(\Theta _{0}=\left \langle \alpha ^{0}_{g},\rho _{g}, \gamma ^{0}_{g} \right \rangle \) are the collections of parameters for the two models, and where \(\alpha ^{0}_{g}\) and \(\gamma ^{0}_{g}\) in M0 are just scalars fitted to the grand mean and saturation of the gene over all conditions.
The likelihood ratio statistic above is expected to be distributed as χ2 with degrees of freedom equal to the difference in the number of parameters (Wilks’ Theorem):
$$ -2 \ ln\left(\frac{L_{0}\left(Y_{g}|\Theta_{0}\right)}{L_{1}\left(Y_{g}|\Theta_{1}\right)}\right) \sim \chi^{2}_{df=df(M_{1})-df(M_{0})} $$
(13)
For the condition-dependent ZINB model (M1), the number of parameters is 2n+1 (for length of αg and γg plus ρg). For the condition-independent ZINB model (M0), there are only 3 scalar parameters \(\left (\alpha ^{0}_{g}, \rho _{g}, \gamma ^{0}_{g}\right)\) used to model the counts pooled across all conditions. Hence df=2n+1−3=2(n−1). The point of the test is to determine whether the additional parameters, which should naturally improve the fit to the data, are justified by the extent of increase in the likelihood of the fit. The cumulative of the χ2 distribution is used to calculate p-values from the log-likelihood ratio, which are then adjusted by the Benjamini-Hochberg procedure [28] to correct for multiple tests (to limit the false-discovery rate to 5% over all genes in the genome being tested in parallel).
Importantly, if a gene is detected to be conditionally-essential (or have a conditional growth defect), it could be due to either a difference in the mean counts (at non-zero sites), or saturation, or both. Thus the ZINB regression method is capable of detecting genes that have insertions in approximately the same fraction of sites but with a systematically lower count (e.g. reduction by X%), possibly reflecting a fitness defect. Similarly, genes where most sites become depleted (exhibiting reduced saturation) but where the mean at the remaining sites (perhaps at the termini) remains about the same would also be detectable as conditional-essentials.
Covariates and interactions
If the data include additional covariates, then the W terms will be included in the regressions for both models M1 and M0:
$$ {\begin{aligned} \begin{array}{ll} M_{1}: & L_{1}({Y_{g}}|{\boldsymbol{X}_{\boldsymbol{g}},\boldsymbol{W}_{\boldsymbol{g}}},\Theta_{1}) = ZINB({Y_{g}}|\boldsymbol{\mu}_{g},r_{g},\boldsymbol{\pi}_{g}) \\ & ln \ {\boldsymbol{\mu}_{\boldsymbol{g}} = \boldsymbol{X}_{\boldsymbol{g}}\boldsymbol{\alpha}_{\boldsymbol{g}} \underline{+\boldsymbol{W}_{\boldsymbol{g}}\boldsymbol{\beta}_{\boldsymbol{g}}}}, \ ln \ r_{g} = \rho_{g}, \ logit(\boldsymbol{\pi}_{g}) = {\boldsymbol{\boldsymbol{X}}_{\boldsymbol{g}} \boldsymbol{\gamma}_{\boldsymbol{g}} \underline{+\boldsymbol{W}_{\boldsymbol{g}}\boldsymbol{\delta}_{\boldsymbol{g}}}} \\[1cm] M_{0}: & L_{1}({Y_{g}}|\boldsymbol{W}_{g},\Theta_{0}) = ZINB({Y_{g}}|\boldsymbol{ X_g}, \boldsymbol{W_g},\mu_{g},r_{g},\pi_{g}) \\ & ln \ \mu_{g}=\alpha^{0}_{g} {\underline{+\boldsymbol{W}_{\boldsymbol{g}}\boldsymbol{\beta}_{\boldsymbol{g}}}}, \ ln \ r_{g}=\rho_{g}, \ logit(\pi_{g}) = \gamma^{0}_{g} {\underline{+\boldsymbol{W}_{\boldsymbol{g}}\boldsymbol{\delta}_{\boldsymbol{g}}}} \\ \end{array} \end{aligned}} $$
(14)
In this way, the covariates W will increase the likelihoods of both models similarly, and the LRT will be evaluating only the improvement of the fits due to the conditions of interest, X, i.e. the residual variance explained by X after taking known factors W into account. Although the number of parameters in both models will increase, the difference in degrees of freedom will remain the same.
If the covariates represent attributes of the samples that could be considered to interact with the main condition, then one can account for interactions by including an additional term in the regression. An interaction between variables occurs when the dependence of the parameter estimates (mean counts or saturation) on the main condition variable is influenced by the value of another attribute (e.g. treatment of the samples), which can cause the coefficients for a condition to differ as a function of the interacting variable. For example, suppose we have samples of two strains (e.g. knockout vs wildtype) that have been cultured over several time points (e.g. 1–3 weeks). Then we might naturally expect that there will be variability across all 6 conditions (considered independently), e.g. due to differences between time points. In fact, some genes might exhibit a gradual increase or decrease in counts over time, which could expressed as a slope (i.e. as a regression coefficient for time, treated as a continuous attribute). For the purpose of addressing the main question, which is whether there is a systematic difference in insertion counts between the strains, we want to discount (or adjust for) the effects of time. However, the difference between the strains could manifest itself as a difference in the slopes (time-dependent effect on the counts), which might be different for each strain. Treating covariates as interactions allows us to capture and test for these effects by incorporating separate coefficients for each combination of values (i.e. independent slopes for each strain).
Interactions can be incorporated in the ZINB regression model by including the product of the conditions with the interacting covariates in the regression for M1.
$$ \begin{array}{llll} M_{1}: & ln \ \boldsymbol{\mu}_{g} & = & {\boldsymbol{X}_{\boldsymbol{g}} \boldsymbol{\alpha}_{\boldsymbol{g}} +\boldsymbol{W}_{\boldsymbol{g}} \boldsymbol{\beta}_{\boldsymbol{g}} +\underline{\boldsymbol{X}_{\boldsymbol{g}} \otimes \boldsymbol{W}_{\boldsymbol{g}} \boldsymbol{\lambda}_{\boldsymbol{g}}}} \\ & logit\left(\boldsymbol{\pi}_{g}\right) & = & {\boldsymbol{X}_{\boldsymbol{g}} \boldsymbol{\gamma}_{\boldsymbol{g}} +\boldsymbol{W}_{\boldsymbol{g}} \boldsymbol{\delta}_{\boldsymbol{g}} +\underline{\boldsymbol{X}_{\boldsymbol{g}} \otimes \boldsymbol{W}_{\boldsymbol{g}} \boldsymbol{\eta}_{\boldsymbol{g}}}} \\ M_{0}: & ln \ \boldsymbol{\mu}_{g} & = & \alpha^{0}_{g} +{\boldsymbol{W}_{\boldsymbol{g}} \boldsymbol{\beta}_{\boldsymbol{g}}} \\ & logit\left(\boldsymbol{\pi}_{g}\right) & = & \gamma^{0}_{g} +{\boldsymbol{W}_{\boldsymbol{g}} \boldsymbol{\delta}_{\boldsymbol{g}}} \end{array} $$
(15)
where Xg⊗Wg represents column-wise products for each pair of columns in Xg and Wg (resulting in a matrix of dimensions m×(n·k) for n conditions and k interaction variables). Thus, if there is a general trend in the counts for a gene over time, it will be captured by the coefficients of Wg (vectors βg and δg), included in both models. However, if the variables Xg and Wg interact, then the coefficients of the product term (λg and ηg) will be non-zero, allowing the slopes to differ between the strains. Importantly, because the objective is to test for the significance of the interaction, in the likelihood-ratio test, the additive term for the covariate is retained in the null model but not the product, thus assessing the specific impact of the interaction on reducing the likelihood, while factoring out the information (i.e. general trend) attributable to the interaction variable on its own (independent of the main condition).
Treatment of mice
Mice were anesthetized with 5% isoflurane and sacrificed by cervical dislocation.