Model for absolute quantification
Absolute quantification is used to estimate the expression of a target transcript in one or more sample-types. For absolute gene expression we write the proposed model as follows:
$$\begin{aligned} X_{ij} = \theta _{ik(j)} + \delta _{j} + \varepsilon _{ij} \end{aligned}$$
(3)
where \(X_{ij}\) are again the completely observed gene expression values, \(\theta _{ik(j)}\) are the true values of gene expression for gene i in the sample-type k to which sample j belongs, \(\delta _{j}\) represents a global shift in expression across samples, and \(\varepsilon _{ij}\) now captures both biological and technical variability. In addition to estimating absolute expression within each sample-type, parameter estimates from this model can be used to assess relative expression between sample-types.
Model for relative quantification
Relative quantification is used to estimate the change in expression of a target transcript between two sample types, typically experimental test and control samples. Due to the significant impact of batch effects on genomic data [22], it is increasingly common for experiments to include a matched control sample for each sample or group of samples analyzed. In this case, the parameters of interest are no longer the average expression within each sample type; rather, they are the differences in expression between the test and control samples. These control samples can be included in a model to directly adjust for batch effects. Specifically, we partition samples into \(J^{'}_{n}\) batches, with \(n(j) = n\) for \(j\in J^{'}_{n}\), and introduce \(\gamma _{in}\) into the model to capture the batch effect for gene i in samples from batch n as follows:
$$\begin{aligned} X_{ij}&= \Delta _{ik(j)}+ \gamma _{in(j)} + \delta _{j} + \varepsilon _{ij}, \nonumber \\ \Delta _{ik(j)}&= \theta _{ik(j)} - \theta ^{Control}_{ij} \end{aligned}$$
(4)
Here, the parameter of interest, \(\Delta _{ik(j)}\), is the difference in expression between the test and control sample. We assume that control samples are denoted by \(k=0\) and \(\Delta _{i0}=0~\forall i\), such that:
$$\begin{aligned} X_{ij} = {\left\{ \begin{array}{ll} \Delta _{ik(j)} + \gamma _{in(j)} + \delta _{j} + \varepsilon _{ij} &{} \text{ if } j \notin J_0 \\ \gamma _{in(j)} + \delta _{j} + \varepsilon _{ij} &{} \text{ if } j \in J_0 \end{array}\right. } \end{aligned}$$
(5)
Rarely, non-detects may occur in the control samples as well as in the test samples. In this case we propose to use a two step process: first, apply the model in Eq. 3 for the control samples and perform SI, MI, or DirEst; second, use the model in Eq. 4, to obtain the estimates of the non-detects for the test samples.
Estimation of model parameters
We perform parameter estimation by using an Expectation Conditional Maximization (ECM) procedure [23], in this section we outline this process. Our algorithm is given in more detail in Additional file 1: Appendix D. Let \(Y=(X,W)\) be the observed data, where X represents the complete data and W denotes the non-detects. We first replace the non-detects with initial estimates and obtain an initial estimate of the missing data mechanism, \((\beta _0^{(0)}, \beta _1^{(0)})\). In the first step of the ECM algorithm, we calculate \(E^{(t)}(W)\) and update \(\theta _{ij}^{(t)}\), and in the second step, based on the results of the first step, we calculate \(E^{(t)} (W^2)\) and \(\sigma _i^{2^{(t)}}\). This process is repeated until the change in the likelihood is less than a specified threshold.
Improved estimation of the missing data mechanism
It is possible to observe perfect separation between observed and non-detected transcripts. This is a common problem in regression with binary predictors. Prediction of the parameter values in this case becomes unstable; however, one can use a Bayesian procedure to obtain stable estimates of the generalized linear regression coefficients [24]. We adopt this approach when estimating the parameters of the missing data mechanism, \((\beta _0, \beta _1)\).
Multiple imputation
We incorporated different sources of variability in the MI procedure: non-systematic variation, uncertainty in the linear model parameters (\(\theta\) in Eqs. 3 and 4), and parameters of the missing data mechanism (\(\beta _0\) and \(\beta _1\) in Eq. 2), as well as each combination of these sources of variability.
Uncertainty in linear model parameters
When the data has missing values, the estimates of the model parameters contain an additional amount of uncertainty due to the missing data. We can account for this added uncertainty by introducing additional variation in the parameter estimates. Instead of using point estimates \({\hat{\theta }}\), we draw M different \({\hat{\theta }}_m\) \((m=1, \ldots , M)\) from the estimated distribution of \(\theta \sim {\mathrm {N}}({\hat{\mu }}_\theta , {\hat{\sigma }}_\theta )\).
Uncertainty in the missing data mechanism
Similarly, one can account for the uncertainty in the missing data mechanism by introducing additional variability in the corresponding parameter estimates. To preserve the dependence between the parameters, we assume \((\beta _0,\beta _1)\) are jointly \({\mathrm {MVN}}({\hat{\mu }}_\beta , {\hat{\Sigma }}_\beta )\). Researchers can draw M pairs of \((\beta _0,\beta _1)\) from the estimated distribution and use these estimates in the imputation procedure. This step introduces additional variability in the resulting complete data sets that reflects uncertainty in the estimated logistic regression model parameters.
Biological variability and measurement error
Suppose the model parameters are known and we are interested in applying MI to obtain an estimate of differential gene expression. In this case, given all the model parameters, the imputed values will be identical and equal to the conditional expectation of the missing data point without additional variability. Such a result is undesirable, as it will lead to artificially small variance estimates. To better estimate the uncertainty of the missing value itself under known true parameters of the modeling framework, we must include non-systematic biological variability and measurement error in the MI procedure. We assume that these sources of variability together are normally distributed with mean zero and variance equal to the residual variance from the EM procedure.
Direct estimation of model parameters
An alternative approach to handling missing data is to directly estimate the parameters of interest. For a fixed gene, in sample j, let \(w_j\) denote an unobserved value of the gene expression \(y_j\) and let \(z_j\) be an indicator of an observed expression value as in Eq. 1. The MLE of the variance for a given gene is:
$$\begin{aligned} {\widehat{\sigma }}^2_{MLE}&= \frac{1}{J} \sum _{j=1}^J\Big ((E(w_j^2)-2E(w_j)\theta _{k(j)} +\theta _{k(j)}^2)(1-z_j) \nonumber \\&+ (y_j^2-2 y_j\theta _{k(j)} +\theta _{k(j)}^2 )z_j \Big ). \end{aligned}$$
(6)
In contrast, the sample variance estimate following SI for a given gene is:
$$\begin{aligned} {\widehat{\sigma }}^2_{SI}&= \frac{1}{J} \sum _{j=1}^J \Big ((E(w_j)^2-2E(w_j)\theta _{k(j)} +\theta _{k(j)}^2 )(1-z_j) \nonumber \\&\quad + (y_j^2-2 y_j\theta _{k(j)} +\theta _{k(j)}^2 )z_j \Big ). \end{aligned}$$
(7)
Derivations of these equations are given in Appendix A.
These equations differ exclusively in the first element within the summation: \(E(w_j^2)\) versus \(E(w_j)^2\). Taking the difference between Eqs. (6) and (7) yields:
$$\begin{aligned} {\widehat{\sigma }}^2_{MLE} - {\widehat{\sigma }}^2_{SI}&= \frac{1}{J} \sum _{j=1}^J \Big ((E(w_j^2) - (E(w_j)^2)(1-z_j) \Big ) \nonumber \\&= \frac{1}{J} \sum _{j \in J, z_j=0} \Big (E(w_j^2) - (E(w_j)^2 \Big ) = Q \sigma _{w} \propto \sigma _{w} > 0, \end{aligned}$$
(8)
where \(Q >0\) is the proportion of non-detects for the given gene. Since \(E(w_j^2)\) is greater then \(E(w_j)^2\), the variance estimated after SI is smaller than the MLE of the variance. As the number of non-detects increases, Q increases, and hence the difference between \({{\hat{\sigma }}}^2_{MLE}\) and \({{\hat{\sigma }}}^2_{SI}\) increases with the number of non-detected values. Note that these relationships hold for any given gene.
Simulation study design
We performed simulation studies with either 16 or 90 genes, 6 sample-types, and 4, 6, or 10 replicates within each sample-type, such that \(J_k\) is the same size \(\forall \, k\). We denote the number of replicates within each sample type as m. Finally, we assume a common missing data mechanism parameterized as in Eq. 2. The values of parameters in the simulations were chosen at levels close to the estimated values from [14]. Specifically, we set \(\beta _0=-35.7\) and \(\beta _1=1\); the \(\sigma ^2_i\) were generated from Unif(0.06, 1.3); \(\theta _{ij(k)}\) were generated from \(N(\mu _\theta , \sigma ^2_{\theta } I)\), where \(\sigma ^2_{\theta }=3\); \(\mu _{\theta }\) was generated from a truncated normal distribution with mean 31, standard deviation of 3.5 and the truncation range from 20 to 40.5. We set \(\delta _j\) to 0, simulated \(\varepsilon\) from \(N(0, \sigma ^2_{i} I)\), and simulated the complete data by combining \(\theta\), \(\delta\) and \(\varepsilon\) as in Eq. 3. To obtain the missing data indicators, we drew from a \(Binom(p_{ij(k)})\), where \(p_{ij(k)}=P(Z_{ij}=1)\) is calculated as in Eq. 2 for each data point. The individual generated data points we replaced with 40 according to missing data indicators or if the Ct value was greater or equal to 40. We compared performance under logit, probit and cloglog links. For each scenario we repeated this procedure 100 times. To summarize the performance of each estimation technique, we report the 25th, 50th, and 75th quantiles of each assessment measure for all genes and samples. For example, in the case of 16 genes and 6 sample-types, there are 16 different \(\sigma ^2_{i}\) and \(16 \times 6\) = 96 distinct values of \(\theta _{ij(k)}\) for each simulated data set.
A penalized EM algorithm incorporating non-random missingness (PEMM)
We compared the proposed DirEst and MI methods to the PEMM algorithm proposed by [12]. PEMM incorporates an Inverse-Wishart penalty into an EM algorithm to estimate the mean and covariance of multivariate Normal data in the presence of missing values. PEMM was initially applied to proteomics abundance levels under the following assumptions: (1) the probability of a missing value does not depend on the missingness of other values given the abundance data and other covariates and (2) the missingness of each feature is independent of the abundance of other features and covariates. In contrast in our methods we assume one common missing data mechanism and estimate it based on all the information available. While PEMM was developed for proteomics data, we have adapted it to qPCR data by transforming the Ct values to a scale similar to proteomic abundance data by subtracting each Ct value from the largest possible value. PEMM proposing the following model of the missing data mechanism: \(Pr(Y|Z=1) = c \times \exp ^{(-\phi \times Y)}\) for positive abundance Y, constant c, and tuning parameter \(\phi\). We tested PEMM with different values of the parameter in the missing-data mechanism \(\phi =0.5, \, 1, \,2\), and chose the value \(\phi =1\) that gave the highest correlation between complete data mean estimates and PEMM mean estimates. The PEMM algorithm is implemented in the R package PEMM (version 1.0).
Masking procedure for the experimental data to assess the methods performance
To assess the applicability of the proposed methods to real data, we used a masking procedure to create missing data for which the observe values are known. First, all the genes with missing data were removed, resulting in a completely observed data set from which we obtained \({{\hat{\theta }}}_{complete}\) and \({{\hat{\sigma }}}^2_{complete}\). Second, the new detection limit was set at Ct=30, and expression values greater than 30 were masked. Importantly, rather than creating missing points based on the probabilistic model, which would require assumptions regarding the unknown missing data mechanism, we truncated the data. Given that PEMM makes different assumptions about the missing data mechanism than DirEst and MI, we restrict our comparison between these methods to the masked data. We calculated \({{\hat{\theta }}}_{method}\) and \({\hat{\sigma }}^2_{method}\) for truncated at 30 data, PEMM, DirEst, and MI, and compared them to the estimates from the complete data.