Multiple imputation and direct estimation for qPCR data with non-detects

Background Quantitative real-time PCR (qPCR) is one of the most widely used methods to measure gene expression. An important aspect of qPCR data that has been largely ignored is the presence of non-detects: reactions failing to exceed the quantification threshold and therefore lacking a measurement of expression. While most current software replaces these non-detects with a value representing the limit of detection, this introduces substantial bias in the estimation of both absolute and differential expression. Single imputation procedures, while an improvement on previously used methods, underestimate residual variance, which can lead to anti-conservative inference. Results We propose to treat non-detects as non-random missing data, model the missing data mechanism, and use this model to impute missing values or obtain direct estimates of model parameters. To account for the uncertainty inherent in the imputation, we propose a multiple imputation procedure, which provides a set of plausible values for each non-detect. We assess the proposed methods via simulation studies and demonstrate the applicability of these methods to three experimental data sets. We compare our methods to mean imputation, single imputation, and a penalized EM algorithm incorporating non-random missingness (PEMM). The developed methods are implemented in the R/Bioconductor package nondetects. Conclusions The statistical methods introduced here reduce discrepancies in gene expression values derived from qPCR experiments in the presence of non-detects, providing increased confidence in downstream analyses.

x jk β k ) 2 Let z j = 1 if y j observed, and z j = 0 if y j is missing, and denote missing y j by w j .
Let U j = (y j , E(w j )), One can update θ treating E(w j ) as "data", then update σ 2 as: The MLE of the variance for any gene is:

Appendix B -Additional Simulation results
Supplementary Supplementary Figure 2: The difference in the variance estimates for the Dataset 3. Similar to Figure 1, on the horizontal axis is the average of the two estimates of the standard deviation, and the difference between the direct MLE estimate and the SI estimate of σ is on the vertical axis.Supplementary Figure 4: Within replicate residuals stratified by the presence and handling of nondetects.The average ∆Ct values were calculated over the replicates for gene i and sample-type k.The residuals, for each gene and sample-type were summarized and are plotted here.Both panels show the distribution of residuals from left to right: without non-detects in dataset, with SI applied to the data, with the application of MI with different sources of variability (noise, liner model parameters -θ, combination of the noise and θ, parameters of the logistic regression modelβ's respectively), combination of uncertainty in θ and β's, and the combination of the three sources of variability combined.

Appendix D -Estimation via Expectation Conditional Maximization (ECM)
Assume Ct values follow a normal distribution, denoted by y = (x, w), where x represents the observed data and w represents the missing data.For a fixed gene i = 1, ...I, the gene and sampletype specific mean is denoted by θ k(j) and σ 2 is a common variance: Missing indicator z is a binomial random variable.z = 1 if the Ct value is observed and z = 0 if the Ct value is missing, so z j = 1 if y j = x j observed, and z j = 0 if y j = w j is missing.
z ∼ Bernoulli(p), and f (z|p We model p explicitly as where β 0 and β 1 are the coefficients from logistic regression.The likelihood can be written as: Define η = (θ k(j) , σ 2 , β 0 , β 1 ), and Furthermore, The log-likelihood can be written: Note y j = (x j , E(w j )), and One can update θ treating E(w j ) as "data", then update σ 2 as: We perform parameter estimation by using an Expectation Conditional Maximization (ECM) procedure, in this section we outline this process.Let Y be the observed data, X represent the complete data, W denote the non-detects, and Z be a missing data indicator.We model the missing data mechanism as 2 parameter sigmoid function, for example for a logit link: Note that the joint P r(Z, W ) = P r(W |Z) × P r(Z), and the marginal probability P r(Z) = P r(Z, W )dW .
The algorithm proceeds as follows: 1. Set initial values of non-detects to a constant, a maximum possible Ct value, say 40.W = 40.
Obtain the initial estimate of missing data mechanism based on the observed data and initial values of non-detects, β0 and β1 , using logistic, probit regression, or complementary log-log transformation.
2. Initialize the mean and the variance of the gene expression, θ and σ2 , as sample mean and sample variance 3. E-step 1: Compute E (1) (W ) = W × P r(Z, W ) P r(Z) dW .

Table 1 :
Direct estimation, probit link.The 25 th (small left), 50 th (large center), and 75 th (small right) quantiles of the bias and MSE are reported.

Table 2 :
Direct estimation, cloglog link.The 25 th (small left), 50 th (large center), and 75 th (small right) quantiles of the bias and MSE are reported.

Table 3 :
Mean imputation results, logit link.The 25 th (small left), 50 th (large center), and 75 th (small right) quantiles of the bias and MSE are reported.