Model
The standard method requires the assumption of a single parameter distribution such as the Poisson distribution because the simple count data only provide information about whether the numbers of initial molecules in partitions are either zero or at least one. The justification for the Poisson distribution comes from the Poisson limit theorem which in part depends on the independence of the positions of the DNA molecules within the fluid. If there are significant dependencies, for example due to molecules sticking together or repelling each other, then there may be some deviation from the Poisson distribution. This may depend on factors such as the length of the DNA strands and the partition size.
The Poisson distribution has probability mass function
$$ P\left(X=x;\mu \right)={e}^{-\mu}\frac{\mu^x}{x!},\ x=0,\ 1,\dots,\ \mu >0. $$
(2)
A less restrictive distribution is the Conway-Maxwell-Poisson (CMP) distribution [9], which has probability mass function
$$ P\left(X=x;\lambda, \nu \right)=\frac{1}{Z\left(\lambda, \nu \right)}\frac{\lambda^x}{{\left(x!\right)}^{\nu }},\ x=0,\ 1,\dots; \lambda >0,\ \nu \ge 0, $$
(3)
where Z(λ, ν) is the normalising constant. For ν = 1 it is equivalent to the Poisson distribution, and the variance equals the mean. For ν < 1 the variance is greater than the mean and for ν > 1 the variance is less.
Figure 1 provides a comparison between the Poisson and CMP distributions, where P(X = 0) is the same for each. The means are 1.40 (CMP with v = 0.8), 1.50 (Poisson) and 1.62 (CMP with v = 1.2).
Our model of Cq data first requires a model of the growth of the number of molecules over the PCR amplification cycles. If the number of molecules at cycle c is given by N(c), then for c > 0
$$ N(c)=N\left(c - 1\right)+\mathrm{Binom}\left(N\left(c-1\right),{E}_c\right) $$
(4)
where Binom(n, p) represents a binomial random variable with n trials and probability p of success and E
c
is the efficiency at cycle c. This is because each of the N(c – 1) molecules from the previous cycle is duplicated with probability E
c
.
In the model the efficiency for the first cycle is E
1 but for subsequent cycles is E. Equation (4) can be used with Eq. (3) as the initial distribution of molecules to calculate the distribution after a chosen modest number of cycles. The distribution after further growth is modelled as following a normal distribution. The fact that Eq. (4) represents a Galton-Watson branching process [10] is used to derive the mean and variance. The introduction of the parameter A, defined as the relative fluorescence per molecule, leads to a distribution for relative fluorescence. This can then be used to derive an approximation for the distribution of Cq data for a given threshold value h.
The default Cq values provided by the data analysed later show clear trends. Additional analysis suggested that the trends could be removed through normalising the amplification curves and then calculating the Cq values (see Additional file 1). This approach could not be properly tested as the amplification curves generally appeared to be a few cycles short of reaching the plateau stage. It is not obvious what causes the differing plateaus, though one potential factor is varying temperature across the panels. The trends appear approximately linear in many cases, and so a linear trend is included in the model.
Censoring may be required for outliers, as they can represent some technical deviation from the model. The exclusion of such values that are inconsistent with the model should improve the performance of the analysis and the accuracy of the results. High outliers may be caused by a problem in amplifying the molecule. Our model censors high outliers, treating them as partitions with one molecule, rather than using the Cq values. The model could be similarly extended to deal with low outliers by treating them as counts of partitions with more than one molecule, though this was not done for the present study. As discussed later, low outliers do lead to spurious results for one of the analysed data sets.
The full vector of variables is θ = (μ, ν, E, E
1, A, b
x
, b
y
), where μ is the mean number of initial molecules per partition. The overall likelihood is
$$ L\left(\boldsymbol{\theta}; \mathbf{c},\mathbf{x},\mathbf{y},\mathbf{n}\right)\propto p{\left(0,0;\mu, \nu \right)}^{n_0}p{\left(0,1;\mu, \nu \right)}^{n_1}\times \left\{{\displaystyle {\prod}_{j=1}^{n_2}{\displaystyle {\sum}_{\kern0.30em i=1}^{\kern0.30em m{2}^{c_0}}p\left(i,{c}_0;\mu, \nu, E,{E}_0\right)\left[\Phi \left(h,iA{G}_{c_j^{\hbox{'}}},i{A}^2{G}_{c_j^{\hbox{'}}}\left(\frac{1-E}{1+E}\right)\left({G}_{c_j^{\hbox{'}}}-1\right)\right)\right.}}\right.-\left.\left.\Phi \left(h,iA{G}_{c_j^{\hbox{'}}+\delta },i{A}^2{G}_{c_j^{\hbox{'}}+\delta}\left(\frac{1-E}{1+E}\right)\left({G}_{c_j^{\hbox{'}}+\delta }-1\right)\right)\right]\right\} $$
(5)
where c'
j
= c
j
− b
x
(x − 0.5n
x
) − b
y
(y − 0.5n
y
) are the detrended Cq data, \( {G}_c={\left(1+E\right)}^{c-{c}_0} \), Φ is the distribution function of the normal distribution and p(j, c; μ, ν) is the probability of there being j molecules at cycle c in a partition given parameters μ and ν. The values of p(j, c; μ, ν) are calculated from Eq. (3) for c = 0 and then through repeated application of Eq. (4) for cycles up to c
0. The value c
0 = 6 was chosen as it is the smallest value required to achieve sufficient precision (see Fig. 2), and computational time increases rapidly as c
0 increases further. See Additional file 2 for the derivation and more details.
The data comprise n = (n
0, n
1) where n
0 is the count of partitions with no Cq value (no molecules), and n
1 is the count of high censored Cq values (one molecule), \( \mathbf{c}=\left({c}_1,\dots, {c}_{n_2}\right) \) the other Cq values along with \( \mathbf{x}=\left({x}_1,\dots, {x}_{n_2}\right) \) and \( \mathbf{y}=\left({y}_1,\dots, {y}_{n_2}\right) \) the x- and y-locations of the associated partitions. The only other data that is required is the threshold value h. All the data can be extracted from the dPCR experiment itself.
This model is a very good approximation as shown in Fig. 2, where a density plot of simulated data (using Eqs. (3) and (4)) almost entirely obscures the associated density plot (Eq. (5)) of the model with the same parameters.
As a Bayesian approach is being used, prior distributions are required for the parameters. We used non-informative uniform priors for μ and ν. Where suitable prior information is available gamma distributed priors could be used instead. Prior information about the efficiency E can be provided by preliminary quantitative PCR (qPCR) experiments. However these estimates for the qPCR efficiency are imprecise [11] and need not be the same as dPCR efficiency. For E we used qPCR estimates of efficiency to select a prior of Beta(190, 10) which has a mean of 0.95 and has 95 % of its mass between 0.92 and 0.98. For E
1 lower values are more plausible and so a prior of Beta(18,2) was used, with mean 0.9 and 95 % of its mass between 0.74 and 0.99. For the remaining parameters there is little prior information, and so we use the non-informative priors π(A) ∝ A
− 1, π(b
x
) ∝ 1 and π(b
y
) ∝ 1.
Single-strand adjustment
There are various reasons why E
1 could be different to E. For example the initial molecule may be more difficult to amplify than the replicates of the target sequence because of its extra length or because of degradation. On the other hand efficiency may decrease as PCR reagents become degraded or are consumed.
Another possible factor is the presence of single-stranded DNA. In the first amplification cycle it can only be amplified to double-stranded DNA molecules, which is equivalent to double-stranded DNA failing to amplify. The standard method counts single-stranded DNA as full molecules and so if they are present it will tend to overestimate μ [12]. If the difference between E and E
1 is entirely because of this issue, then the estimated parameters Ê and Ê
1 can be used to estimate the proportion of single-stranded DNA. This leads to an estimate for μ given by multiplying its original estimate \( \widehat{\mu} \) by an adjustment factor that is between 0.5 and 1:
$$ {\widehat{\mu}}_{\mathrm{adj}}=\frac{\widehat{\mu}\ }{2}\left(1+ \min \left(1,\frac{{\widehat{E}}_1}{\widehat{E}}\right)\right) $$
(6)
Data
Full experimental details are given in [13]. Data were generated in an experiment performed by LGC on a BioMark 48.770 machine made by Fluidigm Corporation. The raw data produced by this experiment comprised fluorescence measurements made at the end of each of the 40 cycles for each partition on several chips. The chips contained 48 panels, each with 770 partitions arranged in 70 rows and 11 columns. The raw data were converted into Cq values for positive partitions by the ‘Fluidigm Digital PCR analysis’ software using an algorithm that is not publicly available. TheCq data are provided in Additional file 3.
The experiment was performed using 3 types of DNA: Attenuated genomic DNA (gDNA), Virulent gDNA and linearised plasmid DNA. The attenuated type was M. Tuberculosis (MTb) H37Ra gDNA, while the virulent type was MTb H37Rv gDNA. These were both sourced from ATCC and have lengths 4,419,977 bp and 4,411,532 bp respectively. The plasmid DNA comprised a genetic construct containing the full sequences of the 16S rRNA and rpoB genes of MTb H37Rv synthesised and inserted into a pUC19 plasmid vector. It had length 8486 bp. We shall refer to these types as A, V and P respectively. Assays Jiang_16S and UCL_16S were used for the amplification of the 16S gene and their primers are described in [14] and [15], while assays GN_rpoB1 and GN_rpoB2 were used for the amplification of the rpoB gene and their primers were designed using Primer Express (Applied Biosystems). Both targets were present once in the genomes of each of the different DNA types.
There were 4 mastermixes, but only Gene Expression Mastermix (Life Technologies) was used for the present analysis. There were three dilutions (identified as 2A, 2B and 3). True values for their concentrations were not available. There were three replications of each combination of dilution, DNA type and assay, with each DNA type tested on a different chip. The fluorescent marker used was FAM and the passive reference ROX was used to normalise the measurements. ‘No template control’ panels were included and showed no issues. See [13] for more information, including the MIQE checklist [16].
Analysis
Numerical methods are required in order to perform analysis using the model we have described. We have produced the software package edpcr for the software platform R, which was used to perform the analyses and create the plots in this paper. R can be freely downloaded from [17] and the package can be installed from within R using the command install.packages(“edpcr",repos = “http://R-Forge.R-project.org”).
The first stage of analysis is to calculate the mode of the posterior distribution via an optimisation algorithm. If a frequentist analysis is being performed rather than a Bayesian one, then no prior distributions are used and the mode is the MLE estimate for the parameters. For different initial values of E and E
1 the optimisation algorithm may find different local maxima. We used the combinations {E, E
1} = {0.9, 0.9}, {0.9, 0.75}, {0.85, 0.9}, {0.85, 0.75}, {0.9, 0.6} and {0.8, 0.9}, with the mode having the highest value selected as the overall mode.
A sample from the posterior density may then be produced by the random walk Metropolis algorithm [18]. The Geweke diagnostic [19] can be used to help confirm convergence.
For more information on the method of analysis see Additional file 4.