CORNAS: coverage-dependent RNA-Seq analysis of gene expression data without biological replicates

Background In current statistical methods for calling differentially expressed genes in RNA-Seq experiments, the assumption is that an adjusted observed gene count represents an unknown true gene count. This adjustment usually consists of a normalization step to account for heterogeneous sample library sizes, and then the resulting normalized gene counts are used as input for parametric or non-parametric differential gene expression tests. A distribution of true gene counts, each with a different probability, can result in the same observed gene count. Importantly, sequencing coverage information is currently not explicitly incorporated into any of the statistical models used for RNA-Seq analysis. Results We developed a fast Bayesian method which uses the sequencing coverage information determined from the concentration of an RNA sample to estimate the posterior distribution of a true gene count. Our method has better or comparable performance compared to NOISeq and GFOLD, according to the results from simulations and experiments with real unreplicated data. We incorporated a previously unused sequencing coverage parameter into a procedure for differential gene expression analysis with RNA-Seq data. Conclusions Our results suggest that our method can be used to overcome analytical bottlenecks in experiments with limited number of replicates and low sequencing coverage. The method is implemented in CORNAS (Coverage-dependent RNA-Seq), and is available at https://github.com/joel-lzb/CORNAS. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1974-4) contains supplementary material, which is available to authorized users.

Loading volume for sequencing. The final product of PCR would yield ≈ 40 µL of 200 nM (nanoMolar). The amount then gets diluted 20,000 times to a loading amount of 120 µL for the flow cell. The 40 µL is first diluted 100 times to 2-3 nM, and then further diluted 200 times as aliquots.
1.2 What is the total mRNA found in a sample?
To identify a sample's sequencing coverage, we will need to first identify what is the size of the mRNA population to compute the sample's sequenced proportion from. While it is ideal to obtain the number of total mRNA available prior to library preparation, the biases mentioned above make it difficult, if not impossible, to allow accurate estimates of the total mRNA in the sample. We reasoned that the amount of cDNA produced at the step prior to PCR would provide us the most reliable means for computation because: 1. The fragmentation step causes homogeneity of the cDNA molecule sizes.
2. The volume and concentration after PCR is known.
3. The number of PCR cycles is known.

What is the original amount of cDNA before PCR?
We can calculate this quantity since the cDNA molecules would have similar molecular weights after size selection (≈ 500 nt). The PCR final volume of 40 µL has 200 nM (200 nmol/L) concentration of 500 bp cDNA molecules, which translates to 4.818 × 10 12 of cDNA molecules. Assuming complete replication efficiency, a cDNA molecule would be amplified 2 14 times for 14 cycles of PCR. Therefore the number of cDNA before PCR, in an ideal case where all cDNA are amplifiied, is 4.818×10 12 2 14 = 294, 067, 382.

The true coverage of RNA-Seq experiments
PCR is required to improve the chance of one cDNA to be picked for sequencing by having it copied ten thousand times. The chance of picking the original amount for each cDNA species prior to PCR should be very high if the dilutions are perfectly homogenous after PCR. If we work with the assumptions above as the ideal case, we are able to calculate the coverage of sequencing, based on the number of sequenced reads obtained over the number of cDNAs available before PCR.
Define the random variable X which has a beta distribution with mean α/(α + β). We can use X to model the deviation from perfect amplification by considering the random variable 2 − X. Let k be the number of PCR cycles, and N0 the initial number of DNA fragments. Assuming perfect amplification, the number of fragments after k cycles of amplification is If we assume amplification efficacy in each cycle is independent of one another, then the actual number of fragments after k cycles is Thus, the expected relative effect of variation in amplification efficiency is given by For the variance of Sa/Sp, we have The following table gives the expected proportion of fragments under a beta model of amplification variation relative to perfect amplification.  Figure S3: DEG set agreement between methods in analysing 12 comparisons between three human liver and four kidney samples. The axes represents the number of DEG called for each method, while the circle size approximates the intersect size. Two types of sample loading concentrations were used, 3 pM (high) and 1.5 pM (low). Details can be found in Table S1.  Figure S4: CORNAS sensitivity against false positive rates (FPR) for data simulated to have 100% 95%, 90%, 85% and 80% PCR amplification efficiencies, facetted according to the expected coverage estimates at 100% PCR amplification efficiency (0.5, 0.25, 0.1 and 0.01).