Somatic mutations acquired in each cell during and after embryogenesis are passed to the descendant cells such that, within the same individual, different populations of somatic cells have slightly different DNA, resulting in genomic mosaicism. The accumulation of somatic mutations increases with age [1,2,3],
and is also affected by environmental factors like tobacco smoking and alcohol consumption [4]. Somatic mutations can not only cause cancer but also diverse neurological diseases, including cortical malformations, epilepsy, intellectual disability, and neurodegeneration [5, 6]. Some somatic mutations might give the cells proliferative advantage, and ultimately cause cancer, or can affect the cellular functions without a proliferative effect. This makes the detection of mosaic mutation important for understanding the mechanism of various diseases.
Although whole genome sequencing of bulk tissue has been used for detecting somatic mutations, it is not sensitive enough to detect mosaic mutations present below 1% variant allele frequency (VAF), i.e., a heterozygous mutation present in less than 2% of the cells. This hurdle has been overcome by single-cell DNA sequencing (scDNA-seq) which in recent times has emerged as an efficient tool for studying mosaic mutations [7,8,9]. Since the starting DNA amount in a single cell is very low, an additional step of DNA amplification is required. There are two types of broad methods for DNA amplification: cell cloning and enzymatic Whole Genome Amplification (WGA). Depending on the experimental design one of the two methods can be used. WGA methods, unlike cell cloning, directly isolates extracted DNA from single cells and then amplify it, making it possible to sequence the DNA of cells which cannot be cultured, such as neurons. There are three types of WGA methods: DOP–PCR (Degenerate Oligonucleotide–Primed Polymerase Chain Reaction) [10], MDA (Multiple Displacement Amplification) [11] and MALBAC (Multiple Annealing and Looping–Based Amplification Cycles) [12], each having its advantages and drawbacks. MDA is the most widely used method for WGA owing to its longer fragment length (up to 70 kbps), low error rate during amplification and higher fraction of the genome being amplified as compared to the other WGA methods [13].
MDA is an exponential amplification method where the DNA is amplified using a high fidelity phi29 polymerase with proofreading activity under isothermal conditions [11]. However, phi29 polymerase is sensitive to template fragmentation happening during cell lysis as well as presence of blocking sites where DNA damage prevents amplification. This may lead to uneven coverage, over-fragmented or completely damaged DNA, which may further lead to allelic imbalance when one of the alleles is under-amplified and the other allele is over-amplified. Even though MDA results in high yield of DNA material, introduction of biases such as allelic imbalance and over representation of C to T mutation introduced during lysis can affect the variant detection downstream.
Before moving forward with high coverage Whole Genome Sequencing (WGS), it is important to select cells with successful amplification, exhibiting little or no biases. Uneven amplification, with the ultimate manifestation of allelic drop-outs (i.e., random and drastic overrepresenting of one allele over the other), challenges separating false positives from real somatic variants. For example, deamination of cytosine happening during cell lysis on one strand of one allele are expected to have 25% allele frequency in a balanced amplification and, based on that, can be marked as artifact. However, if the other non-deaminated allele is not amplified, the allele frequency for the artifact will become 50%, making it indistinguishable from a heterozygous variant. So, using a cell with high allele drop-out rate will result in more false positives and reduce sensitivity, as variants in dropped out regions cannot be discovered.
PCR can be used as a first quality control to test the presence of several random genomic loci, usually chosen on different chromosomes, in the amplified DNA. Multiplex-PCR of 4 loci in one PCR reaction can for instance be used as a rapid quality control where cells are considered to have good quality amplification if at least 3 loci are detected [14]. However, this test is quite limited as there might be regions outside of the 4 loci with un-uniform amplification. Similarly, failing the test doesn’t imply low amplification quality outside of the 4 loci. It is therefore essential to look at the genome as a whole. A few methods for checking amplification quality in silico from WGS data were proposed. Statistical models have been used to detect amplification bias using depth of sequence [15]. Amplification quality prior to sequencing has also been determined by using power spectral density to estimate uniformity of amplification which can be otherwise masked by non-unique read mapping, assembly gaps and locus dropouts (both alleles are not amplified) [16], and median absolute difference (MAPD) [17]. However, these methods either rely on at least 20×–30× coverage or do not evaluate allelic imbalance, which is important to access to have full coverage of all haplotypes in a cell.
Here, we describe a method to determine the extent of allelic imbalance introduced by MDA into the amplified DNA using shallow (< 1×) sequencing coverage. The method is based on considering allele frequency distribution of the heterozygous SNPs, which, for diploid genome, should have a Gaussian distribution centered around 50%. In case of a non-uniform amplification, the distribution of a majority of the SNPs will support homozygosity, suggesting high rate of allelic drop-outs during amplification.