Samples
Reference panel of sorted cell-types
For the methylome-wide investigation (N = 9) and for the targeted reference panel (N = 7) we used DNA extracted from sorted blood cells from US adult volunteer subjects that had been obtained from Virginia Blood Services. The different cell populations were isolated by positive selection using EasySep™ kits (Stemcell Technologies), which apply magnetic nanoparticles coated with antibodies against a particular surface antigen (CD molecules). Specifically, we isolated CD3+, CD14+, CD15+ and CD19+ cells corresponding to T-cells, monocytes, granulocytes, and B-cells, respectively.
Samples with known cell-type mixtures
In the lieu of bulk samples with known cell-type proportions we used genomic DNA from the sorted cells, from five individuals, to create bulk samples with known mixtures, in duplicates. The DNA from the sorted cells were pooled in predetermined proportions as indicated in Additional file 2: Table S3.
Samples with known methylation levels and bulk blood samples
We used fully (> 98%) CpG methylated human genomic DNA (cat# SD1131, Thermo Scientific) in combination with unmethylated DNA from the same individual to create samples with known methylation levels. First, an aliquot (400 ng) of the fully methylated DNA was used for whole genome amplification with the REPLI-g Mini kit (Qiagen) to create a corresponding non-methylated (0%) samples with the same genetic sequence. Next, the fully methylated and the non-methylated samples were pooled (Additional file 1: Figure S2) to create samples with 100%, 75%, 50%, 25% and 0% methylation levels.
In addition, we also included DNA extracted from buffy coat of whole blood, in duplicates, that are used to evaluate the methylation assessment for the included amplicons.
Blanks without DNA for quality control purposes
Finally, we included two “blanks”, which are identical to the samples with known methylation levels with the exception that DNA is excluded (i.e., replaced with the buffer in which the DNA is dissolved). Traditionally, the main purpose of blanks are to serve as negative controls to detect contamination. However, here they will also allow us to determine a background noise level for the sequencing depth required, which is critical for proper quality control.
Methylome-wide investigation to identify cell-type specific methylation markers
Methylation sites with unique cell-type specific methylation profiles for four of the most common leucocyte types in blood (CD3+, CD14+, CD15+ and CD19+ cells corresponding to T-cells, monocytes, granulocytes, and B-cells, respectively) were identified using methylation profiles from the sorted blood cells from nine individuals. As previously described [12], the methylation profiles were generated using methyl-binding domain sequencing (MBD-seq) [7, 13], which allows for assessment of nearly all 28 million CpG sites in the human genome [10]. The methylation data was processed with the RaMWAS Bioconductor package [14]. RaMWAS quantifies methylation by estimating the number of fragments covering a CpG site using a non-parametric estimator of the fragment size distribution [15]. To identify cell-type specific makers, we calculated priority scores for each CpG by counting the number of samples with fragment coverage [15] higher than 0.3 for the target cell-type and fragment coverage lower than 0.3 for all other cell-types. Thus, the maximum priority score was 9 individuals × 4 cell-types = 36.
Primer design and evaluation
For 53 loci that, according to the priority score described above, were determined to have unique methylation profiles for different cell-types, we designed primers for targeted amplicon bisulfite-sequencing using the Juno platform (Fluidigm). First, primers were designed for each site of interest using the Pyromark Assay Design 2.0 software (Qiagen). Second, platform-specific adapters were added and oligo properties such as melting temperature, hairpins, dimers, and genomic mismatches were evaluated using the OligoAnalyzer tool (Integrated DNA Technologies). Third, to detect non-specificity across the bisulfite-converted genome promising designs were further evaluated in silico via BiSearch [16, 17]. Finally, prior to using the primers on the Juno platform they were evaluated off the platform using the same cycling conditions (Additional file 2: Table S4) in 10ul PCR reactions, followed by evaluation of the amplification profile on a Bioanalyzer (Agilent).
Assaying DNA methylation levels with targeted amplicon bisulfite sequencing
Genomic DNA was bisulfite converted using the EZ DNA Methylation-Lightning Kit (#D5030/D5032; company) followed by quantification with NanoDrop Spectrophotometer (ThermoFisher Scientific). A protocol for targeted assessment of DNA methylation using microfluidics has been developed previously [18]. Here we capitalize on this approach to develop a modified protocol for the Juno platform (Fluidigm) to generate targeted amplicon libraries for next-generation sequencing. For further details see the Additional files 1 and 2. Finally, the concentration of (pools of) libraries were measured with Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific), pooled in equal molarities and sequenced with 75 bp single-end reads on a NextSeq500 platform (Illumina). The sequencing data was processed with BS-Seeker [19] using Bowtie2 [20] for alignment while allowing for one mismatch.
All amplicons were on target with an average per target read coverage of 2877 reads per sample (SD = 823). Across the entire genome, only five additional loci received a read coverage greater than 10, where the highest average coverage observed was 70. Thus, the amount of reads observed outside of the targeted loci were well below the reads observed on target and were even lower than the average number of reads observed on target for the blank controls (i.e., lower than the background noise level).
To call methylation levels for each investigated CpG (i.e., the percentage methylation) the number of observed reads with a cytosine at the targeted site was divided by the total number of reads at this location. If the number of reads for a specific amplicon, from a particular samples, did not exceed five times the background level (i.e., the average number of reads for this site observed for the blanks), the reads for this amplicon were excluded from further analysis.
Estimating cell-type proportions
Cell-type proportions for each bulk sample were estimated with the help of a reference panel, that is generated with DNA from sorted cells, using either ordinary least squares regression (OLS) [9] or the weighted least squares regression (WLS) approach implemented in the R-packageMuSiC [11]. Although the estimations differ, both methods essentially use the same model. It is important to note that a separate regression analysis is performed for each subject. For each regression analysis, we regress the bulk methylation levels of each selected CpG on the mean methylation levels of the corresponding CpG in the reference panel. Intuitively speaking, the model assumes that the amount of methylation in bulk tissue is a weighted sum of the average methylation levels for each cell-type, with weights being equal to the cell-type proportion in bulk. However, as the methylation levels of these cell-types are unknown, the approach uses the mean methylation levels of the reference panel. The reference panel will not perfectly match the true cell-type methylation profiles for each subject and the model accounts for this by allowing residuals. Alternatively, the model can be explained by stating that it correlates the bulk methylation levels with the methylation levels of each cell-type in the reference panel. The higher the correlation with a specific cell-type, the more cells of that type is present in bulk. For example, if a reference cell-type would be perfectly correlated with bulk methylation levels, the model would infer that all cells are of that type. In contrast, if bulk methylation levels would be uncorrelated with a given reference cell-type, the model would infer that this cell-type is absent from the bulk tissue.
More formally we can write the model as:
$$Y_{i}^{bulk} = \sum\limits_{c = 1}^{{n_{c} }} {p_{i}^{c} R^{c} + E_{i} }$$
where the m × 1 vector \(Y_{i}^{bulk}\) represents the bulk methylation measurements for subject i of the m selected CpGs, the m × 1 vector Rc contains the mean methylation of the m CpGs across the reference panel for cell-type c, and Ei is a m × 1 vector with residuals. The OLS estimator simply minimizes the sum of square differences between the observed and predicted values. Whereas in the OLS estimation each of the m CpGs have the same weight, in WLS some CpGs affect the estimation more than others. Specifically, MuSiC assigns higher weights to CpGs with consistent methylation levels across subjects (i.e., low cross-subject variance) and high variability across cell types (i.e., are more informative). In addition to using weights there are other estimation differences. For example, MuSiC employs a tree-guided procedure that recursively zooms in on closely related cell-types. This may contribute to the observation that MuSiC may outperform other methods [21, 22] especially when cell-types are closely related [11].