Nucleosome organization referenced landscapes filtration
NUCOME filters nucleosome organization samples based on six quality control measurements. The definitions of sequencing coverage, AA/TT/AT dinucleotide frequency, nucleosomal DNA length, nucleosome depletion at TSSs and enrichment of well-positioned nucleosome arrays at DHSs have been described previously [18]. Nucleosome fuzziness downstream of TSSs was defined as the coefficient of variance (CV) of the distance between + 1, + 2, + 3 and + 4 nucleosomes, while the position of nucleosomes was defined as the local maximum positions. For the measurement of nucleosomal DNA length, samples were ranked by the deviation of the measurement values to 146 bp in ascending order. For the measurement of nucleosome fuzziness downstream TSSs, samples were ranked based on the measurement values in ascending order. For each of the other four measurements, samples were ranked based on the measurement values in descending order. For each sample, a ‘Pass’ or ‘Fail’ label was assigned for each QC measurement, except for sequencing coverage. The criteria of ‘Pass’ and ‘Fail’ for AA/TT/AT dinucleotide frequency, nucleosomal DNA length, nucleosome depletion at TSSs and enrichment of well-positioned nucleosome arrays at DHSs were defined previously [18]. For nucleosome fuzziness downstream of TSSs, samples with values lower than 0.4 were defined as ‘Pass’, while others were defined as ‘Fail’.
For each cell or tissue type, the high-quality sample was selected via three indicators. First, the total number of ‘Pass’ QC measurements for each sample was defined as the first rank indicator (Ngood in Fig. 1a). Then, we count the number of QC measurements for each sample such that the rank quantile better than the average rank quantile for the samples comes from the same cell type as the second indicator (Cbetter in Fig. 1a). Third, we calculated the sum of rank quantiles of all QC measurements for each sample as the third indicator (Rbetter in Fig. 1a). The final ranking order for the samples was determined by ordering Ngood, Cbetter and Rbetter successively. For each cell or tissue type, the sample with the top rank was selected as the ‘High Quality Sample’ (Additional file 1: S1 in Fig. 1a).
For each cell or tissue type, the ‘Pooled Sample’ was pooled by samples with high correlation that come from the same cell type. First, we calculated nucleosome occupancy in 1 kb windows throughout the whole genome for each sample. Then, the Pearson correlation coefficients between samples from the same cell type and treatment condition was calculated. Samples with Pearson correlation coefficients higher than 0.6 were pooled as the ‘Pooled Sample’ (Additional file 1: S2 in Fig. 1a).
‘High Quality Sample’ was defined as the referenced nucleosome organization landscape for each cell type and treatment condition, unless the ‘High Quality Sample’ was included in the samples which were used to build the ‘Pooled Sample’. Samples with coverage lower than five-fold were filtered out.
Nucleosome organization features on TSS and DHS
Here, we defined several nucleosome organization features on TSSs and DHSs and calculated these features on individual gene or DHS for each referenced nucleosome organization map.
The nucleosome array score downstream of the TSS was calculated as the average nucleosome array profile signal from the TSS to the downstream 1 kb region versus the average signal of the whole genome for the given sample. The nucleosome depletion level at a TSS was calculated as the degree of nucleosome occupancy deficiency at the TSS. The maximum nucleosome occupancy at the central 200 bp bin of the site was defined as Ncenter, and the maximum nucleosome occupancy surrounding the central bin spanning 200 bp on both sides was defined as Nbackground. If Ncenter was larger than Nbackground, the nucleosome depletion level of the site was assigned to 0. Otherwise, the nucleosome depletion level of the site was calculated as 1 – Ncenter/Nbackground. The nucleosome occupancy at the TSS was calculated as the average nucleosome occupancy signal at the central 100 bp bin of the TSS.
The + 1 nucleosome, the number of nucleosomes, the linker length and the standard deviation of linker length were calculated as follows. First, the nucleosome occupancy profile for each TSS was scanned and smoothed from the TSS to the downstream 1 kb bin. The local maximal signal was detected, and the relative distance to TSS was defined as candidate of nucleosome positions. If the adjacent nucleosome position was closer than 100 bp, the nucleosome position with higher nucleosome occupancy was maintained. Finally, the nucleosome position downstream of the TSS was defined. The first nucleosome position was defined as the + 1 nucleosome. The linker length was calculated as the average distance between adjacent nucleosome positions subtracting 147 bp, which is the standardized nucleosomal DNA length. The standard deviation of linker length and nucleosome count can be calculated after the nucleosome positions are defined. If the + 1 nucleosome was undetectable, the + 1 nucleosome, linker length and standard deviation of linker length were stored as -1, so that they can be easily recognized. If the number of nucleosomes was less than 2, the linker length and standard deviation of linker length were stored as -1. If the number of nucleosomes was less than 3, the standard deviation of linker length was stored as -1.
The nucleosome organization features on DHSs were calculated in the same way as TSSs. There are two major differences. The nucleosome array score upstream of the DHS was also calculated. The -1 nucleosome, the number of nucleosomes, the linker length and the standard deviation of the linker length 1 kb upstream were calculated by scanning the nucleosome occupancy profile 1 kb upstream of the DHS.
Prediction of TF binding sites
We collected TF ChIP-seq data from the Cistrome DB database [26] as the actual TF binding profile. Qualified ChIP-seq samples were selected by the QC measurements provided in the database. We scanned the entire genome for significant motif hits for each TF by using BINOCh [27]. For each TF ChIP-seq sample, motif hits overlapping with TF binding peaks were defined as the positive samples (bound sites). The negative samples were defined as other motif hits that did not overlap with any detected peaks in the ChIP-seq sample (unbound sites). All positive and negative samples were used for evaluating the TF binding prediction. A ten-fold cross validation strategy was applied during the model training and testing. The motif score calculated by BINOCh was the first predictor. Nucleosome organization features acted as the second predictor to improve the prediction performance. Here, we include the nucleosome depletion level and nucleosome occupancy on motif hits and the nucleosome array score flanking the motif hits in the prediction model. First, the nucleosome depletion level of the motif hits was calculated in the same way as the nucleosome depletion level on the TSS, while the given site was replaced by the center of the motif hits. Then, nucleosome occupancy on motif hits was calculated as the average nucleosome occupancy at the central 100 bp bin of the sites. The nucleosome array score on the upstream or downstream 1 kb bin of the site was calculated as the average nucleosome array profile signal upstream or downstream of the 1 kb bin of the given site.
A logistic linear regression conducted by a ‘glm’ function in R was performed to match the actual TF binding status (bound or unbound) by scoring motif score and nucleosome organization information. The R package ‘pROC’ was used to evaluate the prediction power by calculating the true-positive rate and true-negative rate with different thresholds, and area under the curve (AUC) scores calculated by the ‘auc’ function in R were used as an indicator to evaluate the performance of the TF binding prediction. The prediction improvement was calculated as AUC (motif score + nucleosome organization information) – AUC (motif score).