Deciphering variance in epigenomic regulators by k-mer factorization

Variation in chromatin organization across single cells can help shed important light on the mechanisms controlling gene expression, but scale, noise, and sparsity pose significant analysis challenges. Here, we develop gkm-PCA, an approach to infer variation in transcription factor (TF) activity across samples through an unsupervised analysis of the variation in the DNA sequences associated with an epigenomic mark. gkm-PCA first represents each sample as a vector of DNA word frequencies for the DNA sequence surrounding an epigenomic mark of interest, and then decomposes the resulting matrix of k-mer frequencies per sample to find hidden structure in the data. This allows both unsupervised grouping of samples and identification of the TFs that distinguish groups. Applied to single cell ATAC-seq data, gkm-PCA readily distinguished cell types, treatments, batch effects, experimental artifacts, and cycling cells. The structure within the k-mer landscape can be further related to differentially active TFs, with each variable component reflecting a set of co-varying TFs, which are often known to physically interact. For example, in K562 cells, AP-1 TFs emerge as the central determinant of variability in chromatin accessibility through their diverse interactions with other TFs and variable mRNA expression levels. We provide a theoretical basis for why cooperative TF binding (and any associated epigenomic mark) is inherently more variable than non-cooperative binding. gkm-PCA and related approaches will be valuable for gaining a mechanistic understanding of the trans determinants of chromatin variability between cells, treatments, and individuals.

including at high throughput (9, 10). 17 However, single cell epigenomics data is inherently sparse, since every locus is present at 18 only two copies per diploid cell (9), such that ascertaining the state of an individual cell is 19 challenging. One solution is to pool signals -either across cells (e.g., of the same known 20 type or a discovered cluster) (8) or across loci sharing a known trait (e.g., binding by a 21 Methods). Of the 27 significant PCs, 13 distinguished different replicates 4 ( Supplementary Fig. 2), indicating that at least some of the variability captured on these 5 PCs represents differences between batches. We excluded these PCs from subsequent 6 analysis, and tested the remaining 14 PCs, that showed primarily cell-cell variability 7 (Methods) for enriched TFs. Overall, 40.5% (167/412) of expressed TFs with known 8 motifs were associated with at least one gkm-PC, but this number may be inflated 9 because many TF binding sites are so similar. 10 We considered some of the possible causes for the cell-cell variation in the (inferred) 11 activity of TFs. In particular, TFs with variable activity may be more variably expressed 12 at the RNA level, leading to cell-cell variation at the protein level, or generally lowly 13 expressed, such that the protein level is significantly impacted by bursts of transcription. 14 (There are, of course, other options, independent of RNA or expression levels, such as 15 variation in upstream signaling molecules that affect the TF's activity.). To consider the 16 first two options, we used scRNA-seq of untreated K562 cells (23) to compare the 17 average expression levels and variability (mean corrected CV) in expression across single 18 cells for our k-mer-based "variable" and "constant" TFs. 19 We found that the TFs that were most enriched among the PCs, and hence inferred to 20 have the most variable activity, were expressed on average at lower levels than the least 21 enriched TFs (Wilcoxon rank sum test P=0.08; Supplementary Fig. 3a), but the two 1 groups had a similar mean-corrected coefficient of variation (CV) (Wilcoxon rank sum 2 test P=0.54; Supplementary Fig. 3b; Methods). Most TFs tend to have a low mean-3 corrected CV, with notable exceptions including the AP-1 proteins JUN, FOSL1, BATF, 4 and ATF3 (Supplementary Fig. 3c). 5

PCs help identify TF-TF interactions 6
Finally, we hypothesized that different TFs that are co-enriched (or co-depleted) on the 7 same PC could reflect dependencies or interactions between the activity of those TFs, 8 such as cooperative binding in a complex or through one TF rendering the sites of the 9 other accessible (Fig. 1a -bottom). However, because many TFs have very similar 10 specificities and are difficult to distinguish from their cognate motifs alone, we first 11 eliminated any motifs that closely match another more highly enriched motif (Methods gkm-PCA provides a new approach to leverage scATAC-seq data, to partition cells by 10 distinct epigenomic landscapes, and to understand their regulatory underpinning. It can 11 help identify cell types and the transcriptional regulators that mediate underlying 12 differences in chromatin. Here, we found that gkm-PCA distinguishes cell types, cycling 13 cells, and experimental artifacts, and discovered a large number of significant PCs in all 14 datasets analyzed, each appearing to represent one or more TFs. 15 One possible explanation for the variation in inferred TF activity across single cells is 16 variation in the expression of the TF between the cells, as has been previously shown by 17 scRNA-seq, RNA-FISH, and single cell protein staining (e.g. (31-33); reviewed in (34)). 18 However, we found that TFs associated with cell-cell epigenomic variability across 19 untreated K562 cells are relatively lowly expressed in all cells, but not particularly 20 variable across cells, as reflected by scRNA-seq. One possible explanation is that 21 variation would be more apparent post-transcriptionally, such as in protein translation, 1 modification, or stability, either because of direct regulation of these steps or because of 2 separation of time scales. Consistent with this possibility, low mRNA expression levels 3 generally result in more variable (noisier) protein levels (35) since transcription or decay 4 of a single mRNA results in greater fold differences in low-abundance genes. An 5 alternative explanation is that a TF would show variable binding dependent on a variable 6 co-factor, while itself not being variable (e.g. Fig. 1a -bottom). 7 The primary axes of variation in the K562 scATAC-seq data, as reflected by the gkm-8 PCs, appear to represent the combined actions of multiple TFs, often known to interact 9 physically. This may reflect cooperative binding by these TFs. Cooperative binding 10 mediated by physical interaction between TFs (Supplementary Fig. 4) or by mutual 11 competition with nucleosomes (36) results in a steeper binding curve, such that small 12 changes in concentration around the critical point result in larger changes in occupancy 13 than in a non-cooperative setting. Thus, cell-cell variability in TF concentration around 14 this point will result in higher occupancy/accessibility variability than would be expected 15 in the non-cooperative case. 16 Cooperativity may also provide some insight into the prevalence of AP-1 factors in our 17 analysis, whose binding sites were enriched in many gkm-PCs for both treatment-18 associated and cell-variable PCs. AP-1 TFs are bZIP TFs and can form a large number of 19 heterodimers with other bZIP TFs (29), some of whose motifs were also found to be 20 enriched on the same gkm-PCs as the AP-1 factors. The strong enrichment of AP-1 21 14 motifs in variable k-mer axes associated with scATAC-seq indicates that AP-1 factors 1 may themselves be associated with mediating chromatin accessibility. Indeed, it has been 2 suggested previously that AP-1 factors have pioneer activity (37). 3 A remaining challenge -present whenever motifs are used to infer TF binding -is the 4 definitive identification of causal TFs when many TFs have similar motifs and the 5 specificities of many TFs remains unknown (21). One advantage of a k-mer-based 6 approach is that much of the analysis can be done without ever knowing the identities or 7 specificities of the TFs. In this way, our knowledge deficits regarding TF binding 8 specificities are shifted from the analysis to the interpretation stage, knowing that the 9 specificities themselves can be captured in k-mer space. As we learn more about how TFs 10 function, our interpretation of the k-mer space will improve. 11 As cell numbers grow, it is likely that additional insights can be gleaned, allowing us to 12 detect less variable TFs and TF-TF combinations. We anticipate that gkm-PCA will also 13 be useful in the study of other chromatin profiles collected across single cells (e.g., 14 scChIP-seq (8)). It can also help understand variation in chromatin organization in the 15 analysis of many bulk samples, for example, those collected across individuals in a 16 population (e.g., (2-7)). Although other k-mer based methods have been applied to study 17 of variation in cis (18), we anticipate that the unsupervised approach of gkm-PCA will be 18 useful in dissecting variation in trans. With epigenomic data of ever increasing 19 complexity, tools and approaches like those we described will continue to provide insight 20 into the regulation of chromatin. 21 A summary of the data processing steps and tools used is included in Supplementary 3 Data was obtained from the Gene Expression Omnibus, accession GSE65360. Samples 6 were demultiplexed, and reads trimmed for Nextera adaptors and mapped to the human 7 genome (hg19) using Bowtie2 (38) using paired reads (-X 2000), as described previously 8 (9). Regions of interest were defined as windows of 50 bp to either side of the 5' end of 9 mapped reads, representing the integration sites of the Tn5 transposase, merging 10 overlapping regions (which removes duplicate reads). DNA sequences were then 11 extracted from these loci using twoBitToFa (39) and scanned for k-mer content using 12 AMUSED (https://github.com/Carldeboer/AMUSED), considering both DNA strands, to 13 yield a vector of k-mer frequencies for each cell that was used in subsequent analyses, 14 including all gapped k-mers from length 2 to 8. We stopped at a length of k=8 because 15 for k>8 k-mer frequencies become very sparse when analyzing as few loci per cell as are 16 present in scATAC-seq data, although larger k may be more suitable to analysis of bulk 17 samples. Cells with fewer than 3,162 (10 3.5 ) distinct Tn5 integration loci were excluded 18 from subsequent analyses to remove dead cells and cells with poor data quality. 19 The individual cells' k-mer frequency vectors were merged and scaled so that each k-mer 20 had mean 0 and a standard deviation (SD) of 1, and this matrix was decomposed into its 21 principal components. For all analyses, PCA was done with the prcomp R function and 1 the number of significant PCs was estimated using the permutationPA function for the 2 jackstraw R package (20), tSNE was done using the tsne R package using the default 3 parameters and including only the significant PCs. Because the frequencies of k-mers of 4 varying G+C-content are so correlated to G+C content itself, the first PC often has a 5 significant G+C-content component and should be analysed carefully (e.g., GG tends to 6 occur more frequently with higher G+C-content, and so the two will be correlated and 7 both will be anticorrelated with A+T-rich k-mers). 8

Scoring cells for cell cycle signatures 9
Using the ENCODE Repli-seq data for K562 cells (19), the genome was divided into 10 replication domains using a percent signal cutoff of 25%, where any region with a signal 11 greater than this cutoff was considered a domain for the respective stage of the cell cycle. 12 ATAC-seq reads were then counted within each domain to yield a matrix of ATAC-seq 13 read counts for each domain in each cell. This matrix was scaled by the total number of 14 reads per cell, yielding a matrix of proportions of reads per domain per cell, and the ratio 15 of (G2+S1+S2+S3+S4)/G1 (termed (G2+S)/G1 above) was used to distinguish cycling 16

cells. 17
Identifying PCs that distinguish treated from untreated K562 cells 18 Every cell was "scored" by its position as it is projected onto the respective PC axis. The 19 area under the ROC curve (AUROC) statistic and rank sum P-value, representing how 20 well the projected cell positions divide the cells into treated and untreated cells, were 21 calculated, and the PCs with the AUROC furthest from 0.5 (i.e. those for which treated 1 cells are either enriched or depleted by the PC) were considered those that segregated 2 treated from untreated best. 3 Identifying TF-specific PCs 4 Ungapped 8-mer protein binding microarray Z-scores and position weight matrices 5 (PWMs) for all human TFs (inferred or directly determined) were downloaded from CIS-6 BP (21). For PWMs, gapped k-mer scores were derived by finding the maximum log-7 odds score for that k-mer in the PWM. These scores were then converted into Z-scores by 8 centering them about the median and scaling them to the median absolute deviation, 9 taking a Z-score of >3 as "bound" and leaving others as "unbound" k-mers. For PBM Z-10 scores, Z-scores between experiments for the same TF were combined using Stouffer's 11 method and those k-mers with a Z-score above 3 were considered "bound", with others 12 "unbound". 13 With this set of "bound" and "unbound" k-mers for each TF, the enrichment of each TF 14 in each PC axis was calculated using the minimum hypergeometric test (40). Briefly, the 15 bound and unbound k-mers were ranked by their PC weights and, moving in increasing 16 rank order, hypergeometric P-values were calculated representing the enrichment of k-17 mers bound by that TF amongst the top N most highly (lowly) weighted k-mers. Exact P-18 values (considering the dependence between tests) were not calculated and instead 19 multiple hypothesis testing correction using Bonferroni's method was done as if the tests 20 were independent, yielding a more conservative P-value (to minimize the number of non-21 K562s were considered (41). 4 Because many TFs share similar k-mer binding profiles and the number of k-mers 5 considered for PWM motifs was so high, these appeared to have a high false positive rate 6 and so we set the threshold for significance much lower for PWM motifs (P<10 -112 ) than 7 for 8-mer Z-scores (P<10 -2 ). (log 10 (P-values) are "inflated" with PWMs as a result of 8 common shared submotifs and a very large number of gapped k-mers; we chose these 9 cutoffs based on the "elbow" of the log-P-value distributions, which are similar at these 10 values.) To eliminate redundant motifs and select only the most enriched of each group of 11 related motifs, the most enriched (or depleted) motif was retained and any redundant 12 motifs (k-mer Pearson R > 0.5) were eliminated until all TFs were either eliminated due 13 to redundancy or selected to represent the PC. 14 Comparison to K562 single-cell RNA-seq 15 A matrix of single cell count data was downloaded from GEO (GSE90063) for wild type 16 K562 cells (23) and a negative binomial distribution was fit to the gene-wise mean and 17 variance, representing a theoretical minimum variance dependent on the mean, and this 18 was used to calculate the theoretical minimum log coefficient of variation (CV). We then 19 subtracted the theoretical minimum CV from the observed log CV per gene to get the 20 excess CV over that expected from its dependence on the mean ("mean-corrected CV"). 21 We then compared the distributions of the mean-corrected CV and expression mean for 1 TFs that had a significant enrichment among the cell-variable PCs and those that did not, 2 using the Wilcoxon rank sum test. Cell-variable PCs excluded any PCs that significantly 3 distinguished any replicate from the other two (Bonferroni-corrected Wilcoxon rank sum 4 test P < 0.1), and also excluded PC1 because of the association with G+C content. 5 TF cooperativity occupancy 6 As described previously (42), a TF's (x) fractional occupancy of a single binding site (O x ) 7 depends on its concentration ([x]) and the dissociation constant (Kd x ) of its binding site in 8 the following relationship, which represents 1 minus the probability the binding site will 9 not be bound: 10

11
If TF x can also bind with a partner y, occupancy of x depends on x binding in isolation, 12 as before, but also binding with y as a xy heterodimer, depending on the concentration 13