Materials
Data sources
In this study, data is provided by ReMap 2018 [21]. ReMap endeavors to identify and characterize regulatory regions from a large-scale integrative analysis of DNA-binding protein experiments. The 2018 human update uniformly annotated and processed 3180 ChIP-seq experiments, including some biological replicas, creating a catalogue from the analysis of 35.5 million peaks (after merging) for 485 TRs in a variety of cell types and tissues. The regions of interest or Cis-Regulatory Modules (CRM) selected for this study are defined as a region binding at least two different regulatory proteins in all the cell lines and tissues of ReMap, in order to mitigate variation coming from non-standardized sources. A CRM can contains from two to a few thousand peaks, in one (or more) Cis-Regulatory Elements(s). In ReMap, this adds up to 1.6 million CRM; however current estimates point to in the order of magnitude of a few hundred thousands biologically significant ones only.
Data selection
We used as a query a subset of the aforementioned CRMs, keeping those with at least 100 peaks across all cell lines for 65,535 CRMs in total, to focus on the densest genomic regions. We processed only a subset of representative cell lines and selected only certain relevant TRs, to reduce the sparsity of the resulting tensor representations. The list of selected sources is present in Additional file 10: Table S1. Our goal was to consider TRs with high biological significance, comparable abundances, and interesting combinations. In practice, for each cell line, we get the TRs with the most experiments, but if a selected TR has a known collaborator further down the line, said collaborator may take the place of a previously selected but isolated TR (e.g., MYC/MAX).
Autoencoder model
Artificial neural networks (ANNs) are assemblies of neurons, which are logistic units outputting a result dependent on a linear combination of its inputs. In a network, the output of one layer of neurons is fed to the next layer. The weights of each neuron are learned by backpropagation. More specifically, an autoencoder is an ANN whose goal is to learn for each provided example a compressed representation sufficient to rebuild it in the most efficient manner, which entails discarding signal noise [31]. Applications of autoencoders include dimensionality reduction, anomaly detection, information retrieval and image processing.
Here we performed a lossy compression of our CRM representations, i.e., transforming them into shorter vectors capable of returning similar information. When performing a lossy compression, noise and other non-information are the first elements to be lost, but so are fine-grained details. More interestingly, anomalies (in our case atypical peaks) are lost because no regularity involving them is found. Compressing also introduces artifacts, which for us are phantom peaks it was expecting to see.
As any compression algorithm implicitly maps the compressed vector into a feature space, and learning such mappings based on certain criteria (i.e., minimized loss) is a quintessential machine learning task, there is a close connection between machine learning and compression. Deep (convolutional) autoencoders are particularly suited to it [32] and can be tailored efficiently to variations of the problem such as group anomaly detection [33]. The rebuilt image is not a cleaned image, but a compressed one, unlike in denoising autoencoders [34], but those cannot be used here since we have no ground truth, i.e., no a priori information on which peak is good or not.
Existing anomaly detection approaches (see a recent review [35]) solve slightly different problems and ours (anomaly detection based on sources/dimension combinations) is not directly comparable. The closest existing parent is the detection of anomalous vertices in a dynamic graph [36] giving precedent to our approach of giving a score to each vertex at each time step depending on their behavior (see a recent review [37]). Here we use an autoencoder to learn such a score, for which the use of graph convolution is precedented [38] although instead of edges we have multi-view bags of items. Another related approach is multivariate time series anomaly detection, but here we seek to label anomalous features, not anomalous points.
Data representation
Each putative CRM is represented as a 3D tensor \(T \in {\mathbb{R}}^{3}\). This tensor of peak presence contains a representation of ChIP-seq peaks falling into this region: the x, y, z dimensions are respectively the nucleotide position, the experiment/series ID, and the Transcription Factor involved in the ChIP-seq. Each cell line is analysed separately. The value at each position of the tensor is 1 if there is a peak, 0 otherwise. CRM longer than 3200 bp are truncated, and those shorter are padded with zeros (3.2kbp was the 9th decile of length). The use of an additional third dimension for parallel integration of different data views is precedented, even in bioinformatics, and there is precedent for the use of tensor decompositions on such representations [39]. Our auto-encoder approach inscribes itself in this lineage.
We then downscale the tensor by a factor 10 along the X axis (“squishing”) since the data has low granularity along that axis to allow the use of smaller, easier to train convolutional kernels. Also, to counteract lower rebuilt values at the margins of the tensor in CNNs (Convolutional Neural Networks), we add a padding of meaningless zeroes at the beginning and end of the X axis instead. Its length is twice the convolutional kernels’ length on each side.
However, CNNs have trouble training on sparse data. Coincidentally, encoder-decoders handle sparse data better when combined with a sparse training strategy [40] where the density of tensors used in the training is tweaked during said training. This inspired our crumbing approach: in real biological data only, to help the model learn in spite of the sparsity, we also add crumbing (Additional file 8: Fig. S9) where for each tensor element where there is a nonzero value of v, we add 0.1*v in each position in the same Y or Z axis as a hint. Crumbing is cumulative.
Model architecture
The structure of the model is detailed in Fig. 1b. Our model has two parts: an encoder creating a latent representation and a decoder retrieving the original tensor. As with all autoencoders, the model is trained to try to rebuild the original CRM representation as its output. The full model parameters are available on GitHub. Our model was implemented using Keras 2.3 [41], with Tensorflow 1.15 and NumPy 1.18.1 [42].
Convolutional encoding
The CRM representations are viewed through sliding convolutional filters, to focus on correlations between the TRs and datasets. A convolutional filter gets as input a slice of a matrix and outputs a weighted sum of its elements, with the weights forming the filter proper. The first layers are two successive convolutions with two different types of kernels (combinations of datasets, then combinations of TRs).
Let n be the number of TRs in the cell line, m the number of datasets and k the size of the convolutional kernels. As a result, the kernel shapes are (k, m, l, channel = 1) for the first, and (k, l, n, F) in the second where F is the number of filters in the previous layer. As there is no ordering to the datasets or TR, we perform a depthwise convolution and read the entire dimension at once. Default k is 20. We use only one layer per dimension. We use few kernels, lower than the later Dense layer size, creating a bottleneck [43], but larger Dense layers still improves rebuilding (Fig. 3).
The combinations are learned over a short window across the region given by the variety and stride of the convolutional kernels. Convolution filters have a kernel regularisation of 2.5E−3 by default and Dense layers (see below) have low Dropout regularisation (10%), except for the encoded layer which has none so it can specialize. We observed that a stacked approach of one dimension at a time can lessen training problems associated with large number of dimensions.
Convolutional filters are known to be useful in finding combinations of elements across dimensions, including in biological sequences [44]. Multi-view integration is traditionally done by using different strides for the filters, or by processing each view followed by a feature fusion [23]. In contrast, as our dimensions are incomparable, we express a hierarchy between our two dimensions by integrating datasets combinations first, learning a first latent space which is then passed to another convolutional layer, which learned its own latent space based on TR combinations (across another dimension) of the values of the first latent space.
Integrative layers and decoding
The convolutional layers are followed by 4 regular (Dense) integrative layers, to learn complex combinations. On each layer, only the last dimension (filters) provides weights, resulting in Time-Distributed layers with no communication along the X dimension. The ReLu activation function is used.
We obtain an encoded dimension in the fourth Dense layer. For the decoding, we consider each element of the non-human-readable encoded dimension as a latent variable. A first layer of convolutional filters reads the entire dimension once to produce a first decoding, and a second and final layer has one filter per source (TR-dataset pair). This is done because unlike classical images, there is no order to the features. The final layers perform a reshaping of the result back to the original tensor shape.
There is no communication along the X axis, unlike NLP models such as Transformer or LSTMs, as we focus on local combinations. However, along the other axis of the encoded dimension, the encoding layer has access to the state of all other learned neurons making this partly reminiscent of a transposed attention mechanism [45]. Custom time-based layers and constraints could be added here. This is not necessary to work with large, overlapping ChIP-seq peaks, but might be needed to integrate the ReMap 2020 new ChIP-exo and DAB-seq data.
As such, even though the full tensor represents the CRM; the latent learning is focused on local combinations analogous to CREs (local clusters) in a window the size of the convolutional kernels.
Loss used
The loss used when training the model is the Mean Squared Error, or L2 loss. Hence, when the model adds phantoms from the same correlation group, it must lower the value of the original peaks. This forces compromise compared to a L1 loss, since \(\left\| {\left( {\begin{array}{*{20}c} 0 \\ 2 \\ \end{array} } \right) - \left( {\begin{array}{*{20}c} 1 \\ 1 \\ \end{array} } \right)} \right\|_{2}^{2} < \left\| {\left( {\begin{array}{*{20}c} 0 \\ 2 \\ \end{array} } \right) - \left( {\begin{array}{*{20}c} 2 \\ 0 \\ \end{array} } \right)} \right\|_{2}^{2}\). Using L2 loss on an autoencoder has been previously shown to be effective to perform denoising even when no cleaned images, only corrupted/noisy versions, are available [46].
The best results were obtained with the Adam (“Adaptive learning rate for sparse data”) optimizer [47], which is effective on problems featuring large data and very noisy and/or sparse gradients. It adapts the learning rate based on gradient moments. Our base Learning Rate for the Adam optimizer is 1E−4, which is 0.1× its default. We keep the model with the lowest loss of all the epochs during the training process. Training is stopped when losses stops diminishing with a patience of 5 and a minimum delta of 2.5E−4 (chosen empirically) to consider it to have diminished significantly. Batch size is 48 CRMs with 10 batches per epoch in artificial, 48 in real data. We can also use a weighted loss by apply weighting to the loss for each dataset or TR separately.
Anomaly score
Finally, each peak gets an anomaly score based on the autoencoder reconstruction error. The better the peak according to the model, the higher the score. Such an approach has precedent in signal processing [48].
We define an anomaly score to compare a tensor X to its rebuilding by a model noted h(X). We have A the anomaly score tensor defined as:
$$A\left[ {x,y,z} \right] = \left\{ {\begin{array}{*{20}l} 0 \hfill & {{\text{if}}\;X[z,y,z] = 0} \hfill \\ {1 - \frac{\begin{gathered} \hfill \\ X[x,y,z] - h\left( X \right)[x,y,z] \hfill \\ \end{gathered} }{X[x,y,z]}} \hfill & {{\text{else}}.} \hfill \\ \end{array} } \right.$$
Dividing by the original score accounts for the potential crumbing. The score of each peak is the maximum value in A across all nucleotides of the peak. This is necessary to correct cases where near vicinity peaks will get high score only on the parts where they overlap and because peaks smaller than the convolutional filter’s length will get a lower rebuilt score as a result. Peaks can sometimes be divided between two (or more) of our 3200-bp windows, getting one score for each rebuilt matrix: we merge them by giving them a score that is the mean of each.
Normalization of correlation group biases
This normalization aims to correct bias in rebuilding based on the learned correlation groups. We calculate a weight for each source (meaning each {dataset, TR} pair), based on the following steps, to be applied to all anomaly scores computed for this source.
μX(T) designates the 2D matrix of the mean across the X axis (region size) of the 3D tensor T, and maxX(T) its maximum. M[s] designates the value of the matrix M for the current source. h(T) is the tensor obtained as output when passing as input the T tensor to a trained atyPeak model. We note Ft a “full CRM” which is a 3D tensor representation where all possible sources with abundance higher than zero are present along its length with a value of 1. Let F = μX(Ft). The correlation group of a source can be estimated (see Interpretability) by preparing a CRM containing only a peak for the given source along its entire length denoted Us. We get such a request mask as R = μX(h(Us)).
The first step corrects for intra-group biais in rebuilding, due to learning bias (usually too high learning rate) or abundance differences within the group. There, the sum of the rebuilt CRM will be biased too. We get the first weight \(k_{1} = {{\sum {h\left( T \right)} } \mathord{\left/ {\vphantom {{\sum {h\left( T \right)} } {\sum T }}} \right. \kern-\nulldelimiterspace} {\sum T }}.\)
For inter-group bias, recall that the rebuilt value of a peak (value in h(T)) is proportional to how complete its correlation group is. But group have different sizes, cf background group and different groups. Our goal is that peaks gets the same score where their group’s current completeness relative to its average completeness is the same. We define occupancy for a CRM for the current source as M*maxX(Us)*IW where M = R/max(R) and IW is the matrix of intra group weights for all sources. These are pointwise multiplication, not matrix products. We use a Monte Carlo approach by iterating over a portion of all CRMs and calculating the mean of all their occupancies μ(θ), excluding zeroes to get only the correlators when the source is actually present, and self-correct for relative loneliness. The final second weight k2 = θF/μ(θ) where θF is the occupancy calculated of Ft.
Thirdly, if a source is in several groups, phantoms from several groups can accumulate and will not be seen at step 2. We evaluate how much the sources that are not in the request will contribute. We calculate the mean and max negative occupancies (η) exactly as above, except we use a negative mask Mn instead of request R, where Mn = (F − R/R[s])*F[s]. We ponder by the average presence of these other peaks to get k3 = 1 − ((h(F)[s]/F[s])*(μ(η)/ηF)).
The final weight is k = k1 * k2 * k3. To prevent overcorrection of sources that were not learned by the model, all k are capped at 10. For now, having a CRE with more peaks than average results in higher rebuilt values, as we consider that for CREs in particular more TRs mark denser/better CREs. This assumption could be changed here by penalizing values above the corrected average quality.
The final step consists of centering and reducing/normalizing the scores by TR, under the assumption that no TR is inherently of a better quality than the others. Having more correlators (i.e., data less sparse for the same dimension, more datasets per TRs) is a benefit. For each peak, if their score at this step is s their final score is \(s_{f} = 750*\left( {1 + \frac{{s - \mu_{TR} }}{{2\sigma_{TR} }}} \right)\), where μTR and σTR are respectively the mean and standard deviation of scores observed for this source’s TR at the previous step. Note that scores are usually not normally distributed.
We center around 750 to use a larger part of the score scale for cases where the local cluster (CRE) is less complete than average, which are the cases we want to mark. If you choose to use a non-normalized score, compare each score to the median score for its source. This normalization is a step in the right direction that independently moves score averages for different TRs closer (Additional file 15: Fig. S13) but warrants further research.
Training and interpreting the model
We provide scripts to directly process a BED file in ReMap format with diagnostic figures and usage instructions in the Readme.
Impact of data characteristics/scaling on required information budget
As we discussed in Results, the information budget determines the aggressivity of the compression. It depends on the relative information budget. It is the ratio between the quantity of information to be learned in the data (itself a function of the number of TR and/or dataset combinations) and the model’s entropic capacity (how much information can be stored in the compressed representation). Adequate hyperparameter tuning is a widespread problem in deep learning as a higher information budget will predictably increase the model’s Vapnik–Chervonenkis dimension and make it more prone to overfitting.
In our case, the entropic capacity is mostly increased by increasing the dimension of all Dense layers and the number of convolutional filters on one side (more is higher). But also by diminishing the learning rate (LR) on the other which was often necessary to reach lower losses, even with all other parameters constant. We saw in Fig. 3 that to achieve the same aggressivity, the required entropic capacity scales up with the quantity of information in the data. Figure 2B, on the contrary, is an example of overprecision. Larger dimensions (more datasets and/or TRs in the database) require a higher information budget, even with no additional information (Additional file 3: Fig. S3B). However, Lower Learning Rates are more of a necessary condition than larger models to reach higher precisions with higher dimensions (Additional file 3: Fig. S3C). Learning larger correlation groups (composed of more sources) is also harder.
The most frequent sources are learned in more precise groups, while the rarer ones appear often grouped together in more “background” groups. All groups, but especially the latter, are not expected to be fully complete (meaning all the sources are present) in the real data. More generally, sources that are comparatively too rate (empirically \({\raise0.5ex\hbox{$\scriptstyle 1$} \kern-0.1em/\kern-0.15em \lower0.25ex\hbox{$\scriptstyle 5$}}\) difference) may be completely disregarded by the model as they are seen as systematic noise. All those tendencies are more visible in high-dimensionality examples or those with higher imbalances, and can be alleviated by using a weighted loss: Additional file 4: Fig. S4A shows that dimensions with a higher weight will be focused on and get more precise groups.
We also show in Additional file 4: Fig. S4B that the model is capable of learning overlapping groups (where the groups are “G1” and “G1 + G2” instead of “G1” and “G2” like in Additional file 1: Fig. S1) However, it required learning adjustments with higher weighting on the rarest dimensions to direct the learning, and more importantly early stopping. With a variety of other parameters, peaks in G2 produce only marginal phantoms for G1, or we get too precise or non-homogeneous groups (Additional file 4: Fig. S4C). Note that G1 will often not produce phantoms of G2 (although it should and does sometimes happen, like in HeLa) so be careful to look at the estimated groups for all sources when interpreting the model. Relatedly, even in non-overlapping groups the watermark (i.e., the lonely control peak we added at the same position to most of the CRM that does not particularly correlate with other groups, see Artificial data) does not create phantoms anywhere else. However, watermark phantoms are produced by peaks from (certain sources in) the G1 and G2 groups. The rarer of those two groups often erroneously produces stronger phantoms, a tendency reduced when this rarer group is weighted more.
Loss and training
Due to the high dimensionality and sparsity of our data, we used lower Learning Rates (Additional file 9: Table S2) and large batches to counter overfitting and batch effects. We also used early stopping in most cases, in most cases stopping even before a loss low plateau is reached to prevent the model from adding bias in a futile attempt to improve.
With different random seeds, we observed over several runs small but real deviations in scores and estimated correlation groups. As with most machine learning approaches, we recommend averaging over several runs (2–3) for both these applications.
Training the model takes around 10–30 min per cell line for smaller models (HeLa) and 1–5 h for larger ones (K562 and MCF7). However, reading and processing the source BED files is a large part of this time and the approach is not CPU bound. Times given on an i7-7820HQ and on an SSD drive. GPU use did not significantly improve running times. Production of the resulting BED file after training is also time consuming (around 12 h for K562 but 40 min for HeLa), so it is advised to check some rebuilt matrices before proceeding.
Interpretability
To interpret the latent variables in the encoded dimension (Fig. 1, Additional file 6: Fig. S6 and Additional file 14: Fig. S10), we use a gradient ascent method to build an hypothetical CRM tensor that would maximally activate each individual row in the encoded dimension layer [49].
We seek \(TM = \left\{ {\forall i \in \left[ {1;\# \left( E \right)} \right],\mathop {{\text{argmax}}}\limits_{T} a_{E,i} (T)} \right\}\). We add some blur at regular intervals on the Y and Z axis during gradient ascent for more natural looking results. By default we use a learning rate of 1, 50 steps in gradient ascent, and the blur standard deviation is (σx, σy, σz) = (0.2, 1E − 2, 1E − 2) applied every 5 steps. For each latent variable of the encoded dimension the gradient is calculated across the entire length. As the Dense layers are not connected across the X axis, we are considering local combinations. Since this is not the next-to-last layer, the final result will be a complex non-linear combination of those variables. This should instead be seen as a highlight of the model’s focus during learning.
Another type of interpretability is based on the same procedure used in the normalization (Additional file 5: Fig. S11). We create a CRM representation Us that is empty except for one peak for a given source along all its length. By looking at R = h(Us), we see what phantoms are added by the model, and deduce these are part of the same correlation group as the source we are considering. Due to the peculiarities mentioned above when learning overlapping groups, look at all the sources’ estimated groups, as a source A may impact the score of B without B appearing in A’s estimated group. Passing R does not always result in values of 1 due to complex nonlinearity, but it is a good approximation.
Note that a learned correlation group of “ABCDE” does not necessarily mean ABCDE are always found together, as seen in artificial data where the model learned the entire G1 and G2 group, which almost never found complete in the artificial CRMs. As such, rarer sources can be grouped in more background groups without necessarily being a complex. For both interpretabilities, negative weights are likely due to sum averaging and should not be focused on. Indeed, the rebuilt tensor is not simply the sum of the estimated correlation groups for the sources present.
Q-score quantifies the quality of the reconstruction
Autoencoders are usually evaluated based on the reconstruction error [50]. Here, we instead propose a new method more adapted to the problematic. To rigorously choose the information budget, we propose to verify that the model correctly learns generated pairwise correlations. On one hand, if two dimensions (datasets or TF) correlate, finding them both together in the region of interest should result in a higher score for them than when they are found alone; on the other hand if they do not correlate, this should have no impact. To estimate this we design a Q-score, which is lower in better models.
The Q-score is defined as:
$$\forall \left( {i,j} \right) \in \Lambda ,\;Q = \sum\limits_{i,j} {\sqrt {A_{i} *A_{j} } *\left( {P + R} \right)}$$
where
$$C = \left[ {R\left( {i,j} \right) > \hat{\mu }\left( R \right)} \right]$$
$$\epsilon = 0.05*\# \left( \Lambda \right)$$
$$P = \left( {C - \left[ {P\left( {\mu \left( {alone} \right) = \mu \left( {both} \right)} \right) < \epsilon } \right]} \right)^{2}$$
$$R = \left( {C - \left[ {P\left( {\mu \left( {phantom} \right) = \mu \left( {none} \right)} \right) < \epsilon } \right]} \right)^{2}$$
Here Λ is a set of all TRs and all datasets (so all possible Y and Z dimensions, excluding the X dimensions of peak position) and the brackets are Iverson brackets denoting indicator variables. Note that we only compare TRs with other TRs and datasets with other datasets, because a dataset and a TR are not mutually exclusive and issues can arise when considering a dimension that is only present as noise when another is present. For the same reason, we consider only positive correlation coefficients later.
C asserts whether the Pearson correlation coefficient between the two considered dimensions is higher than the mean correlation coefficient. It is calculated on the tensor representations of the CRMs at the nucleotide level.
For P and R, we take 10thousand CRM tensor representations T and their rebuilding h(T). For each of them, we record the values for the (i, j) dimensions of interest (averaged across X axis). We compare the average rebuilt value of A in different scenarios: For P, when a peak of i was present in T, does presence of j in the same CRM result in a higher rebuilt value for i? And for R, when i was absent, does the presence of j result in higher phantom values than when j is absent? To perform these comparisons, we use a Welch test to determine whether the means are different. We use a Bonferroni correction by using a p value of 0.05/#(Λ). We then weight the result by the relative abundance of the dimensions Ai and Aj. We do not normalize the scores with the procedure discussed before because we compare a source with its own values.
Artificial data
We use artificial regions to confirm the model can discover correlation groups. They are meant to approximate real genomic CRMs, hence the generation process and parameters are based on true data.
We define a probabilistic model to generate the artificial data. The output of this model is an ensemble P of peaks, whose characteristics are: their start and end, the IDs of the TF they represent and the experiment they belong to. Hence we have \(P = \left\{ {\left( {s_{i} ,e_{i} } \right),act_{i} \in \left\{ {0,1} \right\},TF \in N,series \in N} \right\}\) which is then converted into a 3D tensor representation, as explained in Data representation. The generation itself consists of three steps detailed below. Each step is run once per generated artificial CRM. Unless specified otherwise, all random variables used are Poisson R.V. of λ = 1. See Additional file 1: Fig. S1 for an illustration of the dimensions.
First, we place a control peak called a watermark along all the length of the CRM for the 1st TF in the 1st dataset, representing ubiquitous TRs. It will be very frequent but not particularly correlated with other sources and so form its own correlation group. It has a customizable probability (default 75%) of appearing, to prevent the model from learning it and only it when it is too frequent.
Second, we want to place a stack of correlating peaks from different TRs and datasets, at roughly the same positions. The stack will belong to one of two or more TR “correlation groups”. The groups are made by splitting the set of all in TRs in two, or more, or by making groups of 4. Group choosing probabilities are equal by default but can be weighted.
Only one such group is picked per generated artificial CRM. We then pick a common center for the peaks, uniformly randomly across the region. Now, we pick K + 1 datasets without replacement among all predetermined reliable datasets (by default, the last half of them). In these datasets we will place N + 1 peaks. For each peak to be added, we randomly select P + 1 TRs from the current correlation group with replacement. N, K and P are random variables. For each TR selected, separately move the center by a distance jd (uniform R.V. between -200 and + 200), take a peak length randomly of L (L is a log-normal RV of μ = 250, σ = 0.25) and finally, write the exact same peak among all the datasets selected previously. Note that since artificial data draws peaks at random, there is a larger number of possible combinations than there is usually in real data of the same dimensionality.
Third, noise peaks are placed uniformly randomly from all datasets and TRs to represent false positives and atypical peaks, which by nature do not respect existing correlation groups. To represent false negatives, each peak has a probability t = 1/4 of being removed at this tep. Then, we randomly position F + 1 peaks (F is a R.V.) by drawing randomly their characteristics like previously. Noise cannot be placed in the watermark.