Distributed gene expression modelling for exploring variability in epigenetic function

Background Predictive gene expression modelling is an important tool in computational biology due to the volume of high-throughput sequencing data generated by recent consortia. However, the scope of previous studies has been restricted to a small set of cell-lines or experimental conditions due an inability to leverage distributed processing architectures for large, sharded data-sets. Results We present a distributed implementation of gene expression modelling using the MapReduce paradigm and prove that performance improves as a linear function of available processor cores. We then leverage the computational efficiency of this framework to explore the variability of epigenetic function across fifty histone modification data-sets from variety of cancerous and non-cancerous cell-lines. Conclusions We demonstrate that the genome-wide relationships between histone modifications and mRNA transcription are lineage, tissue and karyotype-invariant, and that models trained on matched -omics data from non-cancerous cell-lines are able to predict cancerous expression with equivalent genome-wide fidelity. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1313-1) contains supplementary material, which is available to authorized users.


Introduction
Computational frameworks for modelling gene expression as a function of gene-localised epigenetic features are becoming increasingly common in life sciences research.Previous studies by our lab [1,2,3] and others [4,5] have leveraged the statistical power of modelling genes as observations of regulatory activity (versus variables in network-based analyses [6]) to gain new insight into the function and interactions of transcription factors, histone modifications and DNA methylation.Recent applications include: inference of transcription factor roles from their respective binding motifs [7]; identification of regulatory elements responsible for differential expression patterns [8]; exploring the relationship between gene expression and chromatin organisation [2]; and comparative analysis of the transcriptome across distant species [9].
Despite the wealth of high-throughput sequencing data made available by recent large-scale consortia (e.g.ENCODE), previous predictive modelling studies have focused on a very small number of cell-lines (typically 1-to-3 [7,8]) despite the obvious benefits of broader, integrative analyses.We attribute this to the size of sequencing data, the computational complexity of naïve model implementations and widespread inability of published frameworks to leverage multi-core and/or distributed architectures.In this study, we apply the MapReduce programming paradigm [10] with domain-specific complexity optimisations to provide efficient, parallelised gene expression modelling.
A recent study by Jiang et al. has suggested that RNA-(transcriptomic) and ChIP-seq (epigenetic) data generated in the same condition (i.e. the same cell-line) introduces statistical bias and that specialised methods are necessary for accurately modeling the expression of cancer cells [11].This study investigates both of these concerns, exploiting the computational efficiency of our MapReduce implementation and conducting an integrative analysis of six histone modifications across eight dissimilar ENCODE cell-lines.Firstly, we extend our predictive modelling framework to include L 2 -regularisation, which is specifically designed to prevent overfitting to experimental noise rather than meaningful biological relationships.We then quantify the extent of condition-specific bias by training and testing models on all 64 directed, pairwise combinations of cell-lines.

ENCODE cell-line data
Matched mRNA transcript abundance (RNA-seq) and histone modification (ChIP-seq) data were downloaded from ENCODE [12] for the eight cell-lines summarised in Table 1.These dissimilar cell-lines are the full set for which data is available for the histone modifications listed in Table 2.The remaining histone modifications available from ENCODE are unsuitable for this study as they assert their functional role in non-promoter regions (e.g.H3K36me3 in the 3 -UTR).The MapReduce implementation of gene expression modelling presented in this study could be trivially extended to model more cell-lines if the data were made available.Repressor/Bivalency Facultative heterochromatin

MapReduce
MapReduce is programming paradigm which adapts the map-reduce functional programming construct for distributed and fault-tolerant data processing on commodity hardware.First developed by Google [10], MapReduce is now widely used by companies such as Amazon, Facebook, Google and Yahoo! for parallelised processing of data on terabyte and petabyte scales.Although many of the advanced features of MapReduce are less relevant in the life sciences domain [13] (e.g.optimisation of network communications, as a single sequencing data-set can often fit in main memory), the ability to seamlessly interleave sequential and parallel processing can be leveraged to reduce the processing time of gene expression modelling linearly in the number of available CPU cores.A program implemented using the MapReduce paradigm consists of a sequence, µ 1 , ρ 1 , µ 2 , ρ 2 , . . ., µ R , ρ R , of mappers (µ r ) and reducers (ρ r ) operating over key; value pairs.Formally, a MapReduce program executes the following steps on input U 0 until the final reducer (ρ R ) halts [14]: The computational benefit of MapReduce follows from its inherent parallelisability, as many instances of µ r are able to process their key; value simultaneously (likewise with ρ r , although all instances of µ r−1 must halt before any ρ r can commence).The following sections detail mapper and reducer implementations for various stages of the predictive gene expression modelling pipeline described in our previous studies [1,2,3].

Quantifying transcriptional regulatory interactions
The strength of association between a gene, m ∈ (1, 2, . . ., M ), and epigenetic feature, n ∈ (1, 2, . . ., N ), can be calculated from a ChIP-seq data-set: where R n is the set of ChIP-seq reads for n, d(r, m) is the distance (bp) separating read r from the TSS of m, and φ maps a gene-read pair to their strength of association.The maximum bin-width, d * , is typically set to 2000 to approximate the average width of ChIP-seq binding regions.Different implementations of φ are used for histone modifications (constrained sum-of-tags) versus transcription factors (exponentially decaying affinity) due to their dissimilar ChIP-seq binding profiles [2]:

Linear regression with least squares fitting
Suppose X ∈ R M ×N is a matrix of gene-level epigenetic scores, where M is the number of genes and N is the number of epigenetic variables (M N ).It is commonplace to model the relationship between X and a vector of gene-expression values, Y ∈ R M , in the following form: where β parameterises the linear relationship between gene expression and epigenetics, and ε are the genespecific errors.Such models can be fitted using ordinary least squares: yielding the following model-based predictions of gene expression, Ŷ : Given two matrices, A ∈ R X×Y and B ∈ R Y ×Z , the product C ∈ R X×Z : C = A × B can be formulated as follows: where: This formulation of matrix multiplication can be implemented using the MapReduce paradigm, as demonstrated below.
Our implementation of linear regression with least squares fitting involves decomposing β into the product A −1 B, where A ∈ R N ×N : A = MRMultiply(X , X) and B ∈ R N : B = MRMultiply(X , Y ).
The product A −1 B is calculated using standard matrix multiplication due to the unnecessary overhead of MapReduce for small matrices.

Regularised least squares regression
Regularisation is a common method of overcoming the issue of over-fitting regression-based models to experimental noise rather than meaningful biological relationships.Regularisation involves penalising the fitted parameters, β, by an empirically-tuned hyperparameter, λ: Presuming || • || is the L 2 (Euclidean) norm, our MapReduce implementation can be trivially extended to support regularisation (implementing ridge regression).Specifically, given: where I N is the N × N identity matrix, it follows that: This implementation yields the same asymptotic time complexity as ordinary least squares regression.Moreover, the existence theorem for general ridge regression demonstrates that it is always possible to tune λ (e.g. using cross-validation) to reduce the mean square error of model predictions [15,16].This is particularly important when introducing a large number of epigenetic variables into a predictive model; e.g. a systematic analysis of the roles of dozens of transcription factors from their ChIP-seq binding profiles.In this study, λ is assigned the largest possible value such that the mean 10-fold cross-validated error is within 1 standard error of the minimum (solved iteratively).Unlike the L 2 norm, the L 1 norm is often used to enforce sparsity in β under the assumption that most variables in X are practically unrelated to Y .This is less relevant in the context of gene expression modelling due to the well-established functional importance of epigenetic regulators for which ChIP-seq data is widely available.Moreover, the L 1 norm is not differentiable and thus not amenable to a closed-form MapReduce solution, although the parallelisation of iterative solutions is discussed elsewhere [17].

Histone modifications are predictive of gene expression in both cancerous and normal cell-lines
L 2 -regularised linear regression models of genome-wide mRNA transcript abundance were constructed as functions of the following histone modifications: H2A.Z, H3K4me3, H3K9ac, H3K9me3, H3K27ac and H3K27me3.For each model, the regularisation parameter, λ, was fitted using 10-fold cross-validation.The adj.R 2 performance of each model is presented in Fig. 1, along with a density plot of predicted ( Ŷ ) versus measured (Y ) transcript abundance.It is evident that histone modifications are accurate predictors of gene expression in both cancerous (top row, mean adj.R 2 = 0.608) and normal cell-lines (bottom row, mean adj.R 2 = 0.581), despite recent studies suggesting that specialised models are necessary to appropriately model cancerous cells [11].
Fig. 2 presents the results of hierarchically clustering cell-lines by mRNA transcript abundance residuals (ε = Y − Ŷ ).Interestingly, the three mesodermal derivatives GM12878, K562 and HUVEC form a distinct cluster.RNA-sequencing data for the least similar cell-line (A549) was generated at Cold Spring Harbor Laboratory whereas all other transcriptomic data was generated at the California Institute of Technology, suggesting that batch effects may be a contributing factor.It is also evident that the expression levels of many genes are consistently over-or under-estimated across all eight cell-lines.Taken together, these results indicate that genespecific residuals are non-random but instead indicative of genes that are inherently difficult to model from histone modification data.The existence of genes with transcriptional activity apparently decoupled from the local epigenetic landscape has been explored in detail in our previous study [2].

The regulatory roles of histone modifications are cell-line invariant
To assess the extent to which condition-specific bias influences the reported accuracy of gene expression predictions, we trained and tested models on all 64 directed, pairwise combinations of cell-lines.The adj.R 2 performance for these models are presented in Fig. 3(a).These results demonstrate significant non-symmetry, with dissimilarity between columns (predictions) but not rows (training observations).This demonstrates that the It is worth noting that that models trained and tested using data from a single cell-line (boldfaced along the diagonal of Fig. 3(a)) only marginally outperform models trained on dissimilar cell-lines and, moreover, that these margins are significantly less than the inherent variation between columns.These findings suggest that, in the context of gene expression modelling, training and testing models on data generated under the same experimental conditions (i.e. the same cell-line) is not a significant source of statistical bias.

MapReduce reduces the asymptotic time complexity of gene expression modelling
For M genes and a ChIP-seq data-set containing R mapped reads, the asymptotic time complexity class of generating a column X ,n of X is Θ(M R).By first preprocessing the list of gene TSS loci (invariant between epigenetic datasets) into a balanced binary search tree and observing that the vast majority of reads are within d * bp of exactly zero-or-one gene, our MapReduce implementation of calculating X ,n yields the following complexity when distributed across P MapReduce nodes: which must be completed separately for each epigenetic feature, n ∈ (1, 2, . . ., N ).

Cell
Despite recent studies presenting specialised methods for modelling cancerous gene expression [11], we find no evidence of variation in the statistical relationship between histone modifications and mRNA transcript abundance in normal-versus-cancerous cell-lines.Although our results demonstrate that some cell-lines are inherently more difficult to model than others, this trait appears to be more closely associated with the extent of cellular differentiation than carcinogenic state; e.g.models of h1-hESC embryonic stem cells perform 12% worse than terminally-differentiated GM12878 lymphoblasts.Although the NHEK (Normal Human Epidermal Keratinocytes) cell-line is both terminally-differentiated and exhibits the worst-performing models, this may be attributed to the phenotypic plasticity of keratinocytes between epithelial and mesenchymal states (necessary for wound healing).We therefore speculate that the predictability of a cell-line's genome-wide expression levels from epigenetic data is proportional to its transcriptomic rigidity; i.e. cells with signal-induced phenotypic plasticity are less likely to exhibit a stable, predictive epigenome.
Interestingly, hierarchical clustering of the 8 investigated cell-lines by mRNA transcript abundance residuals (gene-level prediction errors) was able to group the closely-related, mesodermal-derivative cell-lines GM12878, K562 and HUVEC; again, carcinogenic state appeared to have little effect on the propensity of two cell-lines to cluster together.Taken together with the observation that many genes exhibited large and consistent residuals across all cell-lines, these results suggest that gene-level residuals are non-random and, moreover, that the transcriptional activity of many genes are decoupled from their local epigenetic landscape [2].

Fig. 3 .
Fig. 3. (a) Genome-wide accuracy of mRNA transcript abundance predictions (adj.R 2 ) for models trained and tested on each pairwise combination of cell lines.These results are strikingly non-symmetric, with significant dissimilarity between columns (predictions) but not rows (training observations).(b) Distribution of each fitted model parameter, βm, across all cell-lines considered in this study.

Table 1 .
All ENCODE cell-lines for which matched ChIP-seq data was available for the full set of histone modifications considered in this study (listed in Table2).

Table 2 .
All histone modifications considered in this study.The remaining histone modifications available from ENCODE are unsuitable for this study as they assert their functional role in non-promoter regions (e.g.H3K36me3 in the 3 -UTR).
0 controls the strength of exponential decay for quantifying transcription factor interactions and is typically set to d 0 = 5000.The resultant matrix of gene-level epigenetic scores, X ∈ R M ×N , is then log (or arsinh)-transformed and quantile-normalised for use in a regression model.Given ChIP-seq data for epigenetic feature n represented in UCSC wiggle (.WIG) format: each column X ,n ∈ R M : X ,n = col n (X) of the epigenetic score matrix can be efficiently calculated using MapReduce, as demonstrated below.Similar formulations can be derived for other ChIP-seq file formats.Algorithm 2 MREpigeneticScores(X ,n ) procedure MAPPER µ ( X ,n; locus; value ) for each gene m do if |d(locus, m)| ≤ d * then EMIT( xm,n; value × φ(locus, m) ) procedure REDUCER ρ ( xm,n; {v1, v2, . . ., vK } ) EMIT xm,n; K i=1 vi