 Research Article
 Open access
 Published:
Distributed gene expression modelling for exploring variability in epigenetic function
BMC Bioinformatics volume 17, Article number: 446 (2016)
Abstract
Background
Predictive gene expression modelling is an important tool in computational biology due to the volume of highthroughput sequencing data generated by recent consortia. However, the scope of previous studies has been restricted to a small set of celllines or experimental conditions due an inability to leverage distributed processing architectures for large, sharded datasets.
Results
We present a distributed implementation of gene expression modelling using the MapReduce paradigm and prove that performance improves as a linear function of available processor cores. We then leverage the computational efficiency of this framework to explore the variability of epigenetic function across fifty histone modification datasets from variety of cancerous and noncancerous celllines.
Conclusions
We demonstrate that the genomewide relationships between histone modifications and mRNA transcription are lineage, tissue and karyotypeinvariant, and that models trained on matched omics data from noncancerous celllines are able to predict cancerous expression with equivalent genomewide fidelity.
Background
Computational frameworks for modelling gene expression as a function of genelocalised epigenetic features are becoming increasingly common in life sciences research. Previous studies by our lab [1–3] and others [4, 5] have leveraged the statistical power of modelling genes as observations of regulatory activity (versus variables in networkbased analyses [6, 7]) to gain new insight into the function and interactions of transcription factors, histone modifications and DNA methylation. Recent applications include: inference of transcription factor roles from their respective binding motifs [8]; identification of regulatory elements responsible for differential expression patterns [9]; exploring the relationship between gene expression and chromatin organisation [2]; and comparative analysis of the transcriptome across distant species [10].
Despite the wealth of highthroughput sequencing data made available by recent largescale consortia, previous predictive modelling studies have focused on a very small number of celllines (typically 1to3 [8, 9]) despite the obvious benefits of broader, integrative analyses. We attribute this largely to the size of sequencing data and widespread inability of published frameworks to decompose tasks into parallelisable units. Although some studies have considered accelerated GPU implementations [11], this imposes strict memory constraints and does not readily extend to largescale, distributed systems of commodity hardware. In this study, we demonstrate how the MapReduce programming paradigm can be applied to a broad class of regression modelling that captures popular formulations of predictive gene expression modelling [1]. Importantly, we prove general asymptotic speedup in number of processing cores that is not bound to specific hardware infrastructure; i.e. cloud versus enterprise or distributed versus sharedmemory multicore systems.
A recent study by Jiang et al. has suggested that RNA (transcriptomic) and ChIPseq (epigenetic) data generated under the same conditions (i.e. the same cellline) introduce statistical bias and that specialised methods are necessary for accurately modeling the expression of cancer cells [12]. This study investigates both of these concerns, exploiting the computational efficiency of our distributed implementation to conduct an integrative analysis of six histone modifications across eight dissimilar ENCODE celllines. First, we extend our predictive modelling framework to include L ^{2}regularisation, which is specifically designed to prevent overfitting to experimental noise rather than meaningful biological relationships. We then quantify the extent of conditionspecific bias by training and testing models on all 64 directed, pairwise combinations of celllines.
Methods
ENCODE cellline data
Matched mRNA transcript abundance (RNAseq) and histone modification (ChIPseq) data were downloaded from ENCODE [13] for the eight celllines summarised in Table 1. These dissimilar celllines are those for which data are available for the histone modifications listed in Table 2. The remaining histone modifications available from ENCODE are unsuitable for this study as they assert their functional role in nonpromoter regions (e.g. H3K36me3 in the 3^{′}UTR). The MapReduce implementation of gene expression modelling presented in this study could be trivially extended to model more celllines if the data were made available.
MapReduce
MapReduce is programming paradigm which adapts the mapreduce functional programming construct for distributed and faulttolerant data processing on commodity hardware. First developed by Google [14], MapReduce is now widely adopted for parallelised processing of data on terabyte and petabyte scales. A program implemented using the MapReduce paradigm consists of a sequence, 〈μ _{1},ρ _{1},μ _{2},ρ _{2},…,μ _{ R },ρ _{ R }〉, of mappers (μ _{ r }) and reducers (ρ _{ r }) operating over 〈key;value〉 pairs. Formally, a MapReduce program executes the steps described in Algorithm 1 on input U _{0} until the final reducer (ρ _{ R }) halts [15].
The computational benefit of MapReduce follows from its inherent parallelisability, as many instances of μ _{ r } are able to process their 〈key;value〉 simultaneously (likewise with ρ _{ r }, although all instances of μ _{ r−1} must halt before any ρ _{ r } can commence). The following sections detail mapper and reducer implementations for each stage of the standard predictive gene expression modelling pipeline. For additional details on the implementation or rationale of these stages, please refer to references [1–3].
Quantifying transcriptional regulatory interactions
The strength of association between a gene, m∈(1,2,…,M), and epigenetic feature, n∈(1,2,…,N), can be calculated from a ChIPseq dataset specific to some cellline/condition:
where R _{ n } is the set of ChIPseq reads for n, d(r,m) is the distance (bp) separating read r from the TSS of m, and ϕ maps a generead pair to their strength of association. The maximum binwidth, d ^{∗}, is traditionally set to 2000 to approximate the average width of ChIPseq binding regions. Different implementations of ϕ are used for histone modifications (constrained sumoftags) versus transcription factors (exponentially decaying affinity) due to their dissimilar ChIPseq binding profiles [2]:
where hyperparameter d _{0} controls the strength of exponential decay for quantifying transcription factor interactions and is traditionally set to d _{0}=5000. The resultant matrix of genelevel epigenetic scores, \(\mathbf {X} \in \mathcal {R}^{M\times N}\), is then log (or arsinh)transformed and quantilenormalised for use in a regression model.
Given ChIPseq data for epigenetic feature n represented in UCSC wiggle (.WIG) format:
each column, \(X_{\star, n} \in \mathcal {R}^{M} : X_{\star, n} = \text {col}_{n}\left (\mathbf {X}\right)\), of the epigenetic score matrix can be efficiently calculated using MapReduce using the procedure described in Algorithm 2. Equivalent formulations can be derived for other ChIPseq file formats.
Linear regression with least squares fitting
Suppose \(\mathbf {X} \in \mathcal {R}^{M\times N}\) is a matrix of genelevel epigenetic scores (defined above), where M is the number of genes (including a unity term for model bias) and N is the number of epigenetic variables (M≫N). It is commonplace to model the relationship between X and a vector of gene expression values, \(Y \in \mathcal {R}^{M}\), as follows:
where β parameterises the linear relationship between gene expression and local epigenetic features, and ε are the genespecific errors. Such models can be fitted using ordinary least squares:
yielding the following modelbased predictions of gene expression, \(\hat {Y}\):
Given two general matrices, \(\mathbf {A} \in \mathcal {R}^{X\times Y}\) and \(\mathbf {B} \in \mathcal {R}^{Y\times Z}\), the product \(\mathbf {C} \in \mathcal {R}^{X\times Z} : \mathbf {C} = \mathbf {A}\times \mathbf {B}\) can be reformulated (without loss of generality) as:
where:
This formulation of matrix multiplication can be implemented by the MRMultiply function defined in Algorithm 3.
Our implementation of linear regression with least squares fitting involves decomposing \(\hat {\beta }\) into the product A ^{−1} B, where \(\mathbf {A} \in \mathcal {R}^{N\times N} : \mathbf {A} = \text {MRMultiply}(\mathbf {X}^{\top }, \mathbf {X})\) and \(B \in \mathcal {R}^{N} : B = \text {MRMultiply}(\mathbf {X}^{\top }, Y)\). The product A ^{−1} B is calculated using standard, singleprocessor multiplication as the communication overhead of MapReduce cannot be amortised across small matrices.
Regularised least squares regression
Regularisation is a common method of overcoming the issue of overfitting regressionbased models to experimental noise rather than meaningful biological relationships. Regularisation involves penalising the fitted parameters, β, by an empiricallytuned hyperparameter, λ:
Presuming · is the L ^{2} (Euclidean) norm, our MapReduce implementation can be trivially extended to support regularisation (implementing ridge regression). Specifically, given:
where I _{ N } is the N×N identity matrix, it follows that:
It is evident that this implementation yields the same asymptotic time complexity as ordinary least squares regression. Moreover, the existence theorem for general ridge regression demonstrates that it is always possible to tune λ (e.g. using crossvalidation) to reduce the mean square error of model predictions [16, 17]. This is particularly important when introducing a large number of epigenetic variables into a predictive model; e.g. a systematic analysis of the roles of dozens of transcription factors from their ChIPseq binding profiles. In this study, λ is assigned the largest possible value such that the mean 10fold crossvalidated error is within 1 standard error of the minimum (solved iteratively).
Unlike the L ^{2} norm, the L ^{1} norm is often used to enforce sparsity in β under the assumption that most variables in X are physically decoupled from Y. This is less relevant in the context of gene expression modelling due to the wellestablished functional importance of epigenetic regulators for which ChIPseq data is widely available. Moreover, the L ^{1} norm is not differentiable and thus not amenable to a closedform MapReduce solution, and the parallelisation of iterative solutions is discussed elsewhere [18]. A singlenode implementation of our code (see Additional file 1) is provided for convenient reproduction of our experimental results.
Results and discussion
MapReduce enables timeefficient gene expression modelling
For M genes and a ChIPseq dataset containing R mapped reads, the asymptotic time complexity class of generating a column X _{⋆,n } of X is Θ(MR). By first preprocessing the list of gene TSS loci (invariant between epigenetic datasets) into a balanced binary search tree and observing that the vast majority of reads are within d ^{∗} bp of exactly zeroorone gene, our MapReduce implementation of calculating X _{⋆,n } yields the following complexity when distributed across P MapReduce nodes:
which must be completed separately for each epigenetic feature, n∈(1,2,…,N).
For \(\mathbf {X} \in \mathcal {R}^{M\times N}\) and \(Y \in \mathcal {R}^{M}\), the asymptotic time complexity of ordinary least squares fitting \(\hat {\beta } = f(\mathbf {X}, Y)\) can also be derived:
Observing that R≫M≫N for gene expression modelling and by distributing the calculation of A and B across P MapReduce nodes, the overall complexity reduces to:
thus this MapReduce implementation of gene expression modelling yields an optimal Θ(P) improvement in asymptotic time complexity without the need to parallelise matrix inversion or transpose operations. The following results sections demonstrate how this improved performance can allow us to gain new insights from the largescale integration of publicly available datasets.
Histone modifications are predictive of gene expression in both cancerous and normal celllines
L ^{2}regularised linear regression models of genomewide mRNA transcript abundance were constructed as functions of the following histone modifications: H2A.Z, H3K4me3, H3K9ac, H3K9me3, H3K27ac and H3K27me3. For each model, the regularisation parameter, λ, was fitted using 10fold crossvalidation. The adj. R ^{2} performance of each model is presented in Fig. 1, along with a density plot of predicted (\(\hat {Y}\)) versus measured (Y) transcript abundance. It is evident that histone modifications are accurate predictors of gene expression in both cancerous (top row, mean adj. R ^{2}=0.608) and normal celllines (bottom row, mean adj. R ^{2}=0.581), despite recent studies suggesting that specialised models are necessary to appropriately model cancerous cells [12].
Figure 2 presents the results of hierarchically clustering celllines by mRNA transcript abundance residuals (\(\varepsilon = Y  \hat {Y}\)). Interestingly, the three mesodermal derivatives GM12878, K562 and HUVEC form a distinct cluster. RNAsequencing data for the least similar cellline (A549) was generated at Cold Spring Harbor Laboratory whereas all other transcriptomic data was generated at the California Institute of Technology, suggesting that batch effects may be a contributing factor. It is also evident that the expression levels of many genes are consistently over or underestimated across all eight celllines. Taken together, these results indicate that genespecific residuals are nonrandom and indicative of genes that are inherently difficult to model from histone modification data. The existence of genes with transcriptional activity apparently decoupled from the local epigenetic landscape has been explored in detail in our previous study [2].
The regulatory function of histone modifications are cellline invariant
To assess the extent to which conditionspecific bias influences the reported accuracy of gene expression predictions, we trained and tested models on all 64 directed, pairwise combinations of celllines. The adj. R ^{2} performance for these models are presented in Fig. 3 a. These results demonstrate significant nonsymmetry, with dissimilarity between columns (predictions) but not rows (training observations). This demonstrates that the transcriptional regulatory roles of histone modifications are cell line invariant at a genomewide level (within the constraints of a linear model); e.g. A549 and GM12878 expression can be accurately predicted by models trained on any cellline, despite their diversity in lineage, tissue and karyotype. These results are further supported by Fig. 3 b, which demonstrates consistency in the fitted model parameters, \(\hat {\beta }\), across all celllines.
It is worth noting that that models trained and tested using data from a single cellline (boldfaced along the diagonal of Fig. 3 a) only marginally outperform models trained on dissimilar celllines and, moreover, that these margins are significantly less than the inherent variation between columns. These findings suggest that, in the context of gene expression modelling, training and testing models on data generated under the same experimental conditions (i.e. the same cellline) is not a significant source of statistical bias.
Conclusions
Many previous predictive modelling studies have been limited in scope to 13 celllines due to the computational expense of modelling highthroughput sequencing data. In this study, we introduced a MapReduce implementation of gene expression modelling that is able to obtain a full Θ(P) improvement in asymptotic time complexity when distributed across P CPUs (e.g. as part of multicore PC or highperformance cluster). This formulation and corresponding complexity analysis is intended to demonstrate the minimal set of operations that should be parallelised to yield Θ(P) improvement. Practically, machine learning pipelines implemented in TensorFlow [19], FlumeJava [20] or similar technologies would minimise execution time on conventional hardware without the added difficulty of implementing mappers and reducers. For illustrative purposes, a pure MapReduce implementation was applied in this study to model more than 50 epigenetic and matched transcriptomic datasets across 8 dissimilar ENCODE celllines. We encourage other researchers to investigate similar optimisations to increase the volume of data modelled in future integrative analyses.
Despite recent studies presenting specialised methods for modelling cancerous gene expression [12], we find no evidence of variation in the statistical relationship between histone modifications and mRNA transcript abundance in normalversuscancerous celllines. Although our results demonstrate that some celllines are inherently more difficult to model than others, this trait appears to be more closely associated with the extent of cellular differentiation than carcinogenic state; e.g. models of h1hESC embryonic stem cells perform 12 % worse than terminallydifferentiated GM12878 lymphoblasts. Although the NHEK (Normal Human Epidermal Keratinocytes) cellline is both terminallydifferentiated and exhibits the worstperforming models, this may be attributed to the phenotypic plasticity of keratinocytes between epithelial and mesenchymal states (necessary for wound healing). We therefore speculate that the predictability of a cellline’s genomewide expression levels from epigenetic data is proportional to its transcriptomic rigidity; i.e. cells with signalinduced phenotypic plasticity are less likely to exhibit a stable, predictive epigenome.
Interestingly, hierarchical clustering of the 8 investigated celllines by mRNA transcript abundance residuals (genelevel prediction errors) was able to group the closelyrelated, mesodermalderivative celllines GM12878, K562 and HUVEC; again, carcinogenic state appeared to have little effect on the propensity of two celllines to cluster together. Taken together with the observation that many genes exhibited large and consistent residuals across all celllines, these results suggest that genelevel residuals are nonrandom and, moreover, that the transcriptional activity of many genes are decoupled from their local epigenetic landscape. These observations are consistent with and extend upon the findings of our earlier studies [2, 3], and we hope that future studies will leverage distributed computational modelling to further accelerate progress in this field.
Abbreviations
 ChIP:

Chromatin immunoprecipitation
 ChIPseq:

ChIP with massively parallel sequencing
 GRN:

Gene regulatory network
 H1/2A/2B/3/4:

Histone proteins
 H2A.Z:

Histone H2 variant
 hESC:

Human ESC
 HxKyz :

Modification z of lysine y on histone Hx
 mRNA:

Messenger RNA
 RNA:

Ribonucleic acid
 RNAseq:

Highthroughput RNA sequencing
 TSS:

Transcription start site
References
Budden DM, Hurley DG, Crampin EJ. Predictive modelling of gene expression from transcriptional regulatory elements. Brief Bioinform. 2015; 16(4):616–28.
Budden DM, Hurley DG, Cursons J, Markham JF, Davis MJ, Crampin EJ. Predicting expression: the complementary power of histone modification and transcription factor binding data. Epigenetics Chromatin. 2014; 7(36):1–12.
Budden DM, Hurley DG, Crampin EJ. Modelling the conditional regulatory activity of methylated and bivalent promoters. Epigenetics Chromatin. 2015;8(21).
Karlić R, Chung HR, Lasserre J, Vlahoviček K, Vingron M. Histone modification levels are predictive for gene expression. Proc Natl Acad Sci. 2010; 107(7):2926–931.
Ouyang Z, Zhou Q, Wong WH. ChIPSeq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc Natl Acad Sci. 2009; 106(51):21521–1526.
Budden DM, Crampin EJ. Information theoretic approaches for inference of biological networks from continuousvalued data. BMC Syst Biol. 2016; 10(89):1–7.
Hurley DG, Cursons J, Wang YK, Budden DM, Crampin EJ, et al.NAIL, a software toolset for inferring, analyzing and visualizing regulatory networks. Bioinformatics. 2015; 31(2):277–8.
McLeay RC, Lesluyes T, Partida GC, Bailey TL. Genomewide in silico prediction of gene expression. Bioinformatics. 2012; 28(21):2789–96.
Cheng C, Gerstein M. Modeling the relative relationship of transcription factor binding and histone modifications to gene expression levels in mouse embryonic stem cells. Nucl Acids Res. 2012; 40(2):553–68.
Gerstein MB, Rozowsky J, Yan KK, Wang D, Cheng C, Brown JB, Davis CA, Hillier L, Sisu C, Li JJ, et al. Comparative analysis of the transcriptome across distant species. Nature. 2014; 512(7515):445–8.
Olejnik M, Steuwer M, Gorlatch S, Heider D. gCUP: rapid GPUbased HIV1 coreceptor usage prediction for nextgeneration sequencing. Bioinformatics. 2014; 30(22):3272–273.
Jiang P, Freedman ML, Liu JS, Liu XS. Inference of transcriptional regulation in cancers. Proc Natl Acad Sci. 2015; 112(25):7731–736.
ENCODE Project Consortium, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489(7414):57–74.
Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. Commun ACM. 2008; 51(1):107–13.
Karloff H, Suri S, Vassilvitskii S. A model of computation for MapReduce. In: Proceedings of the twentyfirst annual ACMSIAM symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics: 2010. p. 938–48.
Chawla J. The existence theorem in general ridge regression. Stat Probab Lett. 1988; 7(2):135–7.
Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970; 12(1):55–67.
Zinkevich M, Weimer M, Li L, Smola AJ. Parallelized stochastic gradient descent. In: Advances in neural information processing systems. Neural Information Processing Systems Foundation: 2010. p. 2595–603.
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, et al.TensorFlow: Largescale machine learning on heterogeneous systems. arXiv preprint arXiv:1603.04467 (2016).
Chambers C, Raniwala A, Perry F, Adams S, Henry RR, Bradshaw R, Weizenbaum N. FlumeJava: easy, efficient dataparallel pipelines. In: ACM Sigplan Notices, vol. 45, No. 6. ACM: 2010. p. 363–75.
Acknowledgements
None
Funding
This work was supported by an Australian Postgraduate Award [DMB]; the Australian Federal and Victoria State Governments and the Australian Research Council through the ICT Centre of Excellence program, National ICT Australia (NICTA) [DMB]; and the Australian Research Council Centre of Excellence in Convergent BioNano Science and Technology (project number CE140100036) [EJC]. The views expressed herein are those of the authors and are not necessarily those of NICTA or the Australian Research Council.
Availability of data and materials
All data is described in Tables 1–2 and is publicly available from the ENCODE consortium: https://genome.ucsc.edu/ENCODE/dataMatrix/encodeDataMatrixHuman.html.
Authors’ contributions
Analysis and interpretation of data: DMB and EJC. Study design and concept: DMB and EJC. Software development and data processing: DMB. Drafting the paper: DMB. Both authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Additional file
Additional file 1
A singlenode implementation of our code is provided for convenient reproduction of our experimental results. (ZIP 363 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Budden, D.M., Crampin, E.J. Distributed gene expression modelling for exploring variability in epigenetic function. BMC Bioinformatics 17, 446 (2016). https://doi.org/10.1186/s1285901613131
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901613131