Practical impacts of genomic data “cleaning” on biological discovery using surrogate variable analysis
© Jaffe et al. 2015
Received: 29 April 2015
Accepted: 30 October 2015
Published: 6 November 2015
The Erratum to this article has been published in BMC Bioinformatics 2016 17:302
Genomic data production is at its highest level and continues to increase, making available novel primary data and existing public data to researchers for exploration. Here we explore the consequences of “batch” correction for biological discovery in two publicly available expression datasets. We consider this to include the estimation of and adjustment for wide-spread systematic heterogeneity in genomic measurements that is unrelated to the effects under study, whether it be technical or biological in nature.
We present three illustrative data analyses using surrogate variable analysis (SVA) and describe how to perform artifact discovery in light of natural heterogeneity within biological groups, secondary biological questions of interest, and non-linear treatment effects in a dataset profiling differentiating pluripotent cells (GSE32923) and another from human brain tissue (GSE30272).
Careful specification of biological effects of interest is very important to factor-based approaches like SVA. We demonstrate greatly sharpened global and gene-specific differential expression across treatment groups in stem cell systems. Similarly, we demonstrate how to preserve major non-linear effects of age across the lifespan in the brain dataset. However, the gains in precisely defining known effects of interest come at the cost of much other information in the “cleaned” data, including sex, common copy number effects and sample or cell line-specific molecular behavior.
Our analyses indicate that data “cleaning” can be an important component of high-throughput genomic data analysis when interrogating explicitly defined effects in the context of data affected by robust technical artifacts. However, caution should be exercised to avoid removing biological signal of interest. It is also important to note that open data exploration is not possible after such supervised “cleaning”, because effects beyond those stipulated by the researcher may have been removed. With the goal of making these statistical algorithms more powerful and transparent to researchers in the biological sciences, we provide exploratory plots and accompanying R code for identifying and guiding “cleaning” process (https://github.com/andrewejaffe/StemCellSVA). The impact of these methods is significant enough that we have made newly processed data available for the brain data set at http://braincloud.jhmi.edu/plots/ and GSE30272.
High-throughput experiments are commonplace in molecular biology and aim to identify genomic measurements associated with clinical phenotypes and biological mechanisms. Microarrays and next generation sequencing are popular measurement tools for these experiments and assay tens of thousands of genes at once. Typically, data normalization and preprocessing approaches reduce technical variability [1, 2] but there often remain high levels of systematic heterogeneity in the data which can obscure biological phenomena under study. Because its impact is often severe, understanding of such heterogeneity should be an integral part of processing and exploration of genomic data.
Much of this underlying variability is observed to be systematic with the order in which samples are processed, and therefore is commonly referred to as “batch effects” . There are currently two general classes of “batch” correction methods: those that use linear modeling when batches are known or assumed (e.g. ComBat ) and those that attempt to identify and control for potential batch effects (e.g. surrogate variable analysis [SVA] , remove unwanted variation (RUV) , among others). While uncorrected “batch effects” still appear in published high-throughput data, an additional, more subtle, yet perhaps more common issue has arisen: the incorrect or imprecise definition of biological enquiry during “cleaning” of genomic data, resulting either in the removal of important biological signal, or the retention of unwanted latent variability. It is important to note that all of these algorithms depend on well-designed studies to properly identify these “batch effects”, i.e. where the outcome of interest is balanced across potential batches, as in randomization procedures—otherwise, it is difficult to attribute variance in the data to “batch” or biology of interest .
The first class of batch correction methods may miss artifacts due to biology, and unannotated technical variation, while the second class of factor-based estimation may remove biological variation of interest. Here we present three illustrative data analyses using SVA (as a surrogate for almost any factor-based approach) and describe how to perform artifact discovery in light of natural heterogeneity within biological groups, secondary biological questions of interest, and non-linear treatment effects. We do this using two publicly available gene expression datasets, one from differentiating pluripotent cells, and another in the developing and aging human brain. Our analysis indicates that artifact discovery is an important component of high-throughput analysis pipelines but caution should be exercised in supervising the discovery of artifacts to avoid removing biological signal of interest. In particular, researchers should be aware that many biological effects of potential importance can be removed if they are not explicitly protected during the cleaning process.
A summary of SVA
Regardless of data processing methods used, an explicit definition of the precise biological question under study is particularly crucial in genomics investigation. Using SVA formalizes this process in that this “biological model” is an explicitly defined mathematical model passed to SVA. Effects specified in this model are preserved while systematic heterogeneity that affects many measurements unrelated to these effects are identified and subsequently adjusted for in subsequent statistical analyses. This approach has previously been shown to result in more accurate and stable gene rankings, improved false discovery estimation and correct p-value distributions . Under this framework, the biological effects interrogated in the data must be limited to those specified in this model as passed to the SVA algorithm. If this is not the case, effects of interest may be treated as latent heterogeneity and removed from the data. The impact of researchers’ conception of the biological enquiry on the nature of possible discovery should not be underestimated in this process.
SVA’s purely data-driven methods for the estimation of and adjustment for unwanted systematic variance take particular advantage of the breadth of high-dimensional genomics data. The algorithm does not require a priori information about what variables measured by the researcher might represent a “batch effect”. By using the structure of thousands of measures uncorrelated to the effect under study, SVA estimates unwanted effects and allows sculpting of a dataset to focus on an explicitly defined biological effect. This is important as commonly-used “batch” variables, such as microarray scan/hybridization date, are likely surrogates for unmeasured variables that are better estimated by the data themselves . To determine the number of surrogate variables (SVs) to estimate (where more SVs reflects a higher degree of correction applied to the data) the SVA algorithm can take user input, or use an automated approach via permutation testing to estimate the number of SVs present in the data. The correct usage of SVA has the potential to increase statistical power when analyzing experimental data, but note that while increasing the number of SVs reduces the variability in the dataset, it may also reduce variability in the direction of the effect of interest based on the iterative algorithm used to estimate the SVs.
To best assess the biological effect of interest, all estimated SVs, along with the defined “biological model” are included for the adjustment of data in downstream differential expression statistical analyses . We note that the SVA algorithm can permit correlation between these SVs and the outcome of interest . Model selection that removes surrogate or measured variables will lead to p-values that are smaller than their true value (anti-conservative biases) , greatly increasing probability of identifying false positives. Additionally, SVs can be regressed out of the data to obtain “cleaned” data for visualization (as we do in this report), however differential expression statistics should not be performed on this “clean” data, as this too can lead to anti-conservative bias resulting from between-sample correlation being introduced by regressing out the SVs and from inflating variance partitioning related to the effect of interest, as the total variance of the system has been reduced without being taken into account during the linear modeling. We also suggest incorporating a priori biological data, when available, into the evaluation of data “cleaning” by any method, with the goal of using known biology as an additional guide in this process (as we also do in examples here).
Batch correction increases ability to detect defined effects
First, we analyze gene expression data from differentiating pluripotent cells  (GSE32923) using the principles outlined above as a guide in applying SVA to focus on particular biological effects in the data. These analyses illustrate how the details of defining biological enquiry in the “cleaning” process impact discovery on both a global and gene-specific level. Human pluripotent cell lines were differentiated towards neuroectodermal and mesendodermal fates. Biological replicates (duplicates) for each cell line in each condition were collected after 8 days of differentiation and hybridized to Agilent G4112F arrays to measure gene expression. Data were preprocessed and normalized using the ‘limma’ R package  with background correction and quantile normalization (Additional file 1).
To perform SVA here, we define the biological model as the average change in gene expression associated with each treatment (model coded in R as: “expression ~ treatment”, where treatment has 3 levels: undifferentiated, mesendoderm or neurectoderm). Hence, SVA will preserve this mean effect while removing other systematic heterogeneity in the data that effect many measures uncorrelated with this mean effect. After performing SVA and subsequently regressing out the 27 automatically determined SVs in this dataset, biological patterns are clearer in this global expression landscape (Fig. 1c). Now, the 1st PC constructs an axis of differentiation primarily towards a mesendodermal fate (it is of interest that the neuroectodermal fate also moves through this dimension). The 2nd PC identifies an axis that is exclusively represented in neuroectodermal differentiation. Importantly, these samples no longer cluster by processing date, indicating that SVA has successfully accounted for the batch effects while preserving biological signal (Fig. 1d). When combined with precise biological and technical data, these dimension-reducing plots provide a starting point for evaluating both global patterns of biology across the samples under study as well as the extent of technical variation in the data.
It is crucial to note here that by defining the biological effect of interest to be the average change in expression with treatment, SVA removes many individual sample-specific expression traits. For example, there are multiple cell lines which do not induce PAX6 expression in the neuroectodermal differentiation condition (Fig. 2a), but this is not observable after SVA (Fig. 2b). This is desirable if the goal is to estimate the mean effect of each treatment, but detrimental if expression phenotypes of individual cell lines are of interest (see discussion for additional details).
We note that similar reductions in technical variability are seen with the “ComBat” empirical Bayes batch correction approach , which also can also utilize an explicit definition of biological enquiry (Additional file 4: Figure S3A and B). However, the empirical Bayes method places less emphasis on the biological model, mostly reducing global variation even without specifying a biological model (e.g. using an intercept-only model, Additional file 4: Figure S3C and D). On the gene-level, ComBat produced smaller p-values on average (65.8 and 72.7 % of probes had smaller p-values under ComBat compared to SVA for the neuroectodermal versus mesendodermal fates, and neuroectodermal versus the undifferentiated state, respectively).
Batch correction may remove secondary effects of interest
In Figs. 1 and 2, our biological model included only a treatment effect. If we are interested in exploring additional effects, this initial model specification is not complete. Analysis of additional effects requires their explicit inclusion in the biological model along with the primary effect of interest. Statistical analysis can then be applied to the data while incorporating the latent factors identified by the SV analysis.
We explore another example at the GSTT1 gene (glutathione S-transferase theta 1), which has common copy number variants (CNVs) in human populations  that are associated with GSTT1 expression differences and local genetic structure of neighboring genes GSTT2 and GSTT2B . The expression data analyzed without SVA suggests three different copy numbers (0, 1 and 2) at this gene (Fig. 3d) which are validated with SNP microarray intensities in a subset of these samples (Additional file 5: Figure S4). Without accounting for GSTT1 copy number (i.e. using the treatment-only biological model from Figs. 1 and 2) SVA removes the expression differences related to GSTT1 CNV’s (Fig. 3e). However, accounting for copy number at this location in the biological model passed to SVA retains the copy number effect in the data (Fig. 3f; model coded in R as: “expression ~ treatment + CNV”).
Hence, when directed to preserve only the effect of treatment, SVA finds and removes wide-spread systematic expression effects associated with both sex and GSTT1 copy number (among many other effects). This serves as a positive control indicating that SVA can successfully identify these known effects in expression data. This is also a cautionary note, reinforcing the notion that analyses must remain within the bounds set by the biological question defined at the outset of the data “cleaning”. Hence, any re-analyses of existing data with new questions requires an entirely novel re-processing of the data to address these new questions, not simply the calculation of additional statistics on the same processed data. Here we can clearly see how strong an impact our prior conception of the biological system can have on discovery in genomics data.
Batch correction may remove non-linear expression patterns
Next, we re-analyzed an existing dataset measuring gene expression in the dorsolateral prefrontal cortex region (DLPFC) of the brain across the human lifetime. . Data was processed and normalized as previously described . Briefly, background correction and loess normalization were performed on raw two-channel intensity data, and low quality probes were removed from subsequent analysis, leaving 30,176 probes on 269 samples across the lifespan (fetal through the aged).
Patterns of gene expression across age are dynamic and non-linear: the largest changes occur during fetal life and infancy, and decrease in magnitude with age . Here we expand on the previous modeling of age patterns across the lifespan in several key ways: 1) We applied splines to capture non-linear gene expression effects while ensuring patterns of gene expression are continuous across the lifespan The previous analysis used age by decade interaction terms, which are not necessarily continuous. 2) We estimated and adjusted for a much higher number of SVs. The previous analysis used only 2 SVs, here we allowed SVA to automatically determine this number: 31 SVs were used. This much increased “cleaning” further tuned this dataset to age effects (no doubt at the expense of many other effects such as those assessed in Fig. 3). Hence, this newly processed data should only be used for the estimation of canonical, mean patterns of expression across the lifetime. 3) We regressed out SVs while allowing the effects of age and mean gene expression (the intercept) to remain in the data. Previously, SVs were regressed out while ignoring possible correlation between SVs and age, potentially obscuring some age effects.
We applied three related spline models to the dataset, which were A) A linear spline with a knot at birth [2° of freedom], i.e. a line fit to expression across age in fetal life, and a second line fit to expression across age in postnatal life. B) A 2nd degree basis spline with a knot at birth [3° of freedom], i.e. a curve fit to expression across age in fetal life, and a second curve fit to expression across age in postnatal life. C) A 2nd degree basis spline with knots at birth, 1, 10, 20 and 50 years [8° of freedom], i.e. a curve fit to expression across age within each age range between these knots. Each model also allowed an offset at birth, because there were no samples in the third trimester of fetal life.
Inspection of the impact of these different models on SVA’s adjustment of individual gene data demonstrates that increased flexibility in modeling effects can result in both higher fidelity to the original data and the increased ability to distinguish specific dynamic patterns (Fig. 5): Models A and B introduce greater adjustments to the original normalized data than model C. Model C, however effectively reduces within age variance while preserving a gene expression pattern apparent in the normalized data in which dynamics are limited to the range of birth-10 year. While perhaps reducing some variance, both models A and B produce a pattern of expression across age where dynamics are spread from birth-50 year, demonstrating the increased resolution of pattern identification possible with the more flexible model C. This newly processed data is available along with original raw data at the GEO repository: GSE30272.
Using SVA, ComBat or related tools that require the precise definition of biological enquiry can increase the power to identify specific signals in complex genomic datasets. However, especially when high levels of correction are used (e.g. when the automatic number of SVs generated by SVA is used), biological discovery is directly limited by researchers’ prior conception of the system under study. Hence, we must be precise and deliberate in the design and analysis of experiments and the resulting data, and also mindful of the limitations we impose with our own perspective.
Here we have discussed primarily how to focus as much as possible on a single narrowly defined question. Methods such as SVA can also be used in the more open exploration of genomics data. For this, the definition of the biological model of interest would have to be designed specifically for flexibility, to allow more diverse effects to coexist in the SV-adjusted data. Perhaps equally important would be the reduction of the amount of “cleaning” performed. In the use of SVA this would be done by tempering the number of SVs identified and used for the adjustment of the data so that large components of unwanted heterogeneity can be removed while leaving intact a broad range of biological effects to explore. We do note that performing SVA does not necessarily produce overly optimistic results, as the SVA algorithm allows for potential correlation between the primary variables and estimated latent variables. Regardless, this is a balancing act that necessarily entails iterative data processing and assessment of global and gene-specific impacts of the analysis, and can be guided by perspectives and plots that we illustrate in this report.
While we have primarily explored the SVA batch correction algorithm, we note comparable global and gene-level results using the ComBat batch-correction algorithm  within the stem cell dataset. However, as the ComBat algorithm requires categorical “batch” variables, we could not optimally implement it in the brain dataset - several of the top principal components associated with the quantitative RNA integrity numbers (RIN), suggesting that tissue quality likely induces “batch”-like effects in the data. Similarly, while we view the Remove Unwanted Variation (RUV) algorithm  a natural extension of the SVA algorithm, we found that many housekeeping genes , typically used as “negative control genes” in the algorithm, were differentially expressed by differentiation in the stem cell dataset (Additional file 7: Figure S6) and development/birth in the brain dataset, highlighting the difficulty in selecting an a priori set of negative controls required by the algorithm. The selection of “negative control genes” that associate with the outcome of interest has a similar effect as SVA model misspecification displayed in Fig. 3. Without studies dedicated to the identification of expressed yet outcome-independent genes within the system under study, the selection of such control genes is quite difficult, although recent work has suggested that the approach may be robust to the specific choice of control genes , particularly when technical replicate samples have been generated .
The analysis structures described here focus on mean effects across all samples studied. Analysis of experiments in dynamic systems including replicates of a wide diversity of individual subjects and well-characterized genomes will be necessary to move beyond the study of average population effects, into functional genomics where we may begin to estimate the impact of individual genomes on precise molecular and cellular phenotypes. In this context, the use of data “cleaning” must be used with extreme caution as it can remove a great deal of information from genomic data as we have demonstrated here.
All code for processing, analyzing and visualizing is available at: https://github.com/andrewejaffe/StemCellSVA.
We would like to thank the Lieber and Maltz families for their gracious gift that has made this work possible.
The Lieber Institute for Brain Development for AEJ, TH, JEK, DRW, JC, RDGM and CC; NIH Grant R01 GM105705-01 for JTL.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19(2):185–93.View ArticlePubMedGoogle Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4(2):249–64.View ArticlePubMedGoogle Scholar
- Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010;11:733–9.View ArticlePubMedGoogle Scholar
- Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8:118–27.View ArticlePubMedGoogle Scholar
- Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007;3:1724–35.View ArticlePubMedGoogle Scholar
- Gagnon-Bartsch JA, Speed TP. Using control genes to correct for unwanted variation in microarray data. Biostatistics. 2012;13(3):539–52.View ArticlePubMedPubMed CentralGoogle Scholar
- Lambert CG, Black LJ. Learning from our GWAS mistakes: from experimental design to scientific method. Biostatistics. 2012;13(2):195–203.View ArticlePubMedPubMed CentralGoogle Scholar
- Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012;28(6):882–3.View ArticlePubMedPubMed CentralGoogle Scholar
- Mallon BS, Chenoweth JG, Johnson KR, Hamilton RS, Tesar PJ, Yavatkar AS, et al. StemCellDB: the human pluripotent stem cell database at the national institutes of health. Stem Cell Res. 2012;10(1):57–66.View ArticlePubMedPubMed CentralGoogle Scholar
- Smyth GK. Limma: linear models for microarray data. In: Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W, editors. Bioinformatics and computational biology solutions using r and bioconductor. New York: Springer; 2005. p. 397–420.View ArticleGoogle Scholar
- Jaffe AE, Feinberg AP, Irizarry RA, Leek JT. Significance analysis and statistical dissection of variably methylated regions. Biostatistics. 2012;13(1):166–78.View ArticlePubMedGoogle Scholar
- Jaffe AE, Murakami P, Lee H, Leek JT, Fallin MD, Feinberg AP, et al. Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. Int J Epidemiol. 2012;41(1):200–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Georgala PA, Carr CB, Price DJ. The role of Pax6 in forebrain development. Dev Neurobiol. 2011;71(8):690–709.View ArticlePubMedGoogle Scholar
- Landi S. Mammalian class theta GST and differential susceptibility to carcinogens: a review. Mutat Res. 2000;463(3):247–83.View ArticlePubMedGoogle Scholar
- Zhao Y, Marotta M, Eichler EE, Eng C, Tanaka H. Linkage disequilibrium between two high-frequency deletion polymorphisms: implications for association studies involving the glutathione-S transferase (GST) genes. PLoS Genet. 2009;5(5):e1000472.View ArticlePubMedPubMed CentralGoogle Scholar
- Colantuoni C, Lipska BK, Ye T, Hyde TM, Tao R, Leek JT, et al. Temporal dynamics and genetic control of transcription in the human prefrontal cortex. Nature. 2011;478(7370):519–23.View ArticlePubMedPubMed CentralGoogle Scholar
- Eisenberg E, Levanon EY. Human housekeeping genes are compact. Trends Genet. 2003;19(7):362–5.View ArticlePubMedGoogle Scholar
- Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat Biotechnol. 2014;32(9):896–902.View ArticlePubMedPubMed CentralGoogle Scholar
- Peixoto L, Risso D, Poplawski SG, Wimmer ME, Speed TP, Wood MA, et al. How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets. Nucleic Acids Res. 2015;43(16):7664–74.View ArticlePubMedPubMed CentralGoogle Scholar
- Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002;30:e15.View ArticlePubMedPubMed CentralGoogle Scholar