Removing batch effects from purified plasma cell gene expression microarrays with modified ComBat
© Stein et al.; licensee BioMed Central. 2015
Received: 8 January 2015
Accepted: 27 January 2015
Published: 25 February 2015
Gene expression profiling (GEP) via microarray analysis is a widely used tool for assessing risk and other patient diagnostics in clinical settings. However, non-biological factors such as systematic changes in sample preparation, differences in scanners, and other potential batch effects are often unavoidable in long-term studies and meta-analysis. In order to reduce the impact of batch effects on microarray data, Johnson, Rabinovic, and Li developed ComBat for use when combining batches of gene expression microarray data.
We propose a modification to ComBat that centers data to the location and scale of a pre-determined, ‘gold-standard’ batch. This modified ComBat (M-Combat) is designed specifically in the context of meta-analysis and batch effect adjustment for use with predictive models that are validated and fixed on historical data from a ‘gold-standard’ batch.
We combined data from MIRT across two batches (‘Old’ and ‘New’ Kit sample preparation) as well as external data sets from the HOVON-65/GMMG-HD4 and MRC-IX trials into a combined set, first without transformation and then with both ComBat and M-ComBat transformations. Fixed and validated gene risk signatures developed at MIRT on the Old Kit standard (GEP5, GEP70, and GEP80 risk scores) were compared across these combined data sets.
Both ComBat and M-ComBat eliminated all of the differences among probes caused by systematic batch effects (over 98% of all untransformed probes were significantly different by ANOVA with 0.01 q-value threshold reduced to zero significant probes with ComBat and M-ComBat). The agreement in mean and distribution of risk scores, as well as the proportion of high-risk subjects identified, coincided with the ‘gold-standard’ batch more with M-ComBat than with ComBat. The performance of risk scores improved overall using either ComBat or M-Combat; however, using M-ComBat and the original, optimal risk cutoffs allowed for greater ability in our study to identify smaller cohorts of high-risk subjects.
M-ComBat is a practical modification to an accepted method that offers greater power to control the location and scale of batch-effect adjusted data. M-ComBat allows for historical models to function as intended on future samples despite known, often unavoidable systematic changes to gene expression data.
Multiple myeloma (MM) is a systemic hematopoietic malignancy of plasma cells that expands within the bone marrow. As plasma cells become cancerous and multiply, levels of monoclonal immunoglobulins in blood increase and osteolytic bone lesions form in the majority of patients. Myeloma is also characterized by genetic heterogeneity at onset as well as variability in patient response to treatment –survival in patients ranges from months to more than fifteen years.
Gene expression profiling (GEP) via microarray analysis measures the expression levels of tens of thousands of genes simultaneously, and has been widely used in clinical practice for cancer classification, risk stratification, and treatment selection. This technique involves analyzing RNA extracted from purified plasma cells, pulled from the bone marrow, on high-density microarray gene chips [2,3]. Many gene signatures have been identified using GEP data to predict outcome of patients with newly diagnosed disease and identify those with high-risk disease. The GEP5 , GEP70 , GEP80 , EMC-92 , MRC-IX-6 , Millenium-100 , and IFM-15  are all gene signatures for identifying high-risk myeloma in newly diagnosed patients. The GEP70 risk score identified 70 genes linked to shorter overall survival, duration of complete response, event free survival, and progression to clinical disease [5,11-13]. The GEP5 has recently been identified as the optimal 5-gene subset within the GEP70 , and the GEP80 was developed based on differentially expressed genes between baseline and 48-hours after receiving Bortezomib . All three of these signatures (GEP5, GEP70, and GEP80) were trained and validated using samples from the Myeloma Institute for Research and Therapy (MIRT). Because these GEP-based risk models are affected by the expression of individual genes, systematic differences in expression due to batch effects must be corrected when comparing gene signatures across different batches.
Ambient conditions during sample preparation and handling such as room temperature, humidity, and ozone levels
Sites/laboratories: different laboratories may have different operating procedures
Sample preparation (including RNA isolation, amplification, and hybridization): different reagent lots may perform differently
Despite knowing the many potential sources of batch effects, batch effect associated information is not necessarily recorded for all samples. Oftentimes, data analysts are left with surrogates such as processing date and preparation group to use when correcting for batch effects . Here, we will generalize the term ‘batch effect’ to mean any type of systematic bias between two or more groups of samples due to differences in sample preparation, scanner, laboratory of analysis, etc. Many batch effects are unavoidable due to the large sample size requirements and potentially lengthy time required to complete a study. Combining data from different batches without correcting for batch effects can lead to, at a minimum, increased noise and less power to detect a real biological signal, and at a maximum, false biological conclusions. Careful consideration is needed in identifying and removing batch effects before further downstream analyses occur.
Combining batches of samples can be performed both within a single study or on samples from different studies in a meta-analysis framework. The remainder of this paper will focus on the latter scenario of combining larger cohorts of patients across different studies with a modified alternative of ComBat. The method introduced below, M-ComBat, would work in smaller sample, single cohort situations; however, ComBat or other methods may be more applicable in the single cohort context depending on the focus of the study.
Baseline MM purified plasma cell samples
All MIRT samples are baseline purified plasma cells from the bone marrow of patients enrolled on UARK Total Therapies and processed on Affymetrix U133Plus 2.0 microarrays (Santa Clara, CA) between 2004 and 2014. In keeping with institutional, federal, and Helsinki Declaration guidelines, all identifiable patients gave written informed consent for undergoing bone marrow sampling for gene expression profiling and the institutional review board of the University of Arkansas for Medical Sciences approved the research studies. Samples analyzed at the Myeloma Institute were prepared with either the One-Cycle and Two-Cycle Target Labeling and Control Reagents (‘Old’ Kit) or, beginning in 2009, the 3’ IVT Express Kit (‘New’ Kit) . All baseline MM GEP MIRT data are current as of November 11, 2014. There are a total of 1071 baseline MM GEP samples included in this study from two main batches: ‘Old’ Kit samples processed at Myeloma Institute (n=928) and ‘New’ Kit samples (n=143). Additional external data sets will be used to validate both the M-ComBat method and the GEP5, GEP70, and GEP80 risk signatures. The publicly available data set of previously untreated patients (HOVON-65/GMMG-HD4, n=288, accession number: GSE19784, hereon referred to as HOVON-65) will be used as well as baseline data from the MRC-IX trial (n=273) [8,18]. Raw intensity values were MAS5 normalized and converted to log2 scale.
Here we will treat the MIRT: Old Kit data as our ‘gold-standard’, reference batch because it was used to train and validate the GEP5, GEP70, and GEP80. When using M-ComBat, the designation of the reference batch should relate to focus of investigation when combining data across batches. Here we are investigating three risk signatures (GEP5, GEP70, and GEP80) which are based on a particular standard (MIRT: Old Kit). Alternatively if we were investigating the EMC-92 risk signature, we would choose the HOVON-65/GMMG-HD4 data as the reference batch as it is the training set of the EMC-92. In a meta-analysis context where we are performing new analyses or searching for new signatures on combined data, there are no set rules on choosing a reference batch; however, it may be best to choose the largest cohort, most relevant, or most familiar as the reference batch. However, the only benefit of M-ComBat over Combat in that context is that a newly discovered signature would be fixed to a known standard rather than the arbitrary location of ComBat transformed data.
ComBat  is a highly effective method of removing batch effects based on an empirical Bayes framework that allows for borrowed strength across probes. ComBat has proven itself to be a strong method, particularly with smaller sample sizes , and continues to be a widely used technique [14,15,21]. Chen et al.  found ComBat to be ‘best able to reduce and remove batch effects while increasing precision and accuracy’ when compared against five other popular batch effect removal tools.
ComBat centers data to the overall, grand mean of all samples which results in an adjusted data matrix that is shifted to an arbitrary location that no longer coincides with the location of any original batch. When using validated and fixed gene signatures on ComBat transformed data, these gene signatures on ‘gold-standard’ data will also be shifted. Many of these ‘gold-standard’ samples were likely used to train and build the gene signature; therefore, altering these samples when combining data sets will also alter the performance and proportion of high-risk individuals identified for a fixed risk signature on the original, reference training data.
Furthermore, the mean and variance estimates used in the final batch effect adjusted data are calculated using the gene-wise mean and variance estimates of the ‘gold-standard’, reference batch, i=r.
M-ComBat can be applied in R using a modified version of the ComBat script from the sva package. The altered script (including a small, simulated example) is available online for public use at http://github.com/SteinCK/M-ComBat.
We will illustrate both ComBat and M-ComBat as well as the GEP5, GEP70, and GEP80 risk signatures by transforming baseline purified plasma cell GEP samples from UARK Total Therapies across two kinds of sample preparation (Old and New kits) as well as two external data sets (HOVON-65 and MRC-IX). ComBat will be performed assuming a parametric model with no covariates. The four distinct batches will be shifted by M-ComBat to the MIRT: Old Kit ’gold-standard’ as this was the standard of data used to develop and train the GEP5, GEP70, and GEP80 signatures.
Both ComBat and M-ComBat completely eliminated significant batch effect related differences across the four distinct batches. Prior to transformation, over 98% of all probes were significantly differently expressed across at least one batch (probe-wise ANOVA found 53,827 of 54,675 q-values  below a 0.01 false discovery rate threshold). After performing either ComBat or M-ComBat, zero probes remained significantly differently expressed across batches according to the same threshold.
M-ComBat eliminated the overwhelming statistical differences among probes caused by systematic batch effects. In addition to eliminating differences in probes, the agreement in mean and distribution of risk scores coincided with that of the ‘gold-standard’ batch with M-ComBat. The proportion of high-risk patients identified following M-ComBat for external data sets more closely matched that of the reference group as well. The performance of risk scores improved overall using either ComBat or M-Combat; however, using M-ComBat and the original, optimal risk cutoffs allowed for greater ability in our study to identify a smaller cohort of high-risk subjects.
By eliminating the differences in means between all probes, all mean risk scores examined coincided with one another across batches following M-ComBat. This allows for potentially increased power in the risk score on external data sets following M-ComBat transformation. M-ComBat provides the structure to compare risk scores more fairly and realistically across different data sets lending itself directly to the further validation and comparison of yet discovered gene signatures scores.
Modified ComBat eliminates batch effect related differences while adjusting all samples to a pre-defined standard. The modification transforms samples to the mean and variance of the ‘gold-standard’ batch without altering samples from the original batch. This allows for validated and fixed predictive models to function properly on external data sets and offers a less biased framework for comparing gene signatures. Rather than redefining cutpoints of GEP-based scores for new batches, M-ComBat allows for the same cutpoints and logic used to build these GEP-based scores to function properly on external data sets.
Differences in scanner, laboratory, and sample preparation contributed significantly toward systematic batch effects seen when combining gene expression samples across these four batches. The non-biological variation introduced by differences in sample preparation kit, scanner, and laboratory had dramatic impacts on overall GEP expression (seen in PCA plot) as well as the smaller subsets of probes that define gene signatures. These variations are often unavoidable in long-term studies, and both ComBat and M-ComBat are equally as effective in eliminating these confounding batch effect related differences. However, M-ComBat offers an intelligent framework to overcome these changes and the power to control the location and scale of the transformed data.
M-ComBat allows for gene expression meta-analyses to combine data from different studies to a known standard. The benefit of a fixed standard in meta-analysis is important for gene signature comparison, differential expression analysis, and the search for new gene signatures on a larger subset of combined samples on a known standard.
M-ComBat is a practical modification to an accepted method that offers greater power to control the appearance of batch-effect adjusted data. M-ComBat allows for historical models to function as intended on future samples despite known, often unavoidable systematic changes to gene expression data.
I would like to acknowledge the help of Clyde Bailey and D. Ryan Williams in understanding the data. Also would like to acknowledge the help of Rachel Hunter-Merrill in organizing and filtering of the data.
- Fonseca R, Bergsagel PL, Drach J, Shaughnessy J, Gutierrez N, Stewart AK, et al. International myeloma working group molecular classification of multiple myeloma: spotlight review. Leukemia. 2009; 23(12):2210–21.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhan F, Hardin J, Kordsmeier B, Bumm K, Zheng M, Tian E, et al. Global gene expression profiling of multiple myeloma, monoclonal gammopathy of undetermined significance, and normal bone marrow plasma cells. Blood. 2002; 99(5):1745–57.View ArticlePubMedGoogle Scholar
- Zhan F, Tian E, Bumm K, Smith R, Barlogie B, Shaughnessy J. Gene expression profiling of human plasma cell differentiation and classification of multiple myeloma based on similarities to distinct stages of late-stage b-cell development. Blood. 2003; 101(3):1128–40.View ArticlePubMedGoogle Scholar
- Heuck C, Qu P, van Rhee F, Waheed S, Usmani S, Epstein J, et al. Five gene probes carry most of the discriminatory power of the 70-gene risk model in multiple myeloma. Leukemia. 2014; 28:2410–13.View ArticlePubMedPubMed CentralGoogle Scholar
- Shaughnessy J, Zhan F, Burington BE, Huang Y, Colla S, Hanamura I, et al. A validated gene expression model of high-risk multiple myeloma is defined by deregulated expression of genes mapping to chromosome 1. Blood. 2007; 109(6):2276–84.View ArticlePubMedGoogle Scholar
- Shaughnessy J, Qu P, Usmani S, Heuck CJ, Zhang Q, Zhou Y, et al. Pharmacogenetics of bortezomib test-dosing identifies hyperexpression of proteasome genes, especially psmd4, as novel high-risk feature in myeloma treated with total therapy 3. Blood. 2011; 118(13):3512–24.View ArticlePubMedPubMed CentralGoogle Scholar
- Kuiper R, Byoyl A, de Knegt Y, van Vliet MH, van Beers EH, van der Holt B, et al. A gene expression signature for high-risk multiple myeloma. Leukemia. 2012; 26:2406–13.View ArticlePubMedGoogle Scholar
- Dickens NJ, Walker BA, Leonoe PE, Johnson DC, Brito JL, Zeisig A, et al. Homozygous deletion mapping in myeloma samples identifies genes and an expression signature relevant to pathogenesis and outcome. Clin Cancer Res. 2010; 16(6):1856–64.View ArticlePubMedPubMed CentralGoogle Scholar
- Mulligan G, Mitsiades C, Bryant B, Zhan F, Chng WJ, Roels S, et al. Gene expression profiling and correlation withoutcome in clinical trials of the proteasome inhibitor bortezomib. Blood. 2007; 109(8):3177–88.View ArticlePubMedGoogle Scholar
- Decaux O, Lode L, Magrangeas F, Charbonnel C, Gouraud W, Jezequel P, et al. Prediction of survival in multiple myeloma based on gene expression profiles reveals cell cycle and chromosomal instability signatures in high-risk patients and hyperdiploid signatures in low-risk patients: a study of the intergroupe francophone du myelome. J Clin Oncol. 2008; 26(29):4798–805.View ArticlePubMedGoogle Scholar
- Shaughnessy JD, Haessler J, van Rhee F, Anaissie E, Pineda-Roman M, Cottler-Fox M, et al. Testing standard and genetic parameters in 220 patients with multiple myeloma with complete data sets: superiority of molecular genetics. Br J Haematology. 2009; 137(6):530–6.View ArticleGoogle Scholar
- Biran N, Jagannath S, Chari A. Risk stratification in multiple myeloma, part 1: Characterization of high-risk disease. Clical Adv Hematol Oncol. 2013; 11(8):489–503.Google Scholar
- Dhodapkar MV, Sexton R, Waheed S, Usmani S, Papanikalaou X, Nair B, et al. Clinical, genomic, and imaging predictors of myeloma prgoression from anymptomatic monoclonal gammopathies (swog s0120). Blood. 2014; 123(1):78–85.View ArticlePubMedPubMed CentralGoogle Scholar
- Luo J, Schumacher M, Scherer A, Sanoudou D, Megherbi D, Davison T, et al. A comparison of batch effect removal methods for enhancement of prediction performance using maqc-ii microarry gene expression data. Pharmacogenomics J. 2010; 10:278–91.View ArticlePubMedPubMed CentralGoogle Scholar
- Kupfer P, Guthke R, Pohlers D, Huber R, Koczan D, Kinne RW. Batch correction of microarray data substantially improves the identification of genes differentially expressed in rheumatoid arthritis and osteoarthritis. BMC Med Genomics. 2012; 5:23.View ArticlePubMedPubMed CentralGoogle Scholar
- Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010; 11:733–9.View ArticlePubMedGoogle Scholar
- Zhang Q, Heuck C, Qu P, Usmani S, Williams R, Zhou Y, et al. Gene expression profiling-based risk stratification scores in multiple myeloma can be highly sensitive towards sample preparation. Blood. 2013; 122(21):1865.Google Scholar
- Morgan G, Gregory W, Davies F, Bell S, Szubert A, Brown J, et al. The role of maintenance thalidomide therapy in multiple myeloma: Mrc myeloma ix results and meta-analysis. Blood. 2012; 119(1):7–15.View ArticlePubMedGoogle Scholar
- Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics. 2007; 8(1):118–27.View ArticlePubMedGoogle Scholar
- Chen C, Grennan K, Badner J, Zhang D, Gershon E, Jin L, et al. Removing batch effects in analysis of expression microarray data: An evaluation of six batch adjustment methods. PLoS ONE. 2011; 6(2):17238.View ArticleGoogle Scholar
- Konstantinopoulos PA, Cannistra SA, Fountzilas H, Culhane A, Pillay K, Rueda B, et al. Integrated analysis of multiple microarray datasets identifies a reproducible survival predictor in ovarian cancer. PLoS ONE. 2011; 6(3):18202.View ArticleGoogle Scholar
- Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci USA. 2003; 100(16):9440–5.View ArticlePubMedPubMed CentralGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.