Pan-cancer analysis of systematic batch effects on somatic sequence variations
© The Author(s). 2017
Received: 13 December 2016
Accepted: 6 April 2017
Published: 11 April 2017
The Cancer Genome Atlas (TCGA) is a comprehensive database that includes multi-layered cancer genome profiles. Large-scale collection of data inevitably generates batch effects introduced by differences in processing at various stages from sample collection to data generation. However, batch effects on the sequence variation and its characteristics have not been studied extensively.
We systematically evaluated batch effects on somatic sequence variations in pan-cancer TCGA data, revealing 999 somatic variants that were batch-biased with statistical significance (P < 0.00001, Fisher’s exact test, false discovery rate ≤ 0.0027). Most of the batch-biased variants were associated with specific sample plates. The batch-biased variants, which had a unique mutational spectrum with frequent indel-type mutations, preferentially occurred at sites prone to sequencing errors, e.g., in long homopolymer runs. Non-indel type batch-biased variants were frequent at splicing sites with the unique consensus motif sequence ‘TTDTTTAGTT’. Furthermore, some batch-biased variants occur in known cancer genes, potentially causing misinterpretation of mutation profiles.
Our strategy for identifying batch-biased variants and characterising sequence patterns might be useful in eliminating false variants and facilitating correct interpretation of sequence profiles.
KeywordsBatch effect TCGA Pan-cancer Mutation
The Cancer Genome Atlas (TCGA) is a comprehensive database that includes multi-layered genome profiles collected from more than 30 cancer types, including genomic mutations, mRNA and miRNA, DNA copy number, DNA methylation and protein expression profiles. Large-scale collection of data inevitably generates batch effects. For example, processing steps ranging from sample collection to data generation are performed on different days using different lots of reagents or at different sites, potentially introducing batch effects into the data profiles. To compile the TCGA data, human tumour samples were collected from hundreds of Tissue Source Sites (TSS) and sent to two different Biospecimen Core Resources (BCRs), which processed quality-checked and stored specimens along with patients’ clinical information. On the other hand, genomic data were generated from multiple Genome Sequencing Centers (GSCs) and the Genome Characterization Centers (GCCs). These diverse data processing steps were barcoded with a unique sample name, which included the IDs of the TSS, participant, sample, vial, portion, analyte, plate and GSC (http://cancergenome.nih.gov/).
Previously, batch effects in genomic data have been addressed in diverse platforms including microarrays, RNA-seq and Exome-seq data [1–10]. In TCGA, systematic biases in variable batches have also been reported . MBatch, a web-based analysis tool (http://bioinformatics.mdanderson.org/tcgabatcheffects), has been developed to investigate batch effects comprehensively. However, most of these studies are applicable only to continuous variables such as gene expression levels, but not to mutation data. Moreover, the overall patterns or characteristics of the sequence variant calls affected by batch effect have not been investigated extensively. Thus, in this study, we systematically evaluated batch effects on sequence variants identified from whole-exome data of TCGA. By examining barcodes that can introduce batch effects (e.g., TSS, Plate ID, GSC, etc.), we identified 999 batch-biased variants from pan-cancer data. Systematic errors were generated by platform-dependent sequencing reactions, as well as batch effects due to different sampling conditions. By comparing mutational spectra of the batch-biased variants with the unbiased variants, we observed that batch-biased variants preferentially occurred at the sites prone to sequencing errors. Moreover, we found that the batch-biased variants had unique sequence patterns that could be recognised and eliminated from the data. Our systematic evaluation of batch-biased variants might be helpful in diminishing false-positive calls from the large-scale and heterogeneous TCGA profiles.
Data collection and processing
Of the 34 cancer types in TCGA (legacy level2, date 2016-05-24), we collected data from 20 cancer types which have more than 200 samples per cancer type. These included 7,502 tumor samples from 65 mutation annotation format (MAF) datasets, which identified somatic mutations in cancers using various analysis pipelines and/or sequencing platforms. Of the selected MAF files, those with extremely high frequency of indels (>500,000), no indels, or sample sizes less than 100 were filtered out. Finally, we used 46 MAF datasets from 19 cancer types for our analysis (Additional file 1: Figure S1 and S2; Additional file 2: Table S1).
Of the barcodes encoded in the sample names, we tested those with sufficient batch size (greater than 5) for each barcode ID, including the IDs for TSS, plate, and GSC. The BCR ID and date of shipment were also included in the barcodes. The batches (TSS, plate, or GSC) were obtained from the sample barcode IDs, and ‘date of shipment’ and ‘BCR ID’ were obtained from the clinical information and TCGA batch code table (https://github.com/saketkc/tcga-python/tree/master/TCGA/data), respectively. In addition, we also evaluated batch effects by clinical features such as patient’s age, gender, race, smoking history, ethnicity, tumor size, tumor grade, tumor stage, histological type, and country.
Additional annotation of mutations, including functional category, genomic region, and coding strand, were performed using the ANNOVAR software . The flanking sequences of mutation sites were obtained using R library ‘Biostrings’. Homopolymer runs of each nucleotide were calculated by counting the number of continuous identical nucleotides (A, T, G, and C) within the flanking sequences of a variant.
Estimation of batch-biased variant calls
To estimate batch effects on sequence variations, the number of samples with a variation in the tested barcode ID and other IDs vs. the number of samples without a variation were compared by Fisher’s exact test for each variation and each barcode ID. Filtering out the barcode IDs with mutation frequency < 3 in each MAF file, batch-biased variants were identified with a stringent statistical significance threshold of P-value < 0.00001. Then, if the two barcode IDs had more than 75% of the same batch-biased variants, the barcode ID with the highest frequency of batch-biased variants was chosen as representative.
Phylogenic tree analysis
The pairwise distance from sequences of all mutation sites was evaluated using R library ‘phangorn’. The phylogenic tree was constructed using Neighbour-Joining tree estimation method.
Consensus sequence and motif analysis
The position-specific consensus sequence of variants was evaluated using sequence-logo viewer Weblogo 3 with default parameters . De novo motif sequence for batch-biased variants at splicing sites was identified using Homer with default parameters .
Identification of batch-biased sequence variants in TCGA data
To determine possible batch effects on the sequence variant calls, we monitored the association of the mutation frequencies with batches prepared using different processing and data generation steps. Batch effects were estimated by applying Fisher’s exact test to each batch ID, as described in Methods. This analysis revealed 2,636 somatic sequence variants (false discovery rate; FDR ≤ 0.0027) associated with non-biological batches. Ultimately, 999 variants with representative batch IDs were determined as potential batch-biased variants (see Fig. 1a).
The list of biased batch IDs
plate_id ~ 0886
plate_id ~ 0966
plate_id ~ 0928
plate_id ~ 1040
plate_id ~ 0969
plate_id ~ 2284
plate_id ~ A10C
plate_id ~ A10M
plate_id ~ 1719
plate_id ~ A289
plate_id ~ 1705
plate_id ~ A12Z
plate_id ~ 2340
plate_id ~ 2340
plate_id ~ A071
plate_id ~ A188
plate_id ~ 1252
plate_id ~ 2114
In addition, to evaluate possible effects of mutation similarity of the samples on the batch effects, we performed phylogenic tree analysis on KIRC and LUAD data that had the most frequent batch-biased variants. This analysis revealed that the samples harboring batch-biased variants were clustered together, indicating that these samples had similar mutation profiles (Fig. 2d). However, the mutation profiles excluding the batch-biased variants did not cluster together. Thus, we could rule out the possibility that those batch-associated variants are the result of the similar mutation profiles among the samples.
Comparison of mutation spectrum of the batch-biased and unbiased variants
We also evaluated the possible association of GC content with the batch-biased variant calls. By calculating the percentage of GC content in the 20 nucleotides flanking the mutation sites for each of the mutation types (AT/GC and indel/non-indels, i.e., indel–AT, indel–GC, non-indel–AT and non-indel–GC), we found that the batch-biased non-indel–GC variants had significantly higher GC contents than the unbiased variants (P = 9.66 × 10−3, Student’s T-test). Conversely, the batch-biased non-indel–AT variants showed significantly lower GC content than the unbiased variants (P = 3.82 × 10−10, Student’s T-test). By contrast, the frequency of indel-type batch-biased variants was not related to the neighbouring GC content levels (Fig. 3b). In addition, batch-biased indels had frequent frameshift insertions, whereas the unbiased indels had frequent frameshift deletions (Fig. 3c). Of the non-indel variants, the batch-biased variants had more frequent stop-loss mutations than the unbiased non-indel variants (P = 1.38 × 10−5, Fig. 3c). These findings imply that generation of batch-biased variants can largely be attributed to neighbouring sequences with preferential mutation types and characteristic GC contents.
Previous genomic landscape studies of the sequence mutations exhibited that the mutation rates vary considerably among genomic locations . When we evaluated the distribution of the genomic coordinates of the variants in relation to gene positions, we found that the batch-biased variants occurred frequently at splicing sites, although the unbiased variants did not (P = 1.62 × 10−48, Fig. 3d). This result implies the preferential occurrence of erroneous batch-biased variant calls at splicing sites. Furthermore, we found that the batch-biased variants had higher mutation rates than the unbiased variants (Fig. 3e). Taken together, our results suggest that batch-biased erroneous sequence variants have unique genomic features and preferentially occur at the positions prone to sequencing errors. Thus, sequencing errors might be more frequent when the sample batches were prepared from lower-quality tissues, either due to inadequate processing of the tissue or errors in data generation.
Homopolymer runs are associated with batch-biased variants
Batch-biased variants occur frequently at splicing site
Batch-biased variants in significantly mutated genes (SMGs)
In this study, by performing a pan-cancer analysis of exome sequencing data from TCGA, we evaluated possible batch effects on somatic mutation calls and identified 999 potential batch-biased variants. Batch-biased sequence variants were frequently found in specific cancer types: KIRC, LUAD and UCEC. Most of the batch effects were identified from plate IDs 0886 and 0928, implying possible problems with tissue quality on these plates.
Mutation spectrum analysis of batch-biased calls revealed sequencing platform–dependent, error-prone features of the batch-biased variants. Our results imply that the sample batches with poor tissue quality may lead to batch-biased sequencing errors which might preferentially occur at the sites prone to sequencing error rather than randomly.
In fact, sequencing platform–dependent substitution errors, particularly in the Illumina sequencing systems, have been reported previously [16, 21–23]. Frequent sequencing errors are associated with higher GC content and the presence of long G/C homopolymers . To reduce false calls, many of variant calling methods have been developed that filter variants based on minimum depth of coverage, masking of repetitive sequence regions and trimming of poor-quality bases from the ends of reads [25–27]. However, although batch-biased sequencing errors occurred preferentially at sequencing error–prone sites, the batch-biased errors differed significantly from the platform-dependent sequencing errors. We demonstrated that indels were more frequent than substitution mutations in the batch-biased variants, whereas Illumina sequencers generate more substitution-type errors more often than indel type . Comparison of batch-biased variants with the unbiased variants also revealed distinct sequence patterns and mutation spectra between them. These results imply that the biased variants occur preferentially at sites prone to platform-dependent sequencing error, but they also have unique genomic features independent of previously established error-prone sites.
Interestingly, we found that the batch-biased variants were frequent at splicing sites (see Fig. 3), where they may be associated with homopolymer runs at flanking regions. Indeed, analysis of the flanking sequences of the batch-biased splicing sites revealed consensus sequences: upstream long T homopolymer in the intronic region and downstream dinucleotide ‘TT’. This finding is consistent with previous studies reporting that the rates of erroneous sequencing increase at long homopolymer runs [16–19]. Nearly all introns of the U2 type are spliced by the major spliceosome and flanked by GT-AG splice site dinucleotides . The polypyrimidine tracts of GT-AG and GC-AG splice sites have C-rich and T-rich sequences, respectively . Thus, batch-biased T > A/A > T variants might occur preferentially at U2-type GC-AG splicing sites. Indeed, preferential sequencing errors at splicing sites have been reported previously in multiple studies. Sequencing errors are prone to occur at the G base of 3′ splice dinucleotides because of the suppression of G after incorporation of A in the dye-terminator sequencing reaction . Moreover, the ‘G after A’ problem is further enhanced by the polypyrimidine tract that precedes the acceptor AG. Sequencing errors at homopolymer runs of T and C can be generated by polymerase slippage. However, because mutations at splicing sites are implicated in a number of cellular processes and diseases , preferential batch-biased mutations at splicing sites may lead to misinterpretation of the functional effects of mutation profiles.
In addition, as a proof-of-concept, we demonstrated that some SMG genes harbour these batch-biased variants with relatively higher mutation rates (see Fig. 6). This result implies that batch-biased variants might occur even at cancer genes, emphasising those batch-biased variants should be considered carefully in mutation data analysis.
TCGA data are still processing update, correcting erroneous variant calls. However, many previous studies using the legacy TCGA data might be affected by the batch biases. Therefore, our findings regarding batch bias effect might be helpful in validating or re-analyzing studies using these legacy data.
Our strategy for identifying batch-biased variants and characterising sequence patterns might be useful for eliminating false variant calls, and could thus help to interpret sequence profiles correctly.
Biospecimen Core Resource
Breast invasive carcinoma
Genome Characterization Center
Genome Sequencing Center
Kidney renal clear cell carcinoma
Kidney renal papillary cell carcinoma
Lower grade glioma
Liver hepatocellular carcinoma
Mutation annotation format
Skin cutaneous melanoma
Significantly mutated gene
The Cancer Genome Atlas
Tissue Source Sites
Uterine corpus endometrial carcinoma
The results shown here are in whole or in part based upon data generated by the TCGA Research Network: http://cancergenome.nih.gov/.
This research was supported by grants from the Korea Healthcare Technology R&D Project, Ministry of Health & Welfare (H15C1551) and from the National Research Foundation of Korea (NRF) funded by the Korea government (MSIP) (NRF-2012R1A5A2048183). The funding body played no role in the design or conclusions of this study.
Availability of data and materials
All data generated or analysed during this study are included in this published article or the Additional files.
Conceived and designed the study: JHC, SEH, HGW. Analysed the data: JHC, SEH, HGW. Wrote the paper: JHC, HGW. All authors read and approved the manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27.View ArticlePubMedGoogle Scholar
- Chen C, Grennan K, Badner J, Zhang D, Gershon E, Jin L, Liu C. Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PLoS One. 2011;6(2):e17238.View ArticlePubMedPubMed CentralGoogle Scholar
- Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012;28(6):882–3.View ArticlePubMedPubMed CentralGoogle Scholar
- Consortium SM-I. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol. 2014;32(9):903–14.View ArticleGoogle Scholar
- Leek JT. svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 2014;42(21):e161.
- Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Irizarry RA. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010;11(10):733–9.View ArticlePubMedGoogle Scholar
- Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC bioinformatics. 2010;11:94.View ArticlePubMedPubMed CentralGoogle Scholar
- Guo Y, Zhao S, Su PF, Li CI, Ye F, Flynn CR, Shyr Y. Statistical strategies for microRNAseq batch effect reduction. Translational cancer research. 2014;3(3):260–5.PubMedPubMed CentralGoogle Scholar
- Carson AR, Smith EN, Matsui H, Braekkan SK, Jepsen K, Hansen JB, Frazer KA. Effective filtering strategies to improve data quality from population-based whole exome sequencing studies. BMC bioinformatics. 2014;15:125.View ArticlePubMedPubMed CentralGoogle Scholar
- Guo Y, Zhao S, Sheng Q, Ye F, Li J, Lehmann B, Pietenpol J, Samuels DC, Shyr Y. Multi-perspective quality control of Illumina exome sequencing data using QC3. Genomics. 2014;103(5-6):323–8.View ArticlePubMedGoogle Scholar
- Lauss M, Visne I, Kriegner A, Ringner M, Jonsson G, Hoglund M. Monitoring of technical variation in quantitative high-throughput datasets. Cancer informatics. 2013;12:193–201.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38(16):e164.View ArticlePubMedPubMed CentralGoogle Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004;14(6):1188–90.View ArticlePubMedPubMed CentralGoogle Scholar
- Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell. 2010;38(4):576–89.View ArticlePubMedPubMed CentralGoogle Scholar
- Araya CL, Cenik C, Reuter JA, Kiss G, Pande VS, Snyder MP, Greenleaf WJ. Identification of significantly mutated regions across cancer types highlights a rich landscape of functional molecular alterations. Nat Genet. 2016;48(2):117–25.View ArticlePubMedGoogle Scholar
- Laehnemann D, Borkhardt A, McHardy AC. Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief Bioinform. 2016;17(1):154–79.View ArticlePubMedGoogle Scholar
- Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, Nusbaum C, Jaffe DB. Characterizing and measuring bias in sequence data. Genome Biol. 2013;14(5):R51.View ArticlePubMedPubMed CentralGoogle Scholar
- Albers CA, Lunter G, MacArthur DG, McVean G, Ouwehand WH, Durbin R. Dindel: accurate indel calls from short-read data. Genome Res. 2011;21(6):961–73.View ArticlePubMedPubMed CentralGoogle Scholar
- Minoche AE, Dohm JC, Himmelbauer H. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome Biol. 2011;12(11):R112.View ArticlePubMedPubMed CentralGoogle Scholar
- Kandoth C, McLellan MD, Vandin F, Ye K, Niu B, Lu C, Xie M, Zhang Q, McMichael JF, Wyczalkowski MA, et al. Mutational landscape and significance across 12 major cancer types. Nature. 2013;502(7471):333–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Schirmer M, Ijaz UZ, D’Amore R, Hall N, Sloan WT, Quince C. Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform. Nucleic Acids Res. 2015;43(6):e37.View ArticlePubMedPubMed CentralGoogle Scholar
- Xuan J, Yu Y, Qing T, Guo L, Shi L. Next-generation sequencing in the clinic: promises and challenges. Cancer Lett. 2013;340(2):284–95.View ArticlePubMedGoogle Scholar
- Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, Ishikawa S, Linak MC, Hirai A, Takahashi H, et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 2011;39(13):e90.View ArticlePubMedPubMed CentralGoogle Scholar
- Shin S, Park J. Characterization of sequence-specific errors in various next-generation sequencing systems. Mol Biosyst. 2016;12(3):914–22.View ArticlePubMedGoogle Scholar
- Kircher M, Stenzel U, Kelso J. Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol. 2009;10(8):R83.View ArticlePubMedPubMed CentralGoogle Scholar
- Kao WC, Stevens K, Song YS. BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing. Genome Res. 2009;19(10):1884–95.View ArticlePubMedPubMed CentralGoogle Scholar
- Kao WC, Song YS. naiveBayesCall: an efficient model-based base-calling algorithm for high-throughput sequencing. J Comput Biol. 2011;18(3):365–77.View ArticlePubMedGoogle Scholar
- Sheth N, Roca X, Hastings ML, Roeder T, Krainer AR, Sachidanandam R. Comprehensive splice-site analysis using comparative genomics. Nucleic Acids Res. 2006;34(14):3955–67.View ArticlePubMedPubMed CentralGoogle Scholar
- Boldina G, Ivashchenko A, Regnier M. Using profiles based on nucleotide hydrophobicity to define essential regions for splicing. Int J Biol Sci. 2009;5(1):13–9.View ArticlePubMedGoogle Scholar
- Platzer M, Hiller M, Szafranski K, Jahn N, Hampe J, Schreiber S, Backofen R, Huse K. Sequencing errors or SNPs at splice-acceptor guanines in dbSNP? Nat Biotechnol. 2006;24(9):1068–70.View ArticlePubMedGoogle Scholar
- Black DL. Mechanisms of alternative pre-messenger RNA splicing. Annu Rev Biochem. 2003;72:291–336.View ArticlePubMedGoogle Scholar