Skip to main content
Fig. 3 | BMC Bioinformatics

Fig. 3

From: Risk-conscious correction of batch effects: maximising information extraction from high-throughput genomic datasets

Fig. 3

a gPCA p-value vs preserved data variance plot for Dataset 3 (Johnson et al., 2007), showing the scores for data before correction (*gPCA = .225), and after correction by ComBat and Harman batch effect removal methods. For Harman, the fractions in the labels denote the adjustable confidence threshold (=1-probability of overcorrection) for batch noise removal. Hn-.95 is highlighted as it may be the setting of choice for a typical dataset. On the vertical axis, the larger the p-value the lower the probability of batch noise presence as detected by gPCA (Reese et al, 2013). Raw data p-value of .225, indicates that the batch noise component in Dataset 3 is not as predominant as Datasets 1 and 2. The worst case scenario for both methods is that there is no batch effect in the dataset and what they do remove is genuine biological signal. ComBat removes 49 % (with gPCA p-value = 1) of the data variance, which is about the same proportion it removed Dataset 1 (48 %; gPCA p-value = .233), which had the most prevalent batch effect (gPCA p-value = .008). Harman (Hn.95) removes 17 % (gPCA p-value = .52), when it removed 37 % (gPCA p-value = .63) from Dataset 1. Hn-75 matches ComBat’s gPCA p-value of 1 while removing 20 percentage points less data variance. b A plot of first and second PCs for Dataset 3 before correction (Johnson et al., 2007). The three colours represent the three processing batches. The shapes represent four distinct experimental conditions. The figure indicates that within-treatment variability is a larger source of data variance than batch effects in the top two principal components

Back to article page