Systematic noise degrades gene co-expression signals but can be corrected

Background In the past decade, the identification of gene co-expression has become a routine part of the analysis of high-dimensional microarray data. Gene co-expression, which is mostly detected via the Pearson correlation coefficient, has played an important role in the discovery of molecular pathways and networks. Unfortunately, the presence of systematic noise in high-dimensional microarray datasets corrupts estimates of gene co-expression. Removing systematic noise from microarray data is therefore crucial. Many cleaning approaches for microarray data exist, however these methods are aimed towards improving differential expression analysis and their performances have been primarily tested for this application. To our knowledge, the performances of these approaches have never been systematically compared in the context of gene co-expression estimation. Results Using simulations we demonstrate that standard cleaning procedures, such as background correction and quantile normalization, fail to adequately remove systematic noise that affects gene co-expression and at times further degrade true gene co-expression. Instead we show that a global version of removal of unwanted variation (RUV), a data-driven approach, removes systematic noise but also allows the estimation of the true underlying gene-gene correlations. We compare the performance of all noise removal methods when applied to five large published datasets on gene expression in the human brain. RUV retrieves the highest gene co-expression values for sets of genes known to interact, but also provides the greatest consistency across all five datasets. We apply the method to prioritize epileptic encephalopathy candidate genes. Conclusions Our work raises serious concerns about the quality of many published gene co-expression analyses. RUV provides an efficient and flexible way to remove systematic noise from high-dimensional microarray datasets when the objective is gene co-expression analysis. The RUV method as applicable in the context of gene-gene correlation estimation is available as a BioconductoR-package: RUVcorr. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0745-3) contains supplementary material, which is available to authorized users.

2 encephalopathy. Thus, a good set of negative controls would be a set of genes that can be assumed to play no role in the disease process.
One possible set of negative control genes would be so-called housekeeping genes. Housekeeping genes are involved in the basic maintenance of the cell, and their expression levels tend to be constant across a wide range of conditions. As noted in the main text, published lists of housekeeping genes are available, and one of the methods used to discover these genes is to identify genes whose expression levels are observed to be fairly constant [2].
In selecting a set of negative controls for the analyses in this paper, we elected to not use the housekeeping genes and simply used a set of genes that we observed to be fairly constant within our own datasets. Selecting our own set of negative controls in this manner also allowed us to ensure that we selected a set of genes with a representative range of expression levels. More specifically, we selected our negative controls as follows: First, we calculated the inter-quantile range of every gene. We then binned the genes according to their mean expression levels. Finally, we selected a fraction of the genes from each bin with the lowest IQR and designated these genes as our negative controls.
In the analyses of this paper, this method for selecting negative control genes seems to have produced good results. However, this method for selecting negative control genes may not be best.
In particular, we have not investigated the performance of alternative sets of negative controls (e.g. published lists of housekeeping genes); such an investigation is beyond the scope of this paper. Housekeeping genes may also not be ideal as they have unknown correlation structure which may not be representative of background co-expression.
We would also like to point out some potential pitfalls associated with using this method for selecting negative controls. The primary concern is that the genes we select may not actually be negative controls. Recall our model, Y = Xβ + W α + . We note that there are three sources of variation that contribute to the overall observed variation in gene expression levels the variation of interest (Xβ), the unwanted variation (W α), and random noise ( ). By selecting genes that show an overall low level of variation, we hope to exclude genes that exhibit any variation of interest. However, it is also possible that we might select genes that simply have an unusually low level of unwanted variation or random noise. A more pernicious problem may arise if X and W are negatively correlated. In such cases, the variation of interest and the unwanted variation may partially cancel out and thus may prevent the effective removal of the systematic noise.
Unfortunately, it is impossible to assess whether these caveats apply in our real datasets.
Finally, note that RUV-random assumes a mixed-effects model, in which the W α term is mod-3 eled as random. The negative controls are used to estimate the covariance of this term. However, by systematically selecting genes observed to have a low level of variation as our negative controls, we may bias our estimate of the covariance of W α.

RLE Plots for Large Data Sets
RLE plots are valuable graphical tools for estimating parameters required for the application of RUV-random. Unfortunately, for large data sets RLE plots loose their appeal, as it is impractical to compare hundreds or even thousands of boxplots. Here, we present a new RLE plot that neatly summarizes the information for all samples into one plot and additionally offers the possibility of adding information on known sources of systematic noise.
The original RLE plot is a boxplot of the devation from the median gene expression calculated for each gene in a sample. Using the width of the boxplot against the median deviation, it is possible to plot all samples in the same plot. While this still allows the easy identification of outliers, it has the added benefit of providing a study-wide perspective. In particular, samples can be colored according to the levels of a known source of systematic noise (compare Figure 1). This is implemented in the function RLEPlot in the package RUVcorr. The left-hand panel shows novel RLE plot for RUV-random treated data and the right-hand panel displays the untreated data. The samples are colored by which batch they belong to. The corresponding legend can be found on the far left. For a more thorough description of the novel RLE plot see the text.

FIVE LARGE DATASETS ON GENE EXPRESSION IN THE HUMAN BRAIN
We downloaded the raw (i.e. non-normalized) data for each study. Data for the Colantuoni et al study [3] (acession number: GSE30272) and the Hernandez et al study [4]  Due to the differences in study design and technology preprocessing protocols varied. However, for all datasets we applied background correction [8] and then quantile-normalization [9] or in case of the Colantuoni et al dataset loess-normalization [10]. In the following we briefly describe the preprocessing steps taken for each study. Both studies relied on Agilent technology to measure gene expression (see [11] for more details). Analysis was conducted using the R package limma.  S1 Table for the names of the removed arrays). These were consequently removed from further analysis. We then proceeded to apply the standard normalization approaches. Since the datasets in this study included some replicate samples. We averaged the samples for these. We then removed transcripts that were nonhuman or not annotated as well as all technical controls. Transcripts with multiple measurements were represented by the measurement with the largest median value across all samples. Finally, we log 2 transformed all gene expression values.

Dataset by Kang et al
In this study, gene expression was measured using the Affymertix Human Exon 1.0 ST array.
Analysis was conducted with the R package aroma.affymetrix [13]. First we checked array density plots and spot images identifying 11 abnormal or failed arrays that we removed from further analysis (see S1 Table for the names of the removed arrays). We then normalized the data using the approaches outlined above. After this probe level modelling was applied to the differently treated datasets including the untreated data. We then removed transcripts that were non-human or not annotated as well as all technical controls introduced by the manufacturer. Transcripts with multiple measurements were represented by the measurement with the largest median value across all samples. Finally, we log 2 transformed all gene expression values.

Dataset by Colantuoni et al
In this study, gene expression was measured using a custom Illumina two-color microarray with a common reference. Using the R package limma [14] we calculated the M values for further analysis, which is standard procedure when dealing with two-color microarrays with a common reference. These were used to create array density plots as well as PCA plots. In combination with MA-plots for each sample, these plots helped us to identify three failed or abnormal arrays which were consequently removed from further analysis (see S1 In this study, gene expression was measured using an Illumina HumanHT-12 V3.0 beadchip. Analysis was conducted with the R package limma. First we checked array density plots, MAplots and PCA-plots identifying 5 abnormal or failed arrays that we removed from further analysis (see S1 Table for the names of the removed arrays). We then normalized the data using the approaches outlined above. We then removed transcripts that were non-human or annotated as 'bad matches' by the manufacturer. Transcripts with multiple measurements were represented by the measurement with the largest median value across all samples. Finally, we log 2 transformed all gene expression values.

PRIORITZING EE CANDIDATE GENES
In 2014, Oliver et al provided new insights into the mechanism of EE using gene prioritization [15]. In this paper, we follow their approach with few modifications and some updates concerning known EE genes and candidates.

Gene Prioritization Method
Candidate genes are prioritized when they are co-expressed with at least one known EE gene.
Thus, for each candidate and each known EE genes combination, co-expression needs to be established. Oliver et al used the PCC (or a weighted version introduced by Horvath [16] in case of multiple samples per genes) in order to establish the amount of co-expression. In order to avoid declaring spuriously correlated genes co-expressed, they used a thresholding approach. We also make use of the PCC combined with a thresholding approach. Here, we set the threshold for an absolute value of the PCC that corresponds to the top 0.2 of ranked random genes. These random genes were ranked according to their maximum absolute correlation with any of the known EE genes. This process was repeated 1000 times and the average maximum absolute correlation was calculated and set as a threshold. Note that the size of the randomly selected genes corresponded to the number of candidates in the various datasets.

Known EE Genes and Candidates
The list of known EE genes used in the Oliver et al paper was updated from 29 known EE genes to 33 confirmed EE genes (as known in September 2014). Candidate genes included all candidates used in the Oliver et al paper, well-established genes for other types of epilepsies and recent significant findings from the ILAE meta-analysis of genome-wide association studies [17].
Genes 500kb up-or downstream of the location of these findings were included in the list. Table   I gives the number of candidates and known EE genes found in each dataset. Note that for the application of RUV-random, all known EE genes and candidates were excluded from being selected as negative control genes.

SIMULATION
The simulation framework is based on the RUV linear model, which is described in the Section Methods and Materials of the accompanying paper. For the purposes of investigating the performance of different cleaning procedures in the context of gene-gene correlations, it is necessary to known the true underlying correlation structure between the simulated genes. Therefore, the fundamental concept of this simulation is that Σ = Cor(Xβ) represents the genuine gene-gene correlation structure induced by the biological signal. This setup allows control of the average absolute value of the gene-gene correlations through the dimensionality of X and β, p. In the following we explain the simulation setup in detail.

Simulation Setup
For the simulation the parameters of the RUV linear model where simulated in the following manner: • X is a m × p matrix with x i,j ∼ N (0, 1) for i = 1, ..., m and j = 1, ..., p.
• β is a p × n matrix with where n c is the number of negative control genes.
Note that we define Σ i,j = 0 when either i = (p − n c + 1), ..., p or j = (p − n c + 1), ..., p. To simulate correlation between X and W define X = X|W is m × (k + p) matrix with rows X i ∼ N (0, Σ X ) for i = 1, .., m. The covariance matrix is defined by Σ X = with g denoting the dimensionality of the shared subspace of X and W . Hence, the parameter g controls the average correlation between the columns of X and W . If g is increased then the average correlation increases.
In our R-package RUVcorr the sizes of all parameters presented above can be varied. For the simulation scenarios presented here we used the following values, unless they were varied for specific simulations: • Number of samples m = 1000 • Number of genes n = 2500 of which n c = 2000 are negative control genes • Level of systematic noiseα = 2 • Variance of random noise˜ = 0.01 • Dimensionality of systematic noise k = 10 • Dimensionality of factor of interest p = 5 • No correlation between X and W Note that in the application of RUV-random generally all 2000 simulated negative control genes were used. We also assumed that the values of k and ν were known.

Further Simulation Scenarios
We investigated the performance of RUV-random when the systematic noise level was varied (compare Table II). Interstingly, the performance of RUV-random in the simulations is absolutely unchanged no matter how much systematic noise is present in the data. When combining RUV-random with other cleaning approaches we had to find the optimal choice for the parameters k and ν first. Interestingly, the optimal choices for k were generally very large indicating that the application of QN or BC introduced new systematic noise that RUVrandom struggles to remove. Generally, RUV-random on its own did much better than combining approaches (compare Table III). However, it needs to be remembered that we did not simulate measured background noise.

Incorrectly Specified Negative Control Genes
We simulated the effect of incorrectly specified negative control genes. For this different proportions of our 2000 negative controls were selected to be genes that were simulated to have co-expression with other genes. Figure 2 demonstrates that even 50% incorrectly specified negative control genes had little effect on the performance of RUV-random. Indeed, it seems as long as there are enough genuine negative control genes RUV-random perfoms well. Thus, we investigated how many negative control genes are necessary for RUV-random to perform well. Figure 3 shows that with a number twice as large as k correlations estimated from the RUV-random corrected simulated data are already very close to the truth. However, we would still strongly recommend using as many negative controls as possible.

REAL DATA
In this Section we briefly comment on the application of RUV-random to the real datasets and we provide some more information on the results of the prioritzation of the candidate EE genes.

Application of RUV-random to the Real Datasets
As discussed earlier, the application of RUV-random requires the researcher to choose various parameters. For this paper we limit ourselves to giving our parameter choices for each dataset (compare Table IV), but do not provide detailed explanation for our choices. Briefly, we tried to selectk and ν in a way to avoid over-correction and removal of the factor of interest, while at the same trying to ensure that all systematic noise from known sources was removed. To this end we used the methods discussed in the Section Methods and Materials of the paper and Section 1 1.1. For all datasets we used 4000 empirically chosen negative control genes for RUV-random. Note that the performance of RUV-random is determined by the absolute number of negative control genes and not by the proportion of negative control genes to all genes in the dataset.