Error, reproducibility and sensitivity: a pipeline for data processing of Agilent oligonucleotide expression arrays
© Chain et al; licensee BioMed Central Ltd. 2010
Received: 4 November 2009
Accepted: 24 June 2010
Published: 24 June 2010
Expression microarrays are increasingly used to obtain large scale transcriptomic information on a wide range of biological samples. Nevertheless, there is still much debate on the best ways to process data, to design experiments and analyse the output. Furthermore, many of the more sophisticated mathematical approaches to data analysis in the literature remain inaccessible to much of the biological research community. In this study we examine ways of extracting and analysing a large data set obtained using the Agilent long oligonucleotide transcriptomics platform, applied to a set of human macrophage and dendritic cell samples.
We describe and validate a series of data extraction, transformation and normalisation steps which are implemented via a new R function. Analysis of replicate normalised reference data demonstrate that intrarray variability is small (only around 2% of the mean log signal), while interarray variability from replicate array measurements has a standard deviation (SD) of around 0.5 log2 units ( 6% of mean). The common practise of working with ratios of Cy5/Cy3 signal offers little further improvement in terms of reducing error. Comparison to expression data obtained using Arabidopsis samples demonstrates that the large number of genes in each sample showing a low level of transcription reflect the real complexity of the cellular transcriptome. Multidimensional scaling is used to show that the processed data identifies an underlying structure which reflect some of the key biological variables which define the data set. This structure is robust, allowing reliable comparison of samples collected over a number of years and collected by a variety of operators.
This study outlines a robust and easily implemented pipeline for extracting, transforming normalising and visualising transcriptomic array data from Agilent expression platform. The analysis is used to obtain quantitative estimates of the SD arising from experimental (non biological) intra- and interarray variability, and for a lower threshold for determining whether an individual gene is expressed. The study provides a reliable basis for further more extensive studies of the systems biology of eukaryotic cells.
The application of microarray technology to study the expression of thousands of genes in biological samples has become commonplace. The diversity of technologies initially explored has been replaced by a landscape dominated by a small number of proprietary platforms, which differ principally in the type of probe used for hybridisation. Each platform has advantages and disadvantages, and knowledge of technical reproducibility and sources of data variability is crucial to optimising experimental design and data analysis. The reproducibility and comparability of several different platforms has been rigorously examined in the MAQC project, and overall the different platforms give reassuringly similar results, with similar accuracy and sensitivities .
We have collected a large number of array data sets from Agilent human genome arrays . The latest releases of these arrays have approximately 44300 features, which include various control oligonucleotides, and a set of 41001 different oligonucleotide 60 mers complimentary to unique human mRNA sequences. Of these 41001, the latest Agilent annotation lists 29,806 as corresponding to known genes and/or ORFs, of which 19392 are unique. Around 12000 probes correspond to as yet unannotated stretches of human genome. The Agilent platform is designed to be used as a two colour system, probing and then detecting hybridisation of two different cDNA samples labelled with different fluorescent dyes on the same array (typically Cy3 in the green channel and Cy5 in the red channel).
Although there are now many published studies using the Agilent arrays, there is a paucity of studies systematically investigating the reproducibility and sources of variation in this microarray platform. There is also relatively little software for implementation of data preprocessing for this platform. Using a set of data generated from this array platform, we dissect the major contributors to experimental variability within the data. We develop a new R package (agilp) to extract data from Agilent raw data files, and implement a robust Loess normalisation to a mean of multiple experiments. This reduces variability to a level which allows accurate and informative comparisons between array data sets collected at different times and by different operators. We also demonstrate that the low level expression detected by the Agilent arrays for the majority of genes in any one cell type likely corresponds to a genuine high degree of transcriptomic complexity, and is unlikely to arise from weak non specific cross-hybridisation. The results of the study thus give confidence to studies which use this technology and quantify important performance parameters which can be used to optimise the design and interpretation of future transcriptomic experiments.
Impact of logarithimic transformation
We initially examined a set of 77 data sets ( arrays 1-77 in Additional File 1, Table S1) collected in our laboratory over a period of about three years. In each experiment the Agilent human whole genome expression arrays were probed with a mixture of two cDNA preparations labelled with Cy3 and Cy5 respectively. The Cy3 sample in every experiment was a sample of "reference" cDNA prepared from the Stratagene Universal Human Reference reagent. The same batch of reference RNA was used in all cases, although several different batches of labelled cDNA were made. The Cy5-labelled cDNA was prepared from RNA extracted from monocyte derived macrophages or dendritic cells, or the Mutz-3 leukaemic cell line cultured and/or stimulated in a variety of different ways. A summary of the cell type/experimental stimulus for each array is provided in Additional file 1 Table S1. The Cy5 channel for this first set of arrays therefore always provided data for the experimental sample, while the Cy3 provided data for the reference.
The raw output from the scanner shows a large dynamic range from around 60 to 200000 arbitrary units. These data are commonly log transformed prior to manipulation, in order to make the standard deviation (SD) independent of magnitude, and improve the fit to a Gaussian normal distribution. Both these assumptions were therefore tested on the set of reference arrays.
The normality of the distribution of the signal across the arrays was tested using the Shapiro Wilk test for each probe individually. The fit varied widely for different probes. However, the proportion of probes whose distribution did not differ significantly from normality (p > 0.05) increased from 55% to 78% when Log2 transformed data were analysed. Log2 transformed data were therefore used for all further analysis.
Variation attributable to intra array "experimental error."
Intra and inter array variation, before and after normalisation
Log transformed without normalisation2
Log transformed with linear regression normalisation2,3
Log transformed with LOESS linear regression normalisation2,3
Variation attributable to inter array experimental "error."
We next examined the variation which arises from experiments involving different arrays. In order to isolate the variation arising from technical variables (e.g hybridisation conditions, inter array variation etc.) from variation arising from differences in the biological samples being tested, we first analysed variation in the Cy3 signal, which represents hybridisation of an identical biological sample 77 times (Table 1, second column). The average SD for each probe across the arrays for the Cy3 signal was 0.56, representing 6.6% of the mean signal. By comparison, the SD for the Cy5 channel, which includes additional variation from biological variables (e.g. different cell type, different stimuli etc.), as well as unknown intrinsic biological variability was 0.91 (Table 1).
Reduction in variability by normalisation via linear regression
In order to reduce variability by correcting for systematic differences between arrays we normalised the data by simple linear or LOESS local linear regression against the mean signal of all the samples (i.e. a model in which the mean and the slope is held constant).
We applied both simple linear and LOESS local linear regression (dividing up the range into 10 local data sets) across the whole set of arrays. Using the normalised data, we calculated the average SD for each probe (the genewise SD) across the set of arrays (Table 1). Linear regression normalisation decreased the average SD of the Cy3 reference array data set to 0.31. LOESS normalisation decreased the SD further to 0.24 (3% of the mean log intensity for the whole array) . 95% of genes had a SD < 0.46 log2 units. The equivalent SD values for the experimental (Cy5) data were also correspondingly decreased (Table 1). We also evaluated quantile normalisation, the preprocessing normalisation implemented in the widely used Affymetrix RMA [3, 4] method, which reduced the average SD for the Cy3 channel to 0.29.
Further reduction of variation by normalisation via ratio to the reference sample
The covariance between experimental and reference signal was therefore calculated for each probe across the set of arrays (Table 1). The average covariance for all probes was then used to calculate the corrected SD of the difference (ratio) between experimental and reference data sets. Using the ratio for each probe slightly decreased the overall SD of the normalised data (Table 1). The "probe specific" normalisation therefore provides an additional small benefit (in terms of reduction in variability) over and above the global normalisation introduced above.
Unexpectedly, for probes showing the biggest stimulation indices (i.e. the greatest response to interferon), the stimulation index in the green channel is positively correlated to the stimulation index observed in the red channel. This effect is only seen at high stimulation indices, and the correlation is lost once the difference in signal is less than 3 log2 units.
Low intensity signals in Agilent array hybridisation
These data might support a hypothesis that a large number of genes are being transcribed in any one cell type. However, the conclusion is reliant on the the negative control supplied by Agilent. We therefore tested the hypothesis independently. In view of the very large evolutionary distance between Arabidopsis and humans, and the process of Agilent selection of probe sequences so as to avoid repetitive elements or highly conserved common sequences, we predicted that the plant RNA samples should show very little true hybridisation to the arrays. In contrast, if the low level signals were as a result of random low stringency interactions, this should occur to a similar extent using the plant as the human samples. The log signal frequency distribution for the average of three Arabidopsis samples (the data could not be normalised in the same way, since the assumptions that the mean signal is the same is obviously false) was compared to the signal distribution for human samples (fig 4c). The majority of the probes from the arrays hybridised to the Arabidopsis samples gave signals close to or just above the background probe.
Array reproducibility viewed through multidimensional scaling
Having established the basic parameters of error/variability, we examined whether global patterns of gene expression were reproducible and robust when comparing data collected over a time period of several years, and by different operators. The same data set of arrays log2 transformed and Loess normalised as detailed above were analysed using classical multidimensional scaling (MDS), a powerful unsupervised approach to identify relationships between sets of arrays. Since the distance matrices are based on Euclidean distance metrics this analysis is mathematically equivalent to Principal Component Analysis.
K is the number of dimensions, n is the number of arrays, and λi is the ith eigenvalue. A value of around 0.8 (at which around 80% of the SD has been accounted for) has been proposed as a good cut-off for adequately representing the original data (e.g. ). Ten dimensions were therefore analysed. The justification for including this unusually large number of dimensions is discussed further below.
In addition, we analysed a set of 12 newly prepared Mutz-3 derived dendritic cell arrays (fig 9b) including a set of three additional unstimulated cultures (i.e. biological replicates). All twelve sets cocluster with the old Mutz-3 dendritic cells in dimensions 1 and 3 (i.e. away from monocyte derived DC in dimension 3). Experimental variation which may reflect batch differences between the first and second set of experiments is also detectable in higher dimensions (not shown).
In this study, we have developed a pipeline for processing and analysis of data obtained using the Agilent human 44 K expression array platform. We have assessed commonly adopted preprocessing computational procedures, tested reproducibility of the method and the determinants of variability. Interestingly, investigation of the true background signal intensity, suggested that there is low-level transcriptional activity across most of the genome even in highly differentiated macrophages and dendritic cells. Finally, we have illustrated the application of multidimensional scaling to interrogate complex relationships between individual expression profiles in a large data set.
We confirmed that logarithmic transformation of the raw data facilitates downstream data processing by significantly reducing the dependence of the SD on the signal intensity, and by improving approximation to a normal distribution for many probe signals. Log2 transformation did not completely standardise variance vis a vis intensity, and in situations where this becomes critical, additional more complex alternative transformations can easily be incorporated e.g. .
A large number of different approaches to data normalisation have been explored in order to improve the reproducibility of array data, particularly with the objective of comparing data generated from different experiments [8–12] . Many of these were designed for the Affymetrix platform (e.g. RMA, dCHIP, GCOS) and are not readily implemented for the Agilent platform . Nevertheless, we compared the Loess normalisation used in this study to the quantile normalisation which is the basis for data preprocessing in RMA and RMA based packages [3, 4]. Quantile normalisation and LOESS normalisation reduced the average standard error of the data similarly. However, as discussed in the original implementation of the RMA normalisation algorithm , quantile normalisation may "overnormalise" giving identical or very similar values at the extremes of the quantile range. This problem is less serious in dealing with data from the Affymetrix platform, which uses multiple probe sets per gene, and has inter probe variances often greater than the inter array variance. One of the most detailed published comparisons of alternative normalisation approaches did also include a baseline Loess normalisation procedure similar to the one used here . The study found relatively subtle differences between normalisation techniques, but did not report absolute levels of error, rendering comparison difficult.
The use of two colour ratio normalisation, which is possible with the Agilent platform but not Affymetrix, is in fact implemented in the proprietary normalisation supplied by Agilent. The advantage of this normalisation, over global array normalisations, is that it can theoretically correct for probe specific systematic errors. However, as shown in equation 1, ratios decrease variance only when correlation between channels is high, and noise in the "reference" channel is small. Previous studies have in fact suggested that the Agilent normalisation protocol is sub-optimal. In fact, as illustrated in Table 1, the actual decrease in variance observed by introducing ratio normalisation with the two colour array signals is very modest. This modest gain needs to be weighed against disadvantages of the ratio measure. Firstly, ratios behave counter intuitively in terms of biological processes at both extremes of the signal range, adopting increasingly large negative values which tend to infinity as the experimental signal approaches zero. Secondly, we detected an unexpected interaction between independent Cy3 and Cy5 signals at high signal intensities, although the mechanisms remains unclear. Although the effect is only seen for a small number of genes, it may nevertheless prejudice the outcome of experiments which use a reference sample in one channel. Overall, the use of the two colour format to run a reference array for each experimental sample is not supported by our study, a conclusion in line with previous studies .
In summary, the preprocessing steps implemented in this study significantly reduce the variation due to experimental, rather than biological variation. This value provides a good estimate of the overall SD of the signal attributable to the various sources of intrinsic "experimental error". Given the Gaussian nature of log signal variation for most probes, differences between experiments of one log2 unit should be detectable even with rather low sample sizes. However, future studies using spike-in controls, and including comparisons to q-PCR data will be required to validate the analysis given here, and to determine the true discriminatory power of the data. More precise modelling, using probe specific variances computed from the large set of reference arrays, or more sophisticated Bayesian inference may further improve the quality of the data.
The presence of a very large number of probes showing low level hybridisation prompted us to try and distinguish background from low level gene transcription. Hybridisation of the Arabidopsis samples established a realistic background cut-off to determine the boundary at which genes may be considered to be ON or OFF, and showed transcriptional activity for approximately 70% within human macrophage samples. This degree of transcriptional diversity suggests transcription is going on across the majority of the genome even in a single cell type, and is in accord with recent estimates of cell transcriptional complexity obtained from deep sequencing [15, 16], and also with the recent upward revision of estimates of the frequencies of transcriptional start sites within the human genome . This data cannot however address the question of whether the mRNA message are translated or even complete.
In general, the power of genome-wide arrays lies in its ability to define a global transcriptomic response, made up of complex patterns of interacting genes. In order to analyse the extent to which these patterns remained reproducible and robust in the face of experimental error and variation, we used MDS which is equivalent to principal component analysis for continuous Euclidean distance measurements and retains more of the information than conventional hierarchical clustering.
MDS analysis of the macrophage/dendritic cell data set (after Loess normalisation) confirmed that the key underlying structure of the data set is robust, and that sample relationships are conserved even when samples are collected at different times, and by different operators. For example, the distinction between macrophages and dendritic cells, between normal and leukaemic dendritic cells, and between resting and LPS stimulation were all very robust to underlying experimental variation. Of interest, the biologically meaningful variables such as those described above, usually segregated from the "artifactual" variables (such as when the experiments was carried out, or the array format) into a different dimension. Since the MDS axes are orthogonal this suggests that the biological and artifactual variables are often determined by non-overlapping sets of gene probes, and the identification of these genes is the subject of on-going analysis.
Sample collection. The methods for human cell sample collection, array hybridisation and sample collection have all been described previously [18, 19]. A summary of all the samples used is given in Additional File 1 Table S1.
The reference sample is the Stratagene Universal Human Reference reagent, which consists of a mixture of RNA from ten different human cell lines (#740000, Stratagene, Agilent Technologies, Stockport, UK).
Seeds of Arabidopsis thaliana (L.) Heynh. were obtained from the Nottingham Arabidopsis Stock Centre. Plants were grown on 0.8% agar with 0.25 × MS salts, adjusted to pH 5.6 with NaOH, under conditions described previously . At 19 days after sowing, non-senescent, fully-expanded rosette leaves were pooled from 8-22 plants and snap-frozen at -70°C. The experiment was triplicated in three sequential blocks. RNA was initially extracted from tissue samples using a modified TRIzol extraction method . Extracted total RNA was then purified using the 'RNA Cleanup' protocol for RNeasy columns with on-column DNase digestion to remove residual chromosomal DNA (Qiagen, Crawley, West Sussex, UK).
Labelled cRNA samples were generated from RNA samples using the Low RNA Input Fluorescent Linear Amplification Kit according to the manufacturer's instructions (Agilent Technologies, Santa Clara, CA, USA). cRNA was synthesised from the double-stranded cDNA using T7 RNA polymerase, incorporating Cyanine 5-labelled CTP fluorescent dyes (PerkinElemer Life and Analytical Sciences, Boston, MA, USA). Labelled cRNA samples were cleaned using the 'RNA Cleanup' protocol for RNeasy columns (Qiagen) performed at 4°C and eluted using two 30 μL volumes of nuclease-free H2O. Mean dye incorporation for labelled cRNA was 19.53 (± 1.99 SEM, n = 3) pmol dye μg-1 cRNA.
Hybridisation of the Agilent 4 × 44 K whole human genome cDNA microarrays according to manufacturer's instructions http://www.agilent.com. Array images were acquired with Agilent's dual-laser microarray scanner G2565BA (5 μ resolution) and signal data were collected with dedicated Agilent Feature Extraction software (v9.5.1).
The original scanner output files and the normalized microarray data (see below) have been submitted to the ArrayExpress database http://www.ebi.ac.uk/arrayexpress, accession number E-TABM-942). A summary of the samples is given in supplementary Table S1.
The R code for the extraction of raw data from scanner files, removal of duplicates, and Loess normalisation, is available (as function "AAProcess" in package agilp) from the R CRAN repository. http://cran.r-project.org/web/packages/.
The statistical analysis was mostly carried out in R. The functions Shapiro.test and ks.test were used to test for normality. Multidimensional scaling was carried out using the functions distanceMatrix (using Euclidean distance) from the package ClassDiscovery, part of the OOMPA suite of libraries http://bioinformatics.mdanderson.org/Software/OOMPA/, and the function cmdscale from the stats package.
This study was supported in part by a BBSRC/Unilever CASE Studentship awarded to JR. This work is part of Unilever's on-going effort to develop novel ways of delivering consumer safety. JT and MN were supported by a Wellcome Trust fellowship to MN. The project was also supported by a grant to BC from the NIHR UCLH/UCL Comprehensive Biomedical Research Centre. We are very grateful to Jane Galbraith (Dept. Statistics, UCL) for extensive help and advice on aspects of the statistical analysis.
- Shi L, et al.: The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 2006, 24: 1151–1161. 10.1038/nbt1239View ArticlePubMedGoogle Scholar
- Agilent Whole Human Genome Expression Arrays2010. [http://www.chem.agilent.com/en-us/products/instruments/dnamicroarrays/wholehumangenomeoligomicroarraykit/pages/default.aspx]
- Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19: 185–193. 10.1093/bioinformatics/19.2.185View ArticlePubMedGoogle Scholar
- McCall MN, Bolstad BM, Irizarry RA: Frozen robust multiarray analysis (fRMA). Biostatistics 2010, 11: 242–253. 10.1093/biostatistics/kxp059View ArticlePubMedPubMed CentralGoogle Scholar
- Everitt BS, Dunn J: Multidimensional scaling. In Applied multivariate data analysis. 2nd edition. Hodder Arnold, London, UK; 2001:93–122.View ArticleGoogle Scholar
- Rasaiyaah J, Noursadeghi M, Kellam P, Chain B: Transcriptional and functional defects of dendritic cells derived from the MUTZ-3 leukaemia line. Immunology 2009, 127: 429–441. 10.1111/j.1365-2567.2008.03018.xView ArticlePubMedPubMed CentralGoogle Scholar
- Durbin BP, Hardin JS, Hawkins DM, Rocke DM: A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics 2002, 18(Suppl 1):S105-S110.View ArticlePubMedGoogle Scholar
- Kroll TC, Wolfl S: Ranking: a closer look on globalisation methods for normalisation of gene expression arrays. Nucleic Acids Res 2002, 30: e50. 10.1093/nar/30.11.e50View ArticlePubMedPubMed CentralGoogle Scholar
- Dabney AR, Storey JD: Normalization of two-channel microarrays accounting for experimental design and intensity-dependent relationships. Genome Biol 2007, 8: R44. 10.1186/gb-2007-8-3-r44View ArticlePubMedPubMed CentralGoogle Scholar
- Dabney AR, Storey JD: A new approach to intensity-dependent normalization of two-channel microarrays. Biostatistics 2007, 8: 128–139. 10.1093/biostatistics/kxj038View ArticlePubMedGoogle Scholar
- Fan J, Niu Y: Selection and validation of normalization methods for c-DNA microarrays using within-array replications. Bioinformatics 2007, 23: 2391–2398. 10.1093/bioinformatics/btm361View ArticlePubMedGoogle Scholar
- Wang D, Zhang CH, Soares MB, Huang J: Systematic approaches for incorporating control spots and data quality information to improve normalization of cDNA microarray data. J Biopharm Stat 2007, 17: 415–431. 10.1080/10543400701199544View ArticlePubMedGoogle Scholar
- Patterson TA, Lobenhofer EK, Fulmer-Smentek SB, Collins PJ, Chu TM, Bao W, Fang H, Kawasaki ES, Hager J, Tikhonova IR, Walker SJ, Zhang L, Hurban P, de Longueville F, Fuscoe JC, Tong W, Shi L, Wolfinger RD: Performance comparison of one-color and two-color platforms within the MicroArray Quality Control (MAQC) project. Nat Biotechnol 2006, 24: 1140–1150. 10.1038/nbt1242View ArticlePubMedGoogle Scholar
- Johnson WE, Li C, Rabinovic A: Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 2007, 8: 118–127. 10.1093/biostatistics/kxj037View ArticlePubMedGoogle Scholar
- Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D, O'Keeffe S, Haas S, Vingron M, Lehrach H, Yaspo ML: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 2008, 321: 956–960. 10.1126/science.1160342View ArticlePubMedGoogle Scholar
- 't Hoen PA, Ariyurek Y, Thygesen HH, Vreugdenhil E, Vossen RH, de Menezes RX, Boer JM, van Ommen GJ, den Dunnen JT: Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms. Nucleic Acids Res 2008, 36: e141. 10.1093/nar/gkn705View ArticlePubMedPubMed CentralGoogle Scholar
- Birney E, et al.: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007, 447: 799–816. 10.1038/nature05874View ArticlePubMedGoogle Scholar
- Noursadeghi M, Tsang J, Miller RF, Straschewski S, Kellam P, Chain BM, Katz DR: Genome-wide innate immune responses in HIV-1-infected macrophages are preserved despite attenuation of the NF-kappa B activation pathway. J Immunol 2009, 182: 319–328.View ArticlePubMedPubMed CentralGoogle Scholar
- Rasaiyaah J, Noursadeghi M, Kellam P, Chain B: Transcriptional and functional defects of dendritic cells derived from the MUTZ-3 leukaemia line. Immunology 2009, 127: 429–441. 10.1111/j.1365-2567.2008.03018.xView ArticlePubMedPubMed CentralGoogle Scholar
- Hampton CR, Bowen HC, Broadley MR, Hammond JP, Mead A, Payne KA, Pritchard J, White PJ: Cesium toxicity in Arabidopsis. Plant Physiol 2004, 136: 3824–3837. 10.1104/pp.104.046672View ArticlePubMedPubMed CentralGoogle Scholar
- Hammond JP, Bowen HC, White PJ, Mills V, Pyke KA, Baker AJ, Whiting SN, May ST, Broadley MR: A comparison of the Thlaspi caerulescens and Thlaspi arvense shoot transcriptomes. New Phytol 2006, 170: 239–260. 10.1111/j.1469-8137.2006.01662.xView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.