Missing channels in two-colour microarray experiments: Combining single-channel and two-channel data
© Lynch et al; licensee BioMed Central Ltd. 2007
Received: 08 September 2006
Accepted: 25 January 2007
Published: 25 January 2007
There are mechanisms, notably ozone degradation, that can damage a single channel of two-channel microarray experiments. Resulting analyses therefore often choose between the unacceptable inclusion of poor quality data or the unpalatable exclusion of some (possibly a lot of) good quality data along with the bad. Two such approaches would be a single channel analysis using some of the data from all of the arrays, and an analysis of all of the data, but only from unaffected arrays. In this paper we examine a 'combined' approach to the analysis of such affected experiments that uses all of the unaffected data.
A simulation experiment shows that while a single channel analysis performs relatively well when the majority of arrays are affected, and excluding affected arrays performs relatively well when few arrays are affected (as would be expected in both cases), the combined approach out-performs both. There are benefits to actively estimating the key-parameter of the approach, but whether these compensate for the increased computational cost and complexity over just setting that parameter to take a fixed value is not clear. Inclusion of ozone-affected data results in poor performance, with a clear spatial effect in the damage being apparent.
There is no need to exclude unaffected data in order to remove those which are damaged. The combined approach discussed here is shown to out-perform more usual approaches, although it seems that if the damage is limited to very few arrays, or extends to very nearly all, then the benefits will be limited. In other circumstances though, large improvements in performance can be achieved by adopting such an approach.
Cyanine-5 (Cy5) and cyanine-3 (Cy3) are popular dyes in use for microarray experiments. The destructive effect of ozone on the Cy5 dye, in particular, has been discussed [1, 2], and a mechanism for the effect reported (see for example the poster Garner J, Howerton K, Schwalm J and Getts R "Development of a Protective Coating Designed to Preserve the Absorption and Emission Properties of Flourescent Dyes Used on Microarrays" currently available from ). The consequence of this phenomenon is that for two-channel microarray experiments, conducted without protection in a high-ozone environment, one channel of information may be partially or completely corrupted.
While this relationship may not have been accepted universally , the fact that arrays with a defective Cy5 channel exist is not disputed. The advice from a producer of microarray systems  is either to avoid high ozone levels when conducting experiments (passively or actively), or to use one of a number of Cy5 stabilizing solutions that have been shown to prevent a good deal of the degradation.
It is indeed ideal to avoid this problem, however it remains the case that some experiments have suffered and will suffer in this way. The task is then to conduct a valid and informative analysis. Options include discarding the arrays that have problems in their Cy5 data, or discarding the Cy5 data from all arrays and performing a Single Channel analysis. Further options are to ignore the problem (and presumably either display suitable caution in the interpretation or trust that the effect is regular enough that it can be normalized away), or to abandon the data entirely and repeat the experiment. One can imagine circumstances where all of these approaches might suffice, however one can also see that there are drawbacks to each.
In this paper we are motivated by a dataset where the Cy5 channel has been corrupted for a number of arrays. We comment initially on whether this may have been due to ozone, but focus on an approach to their analysis regardless of the cause of the problem (although with the assumption that this is a non-trivial disruption to the Cy5 channel). We further show that this approach produces sensible and consistent results, without incurring the problems of the approaches described above. These results will be substantiated by a large simulation experiment that we also present.
Characteristics of the study samples
Providing ≥2 Cancer Arrays
Providing ≥2 Normal Arrays
Providing Normal and Cancer Arrays
In addition, for some specific comparisons, results are presented of a Single Channel analysis applied only to the Cy3 channel of arrays that have been affected by ozone. We would not suggest that this would ever be a sensible approach, but the approach provides a limit to the Combined(kfix = X) family of distributions as will be clarified at the appropriate juncture.
We focus on the task of distinguishing cancerous samples from normal (non-cancerous) samples through modelling of gene-expression levels. This is a well-understood problem, and allows for a comparison of the performance of the methods without distraction. We readily acknowledge that this may not be the most important biological comparison, but reaffirm that the purpose of this exercise is to compare the methods of analysis.
Performance of the five methods in simulation
Method of Analysis
Percentage arrays damaged
Combined (kfix = 0.5)
Combined (kfix = 0.25)
At all levels of damage, the Combined(kest) method presented within this paper performed best for detecting differentially expressed genes, while the Combined(kfix) approaches were generally close seconds. As an example of the advantage of the active estimation of k, when 80% of the arrays are damaged, the Combined(kest) approach correctly identifies 872 differentially expressed genes that are missed by the Combined(kfix = 0.5) approach, while only 468 are discordant in the other manner (See Additional file 1). With these values, McNemar's test returns a p-value of less than 0.0001. At high levels of damage, the Single Channel analysis approaches this level of performance, whilst at low levels the Unaffected Data analysis performs relatively well.
The Inclusive analysis surprisingly outperforms the Single Channel analysis when only 20% of arrays are damaged and outperforms the Unaffected data analysis when 80% of the arrays are damaged, but loses out in every other comparison (McNemar's test: p < 0.0001 in each case). In both of the cases when it shows superiority it has available 80 channels of undamaged information that the inferior method does not. That this is enough to overcome the 20 damaged arrays included in the victorious comparison with the Single Channel method is understandable, that it overcomes the 80 damaged arrays included in the victorious comparison with the Unaffected data approach is less so, and may reflect our failure to simulate spatial structure in the ozone damage, allowing the damage to average out over such a number of arrays. The Unaffected analysis outperforms the Single Channel approach at damage levels of 20% and 40% but is outperformed at the 60% and 80% levels (McNemar's test: p < 0.0001 in each case). The Combined(kfix = 0.5) approach outperforms the Combined(kfix = 0.25) approach at all levels of damage.
When the magnitude of the simulated effect size (i.e. the modelled log-ratio) is greater than 0.5, the stronger methods consistently correctly return 95% of the differentially expressed genes. The greatest interest for differentiation between methods then lies in those genes where the effect was simulated to be less than 0.5. Here, the Single Channel and Inclusive approaches perform far more poorly, with the Unaffected data approach doing very well at low damage levels, but not performing to the same standard at other damage levels.
Full details of the pairwise comparisons are presented in Additional file 1.
Application to example data
The rankings of the top genes exhibiting differential expression between normal and cancer samples are listed in Additional file 2. Broadly speaking, the same genes are appearing towards the top of all of the lists. There are though distinctions to be made. The table was ordered by the results from the Combined(kest) analysis, and taking this as a reference one can see that the Combined(kfix) and Single Channel analyses perform most similarly.
Numerical comparison of analysis methods applied to the real data
Comb. (k est )
Comb. (kfix = 0.5)
Comb. (kfix = 0.25)
Combined(kfix = 0.5)
Combined(kfix = 0.25)
SC (damaged arrays)
It can be seen that there is no obvious tendency for the arrays to divide into ozone-affected and ozone-unaffected clusters. Some low-level grouping of the same ozone-effect type will occur by chance of course, and some further will be driven by pairs that are homogeneous and originate from the same sample (not indicated). So there really is no evidence of a noticeable effect here. Moreover, the replicates cluster together well, demonstrating that the loss of the Cy5 channel data is being overcome in a satisfactory manner.
Estimation of k
Looking at Figure 5. If we note that the Unaffected data analysis is essentially a Combined(kfix = 0) approach, and the Single Channel analysis applied only to the damaged arrays is essentially a Combined(kfix = ∞) approach then we can see that, for example when the reference is Unaffected, the order of k = 0, k = 0.25, k = 0.5, k = ∞ is preserved in the ordering of the lines. The Combined(kest) approach slots in a little after the Combined(kfix = 0.5) method in this ordering. This might be taken as weak evidence in favour of a Combined(kfix > 0.5) approach, but observation of Figure 8 makes such a choice hard to countenance.
The values of log(k) produced from the simulation experiment are also presented in Figure 8. As might be anticipated, the mixing of components of the data to generate new datasets leads to some homogenization of the estimates of k. This is apparent despite the fact that the simulated datasets are smaller than the motivating dataset meaning that the individual estimates of k might be expected to be less precise and thus lead to a heavier-tailed distribution. Nevertheless, the distribution of log(k) estimated from the simulated data is recognisably of the form demonstrated by the example data, albeit that the two modes have somewhat converged.
There is a distinct spatial aspect to the damaged data. This is clear both from the initial presentation of the data (Figure 1) and the results from the Inclusive analysis. The simulated data did not include a spatial effect in the damage inflicted. Nevertheless, they (along with the analysis of our example data), demonstrated that the inclusion of those damaged data would be detrimental to the analysis.
It is no surprise either that the Single Channel analysis performs well regardless of the percentage of damaged arrays in the experiment as it does not use the channel that is being damaged. The other methods naturally improve as more arrays survive intact, because they have more reliable information available. Therefore it is inevitable that there will be a proportion of missing channels above which the Single Channel approach will do better than the Unaffected Data analysis and below which it will perform worse. That it appears to be at about the 50% mark in our simulations is a little more surprising. It is worthy of note that in those cases when the Single Channel approach does outperform the Unaffected Data analysis, the margin is somewhat greater than when the reverse is true.
The Unaffected Data analysis could be regarded as a Combined analysis with k set to zero. From this viewpoint, it is not surprising that the Combined(kfix = 0.25) approach lies between the Unaffected Data analysis and the Combined(kfix = 0.5) approaches in the results that it produces. If the Unaffected Data analysis has available about half of the arrays (as it does in our sample), and the variance of the log-ratio is approximately half of the variance of the log-expression levels used in the Single Channel analysis, then we might anticipate that the precision of the estimates from the two approaches (Unaffected Data and Single Channel) would be approximately equal to each other.
That the Single Channel and Unaffected Data approaches therefore differ so markedly in their results for the example data, despite our anticipating that their overall performances might be broadly similar, is evidence that a unifying approach such as the Combined analysis really does have something to offer. The results of the simulation experiment suggest that the increase in information that the Combined approach has available more than makes up for the extra complexity of the model. The successful pairing of replicates in the clustering shown in Figure 7 gives additional evidence of the value of this approach. In the simulation experiment, all of the methods are guaranteed high specificity due to the ratio of differentially expressed to non-differentially expressed genes. With 18,575 out of 22,575 simulated genes being non-differentially expressed, the formula for specificity will be at worst (18,575 - x)/18,575 where x is the number of genes nominated as being differentially expressed. So we would have to take a list of 1,858 genes as being differentially expressed before even the least specific of methods could drop below 90% specificity.
Complementary to this is the problem that differentially expressed genes differ in their level of differential expression. There will be some genes that, due to their large associated effect, even the least discriminatory methods of analysis would struggle to miss. A method of analysis may then appear to be competent in the ROC curve merely by harvesting the low-hanging fruit. Results that are so self-evident though are unlikely to be of great interest. It is in the detection of genes with a small differential expression that the improvements offered by the Combined(kest) approach are most apparent.
Our motivating example, and thus our simulated data, originates from an experiment where the affected channel was a common reference. This is clearly not always the case, and we can speculate on the effect that relaxing this condition might have. We will assume that we still have a situation where there are three groups (normal, cancer and reference), and we are only primarily interested in a comparison between the normal and cancer groups. In such circumstances, a Single Channel analysis will not make use of the reference data in the comparison of interest, save for using it to help model the residual variances.
It may be desirable to ignore entirely arrays providing only reference data in a Single Channel analysis, unless one is confident that the minimal extra information provided will outweigh the strong modelling assumptions being made both in the Single Channel normalization and the linear modelling itself when those arrays are included. A Combined analysis method on the other hand will actively use arrays providing information only on the reference group, provided there exist undamaged arrays within which a reference-to-normal or reference-to-cancer comparison is being made. Additionally of course, the Combined analysis is able to make use of normal and cancer samples on the unaffected channels of damaged arrays. Thus we would anticipate that the current experimental design is one where the Single Channel would perform at its best relative to the other methods, and that in other designs the advantages offered by a Combined approach would be even more convincing.
The values of 0.5 and 0.25 were obtained from experimental observations of two-channel cDNA microarrays produced at the Peter MacCallum Cancer Institute and the University of California Berkeley respectively. That the Agilent arrays appear to produce values of k that are not dissimilar is evidence towards some commonality across platforms. However, that k = 0.5 seems to perform better than k = 0.25 implies that the Single Channels contain more information relative to the log-ratios than anticipated from the cDNA microarrays. This is in keeping with the Agilent technologies' reputation for being particularly robust, and suggests that in fact there may be scope for a platform-dependent improved Bayesian shrinkage step in the active estimation of k.
Ignoring the issue of damage to one of the channels results in poor results if such damage exists. Moreover the more damaged arrays there are, the worse the performance of such an approach. The problems are not only due to an increased variance resulting from the loss of information (as seen in the simulation), but also a spatial bias that is evident in the results. A Single Channel approach on the other hand may perform very well in suitable conditions (damage to a majority of arrays, and the lost channel not containing direct information about the contrast of interest), but is outperformed when only a small number of arrays are damaged, by an approach based on simply discarding those arrays. A complementary result is that an analysis of Unaffected arrays performs better the more undamaged arrays there are, but is less reliable than a Single Channel analysis if there are few undamaged arrays.
A Combined analysis has been shown here to offer improvements over the other analyses considered. It could be regarded as producing results that are a compromise between those from a Single Channel analysis and those from an Unaffected Data analysis, and in objective tests it outperforms both. The additional modelling assumptions and complexity are handsomely compensated for by the additional information obtained when including all of the available data. Indeed, if one is using a prescribed value of k in the analysis, then many of the rewards are obtained for a minimum of additional computational effort.
A 'Combined' approach to the linear modelling
Our approach takes the common linear modelling approach  for log-ratios and adjusts it so that instead of estimating terms representing relative levels of expression for the different groups we estimate terms representing absolute levels of expression. For example, instead of estimating the level for cancer sample relative to a reference, C R , and the level for a normal sample relative to a reference, N R , as would normally be the case,
we instead formulate the equation in terms of absolute values (since a log-ratio is simply a linear combination of two log-expression levels) corresponding to cancer, normal and reference samples (C, N, and R respectively).
The design matrix at this point has become singular. However this formulation allows us to insert additional rows corresponding to arrays providing only a single channel of data and thus a single log-expression value:
The design matrix is no longer singular, and the only thing that prevents the fitting of a simple linear model is that the residual errors of the log-expression measurements have a different form to those of the log-ratios. We assume that the only difference is a scaling of the variance, and that for any particular gene, this is constant across the arrays. Thus we can write the equation
This model could be fitted as it is, however we prefer to first estimate k and then fit the model for three reasons. First, the data could by chance lead to extreme estimates of k, and in a two stage approach we have the opportunity to adjust the value to prevent this. Second, this approach allows for easier investigation of the importance of the estimate of k. Finally, this approach allows us to link in to existing packages such as Limma  more easily.
Given we are taking this approach, the model can be rewritten once more to give
that is of a form more readily integrable into existing microarray analysis packages.
The interpretation of k is as the ratio of the variance of the residuals arising from the arrays providing two channels and the variance of the residuals arising from the arrays providing only one channel. There are a number of approaches to obtaining an initial estimate of k, and an equally diverse range of refinements that can be applied. Our experience is that any sensible combination will lead to broadly the same results. Our approach is to take the simplest estimate of k by fitting a model with k set to be equal to one and estimating the variances of firstly those residuals associated with log-ratios and secondly those associated with log-intensities. Once 22,575 estimates of k are obtained (one for each spot on the array), then we use a basic shrinkage method to move those estimates inwards towards their common average. We are not employing the full empirical Bayesian shrinkage method , but have noticed no differences due to this, in part because the large sample sizes seen here mean that the prior distribution is not very influential in any case. Full details of the method we have used for shrinking the estimates of k can be found in Additional file 3.
The values of 0.25 and 0.5 were considered for use as a fixed estimate of k based on observations previously made by one of the authors  (available from ). There it is observed that in one experiment, contrasts formed from single channels have a standard deviation of the order of twice that for within-array log-ratios, while in another experiment the ratio was closer to three.
Assuming that the observations forming the single channel contrasts were independent and individually had variance v, it is easy to see that the variance of the single channel contrast would have been 2v, while from our definition of k the variance of the log-ratios would have been kv. From the experimental observations we have = 2 or = 3 resulting in estimates for k of 0.5 and 0.22 which for simplicity we approximate as 0.5 and 0.25.
Normalization for the combined method
We are aiming for an object that contains M-values (log-ratios) from the unaffected arrays, and G-values (Cy3 log-intensities) from the affected arrays. The basic policy is to create an idealized set of G-values from the good arrays and normalize the bad arrays to these. We begin by performing a loess within-array normalization, although other preferences regarding the choice of within-arrays normalization can easily be accommodated. Within the good arrays, we then perform a 'scale' normalization between arrays and back calculate log-intensities for the green channel. A quantile normalization of these green log-intensities provides us our vector of idealized values. The green log-intensities are then normalized to these values by their within-array ranks.
For the production of heatmaps and clustering, this object is clearly troublesome since the arrays providing M-values will naturally cluster together apart from those supplying G-values. We can address this by back-calculating R values from the normalized unaffected arrays. Since in our design each represents the same reference sample, we take the mean red log-intensity for each spot from the unaffected arrays to get a vector of R values. From this vector we subtract the green log-intensities of the ozone-affected arrays to obtain M-values.
There is still a danger at this point of separating out the affected and unaffected arrays if performing a cluster analysis, so having obtained M-values for all arrays, a final normalization is required. For the unaffected arrays, a quantile normalization is performed, generating a vector of M-values that are seen on every array. The M-values on the ozone-affected arrays are then normalized to this vector by rank order (the highest M-value being replaced by the highest value in the vector, and so on). This leaves an object that can be successfully used for generating heatmaps or performing cluster analyses.
To compare the methods available, we generated datasets using an approach inspired by that of Lonnstedt . A real dataset of 55 microarrays, relating to homogenous samples, containing data for 21,073 spots was used in the generation of a dataset of 100 microarrays containing data for 22,575 spots. These 100 were divided into two groups, one of size 70 and the other of size 30. 4,000 of the simulated spots were differentially expressed between the two groups.
From the real data we have a matrix of log-ratios M ag and a corresponding matrix of log-intensities for the red channel R ag where 1 ≤ a ≤ 55, and 1 ≤ g ≤ 21,073. From this we create a vector of the average log-ratios associated with the genes, avM, where avM g = ∑ a M ag , and a matrix, Δ, of the differences between log-ratios and their gene average, where Δ ag = M ag - avM g .
A new matrix of log-ratios, M', is then built row-by-row, where = avMx(h)+ Δg(b),f(x(h))+ ε. Here x() associates each row of M' with an element of avM at random, g() represents a random sample (with replacement) of 100 values from 1 to 55, and f() returns a row of Δ that is in some sense near the row associated with the element of avM that was returned by x(). ε adds a small amount of noise into the mix to avoid identical elements appearing. Here 1 ≤ b ≤ 100, and 1 ≤ h ≤ 22,575.
A new matrix of corresponding log-intensities, R', for the red channel is then constructed as = Rg(b),f(x(h)), where g,f, and x are defined as above. For 4000 of the simulated log-ratios, in 30 of the simulated arrays, some differential expression was introduced by adding an extra term into the formula for . After examination of the results of models fitted to the real data, the differential expression was simulated as being from a mixture of three normal distributions, with means at -0.9, 0, and 0.9, variances of 0.25, 1, and 0.25 and weights of 0.4375, 0.125, and 0.4375 respectively.
Green log intensities were then back-calculated from the red log-intensities and the log-ratios. The green and red values were used as the basis of the analysis. Where appropriate, the red log-intensities were damaged as follows: From the real data, matrices of log-intensities, RD and GD, from damaged microarrays and RU and GU from undamaged arrays were constructed using only those microarrays that compared a cancer sample to the reference and with some undamaged microarrays omitted so as to create matrices of equal size.
A matrix of predicted red log-intensities (for the case had there been no damage) was produced for the damaged arrays as PD ij = RU ij GD ij /GU ij and from this, an ozone damage matrix, OD, was produced such that OD ij = RD ij /PD ij . Damage to the red intensities was then simulated by multiplying log-intensities by values sampled from OD.
The simulation experiment
One hundred datasets were simulated in the specified manner, consisting of four groups of 25 datasets, each group with the red channel corrupted for a different proportion of the arrays. The damage levels simulated were such that either 20, 40, 60 or 80 of the 100 arrays in a dataset would have their red channel corrupted. Six analyses were then performed on each simulated dataset. These were an analysis of only the single green channel, an analysis of all of the data (ignoring the problem entirely), an analysis of those arrays that did not experience problems with the red channel, and three analyses as presented in this paper (one actively estimating k as described above, one with k prespecified to be 0.25 and another with the value set at 0.5).
"Single channel analysis": Quantile normalization of the 'green' channel followed by linear modelling.
"Inclusive analysis": For all of the data, regardless of the damage to the 'red' channel, construction of log-ratios, within-array loess normalization to the average intensities, linear modelling of the log-ratios.
"Unaffected data" analysis: As for the 'Inclusive analysis', but only for those arrays with unaffected 'red' channels.
The "Combined(kest)" approach: as presented in this article, using k estimated from the data.
The "Combined(kfix)" approach: as presented in this article, but with k set equal to 0.25.
The "Combined(kfix)" approach: as presented in this article, but with k set equal to 0.5.
Analysing data from the simulation experiment
Methods were primarily assessed by their sensitivity and specificity for all lengths of genelist. Additionally, since 2,000 genes were known to be differentially expressed, we also looked specifically at gene-lists of length 2,000. Basic comparisons were via percentages of differentially expressed genes correctly returned, which were then broken down both by proportion of missing channels and size of differential effect. A more powerful comparison of any two methods is possible by a McNemar-type test (actually the exact binomial equivalent). All such paired comparisons are conducted in a pairwise manner. No contrary results (i.e. A > B, B > C, C > A) were seen.
We thank Ian Mills for useful preliminary discussions, Gordon Smyth for much useful advice, John Marioni for assistance in formatting the figures and Simon Tavaré for comments on the manuscript. We thank the referees for a number of helpful suggestions for improving the exposition of the paper. NPT and AGL are supported by a CRUK program grant to Simon Tavaré. The data were produced using funding from The British Urological Foundation, The Shackman Charitable Trust and the Addenbrooke's Hospital Charities Trust.
- Fare TL, Coffey EM, Dai HY, He YDD, Kessler DA, Kilian KA, Koch JE, LeProust E, Marton MJ, Meyer MR, Stoughton RB, Tokiwa GY, Wang YQ: Effects of atmospheric ozone on microarray data quality. Analytical Chemistry 2003, 75(17):4672–4675.View ArticlePubMedGoogle Scholar
- Gershon D: Microarrays go mainstream. Nature Methods 2004, 1(3):263–270.View ArticleGoogle Scholar
- Genisphere Inc. Documentation[http://www.genisphere.com/pdf/DyeSaverPoster.pdf]
- Stability studies of dyes in microarray applications Tech Rep 63–0052–39, Amersham Biosciences 2003.Google Scholar
- Improving microarray results by preventing ozone-mediated flourescent signal degradation Tech Rep 5989–0875EN, Agilent Technologies, Inc 2005.Google Scholar
- Natalie Thorne's Home Page[http://www.damtp.cam.ac.uk/user/npt22/index.html]
- Smyth G: Limma: linear models for microarray data. In Bioinformatics and Computational Biology Solutions using R and Bioconductor. Edited by: Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W. New York: Springer; 2005:397–420.View ArticleGoogle Scholar
- Smyth G: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 2004., 3:Google Scholar
- Thorne N: Single-channel normalisation and analysis of two-colour cDNA microarray data. PhD thesis. University of Melbourne; 2004.Google Scholar
- Lonnnstedt I, Rimini R, Nilsson P: Empirical Bayes microarray ANOVA and grouping cell lines by equal expression levels. Statistical Applications in Genetics and Molecular Biology 2005., 4:Google Scholar
- R Development Core Team: R: A language and environment for statistical computing.R Foundation for Statistical Computing, Vienna, Austria; 2005. [http://www.R-project.org] [ISBN 3-900051-07-0]Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.