Accounting for Uncertainty in Probeset Summarisation
The first step in a typical analysis is to load in data from Affymetrix CEL files, using the ReadAffy function from the affy package [9]. puma makes extensive use of phenotype data, which maps arrays to the condition or conditions of the biological samples from which the RNA hybridised to the array was extracted. It is recommended that this phenotype information is supplied at the time the CEL files are loaded. If the phenotype information is stored in the AffyBatch object in this way, it will then be made available for all further analyses. Details of how to include such phenotype information are included in the puma User Guide [8].
The recommended summarisation method to use within puma is multi-mgMOS [2]. The following code shows how to use this method on an example data set included in the pumadata package. We also create a summarisation using RMA [10] for comparison.
> library(pumadata)
> data(affybatch.estrogen)
> eset_estrogen_mmgmos <- mmgmos(affybatch.estrogen)
> eset_estrogen_rma <- rma(affybatch.estrogen)
mmgmos takes significantly longer to run than rma. The above commands took 225 and 4 seconds respectively to complete on a 2.93 GHz Intel Core 2 Duo MacBook Pro. multi-mgMOS performs well on the affycomp benchmark [11, 12], giving the best score for 3 of the 14 measures (5. Signal detect slope, 6. low.slope and 9. Obs-intended-fc) on the HGU95 spike-in data set, and 1 of the 14 measures (10. Obs-(low)int-fc slope) on the HGU133 data set (data correct as of June 1, 2009).
Propagating Uncertainty in Principal Component Analysis
A useful first step in any microarray analysis is to look for gross differences between arrays. This can give an early indication of whether arrays are grouping according to the different factors being tested. This can also help to identify outlying arrays, which might indicate problems, and might lead an analyst to remove some arrays from further analysis. Principal components analysis (PCA) is often used for determining such gross differences. puma has a variant of PCA called Propagating Uncertainty in Microarray Analysis Principal Components Analysis (pumaPCA) which can make use of the uncertainty in the expression levels determined by multi-mgMOS. The following code shows what samples have been hyrbridised to each array, and then runs both pumaPCA and standard PCA (using prcomp) on the results obtained from the summarisation steps in the previous section. Following this, the 8 arrays used are plotted on the first two principal components using each method.
> pData(eset_estrogen_mmgmos)
estrogen time.h
low10-1.cel absent 10
low10-2.cel absent 10
high10-1.cel present 10
high10-2.cel present 10
low48-1.cel absent 48
low48-2.cel absent 48
high48-1.cel present 48
high48-2.cel present 48
> pumapca_estrogen <- pumaPCA(eset_estrogen_mmgmos)
> pca_estrogen <- prcomp(t(exprs(eset_estrogen_rma)))
> par(mfrow = c(1, 2))
> plot(pumapca_estrogen, legend1pos = "right", legend2pos = "top",
+ main = "pumaPCA")
> plot(pca_estrogen$x, xlab = "Component 1", ylab = "Component 2",
+ pch = unclass(as.factor(pData(eset_estrogen_rma)[,
+ 1])), col = unclass(as.factor(pData(eset_estrogen_rma)[,
+ 2])), main = "Standard PCA")
pumaPCA is much more computationally demanding than standard PCA. The above pumaPCA and prcomp calls took 31 and 0.03 seconds respectively to complete on a 2.93 GHz Intel Core 2 Duo MacBook Pro. We can see from the phenotype data that this experiment has 2 factors (estrogen and time.h), each of which has two levels (absent vs present, and 10 vs 48), hence this is a 2 × 2 factorial experiment. For each combination of levels we have two replicates, making a total of 2 × 2 × 2 = 8 arrays. It can be seen from Figure 1 that the first component appears to be separating the arrays by time, whereas the second component appears to be separating the arrays by presence or absence of estrogen. Note that grouping of the replicates is much tighter with multi-mgMOS/pumaPCA. With RMA/PCA, one of the absent.48 arrays appears to be closer to one of the absent.10 arrays than to the other absent.48 array. This is not the case with multi-mgMOS/pumaPCA. We have seen similar patterns in other experiments (data not shown).
Identifying differentially expressed genes
There are many different methods available for identifying differentially expressed (DE) genes. puma incorporates the Probability of Positive Log Ratio (PPLR) method [4]. The PPLR method can make use of the information about uncertainty in expression levels provided by multi-mgMOS. This proceeds in two stages. Firstly, the expression level information from the different replicates of each condition is combined using the function pumaComb to give a single expression level (and standard error of this expression level) for each condition. Following this, differentially expressed genes are determined using the function pumaDE. The following code determines DE genes from the estrogen data using multi-mgMOS/PPLR, and also, for comparison purposes, using RMA/limma.
> eset_estrogen_comb <- pumaComb(eset_estrogen_mmgmos)
> pumaDERes <- pumaDE(eset_estrogen_comb)
> limmaRes <- calculateLimma(eset_estrogen_rma)
Note that running the pumaComb command is typically the most time-consuming step in a typical puma analysis. As an example, the above command took 78 minutes to run on a 2.93 GHz Intel Core 2 Duo MacBook Pro. The computation time of this step can be decreased significantly when computed in parallel. Figure 2 shows typical run times when using different numbers of compute nodes on a Beowulf cluster. The pumaDE and calculateLimma commands each took less than a second to run.
Because this is a 2 × 2 factorial experiment, there are a number of contrasts that could potentially be of interest. puma will automatically calculate contrasts which are likely to be of interest for the particular design of a data set. For example, the following command shows which contrasts puma will calculate for this data set.
> colnames(statistic(pumaDERes))
[1] "present.10_vs_absent.10"
[2]"absent.48_vs_absent.10"
[3] "present.48_vs_present.10"
[4]"present.48_vs_absent.48"
[5] "estrogen_absent_vs_present"
[6] "time.h_10_vs_48"
[7] "Int__estrogen_absent.present_vs_time.h_10.48"
Here we can see that there are seven contrasts of potential interest. The first four are simple comparisons of two conditions. The next two are comparisons between the two levels of one of the factors. These are often referred to as "main effects". The final contrast is known as an "interaction effect". In more simple cases, where there are just two conditions, puma will create just one contrast.
Suppose we are particularly interested in the interaction term. We saw above that this was the seventh contrast identified by puma. The following commands will identify the gene deemed to be most likely to be differentially expressed due to the interaction term by the RMA/limma approach. We then plot the expression levels of this gene in the four conditions as determined by RMA and multi-mgMOS.
> topLimmaIntGene <- topGenes(limmaRes, contrast = 7)
> par(mfrow = c(1, 2))
> plotErrorBars(eset_estrogen_rma, topLimmaIntGene)
> plotErrorBars(eset_estrogen_mmgmos, topLimmaIntGene)
The gene shown in Figure 3 would appear to be a good candidate for a DE gene. There seems to be an increase in the expression of this gene due to the combination of the estrogen = absent and time = 48 conditions. The within condition variance (i.e. between replicates) appears to be low, so it would seem that the effect we are seeing is real. The plot of Figure 4 tells a somewhat different story. Again, we see that the expected expression level for the absent:48 condition is higher than for other conditions. Also, we again see that the within condition variance of expected expression level is low (the two replicates within each condition have roughly the same value). However, we can now see that we actually have very little confidence in the expression level estimates (the error bars are large), particularly for the time = 10 arrays. Indeed the error bars of absent:10 and present:10 both overlap with those of absent:48, indicating that the effect previously seen might actually be an artifact.
The following code determines and plots the gene most likely to be differentially expressed due to the interaction term using multi-mgMOS and pumaDE. This analysis was not possible using previous implementations of multi-mgMOS and PPLR, as the PPLR method was only able to determine differential expression between two levels of a single condition.
> toppumaDEIntGene <- topGenes(pumaDERes, contrast = 7)
> plotErrorBars(eset_estrogen_mmgmos, toppumaDEIntGene)
Figure 5 shows the gene determined by multi-mgMOS/PPLR to be most likely to be differentially expressed due to the interaction term. There appears to be lower expression of this gene due to the combination of the estrogen = absent and time = 48 conditions. Unlike with the gene shown in the plot of Figure 4 there is no overlap in the error bars between this condition, and the other conditions. Hence, this would appear to be a better candidate for a gene differentially expressed due to the interaction term. The combination of multi-mgMOS and PPLR (as implemented in the functions mmgmos and pumaComb/pumaDE) gave the strongest performance amongst 42 combinations of summarisation and DE detection methods in version 1.1 of the AffyDEComp benchmark [13].
Clustering with pumaClust
The following code will identify seven clusters from the output of mmgmos:
> pumaClust_estrogen <- pumaClust(eset_estrogen_mmgmos,
+ clusters = 7)
Clustering is performing .....................................................................
Done.
The result of this is a list with different components such as the cluster each probeset is assigned to and cluster centers. The following code will identify the number of probesets in each cluster, the cluster centers, and will write out a csv file with probeset to cluster mappings:
> summary(as.factor(pumaClust_estrogen$cluster))
> matplot(t(pumaClust_estrogen$centers))
> write.csv(pumaClust_estrogen$cluster, file = "pumaClust_clusters.csv")
Examples of improved performance on real and simulated data sets of PUMA-CLUST when compared with a standard Gaussian mixture model (MCLUST) are given in [5]
Analysis using remapped CDFs
There is increasing awareness that the original probe-to-probeset mappings provided by Affymetrix are unreliable for various reasons. Various groups have developed alternative probe-to-probeset mappings, or "remapped CDFs", and many of these are available either as Bioconductor annotation packages, or as easily downloadable cdf packages. One of the issues with using remapped CDFs is that many probesets in the remapped data have very few probes. This makes reliable estimation of the expression level of such probesets even more problematic than with the original mappings. Because of this, we believe that even greater attention should be given to the uncertainty in expression level measurements when using remapped CDFs than when using the original mappings. In the puma User Guide [8], we give an example of using a remapped CDF package created using AffyProbeMiner [1]. We show that, as with the standard Affymetrix annotation, we can improve results of both PCA and DE detection using puma methods on the remapped data.
Application beyond Affymetrix microarray data
Although the methods within puma were originally designed for use with Affymetrix microarray expression data, there is considerable scope for application beyond this domain. This is particularly so for the functions pumaComb, pumaDE, pumaPCA and pumaClust. pumaComb and pumaDE can be used in situations where the probability of a difference between means is required from data which has associated standard errors. One directly related application is the analysis of Illumina BeadArray data. Rather than using multi-mgMOS to determine standard errors of expression levels as is recommended with Affymetrix data, the empirical standard errors output by Illumina's BeadStudio software, or the Bioconductor package beadarray [14] can be used directly with pumaComb and pumaDE to determine differentially expressed genes. More generally, pumaComb and pumaDE can be used as an alternative to a t-test to determine probabilities of differences between means of data from different classes, where those data have both point estimates and standard errors associated with those estimates. Similarly, pumaPCA and pumaClust can be applied more generally as alternatives to methods such as standard PCA and standard clustering algorithms respectively.