- Research article
- Open Access
Unsupervised assessment of microarray data quality using a Gaussian mixture model
- Brian E Howard1Email author,
- Beate Sick2 and
- Steffen Heber1
https://doi.org/10.1186/1471-2105-10-191
© Howard et al; licensee BioMed Central Ltd. 2009
- Received: 02 December 2008
- Accepted: 22 June 2009
- Published: 22 June 2009
Abstract
Background
Quality assessment of microarray data is an important and often challenging aspect of gene expression analysis. This task frequently involves the examination of a variety of summary statistics and diagnostic plots. The interpretation of these diagnostics is often subjective, and generally requires careful expert scrutiny.
Results
We show how an unsupervised classification technique based on the Expectation-Maximization (EM) algorithm and the naïve Bayes model can be used to automate microarray quality assessment. The method is flexible and can be easily adapted to accommodate alternate quality statistics and platforms. We evaluate our approach using Affymetrix 3' gene expression and exon arrays and compare the performance of this method to a similar supervised approach.
Conclusion
This research illustrates the efficacy of an unsupervised classification approach for the purpose of automated microarray data quality assessment. Since our approach requires only unannotated training data, it is easy to customize and to keep up-to-date as technology evolves. In contrast to other "black box" classification systems, this method also allows for intuitive explanations.
Keywords
- Gaussian Mixture Model
- Exon Array
- Unlabeled Instance
- Expert Annotation
- Affymetrix Exon Array
Background
Recently, the MicroArray Quality Control (MAQC) consortium found that most microarray platforms will generate reproducible data when used correctly by experienced researchers [1]. Despite this positive result, it has been suggested that 20% or more of the data available in public microarray data repositories may be of questionable quality [2]. For this reason, discriminating between high and low quality microarray data is of the highest importance, and several recent publications have dealt with this problem; detailed reviews are provided by Wilkes et al. [3] and Eads et al. [4].
Several approaches have emphasized the importance of measuring, either directly or indirectly, the integrity of the RNA samples used in the experiment (e.g. [5–7]). Other research has focused on spatial artifacts: problems that typically arise during hybridization due to bubbling, scratches and edge effects [8, 9].
In the case of Affymetrix GeneChips, which we will use to demonstrate our method, there are standard benchmark tests provided by the manufacturer [10]. A standard complementary approach is to use the R statistical software, along with the BioConductor [11] "affy" [12] and "affyPLM" [13] packages, to produce a series of diagnostic plots for the assessment of GeneChip quality (see additional file 1: Fig S3, S4). A review of the quality control features available in BioConductor can be found in [14], and a variety of software packages are now available to assist in the automation of this process [15–19].
In general, the goal of these approaches is to identify chips that are outliers – either in relation to other chips in the same experiment or the entire theoretical population of similar chips. Often, it is assumed that a rational decision regarding data quality is made only after considering several quasi-orthogonal dimensions of quality. Chips are typically rejected only after a preponderance of the evidence indicates poor quality; a slightly unusual score on a single metric is frequently ignored, while a number of moderately or highly unusual scores on a variety of quality metrics is often grounds for exclusion of a particular chip from further analysis. However, there are no universal, robust thresholds available for the identification of outliers according to the various quality variables. Instead, decisions are necessarily made using historical data, either implicitly or explicitly.
Therefore, recent efforts have focused on providing a "holistic", accurate, and automatic interpretation of diagnostic plots and quality metrics. Burgoon et al. [20] describe a custom, in-house protocol for assessing data quality of two-color spotted cDNA arrays. The authors advocate an integrated "Quality Assurance Plan" which attempts to integrate quality control at every level of the experimental procedure.
Another example is the RACE system [15, 16]. This system utilizes various statistics extracted from the BioConductor diagnostic plots, along with a random forest classifier, to automatically identify low quality data. However, like the quality assurance protocol described by Burgoon et al., the RACE system relies on a large expert-annotated data set. For this reason, it is difficult to keep the system up-to-date in the face of rapidly changing technology, with new chip types continually being introduced into the market. A further challenge is to adapt such a system to similar, but slightly different, types of data such as Affymetrix SNP arrays, exon arrays, or arrays produced by other manufacturers such as Illumina and Agilent.
In this paper we investigate a method for unsupervised classification that was designed with these considerations in mind. First, we describe how to frame the interpretation of microarray quality indicators as an unsupervised classification problem using a Gaussian mixture model. We show how the model parameters can be estimated using the Expectation-Maximization (EM) algorithm [21], and how they can be used to construct a Naïve Bayes classifier for identifying low quality data.
Previous work has demonstrated that naïve Bayes classifiers perform well with labeled training sets in the supervised version of the problem discussed in this paper [15]. The combination of Naïve Bayes together with EM has been used with considerable success in other problem domains, including text classification [22]. Gaussian mixture models have been applied to automatic quality assessment of phone signal clarity [23] and mass spectrometry data [24], and in other stages of the microarray processing pipe-line, including identification of differentially expressed genes [25], assessment of the concordance between sets of similar microarray data sets [26], and even quality control at the spot detection and image fluorescence analysis level [27]. However, this is the first research we are aware of that employs this estimation approach, in conjunction with a naïve Bayes classifier, for the purpose of array-level quality control of microarray data.
In the following sections, we describe the datasets used in this research, and explain the implementation of both the supervised and unsupervised versions of the quality classifier. We demonstrate that the performance of the unsupervised classifier is comparable to a supervised classifier constructed from expert-labeled data. We also apply the algorithm to Affymetrix exon array data, and compare the observed quality indicator distributions with those obtained from 3' expression arrays.
Methods
Datasets
Our first dataset is a set of 603 Affymetrix raw intensity microarray data files, from 32 distinct experiments downloaded from the NCBI GEO database [28]. A variety of Affymetrix GeneChip 3' Expression array types are represented in the dataset, including: ath1121501 (Arabidopsis, 248 chips; GEO accession numbers: GSE5770, GSE5759, GSE911 [29], GSE2538 [30], GSE3350 [31], GSE3416 [32], GSE5534, GSE5535, GSE5530, GSE5529, GSE5522, GSE5520, GSE1491 [33], GSE2169, GSE2473), hgu133a (human, 72 chips; GSE1420 [34], GSE1922), hgu95av2 (human, 51 chips; GSE1563 [35]), hgu95d (human, 22 chips; GSE1007 [36]), hgu95e (human, 21 chips; GSE1007), mgu74a (mouse, 60 chips; GSE76, GSE1912 [37]), mgu74av2 (mouse, 29 chips; GSE1947 [38], GSE1419 [39, 40]), moe430a (mouse, 10 chips; GSE1873 [41]), mouse4302 (mouse, 20 chips; GSE5338 [42], GSE1871 [43]), rae230a (rat, 26 chips; GSE1918, GSE2470), and rgu34a (rat, 44 chips; GSE5789 [44], GSE1567 [45], GSE471 [46]). These experiments cover many of the species commonly analyzed using the GeneChip platform, and were selected to represent a variety of tissue types and experimental treatments.
BioConductor Quality Control Statistics
Quality Statistic1 | Description |
---|---|
mean.raw.int, sd.raw.int, median.raw.int, interQuartile.raw.int | mean, standard deviation, median and inter-quartile range of raw log intensity distribution. |
q.5.raw.int, q.95.raw.int | 5th and 95th percentile of raw log intensity distribution. |
slope.bias, p.bias | slope parameter and associated p-value of linear regression of log expression level versus probe number, as computed by R affy library function AffyRNAdeg(). |
mean.norm.int, sd.norm.int, median.norm.int, interQuartile.norm.int, q.5.norm.int, q.95.norm.int | mean, standard deviation, median, inter-quartile range, and 5th and 95th percentiles of normalized log intensity distribution. |
PLM.w.q.0.001, PLM.w.q.0.01, PLM.w.q.0.1, PLM.w.q.0.2 | 0.1th, 1st, 10th and 20th percentile of the probe-level model weights, computed using affyPLM library functionality. |
PLM.res.q.0.01, PLM.res.q.0.1, PLM.res.q.0.25, PLM.res.q.0.75, PLM.res.q.0.9, PLM.res.q.0.99 | 1st, 10th, 25th, 75th, 90th, and 99th percentile of probe-level model residuals, computed using affyPLM library functionality. |
RLE.median, RLE.interQuartile, RLE.lower.whisker, RLE.upper.whisker | median, inter-quartile range, lower tail and upper tail of "relative log intensity", computed using affyPLM library functionality. |
Affymetrix Expression Console Quality Control Statistics (Exon Arrays)
Quality Statistic1 | Description |
---|---|
pm.mean | mean of the raw intensity for all PM probes, prior to any normalizations. |
bgrd.mean | mean of the raw intensity for all probes used to compute background intensity. (Note: may be higher than pm.mean because GC compositions of probes used to compute background and PM probes can be quite different.) |
pos.vs.neg.auc | area under ROC curve discriminating between positive control probesets and negative control probesets. |
probeset.mean, probeset.stdev | mean and standard deviation of probeset signals after normalization. 2 |
probeset.mad.residual.mean, probeset.mad.residual.stdev | mean and standard deviation of the absolute deviations of the RMA probe level model residuals from the median across chips. 2 |
probeset.rle.mean, probeset.rle.stdev | mean and standard deviation of the absolute values of the relative log expression (RLE) for all probesets. 2 |
Expert Annotation
A domain expert analyzed the 3' expression dataset (dataset 1) and assigned quality scores according to a procedure which is based on experience gained during almost three years of bioinformatics support within the Lausanne DNA Array Facility (DAFL). This quality control procedure is described in [15]. Briefly, the chip scan images and the distributions of the log scale raw PM intensities are visualized. Smaller discrepancies between chips are common and can often be removed by normalization. Remaining discrepancies usually indicate low quality data, possibly caused by problems in the amplification or labelling step. The general 5' to 3' probe intensity gradient averaged over all probe sets on a chip is also examined. The slope and shape of the resulting intensity curves depend on the RNA sample source, the amplification method, and the array type. In general, the specific shape of the curves is less important for the quality check than their agreement across the experiment. Pseudo-images representing the spatial distribution of residuals and weights derived from the probeset summarization model are very important diagnostics. Small artifacts are not critical when using robust analysis methods; however, extended anomalies are taken as an indication of low quality. In addition, box plot representations of the Normalized Un-scaled Standard Error (NUSE) from the probe level model fit and the Relative Log Expression (RLE) between each chip and a median chip are examined. These plots are used to identify problematic chips showing an overall deviation of gene expression levels from the majority of all measured chips. A chip may be judged as having poor quality if it is an apparent outlier in the experiment-wide comparison of several quality measures. Each array was given a quality score of 0, 1 or 2, with 0 being "acceptable quality" (519 chips), 1 being "suspicious quality" (56 chips) and 2 being "unacceptable quality" (28 chips). For the purposes of classification, chips with scores of 1 or 2 were combined into the composite "low quality" class.
Supervised Naïve Bayes Classifier
Previous research has demonstrated that quality assessment of microarray data can be successfully automated with the use of a supervised classifier [15, 20]. The goal of supervised classification is to utilize an annotated training dataset to learn a function that can be used to correctly classify unlabeled instances. In the case of microarray quality assessment, the training dataset consists of the quality control features computed for each chip, combined with the quality annotation for each chip.
By making the simplifying assumption that all features are conditionally independent, naïve Bayes classifiers attempt to directly model the probability that a particular data point belongs to each class. Given the class label, each feature is assumed to follow an independent, univariate distribution. These distributions are, of course, unknown, but the maximum likelihood parameter estimates can be determined from a labeled training set. Then, for each unlabeled instance, Bayes' rule can be applied to compute the conditional probability that the instance belongs to each of the possible classes. Because we had prior success performing classification on a similar data set using Naïve Bayes with Gaussian feature distributions [15], we again chose to model the features using independent normal distributions. However, the approach could easily be adapted to use alternative distributions, for example, Student's t-distribution or the skew-normal distribution.
where c ∈ {0,1} signifies the class label, with 0 denoting "high quality" and 1 denoting "low quality,"
is a length p vector of features describing the unlabeled instance., and
is the Gaussian density for the ith feature, among low quality chips.

For the purposes of classification, this algorithm assigns class 1 to an unlabeled instance
, if Pr(c = 1|
} > t, where t is a threshold parameter, ordinarily set to 0.5 in order to approximate the Bayes optimal decision rule. By varying this parameter, it is also possible to construct ROC curves which display the tradeoff between sensitivity and specificity for various decision thresholds.
Unsupervised Naïve Bayes Classifier
The standard (supervised) approach to constructing a naïve Bayes classifier employs maximum likelihood estimation to infer the distribution parameters of each classification feature from an expert-annotated training set. It is, however, also possible to construct an "unsupervised" naïve Bayes classifier by using an unannotated dataset as input. In this case, the EM algorithm is used to infer the feature distributions, assuming an appropriate Gaussian mixture model, as described in the following section.
Gaussian Mixture Model and the EM Algorithm
The naïve Bayes classification model described above requires parameter estimates for the quality control metrics, conditional on each quality class. In the absence of annotated data, however, the quality classes of the unannotated training instances are additional unknowns that must be estimated along with the distributional parameters. We model the unannotated dataset using a Gaussian Mixture Model, under the assumption that microarray data can be reasonably classified into the dichotomy of "high quality" and "low quality" chips, and that the unlabeled training set contains examples of each.

where:
-
x is an N × p matrix containing the p feature values for the N items in the dataset, with
denoting the length p feature vector for the ith data point.
-
μ is a 2 × p parameter matrix containing, in each column, μ0 and μ1 for the pth feature;
is the length p parameter vector for the jth Gaussian mixture component (j ∈ {0,1}).
-
σ2 is a 2 × p parameter matrix containing, in each column,
and
for the pth feature;
is the length p parameter vector for the jthGaussian mixture component.
The likelihood function in equation 3 can be maximized using the EM algorithm [21]. The EM algorithm is a well-known method for maximizing mixture model likelihood functions by iteratively performing two steps:
-
E Step: Estimate the unknown class labels, based on the current estimates for the other parameters.
-
M Step: Given current class labels, compute the maximum likelihood estimators for the parameters μ, σ2, and
.
To implement the EM algorithm, we introduce an additional N × 2 matrix, w, which contains, for each data point, i, the current guesses for p(
= 0) and p(
= 1). After initializing all parameters and the weight matrix, w, to random values, the EM algorithm proceeds as follows:
where normpdf(x, μ, σ2) denotes the probability density of a normal distribution evaluated at x. Because the algorithm can possibly converge to local optima, it is prudent to run the algorithm several times after random restarts. Additionally, each
was constrained to be > = .001 to avoid convergence to a trivial solution. Further details concerning this implementation of the EM algorithm and the associated Gaussian mixture model can be found in [62]. Once estimates have been obtained for μ, σ2 and
, any unlabeled instance can be classified according to these mixture components using naïve Bayes, according to equation 1 (or equivalently, equation 7, in the case of the original unlabeled dataset). Since our assumption is that low quality chips are outliers with respect to these quality features, we use the mixture component corresponding to the smallest value from
to identify the low quality class.
Feature Selection
In order to achieve optimal classification performance, it is important to select an appropriate subset of the classification features. Ideally, this subset should include independent features that are each individually predictive of the class label.

Finally, the p scores were sorted in ascending order, to rank the features by their ability to predict the correct class label. Denote the rank of feature j according to the value of this score as S[j].
To identify correlations among the quality control features, we next computed the p × p Pearson correlation matrix. Let ρ jk denote the correlation between features j and k, and ρ[j]krepresent the rank of the correlation of feature j with feature k among all other features correlated with k, with features ranked in order of descending correlation. To select a subset of n features, we used the following forward selection algorithm:
where F denotes the set of previously selected features. The constants c1 and c2 in this expression are weighting factors that can be modified to control the tradeoff between selection for independent features and features that are highly correlated with the class label. We used 0.5 for each.
Results and discussion
Parameter Estimates
3' Expression Arrays



Mixture Model Parameter Estimates. Supervised (MLE) and Unsupervised (EM) estimates shown are for the following features from the 3' expression arrays: (A) 5th percentile of raw intensities, (B) inter-quartile range of the Relative Log Intensity (RLE), (C) 25th percentile of the probe-level model residuals, and (D) the 20th percentile of the probe-level model weights. All features were normalized relative to other chips in the same experiment, using the SCORE function (see Table 1).
The EM estimates appear to be reasonable in all cases, given the original intent of each quality metric. For example, given the normalized (log-scale) expression values, the RLE metric measures the distribution of the quantity
for each chip, where
is the log expression measurement for probeset g, on chip i, and m
g
is the median expression of probeset g across all arrays. In general, since it is ordinarily assumed that the majority of genes are not differentially expressed across chips, the quantity M
gi
is expected to be distributed with median 0. In addition, chips that more frequently have extreme expression values will have a large inter-quartile range for this statistic. Figure 1b indicates that, as expected, low quality chips were indeed more likely to have a large inter-quartile range for the RLE statistic.
Parameter estimates for the other metrics also agree with our expectations. For example, the estimates for metrics relating to probe-level model weights and residuals reflect the expectation that low quality chips should have larger residuals and more down-weighted probesets (Figure 1c, d). Similarly, the estimates indicate that low quality chips are more likely to have RNA degradation plots that are different from other chips in the same experiment. The low quality chips also tend to have both mean raw and mean normalized intensities that are either significantly higher or lower than other chips in the same experiment.
Exon Arrays
The Affymetrix exon array platform is different from the 3' expression array platform in several important ways [63]. For example, the 3' expression array targeting the human genome (Hgu133) has, on average, 1 probeset pair for each well-annotated gene; each probeset consists of 11 individual 25-mer probes, which primarily target the 3' region of the gene. In contrast, the Human Exon 1.0 ST array has 1 probeset for each exon for each gene in the target genome. Each probeset contains, in general, 4 (rather than 11) 25-mer probes. Unlike 3' expression arrays, exon arrays lack mismatch probes. Instead, the background expression level for each probe is estimated by averaging the intensities of approximately 1000 surrogate genomic and anti-genomic background probes having the same GC content as the target probe. Because most genes consist of several exons, the median number of probes per gene is increased on the exon array from 11 on the 3' array to between 30–40 [64]. However, genes with fewer exons are covered by fewer probes. In fact, there are a few thousand well-annotated single exon genes covered by only 4 probes [63]. Furthermore, the feature size on the exon arrays has been reduced from 11 × 11 microns on the HGU133 array to 5 × 5 microns on the Human Exon 1.0 ST array (about 1/5 the area). This change may increase the expression variance, at least at the probeset level [63]. Exon arrays also utilize a different hybridization protocol which uses sense-strand labeled targets, and results in DNA-DNA hybridizations rather than the DNA-RNA hybridizations used with traditional 3' arrays [65]. These differences suggest that the distributions of key quality control indicators may differ between the two platforms.
Comparison of Parameter Estimates for 3' Expression Arrays and Exon Arrays. Each diagram illustrates the unsupervised Gaussian parameter estimates for one of the quality control features, for each of the two chip types. Estimates shown are for the following features: (A) Upper tail of the Relative Log Intensity (RLE), computed using the affyPLM functionality, (B) median of the raw intensity distribution, (C) 10th percentile of the probe-level model residuals, and (D) inter-quartile range of the RLE.
Parameter Estimates for Exon Array Expression Console QC Features. Shown are the parameter estimates obtained using the EM algorithm for various exon array quality control features available in the Affymetrix Expression Console software. Estimates shown are for the following features: (A) mean of the absolute deviation of the RMA probe level model residuals from the median across chips, (B) standard deviation of signal from positive control probesets after normalization, (C) standard deviation of signal from all probesets after normalization, and (D) area under ROC curve discriminating between positive control probesets and negative controls.
Classifier Performance Evaluation
3' Expression Arrays
Classifier Performance. Unsupervised versus supervised classifier using labeled data sets of various sizes. When the full labeled training dataset (~540 labeled instances per fold) is available, the performance of the unsupervised classification method (EM+Naïve Bayes) and the supervised classification method (MLE+Naïve Bayes) are equivalent on the test dataset. When the amount of labeled data is limited, but unlabeled data is abundant, the unsupervised method outperforms the supervised method.
In many real world scenarios the amount of unlabeled data available greatly exceeds the amount of expert-labeled data. To test the performance of the two classifiers under these conditions, we performed additional 10-fold cross-validation experiments similar to the previous test. However, in this case, the supervised MLE + Naïve Bayes classifier was trained using random subsets of instances from each labeled training fold, while the EM + Naïve Bayes classifier was constructed using the entire unlabeled training fold. Subsets containing 10, 20, 30, 60, 75, and 100 instances were used to train the supervised classifier. The ROC curves in Figure 4 indicate that the EM + Naïve Bayes classifier appears to have an advantage when the amount of unlabeled training data available greatly exceeds the amount of expert-labeled data. For example, the unsupervised method clearly outperforms the supervised method when 30 or fewer labeled instances were available. Table S2 (available in additional file 1) contains the resulting confusion matrix for the case in which 30 labeled training instances were used, with a classification threshold of 0.5.
3' Exon Arrays
To demonstrate the general applicability of our method, we constructed unsupervised classifiers using the two sets of quality control variables and the entire unlabeled training set. These classifiers were then used to predict classification labels for each data point. Figure S6 (in additional file 1) shows a Venn diagram comparing the classification results for classifiers constructed using the BioConductor quality features and the Expression Console quality features. In most, but not all, cases, the classifiers agree on the characterization of each chip with regard to quality. In addition, both classifiers agree that approximately 39% of the data is low quality. Additional file 3 contains the classification labels obtained using unsupervised classifiers constructed using each set of quality variables.
Simulation Results
The agreement between the quality control feature distribution parameters estimated using the supervised maximum likelihood method and the estimates obtained with the unsupervised Gaussian mixture model suggests that our domain expert has uncovered a plausible dichotomy of chips within our dataset. To further confirm that the chips classified as having low quality were indeed more likely to negatively impact tests for differential expression, we performed a simple simulation. The procedure involved adding an offset to the observed expression measurements for a subset of the probesets on a set of "treatment" arrays, and then comparing these arrays with a set of unmodified "control" arrays sampled from the same experiment (details not shown). Among those chips designated by the expert as low quality, the majority (approximately 70%) impaired the ability to detect simulated differential expression when included in an analysis, compared to only about 10% of the chips classified as having high quality.
Conclusion
In this paper we have illustrated the efficacy of an unsupervised classification approach to assessing microarray data quality. Our method uses unlabeled training data to identify apparent distinctions between "good" and "bad" quality chips within the dataset. The method then integrates measurements obtained across a variety of quality dimensions into a single composite quality score which can be used to accurately identify low quality data.
Our method is flexible and can be easily adapted to accommodate alternate quality statistics and platforms. Because this technique requires only unannotated training data, it is easy to keep the resulting classifier up-to-date as technology evolves, and the adaptable nature of the system makes arbitrary, universal quality score thresholds unnecessary. Moreover, since a naïve Bayes classification approach involves the estimation of the underlying, univariate distributions for each of the classification parameters, this method allows for intuitive explanations that offer an advantage over other "black box" classification systems [66, 67]. For example, under this framework, it is possible to infer which diagnostic plots and features are most relevant for the classification of a particular chip. These plots can then be presented to the user in order to explain the classification. A quality control method that incorporates an interpretation of standard diagnostic plots is an extension of a familiar process already used by many labs, and good diagnostic plots can provide powerful and convincing evidence of data quality artifacts.
An important caveat for this, and any quality control methodology, is that the decision about what to do with the detected low quality chip(s) is dependent on the experimental design, the number of low quality chips detected, and the magnitude of the defects encountered. In many cases, low quality chips still contain valuable information, and in some cases the most effective strategy may be to simply down-weight these chips rather than discarding them entirely [68].
Nevertheless, with the availability of a variety of rapidly growing public repositories for microarray data, the continual appearance of new microarray chip types, and the increasing usage of genomics data by research organizations worldwide, the development of robust and flexible methods for microarray quality assessment is now more important than ever. An advantage of the approach described in this paper is that, once a classifier has been constructed, the run-time required to automatically classify new instances is minimal. This makes the method ideal for use as a component of a batch processing system, such as a screening tool for use with public databases, or as a step in a meta-analysis pipeline.
Availability and requirements
-
Project name: Unsupervised Assessment of Microarray Data Quality Using a Gaussian Mixture Model.
-
Availability: A Matlab implementation of these algorithms and the corresponding analyses is available in additional file 5.
-
Operating system: Implemented and tested under Windows XP.
-
Programming language: Matlab 7.0.1.15, service pack 1.
-
Other requirements: Matlab Statistics Toolbox version 6.1.
-
License: Brian E. Howard. Free for non-commercial use.
-
Any restrictions to use by non-academics: Contact corresponding author.
Declarations
Acknowledgements
Funded by the NCSU/EPA Cooperative Training Program in Environmental Sciences Research, Training Agreement CT833235-01-0 with North Carolina State University.
Authors’ Affiliations
References
- Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, et al.: The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature Biotechnology 2006, 24(9):1151–1161.View ArticlePubMedGoogle Scholar
- Larsson O, Wennmalm K, Sandberg R: Comparative microarray analysis. OMICS: A Journal of Integrative Biology 2006, 10(3):381–397.View ArticlePubMedGoogle Scholar
- Wilkes T, Laux H, Foy CA: Microarray data quality – review of current developments. OMICS: A Journal of Integrative Biology 2007, 11(1):1–13.View ArticlePubMedGoogle Scholar
- Eads B, Cash A, Bogart K, Costello J, Andrews J: Troubleshooting microarray hybridizations. Methods in Enzymology 2006, 411: 34–49.View ArticlePubMedGoogle Scholar
- Copois V, Bibeau F, Bascoul-Mollevi C, Salvetat N, Chalbos P, et al.: Impact of RNA degradation on gene expression profiles: assessment of different methods to reliably determine RNA quality. Journal of Biotechnology 2007, 127(4):549–59.View ArticlePubMedGoogle Scholar
- Archer KJ, Dumur CI, Joel SE, Ramakrishnan V: Assessing quality of hybridized RNA in Affymetrix GeneChip experiments using mixed-effects models. Biostatistics 2006, 7(2):198–212.View ArticlePubMedGoogle Scholar
- Jones L, Goldstein DR, Hughes G, Strand AD, Collin F, et al.: Assessment of the relationship between pre-chip and post-chip quality measures for Affymetrix GeneChip expression data. BMC Bioinformatics 2006, 7: 211.PubMed CentralView ArticlePubMedGoogle Scholar
- Reimer M, Weinstein JN: Quality assessment of microarrays: visualization of spatial artifacts and quantitation of regional biases. BMC Bioinformatics 2005, 6: 166.View ArticleGoogle Scholar
- Stokes TH, Moffitt RA, Phan JH, Wang MD: chip artifact CORRECTion (caCORRECT): a bioinformatics system for quality assurance of genomics and proteomics array data. Annals of Biomedical Engineering 2007, 35(6):1068–80.View ArticlePubMedGoogle Scholar
- Affymetrix, Inc:GeneChip expression analysis, data analysis fundamentals. Affymetrix, Santa Clara, CA; 2003. [http://www.affymetrix.com/support/downloads/manuals/data_analysis_fundamentals_manual.pdf]Google Scholar
- Gentleman RC, Carey VJ, Bates BM, Bolstad B, Dettling M, et al.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 2004, 5(10):R80.PubMed CentralView ArticlePubMedGoogle Scholar
- Gautier L, Cope L, Bolstad BM, Irizarry RA: affy: analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004, 20(3):307–15.View ArticlePubMedGoogle Scholar
- Bolstad B:affyPLM: methods for fitting probe-level models. BioConductor version 2.0 package. Retrieved. [http://bioconductor.org/packages/2.0/bioc/html/affyPLM.html]
- Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S: Bioinformatics and Computational Biology Solutions Using R and Bioconductor. New York: Springer; 2005.View ArticleGoogle Scholar
- Heber S, Sick B: Quality assessment of Affymetrix GeneChip data. OMICS: A Journal of Integrative Biology 2006, 10(3):358–68.View ArticlePubMedGoogle Scholar
- Psarros M, Heber S, Sick M, Thoppae G, Harshman K, Sick B: RACE: Remote Analysis Computation for gene Expression data. Nucleic Acids Research 2005, 33: W638–43.PubMed CentralView ArticlePubMedGoogle Scholar
- Howard BE, Perera I, Im YJ, Winter-Sederoff H, Sick B, Heber S: Quality assessment of Affymetrix GeneChip data using the EM algorithm and a naïve Bayes classifier. In Proceedings of the IEEE 7th International Symposium on Bioinformatics & Bioengineering (BIBE 2007): 14–17 October 2007; Cambridge, MA. Edited by: Jack Y Yang, Mary Qu Yang, Michelle M Zhu, et al. IEEE; 2007:145–150.View ArticleGoogle Scholar
- Lee EK, Yi SG, Park T: arrayQCplot: software for checking the quality of microarray data. Bioinformatics 2006, 22(18):2305–7.View ArticlePubMedGoogle Scholar
- Lozano JJ, Kalko SG: AMarge: Automated Extensive Quality Assessment of Affymetrix chips. Applied Bioinformatics 2006, 5(1):45–47.View ArticlePubMedGoogle Scholar
- Burgoon LD, Eckel-Passow JE, Gennings C, Boverhof DR, Burt JW, et al.: Protocols for the assurance of microarray data quality and process control. Nucleic Acids Research 2005, 33(19):e172.PubMed CentralView ArticlePubMedGoogle Scholar
- Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 1977, 39(1):1–38.Google Scholar
- Nigam K, McCallum A, Thrun S, Mitchell T: Text classification from labeled and unlabeled documents using EM. Machine Learning 2000, 39(2/3):103–134.View ArticleGoogle Scholar
- Falk TH, Chan W-Y: Objective speech quality assessment using Gaussian mixture models. 22nd Biennial Symposium on Communications, 31 May – 3 June 2004; Ontario, Canada 2004.Google Scholar
- Wong JWH, Sullivan MJ, Cartwright HM, Cagney G: msmsEval: tandem mass spectral quality assignment for high-throughput proteomics. BMC Bioinformatics 2007, 8: 51.PubMed CentralView ArticlePubMedGoogle Scholar
- Najarian K, Zaheri M, Rad AA, Najarian S, Dargahi J: A novel mixture model method for identification of differentially expressed genes from DNA microarray data. BMC Bioinformatics 2004, 5: 201.PubMed CentralView ArticlePubMedGoogle Scholar
- Lai Y, Adam B, Podolsky R, She J: A mixture model approach to the tests of concordance and discordance between two large-scale experiments with two-sample groups. Bioinformatics 2007, 23(10):1243–1250.View ArticlePubMedGoogle Scholar
- Asyali MH, Alci M: Reliability analysis of microarray data using fuzzy c-means and normal mixture modeling based classification methods. Bioinformatics 2005, 21(5):644–9.View ArticlePubMedGoogle Scholar
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research 2002, 30(1):207–10.PubMed CentralView ArticlePubMedGoogle Scholar
- William DA, Su Y, Smith MR, Lu M, Baldwin DA, Wagner D: Genomic identification of direct target genes of LEAFY. PNAS 2004, 101(6):1775–80.PubMed CentralView ArticlePubMedGoogle Scholar
- Ramonell K, Berrocal-Lobo M, Koh S, Wan J, Edwards H, Stacey G, Somerville S: Loss-of-function mutations in chitin responsive genes show increased susceptibility to the powdery mildew pathogen Erysiphe cichoracearum. Plant Physiology 2005, 138(2):1027–36.PubMed CentralView ArticlePubMedGoogle Scholar
- Vanneste S, De Rybel B, Beemster GT, Ljung K, De Smet I, et al.: Cell cycle progression in the pericycle is not sufficient for SOLITARY ROOT/IAA14-mediated lateral root initiation in Arabidopsis thaliana. Plant Cell 2005, 17(11):3035–50.PubMed CentralView ArticlePubMedGoogle Scholar
- Bläsing OE, Gibon Y, Günther M, Höhne M, Morcuende R, et al.: Sugars and circadian regulation make major contributions to the global regulation of diurnal gene expression in Arabidopsis. Plant Cell 2005, 17(12):3257–81.PubMed CentralView ArticlePubMedGoogle Scholar
- Armstrong JI, Yuan S, Dale JM, Tanner VN, Theologis A: Identification of inhibitors of auxin transcriptional activation by means of chemical genetics in Arabidopsis. PNAS 2004, 101(41):14978–83.PubMed CentralView ArticlePubMedGoogle Scholar
- Kimchi ET, Posner MC, Park JO, Darga TE, Kocherginsky M, et al.: Progression of Barrett's metaplasia to adenocarcinoma is associated with the suppression of the transcriptional programs of epidermal differentiation. Cancer Research 2005, 65(8):3146–54.PubMedGoogle Scholar
- Flechner SM, Kurian SM, Head SR, Sharp SM, Whisenant TC, et al.: Kidney transplant rejection and tissue injury by gene profiling of biopsies and peripheral blood lymphocytes. American Journal of Transplantation 2004, 4(9):1475–89.PubMed CentralView ArticlePubMedGoogle Scholar
- Haslett JN, Sanoudou D, Kho AT, Han M, Bennett RR, et al.: Gene expression profiling of Duchenne muscular dystrophy skeletal muscle. Neurogenetics 2003, 4(4):163–71.View ArticlePubMedGoogle Scholar
- Lin KK, Chudova D, Hatfield GW, Smyth P, Andersen B: Identification of hair cycle-associated genes from time-course gene expression profile data by using replicate variance. PNAS 2004, 101(45):15955–60.PubMed CentralView ArticlePubMedGoogle Scholar
- Giambonini-Brugnoli G, Buchstaller J, Sommer L, Suter U, Mantei N: Distinct disease mechanisms in peripheral neuropathies due to altered peripheral myelin protein 22 gene dosage or a Pmp22 point mutation. Neurobiology of Disease 2005, 18(3):656–68.View ArticlePubMedGoogle Scholar
- Chen Z, Herman AE, Matos M, Mathis D, Benoist C: Where CD4+CD25+ T reg cells impinge on autoimmune diabetes. Journal of Experimental Medicine 2005, 202(10):1387–97.PubMed CentralView ArticlePubMedGoogle Scholar
- Herman AE, Freeman GJ, Mathis D, Benoist C: CD4+CD25+ T regulatory cells dependent on ICOS promote regulation of effector cells in the prediabetic lesion. Journal of Experimental Medicine 2004, 199(11):1479–89.PubMed CentralView ArticlePubMedGoogle Scholar
- Li J, Grigoryev DN, Ye SQ, Thorne L, Schwartz AR, et al.: Chronic intermittent hypoxia upregulates genes of lipid biosynthesis in obese mice. Journal of Applied Physiology 2005, 99(5):1643–8.View ArticlePubMedGoogle Scholar
- Cheng H, Aleman TS, Cideciyan AV, Khanna R, Jacobson SG, Swaroop A: In vivo function of the orphan nuclear receptor NR2E3 in establishing photoreceptor identity during mammalian retinal development. Human Molecular Genetics 2006, 15(17):2588–602.PubMed CentralView ArticlePubMedGoogle Scholar
- Jacobson JR, Barnard JW, Grigoryev DN, Ma SF, Tuder RM, Garcia JG: Simvastatin attenuates vascular leak and inflammation in murine inflammatory lung injury. American Journal of Physiology. Lung Cellular and Molecular Physiology 2005, 288(6):L1026–32.View ArticlePubMedGoogle Scholar
- Ovando BJ, Vezina CM, McGarrigle BP, Olson JR: Hepatic gene downregulation following acute and subchronic exposure to 2,3,7,8-tetrachlorodibenzo-p-dioxin. Toxicological Sciences 2006, 94(2):428–38.View ArticlePubMedGoogle Scholar
- Gonzalez R, Yang YH, Griffin C, Allen L, Tique Z, Dobbs L: Freshly isolated rat alveolar type I cells, type II cells, and cultured type II cells have distinct molecular phenotypes. American Journal of Physiology. Lung Cellular and Molecular Physiology 2005, 288(1):L179–89.View ArticlePubMedGoogle Scholar
- Fischer MD, Gorospe JR, Felder E, Bogdanovich S, Pedrosa-Domellöf F, et al.: Expression profiling reveals metabolic and structural components of extraocular muscles. Physiological Genomics 2002, 9(2):71–84.View ArticlePubMedGoogle Scholar
- Zhang Z, Lotti F, Dittmar K, Younis I, Wan L, et al.: SMN deficiency causes tissue-specific perturbations in the repertoire of snRNAs and widespread defects in splicing. Cell 2008, 133(4):585–600.PubMed CentralView ArticlePubMedGoogle Scholar
- Sandberg R, Neilson JR, Sarma A, Sharp PA, Burge CB: Proliferating cells express mRNAs with shortened 3' untranslated regions and fewer microRNA target sites. Science 2008, 320(5883):1643–7.PubMed CentralView ArticlePubMedGoogle Scholar
- Chahrour M, Jung SY, Shaw C, Zhou X, Wong ST, et al.: MeCP2, a key contributor to neurological disease, activates and represses transcription. Science 2008, 320(5880):1224–9.PubMed CentralView ArticlePubMedGoogle Scholar
- Xing Y, Stoilov P, Kapur K, Han A, Jiang H, et al.: MADS: a new and improved method for analysis of differential alternative splicing by exon-tiling microarrays. RNA 2008, 14(8):1470–9.PubMed CentralView ArticlePubMedGoogle Scholar
- Soreq L, Gilboa-Geffen A, Berrih-Aknin S, Lacoste P, Darvasi A, et al.: Identifying alternative hyper-splicing signatures in MG-thymoma by exon arrays. PLoS ONE 2008, 3(6):e2392.PubMed CentralView ArticlePubMedGoogle Scholar
- Douglas D, Hsu JH, Hung L, Cooper A, Abdueva D, et al.: BMI-1 promotes ewing sarcoma tumorigenicity independent of CDKN2A repression. Cancer Research 2008, 68(16):6507–15.PubMed CentralView ArticlePubMedGoogle Scholar
- Platts AE, Dix DJ, Chemes HE, Thompson KE, Goodrich R, et al.: Success and failure in human spermatogenesis as revealed by teratozoospermic RNAs. Human Molecular Genetics 2007, 16(7):763–73.View ArticlePubMedGoogle Scholar
- Hu Z, Zimmermann BG, Zhou H, Wang J, Henson BS, et al.: Exon-level expression profiling: a comprehensive transcriptome analysis of oral fluids. Clinical Chemistry 2008, 54(5):824–32.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang RS, Duan S, Shukla SJ, Kistner EO, et al.: Identification of genetic variants contributing to cisplatin-induced cytotoxicity by use of a genomewide approach. American Journal of Human Genetics 2007, 81(3):427–37.PubMed CentralView ArticlePubMedGoogle Scholar
- Hung LH, Heiner M, Hui J, Schreiner S, Benes V, Bindereif A: Diverse roles of hnRNP L in mammalian mRNA processing: a combined microarray and RNAi analysis. RNA 2008, 14(2):284–96.PubMed CentralView ArticlePubMedGoogle Scholar
- Kwan T, Benovoy D, Dias C, Gurd S, Provencher C, et al.: Genome-wide analysis of transcript isoform variation in humans. Nature Genetics 2008, 40(2):225–31.View ArticlePubMedGoogle Scholar
- French PJ, Peeters J, Horsman S, Duijm E, Siccama I, et al.: Identification of differentially regulated splice variants and novel exons in glial brain tumors using exon expression arrays. Cancer Research 2007, 67(12):5635–42.View ArticlePubMedGoogle Scholar
- Cahoy JD, Emery B, Kaushal A, Foo LC, Zamanian JL, et al.: A transcriptome database for astrocytes, neurons, and oligodendrocytes: a new resource for understanding brain development and function. Journal of Neuroscience 2008, 28(1):264–78.View ArticlePubMedGoogle Scholar
- Bengtsson H, Simpson K, Bullard J, Hansen K: aroma.affymetrix: A generic framework in R for analyzing small to very large Affymetrix data sets in bounded memory. Tech Report #745, Department of Statistics, University of California, Berkley; 2008.Google Scholar
- Affymetrix, Inc:Quality assessment of exon and gene arrays. Affymetrix, Santa Clara, CA; 2007. [http://www.affymetrix.com/support/technical/whitepapers/exon_gene_arrays_qa_whitepaper.pdf]Google Scholar
- Ng A: Mixtures of Gaussians and the EM algorithm. CS229 Lecture notes.Standford University, Palo Alto, CA; 2006. [http://www.stanford.edu/class/cs229/notes/cs229-notes8.pdf]Google Scholar
- Robinson MD, Speed TP: A comparison of Affymetrix gene expression arrays. BMC Bioinformatics 2007, 8(1):449.PubMed CentralView ArticlePubMedGoogle Scholar
- Gardina PJ, Clark TA, Shimada B, Staples MK, Yang Q, Veitch J, Schwitzer A, Awad T, Sugnet C, Dee S, Davies C, Williams A, Turpaz Y: Alternative splicing and differential gene expression in colon cancer detected by a whole genome exon array. BMC Genomics 2006, 7: 325.PubMed CentralView ArticlePubMedGoogle Scholar
- Abdueva D, Wing MR, Schaub B, Triche TJ: Experimental comparison and evaluation of the Affymetrix exon and U133Plus2 GeneChip arrays. PLoS ONE 2007, 2(9):e913.PubMed CentralView ArticlePubMedGoogle Scholar
- Mozina M, Demsar J, Kattan M, Zupan B: Nomograms for visualization of naïve Bayesian classifiers. In Proc. of Principles and Practice of Knowledge Discovery in Databases (PKDD-2004): 20–24 Sept. 2004; Pisa, Italy. Edited by: Jean-Francois Boulicaut, Floriana Esposito, Fosca Giannotti, Dino Pedreshci. ACM; 2004:337–348.View ArticleGoogle Scholar
- Poulin B, Eisner R, Szafron D, Lu P, Greiner R, et al.: Visual explanation of evidence in additive classifiers. Proceedings of the 21st National Conference on Artificial Intelligence (AAAI): 16–20 July, Boston, MA 2006.Google Scholar
- Ritchie ME, Diyagama D, Neilson J, van Laar R, Dobrovic A, Holloway A, Smyth G: Empirical array quality weights in the analysis of microarray data. BMC Bioinformatics 2006, 7: 261.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.