DNA microarrays are commonly used to measure, in parallel, the steady-state concentration of tens of thousands of mRNAs, providing an estimate of the level of expression of the corresponding genes. They come in two flavors: 1) spotted arrays allows for the simultaneous measurement of two samples on the same array, we'll refer to these arrays as multi-channel arrays; 2) Affymetrix GeneChips arrays with a significantly higher density but only allowing for the hybridization of one sample, we'll refer to those as High-Density Arrays, HDAs. A typical piece of information that investigators seek to extract from microarrays is the list of differentially expressed genes (DEGs) between a treatment and a control condition. This can either take the form of a defined subset or an ordering of the whole transcriptome on which some meaningful (statistically and/or practically) threshold is applied. Until recently most of the efforts to derive a statistical method to solve this problem have been focusing on a general approach applicable to both type of arrays. The strategy when analyzing HDAs is to estimate the absolute level expression of a given gene in each condition and then compute the log-ratio of the expression levels between two conditions. Methods applied to log-ratios on multi-channel arrays can readily be applied to these computed ratios.

A fundamental issue when analyzing microarrays data is the imbalance between the dimensionality of the data (tens of thousands of measures for each sample) vs. the number of samples or replicates (for most studies between 2 and 60). On the theoretical side, this imbalance gives rise to the necessity of adjusting statistical thresholds to the context of multiple testing [1, 2], seriously complicating the accurate estimation of sensitivity and specificity. Because widely used statistical methods are based on an estimate of the gene-specific variation, they technically require at least 2 replicates per conditions. In the context of multi-channel arrays, experts in the field have been recommending the use of 6 replicates [3]. But, for practical reasons, most laboratories have been settling for 3 replicates per condition. Proof-of-principle experiments with no replication (one array per condition) are still performed routinely by several labs and typically analyzed by the simple method of identifying genes as putative DEGs if they show a ratio above 2 or below 0.5.

In the work presented here, a significant increase in the accuracy of DEGs detection from a low number of replicate arrays will be obtained by taking advantage of the design itself of HDAs. All of the Affymetrix GeneChips contain *k* pairs of 25-mer DNA oligonucleotides, called probe pairs, per probe-set. The probe-sets are designed to measure the expression of a single gene. Each probe pair is composed of a match (PM) and mismatch (MM) probe, the former being the exact 25 nucleotides and the later containing one single mismatch at position 13. MM probes were introduced to estimate the level of non-specific hybridization of the corresponding PM probe, but the recent trend has been to ignore the MM measures and only use PM measures seeking alternative methods to estimate non-specific hybridization. The measure of interest being the level of expression of the genes, it is necessary to transform a set of *k* probe pair's intensities into a single probe-set measurement. This step is referred to as the expression summary. It is still debated which expression summary method gives the best results but both MAS5.0 [4] and RMA [5] have been widely used. A recent summarization approach, FARMS [6], have been found to compare favorably to more classical summarization approaches and will be included in the comparisons.

The publication of the Choe *et al*. dataset in 2005 [7] has provided an objective benchmark to evaluate the accuracy of differential gene expression of various combinations of methods. In this work the authors have amplified 3,866 cRNAs and prepared two samples, 'control' (C) and 'spike' (S), where 1,331 cRNAs were spiked at various ratios of concentrations from 1.2 to 4. The two samples were then hybridized to three Affymetrix GeneChips and scanned using standard Affymetrix protocol. The work was done using the Affymetrix *Drosophila* array (DrosGenome1) where each gene is represented by a probe-set of 14 probe pairs of 25-mer DNA oligonuceotides. There are a total of 14,010 probe-sets present on the array. Two key characteristics of this dataset will be important to remember when carrying out the statistical analysis: 1) a large proportion (72%) of the probe-sets corresponds to cRNAs that were not present in the samples, leading to a bimodal intensity distribution; 2) Only concentration ratios above 1 were introduced in the spiked sample, inducing a strong asymmetry in the distribution of log-ratios. Using this validation dataset, it was shown [7] that a preferred set of steps can be identified to optimize the detection of DEGs in HDAs. The optimal pipeline identified corresponds to the use of MAS5.0 for the background correction, probe-level normalization, PM adjustment and the use of the medianpolish algorithm for expression summaries (borrowed from RMA[5]), followed by a loess normalization on the probe-set level data and finally use Cyber-T [8] to identify DEGs.

The most successful and widely used methods to identify DEGs are currently based on variations of a regularized *t*-statistic where the standard deviation term is weighted [8], a constant added to it [9, 10] or both [11]. Regularization terms are always, but in various ways, estimated from the dataset being analyzed to account for experiment-specific behavior of the data. Recent work from Barrera *et al*. [12] attempts to identify DEGs directly from the probe-level data using a two-way ANOVA approach. Their results suggest that this class of approaches can outperform probe-set level methods or give comparable results with less replicates, but their validation relies on the use of the Affymetrix Latin Square dataset which contains DEGs with concentration ratios of 2 and above on a small number of genes (a total of 126 spiked genes). All statistical methods tested (data not shown) perform well on this dataset and it is often difficult to quantify slight improvements in accuracy.

In this paper, I propose a method that scores better accuracy than the preferred pipeline identified by Choe *et al*. [7] and will demonstrate that this method doesn't require the three technical replicates provided by the dataset to correctly identify 75% of the DEGs within 10% of false positives. The proposed method, called PL-LM (Probe-Level Linear Model), directly models the treatment effect versus a baseline control using the probe-level data, using a linear model. The treatment effect and average intensity fitted by the linear model are then modeled by a Gaussian Mixture Model (GMM), which is used to separate the DEGs from non-DEGs. The outcome of the GMM is the probability of each probe-set to belong to the cluster of DEGs, thus the PL-LM method as a whole should be characterized as a feature selection algorithm. This second stage of the analysis borrows idea introduced by Jia and Xu [13], where they used a GMM to cluster genes showing similar expression responses vs. a quantitative phenotype. The average intensity and treatment effect in PL-LM are equivalent to the β_{k0} and β_{k1} in [13]. There are a few fundamental differences between the two approaches: 1) in PL-LM, the optimization of the GMM is decoupled from the fitting of the linear model, 2) the linear model of PL-LM is based on probe-level data, and 3) the clustering step aims essentially at separating DEGs from non-DEGs.