Statistical analysis of differential gene expression relative to a fold change threshold on NanoString data of mouse odorant receptor genes
 Evelien Vaes^{1},
 Mona Khan^{1} and
 Peter Mombaerts^{1}Email author
DOI: 10.1186/147121051539
© Vaes et al.; licensee BioMed Central Ltd. 2014
Received: 24 June 2013
Accepted: 14 January 2014
Published: 4 February 2014
Abstract
Background
A challenge in gene expression studies is the reliable identification of differentially expressed genes. In many highthroughput studies, genes are accepted as differentially expressed only if they satisfy simultaneously a p value criterion and a fold change criterion. A statistical method, TREAT, has been developed for microarray data to assess formally if fold changes are significantly higher than a predefined threshold. We have recently applied the NanoString digital platform to study expression of mouse odorant receptor genes, which form with 1,200 members the largest gene family in the mouse genome. Our objectives are, on these data, to decrease false discoveries when formally assessing the genes relative to a fold change threshold, and to provide a guided selection in the choice of this threshold.
Results
Statistical tests have been developed for microarray data to identify genes that are differentially expressed relative to a fold change threshold. Here we report that another approach, which we refer to as tTREAT, is more appropriate for our NanoString data, where false discoveries lead to costly and timeconsuming followup experiments. Methods that we refer to as tTREAT2 and the running fold change model improve the performance of the statistical tests by protecting or selecting the fold change threshold more objectively. We show the benefits on simulated and real data.
Conclusions
Genewise statistical analyses of gene expression data, for which the significance relative to a fold change threshold is important, give reproducible and reliable results on NanoString data of mouse odorant receptor genes. Because it can be difficult to set in advance a fold change threshold that is meaningful for the available data, we developed methods that enable a better choice (thus reducing false discoveries and/or missed genes) or avoid this choice altogether. This set of tools may be useful for the analysis of other types of gene expression data.
Keywords
Gene expression studies Statistical test Differential gene expression Fold change criterion NanoStringBackground
Multiplex gene expression studies that are based on microarrays and nextgeneration sequencing result in the generation of large datasets. The simultaneous analysis of the expression of thousands of genes in these highthroughput approaches challenges statistical methods and data interpretation. Traditional statistical tests and analysis tools are therefore often insufficient.
One of the most popular types of biological experiments is a twosample comparison. Gene expression studies often seek to identify genes that are Differentially Expressed (DE) between RNA samples from two types of biological conditions, such as gene knockout mice compared to wildtype mice. DE genes can give insights into biological mechanisms or pathways, and form the basis for further experiments. The traditional statistical method for identifying DE genes between two samples is the student’s ttest. But a microarray comparative experiment is faced with simultaneously assessing a large number of genes based on a small number of biological or technical replicates. Assessing such a large dataset with a small sample size challenges statistical methods: dealing with many genes but few replicates may lead to large Fold Changes (FC) driven by outliers, and to small error variances [1, 2]. In order to overcome such problems, the ttest has been modified for microarray data analysis. The general idea behind these modifications is to obtain more stable estimates of the error variance of a given gene by “borrowing” information from all available genes. This goal is often obtained by applying (empirical) Bayesian methods. Examples of these modified tests for microarray data are the SAM test [3], the regularized ttest [3, 4], and the Bstatistic [1, 5], as reviewed [2, 6, 7].
Other statistical approaches for microarray data analysis have introduced linear models [2, 8–10]. These models allow for more flexibility, for instance when comparing more than two samples or introducing additional sources of variation. Such an example is Lin et al. [11], where a complex experimental design and research goal are addressed by setting various contrasts in the design of the linear model. The bioconductor package limma, developed by Smyth [12], applies a genewise linear model, and allows for the analysis of complex experiments (comparing many RNA samples), as well as more simple replicated experiments with two RNA samples. The package has further developed the ideas of Lönsstedt and Speed [1], which have been reset within the framework of linear models in order to address the challenges of microarray data analysis. The statistic (applied in limma) to identify DE genes is referred to as the moderated tstatistic, denoted by t* [5, 12]. In the calculation of t*, shrinkage of the estimated error variances towards a pooled estimate is obtained through an empirical Bayes approach.
For certain biological problems, it is important to rank genes according to their FC or to impose on a gene to attain a predefined FC threshold before calling it DE. In such situations there remains often a disconnect between the concepts of a statistically significant differential expression (based on the p value) and a biologically meaningful differential expression (FC higher than a predefined threshold value). Even the modified tests may result in genes with small FCs to be considered statistically significant.
In order to integrate these statistical and biological concepts, a gene can be defined as DE when it satisfies a p value and a FC criterion [9, 13–15]. The advantage of this combined approach is that the ttest or any of the modified tests can be combined with an adhoc FC criterion. The disadvantage of an adhoc FC criterion is that it does not take into account error variance and therefore offers no statistical confidence about future results. In addition, depending on the choice of the FC criterion and significance criterion, various interpretations of the same dataset are possible [16]. The moderated tstatistic [5] has been extended by McCarthy and Smyth into a new “test relative to a FC threshold”, abbreviated TREAT [17]. This method assesses formally whether the true differential expression is greater than a predefined FC criterion. TREAT offers greater specificity and reproducibility in identifying DE genes, compared to the combined approach of statistical test and adhoc FC criterion. TREAT has been also added to the limma package.
We have previously applied the NanoString digital platform [18] to study the expression of odorant receptor (OR) genes in mice [19, 20]. In contrast to microarray data, where analog levels of fluorescent intensity are measured, NanoString data represent digital readouts of single molecules in the form of probe counts. These probes contain unique fluorescent bar codes, and RNA abundance of up to 800 genes can be analyzed in a single reaction in a single tube. The NanoString technology thus places itself between qPCR and microarrays in terms of throughput level [21]. We have analyzed with NanoString the expression of half of the OR gene repertoire of ~1200 genes [19]. Here, we have explored statistical tools for our NanoString data, and developed a systematic approach for identifying DE genes with respect to a given FC threshold.
We explored the moderated tstatistic (t*), the derivation of which was driven by microarray data (high throughput), versus the classical tstatistic on comparative NanoString experiments (medium throughput). We found that t* does not show a protective effect (i.e. fewer false discoveries) over t on our NanoString data. But we also wanted to test whether differential expression is greater than a FC threshold. Therefore we used the analysis relative to a FC threshold together with the classical tstatistic in two approaches. The first approach is similar to TREAT as published for t* [17], and we refer to it as tTREAT. Then we addressed the arbitrary choice of the FC criterion itself, and developed a twostage approach, tTREAT2. We describe the performance of TREAT, tTREAT, and tTREAT2 on our NanoString data, both in data simulation experiments and on biological data.
The variability of the FC of a gene is inversely related to the expression level of that gene; lowly expressed genes tend to have a greater error in their measured FC levels [4, 22]. These lowly expressed genes can thus more easily reach a certain FC threshold, and the inverse is true for highly expressed genes. A nonlinear model, the Limit Fold Change model (LFC), has been developed to identify DE genes based on this relation [22]. We have applied a similar LFC model on our NanoString data. We did not use the LFC model as a tool for identifying DE genes but to set appropriate FC thresholds for genes with various ranges of expression levels, in order to avoid a subjective choice of FC criterion. The FC thresholds that we thus derived were then used in a subsequent analysis relative to a threshold in order to identify the DE genes. We refer to this setup as the running FC model, and illustrate its use and benefit on biological data.
Results
Biological data of odorant receptor gene expression
The main olfactory epithelium is located in the nasal cavity of the mouse, and detects volatile chemicals (odorants) in the inhaled air. The sense of smell must detect chemical stimuli with an immense variety in physicochemical properties. To accommodate this broad recognition, the mammalian olfactory system has evolved a large repertoire of molecular receptors, odorant receptors. These receptors are expressed by olfactory sensory neurons in the main olfactory epithelium. In the mouse, ~1200 odorant receptors are encoded by distinct genes, which form the largest gene family in its genome. It is widely accepted that a mature olfactory sensory neuron expresses only one of the ~1200 odorant receptor genes. The molecular and genetic mechanisms that regulate this one receptor  one neuron rule remain unclear, and are the focus of our research. We have demonstrated the role of the H element [23] and the P element [24] in the regulation of expression of clusters of odorant receptor genes by genetically engineering mouse strains that lack the H element [25] or the P element [19]: these are the ΔH and ΔP strains and the ΔHxΔP double knockout strain [19]. Another mouse strain is the ΔOlfr7Δ strain [19]; by means of chromosome engineering [26], we excised a 2.4 megabase region on Chromosome 9 that contains the Olfr7 cluster consisting of 99 OR genes [27]. We have also characterized temporal expression patterns of 531 odorant receptor genes in adult and aged mice [20].
Here we used datasets [19] from a NanoString analysis of 558 OR genes comparing knockout versus wildtype mouse strains. Specifically, we used NanoString data obtained from six mutant mice of the ΔHxΔP strain (cartridge MK29) compared to 12 control (wildtype) mice (cartridges MK29 and MK37, six mice each); these 18 mice are in a mixed genetic background, C57BL/6 J × 129/SvEv. Another dataset was obtained from six mice of the ΔOlfr7Δ strain, compared to six control (wildtype) mice (cartridge MK38); these 12 mice are in pure genetic background, 129/SvEv. With NanoString CodeSet Gorilla, we determined the RNA abundance for 558 OR genes from 1 μg RNA of whole olfactory mucosa tissue samples. Each lane of a NanoString cartridge represents a different RNA sample and mouse. Thus, there are 6–12 biological replicates per biological condition, and no technical replicates.
Approaches relative to a FC threshold
Our novel method tTREAT is similar to TREAT [17]. It is applied to the regular student’s tstatistic, and requires a predefined FC threshold τ. For the twostage design, tTREAT2, a second threshold θ, with θ>τ, is used in a first “stop and go” stage in which it is decided whether a gene is nonDE (stop) or whether it can proceed to the next stage (go). Then, for all the “go” genes, a tTREAT test with FC threshold τ is applied. In the running FC model, a nonlinear model for FC versus average gene expression is first used to determine various FC thresholds τ_{ i } for a number of ranges of expression levels. Genes are then binned in k gene expression levels, and the appropriate τ_{ i } is used per concentration bin in a subsequent analysis relative to a FC threshold.
Simulated data
Because NanoString data are not yet as widely studied as microarray data, we have conducted a data simulation procedure that represents one of our typical twogroup comparison NanoString experiments.
The procedure for simulating one NanoString dataset was conducted according to the following steps and distributional assumptions:

To get a general idea about the variances of NanoString gene expression data, the genes in the ΔHxΔP dataset were used as an example gene population. Biological data from ΔHxΔP mice were chosen, as they represent a noisier dataset due to the mixed genetic background of this strain [19]: the resulting simulated data will not represent the cleanest example. The ΔHxΔP dataset was used only for the next step of the simulation exercise.

Subsequently, 100 real variances across genes, ${\sigma}_{g}^{2}$, were drawn from an inversegamma distribution: Inv − Gamma(g_{1}, g_{2}). The parameters g_{1} and g_{2} are estimated by a Maximum Likelihood procedure in the above described gene population.

Each of the 100 ${\sigma}_{g}^{2}$ gave rise to three randomly drawn real differences β_{ g }. The ${\sigma}_{g}^{2}$ and three corresponding β_{ g } were included in one of the following three gene groups:
(Group 1) DE genes: The β_{ g } were drawn from a Gaussian $~\mathcal{N}\left(0,{\sigma}_{g}{v}_{\mathit{start}}\right)$ and had to satisfy the criterion: β_{ g } ≥ log_{2}(ω).
(Group 2) NonDE genes, noisy: Similarly, β_{ g } drawn from $~\mathcal{N}\left(0,{\sigma}_{g}{v}_{\mathit{start}}\right)$ but with β_{ g } < log_{2}(ω).
(Group 3) NonDE genes, regular: Obviously for this group the real β_{ g } were set to 0.
Note that empirically, the normal distribution seems acceptable for the β_{ g }, based on our NanoString data.

The β_{ g } that were produced in the previous step served as true differences that were then subsequently used to simulate three possible estimates ${\widehat{\beta}}_{g}$ from a Gaussian $~\mathcal{N}\left({\beta}_{g},{\sigma}_{g}\sqrt{{z}_{g}}\right)$. Similarly residual variances were drawn from a Chisquare distribution: $~\frac{{\sigma}_{g}^{2}}{d{f}_{g}}{\chi}^{2}\left(d{f}_{g}\right)$
Since the quantity ω used to initiate the simulation as described above defines the DE genes by the rule β_{ g } ≥ log_{2}(ω), the distributional center of the ${\widehat{\beta}}_{g}$ of DE genes actually lies a little further than the actual value of ω, depending on the value of their variance: ${\sigma}_{g}\sqrt{{z}_{g}}$. Henceforth, we will refer to ω as the FC ω with respect to which the data was simulated.
Every such simulated dataset was thus initially based on 100 ${\sigma}_{g}^{2}$ leading to a total of 900 genes. The regular nonDE genes (group 3) were fixed as 20% of the 900 genes. However, to assess whether the number of DE genes in a dataset influences the results, the percentage of DE genes (and thus also the percentage of the noisy nonDE genes) was varied between 1% and 40% (noisy nonDE genes: 79% and 40%, respectively).
 (1)
TREAT with regard to FC threshold τ
 (2)
tTREAT with regard to FC threshold τ
 (3)
tTREAT2 with a bilateral p value calculation in the second stage with regard to FC thresholds θ (stage 1) and τ (stage 2)
As the twostage design tTREAT2 relies on the choice of two related FC thresholds, one for each of the two stages, it is not straightforward to compare the performance of tTREAT2 to TREAT and/or tTREAT. Therefore we have derived various experimental schemes to show in which situations the use of a twostage design like tTREAT2 can be beneficial.
Results on simulated data
We simulated 400 realizations of a twogroup comparison NanoString experiment with 900 genes for which the DE genes are simulated with respect to a certain FC difference ω as described above. As such we know beforehand to which group (DE or nonDE) each gene in these 400 datasets belongs. We can therefore assess the performance of the procedures described in the methods section. For this purpose, we fix v_{ start } = 8 and the significance (type I error) was set at 0.05 or at 0.01. The mean of the false discoveries (false positives) and missed genes (false negatives) over the 400 generated datasets is plotted for the various statistical approaches for FCs ω (the FC difference with respect to which the data are simulated) and τ (the actual FC threshold used in the testing procedure). We also report mean Area Under the ROCcurve (AUC) values over the 400 generated data sets.
Area under the ROC curve (AUC) for various methods on simulated data
AUC results completing Figure1  

Simulated FC between test and control = 1.3  
1% DE genes  10% DE genes  20% DE genes  
TREAT, FC = 1.3  AUC* [95%PI]  96.1 [95.696.7]  96.3 [96.296.5]  96.4 [96.396.6] 
tTREAT, FC = 1.3  AUC* [95%PI]  95.9 [95.396.4]  96.2 [96.196.3]  96.2 [96.196.3] 
Simulated FC between test and control = 1.5  
1% DE genes  10% DE genes  20% DE genes  
TREAT, FC = 1.5  AUC* [95%PI]  97.3 [96.897.7]  97.3 [97.297.5]  97.3 [97.297.4] 
tTREAT, FC = 1.5  AUC* [95%PI]  97.0 [96.597.5]  97.0 [96.897.1]  96.9 [96.897.0] 
Simulated FC between test and control = 2  
1% DE genes  10% DE genes  20% DE genes  
TREAT, FC = 2  AUC* [95%PI]  98.0 [97.798.4]  98.2 [98.198.3]  98.2 [98.198.3] 
tTREAT, FC = 2  AUC* [95%PI]  97.3 [96.897.8]  97.5 [97.397.6]  97.5 [97.397.6] 
AUC results completing Figure 2  
Simulated FC between tets and control = 1.5 and Stringent test  
1% DE genes  10% DE genes  20% DE genes  
tTREAT, FC = 1.5  AUC* [95%PI]  96.7 [96.297.2]  97.0 [96.997.2]  97.0 [96.897.1] 
tTREAT, FC = 2.5  AUC* [95%PI]  91.1 [89.592.7]  93.7 [93.594.0]  93.7 [93.594.0] 
tTREAT2, FC = 2.5/1.5  AUC* [95%PI]  96.7 [96.197.3]  97.3 [97.197.4]  97.2 [97.197.4] 
AUC results completing Figure 3  
Simulated FC between test and control = 2.5 and nonstringent test  
1% DE genes  10% DE genes  20% DE genes  
tTREAT, FC = 2.5  AUC* [95%PI]  97.6 [97.298.1]  97.6 [97.497.7]  97.6 [97.597.7] 
tTREAT, FC = 1.5  AUC* [95%PI]  97.7 [97.498.0]  97.7 [97.697.8]  97.8 [97.797.8] 
tTREAT2, FC = 2.5/1.5  AUC* [95%PI]  96.7 [96.397.1]  96.7 [96.696.9]  96.7 [96.696.8] 
Benefit of tTREAT2 for a biological dataset
The running FC model
Discussion
Gene expression has been studied extensively for a wide variety of biological problems. Over the years various techniques have been developed for and applied to gene expression analysis. Ordered from low to high throughput, these techniques include qPCR, NanoString, microarrays, and RNA sequencing. These techniques come with their own specifications, advantages and disadvantages [21, 28] but each also with specific challenges for statistical analysis. Microarrays have thus far been the most widely used approach to study gene expression, and have therefore led to numerous analytical tools and statistical tests. Many of these tools can also be applied to NanoString data or qPCR data, and some have resulted in practical software packages for NanoString [29, 30].
The moderated t (referred to as t*) based TREAT, which was developed for microarray data [5, 17] works well on our NanoString data. But we found that tTREAT with regular t is even more appropriate. Moreover, we propose an alternative approach that defines a safety margin around the FC threshold. This tTREAT2 application in two subsequent stages can be beneficial when data are expected to be noisy around a chosen FC threshold. We also developed a technique for finding FC thresholds that vary with the expression level of genes, followed by applying these objectively set thresholds in tTREAT. In principle, this running FC model can be applied to gene expression data obtained with any technique.
The goal of this study is not to determine which test has the highest performance on NanoString data, although we describe the overall performance (AUC values) of TREAT, tTREAT and tTREAT2 on simulated data. Statistical tests like TREAT, finetuned to gene expression data, are difficult to outperform. In certain situations, however, an alternative test may prove more beneficial for the technology that is used and the problem that is studied. In our studies of odorant receptor gene expression, a falsely discovered gene leads to extensive followup experiments such as in situ hybridization and even generation of a new mouse strain by gene targeting. Such followup experiments are costly, lengthy, and timeconsuming. We thus need to minimize false discoveries. When using NanoString data, we find that tTREAT protects more against false discoveries at significance cutoffs of p = 0.05 and p = 0.01. It is perhaps counterintuitive that tTREAT instead of TREAT is more beneficial for our dataset, as TREAT and the moderated t statistic are known for their ability to decrease false discoveries in microarray experiments.
There are four main messages. First, we demonstrate with data simulations that tTREAT is more appropriate for our NanoString data than TREAT. We used biological data to initiate data simulation experiments, and find that tTREAT results in fewer false discoveries compared to TREAT when using significance cutoffs of p = 0.05 or p = 0.01. But when the percentage of DE genes in the simulated data is high, the advantage of tTREAT over TREAT in terms of false discoveries is countered by tTREAT missing more DE genes. When the overall performance of TREAT and tTREAT is compared by AUCs (thus independently of the chosen p value), there is no difference on our simulated data. The TREAT Bayesian approach shrinks (or blasts) the individual gene sample variances towards a pooled estimate, which works particularly well when the number of replicates (arrays) is small; often no more than two replicates are available for microarrays [5]. The NanoString counts in our dataset are not noisy and the throughput is much lower (558 genes) compared to a microarray experiment. So some of the genewise variances may be shrunk (or blasted) by too large a factor when applying TREAT on our NanoString data, as the hierarchical Bayesian model is expecting a few outlier variances.
Second, when noisy data are expected, with many nonDE genes showing signs of being increased or decreased, it is beneficial to use a safety margin around the chosen FC threshold in an analysis relative to a FC threshold. A good choice seems to set θ to τ + 0.5 or τ + 1.
Third, the particular choice of a FC threshold can influence the interpretation of biological data [16]. The running FC model can be applied in order to avoid the subjective choice of a FC threshold, or when data vary strongly with expression levels. We used the same dataset to determine the FC thresholds for various gene expression levels and to apply tTREAT, which may seem suboptimal. It would be ideal to determine FC thresholds on an independent dataset before using these thresholds in the subsequent analysis of the actual data. But often obtaining these independent data may be too expensive or impossible. We believe that performing the running FC model on the same dataset gives accurate results despite its datadriven nature in such circumstances.
Fourth, the choice of analytical tools must be driven by the research goals, the gene expression technique, and the hypothesis that is being tested. TREAT, tTREAT, or tTREAT2 either give a high number of false discoveries and few missed genes, only a few false discoveries for a higher number of missed genes, or an average but similar number of false discoveries and missed genes. The inevitable tradeoff between false discoveries and missed genes must be evaluated dependent on the experimental context. In our case  NanoString studies of odorant receptor gene expression  we want to minimize the false discoveries, because the validation and followup experiments are laborious and expensive. But for other projects, the inverse may be desirable: the number of missed genes needs to be minimized, for instance in largescale screening of molecules against a drug target, where missing a potential blockbuster drug cannot be afforded.
In the running FC model, the choice of the FC percentile in each bin in the development of the LFC model (step 1) as well as the choice of the k bins when applying the model (in step 3) affect the final results of the running FC model, thus the DE genes discovered. Depending on the percentage of DE genes that is expected, percentile values ranging from P55 (many DE genes are expected) to P95 (a few DE genes are expected) are deemed acceptable. Regarding the number of k bins, a good coverage of the full expression range of the genes under analysis should be ensured.
Performance results on a single simulated dataset (instead of 400) would have been subject to chance (in favor or disfavor of one method). Repeating the simulations 400 times permits an understanding of the variability in test performance that may be obtained between different simulations. The variability of our mean results is assessed by the 95% prediction intervals around the mean number of false discoveries (missed genes respectively) on these 400 data simulations. Neither of the statistical tests shows outliers in performance that could contradict our conclusions. Our data simulation procedure requires a “real” data input at the beginning only to get an idea about the distribution of gene error variances for NanoString. This input came from our data as described in the results. We also used an independent, published NanoString data set with ~250 mouse genes (cellcycle and G2/M related functions) [31] to initiate the set of data simulation experiments again. We find that simulations initiated by these data lead to very similar results (data not shown).
We present approaches for testing the significance of gene expression relative to a FC threshold, with a focus on our NanoString data on OR genes. When carrying out these hundreds of significance tests simultaneously, the issue of multiple testing can be raised. A significance analysis for microarrays (SAM) test has been developed [3] that encompasses both a test statistic for ranking (a slightly tuned version of student’s t) and a way to control the False Discovery Rate and thus to adjust for multiple testing. The advantage of our approaches is that they provide p values that can be adjusted by a multiple testing method of choice, ranging from Bonferroni to Benjamini and Hochberg [32]. There is a wide variety of adjustment procedures for multiple testing. The choice of the most appropriate procedure for a given analysis is not straightforward. Guidelines and criteria to be attributed to the multiple testing approaches have been described [2]. In general, the formulation of a composite nullhypothesis forces TREAT, tTREAT, and tTREAT2 to rely on a more conservative way of p value calculation [17]. The distribution of the resulting p values is thus skewed towards the larger values, which has consequences: (1) if a method for adjusting multiple testing problems relies on the uniformity of the distribution of the p values, it cannot be used as this criterion is not met and (2) we feel that when applied to NanoString data (for which the technical maximum is 800 genes), these already conservative p values should not be adjusted by yet another conservative method.
For the development of TREAT and initially also for t* [5, 17], it was decided to use genewise linear models with arbitrary coefficients and contrasts of interest (making them widely applicable) as a contextual and practical background; see also the limma package [12]. The tTREAT, tTREAT2, and the running FC model are also based on genewise linear models. Others [8] have used linear models, or, more precisely, analysis of variance models (ANOVA) to assess DE genes in microarray experiments. The main difference is that all genes are included in a single linear model, presenting the advantage that the normalization process is combined with the data analysis and thus prenormalization of the data is not necessary. Obviously for such an ANOVA model to be justified, its underlying model assumptions should be met, but they can be assessed quite straightforward. For a smaller NanoString CodeSet Century [19], which consists of 89 OR genes and 11 reference genes, we developed ANOVA models that also dealt with assessing the DE genes relative to a FC threshold. The negative controls could not be included in these ANOVA models, as their residuals were making the general residual distribution heavily tailed and therefore nonGaussian. Equality of variances was not met but within an acceptable borderline range on these genes. The results of these ANOVA models were not satisfactory, as the genes that were identified as DE genes (with FCs significantly higher than a certain predefined threshold) were sometimes very different from the TREAT, tTREAT and tTREAT2 results, and less biologically meaningful. ANOVA models including various genes in the same linear model can be applied on NanoString data and may be very useful for certain purposes. On our data, however, the genewise linear models and thus TREAT, tTREAT, or tTREAT2 were preferred over an approach that models all genes simultaneously.
Conclusions
Our genewise statistical analysis of gene expression data with significance relative to a FC threshold gives reproducible and reliable results on NanoString data of odorant receptor genes, the largest gene family in the mouse genome. Because it is difficult to set a biologically meaningful FC threshold in advance, we have developed methods that provide guidance in determining a stable FC threshold (in order to minimize false discoveries and/or missed genes) or in avoiding this choice altogether. By applying a twostage approach, a safety margin around the FC threshold can be set, which is beneficial in certain situations. Our running FC model identifies FC thresholds in an automated way, and is thus a more objective model leading to results that are more reproducible. This model is well suited for the problem of higher FC variances of lower expression values, thus avoiding a bias.
Methods
NanoString data
A NanoString experiment is performed using one or more NanoString cartridges, each containing 12 lanes. In each lane one RNA sample is assayed. The cartridge is scanned and imaged, resulting in digital readings (raw counts) for every barcode (gene) in every lane [18]. A macro can then be used to collect raw counts of cartridges of interest into an xls file. We imported these collected xls files into the R environment, version 2.14 [33]. Counts were processed to eliminate systematic experimental variability, differing amounts of input RNA, and variability in background, in three steps: first a normalization with respect to the geometric mean of the positive control spike counts, then a normalization with respect to the geometric mean of a group of five reference genes, and finally a background correction, which consists of subtracting the mean + 2*SD (standard deviation) of the negative control counts. Count values <1 were fixed to background level. We define a gene as informative if the median of the normalized NanoString counts in wildtype control mice is ≥100.
The FC for gene g can be estimated for any two samples as the ratio of the geometric mean of the normalized counts of first sample $\left({{\tilde{x}}_{g}}^{1}\right)$ over the geometric mean of the second sample’s counts $\phantom{\rule{0.25em}{0ex}}\left({{\tilde{x}}_{g}}^{2}\right):\mathit{FC}=\frac{{{\tilde{x}}_{g}}^{2}}{{{\tilde{x}}_{g}}^{1}}$. As ratios are naturally heavily skewed, log_{2}(FC) values, referred to as M values, are represented in most figures and analyses instead of the FCs. In MA plots, M values are plotted against the quantity A, $\left(\mathit{lo}{g}_{2}\left(\sqrt[2]{{{\tilde{x}}_{g}}^{1}.{{\tilde{x}}_{g}}^{2}}\right)\right)$, which represents an average of the NanoString counts for that gene. In MC plots, M values are plotted according to relative gene order on the chromosomes.
Usage of the limma package
A linear model is fitted to the NanoString expression data (digital counts) for each gene, which is key in the limma package [5, 12]. We use the log_{2} of digital NanoString counts as expression data for every lane for the limma application. Thus for a set of l lanes (assumed independent), there is a log_{2} counts vector ${\mathit{y}}_{g}^{T}=\left({y}_{g1},\cdots ,{y}_{\mathit{gl}}\right)$ for every gene g. For the linear model on any gene g, we write: E[y_{ g }] = X ψ_{ g }. The experimental design (i.e. which biological replicates belong to which RNA sample) is captured by the design matrix X, and ψ_{ g } represents the unknown vector of coefficients. What we are interested in, are certain contrasts, as given by the contrast’s matrix C (for example C^{ T } = (1, − 1) captures a two RNA group comparison): β_{ g } = C^{ T }ψ_{ g }. By fitting the linear model, the ψ_{ g } will be estimated which then easily leads to the estimates of the β_{ g }. Linear model theory also shows that $\mathrm{Cov}\left[{\widehat{\psi}}_{g}\right]={\sigma}^{2}{V}_{g}$, so the variancecovariance of the contrasts is deduced as follows: $\phantom{\rule{0.25em}{0ex}}\mathrm{Cov}\left[{\widehat{\beta}}_{g}\right]={\sigma}^{2}{\mathit{C}}^{T}{V}_{g}\mathit{C}$, for ease of notation let’s put C^{ T }V_{ g }C = Z_{ g }. Therefore, the student’s tstatistic for the contrasts in question is then typically written as follows: ${t}_{g}=\frac{{\widehat{\beta}}_{g}}{{s}_{g}\sqrt{{z}_{g}}}$, where s_{ g } represents the variance estimator of the unknown genewise variance.
tTREAT
Let τ be the FC threshold as predefined and M_{ g } the log_{2}(FC) of a gene g. If we take the example of a two RNA group comparison, ${M}_{g}={\overline{y}}_{g}^{2}{\overline{y}}_{g}^{1}$, i.e. the difference in arithmetic means of the log_{2} NanoString counts of the two groups. It can easily be seen how this comparison now comes down to a contrast when applying a linear model to gene g.
The idea of an analysis relative to a FC threshold is to test the following composite hypothesis: H_{0} : M_{ g } ≤ log_{2}(τ) against H_{1} : M_{ g } > log_{2}(τ). Thus, H_{0} is an interval of values for M_{ g } rather than a single value, which is normally 0. McCarthy and Smyth [17] determined the exact (and conservative) p value for this H_{0} on t* (called TREAT). We reformulate their ideas for t instead of t* as follows: Let t_{ obs } be the observed value of the tstatistic for a certain gene g, the p value now equals: p = P{T ≥ t_{ obs }  H_{0}}. As H_{0} is an interval, we may find an upper bound of this p value by choosing the element of H_{0} that is the most difficult to reject (hence conservative). So, let ${\widehat{M}}_{\mathit{obs}}$ be the observed value of ${\widehat{M}}_{g}$ and choose ${M}_{0}=min\left(\mathit{lo}{g}_{2}\left(\tau \right),\left{\widehat{M}}_{\mathit{obs}}\right\right)$. Then the actual p value is bounded above by p ≤ P{T ≥ t_{ obs }  M_{ g } = M_{0}}. Let δ$=\frac{{M}_{0}}{{s}_{g}\sqrt{{z}_{g}}}$ , then if t_{ obs } ≥ 0, p ≤ 2 * [P{T > t_{ obs } − δ M_{ g } = M_{0}}] and if t_{ obs } < 0, p ≤ 2 * [P{T < t_{ obs } + δ M_{ g } = M_{0}}]. In order to differentiate our test from t* and TREAT [17], we refer to it as tTREAT.
tTREAT2
In this twostage design, we reconsider the relative to a FC threshold idea at the point where the composite null hypothesis is formulated: H_{0} : M_{ g } ≤ log_{2}(τ) against H_{1} : M_{ g } > log_{2}(τ). The interval to which H_{0} applies, is thus H_{0} : − log_{2}(τ) < M_{ g } < log_{2}(τ). A larger interval, called I, is now defined around H_{0}, for example I : [−log_{2}(θ), log_{2}(θ)] with θ > τ. Then, in a first stage, called the stop or go stage, a 100% × (1 − α_{stage1}) confidence interval (CI) for M_{ g } is defined as usual: $\left[{M}_{g}{t}_{\mathit{df},1\raisebox{1ex}{${\alpha}_{\mathit{stage}1}$}\!\left/ \!\raisebox{1ex}{$2$}\right.}.{s}_{g}\sqrt{{z}_{g}},{M}_{g}+{t}_{\mathit{df},1\raisebox{1ex}{${\alpha}_{\mathit{stage}1}$}\!\left/ \!\raisebox{1ex}{$2$}\right.}.{s}_{g}\sqrt{{z}_{g}}\right]$, where df stands for the residual degrees of freedom. If this 100% × (1 − α_{stage1}) CI for M_{ g } falls strictly within I, it is decided to call the gene g as nonDE (stop), otherwise the gene g goes on to the second stage (go). For all the go genes, the second stage p value calculation is done exactly as for the tTREAT part above with regard to the interval defined by H_{0}. This approach allows for more flexibility, particularly when there is concern that the chosen tTREAT FC threshold may be too high. In such a case, θ could be set to a high value, and a lower value, e.g. θ − 1 or θ − 0.1, could be chosen for τ. In a second stage, the type I errors of both stages, α_{stage1} (defining the CI) and α_{stage2} (significance of tTREAT analysis on go genes), can be chosen freely and may differ from each other. Here we have chosen α_{stage1} to equal 0.05. The value of θ (and thus the length of the interval I) and its difference with regard to τ allow for a safety margin around τ, thus helping to reduce false positives in noisy data while keeping the false negatives fixed or reducing them as well.
Running FC model
By using a fixed FC threshold, genes with lower expression levels have a greater chance of being identified as DE because of their higher variance. In order to determine which FC threshold is appropriate for a certain expression level (or range of expression levels), we developed a running FC model. We used a model similar to [22] to find a potential FC cutoff function. Then we applied this function is to a number of ranges r of different expression levels, resulting in r different FC thresholds to use in any type of subsequently applied analysis. A stepbystep description of the running FC model is as follows.
1. Discrete relationship between FC and average gene expression levels:
Any twobytwo comparison of RNA groups is considered separately in the model. For simplicity, assume there are only two RNA groups as above. For the running FC model, the actual FC, $F{C}_{g}=\frac{{{\tilde{x}}_{g}}^{2}}{{{\tilde{\mathit{x}}}_{g}}^{1}}$ is calculated for every gene g. When FC_{g} is < 1, the reciprocal is taken. Then the M_{ g } = log_{2}(FC_{ g }) values are plotted against the average expression levels (AR_{g}) of the reference group (say group 1, in our case these are the control mice), $A{R}_{g}={{\tilde{x}}_{g}}^{1}$. Subsequently, the overall AR range is divided into bins of different widths but containing an equal number of genes. In each bin j (j = 1, ⋯ , m), the b%percentile of the FC_{ g } for all genes in that bin is determined and named bp_{ J }. All these bp_{ J } can be visualized on the MAR plot.
2. Fitting a continuous function:
As in [22], the equation bp = a + b/AR (where AR stands for the median of the AR_{g} values in a certain bin j) seems to fit the data quite well. A leastsquares fit of this equation is thus fitted to model the FC bpercentiles over various gene expression levels.
3. Defining the appropriate FC thresholds for various gene expression levels:
The model fit of step 2 is used to get actual FC threshold values for a number k of gene expression level ranges. In this case the bins of gene expression ranges are typically distributed more evenly over the full AR range, in contrast to the binning in step 1. For each of the i different ranges (i = 1, ⋯ , k) we apply the FC_{i}, as defined by the model fit of step 2, to all genes in that specific range i of expression values as a FC threshold value for any kind of analysis (TREAT, tTREAT, or tTREAT2). The complete analysis is then referred to as the running FC model.
Calculation of the Area Under the ROC curve (AUC)
The R opensource package pROC was used to calculate AUC values of statistical tests [34].
Software for the tools
A zip file of a folder containing the functions in R code [33] to use the analytic tools developed here (tTREAT, tTREAT2, running FC model, and MA and MC plots) is available as Additional file 4. The folder also contains an R script as an example of how to apply these functions. The ReadMe file illustrates and explains how to use the tools.
Availability of supporting data
The biological datasets (MK29, MK37 and MK38) supporting the results of this article are available in the NCBI Gene Expression Omnibus [35], and are accessible through GEO Series accession number GSE53876: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE53876.
Abbreviations
 AUC:

Area under the ROC curve
 CI:

Confidence interval
 DE:

Differentially expressed
 FC:

Fold change
 FD:

False discoveries
 LFC:

Limit fold change model
 MG:

Missed genes
 qPCR:

quantitative polymerase chain reaction
 SAM:

Significance analysis of microarray
 TREAT:

t*tests relative to a threshold.
Declarations
Acknowledgements
We thank AnnaLena Hohmann for expert technical assistance, Annie Robert for useful discussions, and Dror Givon for helpful insights on the manuscript.
Authors’ Affiliations
References
 Lönnstedt I, Speed T: Replicated microarray data. Stat Sinica. 2002, 12: 3146.Google Scholar
 Yang YH, Speed T: Design and analysis of comparative microarray experiments. Statistical analysis of gene expression microarray data. Edited by: Speed T. 2003, Boca Raton, FL, USA: Chapman & Hall/CRC Press, 3591.Google Scholar
 Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001, 98: 1051510515.View ArticleGoogle Scholar
 Baldi P, Long AD: A Bayesian framework for the analysis of microarray expression data: regularized ttest and statistical inferences of gene changes. Bioinformatics. 2001, 17: 509519. 10.1093/bioinformatics/17.6.509.View ArticlePubMedGoogle Scholar
 Smyth GK: Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004, 3: Article3PubMedGoogle Scholar
 Pan W: A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics. 2002, 18: 546554. 10.1093/bioinformatics/18.4.546.View ArticlePubMedGoogle Scholar
 Dudoit S, Yang YH, Callow MJ, Speed TP: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sinica. 2002, 12: 111139.Google Scholar
 Kerr MK, Martin M, Churchill GA: Analysis of variance for gene expression microarray data. J Comput Biol. 2000, 7: 819837. 10.1089/10665270050514954.View ArticlePubMedGoogle Scholar
 Jin W, Riley RM, Wolfinger RD, White KP, PassadorGurgel G, Gibson G: The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster. Nat Genet. 2001, 29: 389395. 10.1038/ng766.View ArticlePubMedGoogle Scholar
 Wolfinger RD, Gibson G, Wolfinger ED, Bennett L, Hamadeh H, Bushel P, Afshari C, Paules RS: Assessing gene significance from cDNA microarray expression data via mixed models. J Comput Biol. 2001, 8: 625637. 10.1089/106652701753307520.View ArticlePubMedGoogle Scholar
 Lin DM, Yang YH, Scolnick JA, Brunet LJ, Marsh H, Peng V, Okazaki Y, Hayashizaki Y, Speed TP, Ngai J: Spatial patterns of gene expression in the olfactory bulb. Proc Natl Acad Sci U S A. 2004, 101: 1271812723. 10.1073/pnas.0404872101.View ArticlePubMed CentralPubMedGoogle Scholar
 Smyth GK: limma: linear models for microarray data. Bioinformatics and Computational Biology Solutions using R and Bioconductor. Edited by: Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W. 2005, New York: Springer, 397420.View ArticleGoogle Scholar
 Patterson TA, Lobenhofer EK, FulmerSmentek SB, Collins PJ, Chu TM, Bao W, Fang H, Kawasaki ES, Hager J, Tikhonova IR, Walker SJ, Zhang L, Hurban P, de Longueville F, Fuscoe JC, Tong W, Shi L, Wolfinger RD: Performance comparison of onecolor and twocolor platforms within the MicroArray Quality Control (MAQC) project. Nat Biotechnol. 2006, 24: 11401150. 10.1038/nbt1242.View ArticlePubMedGoogle Scholar
 Peart MJ, Smyth GK, van Laar RK, Bowtell DD, Richon VM, Marks PA, Holloway AJ, Johnstone RW: Identification and functional significance of genes regulated by structurally different histone deacetylase inhibitors. Proc Natl Acad Sci U S A. 2005, 102: 36973702. 10.1073/pnas.0500369102.View ArticlePubMed CentralPubMedGoogle Scholar
 Raouf A, Zhao Y, To K, Stingl J, Delaney A, Barbara M, Iscove N, Jones S, McKinney S, Emerman J, Aparicio S, Marra M, Eaves C: Transcriptome analysis of the normal human mammary cell commitment and differentiation process. Cell Stem Cell. 2008, 3: 109118. 10.1016/j.stem.2008.05.018.View ArticlePubMedGoogle Scholar
 Dalman MR, Deeter A, Nimishakavi G, Duan ZH: Fold change and pvalue cutoffs significantly alter microarray interpretations. BMC Bioinforma. 2012, 13 (Suppl 2): S11S11. 10.1186/1471210513S2S11.View ArticleGoogle Scholar
 McCarthy DJ, Smyth GK: Testing significance relative to a foldchange threshold is a TREAT. Bioinformatics. 2009, 25: 765771. 10.1093/bioinformatics/btp053.View ArticlePubMed CentralPubMedGoogle Scholar
 Geiss GK, Bumgarner RE, Birditt B, Dahl T, Dowidar N, Dunaway DL, Fell HP, Ferree S, George RD, Grogan T, James JJ, Maysuria M, Mitton JD, Oliveri P, Osborn JL, Peng T, Ratcliffe AL, Webster PJ, Davidson EH, Hood L, Dimitrov K: Direct multiplexed measurement of gene expression with colorcoded probe pairs. Nat Biotechnol. 2008, 26: 317325. 10.1038/nbt1385.View ArticlePubMedGoogle Scholar
 Khan M, Vaes E, Mombaerts P: Regulation of the probability of mouse odorant receptor gene choice. Cell. 2011, 147: 907921. 10.1016/j.cell.2011.09.049.View ArticlePubMedGoogle Scholar
 Khan M, Vaes E, Mombaerts P: Temporal patterns of odorant receptor gene expression in adult and aged mice. Mol Cell Neurosci. 2013, 57: 120129.View ArticlePubMedGoogle Scholar
 Prokopec SD, Watson JD, Waggott DM, Smith AB, Wu AH, Okey AB, Pohjanvirta R, Boutros PC: Systematic evaluation of mediumthroughput mRNA abundance platforms. RNA. 2013, 19: 5162. 10.1261/rna.034710.112.View ArticlePubMed CentralPubMedGoogle Scholar
 Mutch DM, Berger A, Mansourian R, Rytz A, Roberts MA: The limit fold change model: a practical approach for selecting differentially expressed genes from microarray data. BMC Bioinforma. 2002, 3: 1710.1186/14712105317.View ArticleGoogle Scholar
 Serizawa S, Miyamichi K, Sakano H: One neuronone receptor rule in the mouse olfactory system. Trends Genet. 2004, 20: 648653. 10.1016/j.tig.2004.09.006.View ArticlePubMedGoogle Scholar
 Bozza T, Vassalli A, Fuss S, Zhang JJ, Weiland B, Pacifico R, Feinstein P, Mombaerts P: Mapping of class I and class II odorant receptors to glomerular domains by two distinct types of olfactory sensory neurons in the mouse. Neuron. 2009, 61: 220233. 10.1016/j.neuron.2008.11.010.View ArticlePubMed CentralPubMedGoogle Scholar
 Fuss SH, Omura M, Mombaerts P: Local and cis effects of the H element on expression of odorant receptor genes in mouse. Cell. 2007, 130: 373384. 10.1016/j.cell.2007.06.023.View ArticlePubMedGoogle Scholar
 RamírezSolis R, Liu P, Bradley A: Chromosome engineering in mice. Nature. 1995, 378: 720724. 10.1038/378720a0.View ArticlePubMedGoogle Scholar
 Xie SY, Feinstein P, Mombaerts P: Characterization of a cluster comprising approximately 100 odorant receptor genes in mouse. Mamm Genome. 2000, 11: 10701078. 10.1007/s003350010217.View ArticlePubMedGoogle Scholar
 Beaume M, Hernandez D, Francois P, Schrenzel J: New approaches for functional genomic studies in staphylococci. Int J Med Microbiol. 2010, 300: 8897. 10.1016/j.ijmm.2009.11.001.View ArticlePubMedGoogle Scholar
 Brumbaugh CD, Kim HJ, Giovacchini M, Pourmand N: NanoStriDE: normalization and differential expression analysis of NanoString nCounter data. BMC Bioinforma. 2011, 12: 47910.1186/1471210512479.View ArticleGoogle Scholar
 Waggott D, Chu K, Yin S, Wouters BG, Liu FF, Boutros PC: NanoStringNorm: an extensible R package for the preprocessing of NanoString mRNA and miRNA data. Bioinformatics. 2012, 28: 15461548. 10.1093/bioinformatics/bts188.View ArticlePubMed CentralPubMedGoogle Scholar
 Chen HZ, Ouseph MM, Li J, Pécot T, Chokshi V, Kent L, Bae S, Byrne M, Duran C, Comstock G, Trikha P, Mair M, Senapati S, Martin CK, Gandhi S, Wilson N, Liu B, Huang YW, Thompson JC, Raman S, Singh S, Leone M, Machiraju R, Huang K, Mo X, Fernandez S, Kalaszczynska I, Wolgemuth DJ, Sicinski P, Huang T, et al: Canonical and atypical E2Fs regulate the mammalian endocycle. Nat Cell Biol. 2012, 14: 11921202. 10.1038/ncb2595.View ArticlePubMed CentralPubMedGoogle Scholar
 Benjamini Y, Hochberg Y: Controlling the False Discovery Rate  a practical and powerful approach to multiple testing. J Roy Stat Soc B Met. 1995, 57: 289300.Google Scholar
 Gentleman R, Ihaka R: The R Project for Statistical Computing. [http://www.rproject.org]
 Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, Müller M: pROC: an opensource package for R and S + to analyze and compare ROC curves. BMC Bioinforma. 2011, 12: 7710.1186/147121051277.View ArticleGoogle Scholar
 Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002, 30: 207210. 10.1093/nar/30.1.207.View ArticlePubMed CentralPubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.