Gene expression has been studied extensively for a wide variety of biological problems. Over the years various techniques have been developed for and applied to gene expression analysis. Ordered from low to high throughput, these techniques include qPCR, NanoString, microarrays, and RNA sequencing. These techniques come with their own specifications, advantages and disadvantages [21, 28] but each also with specific challenges for statistical analysis. Microarrays have thus far been the most widely used approach to study gene expression, and have therefore led to numerous analytical tools and statistical tests. Many of these tools can also be applied to NanoString data or qPCR data, and some have resulted in practical software packages for NanoString [29, 30].

The moderated t (referred to as t*) based TREAT, which was developed for microarray data [5, 17] works well on our NanoString data. But we found that tTREAT with regular t is even more appropriate. Moreover, we propose an alternative approach that defines a safety margin around the FC threshold. This tTREAT2 application in two subsequent stages can be beneficial when data are expected to be noisy around a chosen FC threshold. We also developed a technique for finding FC thresholds that vary with the expression level of genes, followed by applying these objectively set thresholds in tTREAT. In principle, this running FC model can be applied to gene expression data obtained with any technique.

The goal of this study is not to determine which test has the highest performance on NanoString data, although we describe the overall performance (AUC values) of TREAT, tTREAT and tTREAT2 on simulated data. Statistical tests like TREAT, fine-tuned to gene expression data, are difficult to outperform. In certain situations, however, an alternative test may prove more beneficial for the technology that is used and the problem that is studied. In our studies of odorant receptor gene expression, a falsely discovered gene leads to extensive follow-up experiments such as in situ hybridization and even generation of a new mouse strain by gene targeting. Such follow-up experiments are costly, lengthy, and time-consuming. We thus need to minimize false discoveries. When using NanoString data, we find that tTREAT protects more against false discoveries at significance cut-offs of p = 0.05 and p = 0.01. It is perhaps counterintuitive that tTREAT instead of TREAT is more beneficial for our dataset, as TREAT and the moderated t statistic are known for their ability to decrease false discoveries in microarray experiments.

There are four main messages. First, we demonstrate with data simulations that tTREAT is more appropriate for our NanoString data than TREAT. We used biological data to initiate data simulation experiments, and find that tTREAT results in fewer false discoveries compared to TREAT when using significance cut-offs of p = 0.05 or p = 0.01. But when the percentage of DE genes in the simulated data is high, the advantage of tTREAT over TREAT in terms of false discoveries is countered by tTREAT missing more DE genes. When the overall performance of TREAT and tTREAT is compared by AUCs (thus independently of the chosen p value), there is no difference on our simulated data. The TREAT Bayesian approach shrinks (or blasts) the individual gene sample variances towards a pooled estimate, which works particularly well when the number of replicates (arrays) is small; often no more than two replicates are available for microarrays [5]. The NanoString counts in our dataset are not noisy and the throughput is much lower (558 genes) compared to a microarray experiment. So some of the gene-wise variances may be shrunk (or blasted) by too large a factor when applying TREAT on our NanoString data, as the hierarchical Bayesian model is expecting a few outlier variances.

Second, when noisy data are expected, with many non-DE genes showing signs of being increased or decreased, it is beneficial to use a safety margin around the chosen FC threshold in an analysis relative to a FC threshold. A good choice seems to set *θ* to *τ* + 0.5 or *τ* + 1.

Third, the particular choice of a FC threshold can influence the interpretation of biological data [16]. The running FC model can be applied in order to avoid the subjective choice of a FC threshold, or when data vary strongly with expression levels. We used the same dataset to determine the FC thresholds for various gene expression levels and to apply tTREAT, which may seem suboptimal. It would be ideal to determine FC thresholds on an independent dataset before using these thresholds in the subsequent analysis of the actual data. But often obtaining these independent data may be too expensive or impossible. We believe that performing the running FC model on the same dataset gives accurate results despite its data-driven nature in such circumstances.

Fourth, the choice of analytical tools must be driven by the research goals, the gene expression technique, and the hypothesis that is being tested. TREAT, tTREAT, or tTREAT2 either give a high number of false discoveries and few missed genes, only a few false discoveries for a higher number of missed genes, or an average but similar number of false discoveries and missed genes. The inevitable trade-off between false discoveries and missed genes must be evaluated dependent on the experimental context. In our case - NanoString studies of odorant receptor gene expression - we want to minimize the false discoveries, because the validation and follow-up experiments are laborious and expensive. But for other projects, the inverse may be desirable: the number of missed genes needs to be minimized, for instance in large-scale screening of molecules against a drug target, where missing a potential blockbuster drug cannot be afforded.

In the running FC model, the choice of the FC percentile in each bin in the development of the LFC model (step 1) as well as the choice of the k bins when applying the model (in step 3) affect the final results of the running FC model, thus the DE genes discovered. Depending on the percentage of DE genes that is expected, percentile values ranging from P55 (many DE genes are expected) to P95 (a few DE genes are expected) are deemed acceptable. Regarding the number of k bins, a good coverage of the full expression range of the genes under analysis should be ensured.

Performance results on a single simulated dataset (instead of 400) would have been subject to chance (in favor or disfavor of one method). Repeating the simulations 400 times permits an understanding of the variability in test performance that may be obtained between different simulations. The variability of our mean results is assessed by the 95% prediction intervals around the mean number of false discoveries (missed genes respectively) on these 400 data simulations. Neither of the statistical tests shows outliers in performance that could contradict our conclusions. Our data simulation procedure requires a “real” data input at the beginning only to get an idea about the distribution of gene error variances for NanoString. This input came from our data as described in the results. We also used an independent, published NanoString data set with ~250 mouse genes (cell-cycle and G2/M related functions) [31] to initiate the set of data simulation experiments again. We find that simulations initiated by these data lead to very similar results (data not shown).

We present approaches for testing the significance of gene expression relative to a FC threshold, with a focus on our NanoString data on OR genes. When carrying out these hundreds of significance tests simultaneously, the issue of multiple testing can be raised. A significance analysis for microarrays (SAM) test has been developed [3] that encompasses both a test statistic for ranking (a slightly tuned version of student’s t) and a way to control the False Discovery Rate and thus to adjust for multiple testing. The advantage of our approaches is that they provide p values that can be adjusted by a multiple testing method of choice, ranging from Bonferroni to Benjamini and Hochberg [32]. There is a wide variety of adjustment procedures for multiple testing. The choice of the most appropriate procedure for a given analysis is not straightforward. Guidelines and criteria to be attributed to the multiple testing approaches have been described [2]. In general, the formulation of a composite null-hypothesis forces TREAT, tTREAT, and tTREAT2 to rely on a more conservative way of p value calculation [17]. The distribution of the resulting p values is thus skewed towards the larger values, which has consequences: (1) if a method for adjusting multiple testing problems relies on the uniformity of the distribution of the p values, it cannot be used as this criterion is not met and (2) we feel that when applied to NanoString data (for which the technical maximum is 800 genes), these already conservative p values should not be adjusted by yet another conservative method.

For the development of TREAT and initially also for t* [5, 17], it was decided to use gene-wise linear models with arbitrary coefficients and contrasts of interest (making them widely applicable) as a contextual and practical background; see also the limma package [12]. The tTREAT, tTREAT2, and the running FC model are also based on gene-wise linear models. Others [8] have used linear models, or, more precisely, analysis of variance models (ANOVA) to assess DE genes in microarray experiments. The main difference is that all genes are included in a single linear model, presenting the advantage that the normalization process is combined with the data analysis and thus pre-normalization of the data is not necessary. Obviously for such an ANOVA model to be justified, its underlying model assumptions should be met, but they can be assessed quite straightforward. For a smaller NanoString CodeSet Century [19], which consists of 89 OR genes and 11 reference genes, we developed ANOVA models that also dealt with assessing the DE genes relative to a FC threshold. The negative controls could not be included in these ANOVA models, as their residuals were making the general residual distribution heavily tailed and therefore non-Gaussian. Equality of variances was not met but within an acceptable borderline range on these genes. The results of these ANOVA models were not satisfactory, as the genes that were identified as DE genes (with FCs significantly higher than a certain predefined threshold) were sometimes very different from the TREAT, tTREAT and tTREAT2 results, and less biologically meaningful. ANOVA models including various genes in the same linear model can be applied on NanoString data and may be very useful for certain purposes. On our data, however, the gene-wise linear models and thus TREAT, tTREAT, or tTREAT2 were preferred over an approach that models all genes simultaneously.