Super-delta: a new differential gene expression analysis procedure with robust data normalization

Background Normalization is an important data preparation step in gene expression analyses, designed to remove various systematic noise. Sample variance is greatly reduced after normalization, hence the power of subsequent statistical analyses is likely to increase. On the other hand, variance reduction is made possible by borrowing information across all genes, including differentially expressed genes (DEGs) and outliers, which will inevitably introduce some bias. This bias typically inflates type I error; and can reduce statistical power in certain situations. In this study we propose a new differential expression analysis pipeline, dubbed as super-delta, that consists of a multivariate extension of the global normalization and a modified t-test. A robust procedure is designed to minimize the bias introduced by DEGs in the normalization step. The modified t-test is derived based on asymptotic theory for hypothesis testing that suitably pairs with the proposed robust normalization. Results We first compared super-delta with four commonly used normalization methods: global, median-IQR, quantile, and cyclic loess normalization in simulation studies. Super-delta was shown to have better statistical power with tighter control of type I error rate than its competitors. In many cases, the performance of super-delta is close to that of an oracle test in which datasets without technical noise were used. We then applied all methods to a collection of gene expression datasets on breast cancer patients who received neoadjuvant chemotherapy. While there is a substantial overlap of the DEGs identified by all of them, super-delta were able to identify comparatively more DEGs than its competitors. Downstream gene set enrichment analysis confirmed that all these methods selected largely consistent pathways. Detailed investigations on the relatively small differences showed that pathways identified by super-delta have better connections to breast cancer than other methods. Conclusions As a new pipeline, super-delta provides new insights to the area of differential gene expression analysis. Solid theoretical foundation supports its asymptotic unbiasedness and technical noise-free properties. Implementation on real and simulated datasets demonstrates its decent performance compared with state-of-art procedures. It also has the potential of expansion to be incorporated with other data type and/or more general between-group comparison problems. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1992-2) contains supplementary material, which is available to authorized users.

In this section, we will provide theoretical justifications to the super-delta method based on a mixed effects model for gene expression [1,2,3,4,5]. Specifically, we consider that within each phenotypic group (a = A, B), the (log)-expression level of the ith gene sampled from j = 1, 2, . . . , N a arrays are realizations of the following model y a ij = α j + x a ij , x a ij := µ a i + a ij , α j ∼ N (0, η 2 ), a ij ∼ N (0, σ 2 i ).
Here α j is a slide-specific factor, µ a i is the mean for each phenotype. x a ij can be viewed as the oracle expression level, which has biological variation and possibly some other independent variation but no slide-specific noise. We are interested in testing the following hypotheses for each i = 1, 2, ..., m: H Ideally, we would like to remove α j from Model (1) by a normalization method, and then test hypotheses (2) based on two-sample t-statistics constructed from x a ij , the "noise-free" expressions: Here N = N A + N B is the total sample size. We will call these t * i as the oracle t-statistics henceforth. In practice, even the most effective normalization cannot perfectly recover x a ij from y a ij . Our previous studies show that commonly used normalization methods can effectively reduce sample variance explained in part by α j , at a price of introducing certain bias to the normalized expressions, which in turn reduces statistical power and increases type I error rate [6,5].
In this study, we take a completely different approach. We will first compute pairwise differences called "deltas", defined as follows: Next, we compute two-sample t-statistics based on these deltas for all i = 1, 2, . . . , m; k = 1, 2, . . . , m; and i = k (a) Filter out a proportion (say c = 20%) of the most extreme test statistics from t i . Here the most extreme values are determined by |t ij |, the absolute values of t ij . This step ensures that most, if not all, δ ij 's normalized with DEGs will be excluded. Again we need to reiterate that filtering out 20% of t ij according to |t ij | is not equivalent to filtering out 10% of smallest t ij and 10% of largest t ij . If up/down regulations are unbalanced, filtering according to |t ij | will automatically take care of it yet filtering based on removing the smallest and largest 10% t ij 's will result in some bias.
(b) Now estimate the center from the rest 1 − c of the δ ij 's. We use the median for this purpose.
(c) In the end, we still need to multiply the trimmed median by √ 2.

The conditional interpretation of the estimation procedure
Because t ik are computed from the difference of the ith gene expression and all the rest genes, they are correlated by construction. Besides, t * i , the oracle statistic, is not a population characteristic so the very "estimation" needs to be interpreted in an unconventional way. In this section, we will provide a rigorous interpretation based on conditional inference. Proposition 1.1. When conditioned on a ij , δ a ik,j and δ a ik ,j are independent, for all k = k and k, k = i.
Proof. The proof is trivial because a ij is the source of randomness of the ith gene. Once it is fixed by conditioning, the only source of randomness in δ a ik,j is from a kj , which are independent of each other for different j's. Corollary 1.2. When conditioned on a ij , any t ik and t ik in t i (k = k ) are independent.
Apparently, the above corollary is true because t ik depends only on the values of δ a ik,j and t ik depends on δ a ik ,j ; and we already know that δ a ik,j and δ a ik ,j are conditionally independent. The relationship can be seen more clearly from the following flow chart: Let us denote by S 0 the set of all NDEGs (null genes) and S 1 the set of all DEGs. Obviously, Simple calculations lead to the following conclusions.
Proof. We can writeδ A ik in terms of¯ A i and¯ A k : Proposition 1.4. Sample variance of deltas computed in each phenotypic group, denoted by (s a ik ) 2 , a = A, B, has the following non-central (8) Here J denotes an N a × N a -dimensional matrix with all entries being 1. Simple calculation shows that (I − 1 n J) is symmetric, idempotent, with rank N a − 1. In fact, it is the projection matrix onto span(1) ⊥ , where 1 is a vector of all 1s. When conditioned on a ij , a ij − a kj is a normal distribution with mean a ij and variance σ 2 k . Therefore , has the following conditional non-central χ 2 distribution representation Here N = N A + N B is the total sample size and (σ p ,i ) 2 is the pooled sample variance of i computed from both phenotypic groups.
Proof. This is just because a kj −¯ a k is independent of¯ a k , for all j.
Based on Proposition 1.3, Corollary 1.5, and Proposition 1.6, we have the following conclusion.
Lemma 1.7. t ik computed from deltas follows a doubly non-central t-distribution as follows (10) Proof.
Lemma 1.8. The conditional mean value of t ik given i· , which is a doubly noncentral t-distribution, has the following large-sample approximation In particular, for k ∈ S 0 , for k ∈ S 1 , Proof. For simplicity, let us skip "| i· " in the following derivation. Namely, all expectations/variance in the following proof are conditioned on i· .

Let us denoteδ
by U and σ k s p ik by W , so t ik := U · W . By construction, we know that First, we will calculate some useful moments of W based on the large sample approximation.
Since (s p ik ) 2 follows a conditional scaled non-central χ 2 -distribution, there exists a sequence of i.i.d. normal random variables u 1 , u 2 , . . . , u N −2 such that Based on Taylor expansion, we have (14) Therefore, based on the independence of the denominator and numerator, As a special case, for k ∈ S 0 , d k = 0, therefore Similarly, we conclude that for k ∈ S 1 , E (t ik | i· ) is dominated by the d k term, which approaches ∞ with rate O(N ).
Below we will prove a similar lemma for the median. But first, we will need to understand the asymptotic properties of the higher order moments of t ik . Proposition 1.9. The conditional variance and skewness of t ik have the following asymptotic representation and Proof. Let us use the same decomposition as in the previous proof, namely t ik := U · W .
For variance, we have For skewness, we have the following computation. Denote So the skewness is Next, we prove that both the median and mean of t ik converges to 1 2 · t * i , up to an O(N −1 ) difference.
Lemma 1.10. The conditional median of t ik is In particular, for k ∈ S 0 , Proof. Based on Cornish-Fisher expansion, However, the skewness of t ik is of order N −1 . so Putting all the above conclusions together, we have the following theorem.
Theorem 1.11. Assume that σ 2 k ≡ σ 2 for all k (interchangeable covariance structure). The conditional mean and median of t ik , have the following asymptotic representation.
For k ∈ S 0 , For k ∈ S 1 , both the mean and median approaches infinity with rate O(N ): Here Sgn(d k ) is the sign of d k and P i· stands for the probability law of i· .
Proof. We only need to use the strong law of large number (SLLN) of P i· , The divergence result for k ∈ S 1 follows from the fact that d k Remark. Equations (23) justifies the use of multiplicative coefficient √ 2 in super-delta method when estimating the oracle t-statistic. The asymptotic behavior of the "S 0 -pairing" means that when N → ∞, the bulk of the empirical distribution of t i converges to a (constant times) noncentral Student t-distribution, centered at the oracle t-statistic. On the other hand, the "S 1 -pairing", which constitutes two smaller proportions (one for up-and one for down-regulated genes) of the empirical distribution of t i , will "move away" from that bulk distribution.
This phenomenon is illustrated in Figure 1.  Figure 1: An illustration of different asymptotic behavior of t ik . The large central density represents those t ik paired with NDEGs (k ∈ S 0 ). Two smaller densities represents t ik calculated from pairing with up-and down-regulated genes. The proportions are: π 0 = 0.85 for true NDEGs; π + 1 = 0.05 for down-regulated genes; π − 1 = 0.10 for up-regulated genes. Oracle t * is set to be 1.7 (represented by the red vertical line). When the up-and down-regulation effect sizes are fixed and N becomes larger, the centers for the two smaller densities of t ik converges to −∞ and +∞, respectively. Consequently, a 20% median fold trim (represented by the two dotted vertical lines) can effectively remove the two smaller clusters of t ik .