Testing for association between RNA-Seq and high-dimensional data

Background Testing for association between RNA-Seq and other genomic data is challenging due to high variability of the former and high dimensionality of the latter. Results Using the negative binomial distribution and a random-effects model, we develop an omnibus test that overcomes both difficulties. It may be conceptualised as a test of overall significance in regression analysis, where the response variable is overdispersed and the number of explanatory variables exceeds the sample size. Conclusions The proposed test can detect genetic and epigenetic alterations that affect gene expression. It can examine complex regulatory mechanisms of gene expression. The R package globalSeq is available from Bioconductor. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0961-5) contains supplementary material, which is available to authorized users.


A Derivation of the test statistic
Here we derive the proposed test statistic. Note that a more intuitive explanation of the setting is given in the paper.

Setting
The setting is similar to Goeman et al. (2004). Denote the response across all samples by y = (y 1 , . . . , y n ) T , and the library sizes by m = (m 1 , . . . , m n ) T . Each y i is modelled by E[y i |r i ] = γ i exp(α + r i ), where α is the intercept, log(γ i ) an offset, and r i a realisation of the random effect. We use γ i = m i /m, where m = ( n i=1 m i ) (1/n) . For the random vector r = (r 1 , . . . , r n ) T we assume E[r] = 0 and Var[r] = τ 2 XX T , where X is the n × p covariate matrix. The aim is to test H 0 : τ 2 = 0 against H 1 : τ 2 > 0. For simplicity we define R = (1/p)XX T and let R ij denote the element in the i th row and j th column of R.

Distribution
We assume y i |r i ∼ NB(µ i , φ), where µ i > 0 and φ > 0 for all i = 1, . . . , n. Under the chosen parametrization E[y i |r i ] = µ i and Var[y i |r i ] = µ i + φµ 2 i . The density function is:

Score
Le Cessie and van Houwelingen (1995) show how to obtain the score for testing H 0 : τ 2 = 0 against H 1 : τ 2 > 0. The calculations from le Cessie and van Houwelingen (1995) start with the marginal likelihood function: The crucial step of le Cessie and van Houwelingen (1995) is to take the Taylor expansion with respect to the random effect before taking the expectation. Differentiating this approximation of L(α, τ 2 ) with respect to τ 2 , and evaluating the result at τ 2 = 0 gives the score. Under the null hypothesis only some terms of the score can be different from zero: Plugging the expressions for l (1)

Parameter estimation
Under the null hypothesis we have y i ∼ N B(µ i , φ) where µ i = γ i exp(α). Maximum likelihood estimation leads toα = log(ȳ) − log(γ). The maximum likelihood estimate for the dispersion parameter φ can be obtained by numeric maximisation.

Test statistic
In matrix notation the test statistic is T is the diagonal matrix with the diagonal elements T ii = 1/(1 + φμ i ), y •μ 0 is the entrywise product of the vectors y andμ 0 , and d is the column vector of the main diagonal of R.

B Cancer dataset Variables
The prostate cancer dataset from TCGA et al. (2013) includes data of various types and on three different levels. We used preprocessed forms of the RNA-Seq data (gene, level 3), of the DNA methylation data (human methylation 450 array, level 3), and of the DNA copy number data (CNV data extracted from SNP array, level 3). The last-mentioned data involves copy numbers measured at equally spaced loci on the genome, obtained from the segmented copy number profiles.

Samples
Our criterion for sample selection was the availability of gene expression, methylation, copy number and single nucleotide polymorphism data. This lead to a sample size of 162 individuals.

Batch effects
According to the TCGA batch effects tool (MD Anderson Cancer Center, 2016), the between batch dispersion (DB) is much smaller than the within batch dispersion (DW) in the RNA-Seq gene expression data, the copy number data, and the methylation data. Due to the small dispersion separability criteria (DSC = DB/DW) we do not correct the data for batch effects.   Table A: Type I error rates in the simulation study. Under each simulation setup from Figure A we calculate the type I error rates at the 5% (top) and 1% (bottom) significance levels. The row and column names match the entries with the lines in Figure A. As the average rates are 5.1% and 1.0% respectively, there is little concern about rejecting more true null hypotheses than expected. Additional Note: In order to verify that the type I error rate is not only maintained across genes, but also for individual genes, we simulate 5000 expression vectors y = (y 1 , . . . , y 128 ) T from the negative binomial distribution with µ = 7 and φ = 0.1. Testing for associations with the given covariate matrix X leads to the type I error rates 4.4% and 0.7% at the 5% and 1% significance levels, respectively.  Table B: Statistical power of joint and individual testing at various significance levels α. We simulate 1 000 response vectors under the alternative hypothesis (n = 128, p = r = 50, s = 1, φ = 0.01). After testing the covariates jointly as well as individually, we compare the joint p-value with the minimum of the FDR-corrected individual p-values. Joint testing rejects a higher percentage of false null hypotheses than individual testing.  At any reasonable significance level, the non-stratified permutation test (grey) rejects more null hypotheses than the stratified permutation test (black). Naturally, genetic variation is high between and low within populations. Ignoring population structure increases genetic variation and thereby statistical power, whereas accounting for population structure decreases bias.

IL22RA2
C10orf67   Figure D: Cumulative distribution functions of RNA-Seq from the application HapMap for randomly selected genes. Each row represents one gene, and each column represents one model. It is of interest how close the fitted distributions (red) come to the empirical distributions (black). If library sizes are ignored (columns 1 and 2), the negative binomial distribution with a free dispersion parameter has a much better fit than the Poisson distribution. If an offset is included (columns 3 and 4), the differences become smaller.  Figure E: Cumulative distribution functions of RNA-Seq from the application TCGA for randomly selected genes. Each row represents one gene, and each column represents one model. It is of interest how close the fitted distributions (red) come to the empirical distributions (black). Whether library sizes are ignored (columns 1 and 2) or an offset is included (columns 3 and 4), the negative binomial distribution with a free dispersion parameter has a much better fit than the Poisson distribution.