Model framework
For clarity, we first present our main methodology development for a univariate regression problem. We will extend it to multiple regression problems in “Extension to multiple regressions” section.
Consider the following regression-type H-T problem:
$$ {\begin{aligned} \mathbf{y} = \mathbf{1} \mu + \mathbf{x} \beta + {\boldsymbol{\epsilon}}, \\ \end{aligned}} $$
(1)
$$ {\begin{aligned} &\text{where} \quad \mu, \beta \in \mathbb{R}, \quad \mathbf{y},\mathbf{x}, {\boldsymbol{\epsilon}}, \mathbf{1}=(1, \cdots, 1)' \in \mathbb{R}^{n} \\ &\quad \text{and} \quad {\boldsymbol{\epsilon}} \sim \mathcal{N}(\mathbf{0}, \Sigma); \end{aligned}} $$
$$ {\begin{aligned} H_{0}: \beta=0 \quad \text{versus} \quad H_{1}: \beta \ne 0. \end{aligned}} $$
(2)
Here, y is the response variable, x is the covariate, and ε is the error term that follows an n-dimensional multivariate normal distribution \(\mathcal {N}\) with mean zero and a general variance-covariance matrix Σ. By considering a random variable Y in the n-dimensional space, the above problem can also be stated as
$$ \mathbf{Y} = \left(\begin{array}{c} Y_{1} \\ \vdots \\ Y_{n} \\ \end{array} \right), \qquad \mathbf{Y} \sim \left\{\begin{array}{ll} \mathcal{N}\left(\mathbf{1} \mu, \; \Sigma \right), & \text{under}~ H_{0}, \\ \mathcal{N}\left(\mathbf{1} \mu + \mathbf{x} \beta, \; \Sigma \right), & \text{under}~ H_{1}. \\ \end{array}\right. $$
(3)
In this model, μ is the intercept or grand mean that is a nuisance parameter, and β is the parameter of interest that quantifies the effect size. We express the variance-covariance matrix of ε in the form
$$ \text{cov}\left(\boldsymbol{\epsilon} \right) = \Sigma = \sigma^{2} \cdot \mathbf{S}, $$
(4)
where σ2 is a nonzero scalar that quantifies the magnitude of the covariance structure, and S is a symmetric, positive-definite matrix that captures the shape of the covariance structure. Additional constraints are needed to determine σ2 and S; here, we choose a special form that can subsequently simplify our mathematical derivations. For any given Σ, define
$${\begin{aligned} \sigma^{2} := \left(\sum\limits_{i,j} \left(\Sigma^{-1}\right)_{i,j}\right)^{-1} \quad \text{and} \quad \mathbf{S} := \sigma^{-2} \Sigma = \left(\sum\limits_{i,j} \left(\Sigma^{-1}\right)_{i,j}\right) \Sigma. \end{aligned}} $$
From the above definition, we have the following nice property
$$ \sum\limits_{i,j} \left(\mathbf{S}^{-1} \right)_{i,j}= \mathbf{1}' \, \mathbf{S}^{-1} \, \mathbf{1} = 1. $$
(5)
Hereinafter, we refer to S the standardized structure matrix satisfying Eq. 5.
The proposed method
As a special case of Model (3), if S is proportional to I, the identity matrix, it is well-known that regression t-test is a valid solution to this H-T problem. If S≠I, e.g. the observed data are correlated and/or have heterogeneous variance structure, the assumptions of the standard t-test are violated. In this paper, we propose a linear transformation, namely \(\mathbf {P}\mathbf {B} : \mathbf {Y} \to \tilde {\mathbf {Y}}\), which transforms the original data to a new set of data that are independent and identically distributed. Furthermore, we prove that the transformed H-T problem related to the new data is equivalent to the original problem, so that we can approach the original hypotheses using standard parametric (or later rank-based) tests with the new data.
To shed more lights on the proposed method, we first provide a graphical illustration in Fig. 1. The proposed procedure consists of three steps.
-
1
Estimate \(\hat \mu (\mathbf {Y})\) (i.e. the weighted mean of the original data), and subtract \(\hat \mu \) from all data. This process is an oblique (i.e. non-orthogonal) projection from \(\mathbb {R}^{n}\) to an (n−1)-dimensional subspace of \(\mathbb {R}^{n}\). The intermediate data from this step is Y(1) (i.e. the centered data). It’s clear that \(\mathbb {E} \mathbf {Y}^{(1)}\) is the origin of the reduced space if and only if H0 is true.
-
2
Use the eigen-decomposition of the covariance matrix of Y(1) to reshape its “elliptical” distribution to a “spherical” distribution. The intermediate data from this step is Y(2).
-
3
Use the QR-decomposition technique to find a unique rotation that transforms the original H-T problem to an equivalent problem of testing for a constant deviation along the unit vector. The equivalent data generated from this step is \(\tilde {\mathbf {Y}}\), and the H-T problem associated with \(\tilde {\mathbf {Y}}\) can be approached by existing parametric and rank-based methods.
In the proposed PB-transformation, B-map performs both transformations in Step 1 and 2; P-map from Step 3 is designed to improve the power of the proposed semiparametric test to be described in “A semiparametric generalization” section.
Centering data
Using weighted least squares, the mean estimation based on the original data is \(\hat {\mu } (\mathbf {Y})=\mathbf {1}' \mathbf {S}^{-1} \mathbf {Y}\) (for details please see Additional file 1: Section S1.1). We subtract \(\hat {\mu }\) from all data points and define the centered data as
$$ \mathbf{Y}^{(1)} := \mathbf{Y} - \mathbf{1} \hat{\mu} = \left(\mathbf{I} - \mathbf{J}\mathbf{S}^{-1}\right) \mathbf{Y}, $$
where J=1·1′ (i.e. a matrix of all 1’s). With some mathematical derivations (see Additional file 1: Section S1.1), we have
$$ \begin{aligned} \mathbb{E}\mathbf{Y}^{(1)} \,=\, \left\{\begin{array}{ll} \mathbf{0}, & \text{under}~ H_{0},\\ \left(\mathbf{I} - \mathbf{J}\mathbf{S}^{-1}\right) \mathbf{x} \beta, & \text{under} ~H_{1}; \end{array}\right. \quad \text{cov}\left(\mathbf{Y}^{(1)}\right) \,=\, \sigma^{2} \left(\mathbf{S} -\mathbf{J} \right). \end{aligned} $$
The B-map
Now, we focus on S−J, which is the structure matrix of the centered data. Let TΛT′ denote the eigen-decomposition of S−J. Since the data are centered, there are only n−1 nonzero eigenvalues. We express the decomposition as follows
$$ \mathbf{S} -\mathbf{J} = \mathbf{T}_{n-1} \Lambda_{n-1} \mathbf{T}_{n-1}', $$
(6)
where Tn−1∈Mn×(n−1) is a semi-orthogonal matrix containing the first n−1 eigenvectors and Λn−1∈M(n−1)×(n−1) is a diagonal matrix of nonzero eigenvalues. Based on Eq. 6, we define (see Additional file 1: Section S1.2)
$$\mathbf{B} := \Lambda_{n-1}^{1/2} \mathbf{T}_{n-1}' \mathbf{S}^{-1} \in \mathrm{M}_{(n-1) \times n}, $$
so that \(\mathbf {Y}^{(2)} := \mathbf {B}\mathbf {Y} \in \mathbb {R}^{n-1}\) have the following mean and covariance
$$ {\begin{aligned} \mathbb{E}\mathbf{Y}^{(2)} = \left\{\begin{array}{ll} \mathbf{0}_{n-1}, & \text{under}~ H_{0}, \\ \mathbf{B} \mathbf{x} \beta, & \text{under}~ H_{1}; \end{array}\right. \quad \text{cov}\left(\mathbf{Y}^{(2)}\right) = \sigma^{2} \mathbf{I}_{(n-1)\times (n-1)}. \end{aligned}} $$
(7)
We call the linear transformation represented by matrix B the “B-map”. So far, we have centered the response variable, and standardized the general structure matrix S into the identity matrix I. However, the covariate and the alternative hypothesis in the original problem are also transformed by the B-map. For normally distributed Y, the transformed H-T problem in Eq. 7 is approachable by the regression t-test; however, there’s no appropriate rank-based counterpart. In order to conduct a rank-based test for Y with broader types of distribution, we propose the next transformation.
The P-map
From Eq. 7, define the transformed covariate
$$ \mathbf{z} := \mathbf{B}\mathbf{x} \in \mathbb{R}^{n-1}. $$
(8)
We aim to find an orthogonal transformation that aligns z to 1n−1 in the reduced space. We construct such a transformation through the QR decomposition of the following object
$$\mathbf{A} = \left(\mathbf{1}_{n-1} | \mathbf{z}\right) = \mathbf{Q}\mathbf{R}, $$
where A∈M(n−1)×2 is a column-wise concatenation of vector z and the target vector 1n−1,Q∈M(n−1)×2 is a semi-orthogonal matrix, and R∈M2×2 is an upper triangular matrix. We also define the following rotation matrix
$$ {\begin{aligned} \text{\texttt{Rot}} &:= \left(\begin{array}{cc} \xi & \sqrt{1- \xi^{2}} \\ -\sqrt{1- \xi^{2}} & \xi \end{array}\right) \in \mathrm{M}_{2\times 2}, \quad \text{where} \\&\qquad \xi := \frac{\langle{\mathbf{z}}{\mathbf{1}_{n-1}}\rangle}{\sqrt{n-1}\cdot \|\mathbf{z}\|} \in \mathbb{R}. \end{aligned}} $$
Geometrically speaking, ξ= cosθ, where θ is the angle between z and 1n−1.
With the above preparations, we have the following result.
Theorem 1
Matrix P:=I−QQ′+Q Rot Q′=I(n−1)×(n−1)−Q(I2×2−Rot)Q′ is the unique orthogonal transformation that satisfies the following properties:
$$\begin{array}{*{20}l} \mathbf{P} \mathbf{P}' &= \mathbf{P}' \mathbf{P} = \mathbf{I}_{(n-1)\times (n-1)}, \end{array} $$
(9)
$$\begin{array}{*{20}l} \mathbf{P} \mathbf{z} &= \zeta \cdot \mathbf{1}_{n-1}, \qquad \zeta := \frac{\|\mathbf{z}\|}{\sqrt{n-1}}, \end{array} $$
(10)
$$\begin{array}{*{20}l} \mathbf{P} \mathbf{u} &= \mathbf{u}, \qquad \forall \mathbf{u} \text{ s.t.} \langle{\mathbf{u}}{\mathbf{1}_{n-1}\rangle} = \langle{\mathbf{u}}, {\mathbf{z}}\rangle = 0. \end{array} $$
(11)
Proof
See Additional file 1: Section 1.3. □
We call the linear transformation P defined by Theorem 1 the “P-map”. Equation 9 ensures that this map is an orthogonal transformation. Equation 10 shows that the vector z is mapped to 1n−1 scaled by a factor ζ. Equation 11 is an invariant property in the linear subspace \(L_{\mathbf {z}}^{\perp }\), which is the orthogonal complement of the linear subspace spanned by 1n−1 and z, i.e. Lz=span(1n−1,z). This property defines a uniqueminimum map that only transforms the components of data in Lz and leaves the components in \(L_{\mathbf {z}}^{\perp }\) invariant. A similar idea of constructing rotation matrices has been used in [22].
With both B and P, we define the final transformed data as \( \tilde {\mathbf {Y}} := \mathbf {P} \mathbf {Y}^{(2)} = \mathbf {P} \mathbf {B} \mathbf {Y}\), which has the following joint distribution
$$ {}\tilde{\mathbf{Y}} \!\sim\! \mathcal{N}\left(\mathbf{P}\mathbf{B} \mathbf{x} \beta, \; \mathbf{P}\mathbf{B} (\sigma^{2} \mathbf{S}) \mathbf{B}'\mathbf{P}' \right) \,=\, \left\{\begin{array}{ll} \mathcal{N}\left(\mathbf{0},\; \sigma^{2} \mathbf{I}\right), & \!\!\text{under}~ H_{0}, \\ \mathcal{N}\left(\mathbf{1} \zeta \beta,\; \sigma^{2} \mathbf{I}\right), & \!\!\text{under}~ H_{1}. \\ \end{array}\right. $$
The normality assumption implies that each \(\tilde Y_{i}\) follows an i.i.d. normal distribution, for i=1,⋯,n−1. The location parameter of the common marginal distribution is to be tested with unknown σ2. Therefore, we can approach this equivalent H-T problem with the classical one-sample t-test and Wilcoxon signed rank test (more in “A semiparametric generalization” section).
Correlation estimation for repeated measurements
If Σ is unknown, we can decompose Σ in the following way
$$ \Sigma = \mathbf{W}^{-\frac{1}{2}} \, \text{\texttt{Cor}} \, \mathbf{W}^{-\frac{1}{2}}, $$
(12)
where W is a diagonal weight matrix and Cor is the corresponding correlation matrix. By definition, the weights are inversely proportional to the variance of the observations. In many real world applications including RNA-seq analysis, those weights can be assigned a priori based on the quality of samples; but the correlation matrix Cor needs to be estimated from the data. In this section, we provide a moment-based estimator of Cor for a class of correlation structure that is commonly used for repeated measurements. This estimator does not require computationally intensive iterative algorithms.
Let Y be a collection of repeated measures from L subjects such that the observations from different subjects are independent. With an appropriate data rearrangement, the correlation matrix of Y can be written as a block-diagonal matrix
$$\text{cor}(\mathbf{Y}) = \left(\begin{array}{ccc} \text{\texttt{Cor}}_{1} & & \\ & \ddots & \\ & & \text{\texttt{Cor}}_{L} \end{array}\right). $$
We assume that the magnitude of correlation is the same across all blocks, and denote it by ρ. Each block can be expressed as \(\phantom {\dot {i}\!}\text {\texttt {Cor}}_{l}(\rho) = (1-\rho) \mathbf {I}_{n_{l} \times n_{l}} + \rho \mathbf {J}_{n_{l} \times n_{l}}, \quad \text {for} \quad l = 1, \cdots, L,\) where nl is the size of the lth block and \(n={\sum \nolimits }_{l=1}^{L} n_{l}\).
We estimate the correlation based on the weighted regression residuals \(\hat {\boldsymbol {\epsilon }}\) defined by Eq. (S3) in Additional file 1: Section S2.1. Define two forms of residual sum of squares
$$SS_{1}= \sum\limits_{l}\hat{\boldsymbol{\epsilon}}_{l}'\mathbf{I}\hat{\boldsymbol{\epsilon}}_{l} \quad \text{and} \quad SS_{2}= \sum\limits_{l}\hat{\boldsymbol{\epsilon}}_{l}'\mathbf{J}\hat{\boldsymbol{\epsilon}}_{l}, $$
where \(\hat {\boldsymbol {\epsilon }}_{l}\) is the corresponding weighted residuals for the lth block. With these notations, we have the following Proposition.
Proposition 1
Denote \(\Sigma _{\epsilon } = \text {cov}(\hat {\boldsymbol {\epsilon }})\)and assume that for some nonzero σ2,
$$\Sigma_{\epsilon} = \sigma^{2} \cdot \text{diag}(\text{\texttt{Cor}}_{1}(\rho), \cdots, \text{\texttt{Cor}}_{L}(\rho)). $$
An estimator of ρ based on the first moments of SS1 and SS2 is
$$ \hat{\rho}_{\text{moment}}^{2} = \frac{SS_{2} - SS_{1}}{ \frac{1}{n} \sum\nolimits_{l=1}^{L} \left(n_{l}(n_{l}-1) \right) SS_{1}}. $$
Moreover, if \(\hat {\boldsymbol {\epsilon }} \sim \mathcal {N}(\mathbf {0}, \Sigma _{\epsilon })\) and n1=⋯=nL=n/L (i.e. balanced design), the above estimator coincides with the maximum likelihood estimator of ρ, which has the form
$$\hat \rho_{\text{MLE}} = \frac{SS_{2} - SS_{1}}{(n_{1}-1)SS_{1}}. $$
Proof
See Additional file 1: Section S2.1. □
Standard correlation estimates are known to have downward bias [23], which can be corrected by the Olkin and Pratt’s method [24]. With this correction, our final correlation estimator is
$$ \hat{\rho} = \hat{\rho}_{\text{moment}} \left[ 1+ \frac{1-\hat{\rho}_{\text{moment}}^{2}}{2(L-3)} \right]. $$
(13)
Kenward-roger approximation to the degrees of freedom
The degree of freedom (DF) can have nontrivial impact on hypothesis testing when sample size is relatively small. Intuitively, a correlated observation carries “less information” than that of an independent observation. In such case, the effective DF is smaller than the apparent sample size. Simple examples include the two-sample t-test and the paired t-test. Suppose there are n observations in each group, the former test has DF=2n−2 for i.i.d. observations, and the latter only has DF=n−1 because the observations are perfectly paired. These trivial examples indicate that we need to adjust the DF according to the correlation structure in our testing procedures.
We adopt the degrees of freedom approximation proposed by [25] (K-R approximation henceforth) for the proposed tests. The K-R approximation is a fast moment-matching method, which is efficiently implemented in R package pbkrtest[26]. In broad terms, we use the DF approximation as a tool to adjust the effective sample size when partially paired data are observed.
Alternative approach using mixed-effects model
As we mentioned in “Background” section, the H-T problem stated in Model (3) for repeated measurements can also be approached by the linear mixed-effects regression (LMER) model. Suppose the ith observation is from the lth subject, we may fit the data with a random intercept model such that
$$Y_{i(l)} = \mu + x_{i} \beta + 1_{l} \gamma + \epsilon_{i}, $$
where 1l is the indicator function of the lth subject, \(\gamma \sim N\left (0, \sigma ^{2}_{\gamma }\right)\), and \(\epsilon _{i} \stackrel {i.i.d.}{\sim } N\left (0, \sigma ^{2}_{\epsilon }\right)\). The correlation is modeled as
$$ \rho = \text{cor}\left({Y_{i(l)}}{Y_{i'(l)}}\right) = \frac{\sigma^{2}_{\gamma}}{\sigma^{2}_{\gamma} + \sigma^{2}_{\epsilon}}. $$
(14)
The LMER model is typically fitted by a likelihood approach based on the EM algorithm. Weights can be incorporated in the likelihood function. The lmer() function in R package lme4 [16] provides a reference implementation for fitting the LMER model. The algorithm is an iterative procedure until convergence. Due to relatively high computational cost, the mixed-effects model has limited application in high-throughput data.
The R package lmerTest [17] performs hypothesis tests for lmer() outputs. By default, it adjusts the DF using the Satterthwaite’s approximation [27], and can optionally use the K-R approximation.
A semiparametric generalization
In the above sections, we develop the PB-transformed t-test using linear algebra techniques. These techniques can be applied to non-normal distributions to transform their mean vectors and covariance matrices as well. With the following proposition, we may extend the proposed method to an appropriate semiparametric distribution family. By considering the uncorrelated observations with equal variance as a second order approximation of the data that we are approaching, we can apply a rank-based test on the transformed data to test the original hypotheses. We call this procedure the PB-transformed Wilcoxon test.
Proposition 2
Let \(\check {\mathbf {Y}} := \large \left \{ \check {Y}_{1}, \dots, \check {Y}_{n-1}\large \right \}\) be a collection of i.i.d. random variables with a common symmetric density function g(y), g(−y)=g(y). Assume that \(\mathbb {E} \check {Y}_{1} = 0\), \(\text {var}(\check {Y}_{1}) = \sigma ^{2}\). Let Y∗ be a random number that is independent of \(\check {\mathbf {Y}}\) and has zero mean and variance σ2. For every symmetric semi-definite \(\mathbf {S} \in \mathrm {M}_{n \times n}, \mathbf {x} \in \mathbb {R}^{n}\) and \(\mu, \beta \in \mathbb {R}\), there exists a linear transformation \(\mathbf {D}: \mathbb {R}^{n-1} \to \mathbb {R}^{n}\) and constants u,v, such that
$$ \mathbf{Y} := \mathbf{D} \left(\check{\mathbf{Y}} + u \mathbf{1}_{n-1}\right) + (Y^{*} + v) \mathbf{1}_{n} $$
(15)
is an n-dimensional random vector with
$$ \mathbb{E} (\mathbf{Y}) = \mathbf{1} \mu + \mathbf{x} \beta \quad \text{and} \quad \text{cov}(\mathbf{Y}) = \sigma^{2} \mathbf{S}. $$
Furthermore, if we apply the PB-transformation to Y, the result is a sequence of (n−1) equal variance and uncorrelated random variables with zero mean if and only if β=0.
Proof
See Additional file 1: Section S1.4. □
The essence of this Proposition is that, starting with an i.i.d. sequence of random variables with a symmetric common p.d.f., we can use linear transformations to generate a family of distributions that is expressive enough to include a non-normal distribution with an arbitrary covariance matrix and a mean vector specified by the effect to be tested. This distribution family is semiparametric because: a) the “shape” of the density function, g(y), has infinite degrees of freedom; b) the “transformation” (D, u, and v) has only finite parameters.
As mentioned before, applying both the B- and P-maps enables us to use the Wilcoxon signed rank test for the hypotheses with this semiparametric distribution family. This approach has better power than the test with only the B-map as shown in “Simulations” section. Once the PB-transformed data are obtained, we calculate the Wilcoxon signed rank statistic and follow the testing approach in [21], which is to approximate the asymptotic distribution of the test statistic by a t-distribution with an adjusted DF. Note that Wilcoxon signed rank test is only valid when the underlying distribution is symmetric; therefore, the symmetry assumption in Proposition 2 is necessary. In summary, this PB-transformed Wilcoxon test provides an approximate test (up to the second order moment) for data that follow a flexible semiparametric distributional model.
Extension to multiple regressions
In this section, we present an extension of the proposed methods for the following multiple regression
$$ \begin{aligned} \mathbf{y} &= \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}, \qquad \mathbf{y} \in \mathbb{R}^{n}, \quad \mathbf{X} \in \mathrm{M}_{n\times p},\\& \quad \boldsymbol{\beta} \in \mathbb{R}^{p}, \quad \boldsymbol{\epsilon} \in \mathbb{R}^{n}. \end{aligned} $$
(16)
Here the error term ε is assumed to have zero mean but does not need to have scalar covariance matrix. For example, ε can be the summation of random effects and measurement errors in a typical LMER model with a form specified in Eq. 4.
To test the significance of βk,k=1,…,p, we need to specify two regression models, the null and alternative models. Here the alternative model is just the full Model (16), and the null model is a regression model for which the covariate matrix is X−k, which is constructed by removing the kth covariate (Xk) from X
$$ {\begin{aligned} \mathbf{y} \!&=\! \mathbf{X}_{-k}\boldsymbol{\beta}_{-k} + \boldsymbol{\epsilon},\quad\! \mathbf{X}_{-k} \in \mathrm{M}_{n\times (p-1)},\\& \quad\!\! \boldsymbol{\beta}_{-k} \in \mathbb{R}^{p-1}, \!\quad\! \text{span}\left(\mathbf{X}_{-k} \right) \subsetneq \text{span}\left(\mathbf{X} \right). \end{aligned}} $$
(17)
Compared with the original univariate problem, we see that the nuisance covariates in the multiple regression case are X−kβ−k instead of 1μ in Eq. 1. Consequently, we need to replace the centering step by regressing out the linear effects of X−k
$$ \mathbf{E} := \mathbf{C} \mathbf{Y} \!:=\! \left(\mathbf{I}_{n\times n} - \mathbf{X}_{-k} \left(\mathbf{X}_{-k}' \mathbf{S}^{-1} \mathbf{X}_{-k} \right)^{-1}\mathbf{X}_{-k}' \mathbf{S}^{-1}\right) \mathbf{Y}. $$
The new B-transformation is defined as the eigen-decomposition of cov(E)=σ2(S−X−kX−k′). The P-transformation is derived the same as before, but with the new B matrix.