Recent advances in high-throughput genomic assays have allowed for the creation of expansive data sets that are useful for exploring biological variation across cells. In particular, single-cell RNA-sequencing (scRNA-seq) technologies provide gene expression measurements at the individual cell level, allowing for the analysis of variation in transcriptional activity across cells within a single sample [1,2,3,4]. While characterizing this variation is useful by itself for exploratory analysis, it is also of interest to study in a more targeted way how variation relates to cell-specific covariates, such as cell type, genotype, and cell health. Studying associations between gene expression and properties of single cells has the potential to enrich our understanding of the relationship between these covariates and transcription at single-cell resolution.
While methods for association studies have been widely developed for bulk RNA-sequencing (RNA-seq) data [5, 6], methods for studying associations on the level of individual cells are much less developed. Moreover, there are several unique challenges in manipulating and analyzing the data generated by these single-cell assays, as compared to bulk RNA-seq assays. These scRNA-seq data sets are high dimensional—there are tens of thousands of genes in the human genome—making them difficult to interpret gene-by-gene; furthermore, the count-based nature of the data—made up of counts of sequenced RNA fragments that map to a specific gene in a genome to approximate expression levels of that gene—presents a challenge for many standard statistical tools with Gaussian assumptions [7].
In this paper, we propose a statistical modeling approach based on reduced-rank regression that captures associations between gene expression and cell- and sample-specific covariates by leveraging low-dimensional representations of transcription. Within this framework, we propose two specific models: Poisson reduced-rank regression (PRRR), which adapts a generalized linear model to the reduced rank setting, and nonnegative Poisson reduced-rank regression (nn-PRRR), which provides interpretable nonnegative regression components. In what follows, we first review several related threads of research and describe our modeling approach. Then, using simulated data and single-cell RNA-seq data, spatial gene expression data, and bulk RNA-seq data, we show that our models are useful for a wide range of association study types, including studying the transcriptional hallmarks of cell types, genes correlated with disease status, and expression QTLs.
Genome-wide association studies
Since the completion of the Human Genome Project in 2003 [8] and the HapMap project in 2005 [9], researchers have developed the genomic and statistical tools necessary to study the human genome at a large scale in order to better detect, treat, and prevent diseases. Genome-wide association studies (GWAS) are used to identify disease-causing genetic variation across complete genomes. Genetic variation often comes in the form of single nucleotide polymorphisms (SNPs) that can be compared between healthy patients and patients with a disease [10]. GWAS have found a plethora of genetic variants that are associated with common diseases such as asthma, type 2 diabetes, and more [11, 12].
In a similar vein, quantitative trait loci (QTL) studies identify associations between genetic variants and quantitative phenotypes [13, 14]. A common experimental setup is to use gene expression levels as the quantitative phenotype, in which case the association is referred to as an expression QTL (eQTL) [15, 16]. Most eQTL studies have relied on bulk RNA-seq technologies to measure the gene expression levels of samples from tissues with heterogeneous cell types [17,18,19].
In this context, the statistical eQTL problem is to estimate the pairwise association between a set of genetic variants (the covariates or explanatory variables) and the expression level of each gene (the outcome variables). This is typically performed using a linear regression model. In particular, let \({\mathbf {X}}\) be an \(N\times P\) matrix containing information about genetic variants across P SNPs for N individuals or tissue samples, and let \({\mathbf {Y}} \in {\mathbb {R}}^{N \times Q}\) be a matrix of corresponding gene expression levels across Q genes in these individuals or tissues [7]. Generically, these approaches use a model of the form
$$\begin{aligned} {\mathbf {y}}_{\cdot q} = {\mathbf {x}}_{\cdot p} b_{pq} + \varvec{\epsilon }, \end{aligned}$$
(1)
where \({\mathbf {x}}_{\cdot p}\) is the pth column of \({\mathbf {X}}\), \({\mathbf {y}}_{\cdot q}\) is the qth column of \({\mathbf {Y}}\), \(\varvec{\epsilon } \in {\mathbb {R}}^{N}\) is a vector of independent zero-mean Gaussian-distributed noise terms, and \(b_{pq} \in {\mathbb {R}}\) is a scalar coefficient representing the linear relationship between SNP p and gene q for \(p = 1, \dots , P\) and \(q = 1,\dots , Q\). Downstream tests for significance can be performed on these coefficients to identify associations [20,21,22]. Without further assumptions, this model estimates the marginal association between single SNPs and single genes independently. To accommodate polygenic contributions to phenotypes, multivariate models of the form
$$\begin{aligned} {\mathbf {Y}} = {\mathbf {X}} {\mathbf {B}} + \varvec{\epsilon }, \end{aligned}$$
(2)
have been considered, where \({\mathbf {B}} \in {\mathbb {R}}^{P \times Q}\) is a matrix of coefficients [23,24,25]. Under this framework, sparsity-inducing priors for \({\mathbf {B}}\) have been proposed in order to scale these models to high-dimensional genotype data [26, 27].
The advent of scRNA-seq technologies has opened the door for narrowing the investigation of genotype-phenotype relationships from the level of whole tissues to the level of individual cells. However, existing computational tools are insufficient for this purpose: they typically do not accommodate count-based data, and they are seldom robust to high-dimensional outcome variables. It is difficult to control the hypothesis testing error rate in many eQTL analyses, which run millions to trillions of univariate association hypothesis tests (one for each SNP-gene pair) [7, 18, 19, 28, 29].
Count-based models
A further drawback of existing association testing frameworks is their assumption of Gaussian-distributed data. Most canonical regression models assume an independent normally-distributed response variable, with \(\varvec{\epsilon } \sim {\mathcal {N}}({\mathbf {0}}, \sigma ^2 {\mathbf {I}}_N)\) in Eq. (1). However, when the data consist of count-based measurements, as for RNA-seq data, this assumption may be problematic. Various transformations have been proposed to make the response variable approximately Gaussian [30, 31], but these transformations are known to distort the data distribution in undesirable ways [32,33,34]. Count-based scRNA-seq data is discrete and nonnegative, with many gene expression counts having a value of zero. The sparsity of the data poses an additional challenge to these standard transformations [32].
An alternative to this approach is to model the gene expression data with a discrete distribution. A common choice is the Poisson distribution, whose support is restricted to the nonnegative integers and has been shown to improve the representation and interpretation of scRNA-seq data when fitting statistical models [32, 35]. A recent approach using a Poisson data likelihood proposed a naive Bayes model that assigns cell-type identities to samples in scRNA-seq data based on reference data [36]. The model uses a Poisson distribution to represent the count-based data, but the high number of zeros in the data still poses a challenge. The sparsity of the data interferes with standard estimates such as maximum likelihood estimates as rates of zero can be produced for thousands of genes, making the model sensitive to genes with low expression counts [36]. The model handles this challenge by introducing a hierarchical structure, placing posterior distributions on parameters in order to recover non-zero rate estimates for genes with zero counts in the reference data. However, the naive Bayes model also assumes independence between genes, although this assumption does not hold in practice, as expression has been observed to be correlated between genes [32, 37, 38].
Modeling multiple data modalities
Latent variable modeling approaches have also been proposed for modeling multi-view data. The most popular approach has been canonical correlation analysis (CCA, [39]) and its probabilistic variants [40,41,42]. CCA seeks a low-dimensional linear mapping of two paired data matrices such that the resulting low-dimensional projections of both matrices are maximally correlated.
$$\begin{aligned} \max \rho ({\mathbf {X}} {\mathbf {u}}, {\mathbf {Y}} {\mathbf {v}})~~~~\text {subject to } {\mathbf {u}}^\top {\mathbf {u}} = 1, {\mathbf {v}}^\top {\mathbf {v}} = 1, \end{aligned}$$
where \(\rho\) is the Pearson correlation function. The probabilistic version of this model projects the features of each data modality into a shared low-dimensional latent space, assuming heteroskedastic residual errors, maximizing the amount of variance explained in the data modalities by the latent subspace. The weights, or factor loadings, in CCA models allow us to identify covarying features across data modalities. A formal connection between CCA and reduced-rank regression has been shown [43], where the canonical subspace found by CCA is the same as the subspace of the maximum likelihood estimator for the reduced-rank regression model. Despite their connections, the unsupervised nature of CCA does not lend itself directly to association mapping between the data modalities. Conversely, reduced-rank regression has a natural association testing framework because of its regression foundation.
Recently, a latent variable model based on latent Dirichlet allocation [44, 45] for jointly modeling gene expression and genotype was proposed [46]. This model projected both genotype data—using an equivalent of the Structure model [45]—and count-based gene expression data—using a telescoping LDA model [44]—onto a shared latent subspace; we may then identify covarying genes and genotypes in a nonnegative latent representation. But discovering associations in this framework requires association testing in held-out data, which is limited by existing univariate methods and population data.
Reduced-rank regression approaches
The transcriptional states of cells tend to exhibit strong correlations between genes [47]. Thus, it is likely that the relationship between cell covariates and transcriptional phenotypes in scRNA-seq data need not be modeled gene-by-gene. Rather, it is reasonable to assume that these associations exhibit a low-dimensional structure. Furthermore, treating each gene as independent is computationally and statistically inefficient; we would like to exploit these relationships to perform fewer association tests and leverage shared variation to improve statistical power in these often small sample sizes. These ideas motivate a regression model whose coefficient matrix has low rank. Several approaches to reduced-rank regression have been developed to take advantage of this opportunity.
Consider again the linear regression model in Eq. (2). Here, \({\mathbf {B}}\) is a \(P\times Q\) matrix of regression coefficients, where P is the number of covariates, and Q is the number of genes. In most gene expression studies, Q (and sometimes P) is large, and \(\min (P, Q) \gg n\). The core assumption of reduced-rank regression (RRR) is that the matrix \({\mathbf {B}}\) has low rank [48]. In particular, the RRR model assumes \({\mathbf {B}}\) has rank \(R \ll \min (P, Q)\). This implies that \({\mathbf {B}}\) can be factorized as an outer product of two low-rank matrices, giving us the following reduced-rank regression model:
$$\begin{aligned} {\mathbf {Y}} = {\mathbf {X}} {\mathbf {B}} + \varvec{\epsilon }~~~~\text {subject to } {\mathbf {B}} = {\mathbf {U}} {\mathbf {V}}^\top , \end{aligned}$$
(3)
where \({\mathbf {U}} \in {\mathbb {R}}^{P \times R}\) and \({\mathbf {V}} \in {\mathbb {R}}^{Q \times R}\). In the context of gene expression studies, this low-rank assumption implies that the relationship between cell-specific covariates and gene expression can be described in terms of a small set of latent factors. In other words, variance in gene expression is mediated by R different programs encoded in subsets of covariates; then \({\mathbf {B}}\) captures both the covariates of interest and their effect sizes within each of the R programs.
Several estimation approaches have been proposed for RRR under the assumption of Gaussian noise. A common method is to find the parameter values that minimize the squared reconstruction error [48, 49]:
$$\begin{aligned} \min _{{\mathbf {U}}, {\mathbf {V}}} \Vert {\mathbf {Y}} - {\mathbf {X}} {\mathbf {U}} {\mathbf {V}}^\top \Vert _2^2. \end{aligned}$$
This approach corresponds to finding the maximum likelihood solution of an RRR model with Gaussian errors (\(\varvec{\epsilon }_q \sim {\mathcal {N}}(0, \sigma ^2 {\mathbf {I}})\) for \(q = 1, \dots , Q\) in Eq. 3 as \(\sigma ^2 \rightarrow 0\)). When \({\mathbf {B}}\) is assumed to have full rank (that is, \(R = \min (P, Q)\)) the minimization admits the ordinary least squares (OLS) solution:
$$\begin{aligned} \widehat{{\mathbf {B}}}_{OLS} = ({\mathbf {X}}^\top {\mathbf {X}})^{-1} {\mathbf {X}}^\top {\mathbf {Y}}. \end{aligned}$$
When \(R < \min (P, Q)\), the RRR model has an eigenvalue solution:
$$\begin{aligned} \widehat{{\mathbf {B}}}_{RRR} = \widehat{{\mathbf {B}}}_{OLS} {\mathbf {U}}_{1:R} {\mathbf {U}}_{1:R}^\top , \end{aligned}$$
where \({\mathbf {X}} \widehat{{\mathbf {B}}}_{OLS} = {\mathbf {U}}{\mathbf {D}}{\mathbf {V}}^\top\) is the SVD of the fitted values, and \({\mathbf {U}}_{1:R} = [{\mathbf {u}}_1, \cdots , {\mathbf {u}}_R]\) contains the leading R left singular vectors of \({\mathbf {X}} \widehat{{\mathbf {B}}}_{OLS}\).
Sparse approaches to RRR have been proposed as well. Sparsity in the decomposition leads to greater interpretability by including nonzero weights only on a subset of the covariates and genes for any component. One model [50] imposes sparsity on the coefficient matrix \({\mathbf {B}}\) by taking an iterative approach to estimation, solving both a sparse regression problem and the reduced-rank decomposition in alternating frames. The base algorithm solves the following optimization problem:
$$\begin{aligned} \min _{{\mathbf {U}}, {\mathbf {V}}} \frac{1}{2} \Vert {\mathbf {Y}}-{\mathbf {X}}{\mathbf {U}}{\mathbf {V}}^\top \Vert _2^2 + \lambda \sum ^{p}_{j=1}\Vert {\mathbf {U}}_j\Vert _2, \end{aligned}$$
where \({\mathbf {U}} \in {\mathbb {R}}^{P \times R}\), \({\mathbf {V}} \in {\mathbb {R}}^{Q \times R}\), \({\mathbf {U}}_p\) represents the pth row of \({\mathbf {U}}\), the rank R is specified by the modeler, and \(\lambda\) is a sparsity penalty parameter. The alternating minimization problem can be broken into two steps: optimizing \({\mathbf {U}}\), and optimizing \({\mathbf {V}}\). After parameter initialization on iteration \(\ell = 1\), on iteration \(\ell = 2, \dots , L\), the algorithm first solves an orthogonal Procrustes problem for \({\mathbf {V}}\):
$$\begin{aligned} {\mathbf {V}}^{(\ell )} = \mathop {\mathrm {arg\,min}}\limits _{{\mathbf {V}}:{\mathbf {V}}{\mathbf {V}}^\top ={\mathbf {I}}} \Vert {\mathbf {Y}} - {\mathbf {X}}{\mathbf {U}}^{(\ell - 1)}{\mathbf {V}}^\top \Vert _2^2, \end{aligned}$$
(4)
where \({\mathbf {U}}^{(\ell - 1)}\) is the estimate of \({\mathbf {U}}\) from the previous iteration. The algorithm then solves a group lasso problem for \({\mathbf {U}}\):
$$\begin{aligned} {\mathbf {U}}^{\ell } = \mathop {\mathrm {arg\,min}}\limits _{{\mathbf {U}}} \frac{1}{2} \Vert {\mathbf {Y}}{\mathbf {V}}^{(\ell )}-{\mathbf {X}}{\mathbf {U}}\Vert _2^2 + \lambda \sum ^{P}_{p=1} \Vert {\mathbf {U}}_j\Vert _2. \end{aligned}$$
(5)
Equation (4) can be solved using a singular value decomposition, and Eq. (5) can be solved using techniques for group lasso [51]. These two steps are repeated for L steps or until convergence.
Another approach developed a Bayesian RRR framework for association mapping in the GWAS setting [52]. The model—called Bayesian Extendable Reduced-Rank Regression (BERRRI)—uses a nonparametric Indian Buffet Process prior for the latent factors, which allows the rank k to be estimated from the data. BERRRI then uses a variational Bayes approximation to the posterior for inference of the model parameters. However, BERRRI does not explicitly model count-based data, and its inference procedure is not computationally tractable for genome-scale analyses.
The linear RRR model has been generalized to nonlinear functions as well. The most popular nonlinear approaches have used neural networks with multiple inputs and multiple outputs [53]. The linear RRR model is equivalent to a single-layer multi-layer perceptron with only linear transformations between layers [54, 55]. This model can be extended to the nonlinear case by including nonlinear activation functions [54, 56]. However, these models typically to do not capture count data and lack the interpretability of linear models for downstream association testing.
In this manuscript, we propose a statistical model and associated computational framework that addresses the problems that arise with modeling genotype-phenotype associations for high-dimensional phenotypes captured with count data. We propose a reduced-rank regression model that finds low-dimensional associations between genotypes (or other high-dimensional covariates) and RNA-seq data (or other high-dimensional count-based phenotypes). Relying on low-dimensional associations alleviates the problem of estimating millions of pairwise associations. Furthermore, our model uses count-based likelihoods that allow both single-cell RNA-sequencing data and bulk RNA-sequencing data. We show that our approach appropriately models gene expression data with count-based likelihoods, leads to interpretable subsets of genes and genetic variants or other covariates in each dimension, and uses flexible, computationally tractable inference methods that allow for uncertainty quantification.