 Research
 Open Access
 Published:
A Poisson reducedrank regression model for association mapping in sequencing data
BMC Bioinformatics volume 23, Article number: 529 (2022)
Abstract
Background
Singlecell RNAsequencing (scRNAseq) technologies allow for the study of gene expression in individual cells. Often, it is of interest to understand how transcriptional activity is associated with cellspecific covariates, such as cell type, genotype, or measures of cell health. Traditional approaches for this type of association mapping assume independence between the outcome variables (or genes), and perform a separate regression for each. However, these methods are computationally costly and ignore the substantial correlation structure of gene expression. Furthermore, countbased scRNAseq data pose challenges for traditional models based on Gaussian assumptions.
Results
We aim to resolve these issues by developing a reducedrank regression model that identifies lowdimensional linear associations between a large number of cellspecific covariates and highdimensional gene expression readouts. Our probabilistic model uses a Poisson likelihood in order to account for the unique structure of scRNAseq counts. We demonstrate the performance of our model using simulations, and we apply our model to a scRNAseq dataset, a spatial gene expression dataset, and a bulk RNAseq dataset to show its behavior in three distinct analyses.
Conclusion
We show that our statistical modeling approach, which is based on reducedrank regression, captures associations between gene expression and cell and samplespecific covariates by leveraging lowdimensional representations of transcriptional states.
Background
Recent advances in highthroughput genomic assays have allowed for the creation of expansive data sets that are useful for exploring biological variation across cells. In particular, singlecell RNAsequencing (scRNAseq) technologies provide gene expression measurements at the individual cell level, allowing for the analysis of variation in transcriptional activity across cells within a single sample [1,2,3,4]. While characterizing this variation is useful by itself for exploratory analysis, it is also of interest to study in a more targeted way how variation relates to cellspecific covariates, such as cell type, genotype, and cell health. Studying associations between gene expression and properties of single cells has the potential to enrich our understanding of the relationship between these covariates and transcription at singlecell resolution.
While methods for association studies have been widely developed for bulk RNAsequencing (RNAseq) data [5, 6], methods for studying associations on the level of individual cells are much less developed. Moreover, there are several unique challenges in manipulating and analyzing the data generated by these singlecell assays, as compared to bulk RNAseq assays. These scRNAseq data sets are high dimensional—there are tens of thousands of genes in the human genome—making them difficult to interpret genebygene; furthermore, the countbased nature of the data—made up of counts of sequenced RNA fragments that map to a specific gene in a genome to approximate expression levels of that gene—presents a challenge for many standard statistical tools with Gaussian assumptions [7].
In this paper, we propose a statistical modeling approach based on reducedrank regression that captures associations between gene expression and cell and samplespecific covariates by leveraging lowdimensional representations of transcription. Within this framework, we propose two specific models: Poisson reducedrank regression (PRRR), which adapts a generalized linear model to the reduced rank setting, and nonnegative Poisson reducedrank regression (nnPRRR), which provides interpretable nonnegative regression components. In what follows, we first review several related threads of research and describe our modeling approach. Then, using simulated data and singlecell RNAseq data, spatial gene expression data, and bulk RNAseq data, we show that our models are useful for a wide range of association study types, including studying the transcriptional hallmarks of cell types, genes correlated with disease status, and expression QTLs.
Genomewide association studies
Since the completion of the Human Genome Project in 2003 [8] and the HapMap project in 2005 [9], researchers have developed the genomic and statistical tools necessary to study the human genome at a large scale in order to better detect, treat, and prevent diseases. Genomewide association studies (GWAS) are used to identify diseasecausing genetic variation across complete genomes. Genetic variation often comes in the form of single nucleotide polymorphisms (SNPs) that can be compared between healthy patients and patients with a disease [10]. GWAS have found a plethora of genetic variants that are associated with common diseases such as asthma, type 2 diabetes, and more [11, 12].
In a similar vein, quantitative trait loci (QTL) studies identify associations between genetic variants and quantitative phenotypes [13, 14]. A common experimental setup is to use gene expression levels as the quantitative phenotype, in which case the association is referred to as an expression QTL (eQTL) [15, 16]. Most eQTL studies have relied on bulk RNAseq technologies to measure the gene expression levels of samples from tissues with heterogeneous cell types [17,18,19].
In this context, the statistical eQTL problem is to estimate the pairwise association between a set of genetic variants (the covariates or explanatory variables) and the expression level of each gene (the outcome variables). This is typically performed using a linear regression model. In particular, let \({\mathbf {X}}\) be an \(N\times P\) matrix containing information about genetic variants across P SNPs for N individuals or tissue samples, and let \({\mathbf {Y}} \in {\mathbb {R}}^{N \times Q}\) be a matrix of corresponding gene expression levels across Q genes in these individuals or tissues [7]. Generically, these approaches use a model of the form
where \({\mathbf {x}}_{\cdot p}\) is the pth column of \({\mathbf {X}}\), \({\mathbf {y}}_{\cdot q}\) is the qth column of \({\mathbf {Y}}\), \(\varvec{\epsilon } \in {\mathbb {R}}^{N}\) is a vector of independent zeromean Gaussiandistributed noise terms, and \(b_{pq} \in {\mathbb {R}}\) is a scalar coefficient representing the linear relationship between SNP p and gene q for \(p = 1, \dots , P\) and \(q = 1,\dots , Q\). Downstream tests for significance can be performed on these coefficients to identify associations [20,21,22]. Without further assumptions, this model estimates the marginal association between single SNPs and single genes independently. To accommodate polygenic contributions to phenotypes, multivariate models of the form
have been considered, where \({\mathbf {B}} \in {\mathbb {R}}^{P \times Q}\) is a matrix of coefficients [23,24,25]. Under this framework, sparsityinducing priors for \({\mathbf {B}}\) have been proposed in order to scale these models to highdimensional genotype data [26, 27].
The advent of scRNAseq technologies has opened the door for narrowing the investigation of genotypephenotype relationships from the level of whole tissues to the level of individual cells. However, existing computational tools are insufficient for this purpose: they typically do not accommodate countbased data, and they are seldom robust to highdimensional outcome variables. It is difficult to control the hypothesis testing error rate in many eQTL analyses, which run millions to trillions of univariate association hypothesis tests (one for each SNPgene pair) [7, 18, 19, 28, 29].
Countbased models
A further drawback of existing association testing frameworks is their assumption of Gaussiandistributed data. Most canonical regression models assume an independent normallydistributed response variable, with \(\varvec{\epsilon } \sim {\mathcal {N}}({\mathbf {0}}, \sigma ^2 {\mathbf {I}}_N)\) in Eq. (1). However, when the data consist of countbased measurements, as for RNAseq data, this assumption may be problematic. Various transformations have been proposed to make the response variable approximately Gaussian [30, 31], but these transformations are known to distort the data distribution in undesirable ways [32,33,34]. Countbased scRNAseq data is discrete and nonnegative, with many gene expression counts having a value of zero. The sparsity of the data poses an additional challenge to these standard transformations [32].
An alternative to this approach is to model the gene expression data with a discrete distribution. A common choice is the Poisson distribution, whose support is restricted to the nonnegative integers and has been shown to improve the representation and interpretation of scRNAseq data when fitting statistical models [32, 35]. A recent approach using a Poisson data likelihood proposed a naive Bayes model that assigns celltype identities to samples in scRNAseq data based on reference data [36]. The model uses a Poisson distribution to represent the countbased data, but the high number of zeros in the data still poses a challenge. The sparsity of the data interferes with standard estimates such as maximum likelihood estimates as rates of zero can be produced for thousands of genes, making the model sensitive to genes with low expression counts [36]. The model handles this challenge by introducing a hierarchical structure, placing posterior distributions on parameters in order to recover nonzero rate estimates for genes with zero counts in the reference data. However, the naive Bayes model also assumes independence between genes, although this assumption does not hold in practice, as expression has been observed to be correlated between genes [32, 37, 38].
Modeling multiple data modalities
Latent variable modeling approaches have also been proposed for modeling multiview data. The most popular approach has been canonical correlation analysis (CCA, [39]) and its probabilistic variants [40,41,42]. CCA seeks a lowdimensional linear mapping of two paired data matrices such that the resulting lowdimensional projections of both matrices are maximally correlated.
where \(\rho\) is the Pearson correlation function. The probabilistic version of this model projects the features of each data modality into a shared lowdimensional latent space, assuming heteroskedastic residual errors, maximizing the amount of variance explained in the data modalities by the latent subspace. The weights, or factor loadings, in CCA models allow us to identify covarying features across data modalities. A formal connection between CCA and reducedrank regression has been shown [43], where the canonical subspace found by CCA is the same as the subspace of the maximum likelihood estimator for the reducedrank regression model. Despite their connections, the unsupervised nature of CCA does not lend itself directly to association mapping between the data modalities. Conversely, reducedrank regression has a natural association testing framework because of its regression foundation.
Recently, a latent variable model based on latent Dirichlet allocation [44, 45] for jointly modeling gene expression and genotype was proposed [46]. This model projected both genotype data—using an equivalent of the Structure model [45]—and countbased gene expression data—using a telescoping LDA model [44]—onto a shared latent subspace; we may then identify covarying genes and genotypes in a nonnegative latent representation. But discovering associations in this framework requires association testing in heldout data, which is limited by existing univariate methods and population data.
Reducedrank regression approaches
The transcriptional states of cells tend to exhibit strong correlations between genes [47]. Thus, it is likely that the relationship between cell covariates and transcriptional phenotypes in scRNAseq data need not be modeled genebygene. Rather, it is reasonable to assume that these associations exhibit a lowdimensional structure. Furthermore, treating each gene as independent is computationally and statistically inefficient; we would like to exploit these relationships to perform fewer association tests and leverage shared variation to improve statistical power in these often small sample sizes. These ideas motivate a regression model whose coefficient matrix has low rank. Several approaches to reducedrank regression have been developed to take advantage of this opportunity.
Consider again the linear regression model in Eq. (2). Here, \({\mathbf {B}}\) is a \(P\times Q\) matrix of regression coefficients, where P is the number of covariates, and Q is the number of genes. In most gene expression studies, Q (and sometimes P) is large, and \(\min (P, Q) \gg n\). The core assumption of reducedrank regression (RRR) is that the matrix \({\mathbf {B}}\) has low rank [48]. In particular, the RRR model assumes \({\mathbf {B}}\) has rank \(R \ll \min (P, Q)\). This implies that \({\mathbf {B}}\) can be factorized as an outer product of two lowrank matrices, giving us the following reducedrank regression model:
where \({\mathbf {U}} \in {\mathbb {R}}^{P \times R}\) and \({\mathbf {V}} \in {\mathbb {R}}^{Q \times R}\). In the context of gene expression studies, this lowrank assumption implies that the relationship between cellspecific covariates and gene expression can be described in terms of a small set of latent factors. In other words, variance in gene expression is mediated by R different programs encoded in subsets of covariates; then \({\mathbf {B}}\) captures both the covariates of interest and their effect sizes within each of the R programs.
Several estimation approaches have been proposed for RRR under the assumption of Gaussian noise. A common method is to find the parameter values that minimize the squared reconstruction error [48, 49]:
This approach corresponds to finding the maximum likelihood solution of an RRR model with Gaussian errors (\(\varvec{\epsilon }_q \sim {\mathcal {N}}(0, \sigma ^2 {\mathbf {I}})\) for \(q = 1, \dots , Q\) in Eq. 3 as \(\sigma ^2 \rightarrow 0\)). When \({\mathbf {B}}\) is assumed to have full rank (that is, \(R = \min (P, Q)\)) the minimization admits the ordinary least squares (OLS) solution:
When \(R < \min (P, Q)\), the RRR model has an eigenvalue solution:
where \({\mathbf {X}} \widehat{{\mathbf {B}}}_{OLS} = {\mathbf {U}}{\mathbf {D}}{\mathbf {V}}^\top\) is the SVD of the fitted values, and \({\mathbf {U}}_{1:R} = [{\mathbf {u}}_1, \cdots , {\mathbf {u}}_R]\) contains the leading R left singular vectors of \({\mathbf {X}} \widehat{{\mathbf {B}}}_{OLS}\).
Sparse approaches to RRR have been proposed as well. Sparsity in the decomposition leads to greater interpretability by including nonzero weights only on a subset of the covariates and genes for any component. One model [50] imposes sparsity on the coefficient matrix \({\mathbf {B}}\) by taking an iterative approach to estimation, solving both a sparse regression problem and the reducedrank decomposition in alternating frames. The base algorithm solves the following optimization problem:
where \({\mathbf {U}} \in {\mathbb {R}}^{P \times R}\), \({\mathbf {V}} \in {\mathbb {R}}^{Q \times R}\), \({\mathbf {U}}_p\) represents the pth row of \({\mathbf {U}}\), the rank R is specified by the modeler, and \(\lambda\) is a sparsity penalty parameter. The alternating minimization problem can be broken into two steps: optimizing \({\mathbf {U}}\), and optimizing \({\mathbf {V}}\). After parameter initialization on iteration \(\ell = 1\), on iteration \(\ell = 2, \dots , L\), the algorithm first solves an orthogonal Procrustes problem for \({\mathbf {V}}\):
where \({\mathbf {U}}^{(\ell  1)}\) is the estimate of \({\mathbf {U}}\) from the previous iteration. The algorithm then solves a group lasso problem for \({\mathbf {U}}\):
Equation (4) can be solved using a singular value decomposition, and Eq. (5) can be solved using techniques for group lasso [51]. These two steps are repeated for L steps or until convergence.
Another approach developed a Bayesian RRR framework for association mapping in the GWAS setting [52]. The model—called Bayesian Extendable ReducedRank Regression (BERRRI)—uses a nonparametric Indian Buffet Process prior for the latent factors, which allows the rank k to be estimated from the data. BERRRI then uses a variational Bayes approximation to the posterior for inference of the model parameters. However, BERRRI does not explicitly model countbased data, and its inference procedure is not computationally tractable for genomescale analyses.
The linear RRR model has been generalized to nonlinear functions as well. The most popular nonlinear approaches have used neural networks with multiple inputs and multiple outputs [53]. The linear RRR model is equivalent to a singlelayer multilayer perceptron with only linear transformations between layers [54, 55]. This model can be extended to the nonlinear case by including nonlinear activation functions [54, 56]. However, these models typically to do not capture count data and lack the interpretability of linear models for downstream association testing.
In this manuscript, we propose a statistical model and associated computational framework that addresses the problems that arise with modeling genotypephenotype associations for highdimensional phenotypes captured with count data. We propose a reducedrank regression model that finds lowdimensional associations between genotypes (or other highdimensional covariates) and RNAseq data (or other highdimensional countbased phenotypes). Relying on lowdimensional associations alleviates the problem of estimating millions of pairwise associations. Furthermore, our model uses countbased likelihoods that allow both singlecell RNAsequencing data and bulk RNAsequencing data. We show that our approach appropriately models gene expression data with countbased likelihoods, leads to interpretable subsets of genes and genetic variants or other covariates in each dimension, and uses flexible, computationally tractable inference methods that allow for uncertainty quantification.
Methods
We propose a probabilistic reducedrank regression model with a Poisson data likelihood—which we call Poisson reducedrank regression (PRRR)—for association mapping in countbased sequencing data. Our approach takes the form of a reducedrank regression model with intermediate factors that explicitly model countbased data using a Poisson likelihood. These factors are interpretable and can be used to to identify and analyze the global structure of associations between cell covariates and cell phenotypes, such as gene expression levels. We ensure that inference is tractable and efficient in these models by using stochastic variational inference.
Poisson reducedrank regression (PRRR)
PRRR is designed to identify associations between cellspecific covariates and highdimensional gene expression profiles. The response matrix \({\mathbf {Y}} \in {\mathbb {N}}_0^{N \times Q}\) is a matrix containing (in this application) RNA transcript counts for Q genes in N cells, where \({\mathbb {N}}_0 = {\mathbb {N}}\cup 0\). The \(N \times P\) matrix \({\mathbf {X}}\) is a design matrix containing covariates for each cell. For example, these covariates could represent cell type, genotype, or measures of cell health.
PRRR uses a Poisson likelihood to model the transcript counts for each cell as the response variables, conditional on observed cellspecific covariates. The Poisson rate is parameterized by a lowrank linear mapping from the cell covariates.
Specifically, the transcript count of gene p in cell n, denoted by \(y_{np}\) is modeled as a draw from a Poisson distribution, \(y_{np} \sim \text {Poisson}(\lambda _{np})\). The Poisson rate \(\lambda _{np}\) is determined by a linear function of the vector of covariates for cell n, denoted as \({\mathbf {x}}_n\). We use a canonical link function from the exponential family to map the domain of the latent variables to the positive real line—similar to a GLM approach. In particular, we use a \(\log\) link function to ensure that, when pushed through the inverse link—the \(\exp\) function—the mapped linear predictor, or the Poisson rate parameter, lies in \({\mathbb {R}}_+\). The likelihood model is then
where \({\mathbf {v}}_{p\cdot }\) is the pth row of \({\mathbf {V}}\). We place Gaussian priors on columns of \({\mathbf {U}}\) and \({\mathbf {V}}\):
for \(R = 1, \dots , R\). Intuitively, \({\mathbf {U}}\) and \({\mathbf {V}}\) capture the lowrank associations between \({\mathbf {X}}\) and \({\mathbf {Y}}\).
Nonnegative PRRR
In some cases, the covariates \({\mathbf {X}}\) are entirely nonnegative — possibly representing counts or categories—in which case it may be of interest to identify nonnegative, lowrank regression coefficients that explain the associations in a completely additive fashion. For example, in eQTL mapping, the covariates are typically the count of the minor allele for each SNP, where \({\mathbf {x}}_n \in \{0, 1, 2\}\), and it may be of interest to identify a nonnegative, “partsbased” combination of SNPs that explain phenotypic variation. For these cases, we propose nonnegative Poisson reducedrank regression (nnPRRR), whose likelihood is given by
where \(s_n\) is a cellspecific size factor modeling the total number of transcripts in cell i. We fix \(s_n\) to be the total number of transcript counts in cell, \(s_n = \sum _{j=1}^p y_{np}\). We place Gamma priors on the elements of \({\mathbf {U}}\) and \({\mathbf {V}}\):
for \(p = 1,\dots ,P\), \(q = 1, \dots , Q\), and \(r = 1, \dots , R\). For all experiments, we set \(\alpha _u=\alpha _v=2\) and \(\beta _u=\beta _v=1\). Thus, the structures of the PRRR and nnPRRR models are similar except for the nonnegativity constraint and associated prior (Additional file 1: Fig. S1).
Choosing between PRRR and nnPRRR
While the PRRR and nnPRRR models are designed for the same goal — performing reducedrank regression with highdimensional, countbased outcomes—some care is required when choosing which model to apply for a particular application. The choice largely depends on the goal of the analysis. If a primary goal is to examine the lowdimensional latent factors in a dataset, then nnPRRR is often preferable because its nonnegative factors encourage a partsbased representation, which may be easier to interpret. nnPRRR also requires that the covariates are nonnegative. On the other hand, when the primary goal is prediction, PRRR may be preferable because its realvalued factors will be less constrained due to the removal of the nonnegativity constraint. However, these are only guidelines and not hard restrictions, and often it may be preferable to fit both PRRR and nnPRRR to a dataset, when possible, and study both sets of results to select the one with the more appropriate behavior.
Estimation and inference
We propose two approaches to fit our model to data: 1) computing a point estimate for the coefficients using maximum a posteriori (MAP) methods and 2) full Bayesian posterior inference for the regression coefficients using an approximate inference procedure.
MAP estimation
The MAP solution in our model is the coefficient matrices \({\mathbf {U}}_{MAP}, {\mathbf {V}}_{MAP}\) with maximum posterior probability given the data \({\mathbf {X}}\) and \({\mathbf {Y}}\). In particular,
Expanding the posterior with Bayes’ rule, we can write the MAP objective as \(\max _{{\mathbf {U}}, {\mathbf {V}}} p({\mathbf {Y}}  {\mathbf {X}}, {\mathbf {U}}, {\mathbf {V}}) p({\mathbf {U}}, {\mathbf {V}}) / Z,\) where Z is a normalizing constant that does not depend on \({\mathbf {U}}\) or \({\mathbf {V}}\). Taking a \(\log\), dropping the constant Z, and leveraging the i.i.d. assumption, we arrive at our final objective for the MAP estimate of the model parameters:
Although this maximization problem does not have a closedform solution, we use gradientbased methods to iteratively maximize this log posterior with respect to \({\mathbf {U}}\) and \({\mathbf {V}}\).
Variational inference
A fully Bayesian approach to inference, given a set of samples with paired cell covariates and transcript counts, \(\{({\mathbf {x}}_n, {\mathbf {y}}_n)\}_{n=1}^N\), would compute the posterior distribution of the parameters, \({\mathbf {U}}\) and \({\mathbf {V}}\), given the data matrices \({\mathbf {X}}\) and \({\mathbf {Y}}\). By Bayes’ rule,
However, the marginal likelihood, \(p({\mathbf {X}}, {\mathbf {Y}})\), contains an intractable integral,
To circumvent this issue, we use a variational approximation to the posterior. Specifically, we use a meanfield variational approximation, \(p({\mathbf {U}}, {\mathbf {V}}) \approx q({\mathbf {U}}, {\mathbf {V}}) = q_1({\mathbf {U}})q_2({\mathbf {V}})\), where \(q_1\) and \(q_2\) are the variational distributions. Here, we specify the variational families for PRRR and nnPRRR to be normal and log normal, respectively,
We minimize the KL divergence between the true posterior and the variational approximation, which is equivalent to maximizing a lower bound on the model evidence (called the ELBO). This lower bound for PRRR is given by
We maximize this lower bound with respect to the variational parameters using stochastic variational inference [57] as implemented in TensorFlow Probability [58]. For all experiments, we use the Adam optimizer [59] with a learning rate of 0.01.
Results
Simulation experiments
We first demonstrate the use cases of PRRR and test the robustness and accuracy of our model using simulated data.
PRRR identifies lowdimensional association maps
We first sought to confirm that PRRR identifies the lowdimensional relationships between covariates and outcomes.
To start in a setting that can be visualized, we generated a synthetic dataset in which the covariates and outcomes are both twodimensional. Specifically, we sampled data from the generative model (Eqs. 8, 9), setting \(R=1\). We forced a correlation between the covariates and outcomes. We found that PRRR could reliably detect the onedimensional association between \({\mathbf {X}}\) and \({\mathbf {Y}}\) (Fig. 1). Moreover, we are able to recover a quantification of the relationship between the covariates and outcomes, and visualize this relationship in the lowdimensional space.
We next extended this simulation study and visualization to a smallscale synthetic eQTL study. We generated \(N = 200\) synthetic genotypes based on minor allele counts, \({\mathbf {x}}_n \in \{0, 1, 2\}\), and sampled synthetic RNA transcript counts using the PRRR generative model, with \(P=Q=2\) for visualization. We fit PRRR to these data and inspected the fitted coefficients. We found that PRRR recovered these genotypeexpression relationships, and allowed for both inspection of the lowdimensional structure of these relationships, as well as investigating univariate relationships (Fig. 2). This experiment suggests that PRRR may be useful to perform eQTL mapping.
PRRR identifies the optimal rank and is robust to misspecification
PRRR, like other reducedrank regression approaches, requires selecting the rank R of the coefficient matrix. A common approach is to evaluate the goodnessoffit of the model at varying values of R, and select the one with the best fit to the data. To test whether this is feasible with PRRR, we use synthetic data that was generated from PRRR’s generative model with true rank \(R^\star =3\). We then fit the model with \(R \in \{1, 2, \dots , 10\}\) and compute the ELBO for each fit. We repeated this experiment 30 times for each value of R.
We find that the PRRR ELBO peaked at the true value of \(R=3\) (Fig. 3a), demonstrating that the model’s fit to the data was best at the true rank \(R^{\star }\). Moreover, we found that, while the goodnessoffit sharply degraded for models with \(R < R^\star\), the goodnessoffit declined more slowly for models with \(R > R^\star\). This finding confirms similar observations from previous studies [50], and suggests that setting the rank to be higher than anticipated is protective against model misspecification.
PRRR is robust to data dimension
We next sought to validate the robustness of PRRR in the presence of higherdimensional data. To do so, we generated three datasets from the PRRR model, each with a different number of response features (genes), \(Q \in \{10, 50, 100\}\). We set the sample size to \(N = 200\), and we randomly selected \(80\%\) of these samples to fit the model and held out the remaining \(20\%\) to test the model. We fit PRRR on the training data using our MAP estimation procedure and used the estimated parameters to compute the predicted Poisson rate for the heldout data, \(\widehat{\lambda }_{np} = \exp ({\mathbf {x}}_n \widehat{{\mathbf {U}}}\widehat{{\mathbf {v}}}_{p\cdot }^\top\). We then computed the goodnessoffit \(R^2\) measure between our predictions and the heldout dataset’s counts. To benchmark these predictions, we compared PRRR’s predictive performance to two competing methods: a fullrank version of PRRR and a multioutput LASSO model, as implemented in the R package glmnet [51]. We repeated this experiment ten times for each method and each datagenerating condition.
We found that PRRR reliably achieves good predictive performance across all values of the outcome dimension Q (Fig. 3b). In contrast, the fullrank and LASSO models performed worse.
To further validate PRRR and nnPRRR on different data types, we conducted a similar experiment with synthetic data generated using Splatter [60], a data simulator designed specifically for singlecell count data. We generated data for \(N=200\) samples, each belonging to one of 10 groups, and used \(Q=100\) genes. We used the onehot encoded group labels as the covariates and the synthetic gene expression as the response. We fit PRRR and nnPRRR with \(R=5\), along with Gaussian RRR and fullrank Poisson regression, and we reserved a holdout set for evaluating predictions. We found that PRRR and nnPRRR outperformed competing models in terms of their predictive \(R^2\) values (Fig. 3c).
These results suggest that accounting for the countbased data and the lowrank structure of associations is vital, and that the PRRR model successfully captures this structure.
PRRR predictions are robust to rank misspecification
To further explore the role of rank specification in our model, we performed a prediction experiment for varying settings of the rank. We generated synthetic data as before with \(R^\star = 3\) and fit the model on \(80\%\) of the data while reserving \(20\%\) for testing. For a range of ranks, \(R \in \{1, 2, 3, 4, 5, 10, 20\}\), we fit PRRR, used the fitted model to make predictions for the heldout data, and computed the \(R^2\) coefficient of determination for these predictions. We performed this experiment using both maximum a posteriori (MAP) estimation and variational inference to fit the model, repeating the experiment ten times for each rank in both cases.
We found that the predictive performance was strongest when the model was correctly specified (\(R = 3\) in this case; Fig. 4a, b). However, we observed that performance was strong across a range of misspecified ranks as well. Similar to our previous experiment, we observed that predictions were more robust for models with \(R > R^\star\) as compared to models with \(R < R^\star\).
To benchmark these predictions, we compared PRRR’s predictive performance to two competing methods: a reducedrank regression that assumes a Gaussian likelihood [48] and the multioutput LASSO model [51]. We performed a similar prediction experiment as above, computing the \(R^2\) for each model under a range of rank specifications. We found that PRRR outperformed the two competing methods in a range around the true rank \(R^\star = 3\) (Fig. 4c).
Characterizing transcriptional hallmarks of pancreatic cell types
It has been widely observed that cell type is a major driver of transcriptional variation between cells in a variety of tissue types [4, 61,62,63]. Given these observed differences between cell types, it is of interest to identify the gene expression patterns that are characteristic of each cell type. Our PRRR models present a principled approach for identifying these transcriptional hallmarks of cell type by finding a lowdimensional mapping from cell type to expression.
To test this, we fit PRRR to an scRNAseq dataset containing 1,578 cells that span \(C=14\) unique cell types in the human pancreas [64]. For cell n, we encode its cell type as a onehot vector \({\mathbf {x}}_n \in \{0, 1\}^{C}\), and we treat the response variable \({\mathbf {y}}_n\) as the vector of RNA transcript counts in this cell. We extracted the coefficient matrices \({\mathbf {U}}\) and \({\mathbf {V}}\) and studied their properties. For comparison, we also fit BERRRI [52] and a multilayer, multioutput neural network [54, 55]. For BERRRI, we set an upper limit of 13 latent factors, and set the rest of the parameters to their default settings. For the neural network, we included two hidden layers and used the Adam optimizer [59].
We found that PRRR was able to identify transcriptional markers in each cell type. Among the 14 unique cell types present in the dataset, there are five that belong to the family of islet cells (alpha, beta, gamma, delta, and epsilon cells). Given their functional relatedness, these cell types are expected to show similar gene expression patterns compared to patterns found in other cell types. The competing methods, BERRRI and a neural network, were unable to separate islet and nonislet cell types in their latent spaces (Additional file 2: Fig. S2).
Indeed, inspecting PRRR’s estimated coefficients, we find that the model captures the lowdimensional gene expression patterns in islet cells (Fig. 5). We performed a hierarchical clustering on the PRRR coefficients, which revealed that the islet cells clustered together (Fig. 5). We found a similar clustering after fitting nnPRRR on the same dataset (Additional file 3: Fig. S3). Moreover, we observed that the models separated islet cell types from nonislet cell types in the lowdimensional space (Fig. 6). Examining the factor loadings for each cell type, we found that specific factors were especially enriched for islet or nonislet cell types (Fig. 7). The isletrelated factors were enriched for pancreatic gene pathways, such as pancreas beta cells.
We also extracted the top genes for each cell type from the full PRRR coefficient matrix; these genes can be viewed as “marker genes” whose expression is correlated with certain cell type identities. We found that these marker genes correspond to celltypespecific transcription patterns, such as REG1B being the top marker gene for acinar cells (Additional file 4: Fig. S4, Additional file 5: Fig. S5).
These findings imply that the lowdimensional space may be used to compare and contrast existing classifications within the data, and to discover possible new relationships among covariates and phenotypes. They also show that PRRR is able to identify the distinct transcriptional characteristics of specific cell types and groups of cell types, and that PRRR could be used to identify marker genes for cell types.
Analyzing gene expression patterns in spatial datasets
Along with cell type, the physical organization of cells within a tissue has a strong impact on gene expression due to tissue organization structure and cellcell communication. The rise of spatial gene expression profiling technologies provides an opportunity to study how gene expression levels vary spatially across a tissue [65,66,67,68]. In particular, given gene expression data at the individual cell level with appropriate spatial context, it is of interest to identify how the expression of specific genes varies across the expanse of a tissue.
To study this with our modeling framework, we fit PRRR with rank \(R=1\) to a twodimensional spatial dataset containing 2,063 mouse brain sagittal anterior cells [69]. The \({\mathbf {X}}\) matrix is an \(N \times 2\) matrix containing twodimensional spatial coordinates for each cell n, and we treat the response variable \({\mathbf {Y}}\) as the matrix of RNA transcript counts. After fitting PRRR, we extracted the model coefficients to inspect the spatial trends in gene expression that it identified.
We found that PRRR is able to identify trends in gene expression along one latent dimension. While the model is constrained to only identify linear changes in gene expression across space, it is able to identify a general trend in increased gene expression for individual genes such as TTR and FABP7 (Fig. 8). This result demonstrates the utility of our model in the context of spatial genomics and further demonstrates the versatility of PRRR.
eQTL mapping
eQTL mapping is a common approach to finding associations between genotypes and gene expression profiles. However, this type of association mapping requires fitting a regression between millions of genotype variants and the expression of tens of thousands of genes, resulting in billions of univariate models [18]. The large number of univariate tests can cause these approaches to be prohibitive computationally and to lack sufficient statistical power.
We hypothesized that our reducedrank regression model could alleviate these issues of computational tractability and statistical power. To test this, we applied PRRR to an eQTL mapping setting to find a set of lowdimensional factors that capture the relationships between genotype and gene expression. To do so, we used data from the Genotype Tissue Expression (GTEx) Consortium [19]. For this experiment, we focused on data collected from liver tissues from 227 donors. For each donor, the data consist of paired genotype (as encoded by minor allele count \({\mathbf {x}}_n \in \{0, 1, 2\}\)) and bulk gene expression profiles. For our analysis, we used genotype data from chromosome 1 (\(p=18,892\)) and expression levels for the top \(q=5,000\) most variable genes.
We fit PRRR with \(R=10\) latent dimensions and extracted the lowrank regression coefficient matrix (Fig. 9a, b). Within each factor, we can examine associations between individual genetic variants and the expression of individual genes. For factor number r, we do this by taking the outer product of the corresponding columns of \({\mathbf {U}}\) and \({\mathbf {V}}\), respectively. That is, we compute \({\mathbf {u}}_r {\mathbf {v}}_r^\top\) and examine the strongest SNPgene relationships (Fig. 10). Investigating the latent factors more closely, we found several meaningful associations. For example, after performing a gene set enrichment analysis (Table 1), we found that the gene expression loadings for factor 9 were enriched for genes related to interferon gamma response and inflammatory response (Fig. 9c), two major functional roles of liver cells [70, 71]. We find similar results for nnPRRR, although the \({\mathbf {V}}\) matrix is much more sparse (Additional file 6: Fig. S6). The additional sparsity in the nnPRRR results aligns well with the partsbased representation of the lowdimensional space known with nonnegative matrix factorizations [72,73,74]. This experiment suggests that the PRRR and nnRRR framework may be useful for studying associations between genotypes and phenotypes, especially when there is lowdimensional correlation structure between the datasets and within each dataset individually.
Discussion
In this paper, we present two reducedrank regression models and associated variational inference approaches—Poisson RRR (PRRR) and nonnegative Poisson RRR (nnPRRR)—to jointly model associations within two highdimensional paired sets of features where the response variables are counts. In simulations, PRRR and nnPRRR are able to effectively capture associations between paired highdimensional data. Moreover, we show that these models can identify the optimal rank for the parameter matrix. In the context of sequencing data, we find that PRRR and nnPRRR may be used for robust identification of cell types, quantifying the relationships between cell types, and performing association mapping of genetic variants to correlated genes.
There are several limitations of our proposed model. First, PRRR and nnPRRR only model linear associations between inputs and outputs and are unable to account nonlinear relationships. Second, we observe that the time complexity for fitting PRRR and nnPRRR is somewhat slower than competing models (Additional file 7: Fig. S7), although the fitting time is not prohibitive. Third, a limitation of our variational inference (and any variational inference) is that the variational posterior distribution is an approximation to the full posterior and not the exact posterior. Finally, our model requires the user to specify the number of latent dimensions, which may be difficult in practice; often the user will run the method with different latent dimension values and use the results that are the most interpretable (Additional file 8).
Several extensions of the model could be considered. A nonparametric prior could allow for flexibly learning the rank of the parameter matrix, rather than requiring the rank to be prespecified, as in related work [52]. Additionally, the generalized model could be extended to different likelihood distributions. Furthermore, additional structure could be added to the latent variables, such as sparsity or a gene network [75, 76], to encode additional known structure in the covariates.
Conclusions
We present a Poisson reducedrank regression (PRRR) model, along with a nonnegative counterpart called nnPRRR, for association mapping in countbased sequencing data. PRRR is able to detect associations between a highdimensional response matrix and a highdimensional set of predictors by leveraging lowdimensional representations of the data. Using principled Bayesian modeling, PRRR is able to properly account for the countbased nature of RNA sequencing data using a Poisson likelihood. We ensure that inference is tractable and efficient in these models by applying a fast variational inference approach.
Availability of data and materials
Pancreas scRNAseq data were downloaded from GEO (GSE84133). https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84133. GTEx bulk gene expression and SNP data were downloaded from the GTEx portal. https://gtexportal.org/home/datasets. Spatial gene expression data were downloaded from the 10x Genomics “Datasets” page. https://www.10xgenomics.com/resources/datasets.
Code
Code is available at https://github.com/tianafitz/PRRR.
Abbreviations
 RRR:

reducedrank regression
 PRRR:

Poisson reducedrank regression
 nnPRRR:

nonnegative Poisson reducedrank regression
 eQTLs:

expression quantitative trait loci
 MLE:

Maximum likelihood estimate
 MAP:

Maximum a posteriori
References
Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, Wang X, Bodeau J, Tuch BB, Siddiqui A, et al. mRNAseq wholetranscriptome analysis of a single cell. Nat Methods. 2009;6(5):377–82.
Sasagawa Y, Nikaido I, Hayashi T, Danno H, Uno KD, Imai T, Ueda HR. Quartzseq: a highly reproducible and sensitive singlecell RNA sequencing method, reveals nongenetic geneexpression heterogeneity. Genome Biol. 2013;14(4):1–17.
Jaitin DA, Kenigsberg E, KerenShaul H, Elefant N, Paul F, Zaretsky I, Mildner A, Cohen N, Jung S, Tanay A, et al. Massively parallel singlecell RNAseq for markerfree decomposition of tissues into cell types. Science. 2014;343(6172):776–9.
Zeisel A, MuñozManchado AB, Codeluppi S, Lönnerberg P, La Manno G, Juréus A, Marques S, Munguba H, He L, Betsholtz C, et al. Cell types in the mouse cortex and hippocampus revealed by singlecell RNAseq. Science. 2015;347(6226):1138–42.
McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN. Genomewide association studies for complex traits: consensus, uncertainty and challenges. Nature Rev Genet. 2008;9(5):356–69.
Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, De Bakker PI, Daly MJ, et al. Plink: a tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet. 2007;81(3):559–75.
Cantor RM, Lange K, Sinsheimer JS. Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am J Hum Genet. 2010;86(1):6–22. https://doi.org/10.1016/j.ajhg.2009.11.017.
Collins FS, Morgan M, Patrinos A. The human genome project: lessons from largescale biology. Science. 2003;300(5617):286–90.
Consortium IH, et al. The International HapMap Project. Nature. 2003;426(6968):789–96.
Bush WS, Moore JH. Chapter 11: Genomewide association studies. PLoS Computational Biology. 2012; 8(12)
Ober C, Nicolae DL. Metaanalysis of genomewide association studies of asthma in ethnically diverse North American populations. Nat Genet. 2011;43(9):887–92.
Frayling TM. Genomewide association studies provide new insights into type 2 diabetes aetiology. Nat Rev Genet. 2007;8(9):657–62.
Zeng ZB. Precision mapping of quantitative trait loci. Genetics. 1994;136(4):1457–68.
Doerge RW. Mapping and analysis of quantitative trait loci in experimental populations. Nat Rev Genet. 2002;3(1):43–52.
Nica AC, Dermitzakis ET. Expression quantitative trait loci: present and future. Philos Trans Royal Soc B Biol Sci. 2013;368(1620):20120362.
Kendziorski C, Chen M, Yuan M, Lan H, Attie AD. Statistical methods for expression quantitative trait loci (eQTL) mapping. Biometrics. 2006;62(1):19–27.
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK. Understanding mechanisms underlying human gene expression variation with rna sequencing. Nature. 2010;464(7289):768–72.
Genetic effects on gene expression across human tissues. GTEx Consortium. Nature. 2017;550:204–13.
GTEx Consortium: The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369(6509), 1318–1330 (2020).
Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83(3):311–21.
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rarevariant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93.
Lee S, Wu MC, Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics. 2012;13(4):762–75.
Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ. Simultaneous analysis of all SNPs in genomewide and resequencing association studies. PLoS Genet. 2008;4(7):1000130.
Logsdon BA, Hoffman GE, Mezey JG. A variational Bayes algorithm for fast and accurate multiple locus genomewide association analysis. BMC Bioinform. 2010;11(1):1–13.
Wu TT, Chen YF, Hastie T, Sobel E, Lange K. Genomewide association analysis by lasso penalized logistic regression. Bioinformatics. 2009;25(6):714–21.
Li J, Das K, Fu G, Li R, Wu R. The Bayesian lasso for genomewide association studies. Bioinformatics. 2011;27(4):516–23.
Li J, Wang Z, Li R, Wu R. Bayesian group lasso for nonparametric varyingcoefficient models with application to functional genomewide association studies. Annals Appl Stat. 2015;9(2):640.
Karczewski K, Solomonson M, Chao KR, Goodrich JK, Tiao G, Lu W, RileyGillis B, Tsai E, Kim HI, Zheng X, et al. Systematic singlevariant and genebased association testing of 3,700 phenotypes in 281,850 UK Biobank exomes. medRxiv (2021).
Willer CJ, Schmidt EM, Sengupta S, Peloso GM, Gustafsson S, Kanoni S, Ganna A, Chen J, Buchkovich ML, Mora S, et al. Discovery and refinement of loci associated with lipid levels. Nat Genet. 2013;45(11):1274.
Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating singlecell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36(5):411–20.
Yu G. Variance stabilizing transformations of Poisson, binomial and negative binomial distributions. Stat Probab Lett. 2009;79(14):1621–9.
Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for singlecell RNAseq based on a multinomial model. Genome Biology. 2019;20(1):295. https://doi.org/10.1186/s1305901918616. Accessed 02 JAN 2020.
Booeshaghi AS, Pachter L. Normalization of singlecell rnaseq counts by log (x+ 1) or log (1+ x). Bioinformatics. 2021;37(15):2223–4.
Hafemeister C, Satija R. Normalization and variance stabilization of singlecell rnaseq data using regularized negative binomial regression. Genome Biol. 2019;20(1):1–15.
Jones A, Townes FW, Li D, Engelhardt BE. Contrastive latent variable modeling with application to casecontrol sequencing experiments. arXiv preprint arXiv:2102.06731 (2021).
Grabski IN, Irizarry RA. Probabilistic gene expression signatures identify celltypes from single cell rnaseq data. bioRxiv (2020). https://doi.org/10.1101/2020.01.05.895441. https://www.biorxiv.org/content/early/2020/01/06/2020.01.05.895441.full.pdf.
Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating singlecell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36(5):411–20. https://doi.org/10.1038/nbt.4096.
Van Dam S, Vosa U, van der Graaf A, Franke L, de Magalhaes JP. Gene coexpression analysis for functional classification and genedisease predictions. Brief Bioinform. 2018;19(4):575–92.
Hotelling H. Relations between two sets of variates. In: Breakthroughs in Statistics. New York: Springer; 1992. pp. 162–190.
Bach FR, Jordan MI. A probabilistic interpretation of canonical correlation analysis. Technical report.2005.
Zhao S, Gao C, Mukherjee S, Engelhardt BE. Bayesian group factor analysis with structured sparsity. The Journal of Machine Learning Research. 2016.
Argelaguet R, Velten B, Arnol D, Dietrich S, Zenz T, Marioni JC, Buettner F, Huber W, Stegle O. Multiomics factor analysisa framework for unsupervised integration of multiomics data sets. Mol Syst Biol. 2018;14(6):8124.
Tso MS. Reducedrank regression and canonical analysis. J Roy Stat Soc: Ser B (Methodol). 1981;43(2):183–9.
Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. J Mach Learn Res. 2003;3:993–1022.
Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155(2):945–59.
Gewirtz AD, Townes FW, Engelhardt BE. Telescoping bimodal latent Dirichlet allocation to identify expression QTLs across tissues. bioRxiv (2021).
Stuart JM, Segal E, Koller D, Kim SK. A genecoexpression network for global discovery of conserved genetic modules. Science. 2003;302(5643):249–55.
Anderson TW, et al. Estimating linear restrictions on regression coefficients for multivariate normal distributions. Ann Math Stat. 1951;22(3):327–51.
Reinsel G, Velu R. Multivariate reducedrank regression (Lecture notes in statistcs No. 136). Berlin: Springer; 1998.
Qian J, Tanigawa Y, Li R, Tibshirani R, Rivas MA, Hastie T. Largescale sparse regression for multiple responses with applications to UK Biobank. bioRxiv. 2020. https://doi.org/10.1101/2020.05.30.125252.
Friedman J, Hastie T, Tibshirani R. A note on the group lasso and a sparse group lasso. arXiv preprint. 2010. arXiv:1001.0736.
Valente A, Ginsburg G, Engelhardt BE. Nonparametric ReducedRank Regression for MultiSNP, MultiTrait Association Mapping. 2015. arXiv:1512.02306.
Diamantaras KI, Kung SY. Multilayer neural networks for reducedrank approximation. IEEE Trans Neural Networks. 1994;5(5):684–97.
Baldi P, Hornik K. Neural networks and principal component analysis: learning from examples without local minima. Neural Netw. 1989;2(1):53–8.
Kunin D, Bloom J, Goeva A. Seed C. Loss landscapes of regularized linear autoencoders. In: International conference on machine learning. 2019; pp. 3560–3569. PMLR.
Aoyagi M, Watanabe S. Stochastic complexities of reduced rank regression in Bayesian estimation. Neural Netw. 2005;18(7):924–33.
Hoffman MD, Blei DM, Wang C, Paisley J. Stochastic variational inference. J Mach Learn Res. 2013;14(1):1303–47.
Dillon JV, Langmore I, Tran D, Brevdo E, Vasudevan S, Moore D, Patton B, Alemi A, Hoffman M, Saurous RA. Tensorflow distributions. arXiv preprint. 2017. arXiv:1711.10604.
Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv preprint. 2014. arXiv:1412.6980.
Zappia L, Phipson B, Oshlack A. Splatter: simulation of singlecell rna sequencing data. Genome Biol. 2017;18(1):1–15.
Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8(1):1–12.
Kotliar D, Veres A, Nagy MA, Tabrizi S, Hodis E, Melton DA, Sabeti PC. Identifying gene expression programs of celltype identity and cellular activity with singlecell RNAseq. Elife. 2019;8:43803.
Chen R, Wu X, Jiang L, Zhang Y. Singlecell RNAseq reveals hypothalamic cell diversity. Cell Rep. 2017;18(13):3227–41.
Baron M, Veres A, Wolock SL, Faust AL, Gaujoux R, Vetere A, Ryu JH, Wagner BK, ShenOrr SS, Klein AM, et al. A singlecell transcriptomic map of the human and mouse pancreas reveals interand intracell population structure. Cell Syst. 2016;3(4):346–60.
Stickels RR, Murray E, Kumar P, Li J, Marshall JL, Di Bella DJ, Arlotta P, Macosko EZ, Chen F. Highly sensitive spatial transcriptomics at nearcellular resolution with slideseqV2. Nat Biotechnol. 2021;39(3):313–9.
Ståhl PL, Salmén F, Vickovic S, Lundmark A, Navarro JF, Magnusson J, Giacomello S, Asp M, Westholm JO, Huss M, et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science. 2016;353(6294):78–82.
Rodriques SG, Stickels RR, Goeva A, Martin CA, Murray E, Vanderburg CR, Welch J, Chen LM, Chen F, Macosko EZ. Slideseq: a scalable technology for measuring genomewide expression at high spatial resolution. Science. 2019;363(6434):1463–7.
Lee Y, Bogdanoff D, Wang Y, Hartoularos GC, Woo JM, Mowery CT, Nisonoff HM, Lee DS, Sun Y, Lee J, et al. XYZeq: Spatially resolved singlecell RNA sequencing reveals expression heterogeneity in the tumor microenvironment. Science advances. 2021; 7(17).
10x Genomics: Mouse Brain Serial Sections (SagittalPosterior), Spatial Gene Expression Dataset by Space Ranger 1.1.0, 10x Genomics. 2020.
Horras CJ, Lamb CL, Mitchell KA. Regulation of hepatocyte fate by interferon\(\gamma\). Cytokine Growth Factor Rev. 2011;22(1):35–43.
Robinson MW, Harmon C, O’Farrelly C. Liver immunology and its role in inflammation and homeostasis. Cell Mol Immunol. 2016;13(3):267–76.
Lee DD, Seung HS. Learning the parts of objects by nonnegative matrix factorization. Nature. 1999;401(6755):788–91.
Donoho D, Stodden V. When does nonnegative matrix factorization give a correct decomposition into parts? Adv Neural Inf Process Syst. 2003; 16.
Townes FW, Engelhardt BE. Nonnegative spatial factorization. arXiv preprint. 2021. arXiv:2110.06122.
Engelhardt BE, Adams RP. Bayesian structured sparsity from Gaussian fields. arXiv preprint. 2014. arXiv:1407.2235.
Elyanow R, Dumitrascu B, Engelhardt BE, Raphael BJ. netNMFsc: leveraging genegene interactions for imputation and dimensionality reduction in singlecell expression analysis. Genome Res. 2020;30(2):195–204.
Acknowledgements
We thank Ariel Gewirtz and Isabella Grabski for helpful conversations.
Funding
This work was funded by Helmsley Trust grant AWD1006624, NIH NCI 5U2CCA233195, NIH NHLBI R01 HL133218, and NSF CAREER AWD1005627.
Author information
Authors and Affiliations
Contributions
TF, AJ, and BEE designed the method. TF implemented the method and conducted data analysis. TF, AJ, and BEE analyzed the results. TF and AJ wrote the manuscript. TF, AJ, and BEE edited the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
N/A
Consent for publication
N/A
Competing interests
BEE is on the SAB of Creyon Bio, Arrepath, and Freenome. TF and AJ have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
. Fig. S1 Graphical model for PRRR and nnPRRR.
Additional file 2
. Fig. S2 BERRRI and a neural network approach fail to identify expression patterns in islet celltypes. Shown here is the latent encoding of each cell type for each pair of latent variables. Eachpoint in each subplot represents a cell type, and cell types are colored by whether they areclassified as islet cells or not. The densities on the diagonal show the distribution of latent variablevalues for islet and nonislet cell types in each latent dimension. The left panel shows the latentvariables for BERRRI and the right plot shows the latent variables from a neural network.
Additional file 3
. Fig. S3 nnPRRR coefficients for pancreatic cell types. Heatmaps showing the full coefficientmatrix UV^{⊤} for nnPRRR (left is original, and right is on a log scale). Cell types are shown onthe rows and genes on the columns. In the left panel, white cells indicate values near zero,implying that this coefficient matrix is highly sparse.
Additional file 4
. Fig. S4 Marker genes identified by PRRR for pancreatic cell types. For each cell type, the tengenes with the highest coefficients in the matrix UV^{⊤} were extracted for each cell type. Somecell types share the same ten marker genes, which corresponds with our observation that the celltypes are largely overlapping in a PCA plot of the gene expression data (Fig. S4).
Additional file 5
. Fig. S5 PCA plot of pancreas scRNAseq data. The first two principal components (PCs) areplotted. Each point corresponds to a single cell and is colored by its annotated cell type.
Additional file 6
. Fig. S6 nnPRRR coefficients for GTEx eQTL mapping. Left: U matrix showing SNPs on therows and latent factors on the columns. Right: V matrix showing genes on the rows and latentfactors on the columns.
Additional file 7
. Fig. S7 Time complexity. Left: Time to fit each of the four models with varying sample sizes n.Right Left: Time to fit each of the four models with varying outcome dimensions q.
Additional file 8
. Fig. S8 The optimization problem for reducedrank regression (RRR) and description of GTEx experiments.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Fitzgerald, T., Jones, A. & Engelhardt, B.E. A Poisson reducedrank regression model for association mapping in sequencing data. BMC Bioinformatics 23, 529 (2022). https://doi.org/10.1186/s12859022050546
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859022050546
Keywords
 Singlecell RNAsequencing
 Association mapping
 Countbased models
 Reducedrank regression