 Methodology article
 Open access
 Published:
SCeQTL: an R package for identifying eQTL from singlecell parallel sequencing data
BMC Bioinformatics volume 21, Article number: 184 (2020)
Abstract
Background
With the rapid development of singlecell genomics, technologies for parallel sequencing of the transcriptome and genome in each single cell is being explored in several labs and is becoming available. This brings us the opportunity to uncover association between genotypes and gene expression phenotypes at singlecell level by eQTL analysis on singlecell data. New method is needed for such tasks due to special characteristics of singlecell sequencing data.
Results
We developed an R package SCeQTL that uses zeroinflated negative binomial regression to do eQTL analysis on singlecell data. It can distinguish two type of geneexpression differences among different genotype groups. It can also be used for finding gene expression variations associated with other grouping factors like cell lineages or cell types.
Conclusions
The SCeQTL method is capable for eQTL analysis on singlecell data as well as detecting associations of gene expression with other grouping factors. The R package of the method is available at https://github.com/XuegongLab/SCeQTL/.
Background
Expression quantitative trait locus or eQTL analysis is an important approach for studying the association between variations in the genotype and gene expression, which may help to reveal the underlying regulation relationship. Technologies that can sequence in parallel both the genomes and transcriptomes of single cells are being developed recently [6, 8]. These technologies give us an opportunity to uncover the association between genetic variations and genes expression at singlecell level, which can help reveal detailed gene regulation mechanisms in processes like tumorigenesis and cell differentiation.
Methods for identifying eQTLs have been well studied for microarray data and bulk RNAseq data. Typical methods of eQTL mapping include linear regression and ANOVA, where the expression level is taken as the dependent variable and the genotype at a singlenucleotide variation (SNV) site is the explaining factor [7, 16]. Most of those methods are based on the assumption that expression levels or its logarithms follow normal distribution, Poisson distribution or negative binomial distribution [17]. The Krux method used a nonparametric way to identify eQTL and claim their method is more robust [15]. These existing methods including the nonparametric ones can lose their power when applied on singlecell RNAseq data because of the special characteristics of singlecell sequencing data, especially the excess of zero values.
The phenomenon of excess of zero values is common in singlecell RNAseq (scRNAseq) data [2, 9]. There are mainly two reasons. Because the amount of total RNAs in a single cell is extremely small, there is high probability that the RNA capture, reverse transcription and amplification steps may miss some transcripts, causing the expression of some expressed genes not observed in the sequencing data. This is usually called “dropout” events. Another reason is that gene expression is a stochastic process at singlecell level [11]. This results in variations of gene transcription status between cells besides variations in gene expression abundances. The possibility for a gene to have a real zero expression level or to be in the “off” status of transcription is much higher in single cells than in the pooled transcriptomes of many cells in bulk RNAseq data [5, 10]. There are two types of heterogeneity in gene expression: heterogeneity in the “onoff” status of a gene’s transcription, and heterogeneity in the abundance of expressed genes. Studying such heterogeneities is one of the major purposes of singlecell sequencing. Because of these special properties, when we analyze eQTLs on scRNAseq data, we also face two possible types of differences in gene expression associated with variation in genotypes: differences in the transcription status of a gene and differences in expression levels of an expressed gene. We call them as “status difference” and “expression level difference”, respectively. We developed SCeQTL to analyze these two types of differences that may be associated with genotype variations. The method can also be applied to analyze associations of gene expression with other types of groupings such as cell lineages or cell types.
Results
Zeroinflated generalized linear model
We model scRNAseq data as the outcome of two processes. One is that transcripts are captured in the sequencing and the corresponding gene gets nonnegative expression values. The other is that transcripts are missed or the gene is not expressed in the cell, which will result in zero values in the data. The second process causes scRNAseq data to have excess zero values. We find the negative binomial (NB) distribution can fit the nonzero parts of scRNAseq data well in a way similar to bulk RNAseq data (Fig. 3 and Fig. 4a), but there can be a high probability of a gene being zeros in the singlecell data.
Therefore, we use a zeroinflated negative binomial (ZINB) regression to model the scRNAseq data as we have done in [10]. For gene expression g and genotype s, there is probability p that the transcription is off and we have the observation of a zero value in a cell, and probability 1 − p that the gene is expressed with values being described as following a negative binomial distribution. We call the probability p as zero ratio for simplicity. We write these as
where μ and θ are the mean and shape parameter of the negative binomial distribution. We call a SNV to be an eQTL of a gene if the zero ratio p and/or the mean of negative binomial distribution μ is significantly correlated with the genotype of the SNV in a way that
and/or
where s is the genotype (or other distinguishing factor to group the cells), parameters α_{1}, α_{2}, β_{1}, β_{2} and the shape parameter θ are to be estimated from the data. Using maximum likelihood method to estimate the parameters, we get the loglikelihoods of the full model that includes the genotype as the explaining factor (β_{1} ≠ 0 or β_{2} ≠ 0) and of the reduced model that does not include the genotype (β_{1} = 0 or β_{2} = 0).
The generalized linear model contains two parts: distribution hypothesis and link function. In distribution hypothesis, the probability density function (pdf) of overdispersed exponential families is
where η is a natural parameter, T(x) is a sufficient statistic, and A(η) is used to guarantee the integral of pdf to be 1. Link function describes how the expectation of corresponding variable is related to the linear combination of independent variables:
where g is the link function. For NB distribution, the link function is
According to the generalized linear model theory [12], the deviance, which is − 2 times the loglikelihood ratio of the reduced model compared to the full model, follows an approximate chisquare distribution with k degree of freedom. The k is the difference between parameter numbers of the full model and the reduced model. We use the deviance as the test statistic to test for whether β_{1} = 0 or β_{2} = 0. By these two hypothesis tests, we can identify whether the gene expression have association with the genotype (or other factor used to group cells) and what kind of association it is. We call this method as SCeQTL and developed a software package in R to implement it (https://github.com/XuegongLab/SCeQTL/).
Simulation experiments
Singlecell parallel sequencing technologies of both the genome and transcriptome are currently only available in very few labs, and the current genomic coverage of such technologies may not be sufficient for genotyping analysis yet. So it is still hard to find public datasets including both singlecell genomic sequencing data with sufficient depth and singlecell RNAsequencing data of the same single cells. Therefore, we first did a series of simulation experiments to study the performance of SCeQTL. Both genotype and phenotype data were simulated simultaneously under different effect size. We applied SCeQTL and the widely used Matrix eQTL [16] on the simulation data for comparison. Matrix eQTL is a highlyefficient method for eQTL analysis designed for bulk data.
We simulated genotype and gene expression data of 1500 cells of three SNVs and 20 genes in each simulation experiment. Considering the probable effect of different frequencies of three genotypes (denoted by s = 0, 1, 2), we generated three SNVs (denoted as SNV 1–3) with different genotype frequencies, and conducted four experiments (denoted as Experiment 1–4) for each SNV. These experiments aimed to mimic four scenarios: transcription status set at same or different level while expression values change with genotypes, and expression values set at same or different level while transcription status change with genotypes. Under each scenario, we experimented on changes across different effect sizes. Ten significant geneSNV pairs were randomly generated for each SNV, and were taken as the ground truth in performance analysis. Gene expression metrics were generated by ZINB model of different parameters. We define three types of genes that are associated with genotypes in the simulation model: genes whose expression values differ among genotypes (Ges), genes’ transcription status differs among genotypes (Gts), and both transcription status and expression values differ among genotypes (Gs). Table 1 shows the overall design of the simulation data.
For each experiment, we get the ROC curves of both methods by calculating the false positive rate and true positive rate by comparing the detection result with the simulation model. We use Area Under Curve (AUC) of ROC curves to demonstrate performance of the two methods. Larger AUC value means higher accuracy. Figure 1 summarizes the experiment results. It shows that different proportions of genotypes do not affect results. Further checking the results in four experiments of the same SNV, we can see that when only one aspect (zero ratio or NB mean) differs (Fig. 1a, c), performance of the two methods largely overlaps: AUC value first rises, then holds with effect sizes increasing. A minor difference shown in Fig. 1a is that the power of SCeQTL rises earlier and more dramatically at smaller effect sizes. This indicates higher sensitivity of SCeQTL than Matrix eQTL for detecting Ges.
In more complicated situations when both transcription status and expression values are different (Fig. 1b, d), AUC value of SCeQTL keeps steady and high, while that of Matrix eQTL drops drastically at certain effect sizes. This suggests the superiority of SCeQTL in terms of power when detecting Gs. An explanation is that, in Fig. 1b, the increase of zero ratio gradually offset the divergence of NB mean in three genotypes with β_{2} increasing; and in Fig. 1d, the increase of NB mean gradually offset the divergence of zero ratio with β_{1} increasing. Both cases resulted in similar mean in three genotypes which Matrix eQTL cannot distinguish. But once the influence of zero ratio or NB mean became dominant, the power of Matrix eQTL would recover, as is shown in the right part of curves in Fig. 1b and the left part of curves in Fig. 1d.
Figure 2 shows gene expression distributions of three example significant eQTLs which cannot be detected by Matrix eQTL. Figure 2a shows expression levels of a Ge referred to the point in Fig. 1a when β_{2} = 0.05. SCeQTL can detect the slight difference in NB mean, but Matrix eQTL cannot. Figure 2b and c display similar situations when detecting Gs. They correspond to the point in Fig. 1b when β_{2} = 0.15 and Fig. 1d when β_{1} = 0.5, respectively. Again, SCeQTL found the differences very significant, while Matrix eQTL found them insignificant. They could support our analyses above.
Real data experiments
Currently public datasets with both genotype and transcriptome sequenced in the same single cells are still rare, and those only few available datasets still have very limited coverage in SNVs. We therefore used a real RNAseq dataset without celllevel genotype but with multiple groups of cell attributes to further study the performance of SCeQTL. The data we used is a scRNAseq dataset of human preimplantation embryo cells of different embryo days [14]. We split the cells into three groups according to the embryonic day (E5, E6 and E7) to mimic cells of three genotypes. We use this dataset to show that SCeQTL not only works for singlecell eQTL applications, but also can be applied to the more general scenarios of detecting gene expression variations that are associated with other types of grouping factors of cells.
We first checked whether the nonzero expression data were fitted well with our model using the ‘checkdist’ function in our package. We randomly picked some genes and drew QQ plot to compare gene expression distribution with several distributions. Figure 3 and Figure 4a shows that negative binomial distribution is appropriate for modeling the nonzero part of the data, while the lines in QQ plot of other distributions are far away from the diagonal. The histogram in Figure 4b shows that the dropout event is very common in singlecell RNAseq data and needs to be considered.
We applied both SCeQTL and Matrix eQTL [16] on these data for comparison. We first conducted experiments to study the distribution of pvalues of the two methods under the null hypothesis of no eQTL. Figure 5 show the pvalue distributions under null hypothesis, which were obtained by randomly generating and permuting the “genotype” (the embryonic days in this experiment) and use two methods to calculate the pvalues. The pvalue distribution of SCeQTL is close to uniform distribution between 0 and 1, while the pvalue distribution of Matrix eQTL has clear deviation from uniform.
On the eQTL results in the experiment of true embryonic day with gene expression, we found that results of the two methods largely overlapped, but there were noticeable cases on which SCeQTL worked better. Figure 5c shows an example that nonzero part had significant difference but Matrix eQTL didn’t find it. The pvalues obtained by SCeQTL and Matrix eQTL are 9.37 × 10^{−9} and 0.002, respectively. The test has been done for all the 23,981 genes in this dataset. This eQTL is very significant for SCeQTL but will not significant for Matrix eQTL after multipletest correction. One reason is that the negative binomial distribution fit the singlecell data better than normal distribution. On the other hand, the zero values in scRNAseq data caused the means of the three groups to be almost equal, so that Matrix eQTL could not detect the difference. Figure 5d gives an example that zero ratios have significant differences among the compared groups (0.76, 0.26 and 0.26) but nonzero parts shown by the boxplots are almost same. The pvalues with SCeQTL and Matrix eQTL are 1.09 × 10^{−45} and 0.004, respectively. Matrix eQTL can’t detect differences of this type either.
We also experimented on this dataset using cell lineage as the factor to distinguish three cell groups, and found SCeQTL could successfully find results that can be confirmed with biological knowledge on embryonic cell lineages. By dividing the samples by cell lineages EPI (epiblast), TE (trophectoderm) and PE (primitive endoderm), we applied SCeQTL to find genes that vary among different cell lineages. Among all 23,981 genes, SCeQTL found about 20 genes with pvalue less than 10^{−40}, 70 genes with pvalue less than 10^{−30} and 200 genes with pvalue less than 10^{−20}. In these genes that are significantly associated with cell lineages, we found some have been reported as lineage specific genes in the literature. For example, for EPIspecific genes PRDM14, GDF3, TDGF1, NODAL, SOX2 and NANOG, their pvalues are 5.7 × 10^{−45} 、 4.7 × 10^{−20}, 6.5 × 10^{−16}, 4.0 × 10^{−21}, 5.1 × 10^{−37} and 4.8 × 10^{−25}; for TEspecific genes GATA2, GATA3 and DAB2, their pvalues are 4.1 × 10^{−20}, 6.7 × 10^{−26} and 3.3 × 10^{−21}; and for PEspecific genes HNF1B, PDGFRA and GATA4, their pvalues are 5.7 × 10^{−32}, 1.7 × 10^{−23} and 2.0 × 10^{−35}, respectively. All these lineagespecific genes ranked at top 200 in our result. It is worth noting that quite a lot of these genes have shown obvious transcription status differences in our SCeQTL analysis, which may imply that gene transcription in single cells are undergone onoff regulation in many scenarios. To double check the reliability of the SCeQTL discoveries, we manually checked some of the results and found most significant genes truly have differences among the compared groups. Figure 6 shows three examples. Pvalues of SCeQTL in the examples are 4.1 × 10^{−50}, 4.8 × 10^{−33} and 1.5 × 10^{−23} respectively. These experiments showed the potential for discovering associations in single cells that cannot be identified using existing eQTL methods. The biological implication of those associations can be important and deserve further investigations.
Discussion
A limitation of the proposed SCeQTL method is that the computation cost is relatively high if applied for eQTL analysis at wholegenome scale. It can take a few minutes to analyze a few hundred geneSNV pairs on a single computing node. This is mainly due to the iterative procedures in estimating the parameters. However, for most singlecell studies, the cells are from the same tissue sample or closely related samples. We can expect that the number of SNVs among the cells that need to be studied for eQTL analysis is not too large to make the computing cost of SCeQTL a severe issue in practical applications. This will also not be an issue when we use SCeQTL to analyze the association of gene expression with other factors as we did in the application examples.
The analysis of the associations of genotypic variations with gene expression as well as alternative splicing [18] is a fundamental step for understanding the complex gene regulation system of human in health and disease. Cells are the basic units where the regulation happens. The broad existence of heterogeneities in gene expression both in quantity and in alternative splicing isoforms among cells is important in human physiology and pathology. This gives a strong motivation why functional genomics studies are moving quickly into singlecell levels. This is also true for the study of gene regulations. The current singlecell genomics technologies are still at their early stages and not widely available for largescale studies of gene regulations within single cells, but technologies are developing and evolving rapid toward these goals. We hope the proposed SCeQTL method and software provides a ready and effective tool for this development.
Conclusions
We proposed a new method for eQTL analysis on singlecell genomic and transcriptomic parallel sequencing data and developed a software package SCeQTL to implement the method. Experiments showed that the method can reveal associations that cannot be identified with existing eQTL methods developed for bulk data. It can also be applied on tasks of finding the association of gene expression with other grouping factors that distinguish cells into different types. It provides an effective tool for exploring gene regulation relationships at singlecell level.
Methods
Data preprocessing
Multiple steps of data preprocessing are necessary before using SCeQTL. Firstly, we remove the effect of the library size. We use the normalization method in DEseq [1] for this normalization. The median of the ratios of observed counts is used to measure the sequencing depth.
where g_{ij} is the expression level of gene i in sample j. The denominator is obtained by calculating the geometric mean across nonzero samples. As discussed in [1], this method is more robust than just taking the sum of all genes as sequencing depth since otherwise the highly expressed gene would dominant the result, which is often seen in singlecell gene expression data. All samples are normalized by the size factor, and we round down the resulted expression values to fit our readcounts model.
Next, singlecell RNAseq data are noisy and we need to remove genes and variants which are not suitable for the analysis. Genes with read counts less than a certain threshold (by default, <=1) are treated as not expressed and are therefore removed. We only consider genes whose variances are greater than a certain threshold (by default, > = 5). For genomic variants, only variants with at least two genotypic groups in the dataset and each genotype has at least 5 samples (cells) are further considered.
When we enter the iteration of analyzing every genevariant pairs, pairs that don’t have enough nonzero values (by default, <=5) in one genotype are reported. The estimation of distribution parameters can be far away from true values in this situation. And we find that in real data, there are samples whose expression level is much higher than the others. If we include these samples into consideration, the mean of negative binomial distribution will be overestimated. So we treat these samples as outlier and use robust z score as defined below to remove them (by default, > = 4), where MAD stands for the median absolute deviation:
Parameter estimation
The package ‘pscl’ [19] is used to estimate the parameter and calculate the loglikelihood. The package uses the EM algorithm or BFGS algorithm to iteratively update the parameter estimation.
Covariates correction
It is common that some hidden covariates may exist in the sampled population, such as age, gender, or other clinical variables. It is important to remove the effect of them from the eQTL study, as otherwise a high percentage of results will be false discoveries. SCeQTL allows user to define a covariate vector x as possible confounding factors to be considered in the analysis. With covariate vector x ∈ R^{n}, the models become
where extra parameter vectors γ_{1} and γ_{2} to be estimated. The hypothesis test process is the same as noncovariates one. As a special consideration in singlecell studies, potential correlations among single cells from the same individual or from the same cell type can be modeled in this covariate vector to make sure that the associations detected with SCeQTL are not due to those factors.
Multiple test correction
We provide two ways to control the false discovery, BenjaminiHochberg (BH) method [3] and the qvalue method. The qvalue method is implemented by R package ‘qvalue’ (http://github.com/jdstorey/qvalue). Since several publications come up with other methods for multiple test correction in eQTL mapping [4, 13], users can also select whether to let SCeQTL to report pvalue or false discovery rates and set the threshold according to other correction methods.
Availability of data and materials
The R package of the method is available at https://github.com/XuegongLab/SCeQTL/.
Abbreviations
 SCeQTL:

Single Cell expression Quantitative Trait Locus
 scRNAseq:

Single cell RNA sequencing
 ANOVA:

ANalysis Of VAriance
 MAD:

Median absolute deviation
 EM:

ExpectationMaximization
 BFGS:

Broyden–Fletcher–Goldfarb–Shanno
References
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106. https://doi.org/10.1186/gb20101110r106.
Bacher R, Kendziorski C. Design and computational analysis of singlecell RNAsequencing experiments. Genome Biol. 2016;17:63.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Stat Soc. 1995;Series B. 57(1):125–33 JSTOR 2346101.
Degnan J, et al. Genomics and genomewide association studies: an integrative approach to expression QTL mapping. Genomics. 2008;92:129–33.
Delmans M, Hemberg M. Discrete distribution differential expression (D3E) – a tool for gene expression analysis of singlecell RNAseq data. BMC Bioinformatics. 2016;17:110.
Dey SS, et al. Integrated genome and transcriptome sequencing of the same cell. Nat Biotechnol. 2015;33(3):285–9.
Gatti DM, et al. FastMap: fast eQTL mapping in homozygous populations. Bioinformatics. 2009;25(4):482–9.
Macaulay IC, et al. G&Tseq: parallel sequencing of singlecell genomes and transcriptomes. Nat Methods. 2015;12(6):519–22.
Miao Z, Zhang X. Differential expression analyses for singlecell RNASeq: old questions on new data. Quantitative Biol. 2016;4(4):243–60.
Miao Z, et al. DEsingle for detecting three types of differential expression in singlecell RNAseq data. Bioinformatics. 2018;34(18):3223–4.
Munsky B, et al. Using gene expression noise to understand gene regulation. Science. 2012;336(6078):183–7.
Nelder J A, and Baker R J. (1972) Generalized linear models. Encyclopedia of Statistical Sciences.
Peterson CB, Bogomolov M, Benjamini Y, et al. TreeQTL: hierarchical error control for eQTL findings. Bioinformatics. 2016;32(16):2556–8.
Petropoulos, et al. Singlecell RNASeq reveals lineage and X chromosome dynamics in human Preimplantation embryos. Cell. 2016;165(4):1012–26.
Qi J, et al. kruX: matrixbased nonparametric eQTL discovery. BMC Bioinformatics. 2014;15(1):11.
Shabalin AA. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics. 2012;28(10):1353–8.
Sun W. A statistical framework for eQTL mapping using RNAseq data. Biometrics. 2012;68(1):1–11.
Yang Q, et al. ulfasQTL: an ultrafast method of composite splicing QTL analysis. BMC Genomics. 2017;18(Suppl 1):963.
Zeileis A, Kleiber C, Jackman S. Regression models for count data in R. J Statistical Software. 2008;27(8):1–25 URL http://www.jstatsoft.org/v27/i08/.
Acknowledgements
We thank Hao Sun, Zhun Miao and Tianxing Ma for helping to test the package.
Funding
This work is partially supported by the National Key R&D Program of China grant 2018YFC0910401, NSFC grant 61721003 by the CZI HCA Seed Network Project. The funding bodies have not played any role in the design of the study, the collection, analysis, interpretation of data, or the writing of the manuscript.
Author information
Authors and Affiliations
Contributions
YH developed the method, the R package and conducted the experiments on the real data. XX designed the simulations and conducted the simulation experiments with the help from QY. XZ initiated and supervised the project. YH, XX, QY and XZ wrote the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests. The last author XZ is a member of the editorial board (Associate Editor) of this journal.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Hu, Y., Xi, X., Yang, Q. et al. SCeQTL: an R package for identifying eQTL from singlecell parallel sequencing data. BMC Bioinformatics 21, 184 (2020). https://doi.org/10.1186/s1285902035346
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902035346