 Methodology Article
 Open Access
 Published:
Databased RNAseq simulations by binomial thinning
BMC Bioinformatics volume 21, Article number: 206 (2020)
Abstract
Background
With the explosion in the number of methods designed to analyze bulk and singlecell RNAseq data, there is a growing need for approaches that assess and compare these methods. The usual technique is to compare methods on data simulated according to some theoretical model. However, as real data often exhibit violations from theoretical models, this can result in unsubstantiated claims of a method’s performance.
Results
Rather than generate data from a theoretical model, in this paper we develop methods to add signal to real RNAseq datasets. Since the resulting simulated data are not generated from an unrealistic theoretical model, they exhibit realistic (annoying) attributes of real data. This lets RNAseq methods developers assess their procedures in nonideal (modelviolating) scenarios. Our procedures may be applied to both singlecell and bulk RNAseq. We show that our simulation method results in more realistic datasets and can alter the conclusions of a differential expression analysis study. We also demonstrate our approach by comparing various factor analysis techniques on RNAseq datasets.
Conclusions
Using data simulated from a theoretical model can substantially impact the results of a study. We developed more realistic simulation techniques for RNAseq data. Our tools are available in the seqgendiff R package on the Comprehensive R Archive Network: https://cran.rproject.org/package=seqgendiff.
Background
Due to its higher signaltonoise ratio, larger range of detection, and its ability to measure a priori unknown genes, RNAseq has surpassed microarrays as the technology of choice to measure gene expression [1]. With the advent of singlecell RNAseq technologies, researchers now even have the ability to explore expression variation at the individual cell level [2]. This presents exciting opportunities for researchers to characterize the expression heterogeneity between and within organisms, and has brought about a plentiful flow of new datasets. In the wake of these new data, an explosion of methods has been developed to analyze them. In “Application: evaluating differential expression analysis” section, “Application: evaluating confounder adjustment” section, “Application: evaluating effects of library size heterogeneity” section, and “Application: evaluating factor analysis” section we provide a large (yet terribly incomplete) list of methods designed to analyze RNAseq data.
The typical pipeline to evaluate a method is to first simulate data according to some theoretical model, then compare it to competing methods on these simulated data and show it to be superior in some fashion. This way of evaluation can be useful to see how a method works in ideal scenarios. However, real data rarely live in ideal scenarios. Real data often exhibit unwanted variation beyond that assumed by a model [3]. Theoretical distributional assumptions are also difficult to verify, and are sometimes mired in controversy [4].
In this paper, we propose an alternative approach. Rather than generate data with a prespecified signal according to some modeling assumptions, we take a real RNAseq dataset and add a prespecified signal to it. The main advantage of our approach is that any unwanted variation in the real data is maintained in the simulated data, and this unwanted variation need not be prespecified by the researcher. The way we add signal does carry assumptions, but they are flexible (Supplementary Section S1.2, Additional file 1). And we believe that this way of simulation, compared to simulating under a theoretical model, allows researchers to more realistically evaluate their methods.
This manuscript essentially generalizes the simulation techniques proposed in [5, 6], and [7]. These previous papers use binomial thinning (the approach used in this paper) in the case where there are just two groups that are differentially expressed (hereafter, the “twogroup model”). Binomial thinning is the process of subsampling counts using the binomial distribution. This subsampling is applied to different individuals heterogeneously to add signal to the observed counts. These papers did not develop methods for more complicated design scenarios, they did not present userfriendly software implementations for their simulation techniques, and they did not justify their simulation techniques as broadly. Here, we allow for arbitrary experimental designs, we release software for users to implement their own simulations, and we justify our techniques using very flexible assumptions.
There has been some other previous work on “databased” simulations in expression analyses. Datasets resulting from databased simulations (sometimes called “plasmodes” [8]) have been used in microarray studies before the development of RNAseq [9, 10]. All RNAseq databased simulation methods have so far operated in the twogroup (or finitegroup) model, without any ability to simulate data from arbitrary experimental designs. Rocke et al. [11] and [12] randomly shuffled group indicators in the twogroup model, resulting in completely null data, and methods can be evaluated on their ability to control for type I error when the data are all null. Rigaill et al. [13], in addition to generating null data by randomly shuffling group labels, incorporate multiple datasets to create some nonnull genes within their simulated datasets. Benidt and Nettleton [14] use a countswapping algorithm in the twogroup model to create differentially expressed genes when one already has two treatment groups. Kvam et al. [15, 16], and [17] create nonnull genes by multiplying counts for all individuals in a group by the foldchange in mean expression. [18] uses a binomial distribution approach to uniformly decrease the sequencing depth of an entire dataset (but not to add differentially expressed genes). Concerning nondatabased methods, [19] and [20] use real RNAseq data to estimate the parameters in a datagenerating model before simulating data from the theoretical model using these estimated parameter values. Our work is the first to extend databased RNAseq simulation beyond the finitegroup model.
Our paper is organized as follows. We first list the goals and assumptions of our simulation scheme (“Goals and assumptions” section) before motivating it with four applications (“Application: evaluating differential expression analysis” section, “Application: evaluating confounder adjustment” section, “Application: evaluating effects of library size heterogeneity” section, and “Application: evaluating factor analysis” section) and describing our process of simulating RNAseq in detail (“Generating modified RNAseq data” section). We then demonstrate how our approach can more accurately preserve structure in a real dataset compared to simulating a dataset from a theoretical model (“Features of real data” section). We show that this can alter the conclusions of a differential expression analysis simulation study (“Effects on differential expression analysis simulations” section). We then apply our simulation approach by comparing five factor analysis methods using the GTEx data [21] (“Evaluating factor analyses” section). We finish with a discussion and conclusions (“Discussion” section and “Conclusions” section).
We adopt the following notation. We denote matrices by bold uppercase letters (A), vectors by bold lowercase letters (a), and scalars by nonbold letters (a or A). Indices typically run from 1 to their uppercase version, e.g. a=1,2,…,A. Where there is no chance for confusion, we let nonbold versions of letters represent the scalar elements of matrices and vectors. So a_{ij} is the (i,j)th element of A, while a_{i} is the ith element of a. We let 1_{A} denote the Avector of 1’s and 1_{A×B} the A×B matrix of 1’s. The matrix transpose is denoted by \(\boldsymbol {A}^{\intercal }\).
Methods
Goals and assumptions
We will now describe the goals and assumptions of our simulation method, which relies on a researcher having access to a real RNAseq dataset. Suppose a researcher has a matrix \(\boldsymbol {Y} \in \mathbb {R}^{G \times N}\) of RNAseq readcounts for G genes and N individuals. Also suppose a researcher has access to a design matrix \(\boldsymbol {X}_{1} \in \mathbb {R}^{N \times P_{1}}\) with P_{1} variables. The availability of X_{1} is optional, not essential to the method, and is mostly for descriptive purposes. We assume the RNAseq counts, Y, are generated according to the following model:
where

\(\boldsymbol {\mu } \in \mathbb {R}^{G}\) is a vector of intercept terms for the genes,

\(\boldsymbol {B}_{1} \in \mathbb {R}^{G \times P_{1}}\) is the corresponding coefficient matrix of X_{1},

\(\boldsymbol {Z} \in \mathbb {R}^{N \times K}\) is a matrix of unobserved surrogate variables,

\(\boldsymbol {A} \in \mathbb {R}^{G \times K}\) is the corresponding coefficient matrix of Z, and

\(\boldsymbol {\Omega } \in \mathbb {R}^{G \times N}\) represents all other unwanted variation not accommodated by the other terms in the model,
where μ, B_{1}, Z, A, and Ω are all unknown. Given the above datagenerating process, suppose a user provides the following (known) elements:

\(\boldsymbol {X}_{2} \in \mathbb {R}^{N \times P_{2}}\), a design matrix with fixed rows (see note 3 below),

\(\boldsymbol {B}_{2} \in \mathbb {R}^{G \times P_{2}}\), the coefficient matrix corresponding to X_{2},

\(\boldsymbol {X}_{3} \in \mathbb {R}^{N \times P_{3}}\), a design matrix with rows that can be permuted (see note 3 below), and

\(\boldsymbol {B}_{3} \in \mathbb {R}^{G \times P_{3}}\), the coefficient matrix corresponding to X_{3}.
Our goal is to generate a matrix \(\tilde {\boldsymbol {Y}} \in \mathbb {R}^{G\times N}\) from Y such that
where

\(\boldsymbol {\Pi } \in \mathbb {R}^{N\times N}\) is a random permutation matrix, whose distribution controls the level of association between the columns of ΠX_{3} and the columns of Z, and

\(\tilde {\boldsymbol {\mu }}\) is a new vector of intercept terms for the genes.
We will provide the details on how to generate \(\tilde {\boldsymbol {Y}}\) from Y in “Generating modified RNAseq data” section. But we would like to first provide some notes below, and then discuss the applications of being able to generate (2) from (1). Note 1: For simplicity we use the Poisson distribution in the main text (Eqs. (1) and (2)). However, our approach is valid under much more general assumptions. In particular, we note that if the counts were generated according to a negative binomial distribution, a zeroinflated negative binomial distribution, or even a mixture of binomials and negative binomials, then our simulation scheme still preserves the structure of the data (Supplementary Section S1.2, Additional file 1). However, even when our general modeling assumptions are violated, one can show (via the law of total expectation) that if log2(E[Y])=Θ, then \(\log _{2}(E[\tilde {\boldsymbol {Y}}]) = \tilde {\boldsymbol {\Theta }}\), where we are taking elementwise logarithms of E[Y] and \(E[\tilde {\boldsymbol {Y}}]\). Thus, our procedure will produce the correct mean log2fold change in the new dataset, but the resulting mean/variance relationship might not be as assumed. Note 2: The Ω term in (1) and (2) represents the realistic and annoying features of the data. In ideal situations, Ω=0_{G×N}. However, most datasets likely include nonzero Ω, and so assessing a method’s ability to be robust in the presence of Ω, without the researcher having to prespecify Ω, is the key strength of our simulation approach. Note 3: As described below, we include both X_{2} and X_{3} in (2) to control different aspects of a simulation study. One may control the level of association between the columns of X_{1} and X_{2} as these are both observed and fixed by the user. The inclusion of X_{3} and Π allows us to try to control the level of association between ΠX_{3} and Z.
Before we discuss obtaining (2) from (1), we point out four potential applications of this simulation approach: (i) evaluating differential expression analyses (“Application: evaluating differential expression analysis” section), (ii) evaluating confounder adjustment approaches (“Application: evaluating confounder adjustment” section), (iii) evaluating the effects of library size heterogeneity on differential expression analyses (“Application: evaluating effects of library size heterogeneity” section), and (iv) evaluating factor analysis methods (“Application: evaluating factor analysis” section).
Application: evaluating differential expression analysis
One of the more common applications of RNAseq data is estimating and testing for differences in gene expression between two groups. Many packages and techniques exist to perform this task [22–39, among others], and so developing approaches and software to compare these different software packages would be of great utility to the scientific community. Generating data from the twogroup model is a special case of (1) and (2), where
and \(\boldsymbol {\Pi }\boldsymbol {x} \in \mathbb {R}^{N}\) contains a single indicator variable, indicating membership to one of two groups. Researchers may specify b and x and evaluate a method’s ability to (i) estimate b and (ii) detect which genes have nonzero b_{g}.
In many settings, a researcher may want to specify the distribution of the b_{g}’s (the elements of b). Our software implementation allows for this. In addition, following [40], we allow researchers to specify the distribution of \(b_{g}/s_{g}^{\alpha }\), where s_{g} is the sample standard deviation of the gth row of log2(Y+0.5), and α is a userspecified constant. Allowing for α=0 corresponds to the scenario of specifying the distribution of the effects, while allowing for α=1 corresponds to specifying the pvalue prior of [41].
Though the twogroup model is perhaps the most common scenario in differential expression analysis, our method also allows for arbitrary design matrices. Such design matrices have applications in many types of expression experiments [24, 42–44], and so the ability to simulate arbitrary designs gives researchers another tool to evaluate their methods in more complicated scenarios.
Application: evaluating confounder adjustment
Unobserved confounding / batch effects / surrogate variables / unwanted variation has been recognized as a serious impediment to scientific studies in the modern “omics” era [3]. As such, there is a large literature on accounting for unwanted variation, particularly in RNAseq studies [5, 6, 45–72, among others]. The glut of available methods indicates a need to realistically compare these methods.
Typically, the form and strength of any unobserved confounding is not known. So one way to assess different confounder adjustment methods would be to assume model (1) and add signal to the data resulting in the following submodel of (2):
A researcher would then explore how close a method’s estimate of B_{3} is to the truth (assuming the researcher may use both X_{1} and ΠX_{3} to obtain this estimate). The researcher can control the correlation between the columns of ΠX_{3} and the columns of Z by specifying the distribution of Π (as described in “Generating modified RNAseq data” section). Intuitively, the stronger the correlation between the columns of X_{3} and the columns of Z, the more difficult the confounder adjustment problem. This approach was used in the twogroup model in [5] and [6], but not for general design matrices.
Application: evaluating effects of library size heterogeneity
“Library size” corresponds to the number of reads an individual sample contains. Adjusting for library size is surprisingly subtle and difficult, and thus many techniques have been proposed to perform this adjustment [73–77]. The most commonlyused techniques can be viewed as a form of confounder adjustment [5]. For most methods, this form of confounder adjustment corresponds to setting one column of A in (1) to be 1_{G} and estimating the corresponding column in Z using some robust method that assumes that the majority of genes are nondifferentially expressed.
One way to evaluate the performance of a library size adjustment procedure is to see how effect size estimates change when the samples are thinned, changing the library size. First, assume we are operating in the following submodel of (1):
A researcher may specify (i) additional signal and (ii) a further amount of thinning on each sample by generating the following submodel of (2):
To evaluate the effectiveness of a library size adjustment procedure, researchers may observe the effects on the estimates of B_{3} under various amounts of library thinning (controlled by altering x_{2} and x_{3}).
Application: evaluating factor analysis
Factor analysis is a fundamental technique in every statistician’s arsenal. Since its creation by Spearman [78], literally hundreds of factor analysis / matrix decomposition / matrix factorization approaches have been developed, and new approaches are created each year to account for new features of new data [54, 79–96, to name a very few]. For RNAseq, factor analysis methods have found applications in accounting for unwanted variation [63, 64], estimating cellcycle state [97, 98], and general quality assessments [27]. Thus, techniques to realistically compare various factor analysis methods would be of great use to the scientific community. We demonstrate in this section how our simulation approaches can be used to evaluate factor analysis methods applied to RNAseq.
We suppose that the RNAseq readcounts follow the following submodel of (1):
We then suppose that the researcher generates a modified dataset that follows the following submodel of (2):
We assume that a researcher applies a factor analysis to (10) to estimate a lowrank matrix with K+P_{3} factors. That is, the researcher fits the following model,
with factor matrix \(\boldsymbol {F} \in \mathbb {R}^{N \times (K + P_{3})}\) and loading matrix \(\boldsymbol {L} \in \mathbb {R}^{G\times (K + P_{3})}\), obtaining estimates \(\hat {\boldsymbol {L}}\) and \(\hat {\boldsymbol {F}}\). These estimates are obtained without using ΠX_{3}. A researcher may evaluate their factor analysis by

1.
Assessing if any of the columns of \(\hat {\boldsymbol {F}}\) are close to the columns of ΠX_{3},

2.
Assessing if any of the columns of \(\hat {\boldsymbol {L}}\) are close to the columns of B_{3}, and

3.
Assessing if the columnspace of ΠX_{3} is close to the columnspace of \(\hat {\boldsymbol {F}}\), which would be an important consideration in downstream regression analyses [45, e.g.].
In a factor analysis, the factors and loadings are only identifiable after imposing assumptions on their structure (such as sparsity or orthogonality). Thus, researchers may vary the structure of B_{3} and ΠX_{3} and observe the robustness of their factor analysis methods to violations of their structural assumptions.
Generating modified RNAseq data
We will now discuss the approach of obtaining (2) from (1). We will use the following wellknown fact of the Poisson distribution, which may be found in many elementary probability texts:
Lemma 1
If y∼Poisson(a) and \(\tilde {y}y \sim \text {Bin}(y, b)\), then \(\tilde {y} \sim \text {Poisson}(ab)\).
Generalizations of Lemma 1 to negative binomial distributions (and mixtures of negative binomial distributions) may be found in Section S1.2 of Additional file 1.
In the case when Π is drawn uniformly from the space of permutation matrices, we have the simplified procedure described in Procedure 1. The validity of Procedure 1 follows directly from the modeling assumptions in (1) and Lemma 1. Since \(y_{gn}\sim \text {Poisson}(2^{\theta _{gn}})\phantom {\dot {i}\!}\) and \(\tilde {y}_{gn}y_{gn}\sim \text {Bin}(y_{gn},2^{q_{gn}})\), we have that \(\tilde {y}_{gn}\sim \text {Poisson}(2^{\theta _{gn} + q_{gn}})\). If we set \(\tilde {\theta }_{gn} = \theta _{gn} + q_{gn}\), then we have
Equation (13) follows from the definition of Θ from (1) and the definition of Q from Step 4 of Procedure 1. Equation (15) follows by setting \(\tilde {\boldsymbol {\mu }}\) to be μ−e.
There are two main reasons to subtract the rowwise maximum from each row in Step 4 of Procedure 1: (i) this ensures that the binomial probabilities (\(2^{q_{gn}}\phantom {\dot {i}\!}\)) are always between 0 and 1, and (ii) this allows for minimal countthinning while still obtaining our goal of (2). That is, the binomial probabilities will all be between 0 and 1, but they will be as close to 1 as possible while still yielding (2), thereby reducing the amount of discarded counts.
The main disadvantage to Procedure 1 is that the surrogate variables (Z) will be independent of the userspecified covariates (ΠX_{3}). To allow the user to control the level of association between the surrogate variables and the userprovided variables, we propose using Procedure 2 to choose Π, rather than drawing Π uniformly from the space of permutation matrices. In brief, the user specifies a “target correlation” matrix, \(\boldsymbol {R} \in \mathbb {R}^{P_{3}\times K}\), where r_{ik} is what the user desires to be the correlation between the ith column of ΠX_{3} and the kth column of Z. We then estimate the surrogate variables either using a factor analysis (such as the truncated singular value decomposition) or surrogate variable analysis [45, 49]. Note that this estimate of Z is only used to permute the rows of X_{3} and is otherwise not included in the simulated data. We then draw a new random matrix \(\boldsymbol {U} \in \mathbb {R}^{N\times P_{3}}\) from a conditional normal distribution assuming that each row of U and Z is jointly normal with covariance matrix (16), thus the correlation between the columns of U and Z will be approximately R. We then match the rows of X_{3} with the rows of U using the pairwise matching algorithm of [99], though our software provides other options to match pairs via either the GaleShapley algorithm [100] or the Hungarian algorithm [101]. This ensures that ΠX_{3} is as close to U as possible. We denote the permutation matrix that matches the rows of X_{3} with the rows of U by Π.
The resulting covariance matrix (16) used in Procedure 2 is not guaranteed to be positive semidefinite. Rather than demand the user specify an appropriate target correlation matrix (which might be in general difficult for the typical user), we modify the target correlation matrix using Procedure 3 to iteratively shrink R until the Schur complement condition for positive semidefiniteness [102] is satisfied.
Procedure 2 is a compromise between letting the user specify the full design matrix X_{3} and letting the user specify the correlation between the columns of ΠX_{3} and Z. A user might want to specify the correlation between ΠX_{3} and Z to evaluate factor analyses in the presence of correlated factors (“Application: evaluating factor analysis” section), or to evaluate how well confounder adjustment approaches cope in the presence of correlated confounders (“Application: evaluating confounder adjustment” section). In the simple case when X_{3} and \(\hat {\boldsymbol {Z}}\) are drawn from a normal distribution, Procedure 2 will permute the rows of X_{3} so that ΠX_{3} and \(\hat {\boldsymbol {Z}}\) consistently has the correct correlation structure (Theorem S1 in Additional file 1). However, for general design matrices this will not be the case. Procedure 4 (implemented in our software) provides a Monte Carlo algorithm to estimate the true correlation given the target correlation. Basically, the estimator approximates the expected value (conditional on \(\hat {\boldsymbol {Z}}\)) of the Pearson correlations between the columns of ΠX_{3} and the columns of \(\hat {\boldsymbol {Z}}\). We justify this in an intuitive way by the law of total expectation. Consider x a single column of ΠX_{3} with empirical mean and standard deviation of \(\bar {x}\) and s_{x}. Similarly consider z a single column of \(\hat {\boldsymbol {Z}}\) with empirical mean and standard deviation of \(\bar {z}\) and s_{z}. Then
The estimator in Procedure 4 is a Monte Carlo approximation to the internal expectation in (17). We explore this correlation estimator through simulation in Supplementary Section S2.1, Additional file 1.
All simulation methods introduced in this paper are implemented in the seqgendiff R package, available on the Comprehensive R Archive Network: https://cran.rproject.org/package=seqgendiff.
Results
Features of real data
Real data exhibit characteristics that are difficult to capture by simulations. In this section, we demonstrate how our binomial thinning approach maintains these features, while simulating from a theoretical model results in unrealistic simulated RNAseq data.
We took the GTEx muscle data [21], and filtered out all genes with a mean readdepth of less than 10 reads. This resulted in a dataset containing 18,204 genes and 564 individuals. We then randomly assigned half of the individuals to one group and half to the other group, and used our seqgendiff software to add a N(0,0.8^{2}) log2foldchange between groups to 25% of the genes. We similarly used the powsimR software [19] to generate data according to a theoretical negative binomial model (with parameters estimated from the GTEx muscle data), again by adding a N(0,0.8^{2}) log2foldchange between the two groups in 25% of the genes. The parameters estimated and used by powsimR include the mean normalized read counts per gene, the estimated library size factor per sample, and a nonparametric estimate of the mean/dispersion relationship of the counts. powsimR uses the mean normalized read counts, the estimated size factors, and the userprovided log2fold changes to provide a mean for the negative binomial distribution. Based on this mean, it uses the estimated mean/dispersion relationship to provide a dispersion parameter for the negative binomial distribution.
The results below are from one simulation, but the results are robust and consistent across many datasets. The reader is encouraged to change the random seed in our code to explore the robustness of our conclusions.
The structure of the powsimR dataset is very different from that observed in the seqgendiff and GTEx datasets. There seems to be more zeros in the powsimR dataset than in the seqgendiff and GTEx datasets (Supplementary Figure S2, Additional file 1), even though we simulated the powsimR dataset under the negative binomial setting and not the zeroinflated negative binomial setting. Scree plots of the three datasets show that there are a lot more small factors influencing variation in the seqgendiff and GTEx datasets than in the powsimR dataset (Fig. 1). The main source of variation in the powsimR dataset comes from the group membership, while other (unwanted) effects dominate the variation in the seqgendiff dataset (Fig. 2). It is only the fourth principle component in the seqgendiff dataset that seems to capture the group membership (Supplementary Figure S3, Additional file 1). Though this unwanted variation exists, with such a large sample size voomlimma [26] can accurately estimate the effects (Supplementary Figure S4, Additional file 1). The voom plots (visualizing the meanvariance trend [26]) are about the same in the GTEx and seqgendiff data, but the distribution of the squareroot standard deviations appears more symmetric in the powsimR dataset (Fig. 3). There is also an uncharacteristic hook in the meanvariance trend in the powsimR dataset for lowcounts. These visualizations indicate that seqgendiff can generate more realistic datasets for RNAseq simulation.
Effects on differential expression analysis simulations
The differences in real versus simulated data (as discussed in “Features of real data” section) have real implications when evaluating methods in simulation studies. To demonstrate this, we used the GTEx muscle data to simulate RNAseq data from the twogroup model as in “Features of real data” section. We did this for N=10 individuals, G= 10,000 genes, setting 90% of the genes to be null, and generating the log2fold change from a N(0,0.8^{2}) distribution for the nonnull genes. We simulated 500 datasets this way using both seqgendiff and powsimR. Each replication, we applied DESeq2 [27], edgeR [103], and voomlimma [26] to the simulated datasets. We evaluated the methods based on (i) false discovery proportion when using BenjaminiHochberg [104] to control false discovery rate at the 0.05 level, (ii) power to detect nonnull effects based on a 0.05 false discovery rate control threshold, and (iii) mean squared error of the estimates.
We wanted to make sure that the datasets generated from powsimR and seqgendiff were comparable, so we measured the proportion of variance explained (PVE) by the group membership for each gene, which we define as
where b_{3g} is log2fold change for gene g, \(\tilde {\boldsymbol {y}}_{g}\in \mathbb {R}^{N}\) is the gth row of \(\tilde {\boldsymbol {Y}}\), and V(·) returns the empirical variance of a vector. When we looked at the median (over the nonnull genes) PVE across the datasets, the seqgendiff datasets and powsimR datasets had the same median PVE on average, though there was higher variability in the median PVE among the seqgendiff datasets (Supplementary Figure S5, Additional file 1).
Boxplots of the false discovery proportion for each method in each dataset can be found in Figure 4. Both the powsimR and seqgendiff datasets indicate that only voomlimma can control false discovery rate adequately at the nominal level. However, the results based on the seqgendiff datasets indicate that there is a lot more variability in false discovery proportion than indicated by the powsimR datasets. In particular, it does not seem uncommon for seqgendiff to generate datasets with false discovery proportions well above the nominal rate. If a researcher were using only the theoretical datasets generated by powsimR, they would be overly confident in the methods’ abilities to control false discovery proportion. Supplementary Figure S6 of Additional file 1 also indicates that methods generally have much more variable power between the seqgendiff datasets than between the powsimR datasets. Interestingly, the seqgendiff datasets indicate that methods tend to have smaller mean squared error than indicated by the powsimR datasets (Supplementary Figure S7, Additional file 1).
In Additional file 1, we also compared our simulation method to SimSeq [14] when evaluating differential expression analysis methods. We used the GTEx data [21] for both SimSeq and seqgendiff. SimSeq does not allow researchers to control the effect sizes of simulated nonnull genes, as it depends on the presence of an available indicator variable that already exhibits differential expression in a real dataset. So we adjusted the effect sizes produced by seqgendiff to match those present in the GTEx data, and we found that the two databased simulation methods behave similarly (Supplementary Figure S16, Additional file 1). It bodes well that seqgendiff produces similar results to other databased approaches. The advantages, then, of seqgendiff over SimSeq are

1.
seqgendiff can use effect sizes different than those that are already present in the observed indicator variable, while SimSeq cannot.

2.
Because the effect sizes are unknown in the available indicator variable, SimSeq is unable to evaluate the estimation accuracy of effect sizes. seqgendiff can evaluate estimation accuracy.

3.
seqgendiff can use more complicated designs than the finitegroup model. SimSeq is limited to the finitegroup model.

4.
SimSeq cannot guarantee that genes that are intended to be differentially expressed in a simulated dataset are indeed differentially expressed. This depends on the quality of the available indicator variable.

5.
As a minor advantage, seqgendiff is also much faster than SimSeq. On a 2.6 GHz quadcore PC running Linux with 32 GB of memory, seqgendiff took an average of 0.2 seconds to simulate a dataset, while SimSeq took an average of 51.1 seconds to simulate a dataset. A boxplot of simulation times is presented in Supplementary Figure S17 of Additional file 1.
black
Evaluating factor analyses
As we hope we have made clear, there are many approaches to differential expression analysis (“Application: evaluating differential expression analysis” section), confounder adjustment (“Application: evaluating confounder adjustment” section), library size adjustment (“Application: evaluating effects of library size heterogeneity” section), and factor analysis (“Application: evaluating factor analysis” section). We believe it to be beyond the scope of this work to exhaustively evaluate all of these methods — especially since new methods are being developed each year. Rather, we hope our simulation procedures will be used by the research community to more realistically evaluate and benchmark their approaches to RNAseq data analysis.
However, as a final highlight to the utility of our simulation approaches, we demonstrate these simulation techniques in one application: evaluating factor analysis methods in RNAseq (“Application: evaluating factor analysis” section). We have chosen to highlight this particular application because it uses the more general simulation techniques beyond the twogroup model, which were first demonstrated in [5].
We chose to focus on the following methods based on (i) previous use in expression studies, (ii) software availability, (iii) popularity, and (iv) ease of use.

1.
Principle component analysis (PCA) [79],

2.
Sparse singular value decomposition (SSVD) [93],

3.
Independent component analysis (ICA) [84],

4.
Factors and loadings by adaptive shrinkage (flash), an empirical Bayes matrix factorization approach proposed in [96], and

5.
Probabilistic estimation of expression residuals (PEER) [54], a Bayesian factor analysis used in the popular PEER software to adjust for hidden confounders in gene expression studies.
All factor analysis methods were applied to the log2counts after adding half a pseudocount. To simulate RNAseq data, we took the muscle GTEx data [21] and removed all genes with less than an average of 10 reads per sample. Each replicate, we added a rank1 term. That is we assumed model (9) for the muscle GTEx data, then generated RNAseq data such that
where we simulated the components of x_{3} and the nonzero components of b_{3} from independent normal distributions. We varied the following parameters of the simulation study:

1.
The sample size: N∈{10,20,40}

2.
The signal strength: the standard deviation of the loadings (the b_{3g}’s) was set to one of {0.4,0.8}, with higher standard deviations corresponding to higher signal. These values were chosen to have the median PVE vary greatly between the two settings (Supplementary Figure S8, Additional file 1),

3.
The sparsity: the proportion of loadings (the b_{3g}’s) that are 0 was set to one of {0,0.9}, and

4.
The target correlations of the added factor with the first unobserved factor: r∈{0,0.5}.
This resulted in 24 unique simulation parameter settings. We also used 1000 genes each replication. For each setting, we ran 100 replications of generating data from model (19), and fitting the factors with the five methods under study assuming model (11) after we estimated the number of hidden factors using parallel analysis [105].
We chose three metrics to evaluate the performance of the different factor analysis methods:

1.
The minimum mean squared error between Πx_{3} and the columns of \(\hat {\boldsymbol {F}}\). To account for scale and sign unidentifiability, the estimated factors and the added factor were all scaled to have an ℓ^{2}norm of 1 prior to calculating the mean squared error. This measure is meant to evaluate if any of the estimated factors corresponds to the added factor.

2.
The minimum mean squared error between b_{3} and the columns of \(\hat {\boldsymbol {L}}\). We again accounted for scale and sign unidentifiability by calculating the mean squared error after scaling the estimated and true loadings to have an ℓ^{2}norm of 1.

3.
The angle between Πx_{3} and its projection onto the column space of \(\hat {\boldsymbol {F}}\). This measure is meant to evaluate if the estimated factor matrix includes Πx_{3} among its unidentified factors.
The results are presented in Supplementary Figures S9–S14 of Additional file 1. Based on these figures, we have the following conclusions:

1.
PEER performs very poorly when either the sparsity is high or when there are few samples. It also performs less well when the factors are correlated. A possible explanation is that PEER assumes a normal distribution on the factors and loadings, which is violated in the highsparsity regime and is observed in the lowsparsity regime. Though, this does not explain its poor performance in small sample size settings.

2.
SSVD estimates the loadings very poorly in lowsparsity regimes. This is to be as expected as SSVD assumes sparsity on the loadings. Surprisingly, though, it outperforms PCA in high sparsity regimes only when both the sample size and signal are also large.

3.
ICA performs very poorly in low sparsity regimes. This is to be as expected as the normal distributions placed on the factors and loadings are a worstcase scenario for ICA. However, there is no scenario where ICA performs significantly better than PCA.

4.
flash performs adequately in all scenarios and performs best in highsparsity and highsignal regimes.

5.
PCA performs adequately in most scenarios, and is only truly outperformed in high sparsity high signal regimes.
Based on these initial explorations, we would recommend users not use PEER, SSVD, or ICA and instead try either PCA or flash.
In Section S2.2 of Additional file 1, we evaluate the above factor analyses using a single cell dataset from 10X Genomics [106]. The results indicate that PCA, SSVD, and flash perform comparably in all simulation settings, while PEER and ICA have worse performance in some simulation settings. Though the results were less clear than when using the GTEx data.
Discussion
We have focused on a loglinear model because of the large number of applications this generates (“Application: evaluating differential expression analysis” section, “Application: evaluating confounder adjustment” section, “Application: evaluating effects of library size heterogeneity” section, and “Application: evaluating factor analysis” section). This linearity (on the loglink scale) is represented by the structure of the Q matrix in Procedure 1. However, it is possible to replace Q by any arbitrary G×N matrix that has nonpositive entries. This might be useful for simulations that study adjusting for nonlinear effects, such as bias due to GC content [107]. This also might be useful for evaluating nonlinear dimensionality reduction techniques such as UMAP [108] and tSNE [109], as this allows you to introduce nonlinear effects into an RNAseq dataset. However, these nonlinear effects would still be present only on the logscale.
Our simulation procedures may be applicable beyond evaluating competing methods. Vieth et al. [19] used their simulation software to estimate power given the sample size in a differential expression analysis, and thus to develop sample size suggestions. Our simulation methods may be used similarly. Given a large RNAseq dataset (such as the GTEx data used in this paper), one can repeatedly downsample the number of individuals in the dataset and explore how sample size affects the power of a differential expression analysis.
Similarly, [18] already demonstrated that binomial thinning may be used for sequencing depth suggestions. That is, a researcher may repeatedly thin the libraries of the samples in a large RNAseq dataset and explore the effects on power, thereby providing sequencing depth suggestions. Unlike [18], which does this subsampling uniformly over all counts, we allow researchers to explore the effects of heterogeneous subsampling (as in “Application: evaluating effects of library size heterogeneity” section). This might be useful if, say, researchers have more individuals in one group than in another and so wish to explore if they can sequence the larger group to a lower depth without affecting power.
In this manuscript, we have discussed our simulation techniques in the context of RNAseq. However, our techniques would also be applicable to the comparative analysis of metagenomics methods [110]. Instead of quantifying gene expression, metagenomics quantifies gene abundances within metagenomes. Our simulation techniques could be applied in this context by taking a real metagenomics dataset and adding signal to it by binomial thinning.
Conclusions
We developed a procedure to add a known amount of signal to any real RNAseq dataset. We only assume that this signal comes in the form of a generalized linear model with a loglink function from a very flexible distribution. We demonstrated how real data contain features that are not captured by simulated data, and that this can cause important differences in the results of a simulation study. We highlighted our simulation approach by comparing a few popular factor analysis methods. We found that PCA and flash had the most robust performances across a wide range of simulation settings.
Abbreviations
 DESeq2:

Differential expression analysis for sequence count data 2
 edgeR:

Empirical analysis of digital gene expression in R
 FDP:

False discovery proportion
 flash :

Factors and loadings by adaptive shrinkage
 GTEx:

Genotypetissue expression
 ICA:

Independent component analysis
 limma:

Linear models for microarray and RNAseq data
 PCA:

Principle component analysis
 PEER:

Probabilistic estimation of expression residuals
 powsimR :

Power simulations for RNAsequencing in R
 PVE:

Proportion of variance explained
 RNAseq:

Ribonucleic acid sequencing
 seqgendiff :

Sequence generation for differential expression analysis and beyond
 SimSeq:

Nonparametric Simulation of RNASeq Data
 SSVD:

Sparse singular value decomposition
 tSNE:

tDistributed Stochastic Neighbor Embedding
 UMAP:

Uniform Manifold Approximation and Projection
 voom:

variance modeling at the observational level
References
 1
Wang Z, Gerstein M, Snyder M. RNASeq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009; 10(1):57.
 2
Hwang B, Lee JH, Bang D. Singlecell RNA sequencing technologies and bioinformatics pipelines. Exp Mol Med. 2018; 50(8):96.
 3
Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al.Tackling the widespread and critical impact of batch effects in highthroughput data. Nat Rev Genet. 2010; 11(10):733–9.
 4
Svensson V. Droplet scRNAseq is not zeroinflated. Nat Biotechnol. 2020; 38(2):147–150. https://doi.org/10.1038/s4158701903795.
 5
Gerard D, Stephens M. Unifying and Generalizing Methods for Removing Unwanted Variation Based on Negative Controls. Statistica Sinica. 2019;: in press. https://doi.org/10.5705/ss.202018.0345.
 6
Gerard D, Stephens M. Empirical Bayes shrinkage and false discovery rate estimation, allowing for unwanted variation. Biostatistics. 2018. https://doi.org/10.1093/biostatistics/kxy029.
 7
Lu M. Generalized Adaptive Shrinkage Methods and Applications in Genomics Studies. ProQuest Dissertations and Theses. 2018; 1:129. http://proxyau.wrlc.org/login?url=https://search.proquest.com/docview/2161785175?accountid=8285.
 8
Mehta T, Tanik M, Allison DB. Towards sound epistemological foundations of statistical methods for highdimensional biology. Nat Genet. 2004; 36(9):943.
 9
Nettleton D, Recknor J, Reecy JM. Identification of differentially expressed gene categories in microarray studies using nonparametric multivariate analysis. Bioinformatics. 2007; 24(2):192–201.
 10
Gadbury GL, Xiang Q, Yang L, Barnes S, Page GP, Allison DB. Evaluating Statistical Methods Using Plasmode Data Sets in the Age of Massive Public Databases: An Illustration Using False Discovery Rates. PLoS Genet. 2008; 06;4(6):1–8.
 11
Rocke DM, Ruan L, Zhang Y, Gossett JJ, DurbinJohnson B, Aviran S. Excess False Positive Rates in Methods for Differential Gene Expression Analysis using RNASeq Data. bioRxiv. 2015. Cold Spring Harbor Laboratory. https://doi.org/10.1101/020784. https://www.biorxiv.org/content/early/2015/06/11/020784.
 12
Sun L, Stephens M. Solving the Empirical Bayes Normal Means Problem with Correlated Noise. arXiv preprint arXiv:181207488. 2018. https://arxiv.org/abs/1812.07488.
 13
Rigaill G, Balzergue S, Brunaud V, Blondet E, Rau A, Rogier O, et al.Synthetic data sets for the identification of key ingredients for RNAseq differential analysis. Brief Bioinformatics. 2016; 10;19(1):65–76.
 14
Benidt S, Nettleton D. SimSeq: a nonparametric approach to simulation of RNAsequence datasets. Bioinformatics. 2015; 02;31(13):2131–40.
 15
Kvam VM, Liu P, Si Y. A comparison of statistical methods for detecting differentially expressed genes from RNAseq data. Am J Bot. 2012; 99(2):248–56.
 16
Reeb P, Steibel J. Evaluating statistical analysis models for RNA sequencing experiments. Front Genet. 2013; 4:178.
 17
van de Wiel MA, Neerincx M, Buffart TE, Sie D, Verheul HM. ShrinkBayes: a versatile Rpackage for analysis of countbased sequencing data in complex study designs. BMC Bioinformatics. 2014; 15(1):116.
 18
Robinson DG, Storey JD. subSeq: Determining Appropriate Sequencing Depth Through Efficient Read Subsampling. Bioinformatics. 2014; 09;30(23):3424–6.
 19
Vieth B, Ziegenhain C, Parekh S, Enard W, Hellmann I. powsimR: power analysis for bulk and single cell RNAseq experiments. Bioinformatics. 2017; 07;33(21):3486–8.
 20
Zappia L, Phipson B, Oshlack A. Splatter: simulation of singlecell RNA sequencing data. Genome Biol. 2017; 18(1):174.
 21
GTEx Consortium. Genetic effects on gene expression across human tissues. Nature. 2017; 550(7675):204.
 22
Robinson MD, Smyth GK. Smallsample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2007; 08;9(2):321–32.
 23
Hardcastle TJ, Kelly KA. baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010; 11(1):422.
 24
Van De Wiel MA, Leday GGR, Pardo L, Rue H, Van Der Vaart AW, Van Wieringen WN. Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics. 2012; 09;14(1):113–28.
 25
Kharchenko PV, Silberstein L, Scadden DT, Bayesian approach to singlecell differential expression analysis. Nat Methods. 2014; 11(7):740.
 26
Law CW, Chen Y, Shi W, Smyth GK. Voom: precision weights unlock linear model analysis tools for RNAseq read counts. Genome Biol. 2014; 15:R29.
 27
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 2014; 15(12):550.
 28
Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek AK, et al.MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in singlecell RNA sequencing data. Genome Biol. 2015; 16(1):278.
 29
Guo M, Wang H, Potter SS, Whitsett JA, Xu Y. SINCERA: A Pipeline for SingleCell RNASeq Profiling Analysis. PLoS Comput Biol. 2015; 11;11(11):1–28.
 30
Nabavi S, Schmolze D, Maitituoheti M, Malladi S, Beck AH. EMDomics: a robust and powerful method for the identification of genes differentially expressed between heterogeneous classes. Bioinformatics. 2015; 32(4):533–41.
 31
Delmans M, Hemberg M. Discrete distributional differential expression (D3E)  a tool for gene expression analysis of singlecell RNAseq data. BMC Bioinformatics. 2016; 17(1):110.
 32
Korthauer KD, Chu LF, Newton MA, Li Y, Thomson J, Stewart R, et al.A statistical approach for identifying differential distributions in singlecell RNAseq experiments. Genome Biol. 2016; 17(1):222.
 33
CostaSilva J, Domingues D, Lopes FM. RNASeq differential expression analysis: An extended review and a software tool. PLoS ONE. 2017; 12;12(12):1–18.
 34
Qiu X, Hill A, Packer J, Lin D, Ma YA, Trapnell C. Singlecell mRNA quantification and differential analysis with Census. Nat Methods. 2017; 14(3):309.
 35
Miao Z, Deng K, Wang X, Zhang X. DEsingle for detecting three types of differential expression in singlecell RNAseq data. Bioinformatics. 2018; 04;34(18):3223–4.
 36
Risso D, Perraudeau F, Gribkova S, Dudoit S, Vert JP. A general and flexible method for signal extraction from singlecell RNAseq data. Nat Commun. 2018; 9(1):284.
 37
Van den Berge K, Perraudeau F, Soneson C, Love MI, Risso D, Vert JP, et al.Observation weights unlock bulk RNAseq tools for zero inflation and singlecell applications. Genome Biol. 2018; 19(1):24.
 38
Wang T, Nabavi S. SigEMD: A powerful method for differential gene expression analysis in singlecell RNA sequencing data. Methods. 2018; 145:25–32.
 39
Wang T, Li B, Nelson CE, Nabavi S. Comparative analysis of differential gene expression analysis tools for singlecell RNA sequencing data. BMC Bioinformatics. 2019; 20(1):40.
 40
Stephens M. False discovery rates: a new deal. Biostatistics. 2016; 10;18(2):275–94.
 41
Wakefield J. Bayes factors for genomewide association studies: comparison with pvalues. Genet Epidemiol. 2009; 33(1):79–86.
 42
Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004; 3(1):1–25. https://doi.org/10.2202/15446115.1027.
 43
McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNASeq experiments with respect to biological variation. Nucleic Acids Res. 2012; 01;40(10):4288–297.
 44
Tang M, Sun J, Shimizu K, Kadota K. Evaluation of methods for differential expression analysis on multigroup RNAseq count data. BMC Bioinformatics. 2015; 16(1):360.
 45
Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007; 3(9):1724–35.
 46
Carvalho CM, Chang J, Lucas JE, Nevins JR, Wang Q, West M. HighDimensional Sparse Factor Modeling: Applications in Gene Expression Genomics. J Am Stat Assoc. 2008; 103(484):1438–56.
 47
Kang HM, Ye C, Eskin E. Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics. 2008; 180(4):1909–25.
 48
Kang HM, Zaitlen NA, Wade CM, Kirby A, Heckerman D, Daly MJ, et al.Efficient control of population structure in model organism association mapping. Genetics. 2008; 178(3):1709–23.
 49
Leek JT, Storey JD. A general framework for multiple testing dependence. Proc Natl Acad Sci. 2008; 105(48):18718–23.
 50
Stegle O, Kannan A, Durbin R, Winn J. Accounting for Nongenetic Factors Improves the Power of eQTL Studies In: Vingron M, Wong L, editors. Research in Computational Molecular Biology: 12th Annual International Conference, RECOMB 2008, Singapore, March 30  April 2, 2008. Berlin: Springer Berlin Heidelberg: 2008. p. 411–22.
 51
Friguet C, Kloareg M, Causeur D. A factor model approach to multiple testing under dependence. J Am Stat Assoc. 2009; 104(488):1406–15.
 52
Kang HM, Sul JH, Service SK, Zaitlen NA, Kong Sy, Freimer NB, et al.Variance component model to account for sample structure in genomewide association studies. Nat Genet. 2010; 42(4):348–54.
 53
Listgarten J, Kadie C, Schadt EE, Heckerman D. Correction for hidden confounders in the genetic analysis of gene expression. Proc Natl Acad Sci. 2010; 107(38):16465–70.
 54
Stegle O, Parts L, Durbin R, Winn J. A Bayesian Framework to Account for Complex NonGenetic Factors in Gene Expression Levels Greatly Increases Power in eQTL Studies. PLoS Comput Biol. 2010; 05;6(5):1–11.
 55
Wu Z, Aryee MJ. Subset quantile normalization using negative control features. J Comput Biol. 2010; 17(10):1385–95.
 56
Fusi N, Stegle O, Lawrence ND. Joint Modelling of Confounding Factors and Prominent Genetic Regulators Provides Increased Accuracy in Genetical Genomics Studies. PLoS Comput Biol. 2012; 01;8(1):1–9.
 57
GagnonBartsch JA, Speed TP. Using control genes to correct for unwanted variation in microarray data. Biostatistics. 2012; 13(3):539–52.
 58
Stegle O, Parts L, Piipari M, Winn J, Durbin R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat Protocol. 2012; 7(3):500–7.
 59
Sun Y, Zhang NR, Owen AB. Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data. Ann Appl Stat. 2012; 12;6(4):1664–88.
 60
GagnonBartsch J, Jacob L, Speed T. Removing Unwanted Variation from High Dimensional Data with Negative Controls. Technical Report 820. Berkeley: University of California; 2013. http://statistics.berkeley.edu/techreports/820. Accessed Jan 2020.
 61
Mostafavi S, Battle A, Zhu X, Urban AE, Levinson D, Montgomery SB, et al.Normalizing RNAsequencing data by modeling hidden covariates with prior knowledge. PLoS ONE. 2013; 8(7):e68141. https://doi.org/10.1371/journal.pone.0068141.
 62
Yang C, Wang L, Zhang S, Zhao H. Accounting for nongenetic factors by lowrank representation and sparse regression for eQTL mapping. Bioinformatics. 2013; 29(8):1026–34.
 63
Leek JT. svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 2014; 10;42(21):e161.
 64
Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNAseq data using factor analysis of control genes or samples. Nat Biotechnol. 2014; 32(9):896.
 65
Perry PO, Pillai NS. Degrees of freedom for combining regression with factor analysis. arXiv preprint arXiv:13107269. 2015. https://arxiv.org/abs/1310.7269.
 66
Chen M, Zhou X. Controlling for confounding effects in single cell RNA sequencing studies using both control and target genes. Sci Rep. 2017; 7(1):13587.
 67
Lee S, Sun W, Wright FA, Zou F. An improved and explicit surrogate variable analysis procedure by coefficient adjustment. Biometrika. 2017; 104(2):303–16.
 68
Wang J, Zhao Q, Hastie T, Owen AB. Confounder adjustment in multiple hypothesis testing. Ann Statist. 2017; 10;45(5):1863–94.
 69
Caye K, Jumentier B, François O. LFMM 2.0: Latent factor models for confounder adjustment in genome and epigenomewide association studies. bioRxiv. 2018.
 70
Hung H. A robust removing unwanted variation–testing procedure via γ divergence. Biometrics. 2019; 75(2):650–662. https://doi.org/10.1111/biom.13002. https://onlinelibrary.wiley.com/doi/abs/10.1111/biom.13002.
 71
McKennan C, Nicolae D. Accounting for unobserved covariates with varying degrees of estimability in highdimensional biological data. Biometrika. 2019; 09;106(4):823–40.
 72
McKennan C, Nicolae D. Estimating and accounting for unobserved covariates in high dimensional correlated data. arXiv preprint arXiv:180805895. 2018. https://arxiv.org/abs/1808.05895.
 73
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010; 11(10):R106.
 74
Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNAseq experiments. BMC Bioinformatics. 2010; 11(1):94.
 75
Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNAseq data. Genome Biol. 2010; 11(3):R25.
 76
Langmead B, Hansen KD, Leek JT. Cloudscale RNAsequencing differential expression analysis with Myrna. Genome Biol. 2010; 11(8):R83.
 77
Dillies MA, Rau A, Aubert J, HennequetAntier C, Jeanmougin M, Servant N, et al.A comprehensive evaluation of normalization methods for Illumina highthroughput RNA sequencing data analysis. Brief Bioinformatics. 2012; 09;14(6):671–83.
 78
Spearman C. "General Intelligence," Objectively Determined and Measured. Am J Psychol. 1904; 15(2):201–92.
 79
Hotelling H. Analysis of a complex of statistical variables into principal components. J Educ Psychol. 1933; 24(6):417.
 80
Eckart C, Young G. The approximation of one matrix by another of lower rank. Psychometrika. 1936; 1(3):211–8.
 81
Comon P. Independent component analysis, A new concept? Signal Processing. 1994; 36(3):287–314. Higher Order Statistics.
 82
Tipping ME, Bishop CM. Probabilistic Principal Component Analysis. J R Stat Soc Ser B Stat Methodol. 1999; 61(3):611–22.
 83
Lee DD, Seung HS. Learning the parts of objects by nonnegative matrix factorization. Nature. 1999; 401(6755):788.
 84
Hyvärinen A, Oja E. Independent component analysis: algorithms and applications. Neural Netw. 2000; 13(4):411–30.
 85
West M. Bayesian factor regression models in the “large p, small n" paradigm In: Bernardo J, Bayarri M, Berger J, Dawid A, Heckerman D, Smith A, et al., editors. Bayesian Statistics 7. Proceedings of the Seventh Valencia International Meeting. Oxford: Clarendon Press: 2003. p. 733–42.
 86
Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. J Comput Graph Stat. 2006; 15(2):265–86.
 87
Hoff PD. Model averaging and dimension selection for the singular value decomposition. J Amer Statist Assoc. 2007; 102(478):674–85.
 88
Salakhutdinov R, Mnih A. Bayesian Probabilistic Matrix Factorization Using Markov Chain Monte Carlo. In: Proceedings of the 25th International Conference on Machine Learning. ICML ’08. New York: ACM: 2008. p. 880–887.
 89
Ghosh J, Dunson DB. Default prior distributions and efficient posterior computation in Bayesian factor analysis. J Comput Graph Stat. 2009; 18(2):306–20.
 90
Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009; 04;10(3):515–34.
 91
Engelhardt BE, Stephens M. Analysis of Population Structure: A Unifying Framework and Novel Methods Based on Sparse Factor Analysis. PLoS Genet. 2010; 09;6(9):1–12.
 92
Mayrink VD, Lucas JE. Sparse latent factor models with interactions: Analysis of gene expression data. Ann Appl Stat. 2013; 06;7(2):799–822.
 93
Yang D, Ma Z, Buja A. A Sparse Singular Value Decomposition Method for HighDimensional Data. J Comput Graph Stat. 2014; 23(4):923–42.
 94
Josse J, Wager S. BootstrapBased Regularization for LowRank Matrix Estimation. J Mach Learn Res. 2016; 17(124):1–29. http://jmlr.org/papers/v17/14534.html.
 95
Leung D, Drton M. Orderinvariant prior specification in Bayesian factor analysis. Stat Probab Lett. 2016; 111:60–66.
 96
Wang W, Stephens M. Empirical Bayes Matrix Factorization. arXiv preprint arXiv:180206931. 2018. https://arxiv.org/abs/1802.06931.
 97
Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, et al.Computational analysis of celltocell heterogeneity in singlecell RNAsequencing data reveals hidden subpopulations of cells. Nat Biotechnol. 2015; 33(2):155.
 98
Scialdone A, Natarajan KN, Saraiva LR, Proserpio V, Teichmann SA, Stegle O, et al.Computational assignment of cellcycle stage from singlecell transcriptome data. Methods. 2015; 85:54–61.
 99
Hansen BB, Klopfer SO. Optimal Full Matching and Related Designs via Network Flows. J Comput Graph Stat. 2006; 15(3):609–27.
 100
Gale D, Shapley LS. College Admissions and the Stability of Marriage. Am Math Mon. 1962; 69(1):9–15. http://www.jstor.org/stable/2312726.
 101
Kuhn HW. The Hungarian method for the assignment problem. Nav Res Logist Q. 1955; 2(12):83–97.
 102
Zhang F, Horn RA. In: Zhang F, (ed).Basic properties of the Schur complement: Springer; 2005, pp. 17–46.
 103
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2009; 11;26(1):139–40.
 104
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995:289–300. http://www.jstor.org/stable/2346101.
 105
Buja A, Eyuboglu N. Remarks on parallel analysis. Multivar Behav Res. 1992; 27(4):509–40.
 106
Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al.Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017; 8:14049.
 107
Risso D, Schwartz K, Sherlock G, Dudoit S. GCContent Normalization for RNASeq Data. BMC Bioinformatics. 2011; 12(1):480.
 108
McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Journal of Open Source Software. 2018; 3(29):861. The Open Journal. https://doi.org/10.21105/joss.00861. https://doi.org/10.21105/joss.00861.
 109
Maaten Lvd, Hinton G. Visualizing data using tSNE. J Mach Learn Res. 2008; 9(Nov):2579–605.
 110
Jonsson V, Österlund T, Nerman O, Kristiansson E. Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics. BMC genomics. 2016; 17(1):78.
 111
The GenotypeTissue Expression (GTEx) Project. GTEx Analysis V7. 2016. https://gtexportal.org. Accessed Jan 2020.
 112
Wickham H. ggplot2: Elegant Graphics for Data Analysis. New York: SpringerVerlag; 2016. https://ggplot2.tidyverse.org. Accessed Jan 2020.
 113
R Core Team. R: A Language and Environment for Statistical Computing. Vienna; 2019. https://www.Rproject.org/.
Funding
Not applicable.
Author information
Affiliations
Contributions
Authors’ contributions
DG developed the methodology, wrote the software, implemented the study, and wrote the manuscript. The author read and approved the final manuscript.
Authors’ information
DG is an assistant professor of Statistics in the Department of Mathematics and Statistics at American University in Washington, DC.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The author declares that he has no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
This PDF file contains theoretical considerations, simulation summaries and figures, and additional simulation details.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Gerard, D. Databased RNAseq simulations by binomial thinning. BMC Bioinformatics 21, 206 (2020). https://doi.org/10.1186/s1285902034509
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902034509
Keywords
 RNAseq
 Simulation
 Differential expression
 Factor analysis
 Confounders
 Scaling factors