Data-based RNA-seq Simulations by Binomial Thinning

With the explosion in the number of methods designed to analyze bulk and single-cell RNA-seq data, there is a growing need for approaches that assess and compare these methods. The usual technique is to compare methods on data simulated according to some theoretical model. However, as real data often exhibit violations from theoretical models, this can result in un-substantiated claims of a method’s performance. Rather than generate data from a theoretical model, in this paper we develop methods to add signal to real RNA-seq datasets. Since the resulting simulated data are not generated from an unrealistic theoretical model, they exhibit realistic (annoying) attributes of real data. This lets RNA-seq methods developers assess their procedures in non-ideal (model-violating) scenarios. Our procedures may be applied to both single-cell and bulk RNA-seq. We show that our simulation method results in more realistic datasets and can alter the conclusions of a differential expression analysis study. We also demonstrate our approach by comparing various factor analysis techniques on RNA-seq datasets. Our tools are available in the seqgendiff R package on the Comprehensive R Archive Net-work: https://cran.r-project.org/package=seqgendiff.

RNA-seq dataset and add a prespecified signal to it. The main advantage of our approach is that any unwanted variation in the real data is maintained in the simulated data, and this unwanted variation need not be prespecified by the researcher. The way we add signal does carry assumptions, but they are flexible (Supplementary Section S1.2, Additional file 1). And we believe that this way of simulation, compared to simulating under a theoretical model, allows researchers to more realistically evaluate their methods.
This manuscript essentially generalizes the simulation techniques proposed in [5,6], and [7]. These previous papers use binomial thinning (the approach used in this paper) in the case where there are just two groups that are differentially expressed (hereafter, the "two-group model"). Binomial thinning is the process of subsampling counts using the binomial distribution. This subsampling is applied to different individuals heterogeneously to add signal to the observed counts. These papers did not develop methods for more complicated design scenarios, they did not present user-friendly software implementations for their simulation techniques, and they did not justify their simulation techniques as broadly. Here, we allow for arbitrary experimental designs, we release software for users to implement their own simulations, and we justify our techniques using very flexible assumptions.
There has been some other previous work on "databased" simulations in expression analyses. Datasets resulting from data-based simulations (sometimes called "plasmodes" [8]) have been used in microarray studies before the development of RNA-seq [9,10]. All RNA-seq databased simulation methods have so far operated in the two-group (or finite-group) model, without any ability to simulate data from arbitrary experimental designs. Rocke et al. [11] and [12] randomly shuffled group indicators in the two-group model, resulting in completely null data, and methods can be evaluated on their ability to control for type I error when the data are all null. Rigaill et al. [13], in addition to generating null data by randomly shuffling group labels, incorporate multiple datasets to create some non-null genes within their simulated datasets. Benidt and Nettleton [14] use a count-swapping algorithm in the twogroup model to create differentially expressed genes when one already has two treatment groups. Kvam et al. [15,16], and [17] create non-null genes by multiplying counts for all individuals in a group by the fold-change in mean expression. [18] uses a binomial distribution approach to uniformly decrease the sequencing depth of an entire dataset (but not to add differentially expressed genes). Concerning non-data-based methods, [19] and [20] use real RNA-seq data to estimate the parameters in a datagenerating model before simulating data from the theoretical model using these estimated parameter values. Our work is the first to extend data-based RNA-seq simulation beyond the finite-group model. Our paper is organized as follows. We first list the goals and assumptions of our simulation scheme ("Goals and assumptions" section) before motivating it with four applications ("Application: evaluating differential expression analysis" section, "Application: evaluating confounder adjustment" section, "Application: evaluating effects of library size heterogeneity" section, and "Application: evaluating factor analysis" section) and describing our process of simulating RNA-seq in detail ("Generating modified RNA-seq data" section). We then demonstrate how our approach can more accurately preserve structure in a real dataset compared to simulating a dataset from a theoretical model ("Features of real data" section). We show that this can alter the conclusions of a differential expression analysis simulation study ("Effects on differential expression analysis simulations" section). We then apply our simulation approach by comparing five factor analysis methods using the GTEx data [21] ("Evaluating factor analyses" section). We finish with a discussion and conclusions ("Discussion" section and "Conclusions" section). We adopt the following notation. We denote matrices by bold uppercase letters (A), vectors by bold lowercase letters (a), and scalars by non-bold letters (a or A). Indices typically run from 1 to their uppercase version, e.g. a = 1, 2, . . . , A. Where there is no chance for confusion, we let non-bold versions of letters represent the scalar elements of matrices and vectors. So a ij is the (i, j)th element of A, while a i is the ith element of a. We let 1 A denote the Avector of 1's and 1 A×B the A × B matrix of 1's. The matrix transpose is denoted by A .

Goals and assumptions
We will now describe the goals and assumptions of our simulation method, which relies on a researcher having access to a real RNA-seq dataset. Suppose a researcher has a matrix Y ∈ R G×N of RNA-seq read-counts for G genes and N individuals. Also suppose a researcher has access to a design matrix X 1 ∈ R N×P 1 with P 1 variables. The availability of X 1 is optional, not essential to the method, and is mostly for descriptive purposes. We assume the RNA-seq counts, Y , are generated according to the following model: where • μ ∈ R G is a vector of intercept terms for the genes, • A ∈ R G×K is the corresponding coefficient matrix of Z, and • ∈ R G×N represents all other unwanted variation not accommodated by the other terms in the model, where μ, B 1 , Z, A, and are all unknown. Given the above data-generating process, suppose a user provides the following (known) elements: • X 2 ∈ R N×P 2 , a design matrix with fixed rows (see note 3 below), • B 2 ∈ R G×P 2 , the coefficient matrix corresponding to X 2 , • X 3 ∈ R N×P 3 , a design matrix with rows that can be permuted (see note 3 below), and Our goal is to generate a matrixỸ ∈ R G×N from Y such thatỹ where • ∈ R N×N is a random permutation matrix, whose distribution controls the level of association between the columns of X 3 and the columns of Z, and •μ is a new vector of intercept terms for the genes.
We will provide the details on how to generateỸ from Y in "Generating modified RNA-seq data" section. But we would like to first provide some notes below, and then discuss the applications of being able to generate (2) from (1).

Note 1:
For simplicity we use the Poisson distribution in the main text (Eqs. (1) and (2)). However, our approach is valid under much more general assumptions. In particular, we note that if the counts were generated according to a negative binomial distribution, a zero-inflated negative binomial distribution, or even a mixture of binomials and negative binomials, then our simulation scheme still preserves the structure of the data (Supplementary Section S1.2, Additional file 1). However, even when our general modeling assumptions are violated, one can show (via the law of total expectation) that if . Thus, our procedure will produce the correct mean log 2 -fold change in the new dataset, but the resulting mean/variance relationship might not be as assumed.

Note 2:
The term in (1) and (2) represents the realistic and annoying features of the data. In ideal situations, = 0 G×N . However, most datasets likely include non-zero , and so assessing a method's ability to be robust in the presence of , without the researcher having to prespecify , is the key strength of our simulation approach. Note 3: As described below, we include both X 2 and X 3 in (2) to control different aspects of a simulation study. One may control the level of association between the columns of X 1 and X 2 as these are both observed and fixed by the user. The inclusion of X 3 and allows us to try to control the level of association between X 3 and Z.
Before we discuss obtaining (2) from (1), we point out four potential applications of this simulation approach: (i) evaluating differential expression analyses ("Application: evaluating differential expression analysis" section), (ii) evaluating confounder adjustment approaches ("Application: evaluating confounder adjustment" section), (iii) evaluating the effects of library size heterogeneity on differential expression analyses ("Application: evaluating effects of library size heterogeneity" section), and (iv) evaluating factor analysis methods ("Application: evaluating factor analysis" section).

Application: evaluating differential expression analysis
One of the more common applications of RNA-seq data is estimating and testing for differences in gene expression between two groups. Many packages and techniques exist to perform this task [22-39, among others], and so developing approaches and software to compare these different software packages would be of great utility to the scientific community. Generating data from the two-group model is a special case of (1) and (2), where and x ∈ R N contains a single indicator variable, indicating membership to one of two groups. Researchers may specify b and x and evaluate a method's ability to (i) estimate b and (ii) detect which genes have non-zero b g . In many settings, a researcher may want to specify the distribution of the b g 's (the elements of b). Our software implementation allows for this. In addition, following [40], we allow researchers to specify the distribution of b g /s α g , where s g is the sample standard deviation of the gth row of log 2 (Y + 0.5), and α is a user-specified constant. Allowing for α = 0 corresponds to the scenario of specifying the distribution of the effects, while allowing for α = 1 corresponds to specifying the p-value prior of [41].
Though the two-group model is perhaps the most common scenario in differential expression analysis, our method also allows for arbitrary design matrices. Such design matrices have applications in many types of expression experiments [24,[42][43][44], and so the ability to simulate arbitrary designs gives researchers another tool to evaluate their methods in more complicated scenarios.

Application: evaluating confounder adjustment
Unobserved confounding / batch effects / surrogate variables / unwanted variation has been recognized as a serious impediment to scientific studies in the modern "omics" era [3]. As such, there is a large literature on accounting for unwanted variation, particularly in RNAseq studies [5, 6, 45-72, among others]. The glut of available methods indicates a need to realistically compare these methods.
Typically, the form and strength of any unobserved confounding is not known. So one way to assess different confounder adjustment methods would be to assume model (1) and add signal to the data resulting in the following submodel of (2): A researcher would then explore how close a method's estimate of B 3 is to the truth (assuming the researcher may use both X 1 and X 3 to obtain this estimate). The researcher can control the correlation between the columns of X 3 and the columns of Z by specifying the distribution of (as described in "Generating modified RNA-seq data" section). Intuitively, the stronger the correlation between the columns of X 3 and the columns of Z, the more difficult the confounder adjustment problem. This approach was used in the two-group model in [5] and [6], but not for general design matrices.
Application: evaluating effects of library size heterogeneity "Library size" corresponds to the number of reads an individual sample contains. Adjusting for library size is surprisingly subtle and difficult, and thus many techniques have been proposed to perform this adjustment [73][74][75][76][77]. The most commonly-used techniques can be viewed as a form of confounder adjustment [5]. For most methods, this form of confounder adjustment corresponds to setting one column of A in (1) to be 1 G and estimating the corresponding column in Z using some robust method that assumes that the majority of genes are nondifferentially expressed.
One way to evaluate the performance of a library size adjustment procedure is to see how effect size estimates change when the samples are thinned, changing the library size. First, assume we are operating in the following submodel of (1): A researcher may specify (i) additional signal and (ii) a further amount of thinning on each sample by generating the following submodel of (2): To evaluate the effectiveness of a library size adjustment procedure, researchers may observe the effects on the estimates of B 3 under various amounts of library thinning (controlled by altering x 2 and x 3 ).

Application: evaluating factor analysis
Factor analysis is a fundamental technique in every statistician's arsenal. Since its creation by Spearman [78], literally hundreds of factor analysis / matrix decomposition / matrix factorization approaches have been developed, and new approaches are created each year to account for new features of new data [54,[79][80][81][82][83][84][85][86][87][88][89][90][91][92][93][94][95][96], to name a very few]. For RNA-seq, factor analysis methods have found applications in accounting for unwanted variation [63,64], estimating cell-cycle state [97,98], and general quality assessments [27]. Thus, techniques to realistically compare various factor analysis methods would be of great use to the scientific community. We demonstrate in this section how our simulation approaches can be used to evaluate factor analysis methods applied to RNA-seq.
We suppose that the RNA-seq read-counts follow the following submodel of (1): We then suppose that the researcher generates a modified dataset that follows the following submodel of (2): We assume that a researcher applies a factor analysis to (10) to estimate a low-rank matrix with K + P 3 factors. That is, the researcher fits the following model, with factor matrix F ∈ R N×(K+P 3 ) and loading matrix L ∈ R G×(K+P 3 ) , obtaining estimatesL andF. These estimates are obtained without using X 3  In a factor analysis, the factors and loadings are only identifiable after imposing assumptions on their structure (such as sparsity or orthogonality). Thus, researchers may vary the structure of B 3 and X 3 and observe the robustness of their factor analysis methods to violations of their structural assumptions.

Generating modified RNA-seq data
We will now discuss the approach of obtaining (2) from (1). We will use the following well-known fact of the Poisson distribution, which may be found in many elementary probability texts: Generalizations of Lemma 1 to negative binomial distributions (and mixtures of negative binomial distributions) may be found in Section S1.2 of Additional file 1.
In the case when is drawn uniformly from the space of permutation matrices, we have the simplified procedure described in Procedure 1. The validity of Procedure 1 follows directly from the modeling assumptions in (1) and Lemma 1. Since y gn ∼ Poisson(2 θ gn ) andỹ gn |y gn ∼ Bin(y gn , 2 q gn ), we have thatỹ gn ∼ Poisson(2 θ gn +q gn ). If we setθ gn = θ gn + q gn , then we havẽ Equation (13) follows from the definition of from (1) and the definition of Q from Step 4 of Procedure 1. Equation (15) follows by settingμ to be μ − e.
There are two main reasons to subtract the row-wise maximum from each row in Step 4 of Procedure 1: (i) this ensures that the binomial probabilities (2 q gn ) are always between 0 and 1, and (ii) this allows for minimal countthinning while still obtaining our goal of (2). That is, the binomial probabilities will all be between 0 and 1, but they will be as close to 1 as possible while still yielding (2), thereby reducing the amount of discarded counts.
The main disadvantage to Procedure 1 is that the surrogate variables (Z) will be independent of the user-specified Procedure 1 Basic procedure to generate (2) from (1) when the permuted design matrix ( X 3 ) is independent of the surrogate variables. Input: Y , X 2 , X 3 , B 2 , B 3 .
1: Draw uniformly from the space of N × N permutation matrices. 2: Let = B 2 X 2 + B 3 X 3 . 3: Let e ∈ R G contain the row-wise maximums of .
covariates ( X 3 ). To allow the user to control the level of association between the surrogate variables and the user-provided variables, we propose using Procedure 2 to choose , rather than drawing uniformly from the space of permutation matrices. In brief, the user specifies a "target correlation" matrix, R ∈ R P 3 ×K , where r ik is what the user desires to be the correlation between the ith column of X 3 and the kth column of Z. We then estimate the surrogate variables either using a factor analysis (such as the truncated singular value decomposition) or surrogate variable analysis [45,49]. Note that this estimate of Z is only used to permute the rows of X 3 and is otherwise not included in the simulated data. We then draw a new random matrix U ∈ R N×P 3 from a conditional normal distribution assuming that each row of U and Z is jointly normal with covariance matrix (16), thus the correlation between the columns of U and Z will be approximately R. We then match the rows of X 3 with the rows of U using the pair-wise matching algorithm of [99], though our software provides other options to match pairs via either the Gale-Shapley algorithm [100] or the Hungarian algorithm [101]. This ensures that X 3 is as close to U as possible. We denote the permutation matrix that matches the rows of X 3 with the rows of U by .
The resulting covariance matrix (16) used in Procedure 2 is not guaranteed to be positive semidefinite. Rather than demand the user specify an appropriate target correlation matrix (which might be in general difficult for the typical user), we modify the target correlation matrix using Procedure 3 to iteratively shrink R until the Schur complement condition for positive semi-definiteness [102] is satisfied.
Procedure 2 is a compromise between letting the user specify the full design matrix X 3 and letting the user specify the correlation between the columns of X 3

and Z.
A user might want to specify the correlation between X 3 and Z to evaluate factor analyses in the presence of correlated factors ("Application: evaluating factor analysis" section), or to evaluate how well confounder Procedure 2 Procedure to draw a permutation matrix such that the surrogate variables are correlated with the permuted design matrix. Input: Y , X 1 , X 3 , R, and K.
1: Let A ∈ R P 3 ×P 3 be the empirical correlation matrix between the columns of X 3 . 2: Adjust R by Procedure 3. 3: Estimate Z ∈ R G×K in one of two ways: i. By surrogate variable analysis [45,49,63], using (1 N , X 1 ) as the design matrix and 1 N as the null design matrix. ii. By a factor analysis on the residuals of a regression of log 2 (Y + 0.5) on (1 N , X 1 ).
Call the centered and scaled estimates of the surrogate variables (so that the columns each have mean 0 and variance 1)Ẑ. 4: Draw the rows of U ∈ R N×P 3 from a conditional normal distribution, assuming the nth rows of U andẐ are jointly N(0 P 3 +K , ), where 5: Match the rows of the centered and scaled matrix X 3 with the rows of the centered and scaled matrix U by pair-matching [99] using Euclidean distance. Call the resulting permutation matrix , such that row i of X 3 matches with row i of U. Output: . τ ← max(τ − , 0).
adjustment approaches cope in the presence of correlated confounders ("Application: evaluating confounder adjustment" section). In the simple case when X 3 andẐ are drawn from a normal distribution, Procedure 2 will permute the rows of X 3 so that X 3 andẐ consistently has the correct correlation structure (Theorem S1 in Additional file 1). However, for general design matrices this will not be the case. Procedure 4 (implemented in our software) provides a Monte Carlo algorithm to estimate the true correlation given the target correlation. Basically, the estimator approximates the expected value (conditional onẐ) of the Pearson correlations between the columns of X 3 and the columns ofẐ. We justify this in an intuitive way by the law of total expectation. Consider x a single column of X 3 with empirical mean and standard deviation ofx and s x . Similarly consider z a single column ofẐ with empirical mean and standard deviation ofz and s z . Then The estimator in Procedure 4 is a Monte Carlo approximation to the internal expectation in (17). We explore this correlation estimator through simulation in Supplementary Section S2.1, Additional file 1.

Procedure 4
Monte Carlo procedure to estimate the true correlation matrix given the target correlation matrix. Input: Z, X 3  Draw U as in Step 4 of Procedure 2.

3:
Derive as in Step 5 of Procedure 2.

4:
Set R b ∈ R P 3 ×K to be the Pearson correlation matrix between the columns of X 3 and Z. 5: end for 6: SetR = (R 1 + · · · + R B )/B. Output:R.
All simulation methods introduced in this paper are implemented in the seqgendiff R package, available on the Comprehensive R Archive Network: https://cran.rproject.org/package=seqgendiff.

Features of real data
Real data exhibit characteristics that are difficult to capture by simulations. In this section, we demonstrate how our binomial thinning approach maintains these features, while simulating from a theoretical model results in unrealistic simulated RNA-seq data.
We took the GTEx muscle data [21], and filtered out all genes with a mean read-depth of less than 10 reads. This resulted in a dataset containing 18,204 genes and 564 individuals. We then randomly assigned half of the individuals to one group and half to the other group, and used our seqgendiff software to add a N(0, 0.8 2 ) log 2 -fold-change between groups to 25% of the genes. We similarly used the powsimR software [19] to generate data according to a theoretical negative binomial model (with parameters estimated from the GTEx muscle data), again by adding a N(0, 0.8 2 ) log 2 -fold-change between the two groups in 25% of the genes. The parameters estimated and used by powsimR include the mean normalized read counts per gene, the estimated library size factor per sample, and a nonparametric estimate of the mean/dispersion relationship of the counts. powsimR uses the mean normalized read counts, the estimated size factors, and the user-provided log 2 -fold changes to provide a mean for the negative binomial distribution. Based on this mean, it uses the estimated mean/dispersion relationship to provide a dispersion parameter for the negative binomial distribution.
The results below are from one simulation, but the results are robust and consistent across many datasets. The reader is encouraged to change the random seed in our code to explore the robustness of our conclusions.
The structure of the powsimR dataset is very different from that observed in the seqgendiff and GTEx datasets. There seems to be more zeros in the powsimR dataset than in the seqgendiff and GTEx datasets (Supplementary Figure S2, Additional file 1), even though we simulated the powsimR dataset under the negative binomial setting and not the zero-inflated negative binomial setting. Scree plots of the three datasets show that there are a lot more small factors influencing variation in the seqgendiff and GTEx datasets than in the powsimR dataset (Fig. 1). The main source of variation in the powsimR dataset comes from the group membership, while other (unwanted) effects dominate the variation in the seqgendiff dataset (Fig. 2). It is only the fourth principle component in the seqgendiff dataset that seems to capture the group membership (Supplementary Figure S3, Additional file 1). Though this unwanted variation exists, with such a large sample size voom-limma [26] can accurately estimate the effects (Supplementary Figure S4, Additional file 1). The voom plots (visualizing the mean-variance trend [26]) are about the same in the GTEx and seqgendiff data, but the distribution of the square-root standard deviations appears more symmetric in the powsimR dataset (Fig. 3). There is also an uncharacteristic hook in the mean-variance trend in the powsimR dataset for low-counts. These visualizations indicate that seqgendiff can generate more realistic datasets for RNA-seq simulation.

Effects on differential expression analysis simulations
The differences in real versus simulated data (as discussed in "Features of real data" section) have real implications when evaluating methods in simulation studies. To demonstrate this, we used the GTEx muscle data to simulate RNA-seq data from the two-group model as in "Features of real data" section. We did this for N = 10 individuals, G = 10,000 genes, setting 90% of the genes to be null, and generating the log 2 -fold change from a N(0, 0.8 2 ) distribution for the non-null genes. We simulated 500 datasets this way using both seqgendiff and powsimR. Each replication, we applied DESeq2 [27], edgeR [103], and voom-limma [26] to the simulated datasets. We evaluated the methods based on (i) false discovery proportion when using Benjamini-Hochberg [104] to control false discovery rate at the 0.05 level, (ii) power to detect nonnull effects based on a 0.05 false discovery rate control threshold, and (iii) mean squared error of the estimates.
We wanted to make sure that the datasets generated from powsimR and seqgendiff were comparable, where b 3g is log 2 -fold change for gene g,ỹ g ∈ R N is the gth row ofỸ , and V (·) returns the empirical variance of a vector. When we looked at the median (over the nonnull genes) PVE across the datasets, the seqgendiff datasets and powsimR datasets had the same median PVE on average, though there was higher variability in the median PVE among the seqgendiff datasets (Supplementary Figure S5, Additional file 1). Boxplots of the false discovery proportion for each method in each dataset can be found in Figure 4. Both the powsimR and seqgendiff datasets indicate that only voom-limma can control false discovery rate adequately at the nominal level. However, the results based on the seqgendiff datasets indicate that there is a lot more variability in false discovery proportion than indicated by the powsimR datasets. In particular, it does not seem uncommon for seqgendiff to generate datasets with false discovery proportions well above the nominal rate. If a researcher were using only the theoretical datasets generated by powsimR, they would be overly confident in the methods' abilities to control false discovery proportion. Supplementary Figure S6 of Additional file 1 also indicates that methods generally have much more variable power between the seqgendiff datasets than between the powsimR datasets. Interestingly, the seqgendiff datasets indicate that methods tend to have smaller mean squared error than indicated by the powsimR datasets (Supplementary Figure S7, Additional file 1).
In Additional file 1, we also compared our simulation method to SimSeq [14] when evaluating differential expression analysis methods. We used the GTEx data [21] for both SimSeq and seqgendiff. SimSeq does not allow researchers to control the effect sizes of simulated non-null genes, as it depends on the presence of an available indicator variable that already exhibits differential expression in a real dataset. So we adjusted the effect sizes produced by seqgendiff to match those present in the GTEx data, and we found that the two data-based simulation methods behave similarly (Supplementary Figure S16, Additional file 1). It bodes well that seqgendiff produces similar results to other data-based approaches. The advantages, then, of seqgendiff over SimSeq are 1. seqgendiff can use effect sizes different than those that are already present in the observed indicator variable, while SimSeq cannot. 2. Because the effect sizes are unknown in the available indicator variable, SimSeq is unable to evaluate the estimation accuracy of effect sizes. seqgendiff can evaluate estimation accuracy. 3. seqgendiff can use more complicated designs than the finite-group model. SimSeq is limited to the finite-group model. 4. SimSeq cannot guarantee that genes that are intended to be differentially expressed in a simulated dataset are indeed differentially expressed. This depends on the quality of the available indicator variable. 5. As a minor advantage, seqgendiff is also much faster than SimSeq. On a 2.6 GHz quad-core PC running Linux with 32 GB of memory, seqgendiff took an average of 0.2 seconds to simulate a dataset, while SimSeq took an average of 51.1 seconds to simulate a dataset. A boxplot of simulation times is presented in Supplementary Figure S17 of Additional file 1. Fig. 3 Voom plots. Voom plots [26] visualizing the mean-variance trend in RNA-seq datasets. The voom plots are visually similar for the GTEx and seqgendiff datasets. The powsimR dataset has an uncharacteristic hook near the low counts in its voom plot

Evaluating factor analyses
As we hope we have made clear, there are many approaches to differential expression analysis ("Application: evaluating differential expression analysis" section), confounder adjustment ("Application: evaluating confounder adjustment" section), library size adjustment ("Application: evaluating effects of library size hetero geneity" section), and factor analysis ("Application: evaluating factor analysis" section). We believe it to be beyond the scope of this work to exhaustively evaluate all of these methods -especially since new methods are being developed each year. Rather, we hope our simulation procedures will be used by the research community to more realistically evaluate and benchmark their approaches to RNA-seq data analysis. However, as a final highlight to the utility of our simulation approaches, we demonstrate these simulation techniques in one application: evaluating factor analysis methods in RNA-seq ("Application: evaluating factor analysis" section). We have chosen to highlight this particular Fig. 4 False discovery proportion of various methods. Boxplots of false discovery proportion (FDP) (y-axis) for various differential expression analysis methods (x-axis) when applied on different simulated datasets (color). Benjamini-Hochberg was used to control for false discovery rate at the 0.05 level (horizontal dashed line). Only voom-limma controls false discovery rate at the nominal level. The FDP is more variable among the seqgendiff datasets than among the powsimR datasets application because it uses the more general simulation techniques beyond the two-group model, which were first demonstrated in [5].
We chose to focus on the following methods based on (i) previous use in expression studies, (ii) software availability, (iii) popularity, and (iv) ease of use. empirical Bayes matrix factorization approach proposed in [96], and 5. Probabilistic estimation of expression residuals (PEER) [54], a Bayesian factor analysis used in the popular PEER software to adjust for hidden confounders in gene expression studies.
All factor analysis methods were applied to the log 2counts after adding half a pseudo-count. To simulate RNA-seq data, we took the muscle GTEx data [21] and removed all genes with less than an average of 10 reads per sample. Each replicate, we added a rank-1 term. That is we assumed model (9) for the muscle GTEx data, then generated RNA-seq data such that where we simulated the components of x 3 and the nonzero components of b 3 from independent normal distributions. We varied the following parameters of the simulation study: This resulted in 24 unique simulation parameter settings. We also used 1000 genes each replication. For each setting, we ran 100 replications of generating data from model (19), and fitting the factors with the five methods under study assuming model (11) after we estimated the number of hidden factors using parallel analysis [105]. We chose three metrics to evaluate the performance of the different factor analysis methods: 1. The minimum mean squared error between x 3 and the columns ofF. To account for scale and sign unidentifiability, the estimated factors and the added factor were all scaled to have an 2 -norm of 1 prior to calculating the mean squared error. This measure is meant to evaluate if any of the estimated factors corresponds to the added factor. 2. The minimum mean squared error between b 3 and the columns ofL. We again accounted for scale and sign unidentifiability by calculating the mean squared error after scaling the estimated and true loadings to have an 2 -norm of 1.
3. The angle between x 3 and its projection onto the column space ofF. This measure is meant to evaluate if the estimated factor matrix includes x 3 among its unidentified factors.
The results are presented in Supplementary Figures S9-S14 of Additional file 1. Based on these figures, we have the following conclusions: 1. PEER performs very poorly when either the sparsity is high or when there are few samples. It also performs less well when the factors are correlated. A possible explanation is that PEER assumes a normal distribution on the factors and loadings, which is violated in the high-sparsity regime and is observed in the low-sparsity regime. Though, this does not explain its poor performance in small sample size settings. 2. SSVD estimates the loadings very poorly in low-sparsity regimes. This is to be as expected as SSVD assumes sparsity on the loadings. Surprisingly, though, it outperforms PCA in high sparsity regimes only when both the sample size and signal are also large. 3. ICA performs very poorly in low sparsity regimes. This is to be as expected as the normal distributions placed on the factors and loadings are a worst-case scenario for ICA. However, there is no scenario where ICA performs significantly better than PCA. 4. flash performs adequately in all scenarios and performs best in high-sparsity and high-signal regimes. 5. PCA performs adequately in most scenarios, and is only truly outperformed in high sparsity high signal regimes.
Based on these initial explorations, we would recommend users not use PEER, SSVD, or ICA and instead try either PCA or flash. In Section S2.2 of Additional file 1, we evaluate the above factor analyses using a single cell dataset from 10X Genomics [106]. The results indicate that PCA, SSVD, and flash perform comparably in all simulation settings, while PEER and ICA have worse performance in some simulation settings. Though the results were less clear than when using the GTEx data.

Discussion
We have focused on a log-linear model because of the large number of applications this generates ("Application: evaluating differential expression analysis" section, "Application: evaluating confounder adjustment" section, "Application: evaluating effects of library size heterogeneity" section, and "Application: evaluating factor anal ysis" section). This linearity (on the log-link scale) is represented by the structure of the Q matrix in Procedure 1. However, it is possible to replace Q by any arbitrary G × N matrix that has non-positive entries. This might be useful for simulations that study adjusting for nonlinear effects, such as bias due to GC content [107]. This also might be useful for evaluating non-linear dimensionality reduction techniques such as UMAP [108] and t-SNE [109], as this allows you to introduce non-linear effects into an RNA-seq dataset. However, these nonlinear effects would still be present only on the log-scale.
Our simulation procedures may be applicable beyond evaluating competing methods. Vieth et al. [19] used their simulation software to estimate power given the sample size in a differential expression analysis, and thus to develop sample size suggestions. Our simulation methods may be used similarly. Given a large RNA-seq dataset (such as the GTEx data used in this paper), one can repeatedly down-sample the number of individuals in the dataset and explore how sample size affects the power of a differential expression analysis.
Similarly, [18] already demonstrated that binomial thinning may be used for sequencing depth suggestions. That is, a researcher may repeatedly thin the libraries of the samples in a large RNA-seq dataset and explore the effects on power, thereby providing sequencing depth suggestions. Unlike [18], which does this subsampling uniformly over all counts, we allow researchers to explore the effects of heterogeneous subsampling (as in "Application: evaluating effects of library size heterogeneity" section). This might be useful if, say, researchers have more individuals in one group than in another and so wish to explore if they can sequence the larger group to a lower depth without affecting power.
In this manuscript, we have discussed our simulation techniques in the context of RNA-seq. However, our techniques would also be applicable to the comparative analysis of metagenomics methods [110]. Instead of quantifying gene expression, metagenomics quantifies gene abundances within metagenomes. Our simulation techniques could be applied in this context by taking a real metagenomics dataset and adding signal to it by binomial thinning.

Conclusions
We developed a procedure to add a known amount of signal to any real RNA-seq dataset. We only assume that this signal comes in the form of a generalized linear model with a log-link function from a very flexible distribution. We demonstrated how real data contain features that are not captured by simulated data, and that this can cause important differences in the results of a simulation study. We highlighted our simulation approach by comparing a few popular factor analysis methods. We found that PCA and flash had the most robust performances across a wide range of simulation settings.