 Methodology article
 Open access
 Published:
Robust differential expression analysis by learning discriminant boundary in multidimensional space of statistical attributes
BMC Bioinformatics volume 17, Article number: 541 (2016)
Abstract
Background
Performing statistical tests is an important step in analyzing genomewide datasets for detecting genomic features differentially expressed between conditions. Each type of statistical test has its own advantages in characterizing certain aspects of differences between population means and often assumes a relatively simple data distribution (e.g., Gaussian, Poisson, negative binomial, etc.), which may not be well met by the datasets of interest. Making insufficient distributional assumptions can lead to inferior results when dealing with complex differential expression patterns.
Results
We propose to capture differential expression information more comprehensively by integrating multiple test statistics, each of which has relatively limited capacity to summarize the observed differential expression information. This work addresses a general application scenario, in which users want to detect as many as DEFs while requiring the false discovery rate (FDR) to be lower than a cutoff. We treat each test statistic as a basic attribute, and model the detection of differentially expressed genomic features as learning a discriminant boundary in a multidimensional space of basic attributes. We mathematically formulated our goal as a constrained optimization problem aiming to maximize discoveries satisfying a userdefined FDR. An effective algorithm, DiscriminantCut, has been developed to solve an instantiation of this problem. Extensive comparisons of DiscriminantCut with 13 existing methods were carried out to demonstrate its robustness and effectiveness.
Conclusions
We have developed a novel machine learning methodology for robust differential expression analysis, which can be a new avenue to significantly advance research on largescale differential expression analysis.
Background
Highthroughput technologies, such as DNA microarray [1, 2] and RNAseq (RNA sequencing) [3], have made it possible to perform genomewide profiling of various genomic features, such as, genes, transcripts, exons, DNA modifications, and so on. These technologies have been widely adopted to detect genomic features (referred to as “features” from now on) that are differentially expressed between different conditions (e.g., phenotypes, treatments, etc.). When analyzing genomewide datasets to detect differentially expressed features (DEFs), it is important to control the overall false positive rate because thousands of hypotheses are tested simultaneously. Controlling the false discovery rate (FDR – the expected proportion of false positives among all features called significant) was first introduced by Benjamini and Hochberg [4] to largescale testing problems and has been broadly applied in detecting DEFs since then. The BenjaminiHochberg (BH) approach takes the pvalues of all hypothesis tests and uses a sequential method to estimate the rejection region (i.e., pvalue threshold). More recently, researchers formulated FDR estimation in a Bayesian fashion [5–7], which assumes the distribution of the statistic as a density mixed of nulls and alternatives. The Bayesian approaches can be implemented nonparametrically using the test statistics directly rather than their pvalues. The calculation of test statistics (and their pvalues) can be deemed as a mapping from the original highdimensional observations to a single index value per feature. The ordinary ttest [8] was one of the most popular mappings for detecting differential expressions measured by DNA microarrays. The ttest assumes normality in the target data and can be prone to outliers. In addition, its variance estimation is feature specific and is impacted by great variability when only few samples/replicates are available in DNA microarray experiments. To deal with this problem, various versions of moderated tstatistics [6, 9–16] were developed to utilize information across features for regularizing variance estimation.
Statistics based on the Poisson distribution [17, 18] or the negative binomial (NB) distribution [19–22] were later proposed specifically for detecting DEFs using RNAseq data. Different from typical DNA microarray approaches that rely on hybridization to measure the expression levels of features as continuous values, RNAseq approaches use deep sequencing to produce millions of short reads corresponding to those features. The reads are then mapped onto a reference genome, which makes Poisson a natural representation of read counts. It was shown that the Poisson distribution was able to effectively characterize technical replicates in RNAseq experiments [17]. However, the Poisson distribution forced the mean and variance to be the same and predicts a smaller variance than what was observed in biological replicates [23]. To deal with this socalled overdispersion problem, PoissonSeq [24] applied a power transformation to make the data distribution look more like Poisson. Auer [25] proposed a twostage Poisson model (TSPM) to handle features with significant overdispersion evidence by a quasilikelihood approach [26]. In the meantime, the NB distribution was proposed as an alternative [23] and has been gaining momentum in analyzing RNAseq data. Compared to the Poisson distribution, the NB distribution allows the modeling of a more general meanvariance relation by taking another dispersion parameter. Several NBbased approaches, such as DESeq [20], DESeq2 [27], edgeR [22], NBPSeq [28], EBSeq [29], baySeq [30], ShrinkSeq [31], and so on, have been developed, and they mainly differ in their ways of modeling and estimating the dispersion parameter.
Recently, it was demonstrated that the moderated tstatistic, when combined with appropriate data preprocessing methods, could be powerful for detecting DEFs using RNAseq data. For example, voom [32] extended limma [11], which uses the moderated tstatistic in a pipeline wellestablished for analyzing DNA microarray data, for differential expression analysis using RNAseq data. Voom applies a logarithmic transforms to readcounts normalized by the corresponding library size, estimates the meanvariance relationship nonparametrically from the transformed data, uses the estimated relationship to generate a precision weight for each normalized observation, and finally enters them into the limma empirical Bayes analysis pipeline for detecting DEFs. In another example, vst/limma [33] applied the variancestabilizing transformation (vst) of DESeq to RNAseq data before using limma to calculate the moderated tstatistic.
The above test statistics can be viewed as attributes extracted from data to characterize the observed differential expression patterns. Most existing attribute extraction methods make specific assumptions about data distributions (e.g., Gaussian, Poisson, or NB), and then calculate a statistic (i.e., an attribute in our words) for each feature. Although those test statistics are efficient in preserving differential expression information up to certain levels, they leave plenty of room for further improvements. In real applications, the profiles of individual features in the same dataset can be governed by complex distributions, and hence may not be well represented by the assumed distribution [34]. We made a similar observation that the distributions could indeed be far more complex than those often assumed (see Figs. 2 and 11 in the Results section for examples). Individual attributes based on relatively simple distribution assumptions will have limited capacity in characterizing complex differential expression patterns, and hence can greatly affect DEF detection results. In theory, we can explicitly make every differential expression test follow a common family of distributions by designing a complex distribution form (e.g., mixture of simple distributions) to approximate all complex distributions in data. Such a complex distribution will have unknown parameters that can be estimated from data by applying the same procedure to all features. However, it can be challenging to design not only a statistic for testing differential expressions based on such a complex distribution but also a parametric DEF detection approach that uses this test statistic.
There are nonparametric approaches that do not assume data distribution, such as, SAMSeq [34] and NOISeq [35]. SAMseq utilizes the ranksum test statistic [36] to characterize differential expressions and uses resampling to adjust for different library sizes. Although the ranksum test does not assume any data distribution and is less likely to be affected by outliers, it can sometimes be considerably less capable of preserving information. NOISeq uses two simple attributes (log foldchanges and absolute expression differences), and estimates the null as the joint distribution of these two attributes from replicates (or replicates simulated from an empirically determined multinomial distribution), which is then used to calculate the odds of an observed statistic pair indicating differential expression. Nevertheless, NOISeq does not directly estimate FDR. In addition, log foldchanges and absolute expression differences can be prone to outliers and are not powerful enough for characterizing complex differential expression patterns. However, NOISeq motivated us to investigate better ways for integrating multiple attributes to detect DEFs while controlling the FDR.
In this paper, we call the above attributes “basic” because of their relatively simple forms and limited capacity in preserving differential expression information. Most of the existing DEF detection methods rely on one single basic attribute in each analysis run, which can greatly restrict their detection power. Since different basic attributes may capture distinct aspects of differential expression patterns, we anticipate that DEFs can be better differentiated from nonDEFs using multiple basic attributes, which may be extracted from data using existing tools, such as, DESeq2, voom, limma, and so on. This work addresses a general application scenario, in which users set a target FDR and ask a method to detect as many DEFs as possible. This can be formulated as a constrained optimization problem that tries to learn an optimal decision boundary in a space of multiple basic attributes to differentiate DEFs from nonDEFs. An algorithm DiscriminantCut has been developed to explore the linear decision boundary family. Extensive tests were conducted to test DiscriminantCut and compare it with several popular DEF detection methods. The results demonstrate that it is significantly advantageous to combine multiple basic attributes in detecting DEFs.
Methods
DEF detection as learning multidimensional decision boundary
Let G = {g _{ ij }}_{ i = 1 … M,j = 1 … N } contain the values of M features in N samples, in which g _{ ij } is the value of the ith feature in jth sample. Without loss of generality, we assume that samples are randomly selected from a population with two different conditions. Let Y = {y _{ j }}_{ j = 1,…,N }, where y _{ j } be the binary condition label of the jth sample. The goal is to detect features that are differentially expressed between these two conditions. We propose to treat DEF detection as finding a discriminant function h(∙) that specifies the decision boundary between DEFs and nonDEFs. Let d _{ i } = h(g _{ i1}, g _{ i2}, …, g _{ iN }; y _{1}, y _{2}, …, y _{ N }) be the discriminant value of the ith feature. The ith feature is called a DEF if d _{ i } > 0. The unknown parameters of h(∙) should be learned from X = <G, Y>. It can be challenging to design a proper h(∙) in a topdown way and learn such a function. To circumvent this problem, we can take advantage of previous research achievements in designing and calculating various statistics for testing differential expression (e.g., tstatistic, moderated tstatistic, ranksum statistic, Wald statistic for NBbased differential expression tests, etc.). We let h(g _{ i1}, g _{ i2}, …, g _{ iN }; y _{1}, y _{2}, …, y _{ N }) ≜ f(s ^{(1)}_{ i } , s ^{(2)}_{ i } , …, s ^{(K)}_{ i } ) where s ^{(1)}_{ i } , s ^{(2)}_{ i } , …, s ^{(K)}_{ i } are K different basic attributes (i.e., test statistics) of the ith feature. This design can be considered as a twolayer data summarization mapping with calculating the basic attributes as the first layer and f(∙) as the second layer. The function f(∙) should be much less complex than h(∙), and its unknown parameters can be estimated from X more easily. Our approach can be geometrically interpreted as treating each feature as a point in the multidimensional space of those K basic attributes, and learning f(∙) from a given dataset to specify a decision boundary between DEFs and nonDEFs in that space. Each basic attribute provides a certain pointofview about being differentially expressed, which is then integrated by f(∙) to produce a more comprehensive view. We leave the detailed specification of f(∙) to implementation and focus on explaining the idea for now. It will be shown later in our experiments that simple instantiations of f(∙), such as linear functions, can deliver superior performance.
As simple as it sounds, it is in fact quite significant and innovative to explicitly model DEF detection as learning a decision boundary in a multidimensional space. Conventional DEF detection approaches use topdown approaches to design single attributes to characterize differential expression information, and then find decision points in onedimensional spaces. To accurately deal with complex differential expression patterns in the traditional way, we need to design a complex data distribution and a corresponding statistic for testing differential expression, which can be challenging and often requires performing less tractable computations. Our approach is much more simple and practical, and offers a straightforward geometrical interpretation. Our novel formulation of DEF detection opens up a new avenue to advance DEF detection research by incorporating decision boundary modeling and learning techniques developed in Machine Learning community. Learning f(∙) from X is an unsupervised task because no feature is labeled as DEF or nonDEF in X. As far as we know, this kind of unsupervised learning problem (i.e., maximizing discoveries under a FDR constraint) has not captured major attentions in Machine Learning research.
Maximizing DEF detection by constrained optimization
Let \( \mathbb{D}\left(\boldsymbol{X},f\right)={\left\{{d}_i=f\left({s}_i^{(1)},{s}_i^{(2)},\dots, {s}_i^{(K)}\right)\ \right\}}_{i=1\dots M} \) be the discriminant value set including the discriminant values of all M features in X. We want to learn f(∙) from data so that the number of the detected DEFs is maximized while the FDR is under controlled by a userdefined threshold Ψ. Given a dataset X and a fixed discriminant function f(∙), the DEF set is indicated as
Let FDR(X, f) denote the corresponding FDR of Γ(X, f). This problem of learning f(∙) from X to maximize the size of Γ(X, f) subject to the FDR constraint can be mathematically written as:
Our approach is different from the optimal discovery procedure (ODP) [37] that tries to optimally capture common differential expression patterns shared among detected DEFs by rigorously exploring the relevant information across features to rank their significance of being differentially expressed. The current setup of ODP only allows one kind of hypothesis test for all features in each analysis run. Our approach tries to capture differential expression information of individual features as much as possible by scrutinizing their expression profiles from multiple “view angles” (i.e., using multiple basic attributes). We aim to maximize the number of detected DEFs at a given FDR level. It is possible that different numbers of DEFs can have the same FDR. It can be beneficial to treat up and downregulation asymmetrically (i.e., using different discriminant functions) because the induced and suppressed features may exhibit different up and downtail characteristics in the joint distribution of basic attributes (Fig. 1). Equation (2) and the following derivations are general and can be applied to detect both up and downregulated features. Before we introduce the algorithm to find the parameters of f(∙) by trying to solve Eq. (2), we explain how to estimate FDR(X, f) in the following.
FDR estimation
In practice, FDR(X, f) in Eq. (2) is unknown. To estimate the FDR of an arbitrary f(∙), we implemented the Storey framework [5] in a nonparametric fashion [10], which we briefly explain below for completeness. Let the NULL hypothesis of a feature be that it is not a DEF. Assuming there are M independent features. Table 1 lists the possible results when simultaneously testing M features for calling DEFs using f(∙), among which R(f) is an observable variable indicating the number of DEFs detected by f(∙) and V(f) is a hidden variable indicating the number of false DEFs detected by f(∙). Let D _{ f } be the variable representing discriminant value calculated by f(∙). We can write down the FDR according to [38] as a function of f(∙):
Equation (3) can be rewritten using the Bayes rule as the following:
Equation (4) utilizes the fact that P(D _{ f } > 0NULL, R(f) = 0) = 0 because no hypothesis is rejected when R(f) = 0. Below we explain nonparametric methods for estimating P(D _{ f } > 0NULL), P(NULL), and P(D _{ f } > 0R(f) > 0) given a dataset X and a fixed f(∙).
Estimate P(D _{ f } > 0NULL). The term P(D _{ f } > 0NULL) is the probability of D _{ f } > 0 when NULL is true. The distribution of D _{ f } under NULL condition, depending on both the distributions of basic attributes and f(∙), can be extremely complex. Hence it may not feasible to determine this term in an analytical form. We therefore estimate P(D _{ f } > 0NULL) by adopting the nonparametric method developed in [10], which allows us to better explore the structure of data distribution in a datadependent manner. This method randomly permutes the original dataset B times to generate the null control, and estimates P(D _{ f } > 0NULL) as:
where d ^{*}_{ i,b } is the discriminant value of the ith feature in the bth (1 ≤ b ≤ B) permutation X ^{*}_{ b } . The function \( {\widehat{\mathbb{E}}}_b\left(\cdot \right) \) uses all B permutated datasets to estimate the expected number of nonDEFs that are incorrectly classified as DEFs. A reasonable choice of \( {\widehat{\mathbb{E}}}_b\left(\cdot \right) \) is the median/mean function.
Estimate P(NULL). It is expected that P(NULL) ⋅ M features are nonDEFs (i.e., true NULL hypotheses). Below we use the pvalue concept to explain how to estimate P(NULL) from data although we do not need to estimate pvalues. Assuming that all features are independent, the pvalues of the discriminant values of these P(NULL) ⋅ M features should be uniformly distributed between 0 and 1. Therefore, for some chosen pvalue cutoff λ ∈ (0, 1), we should expect that there are (1 − λ) ⋅ P(NULL) ⋅ M nonDEFs whose pvalues are greater than λ. Let d _{ λ } denote the discriminant value whose pvalue is λ. Since it is possible for some true DEFs to have pvalues greater than λ, it is expected that \( \left(1\lambda \right)\cdot P(NULL)\cdot M\le \left\left\{{d}_i\Big{d}_i\le {d}_{\lambda },{d}_i\in \mathbb{D}\left(\boldsymbol{X},f\right)\right\}\right \) when M is large enough and λ is wellchosen. In practice, d _{ λ } can be estimated as the value smaller than λ percentile of elements in the permutation set \( {\left\{\mathbb{D}\left({\boldsymbol{X}}_b^{*},f\right)\right\}}_{b=1\dots B} \). We can hence have a conservative estimation of P(NULL) as:
We conservatively set λ = 50% and truncate \( \widehat{\mathrm{P}}(NULL) \) at 1 because a probability should never exceed 1.
Estimate P(D _{ f } > 0R(f) > 0). The probability P(D _{ f } > 0R(f) > 0) can be naturally estimated as:
where \( r\left(\boldsymbol{X},f\right)=\left\left\{{d}_i\Big{d}_i>0,{d}_i\in \mathbb{D}\left(\boldsymbol{X},f\right)\right\}\right \) is an observed value of the variable R(f) given the dataset X, and r(X, f) ∨ 1 = r(X, f) if r(X, f) > 0, otherwise 1. The term r(X, f) ∨ 1 prevents the estimated FDR from being undefined due to having 0 as the denominator. Plugging Eqs. (5–7) into Eq. (4), we have the estimated FDR as:
If the number of permutation is large enough, r(X, f) ∨ 1 will effectively set the estimated FDR as 0 when r(X, f) = 0 because, on expectation, the discriminant values of the permuted data are less significant than those of the original data. Thus we have \( {\widehat{\mathbb{E}}}_b\left(\left\left\{{d}_{i,b}^{*}>0\Big{d}_{i,b}^{*}\in \mathbb{D}\left({\boldsymbol{X}}_b^{*},f\right)\right\}\right\right)/M\le \left\left\{\left.{d}_i\right{d}_i>0,{d}_i\in \mathbb{D}\left(\boldsymbol{X},f\right)\right\}\right/M=r\left(\boldsymbol{X},f\right)/M=0 \), which makes \( {\widehat{\mathrm{FDR}}}_{\lambda}\left(\boldsymbol{X},f\right)=0 \).
DiscriminantCut algorithm
As a simple start to implement Eq. (2), we chose the discriminant function f(∙) from the linear function family \( f\left({s}_1,\dots, {s}_K\right)={\displaystyle \sum_{i=1}^K}{w}_i{s}_i\tau \), subject to \( \left{\displaystyle \sum_{i=1}^K}{w}_i\right=1 \), where {w _{ i }} and τ are the unknown parameters of f(∙) to be learned from X. We further require w _{ i } ≥ 0 when detecting upregulated DEFs and w _{ i } ≤ 0 when detecting downregulated DEFs, which effectively make \( \left{\displaystyle \sum_{i=1}^K}{w}_i\right={\displaystyle \sum_{i=1}^K}\left{w}_i\right=1 \) a L_{1} regularization that tends to yield sparse models. A simple algorithm, DiscriminantCut (DC), was designed and implemented to search for the “ideal” f*(⋅). DC performs an exhaustive search at an empirically decided resolution (Additional file 1: Algorithm S1). The algorithm first populates a set of {w _{ i }} candidates, and for each of them, tunes τ to detect as many DEFs as possible while keeping the estimated FDRs under controlled by a userdesired threshold Ψ. Since both finding f*(⋅) and estimating FDR using the same permutation set, it is possible that the final estimated FDR is biased. To address this, we referred the idea in [39]. After choosing f*(⋅), we calibrate its cutoff τ using another large independent permutation set, and then apply the recalibrated f*(⋅) to identify DEFs. The efficiency of the search was greatly improved by sorting intermediate results to facilitate quick search, binary search, and avoiding unnecessary exploration (details in Additional file 1: Section 1.1). The algorithm runs fast in practice. In our experiments, most of the runtime was spent on computing basic attributes, and the remaining computations took almost negligible time.
There are approaches for linearly combining multiple attributes (or statistics) from either dependent or independent datasets [39–41] (and the references therein). Some of them mainly explore the covariance between attributes. Some aim to minimize the pvalues of individual features by allowing each feature to has its own combination setting. Our approach does not make any assumption about the joint distribution of the attributes. We try to maximally explore differential expression information in one dataset, and force all features to share the same {w _{ i }}. In addition, our objective function explicitly models the overall goal – maximize detections constrained by a target FDR. In the future, it may worth exploring how minimizing the pvalues of individual features can benefit our goal.
Results
RNAseq simulation test
We firstly carried out a series simulation tests, in which the ground truths were known to ensure proper comparison, to assess the advantages of combining multiple basic attributes by DC. We let DC use up to three representative basic attributes: (1) s ^{T} – the moderated tstatistic from voom, (2) s ^{R} – the corrected ranksum statistic from SAM (this is different from SAMseq’s ranksum statistic that is adjusted for different library sizes by resampling), and (3) s ^{NB} – the Wald statistic for NBbased differential expression test from DESeq2. This produced seven DC configurations: DC^{T} (DC using s ^{T}), DC^{R} (DC using s ^{R}), DC^{NB} (DC using s ^{NB}), DC^{T+R} (DC using s ^{T} and s ^{R}), DC^{R+NB} (DC using s ^{R} and s ^{NB}), DC^{T+NB} (DC using s ^{T} and s ^{NB}), and DC^{T+R+NB} (DC using s ^{T}, s ^{R} and s ^{NB}). We also compared DC with 13 other RNAseq differential expression analysis methods including baySeq, DESeq, EBSeq, edgeR, NBPSeq, SAMseq, ShrinkSeq, TSPM, voom, vst/limma, PoissonSeq, DESeq2, and ODP.
Simulation design
To make the simulation tests as realistic as possible, we simulated the test datasets based on a real RNAseq dataset – the Montgomery dataset (downloaded from http://bioinf.wehi.edu.au/PublicDatasets/ as of Apr.15^{th}, 2015) [42], which contains the transcriptome of 25,702 genes in 60 extended HapMap individuals of European descent. Large number of samples in this dataset allows us to reveal that the distributions in real datasets can be indeed much more complex than often assumed. Nevertheless, the number of replicates in each simulated dataset is much smaller and is within the range of common practice. We first removed genes with extremely low expression profiles (read counts below 10 in more than half of the replicates). For each of the remaining 11,573 genes, we decided whether its read counts could be better modeled by a NB distribution or a Gaussian mixture model (GMM) in the following way. The NB and GMM distributions were estimated by using DESeq2 implemented in R and the statistics toolbox of MATLAB R2013a, respectively. The most proper number of components in a GMM was decided based on the Bayesian Information Criterion. The GMMs of ~44, ~50, and ~6% genes contained 1, 2, and 3 components, respectively. Figure 2a–c show a few typical examples. Then, for each gene, we calculated the correlation between the histograms of its readcounts and the corresponding fitted NB/GMM to decide which distribution was a better fit. The GMMs were truncated at zero because read counts should be nonnegative. The distributions of about 63.5 and 36.5% of genes can be better represented by GMM and NB (Fig. 2d), respectively. A simple experiment presented in Fig. 2d caption validates that the distributions of many genes in this dataset are more complex than what assumed conventionally (e.g., NB or Gaussian). Our choice of examining correlation between the histogram of data and its fits was based on two considerations: (a) histogram is commonly used in practice to approximate distributions, and (b) correlation is a widely adopted distance metric. This method is mainly used to show that features have complex patterns of distributions rather than as a rigorous model selection method for determining the exact ratio of GMM to NB, such as the one (63.5 vs. 36.5%) shown above. We consider it sufficient for choosing distributions, which roughly approximate the original ones, for generating data in the following simulation test.
In each simulation test run for comparing the chosen RNAseq differential expression analysis methods, we simulated N readcounts for every gene using the distribution (either NB or GMM) decided to be better in the above way, and randomly divided the simulated readcounts into two equalsize groups to obtain true nonDEFs. The simulation of a gene was repeated until its logarithmic foldchange was not larger than 4.5σ _{ N }, where σ _{ N } is the standard deviation of the logarithmic foldchange between two Nsample groups randomly chosen from the Montgomery dataset. The 4.5σ _{ N } foldchange threshold was chosen because we observed in the Montgomery dataset that the expected number of foldchanges higher than 4.5σ _{ N } is below 0.05. Then we randomly made G ^{a}_{ b } genes (a and b are the numbers of up and downregulated genes, respectively) as true DEFs in the following way. For each of the chosen genes, we multiplied or divided one of its groups by a factor uniformly sampled between 1.5 and 3.0 to provide a reasonable wide range of differences in expression. Finally, all simulated values were rounded to their nearest integers.
A series of simulation test runs were conducted under 20 different settings: 5 different sample sizes (N = 8 [4 vs. 4], 10 [5 vs. 5], 12 [6 vs. 6], 16 [8 vs. 8], and 20 [10 vs. 10]) × 4 different true DEF configurations (G ^{400}_{400} , G ^{500}_{500} , G ^{600}_{600} , and G ^{1000}_{0} ). At each of the 20 simulation settings, we ran the test 100 times and recorded the results. Our comparisons focus on two key performance factors: (1) the effectiveness of FDR control, namely whether the real FDR is effectively bounded by the target FDR; and (2) the detection power, namely the ability to detect as many true DEFs as possible without violating (1).
Integrating multiple basic attributes helps substantially
Comparing the results of different DC configurations shows that the advantage of integrating multiple basic attributes in detecting DEFs is significant. Figure 3 shows that DC^{T+R+NB} consistently outperformed the three singleattribute DC configurations under all 20 simulation test settings (5 sample sizes × 4 DEF configurations), and singleattribute DC methods (DC^{T}, DC^{R}, DC^{NB}) significantly underperformed the multiattribute ones. Here we use the results of a typical simulation test setting (6 vs. 6 and G ^{500}_{500} ) as an example. Even though some individual attributes alone may be inferior to other attributes in detecting DEFs, they can indeed provide substantial enhancements to other attributes. For example, in Table 2, DC^{R} detected no DEFs at FDR < 0.01 or FDR < 0.05. Adding s ^{R} to s ^{NB} significantly improved the results by 16.35% (paired ttest pvalue = 8.87e30) at FDR < 0.01 and by 9.62% (paired ttest pvalue = 2.32e35) at FDR < 0.05. Results across different sample sizes (Table 3) confirm the advantages of integrating multiple basic attributes. Grouping the DEFs detected by DC^{T}, DC^{NB}, and DC^{T+R+NB} accordingly to their distribution categories (Table 4), we observe that integrating multiple basic attributes helps to detect DEFs across the whole distribution spectrum. Interestingly, DC^{T} on average detected more DEFs governed by NB distributions than DC^{NB}, which to some extent resonates with the idea of voom, i.e., it is sometimes more important to model the meanvariance relationship correctly than to design the exact distribution of readcounts.
No single basic attribute dominates
We also observed that none of the basic attributes consistently performed better than other basic attributes in our simulation tests, which resonates the idea of utilizing multiple attributes. For example in Table 3, under the simulation test setting 10 vs. 10 and G ^{500}_{500} , DC^{T} on average detected more true DEFs than DC^{NB} (370.79 vs. 359.01) at FDR < 0.01, but performed worse than DC^{NB} (527.27 vs. 547.70) at FDR < 0.05. Moreover, at FDR < 0.05, DC^{T} outperformed DC^{NB} on datasets when the sample size was relatively small (e.g., 4 vs. 4, 5 vs. 5 and 6 vs. 6) while DC^{NB} outperformed DC^{T} when the sample size was larger (e.g., 10 vs. 10). Interestingly, although DC^{R} underperformed DC^{T} under most test settings, DC^{R} outperformed DC^{T} under the setting of 10 vs. 10, G ^{1000}_{0} and target FDR < 0.05 (true DEFs: 555.63 by DC^{R} vs. 542.29 by DC^{T}).
Compare DC^{T+R+NB} with other DEF detection approaches
We compared DC^{T+R+NB} and 13 other RNAseq differential expression analysis methods including baySeq, DESeq, EBSeq, edgeR, NBPSeq, SAMseq, ShrinkSeq, TSPM, voom, vst/limma, PoissonSeq, DESeq2, and ODP. Figure 4 shows the average numbers of the detected true DEFs at two typical target FDR levels (0.01 and 0.05) under a typical simulation test setting 6 vs. 6 and G ^{500}_{500} (the results of the rest test settings are provided in Additional file 1: Figures S1–S20). Among those able to effectively control the FDR, DC^{T+R+NB} in general performed the best. At target FDR < 0.01, DC^{T+R+NB} on average detected 99.44 true DEFs, which is significantly better (paired ttest pvalue = 1.79e66) than the 31.59 true DEFs detected by the best nonDC method (vst/limma). At target FDR < 0.05, DC^{T+R+NB} detected 266.22 true DEFs, which is significantly better (paired ttest pvalue = 1.32e71) than the 204.59 true DEFs detected by the best nonDC method (vst/limma). Figure 5 compares the average number of true positives detected by different approaches at different target FDR cutoffs (from 0.01 to 0.1 with a step of 0.01) under a typical simulation test setting of 6 vs. 6 and G ^{500}_{500} (the results of other test settings are provided in Additional file 1: Figures S21–40). Figure 5 and Additional file 1: Figures S21–40 show that DC in general performed the best among those effectively controlled FDR.
In some application scenarios other than ours, users may want to choose a fixed number of top DEFs. To serve this purpose, Fig. 6 compares the results using FDC (false discovery curve: true FDR vs number of detected DEFs). The FDCs of other test settings are provided in Additional file 1: Figures S41–60. Figure 6 and Additional file 1: Figures S41–60 show that DC is among the best performers including voom, vst/limma, DESeq2, edgeR, and ShrinkSeq. Here we do not show ROC (false positive rate vs. true positive rate), which is also popular for evaluating machine learning techniques and statistical analysis methods, because FDC and ROC deliver the same information from different viewpoints. Since true FDR can be estimated but usually unknown, FDC and ROC should be used with caution in our application scenario because they do not consider whether a method is able to estimate FDR well. FDC and ROC only depend on the ranks of features’ significance scores regardless of their actual values. Therefore, it is possible that two DEF detection methods can produce the same ROC/FDC although they have quite different capabilities in estimating FDR. Imagining there are two DEF detection methods. The 1^{st} method is biased towards high pvalues (i.e., it tends to generate very high pvalues for all features) because it imposes some assumptions. Calling one single significant feature using the 1^{st} method will lead to an extraordinarily high estimated FDR. On the contrary, the 2^{nd} method is biased towards small pvalues (i.e., it tends to generate very low pvalues for all features) because it imposes other assumptions. Given a target FDR, the 2^{nd} method will dramatically underestimate its true FDR and call too many false positives. Nevertheless, if the features are ranked in the same order by both methods, they will produce exactly the same ROC/FDC.
Effects of sample size and DEF configuration
Figure 7 summarizes the effects of “sample sizes + DEF configurations” on DEF detection results at target FDR < 0.05. The result of a method under a particular setting is included if its true FDR does not exceed the target FDR by 10% and it detects on average at least 0.5 true DEFs (rounds up to 1). Under most settings, DC^{T+R+NB} was able to effectively control FDR and detect more DEFs. However, when the sample size is small (4 vs. 4), the average true FDRs of DC^{T+R+NB} were 0.053 and 0.058 for G ^{400}_{400} and G ^{500}_{500} , respectively; and ODP was the only method able to detect true DEFs (1.11 under G ^{400}_{400} and 2.57 under G ^{500}_{500} ) while meeting the FDR target. When the sample size was decreased, all methods detected less DEFs, and it was more difficult to control the FDR, especially when a more stringent target FDR was imposed. For example, when N = 8 (4 vs. 4), G ^{500}_{500} , and the target FDR < 0.01, DC^{T+R+NB} on average detected less than 20 DEFs, and one single false positive alone would increase its true FDR by 0.05, which is much higher than the target FDR. Smaller sample sizes (2 vs. 2 and 3 vs. 3) were also tested. However, no method was able to control FDR well (i.e., their true FDRs > 110% × the target FDR) or detect at least 0.5 DEFs on average. Hence, the results of 2 vs. 2 and 3 vs. 3 are not shown in Fig. 7. This indicates that it remains challenging to detect DEFs governed by complex distributions when the sample size is small.
Evaluation using the SEQC/MAQCIII dataset
The US Food and Drug Administration has coordinated a largescale community effort, the Sequencing Quality Control project (SEQC/MAQCIII), to assess the performance of RNAseq across laboratories and to test different sequencing platforms and data analysis pipelines [43]. The consortium has generated a RNAseq datasets (Gene Expression Omnibus accession code: GSE47792) from two reference RNA samples, the Strategene Universal Human Reference RNA (sample A) and the Ambion Human Brain Reference RNA (sample B). This dataset contains two reference feature subsets: (1) 92 synthetic RNAs from the External RNA Control Consortium (i.e., ERCC spikeins) with four different sample A/sample B ratios (1/2, 2/3, 1 and 4); and (2) ~1000 genes whose sample A/sample B foldchanges were validated using TaqMan qRTPCR [44]. In the following comparison, we used log2 expression change threshold of 0.5 to select true DEFs from the ~1000 TaqMan qRTPCR validated genes, and obtained 693 genes denoted as the positive TaqMan genes below. However, due to the extreme difference between samples A and B [45], the positive TaqMan genes only represent a small fraction of those differentially expressed between samples A and B. If we let different DEF detection methods compare the replicates of samples A and those of samples B, their results on the positive ERCC spikeins and the positive TaqMan genes cannot accurately reflect their overall performances. In addition, all DEF detection methods will detect too many DEFs that dwarf the differences between their detection results. Thus we designed the following procedure to make the positive ERCC spikeins and the positive TaqMan genes together as a proper reference feature set for evaluating DEF detection methods.
We focused on the SEQC/MAQCIII RNAseq subset sequenced at the Australian Genome Research Facility using the Illumina HiSeq200, in which each RNA sample has 64 technical replicates (4 libraries per sample and 8lanes of 2flow cells per library). First, the lowcount genes (5+ reads in less than 10 replicates) were removed. After this step, 14 negative ERCC spikeins (ratio = 1) and 45 positive ERCC spikeins (ratio = 1/2, 2/3 or 4) were retained. Then we used the stateoftheart RNAseq normalization tool, RUVSeq [45], to normalize all 128 replicates using the negative ERCC spikeins and 1000 least differentially expressed genes (ranked by edgeR pvalues) as the in silico empirical negative control genes. In particularly, we used RUVg (Remove Unwanted Variation Using Control Genes) and followed the practice of RUVg authors described in the online methods of [45] by dropping the first unwanted factor and retained the next 6 factors. After normalizing the replicates, we randomly chose 12 replicates from one library from the sample A and divide them into two equalsize groups to form the base of nonDEFs (the results using different number of replicates are provided in Additional file 1: Figures S61–72). Occasionally we obtained two very distinct groups because the above normalization procedure could not get rid of all unwanted variations. To avoid this problem, we applied PoissonSeq to calculate the pvalues of the true nonDEFs being differentially expressed between the chosen groups, and redid grouping if the pvalue distribution of the true nonDEFs was not closed to uniform between 0 and 1. PoissonSeq was used because the Poisson distribution was reported to be effective for modelling technical replicates [17]. Finally, we replaced the values of the positive ERCC spikeins and the positive TaqMan genes in one of the chosen groups by their values in 6 randomly selected replicates of sample B. This arrangement should make the positive TaqMan genes as the true DEFs and the remaining genes as the true nonDEFs.
The data obtained above was then used to benchmark different DEF detection methods. We repeated the above procedure 100 times. The results are summarized in Fig. 8. All DC configurations and most nonDC methods were able to effectively control the FDR at both target FDR levels (0.01 and 0.05). Among those able to effectively control the FDR, DC^{T+R+NB} was the most powerful. At target FDR < 0.01, DC^{T+R+NB} on average detected 565.10 true DEFs, which is significantly better (paired ttest pvalue = 1.58e25) than the 554.85 true DEFs detected by the best nonDC method (SAMSeq). At target FDR < 0.05, DC^{T+R+NB} on average detected 577.15 true DEFs, which is significantly better (paired ttest pvalue = 5.58e27) than the 569.63 true DEFs detected by the best nonDC method (SAMSeq). The leads of DC^{T+R+NB} over nonDC methods are not as large as those in the simulation test because we used technical replicates in this experiment. Figures 9 and 10 compare the curves of “the true positives vs. the target FDR” and FDCs, respectively. The supreme performance of DC^{T+R+NB} can be explained by Fig. 11, which shows that the normalizedcount distributions of some positive TaqMan genes are complex even within the chosen technical replicate subset.
Analyze a methylation dataset
To demonstrate the general applicability of DC, we applied it to analyze a DNA methylation dataset generated by Aldinger et al. [46] using the Illumina HumanMethylation27 BeadChip, which can be downloaded as GSE34099 from GEO. This data set contains global DNA methylation of 18 Rett syndrome samples and 19 control samples. Since the nature of this dataset is quite different from typical RNAseq count data, we did not include methods developed specifically for RNAseq in this comparison. Instead, we focused on the applicability of DC and assessing the benefits of using more than one attributes. We selected five basic statistics and let DC use two of them in each run: (1) s ^{T.SAM} – the corrected tstatistic [10, 47]; (2) s ^{T.log.SAM} – the corrected tstatistic with logarithmic transformation; (3) s ^{R.SAM} – the corrected ranksum statistic [10, 47]; (4) s ^{T.voom} – the moderated tstatistic produced by voom; and (5) s ^{T.limma} – the moderated tstatistic produced by limma. The original data values were multiplied by 1000 and then rounded to the nearest integer if an attribute extraction package only accepts integer inputs. The NBbased basic attributes (e.g., the Wald statistic for the NBbased differential expression test by DESeq2) were not used because the distributions of DNA methylation features in this dataset are quite different from the NB distribution.
The results at FDR < 0.01 (Table 5) show that combining two basic attributes is significantly advantageous over utilizing single ones. For example, DC^{T.limma+T.voom} detected 63 DEFs, which is much higher than the 35 and 40 DEFs detected by DC^{T.limma} and DC^{T.voom}, respectively. This result is interesting because both s ^{T.voom} and s ^{T.limma} are moderated tstatistics and voom utilizes limma to calculate its test statistics after applying a logcount per million transformation to the original data. Nevertheless, the integration of s ^{T.voom} and s ^{T.limma} by DC can achieve significantly higher detection power than using one of them. The main reason underlying this observation is visualized in Fig. 12: The joint distribution of s ^{T.limma} and s ^{T.voom} are quite asymmetric and nonGaussian. It is more advantageous to use s ^{T.voom} and s ^{T.limma} to detect DEFs in the up and downregulated regions, respectively. Their advantages can be integrated by DC that rigorously explores the structures in the joint distribution of s ^{T.voom} and s ^{T.limma} to achieve better DEF detection results.
Discussions
Conventional methods for differential expression analysis often use individual basic attributes (e.g., foldchange, ranksum statistics, or other statistics based on simple distributional assumptions), which may significantly underestimate the complexity observed in reality. This is partially because the datasets, which were available when those analysis methods were developed, usually contained only a few replicates. It can also be due to underestimation of the underlying biological variations. We have shown in this paper that insufficient characterization of differential expression information could lead to low detection power and/or higherthanexpected FDRs. It is expected that future studies will produce sufficiently large number of replicates because the collaboration scales are quickly growing larger and the rapid advances of highthroughput technologies will bring down the experimental cost dramatically. Therefore, it is important to develop novel DEF detection methods with better capability of dealing with complex differential expression patterns. To this end, we proposed to utilize multiple basic attributes to better capture differential expression information and formulate the problem of detecting DEFs as optimizing discriminant boundary constrained by a userdefined FDR cutoff in a multidimensional space. We have developed the DiscriminantCut (DC) algorithm for dealing with a special family of discriminant functions (i.e., linear boundaries). The comparison of DC with several existing DEF detection methods using simulated datasets and the SEQC/MAQCIII RNAseq dataset confirms the advantages of DC in handling complex differential expression patterns. In addition, we also show an application of DC to analyze microarray datasets, and expect that DC can be used (maybe with slight extensions) to analyze many different types of highthroughput datasets. In the future, we will explore our approach for metaanalysis [48] that integrate multiple datasets.
Using linear discriminant functions is an effective step forward, but it may not be powerful enough to fully utilize largescale datasets. More powerful methods can be developed in the future by exploring more sophisticated discriminant function families and learning techniques. Discriminant analysis by integrating heterogeneous attributes is popular in many machinelearning research and its applications (e.g., computer vision, natural language processing, speech recognition, etc.). It is mostly done in supervised way that can rely on labelled information to perform calibration. Our approach is unsupervised and uses the estimated FDR for selfcalibration. This kind of machine learning problem has not been widely researched, and hence can be of great interest to future research.
Our approach greatly benefits from the attributes designed by previous research on differential analysis (such as, SAM, DESeq2, voom, vst/limma, and so on). We believe that we are far from fully exploring the potentials of those attributes. On the other hand, it is possible that some attributes may be redundant (i.e., can be replaced by combinations of other attributes) or their information cannot be effectively utilized by the chosen discriminant function family. Which attributes are effective depends on the characteristics of the dataset under analysis. DC already has certain attribute selection capability because it applies the L_{1} regularization. However, we believe attribute selection remains an open problem and can be domain specific. We will investigate this problem in the context of detecting DEFs in the future. As far as we know, our work is the first one that formerly introduces unsupervised multidimension discriminant analysis to DEF detection, which can be a new direction to significantly advance the DEF detection research as supported by our experimental results.
Conclusion
This paper presents a novel machine learning methodology for robust differential expression analysis, which can be a new avenue to significantly advance research on largescale differential expression analysis. The corresponding mathematical model was formulated as a constrained optimization problem aiming to maximize discoveries satisfying a userdefined FDR constraint. An effective algorithm, DiscriminantCut, was developed to solve an instantiation of this problem. Extensive comparisons of DiscriminantCut with a couple of cutting edge methods were carried out to demonstrate its robustness and effectiveness.
Abbreviations
 DC:

DiscriminantCut
 DEF:

Differential expressed features
 FDR:

False discovery rate
References
Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995;270(5235):467–70.
Lipshutz RJ, Fodor SP, Gingeras TR, Lockhart DJ. High density synthetic oligonucleotide arrays. Nat Genet. 1999;21(1 Suppl):20–4. doi:10.1038/4447.
Wang Z, Gerstein M, Snyder M. RNASeq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63. doi:10.1038/nrg2484.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B (Methodological). 1995;57(1):289–300.
Storey JD. The false discovery rate: a Bayesian interpretation and the qvalue. Technical report of the Stanford University Department of Statistics. 2001.
Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc. 2001;96(456):1151–60.
Efron B, Storey JD, Tibshirani R. Microarray Empirical Bayes Methods, and false discovery rates: Technical report of the Stanford University Department of Statistics. 2001.
Student. The probable error of a mean. Biometrika. 1908;6(1):1–25
Long AD, Mangalam HJ, Chan BY, Tolleri L, Hatfield GW, Baldi P. Improved statistical inference from DNA microarray data using analysis of variance and a Bayesian statistical framework. Analysis of global gene expression in Escherichia coli K12. J Biol Chem. 2001;276(23):19937–44. doi:10.1074/jbc.M010192200.
Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci. 2001;98(9):5116–21.
Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:Article3. doi:10.2202/15446115.1027
Cui X, Hwang JT, Qiu J, Blades NJ, Churchill GA. Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics. 2005;6(1):59–75. doi:10.1093/biostatistics/kxh018.
Fox RJ, Dimmic MW. A twosample Bayesian ttest for microarray data. BMC Bioinf. 2006;7:126. doi:10.1186/147121057126.
Sartor MA, Tomlinson CR, Wesselkamper SC, Sivaganesan S, Leikauf GD, Medvedovic M. Intensitybased hierarchical Bayes method improves testing for differentially expressed genes in microarray experiments. BMC Bioinf. 2006;7:538. doi:10.1186/147121057538.
Yu L, Gulati P, Fernandez S, Pennell M, Kirschner L, Jarjoura D. Fully moderated Tstatistic for small sample size gene expression arrays. Stat Appl Genet Mol Biol. 2011;10(1). doi:10.2202/15446115.1701
Lönnstedt I, Speed T. Replicated microarray data. Stat Sin. 2001;12:31–46.
Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNAseq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18(9):1509–17. doi:10.1101/gr.079558.108.
Wang L, Feng Z, Wang X, Wang X, Zhang X. DEGseq: an R package for identifying differentially expressed genes from RNAseq data. Bioinformatics. 2010;26(1):136–8. doi:10.1093/bioinformatics/btp612.
Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23(21):2881–7. doi:10.1093/bioinformatics/btm453.
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106. doi:10.1186/gb20101110r106.
Hardcastle TJ, Kelly KA. baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinf. 2010;11:422. doi:10.1186/1471210511422
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40. doi:10.1093/bioinformatics/btp616.
Robinson MD, Smyth GK. Smallsample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008;9(2):321–32. doi:10.1093/biostatistics/kxm030.
Li J, Witten DM, Johnstone IM, Tibshirani R. Normalization, testing, and false discovery rate estimation for RNAsequencing data. Biostatistics. 2012;13(3):523–38. doi:10.1093/biostatistics/kxr031.
Auer PL, Doerge RW. A twostage Poisson model for testing RNASeq data. Stat Appl Genet Mol Biol. 2011;10:1.
Wedderburn RWM. QuasiLikelihood Functions, Generalized Linear Models, and the GaussNewton Method. Biometrika. 1974;61(3). doi:citeulikearticleid:10002546 doi: 10.2307/2334725
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 2014;15(12):550. doi:10.1186/s1305901405508.
Di Y, Schafer DW, Cumbie JS, Chang JH. The NBP negative binomial model for assessing differential gene expression from RNASeq. Stat Appl Genet Mol Biol. 2011;10(1):1–28.
Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BMG, et al. EBSeq: an empirical Bayes hierarchical model for inference in RNAseq experiments. Bioinformatics. 2013;29(8):1035–43.
Hardcastle T, Kelly K. baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinf. 2010;11(1):1–14. doi:10.1186/1471210511422.
Van De Wiel MA, Leday GG, Pardo L, Rue H, Van Der Vaart AW, Van Wieringen WN. Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics. 2013;14(1):113–28. doi:10.1093/biostatistics/kxs031.
Law CW, Chen Y, Shi W, Smyth GK. voom: Precision weights unlock linear model analysis tools for RNAseq read counts. Genome Biol. 2014;15(2):R29. doi:10.1186/gb2014152r29
Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNAseq data. BMC Bioinf. 2013;14:91. doi:10.1186/147121051491.
Li J, Tibshirani R. Finding consistent patterns: a nonparametric approach for identifying differential expression in RNASeq data. Stat Methods Med Res. 2013;22(5):519–36. doi:10.1177/0962280211428386.
Tarazona S, GarciaAlcalde F, Dopazo J, Ferrer A, Conesa A. Differential expression in RNAseq: a matter of depth. Genome Res. 2011;21(12):2213–23. doi:10.1101/gr.124321.111.
Wilcoxon F. Individual comparisons by ranking methods. Biom Bull. 1945;1:80–3.
Storey JD. The optimal discovery procedure: a new approach to simultaneous significance testing. J R Stat Soc Ser B (Stat Methodol). 2007;69(3):347–68.
Storey JD. A direct approach to false discovery rates. J R Stat Soc Ser B (Stat Methodol). 2002;64(3):479–98.
Xu X, Tian L, Wei LJ. Combining dependent tests for linkage or association across multiple phenotypic traits. Biostatistics. 2003;4(2):223–9. doi:10.1093/biostatistics/4.2.223.
Li J, Tseng GC. An adaptively weighted statistic for detecting differential gene expression when combining multiple transcriptomic studies. 2011:994–1019. doi:10.1214/10AOAS393
Demetrescu M, Hassler U, Tarcolea AI. Combining significance of correlated statistics with application to panel data*. Oxf Bull Econ Stat. 2006;68(5):647–63. doi:10.1111/j.14680084.2006.00181.x.
Montgomery SB, Sammeth M, GutierrezArcelus M, Lach RP, Ingle C, Nisbett J, et al. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010;464(7289):773–7. doi:10.1038/nature08903.
Consortium SMI. A comprehensive assessment of RNAseq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol. 2014;32(9):903–14. doi:10.1038/nbt.2957.
Canales RD, Luo Y, Willey JC, Austermiller B, Barbacioru CC, Boysen C, et al. Evaluation of DNA microarray results with quantitative gene expression platforms. Nat Biotechnol. 2006;24(9):1115–22. doi:10.1038/nbt1236.
Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNAseq data using factor analysis of control genes or samples. Nat Biotech. 2014;32(9):896–902. doi:10.1038/nbt.2931. http://www.nature.com/nbt/journal/v32/n9/abs/nbt.2931.html#supplementaryinformation.
Aldinger KA, Plummer JT, Levitt P. Comparative DNA methylation among females with neurodevelopmental disorders and seizures identifies TAC1 as a MeCP2 target gene. J Neurodev Disord. 2013;5(1):15. doi:10.1186/18661955515.
Chu G, Li J, Narasimhan B, Tibshirani R, Tusher V. Significance analysis of microarrays users guide and technical document. 2001.
Chang LC, Lin HM, Sibille E, Tseng G. Metaanalysis methods for combining multiple expression profiles: comparisons, statistical characterization and an application guideline. BMC Bioinf. 2013;14(1):368.
Acknowledgement
The authors acknowledge constructive inputs from Dr. Jun Liu, Dr. Wing H. Wong, and Dr. George C. Tseng. Most simulation tests were performed on the High Performance Computing Cluster at Brandeis University.
Funding
This work is supported by the Brandeis University Graduate Fellowship.
Availability of data and materials
The executable of DiscriminantCut is available at GitHub (https://github.com/beiyuanzhe/DiscriminantCut).
Authors’ contributions
PH initiated the idea. YB and PH codesigned the mathematical model, the DC algorithm, and the experiments. YB implemented the DC algorithm and carried out all experiments. YB designed and implemented the method to significantly speed up DC. YB and PH together interpreted the experimental results and wrote the manuscript. Both authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Additional file
Additional file 1:
Additional documentation. (DOCX 9047 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Bei, Y., Hong, P. Robust differential expression analysis by learning discriminant boundary in multidimensional space of statistical attributes. BMC Bioinformatics 17, 541 (2016). https://doi.org/10.1186/s128590161386x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s128590161386x