The impact of sample imbalance on identifying differentially expressed genes

Background Recently several statistical methods have been proposed to identify genes with differential expression between two conditions. However, very few studies consider the problem of sample imbalance and there is no study to investigate the impact of sample imbalance on identifying differential expression genes. In addition, it is not clear which method is more suitable for the unbalanced data. Results Based on random sampling, two evaluation models are proposed to investigate the impact of sample imbalance on identifying differential expression genes. Using the proposed evaluation models, the performances of six famous methods are compared on the unbalanced data. The experimental results indicate that the sample imbalance has a great influence on selecting differential expression genes. Furthermore, different methods have very different performances on the unbalanced data. Among the six methods, the welch t-test appears to perform best when the size of samples in the large variance group is larger than that in the small one, while the Regularized t-test and SAM outperform others on the unbalanced data in other cases. Conclusion Two proposed evaluation models are effective and sample imbalance should be taken into account in microarray experiment design and gene expression data analysis. The results and two proposed evaluation models can provide some help in selecting suitable method to process the unbalanced data.


Background
Microarrays enable us to monitor expressions of thousands of genes simultaneously and generate enormous amount of data. Using such techniques, it is possible to explore the secret of biology at the molecular level and understand the fundamental biological processes ranging from gene function to development and to cancer [1][2][3]. In microarray experiments, the expression levels of several thousands candidate genes have been monitored in two opposite conditions, such as Treatment versus Control conditions, where each condition is represented by several samples. Unfortunately, most monitored genes are unrelated to the conditions and their expression levels do not change or change by chance, while other genes are strongly related to the conditions and truly change their expression levels according to conditions. However, these differentially expressed genes are very useful in latter research and clinical applications [2,3]. Therefore, one of the important tasks in microarray data analysis is to compare the expression levels of genes in samples drawn from two different conditions and to select genes with differential expression under those two conditions. Specifically, we are interesting in identifying which of several thousands candidate genes have had their expression levels changed by condition, given a microarray data.
One simple approach used in literature to detect differential expression genes is "fold change" method, in which a gene is declared to be differentially expressed if its average expression level varies by more than a given constant between two conditions. However, "fold change" method has been demonstrated to be unreliable and inefficient, because statistical variability is not considered [4]. Then, many sophisticated statistical approaches have been proposed [5,6]. These approaches can be roughly classified into two categories. The parametric methods based on statistical model is the first category of methods. This kind of methods include various versions of the two-sample t-test [6][7][8]. Due to the reason that gene expression data are often noisy and not normally distributed [9], the strong assumption of parametric method can be violated in practice. The second category of approaches is nonparametric statistical methods, including the Wilcoxon rank-sum test [10], the Significance Analysis of Microarray (SAM) method [11], the Empirical Bayes (EB) method [12], the mixture model method [13] and other modified nonparametric methods [14,15]. For recent reviews, please see [5,6].
However, very few studies consider the problem of sample imbalance in detecting differential expression genes and there are no studies as well as quantitive method to investigate the effect of sample imbalance on differential expression genes selection. Sample imbalance means that the size of samples in one group is very different to that in another group. In fact, the problem of sample imbalance usually appears in gene expression data, especially in the data about tumor samples. For example, the data in [16][17][18][19][20][21][22][23] are all unbalanced. There are many factors causing the problem of sample imbalance, such as the limit of source of tumor samples, budgetary constraints and reducing samples in the control group artificially and factitiously.
Coupled with the small sample in gene expression data, the problem of sample imbalance may be more serious. Consequently, two important and natural questions may be asked by biologists as follows: How does the sample imbalance affect the methods for identifying differential expression genes? Which method is more suitable for the unbalanced data? In addition, previous studies [24,25] have found that the variability of gene expression may be related to the average expression. It suggests that the two sample t-test being used should be based on unequal variances. An instant but reasonable question is: whether the above suggestion is still true on the unbalanced data.
In this paper, we investigate the new problem about the impact of sample imbalance on identifying differential expression genes. Two evaluation models based on random sampling are proposed and six famous methods are compared on both the real data and the simulated data. Under each evaluation model, the random sampling is utilized to estimate the expected performances of methods on the unbalanced data which satisfy one specific sample ratio between two groups. Then the variations of performances are used to illustrate the effect of sample imbalance on differential expression genes selection and method selection.

Results
In this section, six methods including two-sample t-test with equal variances (equalling F-test) [6], two-sample ttest with unequal variances (i.e. Welch t-test) [5,7], Wilcoxon rank-sum test [10], SAM [11], Regularized t-test [8] and the permutation-based method of Pan [15] are systematically compared on real data and simulated data according to two evaluation models. All experiments are conducted in Matlab environment on a Pentium PC with a 3.20 GHz CPU and 512 MB RAM. The processing procedure is as follows. For every pair of fixed parameters n 1 and n 2 (which are the numbers of samples in class one C 1 and class two C 2 ) in each experiment under two evaluation models, first, we randomly create a set of x independent artificial data or simulated data and test all six methods on these x data to get the results. For a specific method, each one in the x random data will only get one result for each measure, for example Overlap Rate, Precision Rate or Recall Rate. Then, these x values are treated as a random sample of size x from the fixed parameters n 1 and n 2 . Last, the expected performance of each method and its 0.95 confidence interval are calculated from this kind of random samples.

Datasets
Two real datasets are the liver dataset [21] and the prostate dataset [26]. Taking a data preprocess protocol similar to that in Dudoit et al [27], we screen out genes with missing data in more than 5% arrays, impute other missing data by 0, and then apply a base 2 logarithmic transformation. Each experiment is standardized to zero median across the genes. The prostate data finally consists of gene expression profiles of 62 primary prostate tumours and 41 normal specimens with expression values of 7931 genes. The liver data consists of gene expression profiles of 105 primary HCC and 76 non-tumor liver tissues, 7 benign liver tumor samples, 10 metastatic cancers, and 10 HCC cell lines on 11763 genes. We select two largest classes from the liver dataset to do experiments.
The simulated data is created according to the protocol in [10], where the gene expression value is a normally generated random value with a noise generated from one uniform distribution of U(-0.1, 0.1), which is very similar to real data. In each simulated data, there are 1000 genes (first 50 with differential expression and next 950 with non-differential expression) and two classes C 1 and C 2 (having n 1 and n 2 samples, respectively). For any non-differential expression gene j (i.e. 51 ≤ j ≤ 1000), its expression value a ij on each sample i is randomly generated from N(μ, 0.5) and U(-0.1, 0.1), where μ ~N(0, 0.25). For gene j ≤ 50, the value of gene j on any sample in class C 1 is generated from N(μ 1 , σ 1 ) and U(-0.1, 0.1), while that in class C 2 is generated from N(μ 2 , σ 2 ) and U(-0.1, 0.1), where μ 1 , μ 2~N (0, 0.5). For the problem of multiple testing involved in identifying differential expression genes, bonfenorri correction of the significant level α can be used to reduce the error of type I. But a very small α will be disadvantaged to compare the performances of methods. In this paper, a relatively small significant level α will be used to control the type error I. On the real data, the value of α is set to 0.0001. On the simulated data, the significant level α is set to 0.01.

Results on real data
In the experiments of the evaluation model 1, the number of samples in Class C 1 of the artificial data, which are created from the liver data or the prostate data, is always fixed at 60. The results under the evaluation model 1 are presented in figure 1. Because of the limitation of sample size in real data, in the experiments of the evaluation model 2, the value of n 1 + n 2 in the artificial data created from the liver data is fixed at 120 and that from the prostate data is fixed at 60. The results of the evaluation model 2 on real data are presented in figure 2. The expected Overlap Rates and its 0.95 confidence interval (or Error Limit) of each method at each specific SR are obtained from 100 randomly generated artificial data. Furthermore, in order to test whether the average Overlap Rate at SR ≠ 1 (denoted as i(i ≠ 1) )is significantly different with that at SR = 1 (denoted as 1 ), we make a two sample t-test, where the observations are these 100 Overlap Rates calculated from 100 random artificial data with SR = 1 and those calculated from 100 random artificial data with SR ≠ 1. So our null hypothesis states that i(i ≠ 1) = 1 , while the alternative hypothesis states that i(i ≠ 1) ≠ 1 . The p-values associated with the t-statistic in the evaluation model 1 and 2 are summarized in table 1 and 2, respectively. The experiments on real data indicate that the sample imbalance has a great influence on the performances of all six methods. As can be seen in figures 1 and 2, on both real datasets, the Overlap Rates of all methods are gradually decreasing in response to the increasing amounts of sample ratio. For example, in the figure 2(a), the margins between the average Overlap Rates at SR = 1 and that at SR = 3 on 6 methods (F, welch-t, wilcoxon, SAM, Regularized-t and Pan) are 0.2249, 0.1842, 0.2429, 0.2255, 0.2378 and 0.1932. According to the p-value showed in Table 2, we can conclude that on the real data the difference of the performance for each method between SR = 1 and SR ≠ 1 has a very high statistical confidence. Additionally, there is also a difference among the Overlap Rates of different methods. It can be seen from figure 1 and 2 that Welch t-test and the method of pan create higher Overlap Rates on the unbalanced liver data than other 4 methods, while Wilcoxon test shows a lower Overlap Rate compared with other 5 methods on the unbalanced prostate data. However, because of without true solution, we can't decide directly and strictly which one of the six methods has the best performance on real data.

Results on simulated data
In this section, under two proposed evaluation models, we generate two kinds of simulated data to compare the performances of different methods on the unbalanced data. In the first category, the differential expression genes have equal variances in sample class C 1 and sample class C 2 (i.e. σ 1 = σ 2 ), but have unequal variances (i.e. σ 1 ≠ σ 2 ) in the second category of simulated data. The result on a simulated data is the average result on 1000 random data generated with a specific sample ratio. Figure 3 shows the results on the simulated data in the  Rate) is steadily declined as the sample ratio increases. This result is consistant with that of previous experiments on the real data.

Equal variances
Furthermore, the downward trend of Recall Rate in response to the increasing amounts of sample ratio is steeper than that of Precision Rate. In other words, the The results on prostate and liver datasets under the evaluation model 2 Figure 2 The results on prostate and liver datasets under the evaluation model 2. The expected Overlap Rates of six methods as well as their error limits on prostate and liver datasets under the evaluation model 2, where the number of overall samples in the artificial data from liver data is fixed at 120 and that from the prostate data is fixed at 60. The results on prostate and liver datasets under the evaluation model 1 Figure 1 The results on prostate and liver datasets under the evaluation model 1. The expected Overlap Rates of six methods as well as their error limits on prostate and liver datasets under the evaluation model 1, where the sizes of samples in Class C 1 of the artificial data, which are created from the liver data and the prostate data, are all fixed at 60.

(b) Liver dataset
Recall Rate (the false negative) of the method for selecting differential expression genes is more sensitive than the Precision Rate (the false positive) to sample imbalance, although they are all affected by sample imbalance.
It is certain that the sample imbalance appears to have different effects between different methods. The difference between different methods become great when the degree of sample imbalance increases. In detail, the Precision Rates of the Wilcoxon rank-sum test and the Regularized t-test are higher than those of others, that is, the Wilcoxon rank-sum test and the Regularized t-test have lowest false positive rate (Type I error). Whereas, the Recall Rate of SAM is superior to that of other methods, i.e. the method of SAM has the lowest false negative rate (Type II error).
And Welch t-test shows the worst performance.

Unequal variances
In this section, under two evaluation models, the simulated data are generated in two case: the first case satisfies σ 1 ≤ σ 2 and n 1 ≥ n 2 , for example, σ 1 = 0.5, σ 2 = 1 and SR = 1, 2, 3. The second case is that σ 1 ≤ σ 2 and n 1 ≤ n 2 , for example, σ 1 = 0.   ance increases, and on the same unbalanced data there exists great variance among the performances of six methods. These features are the same as those on the simulated data with equal variances. Furthermore, there are surprising variation on the performances of all methods compared in this paper between two different types of unbalanced data with unequal variances. In the case of σ 1 = 0.5, σ 2 = 1 and n 1 ≥ n 2 , Regularized t-test shows the highest Precision Rate and Recall Rate while Welch t-test performs the worst capability. In contrast, Regularized t-test has the medium performance and Welch t-test shows the best performance when σ 1 = 0.5, σ 2 = 1 and n 1 ≤ n 2 . This surprising observation can be easily explained by figure 5.
As we can see in figure 5, the curve of each method performance under the evaluation model 2 is a function of sample ratio, which maximize its value at a specific sample ratio. These results imply that one should select a relatively feasible method to detect differentially expressed genes on an actual and specific unbalanced data. If one more suitable method has been selected to process the unbalanced data, then the result of analysis can be improved greatly.
In order to investigate the combined influence of sample ratio and varied variance on method performance, Regularized t-test and Welch t-test are selected as examples to demonstrate the dependency of the difference between methods with respect to different variances and sample ratios. Figure 6 shows the difference between Regularized t-test and Welch t-test against varied variance at different sample ratios. When σ 1 ≤ σ 2 , Regularized t-test is always superior to Welch t-test on the unbalanced data which satisfies n 1 ≥ n 2 . When σ 1 ≤ σ 2 and n 1 ≤ n 2 , the results become relatively complex. In the plot b of figure 6, several curves cross the line of zero, which implies that both methods of Regularized t-test and Welch t-test have some region of superiority. But when σ 1 ≤ σ 2 and n 1 ≤ n 2 , Welch t-test have obvious dominance. In addition, the more difference between variances σ 1 and σ 2 in unbalanced data, the higher different effects on different methods.

Discussion
From this study, it is clear that there is a great effect on the performances of methods for selecting differential expression genes by the sample imbalance. Because of many objective factors, the gene expression data always involve the problem of small sample. As mentioned earlier in the previous section, coupled with the problem of small sample, the presence of the unbalanced data makes detecting differential expression genes more difficult. The sample imbalance is an important and inevitable problem in gene expression data analysis. Hence, one need to consider the problem of sample imbalance in the design of microarray experiments and the following data analysis.
Careful experimental design is necessary to improve the result of data analysis and reduce the cost of experiment simultaneously. By the comparison between plot a and b in figure 3, we can find that the expected Recall Rates and the expected Precision Rates at SR = 1 in plot b are higher than those at SR = 6 in plot a. In other words, because of the influence of sample imbalance, the result from one gene expression data of size 60 can be superior to that from another similar gene expression data of size 70. This finding is very considerable and exciting.   Furthermore, our results also indicate that on the unbalanced data, there have a great difference between the performances of different methods, especially on the data with heterogeneity. Some previous studies [24,25] have found that the variance (for i = 1, 2) of expression values for gene j may depend on the mean expression value μ i . Hence, it will be very helpful to the result of analysis if a more suitable method has been selected to process the unbalanced data. For example, given an unbalanced data with unequal variances, one can improve the result of analysis if a feasible method from the six methods is selected. However, it is very likely that all six methods are not feasible for the unbalanced data and there is a requirement to find new methods more suitable to process the unbalanced data.
The expected performances of six methods on the simulated data with equal variances, i.e. σ 1 = σ 2 = 0.5  It should be noted that this paper does not consider the problem of determining sample size for detecting differentially expressed genes in microarray data. An interesting topic is how to assign samples between two groups in order to maximize a method performance under the constraint of the given number of overall samples n 1 + n 2 .
The results of this paper are based on six popular and typical methods for identifying differential expression genes including parametric method and nonparametric method. The similar effect of the sample imbalance on both kinds of methods leads us to believe that the findings in this paper should have, at least qualitatively, a comprehensive meaning. Also, two proposed evaluation models can be used to compare and evaluate other methods.

Conclusion
The experiments in this paper demonstrate that sample imbalance has a great effect on identifying differential expression genes and two proposed models are effective to quantify the effect of sample imbalance. Moreover, different methods have different performances on the unbalanced data and we can not find one method to be suitable for all unbalanced data in the experiments. Among the six methods, the welch t-test appears to perform best when the size of samples in the large variance group is larger than that in the small one, While the Regularized t-test and SAM outperform others on the unbalanced data in other cases. In conclusion, two proposed evaluation models and the results provide some help in selecting suitable method to process the unbalanced data.
In future work, we will apply the evaluation models to evaluate more methods, for example the methods based False Discovery Rate. Furthermore, we attempt to investi-gate the problem of determining the sample size to maximize the performance of a given differential expression genes selection method.

Methods
First, some notations used in this paper are introduced here. We assume there are n samples in the gene expression data and these n samples consist of two nonoverlapping categories named class one (C 1 ) and class two (C 2 ). In each sample, the expression values of p genes have been detected. Then the gene expression data may be represented by a n × p matrix where the element a ij is the expression value of gene j in sample i. The rows of A correspond to samples, and the ith row vector of A is called the expression profile of the ith sample. We assume that n k , k (j) and S k (j) are number of samples, sample mean and sample variance of gene j in the class C k , respectively, where k = 1, 2.

Basic concepts Definition 1 (Sample Ratio)
Given a gene expression data, let n k denotes the number of samples in class C k , k = 1, 2. Then, Sample Ratio, denoted by SR, is defined to be n 1 /n 2 , i.e. SR = n 1 /n 2 .
We use the Sample Ratio (SR) to measure the degree of sample imbalance between two groups. As revealed by definition 1, the further the value of SR departs from 1, the more serious the degree of sample imbalance is.
A key question also involved in this paper is how to evaluate the performance of a method for identifying differen-a tial expression genes, that is, how to evaluate the solution resulted from the method. For thousands of genes in a real gene expression data, it is generally unclear that which ones are differentially expressed genes. This situation has resulted in an obstacle to assess a method directly and strictly. In contrast, the true solution is known for the simulated data. So, in order to assess the performance of method directly, the simulated data are introduced. Furthermore, several measures are introduced to measure the quality of the method solution. Different measures are applicable in different situations, depending on whether a true solution is known or not.
First, we present a metric to assess the method performance for selecting differential expression genes on the real gene expression data.
Given real data, the whole real data is treated as the original data (OD) and the artificial data (AD), which satisfies the given parameters n 1 and n 2 , is generated by randomly sampling samples from the original data. Thus the Overlap Rate denoted by OR is calculated according to the following definition.

Definition 2 (Overlap Rate) Let DEG OD and DEG AD be the sets of Differentially Expressed
Genes identified by some method on the original data (OD) and the artificial data (AD), respectively, then the Overlap Rate (OR) is defined as OR = |DEG OD ∩ DEG AD |/|DEG OD |.
To assess the method performance on the simulated data, we can compare the true solution with the suggested solution by the following method. Given simulated data with p genes, any solution can be represented by a binary 1 × p vector T, where T(i) = 1 if and only if the i-th gene is differentially expressed gene (or positive gene). Suppose that T and S be the true solution and the suggested solution of a method, respectively. And let n xy denote the number of pair (i, i), for which T(i) = x and S(i) = y, where x, y = 0 or 1. Thus, n 11 is the number of true positive genes, n 01 is the number of false positive genes, n 00 is the number of true negative genes, and n 10 is the number of false negative genes. Consequently, two different metricss, Recall Rate and Precision Rate, are introduced to measure the performance of method.

Definition 3 (Recall Rate)
Suppose that S and T be the suggested solution of a differential expression gene selection method and the true solution, respectively. Then Recall Rate (RR) is defined as RR = n 11 /(n 10 + n 11 ).

Definition 4 (Precision Rate)
Suppose that S and T be the suggested solution of a differential expression gene selection method and the true solution, then Precision Rate (PR) is defined as PR = n 11 /(n 01 +n 11 ).
From the definitions 3 and 4, We can see that the Recall Rate focuses on the false negative while the Precision Rate focuses on the false positive. However, the false negative and the false positive are two different keystones in the context of selecting differential expression genes, and the false negative is inconsistent with the false positive. So in a particular problem specification, one can choose either Recall Rate or Precision Rate as the main focus.

Random sampling
For one specific method of differential expression genes selection and one given data with n 1 samples in class C 1 and n 2 samples in class C 2 , we can only get one specific value of each of the metrics OR, RR and PR of the method on the given data. So after one specific method and the set of gene expression data with size m are given, there exists the set of ORs (RRs or PRs) with size m resulted from the method. A perfect way to evaluate the performance of one   method is to run the method on the whole set of gene expression data which satisfy given parameters n 1 and n 2 and to calculate the average value of each metric. But the cardinality of the set of data with parameters n 1 and n 2 may be very large or infinite. For example, if a real microarray data has 50 and 30 samples in class C 1 and C 2 respectively, then the number of different artificial data with parameters n 1 = 40 and n 2 = 20 is > 3 × 10 17 . In order to reduce the computation cost and avoid the problem of infinity, one feasible way is to estimate the expected value of each metric and its approximate confidence interval (or Error Limit) by sampling a sample from the specific gene expression data randomly.
Lemma 1 [28]Suppose that population X has mean μ, and finite variance σ 2 , and X 1 , X 2 , ..., X n are an independent random sample of size n from the population X, then the sample mean is an C C For sample of size n ≥ 30, regardless the shape of most population, sampling theory guarantees good results [28].
When population is finite, the change is the introduction of the factor 1 -f for the variance D( ), where f = n/N is the sampling fraction and N is the size of population. The factor 1 -f is called the finite population correction (fpc). That is, the confidence interval is . In practice, the fpc can be ignored whenever the sampling fraction does not exceed 5% [29].

Evaluation models
In order to investigate the effect of sample imbalance on differential expression genes selection, one simply needs to consider the change of the performance of a method in response to different sample ratios (SRs), because the sample ratio is a measure of the degree of sample imbalance between two groups. Therefore, two evaluation models are proposed as follows.

Evaluation model 1
Let the number of samples of certain class always equal to constant C, for instance n 1 = C, and the artificial data (or the simulated data) is randomly created with different Sample Ratios. Then compare the method results on the data with various Sample Ratios.

Evaluation model 2
Let the number of all samples in the artificial data (or the simulated data) always equal to constant C, i.e. n 1 +n 2 ≡ C, and the artificial data (or the simulated data) is randomly created with different Sample Ratios. Then the method is evaluated based on these random data with particular parameter SR.

Calculating cutoff point
For the parametric method, the cutoff point of a significance level a is calculated from the assumed distribution.
In the nonparametric method, for a given significance level α, following the spirit of SAM, we find the 100(1α)% quantile of the null distribution, i.e. noted as , using the following formula where B is the number of permutations and is the value of the statistic for the i-th gene in the b-th permutation. Then the quantile value is used as the cutoff point for that statistic to select differential expression genes.