Skip to main content

Prediction of heterogeneous differential genes by detecting outliers to a Gaussian tight cluster



Heterogeneously and differentially expressed genes (hDEG) are a common phenomenon due to bio-logical diversity. A hDEG is often observed in gene expression experiments (with two experimental conditions) where it is highly expressed in a few experimental samples, or in drug trial experiments for cancer studies with drug resistance heterogeneity among the disease group. These highly expressed samples are called outliers. Accurate detection of outliers among hDEGs is then desirable for dis- ease diagnosis and effective drug design. The standard approach for detecting hDEGs is to choose the appropriate subset of outliers to represent the experimental group. However, existing methods typically overlook hDEGs with very few outliers.


We present in this paper a simple algorithm for detecting hDEGs by sequentially testing for potential outliers with respect to a tight cluster of non- outliers, among an ordered subset of the experimental samples. This avoids making any restrictive assumptions about how the outliers are distributed. We use simulated and real data to illustrate that the proposed algorithm achieves a good separation between the tight cluster of low expressions and the outliers for hDEGs.


The proposed algorithm assesses each potential outlier in relation to the cluster of potential outliers without making explicit assumptions about the outlier distribution. Simulated examples and and breast cancer data sets are used to illustrate the suitability of the proposed algorithm for identifying hDEGs with small numbers of outliers.


A heterogeneously and differentially expressed gene (hDEG) is a gene which has an inconsistent expression pattern across its experimental samples. Typically, a large proportion of the experimental samples and the control samples form a tight cluster in low expressions. The remaining small proportion of experimental samples, namely the outliers, are observed to significantly deviate from the tight cluster towards high expressions. We use the word ‘tight’ to describe the cluster of null (or low) expressions of a hDEG as the null variance is typically small compared to the null-outlier distance. In situations where the few highly expressed outliers of a non-differential gene are caused by measurement error, it is also useful to distinguish such genes with hDEG characteristics. The existence of hDEGs has been established in various experiments ([1-8]). Suppose we have the expressions of m genes. The standard t statistic under-estimates the significance in testing the difference across the control and experimental samples of a hDEG. COPA (cancer profile outlier analysis)[9] proposed modifying the Student t statistic to be a ratio of the distance between the r th (default 9th) percentile of experimental samples and the median of all samples over the median absolute distance (deviated from the whole sample median), i.e.,

t i COPA = q r ( y i ) λ i σ i i=1,,m

where σ i =1.4826×med( x i λ i , y i λ i ), x i and y i represent control samples and experimental samples of the i th gene respectively, q r (y i ) is the r th percentile of y i and λ i is the median of both x i and y i . The quantile-median difference in (1) summarises the null-outlier distance using a single value of y i . To make outlier detection more efficient, the outlier-sum (OS) statistic[10] sums over outliers, t i OS = j ( y ij λ i ) σ i 1 where the outliers are defined as{y y i :y> q 75 ( x i , y i )+IQR( x i , y i )}. Outlier robust t statistic (ORT) uses the same statistic but defines the outliers in relation to the control samples only{y y i :y> q 75 ( x i )+IQR( x i )}[11]. Maximum ordered subset t statistic (MOST) defines the outliers to be the top k experimental samples and chooses k by optimising a normalised t statistic[12]. The least sum of ordered subset square t statistic (LSOSS)[13] also compares the controls with a subset of the top k experimental samples, t i LSOSS =k( y ̄ i ( k ) x ̄ i ) S i 1 where x ̄ i is the mean of control samples, y ̄ i ( k ) is the mean of top k experimental samples and S i is the pooled standard deviation of the set of control samples plus non-outlier experimental samples and the set of outlier experimental samples. k is optimised iteratively to minimise the within-cluster variance. We propose a new algorithm for detecting hDEGs with a small number of outliers by detecting outliers via gap (DOG) maximisation. What makes this approach different from the existing methods is that we assess each potential outlier in relation to a tight cluster of non-outliers. This avoids modelling the highly expressed outliers explicitly. This is especially important when the number of outliers is small. The proposed algorithm classifies each gene as a hDEG or non-hDEG by locating potential outliers and summarises it using the average of the standardised outlier expressions. We will use simulated examples and a breast cancer dataset to illustrate the effectiveness of the proposed algorithm in detecting hDEGs with few outliers. We will also show how effective test algorithms are when varying conditions.

Results and discussion

Simulated examples

Scenario 1 - identification of a single hDEG

The algorithms are compared for the detection of a single hDEG with the number of outliers varied from one to nine. The results are summarised in Table1. For a small number of outliers, COPA, MOST and LSOSS demonstrated relatively poor performances while DOG consistently gave significant p-values.

Table 1 Scenario 1

Scenario 2 - identification of multiple hDEGs (100 genes with 50 hDEGs)

Over a critical p-value range from 0 to 0.01, DOG demonstrated the highest average cumulative Matthews correlation coefficient (cMCC, see Methods for more detail) across five sets of simulations with one to five outliers - Figure1. Table2 shows that DOG had very high classification rates compared with the other five algorithms. When the number of outliers exceeded two, OS, ORT and LSOSS gave more reasonable classification rates. COPA and MOST gave poor predictions overall.

Figure 1

cMCC. Scenario 2: average cMCC of the six algorithms over (0, 0.01) for 1-5 numbers of outliers.

Table 2 Scenario 2

Figure2 shows the ROC curves for the one-outlier simulations, it can be seen that DOG had a superior ROC curve with an partial AUC value of 1. Figure3 illustrates the same ROC curves oover the complete range of false positive rate, COPA and LSOSS remained poor. We also found that as the number of outliers increased to five, most algorithms worked well with the exception of COPA.

Further simulated examples

We look at the sensitivitiy of DOG with respect to changes in certain assumptions and parameters.

Figure 2

ROC - one outlier. Scenario 2: ROC curves of the six algorithms in detecting single outlier-hDEGs (in close up for low false positive rates).

Figure 3

ROC - one outlier. Scenario 2: Full ROC curves of the six algorithms in detecting single outlier hDEGs.

Variable marginal null-outlier distance

We revisit the single-hDEG simulation but vary the marginal null-outlier distance (defined in Experimental design of Methods) from 0.5 to 2 with increments of 0.1 - Table3. DOG’s p-values increased for a reduced marginal null-outlier distance but retained the most significant mean p-values for larger marginal null-outlier distances. MOST and LSOSS failed to detect the hDEG. DOG gave accurate estimates of the outlier number when the null-outlier distance was greater than one.

Table 3 Distance effect

Non-Gaussian tight cluster

We simulated a Gaussian-mixture tight cluster (0.5N(9,1)+0.5N(10,1)) to examine how DOG is affected by non-Gaussianity in the tight cluster. All other parameters were kept the same as those used in the single-hDEG simulation. The results were very similar to those seen previously - Table4. In particular, the performances of COPA, OS and ORT have improved for the simulated non-Gaussian tight cluster.

Table 4 Non-Gaussian tight cluster

Control samples containing outliers

DOG can be modified to enable the detection of hDEGs when control samples contain outliers (see ‘’Allowing control samples to contain outliers of Methods. We illustrate this using the single-hDEG example with one outlier added to the control samples - Table5. It can be seen that DOG accurately detected the outliers from both control and experimental samples. MOST and LSOSS failed to detect the hDEG.

Table 5 Control samples containing outlier

Breast cancer data

Figure4 illustrates the ordered expressions of the top four hDEGs as detected by the COPA, OS, ORT, MOST, LSOSS and DOG respectively (with annotations of rankings). The rankings of the genes were based on the order of the test statistics. The defining feature of DOG’s top four hDEGs, PEX6, TFP12, UGT2B4 and SLC4A2 (last row of Figure4), is that they contain a few highly expressed outliers. Figure5 shows the top 25 predictions of hDEGs using DOG for this data set. Existing literature have established these genes to be of biological relevance to the progression and treatment of breast cancer ([14-23]).

Figure 4

COPA, OS, ORT, MOST, LSOSS, DOG. Breast cancer data: log2 expressions of the top four hDEGs detected using COPA, OS, ORT, MOST, LSOSS, DOG. The vertical line indicates the separation of expressions in the tight cluster (left) and outliers (right).

Figure 5

DOG. Breast cancer data: log2 expressions of the top 25 hDEGs detected using DOG.

Most other algorithms chose genes with a reasonably large pool of differentially expressed experimental samples expressed at a more moderate level. LSOSS also generally favoured ordinary DEGs. MOST chose a set of top four genes with only one or two moderately expressed outliers. Table6 shows how the top 100 predictions of these algorithms overlap - COPA and OS are most similar in their rankings whilst DOG has a maximum of 15% overlap with OS. Using the ordered log2 expressions of each algorithm’s unique top 100 genes, Figure6 illustrates the median expressions minus the minimum expressions for each experimental sample index. The unique top 100 genes for DOG and COPA showed the largest change across their experimental samples, their difference being that COPA favoured hDEGs with a larger number of outliers whilst DOG picked out hDEGs with small numbers of outliers.

Using the significance analysis approach discussed in ‘’Significance analysis for real data of Methods, we estimated p values from sampling the replicates which then give us alternative p values based rankings of the genes. We also found the top four predictions ranked using the p values of DOG to be near identical to those ranked using its t statistics, though there were discrepancies in rankings for the lower ranking genes. Similar results were observed for the remainingfive algorithms.


The difficulty in identifying hDEGs arises from the fact that only a small number of experimental samples are highly expressed at a much higher level than the non-outliers. As a result, various modified t tests target the subset of potential outliers which are then tested against the control group. In practice, for hDEGs with very few outliers, we found that these algorithms often identify hDEGs with insignificant deviations between the outliers and the tight cluster of non-outliers. Based on this observation, the proposed algorithm assesses each potential outlier in relation to the Gaussian tight cluster without making an explicit assumption about the outlier distribution. At each step, we update the posterior mean and variance of the tight cluster which are then used to evaluate the probability of an outlier being a random sample of the tight cluster. Examples of simulated and breast cancer data sets verify the suitability of the proposed algorithm in identifying hDEGs with small numbers of outliers. An extension of the algorithm which fully takes into account gene correlations will be presented in future work. For the breast cancer data, we found negligible correlations across the top ranking genes and very low correlations among the less significant genes.

Figure 6

Trends. Breast cancer data: trends of scaled medians (median minus the minimum across each sample index) across the experimental samples of the log2 expressions of each algorithm’s unique top 100 hDEGs.

Table 6 Ranking accordance


The proposed algorithm can be briefly summarised as follows. We first take the list of candidate outliers to be those experimental samples whose expressions are larger than the maximum expression of control samples. For the situation when control samples also contain outliers, see section ‘’Allowing control samples to contain outliers for a description of the necessary extension. The samples in the candidate list are sorted in an ascending order. The algorithm then updates the tight cluster of non-outliers by testing sequentially the samples in the updated candidate list of outliers. The test is terminated when a significant deviation between a candidate sample and the tight cluster is detected. We now give the steps in more statistical detail. First, let us introduce some notation. Let x denote the control samples and y the experimental samples of a gene or a probe set (we drop the gene subscript i for simplicity). The proposed DOG algorithm has the following steps:

  1. 1.

    Candidate outlier: Given the union of x and y, zxy, we divide z into the candidate outlier set z +={z j+z|z j+> max(x)} and the non-outlier set z j ={ z j z| z j max(x)} where sorts the elements of a set in an ascending order.

  2. 2.

    Detection: Given a critical tail probability α and the corresponding threshold t α [24]. The first element in z +, z 1 + , is classified as the first outlier if

    t = z 1 + μ σ > t α

in which case the algorithm terminates and z+ is the set of outliers. We use a default value of α=0.05. The parameters μ and σ2 are posterior mean and posterior variance derived of the tight cluster. Details of estimating μ and σ are given below.

  1. 3.

    Absorption: On the other hand if tt α , we move z 1+ to the tight cluster of non-outliers, z z z 1+ and z +z +z 1+.

  2. 4.

    Estimating the parameters of the tight cluster: The parameters μ and β=σ −2 are updated using iterative Bayesian learning, i.e., by maximising the posterior probability [24]. Given zN(μ,1/β) with conjugate priors μN( μ 0 ,1/ σ 0 2 ) and σ 2=1/βI G(a,b), the log-posterior is

    log P ( θ | z , α ) log ( z | μ , σ 2 ) + log IG ( σ 2 | a , b ) + log N ( μ | μ 0 σ 0 2 )


log ( z | μ , σ 2 ) log β / 2 z j z β ( z j μ ) 2 / 2 log IG ( σ 2 | a , b ) a log b + ( a + 1 ) log β log N ( μ | μ 0 , σ 0 2 ) σ 0 2 ( μ μ 0 ) 2 / 2

and θ=(μ,β) andα=( μ 0 , σ 0 2 ,a,b). Suppose n is the number of expressions in the tight cluster for the current iteration. For simplicity, we set μ0=m e d(z), a=1, b is set to be the maximum variance of expressions calculated gene by gene. To simplify the notation, we let β 0 = σ 0 2 . β0 is updated recursively but we set its initial value to be β 0 ( 1 ) =0.1. The maximum a posteriori probability procedure then gives the updates

μ = β j z j + β 0 μ 0 βn + β 0 ; 1 / β = j ( z j μ ) 2 + 2 b n + 2 a + 2 ; z j z 1 / β 0 = ( μ μ 0 ) 2 / 2 + b a + 1 .

Repeat 3 and 4 until the first outlier (with the lowest expression) is detected or until all candidate outliers have been classified as non-outliers.

  1. 5.

    Classification: A gene for which the set z + is non-empty is classified as a hDEG.

The summary statistic for a gene is taken to be the average of the outlier statistics j z + t j /| z + |. We use the average as opposed to the sum of outlier contributions as we prioritise the detection of hDEGs with few outliers.

Remark 1

We allow the hyperparameters μ0 to be evaluated directly from the dataset. We set β 0 ( 1 ) to be 0.1, β0 is then updated iteratively in the algorithm. We desire the tight cluster variance prior to be densely distributed around the small values, thus we choose a=1 and b to be the maximum gene sample variance. In practice, we found that a large b and a small a≤1 optimise detection rates.

Remark 2

It is clear that for a finite replicate number, the difference in mean and variance of the tight cluster at two sequential steps are bounded. Asymptotically, as the sample size increases at each iteration, these differences converge toward zero since the posterior mean and variance converge toward the sample mean and variance and the tight cluster only absorbs probable null samples. This then guarantees asymptotic algorithmic convergence. Convergence of parameters in step 4 for each iteration follow from standard Bayesian results[25].

Cumulative Matthews correlation coefficient

We compare COPA, OS, ORT, MOST and LSOSS using the cumulative Matthews correlation coefficient (cMCC) which is the area under Matthews correlation coefficient (MCC,[26, 27]) in the interval[0, p ]:

ρ ̄ = 0 p ρ p dp,

the MCC ρ p is defined as:

ρ p = T P p × T N p F P p × F N p ( T P p + F P p ) ( T P p + F N p ) ( T N p + F P p ) ( T N p + F N p )

Here, T P p , T N p , F P p and F N p represent the numbers of true positives (true hDEGs), true negatives (true non-hDEGs), false positives and false negatives respectively. These four quantities are determined based on a pre-defined critical p-value, i.e. p(0,p].

Total classification accuracy

The total classification accuracy is defined as

T N p + T P p T N p + T F P + T P p + F P p

where T P p , T N p , F P p and F N p have been defined above.

Receiver operating characteristic (ROC) analysis

Receiver Operating Characteristic (ROC)[28] analysis has been used widely in outlier detection[11-13] for evaluating a classification model when varying the classification threshold, thus it is a useful tool for analysing the robustness of a classifier. As the threshold varies, the sensitivity T P p T P p + F N p and the false positive rate 1 T N p T N p + F P p change accordingly. The ROC curve is then generated by linking all the pairs of false positive rates and sensitivities corresponding to a set of thresholds. The ROC curve of a desirable classifier is close to the top-left corner. In particular, we limit the false positive rate to less and equal to 5% as rates above this correspond to critical p values that are too large to be of practical relevance. We also calculate the area under a ROC curve (AUC) for quantitative evaluation. A large AUC value of close to 1 indicates a good classifier. As we truncate the false positive rate at an upper limit of 5%, we scale the AUC by this limit so that the best possible partial AUC value is one.

Allowing control samples to contain outliers

In order for DOG to detect hDEGs when outliers are present in control samples, we can modify it slightly. Rather than using z j ={ z j z| z j max(x)} in the first step of the algorithm, we can use instead the rth (default is 90th) percentile of the control samples as the separation between samples belonging to the tight cluster and candidate outliers. Suppose the 90th percentile of the control samples is denoted by ς, the selection of z j− now follows z j ={ z j z| z j ς}. In practice, the r th percentile can be specified subjectively by the modeller.

Significance analysis for real data

Existing literature on algorithms such as COPA, OS and ORT typically omits statistical significance when analysing real data. Here we propose a simple method for significance analysis. We assume that control samples contain no outliers. For each algorithm, we create new control and experimental replicates of a gene under the null hypothesis by sampling with replacement from only the control expressions of that gene. This is repeated 100 times to augment the set of null control and experimental samples. The null t statistics are then calculated for all genes. The p value for each gene is then calculated as the proportion of null statistics across all genes that exceed its observed t statistic.

Experimental design

We first look at two simulated scenarios for comparing the algorithms. For both scenarios, the tight cluster of control samples and non-outlier experimental samples are drawn randomly from a Gaussian distribution with a mean of ten and a standard deviation of one. Both control and experimental categories have 30 replicates. The outliers are generated by adding distances to the maximum expression of the tight cluster. The distances are called marginal null-outlier distances in that such a distance measures the gap between the tight cluster and the first outlier which is closest to the tight cluster. The marginal oull-outlier distances are sampled from a Gaussian distribution centered at two and with a standard deviation 0.2. Similar to examples seen in[10], we generate 10,000 non-DEGs which gives us 10,000 null t statistics and corresponding p-values for the hDEGs. This approach is applied to each algorithm. All simulations are repeated 100 times. In the first scenario, we evaluate the algorithms for a single hDEG. In addition, we vary the number of outliers from one to nine. In the second scenario, we generate 50 non-DEGs and 50 hDEGs and vary the number of outliers from one to five. We also look at extensions of the single-hDEG experiment for testing DOG with regard to deviations from the model assumptions. We then apply the algorithms to the histological breast cancer dataset (GDS3139 -[29]) which was downloaded from the gene expression omnibus (GEO, It contains 22,283 genes for 14 breast cancer patients and 15 non-cancer women. The age of non-cancer women was matched with that of cancer patients. For evaluation and comparison of algorithms, we use the cumulative Matthews correlation coefficient (cMCC) and the total classification accuracy (with a critical p-value threshold of 0.01). We also carry out receiver operating characteristic (ROC) analysis[28] for variable critical p-value thresholds. Details of cMCC and ROC analyses have been given above.


  1. 1.

    Ebina M, Martínez A, Birrer M, Linnoila R: In situ detection of unexpected patterns of mutant p53 gene expression in non-small cell lung cancers. Oncogene 2001, 20: 2579-2586. 10.1038/sj.onc.1204351

    Article  CAS  PubMed  Google Scholar 

  2. 2.

    Ezzat S, Smyth H, Ramyar L, Asa S: Heterogenous in vivo and in vitro expression of basic fibroblast growth factor by human pituitary adenomas. J Clin Endocrinol Metab 1995, 80: 878-884. 10.1210/jc.80.3.878

    CAS  PubMed  Google Scholar 

  3. 3.

    Hess G, Rose P, Gamm H, Papadileris S, Huber C, Seliger B: Molecular analysis of the erythropoietin receptor system in patients with polycythaemia vera. Br J Haematol 1994, 88: 794-802. 10.1111/j.1365-2141.1994.tb05119.x

    Article  CAS  PubMed  Google Scholar 

  4. 4.

    Knaust E, Porwit-MacDonald A, Gruber A, Xu D, Peterson C: Heterogeneity of isolated mononuclear cells from patients with acute myeloid leukemia affects cellular accumulation and efflux of daunorubicin. Haematologica 2000,85(2):124-132.

    CAS  PubMed  Google Scholar 

  5. 5.

    Miyachi H, Takemura Y, Yonekura S, Komatsuda M, Nagao T, Arimori S, Ando Y, et al.: MDR1 (multidrug resistance) gene expression in adult acute leukemia: correlations with blast phenotype. Int J Hematol 1993, 57: 31-37.

    CAS  PubMed  Google Scholar 

  6. 6.

    Nakayama T, Watanabe M, Suzuki H, Toyota M, Sekita N, Hirokawa Y, Mizokami A, Ito H, Yatani R, Shiraishi T: Epigenetic regulation of androgen receptor gene expression in human prostate cancers. Lab Invest 2000, 80: 1789-1796. 10.1038/labinvest.3780190

    Article  CAS  PubMed  Google Scholar 

  7. 7.

    Suzuki M, Hurd Y, Sokoloff P, Schwartz J, Sedvall G: D3 dopamine receptor mRNA is widely expressed in the human brain. Brain Res 1998, 779: 58-74. 10.1016/S0006-8993(97)01078-0

    Article  CAS  PubMed  Google Scholar 

  8. 8.

    Wani G, Wani A, MD’Ambrosio S, et al.: Cell type-specific expression of the O6-alkylguanine-DNA alkyltransferase gene in normal human liver tissues as revealed by in situ hybridization. Carcinogenesis 1993, 14: 737-741. 10.1093/carcin/14.4.737

    Article  CAS  PubMed  Google Scholar 

  9. 9.

    Tomlins S, Rhodes D, Perner S, Dhanasekaran S, Mehra R, Sun X, Varambally S, Cao X, Tchinda J, Kuefer R, et al.: Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 2005, 310: 644-648. 10.1126/science.1117679

    Article  CAS  PubMed  Google Scholar 

  10. 10.

    Tibshirani R, Hastie T: Outlier sums for differential gene expression analysis. Biostatistics 2007, 8: 2-8. 10.1093/biostatistics/kxl005

    Article  PubMed  Google Scholar 

  11. 11.

    Wu B: Cancer outlier differential gene expression detection. Biostatistics 2007, 8: 566-575.

    Article  PubMed  Google Scholar 

  12. 12.

    Lian H: MOST: detecting cancer differential gene expression. Biostatistics 2008, 9: 411-418.

    Article  PubMed  Google Scholar 

  13. 13.

    Wang Y, Rekaya R: LSOSS: detection of cancer outlier differential gene expression. Biomarker Insights 2010, 5: 69-78.

    PubMed Central  Article  PubMed  Google Scholar 

  14. 14.

    Boverhof D, Burgoon L, Williams K, Zacharewski T: Inhibition of estrogen-mediated uterine gene expression responses by dioxin. Mol Pharmacol 2008, 73: 82-93.

    Article  CAS  PubMed  Google Scholar 

  15. 15.

    Cattaneo M, Lotti L, Martino S, Cardano M, Orlandi R, Mariani-Costantini R, Biunno I: Functional characterization of two secreted SEL1L isoforms capable of exporting unassembled substrate. J Biol Chem 2009, 284: 11405-11415.

    PubMed Central  Article  CAS  PubMed  Google Scholar 

  16. 16.

    Hensen E, De Herdt M, Goeman J, Oosting J, Smit V, Cornelisse C, De Jong R: Gene-expression of metastasized versus non-metastasized primary head and neck squamous cell carcinomas: a pathway-based analysis. BMC Cancer 2008, 8: 168. 10.1186/1471-2407-8-168

    PubMed Central  Article  PubMed  Google Scholar 

  17. 17.

    Hoque M, Kim M, Ostrow K, Liu J, Wisman G, Park H, Poeta M, Jeronimo C, Henrique R, Lendvai Á, et al.: Genome-wide promoter analysis uncovers portions of the cancer methylome. Cancer Res 2008, 68: 2661-2670. 10.1158/0008-5472.CAN-07-5913

    PubMed Central  Article  CAS  PubMed  Google Scholar 

  18. 18.

    Iwao-Koizumi K, Matoba R, Ueno N, Kim S, Ando A, Miyoshi Y, Maeda E, Noguchi S, Kato K: Prediction of docetaxel response in human breast cancer by gene expression profiling. J Clin Oncol 2005, 23: 422-431.

    Article  CAS  PubMed  Google Scholar 

  19. 19.

    Missiaglia E, Blaveri E, Terris B, Wang Y, Costello E, Neoptolemos J, Crnogorac-Jurcevic T, Lemoine N: Analysis of gene expression in cancer cell lines identifies candidate markers for pancreatic tumorigenesis and metastasis. Int J Cancer 2004, 112: 100-112. 10.1002/ijc.20376

    Article  CAS  PubMed  Google Scholar 

  20. 20.

    Smeets A, Daemen A, Vanden Bempt I, Gevaert O, Claes B, Wildiers H, Drijkoningen R, Van Hummelen P, Lambrechts D, De Moor B, et al.: Prediction of lymph node involvement in breast cancer from primary tumor tissue using gene expression profiling and miRNAs. Breast Cancer Res Treat 2011, 129: 767-776. 10.1007/s10549-010-1265-5

    Article  CAS  PubMed  Google Scholar 

  21. 21.

    Smid M, Wang Y, Klijn J, Sieuwerts A, Zhang Y, Atkins D, Martens J, Foekens J: Genes associated with breast cancer metastatic to bone. J Clin Oncol 2006, 24: 2261-2267. 10.1200/JCO.2005.03.8802

    Article  CAS  PubMed  Google Scholar 

  22. 22.

    Sun P, Gao L, Han S: Prediction of human disease-related gene clusters by clustering analysis. Int J Biol Sci 2011, 7: 61-73.

    PubMed Central  Article  PubMed  Google Scholar 

  23. 23.

    Sun C, Huo D, Southard C, Nemesure B, Hennis A, Cristina Leske M, Wu S, Witonsky D, Di Rienzo A, Olopade O: A signature of balancing selection in the region upstream to the human UGT2B4 gene and implications for breast cancer risk. Human Genet 2011, 130: 767-75. 10.1007/s00439-011-1025-6

    Article  CAS  Google Scholar 

  24. 24.

    Bernardo J, Smith A, Berliner M: Bayesian Theory. New York: Wiley; 1994.

    Google Scholar 

  25. 25.

    Bishop C: Pattern Recognition and Machine Learning. New York: Springer; 2006.

    Google Scholar 

  26. 26.

    Matthews B, et al.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta 1975, 405: 442-451. 10.1016/0005-2795(75)90109-9

    Article  CAS  PubMed  Google Scholar 

  27. 27.

    Baldi P, Brunak S, Chauvin Y, Andersen C, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 2000, 16: 412-424. 10.1093/bioinformatics/16.5.412

    Article  CAS  PubMed  Google Scholar 

  28. 28.

    McNeil H, Barbara J: The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology 1982, 143: 29-36.

    Article  PubMed  Google Scholar 

  29. 29.

    Tripathi A, King C, de la Morenas A, Perry V, Burke B, Antoine G, Hirsch E, Kavanah M, Mendez J, Stone M, et al.: Gene expression abnormalities in histologically normal breast epithelium of breast cancer patients. Int J Cancer 2008, 122: 1557-1566.

    Article  CAS  PubMed  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Zihua Yang.

Additional information

Competing interests

Both authors declare that they have no competing interests.

Authors’ contributions

ZRY and ZHY designed the algorithm. ZRY implemented the algorithm. ZHY analysed the algorithm on the conceived simulated examples. ZRY acquired the dataset from GEO and analysed the algorithm on the real dataset. ZRY and ZHY wrote the paper. Both authors read and approved the final manuscript.

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Yang, Z., Yang, Z. Prediction of heterogeneous differential genes by detecting outliers to a Gaussian tight cluster. BMC Bioinformatics 14, 81 (2013).

Download citation


  • Cancer
  • Outlier
  • Differentially expressed genes
  • Microarray