Prediction of heterogeneous differential genes by detecting outliers to a Gaussian tight cluster

Background Heterogeneously and differentially expressed genes (hDEG) are a common phenomenon due to bio-logical diversity. A hDEG is often observed in gene expression experiments (with two experimental conditions) where it is highly expressed in a few experimental samples, or in drug trial experiments for cancer studies with drug resistance heterogeneity among the disease group. These highly expressed samples are called outliers. Accurate detection of outliers among hDEGs is then desirable for dis- ease diagnosis and effective drug design. The standard approach for detecting hDEGs is to choose the appropriate subset of outliers to represent the experimental group. However, existing methods typically overlook hDEGs with very few outliers. Results We present in this paper a simple algorithm for detecting hDEGs by sequentially testing for potential outliers with respect to a tight cluster of non- outliers, among an ordered subset of the experimental samples. This avoids making any restrictive assumptions about how the outliers are distributed. We use simulated and real data to illustrate that the proposed algorithm achieves a good separation between the tight cluster of low expressions and the outliers for hDEGs. Conclusions The proposed algorithm assesses each potential outlier in relation to the cluster of potential outliers without making explicit assumptions about the outlier distribution. Simulated examples and and breast cancer data sets are used to illustrate the suitability of the proposed algorithm for identifying hDEGs with small numbers of outliers.


Background
A heterogeneously and differentially expressed gene (hDEG) is a gene which has an inconsistent expression pattern across its experimental samples. Typically, a large proportion of the experimental samples and the control samples form a tight cluster in low expressions. The remaining small proportion of experimental samples, namely the outliers, are observed to significantly deviate from the tight cluster towards high expressions. We use the word 'tight' to describe the cluster of null (or low) expressions of a hDEG as the null variance is typically small compared to the null-outlier distance. In situations where the few highly expressed outliers of a non-differential gene are caused by measurement error, it *Correspondence: z.h.yang@qmul.ac.uk 1 Wolfson Institute for Preventive Medicine, Queen Mary University of London, Charterhouse Square, London EC1M 6BQ, UK Full list of author information is available at the end of the article is also useful to distinguish such genes with hDEG characteristics. The existence of hDEGs has been established in various experiments ( [1][2][3][4][5][6][7][8]).
Suppose we have the expressions of m genes. The standard t statistic under-estimates the significance in testing the difference across the control and experimental samples of a hDEG. COPA (cancer profile outlier analysis) [9] proposed modifying the Student t statistic to be a ratio of the distance between the rth (default 9th) percentile of experimental samples and the median of all samples over the median absolute distance (deviated from the whole sample median), i.e., where σ i = 1.4826 × med(x i − λ i , y i − λ i ), x i and y i represent control samples and experimental samples of the ith gene respectively, q r (y i ) is the rth percentile of y i and λ i is the median of both x i and y i . http://www.biomedcentral.com/1471-2105/14/81 The quantile-median difference in (1) summarises the null-outlier distance using a single value of y i . To make outlier detection more efficient, the outlier-sum (OS) statistic [10] sums over outliers, where the outliers are defined as {y ∈ y i : Outlier robust t statistic (ORT) uses the same statistic but defines the outliers in relation to the control samples only {y ∈ y i : y > q 75 (x i ) + IQR(x i )} [11]. Maximum ordered subset t statistic (MOST) defines the outliers to be the top k experimental samples and chooses k by optimising a normalised t statistic [12]. The least sum of ordered subset square t statistic (LSOSS) [13] also compares the controls with a subset of the top k experimental i is the mean of top k experimental samples and S i is the pooled standard deviation of the set of control samples plus non-outlier experimental samples and the set of outlier experimental samples. k is optimised iteratively to minimise the within-cluster variance.
We propose a new algorithm for detecting hDEGs with a small number of outliers by detecting outliers via gap (DOG) maximisation. What makes this approach different from the existing methods is that we assess each potential outlier in relation to a tight cluster of non-outliers. This avoids modelling the highly expressed outliers explicitly. This is especially important when the number of outliers is small. The proposed algorithm classifies each gene as a hDEG or non-hDEG by locating potential outliers and summarises it using the average of the standardised outlier expressions. We will use simulated examples and a breast cancer dataset to illustrate the effectiveness of the proposed algorithm in detecting hDEGs with few outliers. We will also show how effective test algorithms are when varying conditions.

Simulated examples Scenario 1 -identification of a single hDEG
The algorithms are compared for the detection of a single hDEG with the number of outliers varied from one to nine. The results are summarised in Table 1. For a small number of outliers, COPA, MOST and LSOSS demonstrated relatively poor performances while DOG consistently gave significant p-values.

Scenario 2 -identification of multiple hDEGs (100 genes with 50 hDEGs)
Over a critical p-value range from 0 to 0.01, DOG demonstrated the highest average cumulative Matthews correlation coefficient (cMCC, see Methods for more detail) across five sets of simulations with one to five outliers - Figure 1. Table 2 shows that DOG had very high classification rates compared with the other five algorithms. When the number of outliers exceeded two, OS, ORT and LSOSS gave more reasonable classification rates. COPA and MOST gave poor predictions overall. Figure 2 shows the ROC curves for the one-outlier simulations, it can be seen that DOG had a superior ROC curve with an partial AUC value of 1. Figure 3 illustrates the same ROC curves oover the complete range of false positive rate, COPA and LSOSS remained poor. We also found that as the number of outliers increased to five, most algorithms worked well with the exception of COPA.

Further simulated examples
We look at the sensitivitiy of DOG with respect to changes in certain assumptions and parameters.

Variable marginal null-outlier distance
We revisit the single-hDEG simulation but vary the marginal null-outlier distance (defined in Experimental design of Methods) from 0.5 to 2 with increments of 0.1 - Table 3. DOG's p-values increased for a reduced marginal null-outlier distance but retained the most significant mean p-values for larger marginal null-outlier distances. MOST and LSOSS failed to detect the hDEG. DOG gave accurate estimates of the outlier number when the null-outlier distance was greater than one.

Non-Gaussian tight cluster
We simulated a Gaussian-mixture tight cluster (0.5N (9, 1) + 0.5N (10, 1)) to examine how DOG is  affected by non-Gaussianity in the tight cluster. All other parameters were kept the same as those used in the single-hDEG simulation. The results were very similar to those seen previously - Table 4. In particular, the performances of COPA, OS and ORT have improved for the simulated non-Gaussian tight cluster.   5N (9, 1) + 0.5N (10, 1)) distributed tight cluster. M is the average number of outliers detected using DOG.

Control samples containing outliers
DOG can be modified to enable the detection of hDEGs when control samples contain outliers (see '' Allowing control samples to contain outliers of Methods. We illustrate this using the single-hDEG example with one outlier added to the control samples - Table 5. It can be seen that DOG accurately detected the outliers from both control and experimental samples. MOST and LSOSS failed to detect the hDEG.  Figure 4), is that they contain a few highly expressed outliers. Figure 5 shows the top 25 predictions of hDEGs using DOG for this data set. Existing literature have established these genes to be of biological relevance to the progression and treatment of breast cancer ( [14][15][16][17][18][19][20][21][22][23]).

Breast cancer data
Most other algorithms chose genes with a reasonably large pool of differentially expressed experimental samples expressed at a more moderate level. LSOSS also generally favoured ordinary DEGs. MOST chose a set of top four genes with only one or two moderately expressed outliers. Table 6 shows how the top 100 predictions of these algorithms overlap -COPA and OS are most similar in their rankings whilst DOG has a maximum of 15% overlap with OS. Using the ordered log 2 expressions of each algorithm's unique top 100 genes, Figure 6 illustrates the median expressions minus the minimum expressions for each experimental sample index. The unique top 100 genes for DOG and COPA showed the largest change across their experimental samples, their difference being that COPA favoured hDEGs with a larger number of outliers whilst DOG picked out hDEGs with small numbers of outliers.
Using the significance analysis approach discussed in ''Significance analysis for real data of Methods, we estimated p values from sampling the replicates which then give us alternative p values based rankings of the genes.  We also found the top four predictions ranked using the p values of DOG to be near identical to those ranked using its t statistics, though there were discrepancies in rankings for the lower ranking genes. Similar results were observed for the remaining five algorithms.

Conclusions
The difficulty in identifying hDEGs arises from the fact that only a small number of experimental samples are highly expressed at a much higher level than the nonoutliers. As a result, various modified t tests target the subset of potential outliers which are then tested against the control group. In practice, for hDEGs with very few outliers, we found that these algorithms often identify hDEGs with insignificant deviations between the outliers and the tight cluster of non-outliers. Based on this observation, the proposed algorithm assesses each potential outlier in relation to the Gaussian tight cluster without making an explicit assumption about the outlier distribution. At each step, we update the posterior mean and variance of the tight cluster which are then used to evaluate the probability of an outlier being a random sample of http://www.biomedcentral.com/1471-2105/14/81 the tight cluster. Examples of simulated and breast cancer data sets verify the suitability of the proposed algorithm in identifying hDEGs with small numbers of outliers. An extension of the algorithm which fully takes into account gene correlations will be presented in future work. For the breast cancer data, we found negligible correlations across the top ranking genes and very low correlations among the less significant genes.

Methods
The proposed algorithm can be briefly summarised as follows. We first take the list of candidate outliers to be those experimental samples whose expressions are larger than the maximum expression of control samples. For the situation when control samples also contain outliers, see section '' Allowing control samples to contain outliers for a description of the necessary extension. The samples in the candidate list are sorted in an ascending order. The algorithm then updates the tight cluster of non-outliers by testing sequentially the samples in the updated candidate list of outliers. The test is terminated when a significant deviation between a candidate sample and the tight cluster is detected. We now give the steps in more statistical detail.
First, let us introduce some notation. Let x denote the control samples and y the experimental samples of a gene or a probe set (we drop the gene subscript i for simplicity). The proposed DOG algorithm has the following steps: 1. Candidate outlier: Given the union of x and y, z ≡ x ∪ y, we divide z into the candidate outlier set where ⇑ sorts the elements of a set in an ascending order. 2. Detection: Given a critical tail probability α and the corresponding threshold t α [24]. The first element in z + , z + 1 , is classified as the first outlier if in which case the algorithm terminates and z + is the set of outliers. We use a default value of α = 0.05. The parameters μ and σ 2 are posterior mean and posterior variance derived of the tight cluster. Details of estimating μ and σ are given below. 3. Absorption: On the other hand if t ≤ t α , we move z + 1 to the tight cluster of non-outliers, z − ← z − ∪ z + 1 and z + ← z + \z + 1 . 4. Estimating the parameters of the tight cluster: The parameters μ and β = σ −2 are updated using iterative Bayesian learning, i.e., by maximising the posterior probability [24]. Given z ∼ N (μ, 1/β) with conjugate priors μ ∼ N (μ 0 , 1/σ 2 0 ) and σ 2 = 1/β ∼ IG(a, b), the log-posterior is and θ = (μ, β) and α = (μ 0 , σ 2 0 , a, b). Suppose n is the number of expressions in the tight cluster for the current iteration. For simplicity, we set μ 0 = med(z − ), a = 1, b is set to be the maximum variance of expressions calculated gene by gene. To http://www.biomedcentral.com/1471-2105/14/81 simplify the notation, we let β 0 = σ −2 0 . β 0 is updated recursively but we set its initial value to be β (1) The maximum a posteriori probability procedure then gives the updates Repeat 3 and 4 until the first outlier (with the lowest expression) is detected or until all candidate outliers have been classified as non-outliers. 5. Classification: A gene for which the set z + is non-empty is classified as a hDEG.
The summary statistic for a gene is taken to be the average of the outlier statistics j∈z + t j /|z + |. We use the average as opposed to the sum of outlier contributions as we prioritise the detection of hDEGs with few outliers.

Remark 1.
We allow the hyperparameters μ 0 to be evaluated directly from the dataset. We set β (1) 0 to be 0.1, β 0 is then updated iteratively in the algorithm.
We desire the tight cluster variance prior to be densely distributed around the small values, thus we choose a = 1 and b to be the maximum gene sample variance. In practice, we found that a large b and a small a ≤ 1 optimise detection rates.

Remark 2.
It is clear that for a finite replicate number, the difference in mean and variance of the tight cluster at two sequential steps are bounded. Asymptotically, as the sample size increases at each iteration, these differences converge toward zero since the posterior mean and variance converge toward the sample mean and variance and the tight cluster only absorbs probable null samples. This then guarantees asymptotic algorithmic convergence. Convergence of parameters in step 4 for each iteration follow from standard Bayesian results [25].

Cumulative Matthews correlation coefficient
We compare COPA, OS, ORT, MOST and LSOSS using the cumulative Matthews correlation coefficient (cMCC) which is the area under Matthews correlation coefficient (MCC, [26,27]) in the interval [ 0, p * ]: the MCC ρ p is defined as: Here, TP p , TN p , FP p and FN p represent the numbers of true positives (true hDEGs), true negatives (true non-hDEGs), false positives and false negatives respectively. These four quantities are determined based on a predefined critical p-value, i.e. p ∈ (0, p ].

Total classification accuracy
The total classification accuracy is defined as where TP p , TN p , FP p and FN p have been defined above.

Receiver operating characteristic (ROC) analysis
Receiver Operating Characteristic (ROC) [28] analysis has been used widely in outlier detection [11][12][13] for evaluating a classification model when varying the classification threshold, thus it is a useful tool for analysing the robustness of a classifier. As the threshold varies, the sensitivity TP p TP p +FN p and the false positive rate 1 − TN p TN p +FP p change accordingly. The ROC curve is then generated by linking all the pairs of false positive rates and sensitivities corresponding to a set of thresholds. The ROC curve of a desirable classifier is close to the top-left corner. In particular, we limit the false positive rate to less and equal to 5% as rates above this correspond to critical p values that are too large to be of practical relevance. We also calculate the area under a ROC curve (AUC) for quantitative evaluation. A large AUC value of close to 1 indicates a good classifier. As we truncate the false positive rate at an upper limit of 5%, we scale the AUC by this limit so that the best possible partial AUC value is one.

Allowing control samples to contain outliers
In order for DOG to detect hDEGs when outliers are present in control samples, we can modify it slightly. Rather than using z − j = {z − j ∈ z|z − j ≤ max(x)} in the first step of the algorithm, we can use instead the r th (default is 90 th ) percentile of the control samples as the separation between samples belonging to the tight cluster and candidate outliers. Suppose the 90 th percentile of the control samples is denoted by ς , the selection of z − j now follows In practice, the rth percentile can be specified subjectively by the modeller.

Significance analysis for real data
Existing literature on algorithms such as COPA, OS and ORT typically omits statistical significance when analysing real data. Here we propose a simple method for significance analysis. We assume that control samples contain no outliers. For each algorithm, we create new control and experimental replicates of a gene under the null hypothesis by sampling with replacement from only http://www.biomedcentral.com/1471-2105/14/81 the control expressions of that gene. This is repeated 100 times to augment the set of null control and experimental samples. The null t statistics are then calculated for all genes. The p value for each gene is then calculated as the proportion of null statistics across all genes that exceed its observed t statistic.

Experimental design
We first look at two simulated scenarios for comparing the algorithms. For both scenarios, the tight cluster of control samples and non-outlier experimental samples are drawn randomly from a Gaussian distribution with a mean of ten and a standard deviation of one. Both control and experimental categories have 30 replicates. The outliers are generated by adding distances to the maximum expression of the tight cluster. The distances are called marginal nulloutlier distances in that such a distance measures the gap between the tight cluster and the first outlier which is closest to the tight cluster. The marginal oull-outlier distances are sampled from a Gaussian distribution centered at two and with a standard deviation 0.2. Similar to examples seen in [10], we generate 10,000 non-DEGs which gives us 10,000 null t statistics and corresponding p-values for the hDEGs. This approach is applied to each algorithm. All simulations are repeated 100 times.
In the first scenario, we evaluate the algorithms for a single hDEG. In addition, we vary the number of outliers from one to nine. In the second scenario, we generate 50 non-DEGs and 50 hDEGs and vary the number of outliers from one to five. We also look at extensions of the single-hDEG experiment for testing DOG with regard to deviations from the model assumptions.
We then apply the algorithms to the histological breast cancer dataset (GDS3139 - [29]) which was downloaded from the gene expression omnibus (GEO, http://www. ncbi.nlm.nih.gov/geo). It contains 22,283 genes for 14 breast cancer patients and 15 non-cancer women. The age of non-cancer women was matched with that of cancer patients.
For evaluation and comparison of algorithms, we use the cumulative Matthews correlation coefficient (cMCC) and the total classification accuracy (with a critical p-value threshold of 0.01). We also carry out receiver operating characteristic (ROC) analysis [28] for variable critical pvalue thresholds. Details of cMCC and ROC analyses have been given above.