The proposed algorithm can be briefly summarised as follows. We first take the list of candidate outliers to be those experimental samples whose expressions are larger than the maximum expression of control samples. For the situation when control samples also contain outliers, see section ‘’Allowing control samples to contain outliers for a description of the necessary extension. The samples in the candidate list are sorted in an ascending order. The algorithm then updates the tight cluster of non-outliers by testing sequentially the samples in the updated candidate list of outliers. The test is terminated when a significant deviation between a candidate sample and the tight cluster is detected. We now give the steps in more statistical detail. First, let us introduce some notation. Let x
denote the control samples and y
the experimental samples of a gene or a probe set (we drop the gene subscript i
for simplicity). The proposed DOG algorithm has the following steps:
Candidate outlier: Given the union of x and y, z≡x∪y, we divide z into the candidate outlier set z
j+> max(x)} and the non-outlier set
where ⇑ sorts the elements of a set in an ascending order.
: Given a critical tail probability α
and the corresponding threshold t
]. The first element in z
, is classified as the first outlier if
in which case the algorithm terminates and z
+ is the set of outliers. We use a default value of α=0.05. The parameters μ and σ
2 are posterior mean and posterior variance derived of the tight cluster. Details of estimating μ and σ are given below.
Absorption: On the other hand if t≤t
, we move z1+ to the tight cluster of non-outliers, z
−∪z1+ and z
Estimating the parameters of the tight cluster
: The parameters μ
are updated using iterative Bayesian learning, i.e.
, by maximising the posterior probability
with conjugate priors
), the log-posterior is
. Suppose n
is the number of expressions in the tight cluster for the current iteration. For simplicity, we set μ
is set to be the maximum variance of expressions calculated gene by gene. To simplify the notation, we let
is updated recursively but we set its initial value to be
. The maximum a posteriori
probability procedure then gives the updates
Repeat 3 and 4 until the first outlier (with the lowest expression) is detected or until all candidate outliers have been classified as non-outliers.
Classification: A gene for which the set z
+ is non-empty is classified as a hDEG.
The summary statistic for a gene is taken to be the average of the outlier statistics
. We use the average as opposed to the sum of outlier contributions as we prioritise the detection of hDEGs with few outliers.
We allow the hyperparameters μ
0 to be evaluated directly from the dataset. We set
to be 0.1, β
0 is then updated iteratively in the algorithm. We desire the tight cluster variance prior to be densely distributed around the small values, thus we choose a=1 and b to be the maximum gene sample variance. In practice, we found that a large b and a small a≤1 optimise detection rates.
It is clear that for a finite replicate number, the difference in mean and variance of the tight cluster at two sequential steps are bounded. Asymptotically, as the sample size increases at each iteration, these differences converge toward zero since the posterior mean and variance converge toward the sample mean and variance and the tight cluster only absorbs probable null samples. This then guarantees asymptotic algorithmic convergence. Convergence of parameters in step 4 for each iteration follow from standard Bayesian results
Cumulative Matthews correlation coefficient
We compare COPA, OS, ORT, MOST and LSOSS using the cumulative Matthews correlation coefficient (cMCC) which is the area under Matthews correlation coefficient (MCC,
]) in the interval
the MCC ρ
is defined as:
represent the numbers of true positives (true hDEGs), true negatives (true non-hDEGs), false positives and false negatives respectively. These four quantities are determined based on a pre-defined critical p-value, i.e. p∈(0,p
Total classification accuracy
The total classification accuracy is defined as
have been defined above.
Receiver operating characteristic (ROC) analysis
Receiver Operating Characteristic (ROC)
 analysis has been used widely in outlier detection
[11–13] for evaluating a classification model when varying the classification threshold, thus it is a useful tool for analysing the robustness of a classifier. As the threshold varies, the sensitivity
and the false positive rate
change accordingly. The ROC curve is then generated by linking all the pairs of false positive rates and sensitivities corresponding to a set of thresholds. The ROC curve of a desirable classifier is close to the top-left corner. In particular, we limit the false positive rate to less and equal to 5% as rates above this correspond to critical p values that are too large to be of practical relevance. We also calculate the area under a ROC curve (AUC) for quantitative evaluation. A large AUC value of close to 1 indicates a good classifier. As we truncate the false positive rate at an upper limit of 5%, we scale the AUC by this limit so that the best possible partial AUC value is one.
Allowing control samples to contain outliers
In order for DOG to detect hDEGs when outliers are present in control samples, we can modify it slightly. Rather than using
in the first step of the algorithm, we can use instead the r
(default is 90
) percentile of the control samples as the separation between samples belonging to the tight cluster and candidate outliers. Suppose the 90
percentile of the control samples is denoted by ς, the selection of z
j− now follows
. In practice, the rth percentile can be specified subjectively by the modeller.
Significance analysis for real data
Existing literature on algorithms such as COPA, OS and ORT typically omits statistical significance when analysing real data. Here we propose a simple method for significance analysis. We assume that control samples contain no outliers. For each algorithm, we create new control and experimental replicates of a gene under the null hypothesis by sampling with replacement from only the control expressions of that gene. This is repeated 100 times to augment the set of null control and experimental samples. The null t statistics are then calculated for all genes. The p value for each gene is then calculated as the proportion of null statistics across all genes that exceed its observed t statistic.
We first look at two simulated scenarios for comparing the algorithms. For both scenarios, the tight cluster of control samples and non-outlier experimental samples are drawn randomly from a Gaussian distribution with a mean of ten and a standard deviation of one. Both control and experimental categories have 30 replicates. The outliers are generated by adding distances to the maximum expression of the tight cluster. The distances are called marginal null-outlier distances in that such a distance measures the gap between the tight cluster and the first outlier which is closest to the tight cluster. The marginal oull-outlier distances are sampled from a Gaussian distribution centered at two and with a standard deviation 0.2. Similar to examples seen in
, we generate 10,000 non-DEGs which gives us 10,000 null t statistics and corresponding p-values for the hDEGs. This approach is applied to each algorithm. All simulations are repeated 100 times. In the first scenario, we evaluate the algorithms for a single hDEG. In addition, we vary the number of outliers from one to nine. In the second scenario, we generate 50 non-DEGs and 50 hDEGs and vary the number of outliers from one to five. We also look at extensions of the single-hDEG experiment for testing DOG with regard to deviations from the model assumptions. We then apply the algorithms to the histological breast cancer dataset (GDS3139 -
) which was downloaded from the gene expression omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo). It contains 22,283 genes for 14 breast cancer patients and 15 non-cancer women. The age of non-cancer women was matched with that of cancer patients. For evaluation and comparison of algorithms, we use the cumulative Matthews correlation coefficient (cMCC) and the total classification accuracy (with a critical p-value threshold of 0.01). We also carry out receiver operating characteristic (ROC) analysis
 for variable critical p-value thresholds. Details of cMCC and ROC analyses have been given above.