Skip to main content
Figure 1 | BMC Bioinformatics

Figure 1

From: Semi-supervised discovery of differential genes

Figure 1

Unsupervised extraction of significant genes. (A) An artificial data set and its generative models. The artificial gene expression data set consists of 6,400 genes for 16 samples. The 6,400 genes are made up of 3,200 inactive ((1) and (2)) and 3,200 active ((3) and (4)) genes. Expressions of the inactive genes are generated from a normal distribution with mean 0 and variance 1.22 or 0.82 for gene group (1) or (2), respectively. Since the genes are generated from a single distribution regardless of the sample index, they are in fact inactive. Expressions of active genes are generated from a normal distribution with mean 1.0 or -1.0 for highly or lowly expressed genes in each sample, respectively, and a common variance 1.0. The expression pattern for each gene is different between groups (3) and (4): in group (3), there are eight subgroups of 200 genes and each subgroup has a high/low pattern different from the other subgroups. In group (4), 1,600 genes have the same high/low pattern. In (4), all the genes are assumed to reflect a common biology leading to similar expressions over all samples, and in (3) there are eight gene clusters, each of which reflects its own biology. (B) Histogram of three unsupervised significance scores: LR-U, ODPp-U, and ODP-U, which are separately described for the four gene groups. Horizontal axes are shown in log-scale. A vertical line denotes the mean of the distribution. (C) ROC (receiver operating characteristic) curves of active gene detection generated by changing the threshold for each score; the horizontal and vertical axes denote specificity, (true negative/(true negative + false positive)), and sensitivity, (true positive/(true positive + false negative)), respectively.

Back to article page