Skip to main content
Figure 5 | BMC Bioinformatics

Figure 5

From: Automating Genomic Data Mining via a Sequence-based Matrix Format and Associative Rule Set

Figure 5

Hypothetical histogram of a binned proximity analysis between genes (entity) and their nearest CpG islands (property), with Gene Ontology categories the subject of analysis. Because the observed distribution is non-symmetrical, bimodal and skewed, statistics that depend upon the central tendency assumption are not appropriate. The black portion of the lines represents the known distribution for all genes while the white portion represents the distribution only for one specific GO category. An MC stochastic simulation tests the null hypothesis that there is no correlation and that the observed distribution could be a result of chance. By picking an equal number of genes as found in the GO category 10,000 times, choosing randomly based upon the observed frequency distribution, and calculating the weighted average each time, we can arrive at a probabilistic estimate of how many times a weighted average equal to or greater than the one observed could arise by chance. Other MC-based statistical tests are possible, such as analyzing the spread of data, but are not explored in this report.

Back to article page