Model profiles overview interface
A screenshot of the main interface window of STEM appears in Figure 3. In this window each box corresponds to one of the model temporal expression profiles. Clicking on a profile box displays a new window, described in the next subsection, with detailed information about the profile. The colored profiles have a statistically significant number of genes assigned. Colored profiles which have the same color are all similar to each other (based on correlation coefficients, see Additional file 1 for more details). These profiles are grouped together to form a cluster of significant profiles. By default profiles on the main window are ordered such that significant profiles appear before non-significant profiles, and among significant profiles those profiles of the same color appear next to each other. The profiles can be reordered based on the number of genes assigned, the number of genes expected, or their significance p-value. Additionally as we discuss below, the profiles can also be reordered based on their relevance to a given GO category (Figure 4), a user defined gene set, or profile(s) from a comparison experiment. When the profiles are reordered relevant information appears in the profile boxes.
The model overview screen is designed such that by default a user can visualize all profiles simultaneously, but as a result each profile box needs to be relatively small. At times however, a user will be interested in focusing on a small subset of neighboring profiles. The interface of STEM supports zooming and panning on any portion of the model profiles overview screen. The ability to zoom and pan is powered by the open source Java libraries of Piccolo [13].
Model profile detailed information interface
Clicking on a profile box on the model profiles overview interface displays a window with detailed information about the profile. Examples of such windows appear in Figure 5. The window contains a graph of the expression patterns for all the genes assigned to the profile, a count of the number of genes assigned, a count of the number of genes expected based on the permutation test, and the profile's p-value. The window also gives the option to display a table with all genes assigned to the profile, or to display a table for a GO enrichment analysis of the set of genes assigned to the profile (Figure 6). If the profile is a part of a non-singleton cluster of significant profiles, then there is also the option to display a table with a GO enrichment analysis for all cluster genes.
Integration with the Gene Ontology
The Gene Ontology (GO) is a structured vocabulary for describing biological processes, cellular components, and molecular functions of gene products [6]. The ontology is a hierarchy of terms organized as a directed acyclic graph. GO term annotations of gene products is available for many organisms. A popular approach to gain biological insights from a set of identified genes of interest is to determine which GO terms annotations are overrepresented among the genes in the set. A number of software packages are available which can determine GO term enrichments in a set of genes (see [16] for a recent review). STEM's integration with GO allows a user to conduct gene enrichment analyses directly in STEM, avoiding the need for a user to export into a separate file each set of genes of interest and then import them into an external GO software. Additionally, STEM implements an expected size enrichment analysis not available in other software (see below). Also unique to STEM is its ability to allow a user with a GO category of interest to easily identify the significant temporal response patterns associated with this category.
The integration with GO is designed to be simple for the user, comprehensive, and current. A user can select from a drop down menu on the main interface any of 35 gene annotation sources available from the Gene Ontology [14] or European Bioinformatics Institutes websites [15]. STEM also accepts any user provided annotation file in the official 15 column gene annotation format, or a simpler two column annotation format. In fact there is no restriction that annotations be GO terms. The set of GO annotations used can be filtered based on evidence code, and restricted to specific subset of annotations (see Additional file 1 for more details). New versions of annotations and the ontology are frequently released, but STEM makes it easy for a user to keep these files up to date by simply having the user check an appropriate field when they want STEM to download the latest annotations or ontology.
Actual and expected size gene set enrichments
STEM implements two types of gene enrichments for a set of genes assigned to the same model temporal expression profile r. The default enrichment in STEM and the method used in other software is actual size based enrichment, in which the enrichment is computed using the hypergeometric distribution based on the number of genes in the set of interest. Formally denote by N the total number of unique genes on the microarray. Denote by m the total number of genes that are in the GO category of interest. Denote by s
a
the number of gene's assigned to profile r. Based on the hypergeometric distribution the p-value of seeing v or more genes in the intersection of the category of interest and profile r can be computed as:
An advantage of the actual size enrichment is that it provides a means to externally validate a clustering algorithm, since the enrichment calculation makes no assumptions about how a set of genes was produced. Such a biological validation for the STEM clustering algorithm appears in [5].
Unlike other clustering algorithms, STEM's clustering algorithm also computes the expected number of genes matching a specific model profile. This leads to a new GO category enrichment p-value based on a profile's expected size. Formally, denote by s
e
the expected size of profile r. Then the p-value of seeing more than v genes belonging to both the category and profile r can be computed using the binomial distribution with parameters m and as:
An advantage of expected size enrichment occurs in the case in which the genes of multiple independent processes happen to have the same temporal expression pattern. In this case a temporal expression pattern could be very significant in terms of the number of genes assigned versus expected, but no GO category will appear enriched under an actual size enrichment test. However under an expected size enrichment test the GO categories could correctly be identified as being enriched. Expected size based enrichment is also useful for ordering temporal expression profiles to determine which are most relevant to a given GO category (see next subsection).
As many GO categories are being tested simultaneously, it is necessary to correct p-values using a multiple hypothesis correction. STEM can correct p-values using the Bonferonni correction, or in the case of actual size enrichment also by using a randomization test.
Bidirectional integration
STEM's integration with GO is bidirectional. In addition to allowing a user to determine for a given model profile what GO terms are significantly enriched, STEM can also determine for a given GO category what model profiles were most enriched for genes in that category. Given a GO category, STEM ranks the profiles based on their p-value enrichment for that category. The profiles on the main interface can be reordered based on either the actual or expected size enrichment. Figure 4 shows an image of the profiles of Figure 3 reordered by actual size based enrichment for the GO category DNA metabolism. In the bottom left hand corner of each profile box is the number of profile genes that belong to that GO category and the enrichment p-value. When the profiles are reordered by a GO category, upon opening the window with detailed information about the profile there is the option to plot just the subset of genes belonging to that GO category. STEM can also determine which cluster of significant profiles were most enriched for a GO category, and reorder the cluster of profiles according to the selected category.
Comparing data sets across experimental conditions
Many microarray studies include a comparison of the temporal response of genes between experimental conditions. For example, researchers have compared the temporal response of genes infected with a wildtype pathogen to those infected with a knockout mutant version of the pathogen [3] or the response of genes when exposed to a certain chemical substance to their response when not exposed [17]. STEM supports the ability to compare expression data sets across experimental conditions even when only few time points are sampled (assuming that the number of time points are the same). STEM allows a user to investigate questions such as: "for a set of genes which had temporal response X in experiment A, what significant responses did they have in experiment B?". STEM uses the hypergeometric distribution to compute the significance of overlap between gene sets of model profiles of two experiments. Since the model profiles are defined independent of the data, the boundaries in expression space that they induce will remain the same between experiments. In contrast, cluster boundaries from traditional, data driven, clustering algorithms will change between experiments. STEM is thus able to detect significant sets of genes with the same expression profiles across experiments that might otherwise be missed if the clusters were defined differently across experiments. Furthermore since the model profiles in STEM are also selected to be distinct and representative of all expression profiles, STEM will determine for all pairs of distinct expression patterns if there is a significant gene set intersection. If the clusters had been formed with a data driven clustering algorithm no such guarantee is possible.
Figure 7 shows a portion of the STEM interface which displays pairs of profiles from two experiments for which the gene set intersection is significant. The results are based on a comparison of an experiment measuring the response of gastric epithelial cells infected with the wildtype pathogen Helicobacter pylori to the response when infected with the vacA- strain [3]. The window shows that many sets of genes had a consistent response across experiments. This result is consistent with the observation made in [3] that the phenotypical response of vacA- infected cells was similar to the wildtype infected cells. The profile pairs on the comparison interface can be rearranged based on the significance of the intersection or how different the expression profiles are as measured by the correlation coefficient. On the main model profiles overview screen a user can reorder all the model profiles from one experiment based on the enrichment for a set of genes assigned to a profile or set of profiles in the other experiment.