STEM: a tool for the analysis of short time series gene expression data

Background Time series microarray experiments are widely used to study dynamical biological processes. Due to the cost of microarray experiments, and also in some cases the limited availability of biological material, about 80% of microarray time series experiments are short (3–8 time points). Previously short time series gene expression data has been mainly analyzed using more general gene expression analysis tools not designed for the unique challenges and opportunities inherent in short time series gene expression data. Results We introduce the Short Time-series Expression Miner (STEM) the first software program specifically designed for the analysis of short time series microarray gene expression data. STEM implements unique methods to cluster, compare, and visualize such data. STEM also supports efficient and statistically rigorous biological interpretations of short time series data through its integration with the Gene Ontology. Conclusion The unique algorithms STEM implements to cluster and compare short time series gene expression data combined with its visualization capabilities and integration with the Gene Ontology should make STEM useful in the analysis of data from a significant portion of all microarray studies. STEM is available for download for free to academic and non-profit users at .

Welcome to STEM! STEM is an acronym for the Short Time-series Expression Miner, a software program designed for clustering, comparing, and visualizing gene expression data from short time series microarray experiments (∼8 time points or fewer). STEM implements a novel method for clustering short time series expression data that can differentiate between real and random patterns. STEM is also integrated with the Gene Ontology (GO) [4] allowing efficient biological interpretations of the data.

STEM Clustering Method Overview
The novel clustering method that STEM implements first defines a set of distinct and representative model can be grouped together to form clusters of profiles. The biological significance of the set of genes assigned to the same profile or the same cluster of profiles can then be assessed using a GO enrichment analysis. For a more detailed discussion of the novel method STEM uses to cluster genes and associate statistical significance with genes having the same expression profile see [3].

Manual Overview
The remainder of the main portion of the manual contains six sections. Section 2 contains instructions on installing and starting STEM. Section 3 discusses the input to STEM including execution options and data file formats.
Section 4 describes the model profile overview interface, which allows a user to visualize on a zoomable interface a large number of model profiles and order them based on their relevance to a GO category or user defined gene set. Section 5 describes the interface for obtaining detail information about a model profile or cluster of profiles including a table of genes assigned and a table of GO category enrichments. Section 6 describes STEM features to compare two data sets from different experimental conditions. STEM also provides an implementation of the standard K-means clustering algorithms which is described in Section 7. Sections 3-6 are presented assuming a user is interested in the novel STEM clustering method. Using K-means in STEM is similar, and the differences are discussed in Section 7. Most, but not all of the information, contained in this manual can also be obtained by clicking on the help icons throughout the software.

1
• To use STEM a version of Java 1.4 or later must be installed. If Java 1.4 or later is not currently installed, then it can be downloaded from http://www.java.com.
• To install STEM simply save the file stem.zip locally and then unzip it. This will create a directory called stem.
• To execute STEM in Windows with its default initialization options simply double click on the file stem.cmd in the stem directory.
• To execute STEM from a command line change to the stem directory type and then type: java -mx1024M -ms512M -jar stem.jar If Java gives an error message indicating that there is not enough memory on the computer available to start STEM, then remove the -ms512M option. For slightly better time performance at the cost of more memory usage, replace the -ms512M option with -ms1024M.
• STEM can be started with its initial settings specified in a default settings file. The format of a default setting file is specified in Appendix A. To have STEM load its initial settings from a default settings file, from the command line append -d followed by the name of the default settings file to the above command.
For instance to have STEM start with the settings specified in the file defaults.txt use the command: java -mx1024M -ms512M -jar stem.jar -d defaults.txt

Input Interface
The first window that appears after STEM is launched is the input interface ( Figure 1). The interface is divided into four sections. In the top section a user specifies the expression data files and normalization options for the data. In the second section a user specifies the gene annotation information. In the third section a user specifies the desired clustering algorithm and various execution options. These three sections of the interface are described in more detail in the next three subsections. In the fourth section of the interface there is a button which when pressed causes STEM to execute the selected clustering algorithm, and then display the output interface described in Section 4. If the data file does not have two or more time points then results for a standard gene enrichment analysis will be displayed. For details about using STEM for standard gene enrichment analysis on non-time series data consult Appendix B.

Expression Data Info
The first field in the expression data section of the interface is the Data File field where a user specifies the input data file. An input data file consists of gene symbols, time series expression values, and optionally spot IDs. Spot IDs uniquely identify an entry in the data file, and if they are not included in the data file, then they will be automatically generated. While spot IDs must be unique, the same gene symbol may appear multiple times in Figure 1: Above is the main input interface, which is the first screen that appears when STEM is launched. From this screen a user specifies the input data, gene annotation information, and various execution options. Pressing the execute button at the bottom of the interface causes the clustering and gene enrichment analysis algorithms to execute and then a new interface, described in Section 4, to appear. the data file corresponding to the same gene appearing on multiple spots on the array. Expression values for the same gene will be averaged using the median before further analysis on the data is conducted.
A sample data file as it would appear in Microsoft Excel is shown in Figure 2. The first column, which appears in yellow, is optional, and if included contains spot IDs. If the data file includes the spot IDs column, then the field Spot IDs included in the data file on the input interface must be checked, otherwise the field must be unchecked. The next column, or the first column if spot IDs are not included in the data file, contain gene symbols. If a gene symbol is not available then the field can be left empty or a '0' can be placed in it. Both the spot ID field and the gene symbol field may contain multiple entries delimited by a semicolon (';'), pipe ('|'), or comma (',').
The sub-entries in the field are only relevant in the context of gene annotations described in the next section.
The remaining columns contain the expression value at each time point ordered sequentially based on time. If an expression value is missing, then the field should be left empty.
The first row of the data file contains column headers, and each row below the column header corresponds to a spot on the microarray. Each column must be delimited by a tab. The tab-delimited input data file should be  an ASCII text file or a GNU zip file of an ASCII text file. A tab-delimited text file can easily be generated in Microsoft Excel by choosing Text(Tab delimited) as the Save as type type under the Save As menu. To view the contents of the data file from the interface press the button View Data File and then a table such as in Figure 3 will appear. Before gene expression time series are matched against model temporal expression profiles, the time series must be transformed to start at 0. The transformation that is used to do this can be selected to be of one of three types: Log normalize data, Normalize data, or No normalization/add 0. Given a time series vector of gene expression values (v 0 , v 1 , v 2 , ..., v n ) the transformations are as follows: • Log normalize data -transforms the vector to (0, log 2 ( v1 v0 ), log 2 ( v2 v0 ), ..., log 2 ( vn v0 )) • Normalize data -transforms the vector to It is recommended that after transformation a time series represent the log ratios of the gene expression levels versus the level at time point 0. Time point 0 usually corresponds to a control before the experimental conditions were applied. If the input data file contains raw expression values as from an Oligonucleotide array, then the Log normalize data option should be selected. If any values are 0 or negative and the Log normalize data option is selected, then these values will be treated as missing. If the input data file already represents the log ratio of a sample against a control as is often the case when the data is from a two channel cDNA array and an experiment was conducted at time point 0, then the Normalize data option should be selected. In this case after normalization the transformed values will represent the log change ratio versus time point 0. If the input data file already contains log ratio data against a control, but no time point 0 experiment was conducted, then the No normalization/add 0 option should be selected. In this case the assumption is made that had a time point 0 experiment been conducted the expression level in both channels would have been equal. Pressing the Repeat Data button brings up an interface as shown in Figure 4. The Repeat Data button on the main input interface is yellow if there is currently one or more repeat data files specified, otherwise it is gray.
Repeat data files must have the same format as the original data file, including the same number of rows and columns. Repeat data values will be averaged with the values from the original data file using the median.
Repeat data can be selected to be from either Different time periods or The same time period. If the data is from Different time periods then data was collected over multiple distinct time series, but presumably at the same sampling rate. If the data is from The same time period then this implies multiple measurements were collected at each time point during one time series. If the repeat data is selected to be from the The same time period, then the file to which any two column of values for the same time point could be interchanged without effect, while if the repeat data is selected to be from Different time periods this is not the case. If the repeat data is from Different time periods, the repeat data will be averaged after normalization, while if the repeat data is from The Same Time Period the repeat data will be averaged before normalization. In the case the repeat data is from Different time periods, the repeat data can be used to filter genes with inconsistent expression patterns and also to provide noise estimates by which to base clustering model profiles as explained in Section 3.3. In the second section of the interface a user specifies the gene annotation information. Both gene symbols or spot IDs can be annotated as belonging to an official Gene Ontology (GO) category or a user defined category.

Gene Annotation Info
If a gene is annotated as belonging to an official category in the Gene Ontology, then it will automatically also be annotated as belonging to any ancestor category in the ontology hierarchy. The first field in this section of the interface is the Gene Annotation Source. This field can be set to either User provided, No annotations, or one of 35 annotation data sets provided by Gene Ontology Consortium members. A full list of the 35 data sets can be found in Appendix C. More information about these annotation sets can be found at http://www. geneontology.org/GO.current.annotations.shtml, and for the annotation sets provided by the European Bioninformatics Institute (EBI) also at http://www.ebi.ac.uk/GOA/. One of the 35 data sets is the EBI UniProt set. Subsets of this data set with annotations specific to a large number of organisms can be found at http: //www.ebi.ac.uk/GOA/proteomes.html and are not included in the list of 35 data sets. If one of the 35 data sets is selected, then the annotation file corresponding to the source will appear in the Gene Annotation File text box uneditable. If User provided is selected, then the Gene Annotation File text box will become editable, and a user can specify a gene annotation file. Selecting No annotations is equivalent to selecting User Provided and leaving the field empty.
A gene annotation file can be in one of two formats: 1. The gene annotation file can be in the official 15 column gene annotation file format described at http: //www.geneontology.org/GO.annotation.shtml#file. All 35 of the data sets provided by Gene Ontology Consortium members are in this format. If the file is in this format any entry in the columns DB Object ID (Column 2), DB Object Symbol (Column 3), DB Object Name (Column 10), or DB Object Synonym (Column 11) will be annotated as belonging to the GO category specified in Column 5 of the row. If the entry in the DB Object Symbol contains an underscore (' '), then the portion of the entry before the underscore will also be annotated as belonging to the GO category since under some naming conventions the portion after the underscore is a symbol for the database that is not specific to the gene. The DB Object Synonym column may have multiple symbols delimited by either a semicolon (';'), comma (','), or a pipe ('|') symbol and all will be annotated as belonging to the GO category in Column 5. Note that the exact content of the DB Object ID, DB Object Symbol, DB Object Name, and DB Object Synonym varies between annotation source, consult the README files available at http://www.geneontology.org/GO.current.annotations.shtml to find out more information about the content of these fields for a specific annotation source.
2. The alternative format for an annotation file is two columns delimited by a tab as illustrated in Figure 5.
The first column contains gene symbols or spot IDs and the second column contains category IDs. The entries in each column are delimited by a semicolon (';'), comma (','), or a pipe ('|') symbol. If the same gene symbol or spot ID appears on multiple rows, then the union of all its annotations is used.
Matches between gene symbols in the data file and the annotation file is not case sensitive. Gene annotation files can either be in an ASCII text format or a GNU zip file of an ASCII text file.
Below the Gene Annotation Source field, is the Cross Reference Source field which controls the entry in the At the bottom of the gene annotation section of the interface is the phrase Download the latest and then three checkboxes, Annotations, Cross References, and Ontology. If the Annotations box is checked, then the file listed in the Gene Annotation File box will be downloaded from ftp://ftp.geneontology.org/go/gene-associations/ unless it is an EBI data source in which case it will be downloaded from ftp://ftp.ebi.ac.uk/pub/databases/ GO/goa/. If the Cross References box is checked, then the file listed in the Cross Reference File box will be downloaded from ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/. If the Ontology field is checked, then the file gene ontology.obo will be downloaded from http://www.geneontology.org/ontology/gene ontology.obo.
If the annotation, cross reference, or ontology file is required for use, and not present in the stem directory, then the corresponding field will be checked and there will not be an option to uncheck the field forcing download of the file(s). If the Gene Annotation Source is set to User Provided then there will not be an option to download the gene annotation file, and likewise for the cross reference source field and cross reference file. Upon pressing the execute button, the files corresponding to the checked fields will be downloaded.

Options
In the third section of the interface a user has the option to specify a variety of execution options for STEM.
The first option a user specifies is the Clustering Method which can be set to either STEM Clustering Method or K-means. The STEM clustering method is the novel clustering method STEM implements specifically designed for short time series expression data briefly described in Section 1.1 and described in more detail in [3]. STEM's implementation of the K-means algorithm is discussed in Section 7. Assuming the user selects the STEM Clustering Method, then two options related to selecting temporal model expression profiles appear directly on the main input interface window. These options are: •  Below is a more detailed description of the parameters on the filtering panel: • Maximum Number of Missing Values -A gene will be filtered if the number of missing values exceeds this parameter.
• Minimum Absolute Expression Change -A gene will be filtered if the absolute value of its expression at every time point after the selected transformation is applied (Log normalize data, Normalize data, or No normalization/add 0 ) is below this value.
• Minimum Correlation between Repeats -This parameter controls filtering of genes which do not display a consistent temporal profile across repeat experiments and only applies if there is repeat data selected to be from Different time periods. If there is a single repeat file, a gene will be filtered if its correlation between the original data set and the repeat set is below this parameter. If multiple repeats are available, then the gene will be filtered if the median of all its pairwise correlations between experiments is below this parameter.
• Pre-filtered Gene File -This file is optional. If included any genes listed in the file will be considered part of the initial base set of genes during a Gene Ontology (GO) enrichment analysis in addition to any genes included in the data file. Using this file thus allows one to filter genes from the data by a criteria not implemented in STEM by excluding them from the data file, but still include the filtered genes as part of the base set of genes during a GO enrichment analysis. If genes appear in both Pre-filtered Gene File file and the data file, then the gene will only be added to the base set once. The format of this file is the same as a data file, except including the time series expression values is optional and if included they will be ignored. As with a data file if the field Spot IDs included in the data file is checked, then the first column will contain spot IDs and the second column will contain gene symbols, otherwise the first column will contain gene symbols. • Significance Level -The significance level at which the number of genes assigned to a model profile as compared to the expected number of genes assigned should be considered significant. If the Correction Method parameter for multiple hypothesis testing is Bonferroni, then this parameter is the significance level before applying a Bonferroni correction. If Correction Method is False Discovery Rate, then this parameter is the false discovery rate. If Correction Method is none, then this parameter is the uncorrected significance level.

Model Profile Options
• Correction Method -The significance level can be corrected for the fact that multiple profiles are being tested for significance. The correction can be a Bonferroni correction where the significance level is divided by the number of model profiles or the less conservative False Discovery Rate control [2]. If none is selected then no correction is made for the multiple significance tests. Note that this parameter for multiple test correction for model profiles is unrelated to the corrected p-values in a GO enrichment analysis.

Clustering Profiles Options
The two parameters on the clustering profile panel, shown in Figure  not K-means clustering. The parameters control how similar two model profiles must be if they are grouped together. The two parameters are as follows: • Minimum Correlation -Any two model profiles assigned to the same cluster of profiles must have a correlation above this parameter's value. Increasing this value will lead to more clusters with fewer model profiles per cluster, while decreasing the value will lead to fewer clusters with more model profiles per cluster.

Gene Annotations Options
On the fourth panel, shown in Figure 9, a user may specify options related to gene annotations. The first three options allow one to filter annotations when the annotation file is in the official 15 column format. The last field, the Category ID mapping file, is useful in the case in which genes are annotated as belonging to a category outside the Gene Ontology. The options on this panel are as follows: • Only include annotations of type {Biological Process, Molecular Function, Cellular Component} -These three checkboxes allow one to filter annotations that are not of the types checked. These three checkboxes only apply if the annotations are in the official 15 column GO format, in which case the annotation type is determined by the entry in the Aspect field (Column 9). An entry of P in the Aspect field means the annotation is of type Biological Process, an entry of F means the annotation is of type Molecular Function, and an entry of C means the annotation is of type Cellular Component.

GO Analysis Options
The final advanced options panel, shown in Figure 10, controls options related to Gene Ontology (GO) enrichment analysis. Note that categories that appear in a gene annotation file even if not part of the official Gene Ontology, are also included in a GO analysis. The parameters included on this panel are as follows: • Minimum GO level -Any GO category whose level in the GO hierarchy is below this parameter will not be included in the GO analysis. The categories Biological Process, Molecular Function, and Cellular Component are defined to be at level 1 in the hierarchy. The level of any other term is the length of the longest path to one of these three GO terms in terms of the number of categories on the path. This parameter thus allows one to exclude the most general GO categories.
• Minimum number of genes -For a category to be listed in a gene enrichment analysis table, described in Section 5.2, the number of genes in the set being analyzed that also belong to the category must be greater than or equal to this parameter.
• Number of samples for randomized multiple hypothesis correction -This parameter specifies the number of random samples that should be made when computing multiple hypothesis corrected enrichment p-values by a randomization test. A randomization test is used when the p-value enrichment is based on the actual size of the set of genes and Randomization is selected next to the Multiple hypothesis correction method for actual sized based enrichment label. The Bonferroni correction is always used when the p-value enrichment is based on the expected size of the set of genes. The difference between actual and expected size enrichment is discussed in Section 4.3. Increasing this parameter will lead to more accurate corrected p-values for the randomization test, but will also lead to longer execution time to compute the values. Randomization is selected the corrected p-value is computed based on a randomization test where random samples of the same size of the set being analyzed is drawn. The number of samples is specified by the parameter Number of samples for multiple hypothesis correction. The corrected p-value for a p-value, r, is the proportion of random samples for which there is enrichment for any GO category with a p-value less than r. A Bonferroni correction is faster, but a randomization test leads to lower p-values. After the STEM clustering algorithm executes, the model profile overview interface appears. An example of such an interface is shown in Figure 11. Each box corresponds to a different model temporal expression profile.  By button opens a dialog window that allows one to reorder the clusters of profiles, that is profiles are reordered with the constraint that profiles of the same color must be kept together. The main gene table, the filtered gene list, ordering profiles, and ordering clusters of profiles are explained in detail in the next four subsections. The

Model Profiles Overview Interface
Compare option which allows comparison with a data set from a different experimental condition is explained in Section 6. Pressing the help icon brings up the legend that appears in Figure 12 along with additional help information. The last subsection of this section, Section 4.5, describes how one can zoom in or out on any portion of the main window. Figure 12: The legend that appears after pressing the help icon.   Figure 13. Clicking on a row of the table opens a new window containing detailed information about the profile to which the gene of the row was assigned. This new window is described in Section 5. An option will also appear on the newly opened window to plot only the expression of the gene of the selected row.

Main Gene Table
The columns of the table are as follows: • Selected -An entry in this column contains a 'Yes' if the gene of the row is part of a category or gene set by which the profiles are ordered, otherwise the field is empty.
• Gene Symbol -This column contains the gene symbols. The name for this column is read from the header in the data file.
• Spot ID -An entry in this column contains a list of spot IDs of spots which contain the gene of the row. This   If a profile has more genes assigned than expected, then it is possible a gene enrichment for a category will be significant under an expected size based enrichment while it is not significant under an actual size enrichment.

Ordering Profiles
Likewise if a profile has fewer genes assigned than expected, it is possible a gene enrichment for a category will be significant under an actual size based enrichment while it is not significant under an expected size based enrichment. If multiple independent processes happen to have the same temporal profile, then a significant gene enrichment for the process may be missed through an actual size enrichment, but detected through an expected size enrichment.
Clicking on a row of the table will reorder the profiles based on the p-value enrichment for the category of that row. Whether the p-value enrichment is computed based on the profile's actual size or expected size will depend on which is selected next to the label Order using enrichment p-values based on a profile's. Profiles are ordered row-wise from left to right and top to bottom based on the significance of the enrichment for the selected category.
The profile most enriched for the selected category appears in the top left corner. The next most enriched profile appears second in the top row and so on. For instance Figure 16 shows an example of the model profiles reordered based on an actual size enrichment for cell cycle genes. The numbers that appear in the bottom left hand corner of the model profile box are the number of genes assigned to the profile that also belong to the selected category and then separated by a semicolon the p-value enrichment.
Below the table are several buttons which give additional criteria to reorder profiles: • Profile ID -Reorders profiles sequentially from left to right and top to bottom by their ID number, the number in the top left corner of the profile box (top left Figure 17). Profiles which go down initially will appear first, then profiles which hold steady initially, and then last will be profiles which go up initially.
• Significance -Reorders profiles based on the p-value significance of number of genes assigned to a profile being more than the number of genes expected (top right Figure 17). If s a genes were assigned to the profile and s e genes were expected and a total of t genes passed filter, then the uncorrected p-value of seeing s a or more genes assigned to the profile is computed based on a binomial distribution with parameters t and se t . The p-value is computed as    When a user presses the Order Clusters By button on the main profile window a dialog box such as in Figure 19 appears. This window is a simplified version of the window that appears when a user presses Order Profiles By.    Along the bottom of the window are several yellow buttons. Which buttons appear will depend upon how the profiles are ordered, through which interface the window was opened, and whether the profile is part of a non-singleton cluster of profiles. However every window will contain a Profile Gene If the profiles or cluster of profiles are reordered based on a category, then two additional buttons will appear above the bottom row. Pressing the top of these two button will display a table of the genes that were assigned to the profile and also belong to the category by which the profiles are ordered. In Figure 23 this is the Profile cell cycle Gene Table button. Below this button is a button which gives the option to plot only the profile genes belonging to the category by which the profiles are ordered. This is the Click to plot only profile cell cycle genes button on the left side of Figure 23. Once this button is pressed, the button will be replaced with a button that says Click to plot all profile genes (right side of Figure 23), which gives the user the option to revert back to having all the profile genes plotted.  Table displays a table of genes assigned to the profile that are also cell cycle genes.

Zooming and Panning
If the profiles or cluster of profiles are ordered based on a user defined gene set, referred to as a query gene set, then there will be several additional buttons (Figure 24). The button Click to plot only profile query set genes replots the window with only profile genes that also belong to the user defined gene set. Pressing the button, will cause the button to be replaced with a Click to plot all profile genes button which pressing will revert to the original window. Above the Profile Gene Table and Profile GO Table are Table displays a table with all genes assigned to the profile that also belong to the query gene set. Pressing the Profile Query GO Table displays a   gene enrichment table for just the genes assigned to the profile that are also part of the query set. If the profile is part of a non-singleton cluster of profiles, then two additional buttons will appear, the Cluster Query Gene Table   and Cluster Query GO Table buttons. These buttons are analogous to the Profile Query Gene Table and Profile  Query GO Table buttons, but are based on all genes in the query set that are assigned to any profile that is part of the profile's cluster of profiles.
If the profile window was opened by clicking on a row in the main gene table as described in Section 4.1, then a button will appear to plot only the gene of the row that was clicked on. This is the Click to plot only gene STAM2 button on the left side of Figure 25. Once the button is pressed, the button will be replaced with the Click to plot all profile genes button (right side of Figure 25), which if pressed again will revert the window back to its original state.   Table   Figure   • Selected -This is the same column as in the main gene table. An entry in this column contains a 'Yes' if the gene of the row is part of a category or gene set by which the profiles are ordered, otherwise the field is empty.

Gene
• Weight -This field represents the weight of the assignment of the gene to the profile. If the profile the gene most closely matches is unique, then the value is one. If there is a tie as to which profile a gene most closely matches, then this value is one divided by the number of profiles a gene most closely matches.
• Gene Symbol -This column contains the gene symbols. The name for this column is read from the header in the data file.
• Spot ID -An entry in this column contains a list of spot IDs of spots which contain the gene of the row delimited by a ';'. The header for this column is read from the data file if spot IDs are included in the data file.
• Time Point columns -The time series of gene expression levels for the gene after any selected transformation (Log normalize data, Normalize data, or No normalization/add 0 ). The header for these columns are read from the data file.
As with all tables in STEM, this table can be sorted in ascending or descending order by any column by clicking on the column header. A user can also save the entire table using the Save Table button or just the gene names using the Save Gene Names button.

Gene Enrichment Analysis Table
From the window with details about a model profile a user has the option to display a table that includes gene enrichment for Gene Ontology (GO) categories along with any other categories that may appear in an annotation file. Figure 27 shows an example of such a table. As discussed at the beginning of the section the exact set of The columns of a gene enrichment table are as follows: • Category ID -The ID for the category.
• Category Name -The name for the category.
• # Genes Category -The number of genes on the entire microarray that were annotated as belonging to the category.
• # Genes Assigned -The number of genes annotated as belonging to the category that are part of the set of genes being analyzed.
• # Genes Expected -The number of genes annotated as belonging to the category that were expected to be part of the set being analyzed. This value will depend on whether an actual size or expected size profile enrichment analysis is being conducted.
• # Genes Enriched -The difference between # Genes Assigned and # Genes Expected • p-value -The uncorrected p-value of seeing this many or more genes from this category assigned to the set of genes being analyzed. This p-value will depend on whether an actual size or expected size enrichment analysis is being conducted. See Section 4.3 for a discussion on how the p-value is computed.
• Corrected p-value -The p-value corrected for testing a large number of GO categories. If the enrichment is based on a set's actual size and Randomization is selected as the value for Multiple hypothesis correction method for actual size based enrichment the corrected p-value is computed based on a randomization test. If the enrichment is computed based on a set's expected size or Bonferroni is selected as the value for Multiple hypothesis correction method for actual size based enrichment, then the corrected p-value is computed based on a Bonferroni correction. See section 3.3.5 for a discussion on these two methods for correcting GO enrichment p-values.
A gene enrichment table can be sorted by any column in ascending or descending order by clicking on the column header. The contents of the table can also be saved to a text file using the Save Table button. Clicking on a row of the gene enrichment table will display a gene table that only includes genes that belong to category of the row and also the set being analyzed. For example if a user clicked on the cell cycle row, a table such as that in Figure 28 will appear which contains only genes that were assigned to the profile being analyzed that were also annotated as being cell cycle genes.  • Maximum uncorrected intersection p-value -The maximum uncorrected intersection p-value for the intersection to be of interest.
• Minimum number of genes in intersection -The minimum number of genes in the intersection of the set of genes assigned to two profiles for the intersection to be of interest.
Pressing the yellow Compare button will launch two new windows. One of the windows that is launched contains the model profile overview screen for the comparison data set. This is the same interface that is described in Section 4. The other window that appears is the main comparison window, an example of which is shown in  while if the profiles to the right of the yellow bar are from the original experiment then the horizontal labels will read "Original Set Profiles." A profile appears to the right of the yellow bar if the intersection of the set genes assigned to it and the profile to the immediate left of the yellow bar satisfy the size and p-value constraints specified on the comparison dialog. The legend that appears when a user presses the help icon information appears in Figure 31 and explains what the various numbers mean on the profile boxes. This window as with the main profile screen is zoomable and pannable. Instructions for zooming and panning can be found in Section 4.5.
Clicking on a profile box to the right of a yellow bar launches a detail model profile window that includes the option to obtain information about the genes in the intersection between the profile clicked on and the profile to the immediate left of the yellow bar (left side Figure 32). Near the top of the window is a line of text indicating how many genes were in the intersection and the p-value of the intersection. The intersection profile window also contains a button which plots only those genes in the profile which were also assigned to the profile in its row to the left of the yellow bar in the other experiment. After pressing the Click to plot only genes in intersection one has the option to press the button Click to plot all profile genes to revert back to the original screen. Two additional buttons that appear on the profile interface are the Profile Intersect Gene Table button and the Profile   Intersect GO Table buttons. The Profile Intersect Gene Table button displays a gene table (Section 5.1) of genes assigned to this profile which were also assigned to the profile to the left of the yellow bar in the other experiment, that is the genes in the intersection. The Profile Intersect GO Table buttons displays a table (Section 5.2) with a gene enrichment analysis for genes in the intersection set. Clicking on a profile to the left of the yellow bar opens a window which displays information about the profile, but does not provide any information about gene intersections. On the bottom of the comparison window are four yellow buttons which are used to rearrange the profile boxes on the main window. These buttons function as follows: • Swap Rows and Columns -Interchanges which data set is to the left of the yellow bar, and which is to the right of the yellow bar.
• Order By Profile ID -This button returns the profile pairs to their default ordering. By default the profiles to the left of the yellow are first ordered by increasing ID. Profiles to the right of the yellow bar are then ordered within the row by increasing ID.
• Order By Significance -This reorders profile pairs based on statistical significance of the gene set intersection. In any row, the profiles to the right of the yellow bar are ordered with increasing p-value for the gene set intersection with the profile to the left of the yellow bar. The profiles to the left of the yellow bar are ordered to have increasing minimum intersection p-value significance with a profile in its row to the right of the yellow bar.
• Order By Correlation -This reorders profile pairs based on correlation. In any row, the profiles to the right of the yellow bar are ordered based on increasing correlation with the profile to the left of the yellow bar.
The profiles to the left of the yellow bar are ordered to have increasing minimum correlation with a profile in its row to the right of the yellow bar. Figure 32: On the left is an example of a model profile window that appears when a model profile box to the right of a yellow bar is pressed. On the right is the same window after the button Click to plot only genes in intersection is pressed As mentioned in Section 4.3 a user can reorder the profiles on the model profile overview screen based on gene enrichment for a user defined set. After the Compare button on the comparison dialog has been pressed, the user defined gene set can be defined based on sets of genes assigned to profile(s) in the other data set. This feature thus allows a user to visualize how a set of genes which all had the same expression profile(s) in one experiment responded in another experiment under different conditions. On the left of Figure 33 is the window to define a gene set by which to reorder the original data set model profiles, notice that the field Profile ID in Comparison Set is active. On the right of Figure 33 is the window to define a gene set by which to reorder the comparison data set model profiles, notice the field Profile ID in Original Set is active. Pressing the Select button selects those genes from the other experiment assigned to the profile of the ID displayed. Note that one can select genes from multiple profiles, since selecting an additional profile ID does not clear any currently selected genes. To create a gene set based on all the genes filtered in the other experiment set the profile ID value to "-1" and then press select genes. Figure 33: Dialog windows to define gene sets. The dialog window on the left is used to define a gene set to reorder model profiles from the original data set, while the dialog on the right is used to define a gene set to reorder model profiles from the comparison data set.

K-means
In addition to providing a novel clustering method designed for short time series expression data [3], STEM also provides an implementation of the standard K-means algorithm for clustering. To use the K-means clustering algorithm in STEM select K-means under Clustering Method (Figure 34). The K-means clustering algorithm partitions genes into K sets, S 1 , S 2 , ..., S K , where K is an input parameter provided by a user in the field Number of Clusters (K). Each set S i has a center c i associated with it where the center represents the mean of all genes assigned to the set S i . After transformation described in Section 3.1 a gene x j and center c i are T + 1 element vectors that can be written as (0, x j1 , x j2 , ..., x jT ) and (0, c i1 , c i2 , ..., c iT ) respectively.
The K-means algorithm attempts tries to minimize the function The K-means algorithm starts with randomly selected centers where in STEM's implementation the initial centers are chosen to be randomly selected genes. The algorithm then iterates between two steps until convergence. In one step each gene is reassigned to the cluster of the center to which it is closest. In the next step the center of each cluster is recomputed based on the new assignment of genes to clusters. The algorithm terminates when no changes in reassignment can be made. This algorithm is guaranteed to converge to a local minimum, but not a global minimum. The algorithm can be repeated for a number of different random starts with potentially a different clustering obtained from each start. Only the run with the best scoring final set of clusters is returned.
The number of random starts is specified in the field Number of Random Starts on the main input interface.
Increasing this parameter leads to a potentially slightly better clustering, at the expense of a slightly longer running time.
After the K-means algorithm executes the main output interface is displayed, an example of which appears in Figure 35. This interface is similar to the model profile overview interface described in Section 4 with a few differences of note. For K-means clustering each box on the interface corresponds to a cluster instead of a profile.
The time series shown in the box is the average expression of all genes assigned to the cluster. The number in the top left hand corner of the box is a Cluster ID (see Figure 36 for a legend). All K-means cluster boxes appear white since no statistical significance is associate with them. The K-means cluster are by default ordered based on ID. IDs are assigned based on the cluster average expression value at the first time point. K-means cluster boxes can be reorder on the main interface analogous to the reordering of STEM profile boxes described in Section 4.3.
Pressing the Order Cluster By button brings up the dialog box in Figure 37 through which the clusters can be reordered. The reordering criteria of the clusters can be the number of genes assigned to the cluster, or p-value enrichment for a GO category or user defined gene set.
Pressing a cluster box opens a window such as Figure 38 with detailed information about a K-means cluster similar to the model profile detailed interface described in Section 5. From this window one can open a table of all genes assigned to the cluster as one could do for all gene assigned to a STEM profile described in Section 5.1.
Similarly one can open a table with GO analysis results for the set of genes assigned to the cluster as one could do for all genes assigned to a profile described in Section 5.2. The GO analysis can only be based on the actual size of the cluster since there is no notion of the expect sized of a K-means cluster.
Pressing the Main Gene Table on the main K-means interface is the same as described in Section 4.1 for the Figure 34: Above is the main input interface described previously in Section 3 with the clustering method set to K-means.
Two parameters appear when K-means is selected that do not appear when the STEM clustering method is selected. These two parameters specify the number of clusters and the number of random starts.
STEM clustering method except the table has the cluster the gene was assigned to instead of the profile. The Filter Gene Table is identical to that described in Section 4.2. Comparison for K-means works the same way as described in Section 6 except STEM profiles are replaced with K-means clusters. Figure 39 shows the comparison legend for the comparison interface with K-means analogous to Figure 31 for comparison with STEM profiles.

A Defaults File Format
As mentioned in the preliminary section the default settings for STEM can be specified in a file and used through the -d on the command line. Below is a sample file. The parameters names are on the left side and a tab separates them from their value. Lines which begin with a # are comments and are ignored.

B Using STEM for Standard Gene Ontology Enrichment Analysis
STEM may be used for standard Gene Ontology enrichment analysis for non-time series data in two ways. Given a data file of genes with a single time point column, STEM will perform a Gene Ontology enrichment analysis for those genes whose absolute value exceeds the value specified by the Minimum Absolute Expression Change parameter. In this case the base set of genes is all genes in the data file. STEM can also be used to do an enrichment analysis for an arbitrary set of genes and an arbitrary base set of genes. The set of genes to do an enrichment analysis on is specified in the Data File while the base set of genes are specified in the Pre-filtered Gene File. The first line of these files is a header line, and every line below the header line will contain one gene per line. As with a data file, the field Spot IDs included in the data file should be unchecked, unless spot IDs are the first column and gene symbols are the second column in which case the field should be checked. After pressing execute a gene enrichment analysis table will appear as described in Section 5.2.

C Gene Annotation Sources
The