sscMap: An extensible Java application for connecting small-molecule drugs using gene-expression signatures

Background Connectivity mapping is a process to recognize novel pharmacological and toxicological properties in small molecules by comparing their gene expression signatures with others in a database. A simple and robust method for connectivity mapping with increased specificity and sensitivity was recently developed, and its utility demonstrated using experimentally derived gene signatures. Results This paper introduces sscMap (statistically significant connections' map), a Java application designed to undertake connectivity mapping tasks using the recently published method. The software is bundled with a default collection of reference gene-expression profiles based on the publicly available dataset from the Broad Institute Connectivity Map 02, which includes data from over 7000 Affymetrix microarrays, for over 1000 small-molecule compounds, and 6100 treatment instances in 5 human cell lines. In addition, the application allows users to add their custom collections of reference profiles and is applicable to a wide range of other 'omics technologies. Conclusion The utility of sscMap is two fold. First, it serves to make statistically significant connections between a user-supplied gene signature and the 6100 core reference profiles based on the Broad Institute expanded dataset. Second, it allows users to apply the same improved method to custom-built reference profiles which can be added to the database for future referencing. The software can be freely downloaded from .


sscMap: Connecting small-molecule drugs using gene-expression signatures
The sscMap program implements the method introduced in [1] to test the connections between Gene Signatures and Reference Gene-Expression Profiles. It may be helpful to read some related papers [1,2] first to learn more about the methodology itself. This document can be regarded as a supplementary to the paper that introduces the sscMap program [3].
The sscMap program can be run in two execution modes: as a command line program, or as a GUI (Graphical User Interface) application. The instructions for running the program as a command line can be found near the end of this document, while the two tours below will guide you through the GUI mode. You should have Java 1.6 (or later version) installed on your computer to run the program. By going through the two guided tours, users should be able to get a fairly good grasp of how the program works. After that you can try to run queries using your own gene signatures.
The sscMap software is bundled with a default collection of 6100 reference geneexpression profiles based on the Broad Institute Connectivity Map 02 dataset (http://www.broad.mit.edu/cmap/). To query this default collection of reference profiles using your own gene signature(s), you need first prepare the query files in the format as those examples in the folder "queries", one file for each gene signature. The files should be tab-delimited text, and gene IDs should be represented by Affymetrix HG-U133A probe-set IDs, as these are the IDs used in the default reference profiles. If your gene IDs are not already Affymetrix HG-U133A probe-set IDs, please use the corresponding Affymetrix annotation files to map them to these IDs.
The sscMap software can be extended by adding custom collections of reference profiles. Tour 2 actually uses a small example of custom extension. The section after Tour 2 gives a detailed description of the general contracts for adding a custom collection of reference profiles to sscMap.

Tour 1: Querying the bundled collection of reference profiles with 5 gene signatures (1) Launch the program
Go to the program's main folder "sscMap"; double click the "sscmap-gui.bat" file to launch the program. If you are using Unix or Linux, you may need first add execution permission to the file "sscmap-gui", and then start the program. The commands you need to type in are: chmod +x sscmap-gui sscmap-gui A Graphical User Interface similar to the one shown below should appear.
(2) Clearing up for a fresh start Click the "Gene Signatures" menu on the menu bar, and then select the "Remove All Gene Signatures" item, to clear up the Gene Signature list from previous run.

(3) Loading default settings
Click the menu "Setting Parameters" on the menu bar, and then select the "Load Default Settings" item.
This will bring up the "sscMap Settings" window, showing the current settings (The default settings in this case).
As we are going to use the default settings in this case, no need to change anything, just press the "OK" button to close the window.

(4) Load the gene signatures
Click the "Gene Signatures" menu on the menu bar; select the item "Import Gene Signature(s)".
Browse to the folder "queries" under the program main folder "sscMap", select the five Gene Signature files there to load them into the gene signature list (Multiple selection allowed). These five Gene Signature files are: "Estrogen.sig", "HDACs.sig", "Immunosuppress.sig", "random01.sig", and "random02.sig".
And they will appear as five nodes under the "Gene Signatures" node, as shown below.
Now we are ready to run the queries.

(5) Run the queries
Click the "Run Queries" menu on the menu bar, and select "Run queries with current settings".
A progress monitor window will pop up shortly indicating the progress being made. In this example, as shown below, the program is calculating 18690 connection scores and estimating 18690 p-values. It takes a couple of hours to finish on a typical today's desktop computer, as the estimation of p-values is the most time-consuming part of the calculation (2 hours and 7 minutes on my laptop computer with an Intel Core Duo T2300E / 1.66 GHz processor; 1 hour and 26 minutes on my desktop computer with an AMD Athlon 64 X2 6000+ AM2 3.0GHz processor).
Once the calculation is completed, the "Results" node will be populated with five result nodes, one for each gene signature in the Gene Signature list. A graph showing the connection scores and p-values will be displayed on the right. The caption under the graph summarizes the results shown in the graph.
Pointing the mouse to a data point on the graph and pressing a mouse button will bring up a small window displaying the detailed information for that data point. When the mouse button is released, the information window will disappear.

(6) Viewing and interpreting the results
Let's take the HDACs gene signature as an example. Double clicking the result node "HDACs.sig.sscmap.plot", a graph for that node will be displayed on the right as below. The horizontal axis is for connection scores and the vertical axis is Y= -log 10 (pvalue). So if a data point on the plot has a Y-coordinate around 3, you know that the original p-value is of the order 10 -3 . The green horizontal line on the graph indicates where the threshold p-value is set. Any connection score with a pvalue less than that threshold (any data point above the green line on the graph) is considered as statistically significant. The threshold p-value is in fact set as alpha = N_falses / N_sets = 1 / 3738 ~= 0.00027, where N_false=1 is the number of false connections the user is willing to tolerate for each gene signature, and N_sets=3738 is the number of Refsets being queried in this run. Thus the number of connection scores obtained for each gene signature is also N_sets. By setting the threshold p-value as such, we should expect that on average there will be N_falses=1 false connections among the significant connection scores. The expected number of false connections to tolerate (N_falses) can be viewed and/or changed from the "sscMap Settings" window when the "Setting Parameters" menu item is selected. Its default value is 1.
In the graph shown below, you may notice that the Y-coordinates of the data points become flat ("saturate") at around Y=4.0, indicating that the smallest p-values are of the order 10 -4 for the data shown here. This is because the number of random gene signatures generated for each p-value estimation was 10000. So the lower limit of pvalues that can be estimated is of the order 10 -4 . The true p-values of those saturated data points are likely to be much smaller. But for the purpose of identifying significant connections using the current threshold (alpha=0.00027, the green line), this limit (10 -4 ) for p-value estimation is adequate. Increasing the number of random gene signatures for each p-value estimation to 100000, for example, will require much longer computational time. The set of significant connections identified by the longer run, however, is not much different. So overall, using 10000 random gene signatures to estimate each p-value is probably the right balance to strike.
There are three buttons at the bottom of the graph. Pressing the Button "Show Raw Scores" will show the original connection scores defined by Equation (6) in [1], and pressing the "Show Standardized Scores" button will show the standardized connection scores. The standardized connection score is just the original connection score normalized (divided) by the sample standard deviation of the random scores generated during the p-value estimation.
The "Save as .tab file" button allows users to export the results and save them to a tab-delimited text file, which can be opened and viewed using MS-Excel or other similar spreadsheet software.
In fact, two files were already saved in the "Results" folder for each gene signature when the calculation was completed. One is a "*.sscmap.tab" file, in tab-delimited text format, which basically lists the connection results of the gene signature to all the Refsets queried. This "*.sscmap.tab" file can be opened and viewed with MS-Excel or other similar spreadsheet software. It is the main result file for users to take away and to work on.
The other file "*.sscmap.plot" is in binary format, and is only be to read by the current sscMap program. The "Load Saved Results" item under the "Load Results" menu allows users to load previously saved "*.sscmap.plot" files and display them in graph similar to the one we have seen above.

(7) Exit the program
Double click the "Exit" node to exit the program, or Click "File" on the menu bar, and select "Quit".
Tour 2: Querying a custom collection of reference profiles with 2 gene signatures. (

1) Launch the program
Go to the program's main folder "sscMap"; double click the "sscmap-gui.bat" file to launch the program. If you are using Unix or Linux, you may need first add execution permission to the file "sscmap-gui", and then start the program. The commands you need to type in are: chmod +x sscmap-gui sscmap-gui A Graphical User Interface (GUI) similar to the one below should appear.
(2) Clearing up for a fresh start Click the "Gene Signatures" menu on the menu bar, and then select the "Remove All Gene Signatures" item, to clear up the Gene Signature list from previous run.
In this example, we are going to use a custom collection of reference profiles stored in a subfolder under "custom-example".
Click the Button "1. Chang the Reffiles Folder by choosing a reffile", and browse to the folder "custom-example/custom-reffiles" under the program's main folder "sscMap".
The naming of reffiles in this folder apparently suggests that the string --is to be used as a Field Separator to divide a reffile name into several Fields. In this example, these Name Fields are: drug name, dose, tissue type, and time point.
So we type in the Field Separator String --(two hyphens) into the text box.
Click the Button "3. Refresh Checkboxes for Reffile Fields", the checkboxes will be updated as below.
Now we must decide which name fields to use for defining a Refset. A Refset is defined as a set of reffiles with the same selected Name Fields. In this example, we are going to use the drug name and tissue type to define a refset. This means that all reference profiles with the same drug and tissue type will be taken as forming a refset, disregarding the dose and time point.
So we check the two Name Fields (Drug and Tissue) then click the Button "5. Refresh Refset Name".
We leave the pvalue-related settings at their default values, and click to "OK" button to close the "sscMap Settings" window.
And they will appear as two nodes under the "Gene Signatures" node, as shown below. Now we are ready to run the queries.

(5) Run the queries
Click the "Run Queries" menu on the menu bar, and select "Run queries with current settings".
A progress monitor window will pop up shortly indicating the progress made.
Once the calculation is completed, the "Results" node will be populated with two result nodes, one for each gene signature in the Gene Signature list. A graph showing the connection scores and p-values will be displayed on the right. The caption under the graph summarizes the results shown in the graph. Double clicking any result node will display the graph for that node.

(6) Exit the program
Double click the "Exit" node to exit the program, or Click "File" on the menu bar, and select "Quit".

Adding custom reference profiles to sscMap
As an example, we have included with the sscMap program a folder called customexample, which contains all the key components of a custom extension to the application. Following the example provided users should be able to build their own extension.
This section describes the general contracts for a custom collection of reference profiles to be added to sscMap. A custom ref-files directory must contain the following items:

Instructions for running sscMap as a command line
To run the program using the example queries already in the folder "queries", (1) Open an example query using MS Excel as a tab-delimited text file, to get familiar with the format of query files.
(2) Go to the program main folder; double click the file "run-sscmap.bat" to start the program. If you are using Unix or Linux, you may need first add execution permission to the file "run-sscmap", and then start the program. The commands you need to type are: chmod +x run-sscmap run-sscmap (3) After the program ends, go to the "results" folder, open the corresponding "*.sscmap.tab" files with Excel to view the results.
To query sscMap with your own query gene signature(s), First, prepare the query signature files in the format as those examples in the folder "queries", one file for each gene signature. The files should be tab-delimited text, and gene IDs should be represented by Affymetrix HG-U133A probe-set IDs.