Exploratory and inferential analysis of gene cluster neighborhood graphs

Background Many different cluster methods are frequently used in gene expression data analysis to find groups of co-expressed genes. However, cluster algorithms with the ability to visualize the resulting clusters are usually preferred. The visualization of gene clusters gives practitioners an understanding of the cluster structure of their data and makes it easier to interpret the cluster results. Results In this paper recent extensions of R package gcExplorer are presented. gcExplorer is an interactive visualization toolbox for the investigation of the overall cluster structure as well as single clusters. The different visualization options including arbitrary node and panel functions are described in detail. Finally the toolbox can be used to investigate the quality of a given clustering graphically as well as theoretically by testing the association between a partition and a functional group under study. Conclusion It is shown that gcExplorer is a very helpful tool for a general exploration of microarray experiments. The identification of potentially interesting gene candidates or functional groups is substantially accelerated and eased. Inferential analysis on a cluster solution is used to judge its ability to provide insight into the underlying mechanistic biology of the experiment.


Overview
This document is an additional file to the paper "Exploratory and Inferential Analysis of Gene Cluster Neighborhood Graphs" by Scharl, Voglhuber and Leisch submitted to BMC Bioinformatics where recent extensions of R package gcExplorer (Scharl and Leisch, 2009) are presented. This document will also be contained in the package as a vignette. Here we give the R code for the analysis described in the paper. Details about the different options and arguments can be found in the paper and in the help pages of the functions. gcExplorer depends on R package flexclust (Leisch, 2006) and Bioconductor package Rgraphviz (Carey et al., 2005).

Interactive exploration
First the E. coli PS19 data is clustered using the stochastic QT-Clust algorithm implemented in function qtclust of package flexclust.

Color coding
Further information can be added to the neighborhood graph by the use of color coding specified by argument node.function. Some examples of color coding are shown in Figure 3. The color theme can be modified using argument theme. In panel (a) cluster size is highlighted using function node.size, i.e., dark node symbols indicate large clusters and light node symbols indicate small clusters. A legend is added if the position of the legend is specified using argument legend.pos.
> gcExplorer(cl1, filt = 0, theme = "red", + node.function = node.tight, + legend.pos = "bottomright") In panels (c) and (d) two functional groups are investigated. In panel (c) clusters with accumulation of σ 32 -regulated genes are highlighted which are related to heat shock response. The assignment of E. coli genes to Sigma factors is given in data sigma. In this case node function node.go is used where further arguments are passed using argument node.args. gonr is the name of the functional group under investigation, source.id and source.group contain gene identifiers and their assigned groups for the organism and id is the vector of identifiers for the clustered dataset.

Node symbols
Another option for adding information to the display of the neighborhood graph is to use different graphical symbols for the representation of nodes. For that purpose gcExplorer makes use of R package symbols (http://r-forge.r-project. org/projects/symbols).
The most natural node symbols in the case of time-course gene expression data is to use line plots showing the gene expression profiles for either the cluster centroids or the whole group of genes in a certain cluster.
> gcExplorer(cl1, filt = 0, + node.function = gmatplot, + doViewPort = TRUE) Figure 4 gives a very good overview of the cluster solution and the single gene clusters where similarities in gene expression profile can directly be investigated. It can be seen that clusters containing down-regulated genes are located in the bottom left part of the graph whereas up-regulated genes are located in the right part of the graph. Further, there are no edges between clusters of up-and down-regulated genes. Another example for node symbols are pie charts. Here is a user-defined grid pie function  + radius = 1.1, + col = c("white", "skyblue")) + } For demonstration purpose the F-statistic for differential expression for each gene is used here where the amount of genes with F-statistic ≤ 20 is given in white and the amount of genes with F-statistic > 20 is given in skyblue (see Figure 5 left panel. > f2 <-f<20 > gcExplorer(cl1, filt = 0, theme = "blue", + node.function = gpie, + bgdata = as.data.frame(cbind(as.numeric(f2))), + doViewPort = TRUE) > legend("topleft", inset = 0.05, + legend = c("F <= 20", "F > 20"), + fill = c("white", "skyblue")) Grid-based boxplots can be used as node symbols using the following userdefined function.

Node modifications
In order to modify an existing graph the graph structure has to be saved.
> graph <-gcExplorer(cl1, filt = 0, + node.function = gmatplot, + doViewPort = TRUE) Now the graph structure of object graph can be modified using function gc-Modify. In this example argument kpNodes is used to keep only the stated nodes.
> graph2 <-gcModify(graph1, zoom = "auto") In the left panel of Figure 6 the subgraph is shown with no node function setting argument doViewPort=FALSE. In the right panel the zoomed subgraph is shown.

Edge modifications
Filtering by cluster similarity can be used to simplify the original neighborhood graph. Edges between nodes are only drawn if the similarity of a cluster to another cluster is above a certain threshold, e.g., at least 10%. This prevents the graph from being too complex. Now the similarity matrix is modified.
Here d1 is the original cluster similarity matrix which can be extracted from the cluster object using function clusterSim, d2 is the similarity matrix where all values smaller 0.1 are set to 0 and so on. Again we save the original neighborhood graph to object graph. In order to modify the edges of an existing graph function gcModify is used specifying argument clsim. 3 Inferential Analysis

Compare Cluster Solutions
Function comp_test is now used to test the goodness of the cluster solution obtained for the PS19 data when applied to the PS17 data where the same set of genes was investigated under different experimental conditions. > data(comp19) > ct1 <-comp_test(comp17, clusters(cl1), N = 1000) The result is shown in Table 1 consisting of cluster size, observed average within cluster distance, the 5% quantile of the permuted average distances and the probability of observing a lower within cluster distance ("p.val.lower") by randomly assigning the genes to clusters. In this case 10 out of 14 clusters have a significantly smaller within cluster distance when using the cluster solution of the PS19 experiment compared to random assignment. These 10 groups of genes form tight clusters under both conditions and therefore likely to be co-regulated.

Functional Relevance Test
Another possibility for external validation of a cluster solution is to test the functional relevance of single edges, i.e., to test the relationship between a functional grouping and a cluster solution. In this example the E. coli oxygen dataset Covert et al. (2004) is used and the GO term GO:0009061 (anaerobic respiration) is investigated. The dataset is loaded and clustered into 43 clusters using qtclust.
> data(oxygen) > set.seed(1111) > cl2 <-qtclust(oxygen, radius = 3, save.data = TRUE, + control = list(min.size = 5)) > cl2 kccasimple object of family 'kmeans' Function Group2Cluster is used to find the cluster membership of all genes involved in anaerobic respiration and the functional relevance test is implemented in function edgeTest. An edge is only tested if the number of functionally related genes is above a predefined threshold given by argument min.size. Argument filt can be used to filter edges with smaller than a predefined similarity threshold.

> eT$quant
The accumulation of genes involved in anaerobic respiration is displayed in Figure 8 left panel. Here edge.method = "mean" is used to draw an undirected graph. In this case a different layout algorithms is selected using layout = "neato".
> clsim1 <-newclsim(eT = eT$res, object = cl2, p.filt = 0.05) > gcModify(graph, clsim1) In Figure 8 right panel the modified neighborhood graph is displayed. It can be seen that clusters 32, 43, 36, 34, 21 and 22 contain most of the genes involved in anaerobic respiration and form a disconnected subgraph after testing the functional relevance of the edges.