DNA microarray data and contextual analysis of correlation graphs

Background DNA microarrays are used to produce large sets of expression measurements from which specific biological information is sought. Their analysis requires efficient and reliable algorithms for dimensional reduction, classification and annotation. Results We study networks of co-expressed genes obtained from DNA microarray experiments. The mathematical concept of curvature on graphs is used to group genes or samples into clusters to which relevant gene or sample annotations are automatically assigned. Application to publicly available yeast and human lymphoma data demonstrates the reliability of the method in spite of its simplicity, especially with respect to the small number of parameters involved. Conclusions We provide a method for automatically determining relevant gene clusters among the many genes monitored with microarrays. The automatic annotations and the graphical interface improve the readability of the data. A C++ implementation, called Trixy, is available from .


Background
Measurements of gene expression levels by microarray experiments create a high-throughput of data, the interpretation of which increasingly requires novel and efficient dimensionality reduction strategies. Many clustering methods have been proposed (see for example [1][2][3][4][5] and the more comprehensive reviews [6,7]) and are widely used. These algorithms group genes and/or samples into clusters of similar expression profiles, in order to suggest possible functional relationships between them. The importance of graphical representations and of automatic cluster annotations stands out from many recent publications [1,[8][9][10][11][12] devoted to gene functions prediction, prognosis or diagnosis of cancer subtypes for instance. Similar problems arise in the analysis of large interaction networks [13][14][15][16][17][18] where one tries to extract sub-networks satisfying some significance criteria. The problem of find-ing web pages dedicated to the same topic is an example that will appeal to the experience of every reader (in this case the network's nodes are the URLs, with HTML links).
We propose a new method which combines one of these network analysis techniques with the classical correlationbased clustering for studying DNA microarray data. It provides a novel graphical representation, a cluster forming rationale and cluster annotations through correlation with gene or sample keywords. The algorithm relies on only two user-controlled parameters, therefore sensitivity of the results to a particular choice of parameters can be checked effectively.
The algorithm is based on the notion of curvature introduced in [13] (this is the same as the clustering coefficient of [19]), which we apply to the network of co-expressed genes where nodes are genes (or samples) and links symbolize co-expression. We define clusters as connected regions of the graph with high curvature, which is the local density of triangular relations. The gene or sample clusters are the densest regions of the corresponding correlation graph, which we will show has biological relevance as intuitively expected. We must emphasize that curvature is typically extremely low in random graphs that have small average degree compared to the number of nodes (which is usually the case in biological networks [14,20]). Clusters of high curvature are thus highly nonrandom structures.
We have implemented these concepts in the freely available program Trixy. It is a graphical interface for visualising the graph, the clusters and the automatic annotations providing a straightforward tool for exploring microarray data. The C++ source code and sample Perl parsers are freely available from http://tagc.univ-mrs.fr/bioinformatics/ trixy.html. We also provide the data files adapted from the original yeast [1] and lymphoma [21] sets as examples. We have compiled and used the program on both Linux and Windows platforms. Compiling on other platforms has not been attempted but is theoretically possible.
On the performance side, clustering and display with Trixy requires CPU time and memory size comparable to hierarchical clustering as performed in [1].

Curvature on Graphs
The discussion below focuses on the problem of clustering genes. The symmetric question of clustering samples can be treated similarly.
A DNA microarray data set consists of expression levels of N genes in M different experimental conditions (M different RNA samples). This is organised in an N × M matrix X i,j , i = 1,...,N; j = 1,...,M each row of which contains the expression profile of a given gene across all samples. We are interested in patterns of co-expression, namely groups of genes with parallel or anti-parallel profiles. We measure co-expression of genes g k and gᐍ by the (Pearson) correlation cor(k, ᐍ) between their profiles: where µ i and σ i denote the mean and the standard deviation of row i. This creates a correlation matrix which is an N × N symmetric matrix (because cor(k, ᐍ) = cor(ᐍ, k)).
We construct a correlation graph as follows. We first make a node n for each gene. We then choose a threshold T cor ∈ [0,1] and draw a link between genes g k and gᐍ if cor(k, ᐍ) ≥ T cor . This can be understood as follows: a graph with N nodes is defined by its adjacency matrix A (the N × N matrix such that A i,j = 1 if i and j are joined, 0 otherwise [22]). We obtain A from the correlation matrix by binarisation: we replace cor(k, ᐍ) by 0 if it lies between -T cor and T cor and by 1 otherwise.
We next introduce the concept of curvature (or clustering coefficient) on a graph [13,19]. Each node n has a curvature which is a function of the number v of neighbours (nodes to which it is linked) and the number t of triangles (pairs of adjacent neighbours, see Figure 1), given by the formula Remark that v(v -1)/2 is the maximum number of triangles that can be drawn on v neighbours hence curv(n) lies between 0 and 1 if v > 1 and is undefined otherwise (see Figure 1 for examples of graphs and curvature).
There is a natural notion of distance between nodes in a graph [22]: it is the number of links in the shortest path connecting them (distance is infinite if there is no such path). Let d n (i, j) be the distance between the ith and jth neighbours of n: either d n (i, j) = 1 (these two neighbours are linked) or d n (i, j) = 2 (they are not, the shortest path goes through n). A simple computation shows that
(2). Bottom left: a complete graph, each node has curvature 1. Center: a tree, each node has curvature 0 or undefined. Right: the central node is a hub where <d n > is the average distance between pairs of neighbours of n. Hence one can picture high curvature as high local density (low average distances).
Given a curvature threshold T curv ∈ [0; 1] we select a subgraph by deleting all nodes with a curvature below T curv as well as their links. This splits the graph into several connected components. Each such connected component will be interpreted as a coherent cluster of co-expressed genes (see Figure 2). Varying T curv adds or removes nodes and links and thus modifies the clusters displayed (possibly merging or splitting some of the clusters). The reader can think of the graph as a sea bed, curvature being height. The curvature threshold T curv is the sea level and the clusters are the emerging islands. Changing the correlation threshold T cor changes the landscape while changing the curvature threshold T curv only moves the sea level up and down.
Consider the Internet analogy of a University web site: the index page has many links to all department's web pages. It is unlikely that, for instance, the biology department's page provides a link to the literature department's page. Therefore, the index page will surely have a small curvature (few of its neighbours have links between them). However, the home page of the biology department has external links to biology departments in other universities with which it has common interests. These other pages will certainly also have external links to many of the same pages, again because they share similar interests. Therefore, a cluster of high curvature will emerge, comprising all the biology departments web pages. This reasoning applies to virtually any communication network and we demonstrate below that it can also be usefully applied to correlation graphs of gene expression profiles.
The program Trixy implements the algorithm described above in a user-friendly graphical interface. It is written in C++ using the free Qt graphical library. It uses embedded Perl for parsing data inputs, which has the advantage that loading data saved under a new format only requires rewriting a Perl script which can be picked at run time. We have mostly used Trixy for clustering genes, but sample clustering can also be performed simply by using a modified parser which rotates the matrix. Similarly, an appropriate Perl script could simply fetch gene annotations from web servers such as http://www.geneontology.org/ rather than read them from a local file.

Normalisation and Parameters
A few simple data processing tools are provided in Trixy: log transform, samples centering (by subtracting the mean or the median) and samples reduction (division by the standard deviation). After these operations have been performed, the correlation matrix is computed and the curvature of each node is deduced from it. At this point, the user can view (using the "Eisengram" standard colour representation of the matrix, such as Figure 4) or save the resulting data set as a flat file. The graph is then built and displayed (as in Figure 2). Although the correlation threshold T cor is set before loading the data, the curvature threshold T curv can be varied as the graph is displayed.
Starting from an initially high value of T curv and lowering it progressively unveils new nodes and new clusters. It increases the size of existing clusters, sometimes merging several of them ( Figure 2). This gives a feeling for the robustness of the clustering and for the closeness of clusters.
Our advice for the choice of T cor is to set it to a value which retains only links significantly stronger than expected by pure chance (this depends on the particular data set and can be determined by bootstrapping, see e.g. [5]). The parameter T curv is different. We have observed that the best value is often cluster-dependent. We have a more dynamic view on this parameter: the way clusters change as T curv moves is informative. A good way of picking the best threshold is by maximising the annotation scores (see below).

Automatic Annotation
Trixy allows the user to provide annotation files for sample and genes. They consist of a list of keywords associated with each of the gene and/or sample names.
On the one hand, a cluster of genes can be associated with an over-represented gene keyword by giving a score to each annotation equal to its frequency in the cluster.
On the other hand, for sample annotations, a correlation score is computed. Suppose a cluster consists of genes g 1 ,..., g K . For each sample keyword W, we create a discriminating vector g 0 which takes the value 1 on each sample associated with W and -1 otherwise. The annotation score is the average absolute correlation with keyword .
Both scores yield numbers between 0 and 1, the closer to 1 the more significant the annotation. We discard annotations that were not present for at least 10% of the samples and 2 of the genes in each cluster. Each cluster can be selected and the corresponding data subset viewed (as a colour-coded table such as Figure 4). If annotations were provided, those with the highest scores are listed and the cluster can be saved as a data file, gene list or colour picture.

Yeast Gene Expression Data
We have applied our algorithm to the data set of gene expression of the budding yeast Saccharomyces cerevisiae available from the website http://rana.lbl.gov/EisenData.htm and described in [1]. We have used Gene Ontology gene annotations from the Saccharomyces Genome Database (SGD) [23]. The sample keywords were extracted from the original expression data file and in this case do not yield interpretable annotations (see the lymphoma section below for a more convincing example of the usefulness of sample annotations).
Even with threshold values as high as T cor = 0.90 and T curv = 0.70, we get clearly delineated clusters (see Figure 3). Note that only 263 out of a total of 6221 genes have positive curvature at this value of the correlation threshold, yielding 2075 links.
Most of the clusters obtained appear biologically coherent. For example the chromatin assembly cluster contains all the 9 histone genes for T cor = 0.80 and T curv = 0.64 ( Figure  3, cluster A: it only shows 7 of the genes at this level of correlation). It is disconnected from the rest of the graph and extremely robust with respect to changes in the parameters. The ubiquitin dependent protein catabolism (Figure 4) cluster appears at a much lower curvature but is extremely coherent with 17 out of 17 proteolysis genes. For the sake of comparison, a proteasome cluster of similar size obtained using hierarchical clustering contains 3 genes unrelated to proteolysis.
Much larger clusters are also visible, such as the protein biosynthesis cluster (Figure 3, cluster B and Figure 2). However, it is very sensitive to variations of the thresholds and can include over 900 genes.

Figure 3
The full graph based on Saccharomyces cerevisiae 6221 gene expression profiles across 80 experiments with thresholds T curv = 0.70 at T cor = 0.90

B-cell Lymphoma
We have also used the data set of the lymphoma study available from the website http://llmpp.nih.gov/lymphoma/ and published in [21]. Sample and gene annotations were extracted from the names included in the data file. We were in particular interested in the classification of the tumor subtypes called chronic lymphocytic leukaemia (CLL) and diffuse large B-cell lymphoma (DLCL).
Using a threshold T cor = 0.80 we obtain nicely discriminating clusters, for example Figure 5 which separates the CLL tumors and Figure 6 which sorts the DLCL tumors. The latter cluster eventually merges with a cluster of Interferon-induced genes (data not shown) as the curvature threshold decreases: they are like two hills on the same island.
We also have a good example of a property that is often observed with our graphical representation: negative correlations are much rarer than positive ones and are carried by just a few nodes, which are almost certainly represser genes. Figure 7 shows a small part of the graph where 7 nodes are mostly anti-correlated with the rest of a cluster of 307 genes. However, let us emphasize that the graphs shown here do not represent gene interaction networks per se, they are merely a means of clustering genes co-expressed within the selected samples.

Figure 4
The ubiquitin dependent protein catabolism cluster of yeast gene expression at T cor = 0.80 and T curv = 0.18

Figure 5
A cluster of genes under-expressed in the CLL tumors obtained from the lymphoma data with T cor = 0.80 and T curv = 0.40 This data set provides a good test bed for sample clustering. In this case, the graph's nodes are the samples and links denote a correlation between samples (columns of the expression level matrix X). As shown in Figure 8, the result is clear cut: all clusters are associated with a single sample subtype, be it B-cells, T-cells, FL (follicular lymphoma) or CLL.

Statistical Validation
As stated earlier, mathematical theory shows that random graphs have a very low number of triangles [14,20]. We can check this statement numerically by using random permutations of a real data set. We have randomly reordered the yeast data, gene by gene (a random permutation in each line), computed the curvature of each node, and repeated this operation 1000 times. At T cor = 0.6 we deduced a probability of positive curvature of 5 × 10 -5 , the probability of having a degree larger than 1 was 2 × 10 -4 (maximum degree observed: 6). At T cor = 0.7 the probability of positive curvature was 10 -6 . The comparison with the curvature distribution of the real biological data is displayed in Figure 9.
Statistical validation of an annotation by a particular keyword can be performed with similar methods. For example in the case of the clusters shown in Figure 5,6, we have performed 10,000 random permutations of the sample keywords. We have obtained a maximum annotation score of 0.315 for the CLL cluster and the mean score plus two standard deviations was equal to 0.177, whilst the annotation score of the original cluster was 0.510. Similarly for the DLCL cluster, the maximum score after 10, 000 permutations was 0.367, the mean score plus two standard deviations 0.172, whilst the original score was 0.729.
Permutations of gene keywords can be computed explicitely. For example 66 of the 6221 yeast genes have the GO annotation ubiquitin dependent protein catabolism. The probability of having 17 of them in the same cluster of size 17 (see Figure 4) is of the order of 10 -35 .

Discussion and Conclusions
We have described an algorithm for visualising and analysing large microarray data sets. It combines traditional correlation distances and new graph-theoretical ideas. We have implemented this algorithm in a convenient graphical interface and evaluated its performance on well established data sets.
Curvature thresholds split the graph into clusters which appear to be biologically meaningful. An automatic annotation procedure associates keywords with clusters, which are consistent with previous publications [1,21].

Figure 6
A cluster of genes over-expressed in the DLCL tumors obtained from the lymphoma data with T cor = 0.80 and T curv = 0.24 Our approach uses a local analysis of the correlation graph as opposed to global properties such as small-world [19] or scale-free properties [24]. Hubs [25] are not seen as relevant in our approach since they usually make little contribution to the information carried by the graph (like index pages in the worldwide web). The closest approach to ours is that of [14,15] in the sense that triangles are network motifs, but again, we only draw conclusions from their local density not their frequency at the level of the whole graph. This also emphasizes the difference between our method and the more commonly used hierarchical clustering [1]: because we focus on triangular relations (rare motif in graphs) rather than simple links (very common motif), we obtain a drastic dimensional reduction (see Figure 3) whereas hierarchical clustering retains all the data and does not in itself delineate clusters. Furthermore the stronger constraint offered by triangular links as opposed to single link methods ensure more coherent clusters.

Further Developments
Future development should include finer statistical analysis tools to validate the automatic annotations. In particular a bootstrap validation of a discriminant score [9] would be more accurate that the correlation score explained in the methods, which detects consistency with the annotation rather than actual discrimination. Also, more sophisticated methods for determining optimal annotations exist in the literature and could be applied to our clusters (see e.g. [11]).
A method for determining a natural correlation threshold T cor would be most welcome (such methods have been discussed in [26,27]). It would leave only one free parameter, the curvature threshold T curv Again a bootstrap calculation could provide an estimate of a significant deviation from average random correlation. It was also suggested to use a hierarchical construction of the graph: first use the strongest links (largest correlations) to build small clusters, then link clusters with weaker links and continue until all nodes belong to the same cluster. Varying T curv would subsequently split this unique cluster into significant parts. Memory and speed limitations may hamper these developments.

Figure 8
Clustering of the lymphoma samples with T cor = 0.60 and T curv = 0.30. Each cluster is associated with a cell type indicated in the grey boxes

Figure 9
Distribution of curvatures in the yeast (top) and randomised yeast (bottom) data with T cor = 0.6. Top: the mean curvature is 0.387 and 85% of nodes have positive curvature, bottom: cumulative distribution of curvatures after 1000 permutations in each of the 6221 genes. The proportion of nodes with non-zero curvature is 5 × 10 -5