DNA microarray data and contextual analysis of correlation graphs
© Rougemont and Hingamp; licensee BioMed Central Ltd. 2003
Received: 23 December 2002
Accepted: 29 April 2003
Published: 29 April 2003
DNA microarrays are used to produce large sets of expression measurements from which specific biological information is sought. Their analysis requires efficient and reliable algorithms for dimensional reduction, classification and annotation.
We study networks of co-expressed genes obtained from DNA microarray experiments. The mathematical concept of curvature on graphs is used to group genes or samples into clusters to which relevant gene or sample annotations are automatically assigned. Application to publicly available yeast and human lymphoma data demonstrates the reliability of the method in spite of its simplicity, especially with respect to the small number of parameters involved.
We provide a method for automatically determining relevant gene clusters among the many genes monitored with microarrays. The automatic annotations and the graphical interface improve the readability of the data. A C++ implementation, called Trixy, is available from http://tagc.univ-mrs.fr/bioinformatics/trixy.html.
Measurements of gene expression levels by microarray experiments create a high-throughput of data, the interpretation of which increasingly requires novel and efficient dimensionality reduction strategies. Many clustering methods have been proposed (see for example [1–5] and the more comprehensive reviews [6, 7]) and are widely used. These algorithms group genes and/or samples into clusters of similar expression profiles, in order to suggest possible functional relationships between them. The importance of graphical representations and of automatic cluster annotations stands out from many recent publications [1, 8–12] devoted to gene functions prediction, prognosis or diagnosis of cancer subtypes for instance.
Similar problems arise in the analysis of large interaction networks [13–18] where one tries to extract sub-networks satisfying some significance criteria. The problem of finding web pages dedicated to the same topic is an example that will appeal to the experience of every reader (in this case the network's nodes are the URLs, with HTML links).
We propose a new method which combines one of these network analysis techniques with the classical correlation-based clustering for studying DNA microarray data. It provides a novel graphical representation, a cluster forming rationale and cluster annotations through correlation with gene or sample keywords. The algorithm relies on only two user-controlled parameters, therefore sensitivity of the results to a particular choice of parameters can be checked effectively.
The algorithm is based on the notion of curvature introduced in  (this is the same as the clustering coefficient of ), which we apply to the network of co-expressed genes where nodes are genes (or samples) and links symbolize co-expression. We define clusters as connected regions of the graph with high curvature, which is the local density of triangular relations. The gene or sample clusters are the densest regions of the corresponding correlation graph, which we will show has biological relevance as intuitively expected. We must emphasize that curvature is typically extremely low in random graphs that have small average degree compared to the number of nodes (which is usually the case in biological networks [14, 20]). Clusters of high curvature are thus highly non-random structures.
We have implemented these concepts in the freely available program Trixy. It is a graphical interface for visualising the graph, the clusters and the automatic annotations providing a straightforward tool for exploring microarray data. The C++ source code and sample Perl parsers are freely available from http://tagc.univ-mrs.fr/bioinformatics/trixy.html. We also provide the data files adapted from the original yeast  and lymphoma  sets as examples. We have compiled and used the program on both Linux and Windows platforms. Compiling on other platforms has not been attempted but is theoretically possible.
On the performance side, clustering and display with Trixy requires CPU time and memory size comparable to hierarchical clustering as performed in .
Curvature on Graphs
The discussion below focuses on the problem of clustering genes. The symmetric question of clustering samples can be treated similarly.
A DNA microarray data set consists of expression levels of N genes in M different experimental conditions (M different RNA samples). This is organised in an N × M matrix Xi,j, i = 1,...,N; j = 1,...,M each row of which contains the expression profile of a given gene across all samples. We are interested in patterns of co-expression, namely groups of genes with parallel or anti-parallel profiles. We measure co-expression of genes g k and gℓ by the (Pearson) correlation cor(k, ℓ) between their profiles:
where μ i and σ i denote the mean and the standard deviation of row i. This creates a correlation matrix which is an N × N symmetric matrix (because cor(k, ℓ) = cor(ℓ, k)).
We construct a correlation graph as follows. We first make a node n for each gene. We then choose a threshold Tcor ∈ [0,1] and draw a link between genes g k and gℓ if cor(k, ℓ) ≥ Tcor. This can be understood as follows: a graph with N nodes is defined by its adjacency matrix A (the N × N matrix such that Ai,j= 1 if i and j are joined, 0 otherwise ). We obtain A from the correlation matrix by binarisation: we replace cor(k, ℓ) by 0 if it lies between -Tcor and Tcor and by 1 otherwise.
Remark that v(v - 1)/2 is the maximum number of triangles that can be drawn on v neighbours hence curv(n) lies between 0 and 1 if v > 1 and is undefined otherwise (see Figure 1 for examples of graphs and curvature).
There is a natural notion of distance between nodes in a graph : it is the number of links in the shortest path connecting them (distance is infinite if there is no such path). Let d n (i, j) be the distance between the i th and j th neighbours of n: either d n (i, j) = 1 (these two neighbours are linked) or d n (i, j) = 2 (they are not, the shortest path goes through n). A simple computation shows that
where <d n > is the average distance between pairs of neighbours of n. Hence one can picture high curvature as high local density (low average distances).
Consider the Internet analogy of a University web site: the index page has many links to all department's web pages. It is unlikely that, for instance, the biology department's page provides a link to the literature department's page. Therefore, the index page will surely have a small curvature (few of its neighbours have links between them). However, the home page of the biology department has external links to biology departments in other universities with which it has common interests. These other pages will certainly also have external links to many of the same pages, again because they share similar interests. Therefore, a cluster of high curvature will emerge, comprising all the biology departments web pages. This reasoning applies to virtually any communication network and we demonstrate below that it can also be usefully applied to correlation graphs of gene expression profiles.
The program Trixy implements the algorithm described above in a user-friendly graphical interface. It is written in C++ using the free Qt graphical library. It uses embedded Perl for parsing data inputs, which has the advantage that loading data saved under a new format only requires rewriting a Perl script which can be picked at run time. We have mostly used Trixy for clustering genes, but sample clustering can also be performed simply by using a modified parser which rotates the matrix. Similarly, an appropriate Perl script could simply fetch gene annotations from web servers such as http://www.geneontology.org/ rather than read them from a local file.
Normalisation and Parameters
A few simple data processing tools are provided in Trixy: log transform, samples centering (by subtracting the mean or the median) and samples reduction (division by the standard deviation). After these operations have been performed, the correlation matrix is computed and the curvature of each node is deduced from it. At this point, the user can view (using the "Eisengram" standard colour representation of the matrix, such as Figure 4) or save the resulting data set as a flat file. The graph is then built and displayed (as in Figure 2). Although the correlation threshold Tcor is set before loading the data, the curvature threshold Tcurv can be varied as the graph is displayed. Starting from an initially high value of Tcurv and lowering it progressively unveils new nodes and new clusters. It increases the size of existing clusters, sometimes merging several of them (Figure 2). This gives a feeling for the robustness of the clustering and for the closeness of clusters.
Our advice for the choice of Tcor is to set it to a value which retains only links significantly stronger than expected by pure chance (this depends on the particular data set and can be determined by bootstrapping, see e.g. ). The parameter Tcurv is different. We have observed that the best value is often cluster-dependent. We have a more dynamic view on this parameter: the way clusters change as Tcurv moves is informative. A good way of picking the best threshold is by maximising the annotation scores (see below).
Trixy allows the user to provide annotation files for sample and genes. They consist of a list of keywords associated with each of the gene and/or sample names.
On the one hand, a cluster of genes can be associated with an over-represented gene keyword by giving a score to each annotation equal to its frequency in the cluster.
On the other hand, for sample annotations, a correlation score is computed. Suppose a cluster consists of genes g1,..., g K . For each sample keyword W, we create a discriminating vector g0 which takes the value 1 on each sample associated with W and -1 otherwise. The annotation score is the average absolute correlation with keyword .
Both scores yield numbers between 0 and 1, the closer to 1 the more significant the annotation. We discard annotations that were not present for at least 10% of the samples and 2 of the genes in each cluster.
Each cluster can be selected and the corresponding data subset viewed (as a colour-coded table such as Figure 4). If annotations were provided, those with the highest scores are listed and the cluster can be saved as a data file, gene list or colour picture.
Yeast Gene Expression Data
We have applied our algorithm to the data set of gene expression of the budding yeast Saccharomyces cerevisiae available from the website http://rana.lbl.gov/EisenData.htm and described in . We have used Gene Ontology gene annotations from the Saccharomyces Genome Database (SGD) . The sample keywords were extracted from the original expression data file and in this case do not yield interpretable annotations (see the lymphoma section below for a more convincing example of the usefulness of sample annotations).
Most of the clusters obtained appear biologically coherent. For example the chromatin assembly cluster contains all the 9 histone genes for Tcor = 0.80 and Tcurv = 0.64 (Figure 3, cluster A: it only shows 7 of the genes at this level of correlation). It is disconnected from the rest of the graph and extremely robust with respect to changes in the parameters. The ubiquitin dependent protein catabolism (Figure 4) cluster appears at a much lower curvature but is extremely coherent with 17 out of 17 proteolysis genes. For the sake of comparison, a proteasome cluster of similar size obtained using hierarchical clustering contains 3 genes unrelated to proteolysis.
We have also used the data set of the lymphoma study available from the website http://llmpp.nih.gov/lymphoma/ and published in . Sample and gene annotations were extracted from the names included in the data file. We were in particular interested in the classification of the tumor subtypes called chronic lymphocytic leukaemia (CLL) and diffuse large B-cell lymphoma (DLCL).
We also have a good example of a property that is often observed with our graphical representation: negative correlations are much rarer than positive ones and are carried by just a few nodes, which are almost certainly represser genes. Figure 7 shows a small part of the graph where 7 nodes are mostly anti-correlated with the rest of a cluster of 307 genes. However, let us emphasize that the graphs shown here do not represent gene interaction networks per se, they are merely a means of clustering genes co-expressed within the selected samples.
Statistical validation of an annotation by a particular keyword can be performed with similar methods. For example in the case of the clusters shown in Figure 5,6, we have performed 10,000 random permutations of the sample keywords. We have obtained a maximum annotation score of 0.315 for the CLL cluster and the mean score plus two standard deviations was equal to 0.177, whilst the annotation score of the original cluster was 0.510. Similarly for the DLCL cluster, the maximum score after 10, 000 permutations was 0.367, the mean score plus two standard deviations 0.172, whilst the original score was 0.729.
Permutations of gene keywords can be computed explicitely. For example 66 of the 6221 yeast genes have the GO annotation ubiquitin dependent protein catabolism. The probability of having 17 of them in the same cluster of size 17 (see Figure 4) is of the order of 10-35.
Discussion and Conclusions
We have described an algorithm for visualising and analysing large microarray data sets. It combines traditional correlation distances and new graph-theoretical ideas. We have implemented this algorithm in a convenient graphical interface and evaluated its performance on well established data sets.
Curvature thresholds split the graph into clusters which appear to be biologically meaningful. An automatic annotation procedure associates keywords with clusters, which are consistent with previous publications [1, 21].
Our approach uses a local analysis of the correlation graph as opposed to global properties such as small-world  or scale-free properties . Hubs  are not seen as relevant in our approach since they usually make little contribution to the information carried by the graph (like index pages in the worldwide web). The closest approach to ours is that of [14, 15] in the sense that triangles are network motifs, but again, we only draw conclusions from their local density not their frequency at the level of the whole graph. This also emphasizes the difference between our method and the more commonly used hierarchical clustering : because we focus on triangular relations (rare motif in graphs) rather than simple links (very common motif), we obtain a drastic dimensional reduction (see Figure 3) whereas hierarchical clustering retains all the data and does not in itself delineate clusters. Furthermore the stronger constraint offered by triangular links as opposed to single link methods ensure more coherent clusters.
Future development should include finer statistical analysis tools to validate the automatic annotations. In particular a bootstrap validation of a discriminant score  would be more accurate that the correlation score explained in the methods, which detects consistency with the annotation rather than actual discrimination. Also, more sophisticated methods for determining optimal annotations exist in the literature and could be applied to our clusters (see e.g. ).
A method for determining a natural correlation threshold Tcor would be most welcome (such methods have been discussed in [26, 27]). It would leave only one free parameter, the curvature threshold Tcurv Again a bootstrap calculation could provide an estimate of a significant deviation from average random correlation. It was also suggested to use a hierarchical construction of the graph: first use the strongest links (largest correlations) to build small clusters, then link clusters with weaker links and continue until all nodes belong to the same cluster. Varying Tcurv would subsequently split this unique cluster into significant parts. Memory and speed limitations may hamper these developments.
We thank D. Gautheret for critically reading a first version of the manuscript. J.R. is grateful to J.-P. Eckmann and E. Moses for useful comments on earlier versions of the program and for detailed explanations of their results. Part of this research was supported by the Temblor project EU grant QLRT-2001-00015.
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc NatL Acad Sci USA 1998, 95: 14863–14828. 10.1073/pnas.95.25.14863PubMed CentralView ArticlePubMedGoogle Scholar
- Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 1999, 96: 2907–2912. 10.1073/pnas.96.6.2907PubMed CentralView ArticlePubMedGoogle Scholar
- Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nature Genetics 1999, 22: 281–285. 10.1038/10343View ArticlePubMedGoogle Scholar
- Gasch AP, Eisen MB: Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biology 2002, 3: Research0059. 10.1186/gb-2002-3-11-research0059PubMed CentralView ArticlePubMedGoogle Scholar
- Herrero J, Valencia A, Dopazo J: A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 2001, 17: 126–136. 10.1093/bioinformatics/17.2.126View ArticlePubMedGoogle Scholar
- Shamir R, Sharan R: Algorithmic approaches to clustering gene expression data. In Current Topics In Computational Molecular Biology (Edited by: Jiang T, Xu Y, Smith T). 2002, 269–300.Google Scholar
- Tamames J, Clark D, Herrero J, Dopazo J, Blaschke C, Fernandez JM, Oliveros JC, Valencia A: Bioinformatics methods for the analysis of expression arrays: data clustering and information extraction. J Biotechnol 2002, 98: 269–283. 10.1016/S0168-1656(02)00137-2View ArticlePubMedGoogle Scholar
- Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray data. Bioinformatics 2000, 16: 906–914. 10.1093/bioinformatics/16.10.906View ArticlePubMedGoogle Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 1999, 286: 531–537. 10.1126/science.286.5439.531View ArticlePubMedGoogle Scholar
- Wu LF, Hughes TR, Davierwala AP, Robinson MD, Stoughton R, Altschuler SJ: Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nature Genetics 2002, 31: 255–265. 10.1038/ng906View ArticlePubMedGoogle Scholar
- Pe'er D, Regev A, Tanay A: Minreg: inferring an active regulator set. Bioinformatics 2002, 18: s258-s267.View ArticlePubMedGoogle Scholar
- Zhou X, Kao MC, Wong WH: Transitive functional annotation by shortest-path analysis of gene expression data. Proc Natl Acad Sci USA 2002, 99: 12783–12788. 10.1073/pnas.192159399PubMed CentralView ArticlePubMedGoogle Scholar
- Eckmann JP, Moses E: Curvature of co-links uncovers hidden thematic layers in the world wide web. Proc Natl Acad Sci USA 2002, 99: 5825–5829. 10.1073/pnas.032093399PubMed CentralView ArticlePubMedGoogle Scholar
- Shen-Orr SS, Milo R, Mangan S, Alon U: Network motifs in the transcriptional regulation network of Escherichia coli . Nature Genetics 2002, 31: 64–68. 10.1038/ng881View ArticlePubMedGoogle Scholar
- Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U: Network motifs: Simple building blocks of complex networks. Science 2002, 298: 824–827. 10.1126/science.298.5594.824View ArticlePubMedGoogle Scholar
- Jenssen TK, Laegreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression. Nat Gene 2001, 38: A21–28. 10.1038/88213Google Scholar
- Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS: Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc Natl Acad Sci USA 2000, 97: 12182–12186. 10.1073/pnas.220392197PubMed CentralView ArticlePubMedGoogle Scholar
- Yanai I, DeLisi C: The society of genes: networks of functional links between genes from comparative genomics. Genome Biology 2002, 3: Research0064.PubMed CentralPubMedGoogle Scholar
- Watts DJ, Strogatz SH: Collective dynamics of 'small-world' networks. Nature 1998, 393: 440–442. 10.1038/30918View ArticlePubMedGoogle Scholar
- Collet P, Eckmann JP: The number of large graphs with a positive density of triangles. J Stat Phys 2002, 109: 923–943. 10.1023/A:1020489507547View ArticleGoogle Scholar
- Alizadeh AA, Eisen MB, Davis RE, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, et al.: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000, 403: 503–522. 10.1038/35000501View ArticlePubMedGoogle Scholar
- Van Lint JH, Wilson RM: A course in combinatorics. Cambridge, Cambridge University Press 2001.Google Scholar
- Dwight SS, Harris MA, Dolinski K, Ball CA, Binkley G, Christie KR, Fisk DG, Issel-Tarver L, Schroeder M, Sherlock G, et al.: Saccharomyces genome database (SGD) provides secondary gene annotation using the gene ontology (GO). Nucleic Acids Res 2002, 30: 69–72. 10.1093/nar/30.1.69PubMed CentralView ArticlePubMedGoogle Scholar
- Jeong H, Tombor B, Albert R, Oltval ZN, Barabasi AL: The large-scale organization of metabolic networks. Nature 2000, 407: 651–654. 10.1038/35036627View ArticlePubMedGoogle Scholar
- Featherstone DE, Broadie K: Wrestling with pleiotropy: Genomic and topological analysis of the yeast gene expression network. Bio Essays 2002, 24: 267–274. 10.1002/bies.10054Google Scholar
- Domany E: Cluster analysis of gene expression data. J Stat Phys 2003, 110: 1117–1139. 10.1023/A:1022148927580View ArticleGoogle Scholar
- Cheng Y, Church GM: Biclustering of expression data. Proc Int Conf lntell Syst Mol Biol 2000, 8: 93–103.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.