STAR NET 2: a web-based tool for accelerating discovery of gene regulatory networks using microarray co-expression data
© Jupiter et al; licensee BioMed Central Ltd. 2009
Received: 20 April 2009
Accepted: 14 October 2009
Published: 14 October 2009
Although expression microarrays have become a standard tool used by biologists, analysis of data produced by microarray experiments may still present challenges. Comparison of data from different platforms, organisms, and labs may involve complicated data processing, and inferring relationships between genes remains difficult.
S TAR N ET 2 is a new web-based tool that allows post hoc visual analysis of correlations that are derived from expression microarray data. S TAR N ET 2 facilitates user discovery of putative gene regulatory networks in a variety of species (human, rat, mouse, chicken, zebrafish, Drosophila, C. elegans, S. cerevisiae, Arabidopsis and rice) by graphing networks of genes that are closely co-expressed across a large heterogeneous set of preselected microarray experiments. For each of the represented organisms, raw microarray data were retrieved from NCBI's Gene Expression Omnibus for a selected Affymetrix platform. All pairwise Pearson correlation coefficients were computed for expression profiles measured on each platform, respectively. These precompiled results were stored in a MySQL database, and supplemented by additional data retrieved from NCBI. A web-based tool allows user-specified queries of the database, centered at a gene of interest. The result of a query includes graphs of correlation networks, graphs of known interactions involving genes and gene products that are present in the correlation networks, and initial statistical analyses. Two analyses may be performed in parallel to compare networks, which is facilitated by the new H EAT S EEKER module.
S TAR N ET 2 is a useful tool for developing new hypotheses about regulatory relationships between genes and gene products, and has coverage for 10 species. Interpretation of the correlation networks is supported with a database of previously documented interactions, a test for enrichment of Gene Ontology terms, and heat maps of correlation distances that may be used to compare two networks. The list of genes in a S TAR N ET network may be useful in developing a list of candidate genes to use for the inference of causal networks. The tool is freely available at http://vanburenlab.medicine.tamhsc.edu/starnet2.html, and does not require user registration.
Expression microarrays have become a widely used platform for assaying the differences in the transcriptomes of two experimental settings. While the technology has gained wide acceptance, the analysis of the data produced from a microarray experiment may yet present challenges to experimentalists. This is the case both for array experiments performed in-house by individual labs, and for retrospective analysis of array experiments that have been conducted elsewhere. The problem is exacerbated when considering comparisons between different experiments, platforms, and model organisms. Basic analysis of microarray experiments typically produces lists of differentially expressed genes. The central challenge of basic microarray analysis is thus to ascribe biological meaning to the members of the list of differentially expressed genes by inferring the relationships between these genes and the relationships between the genes and the experimental milieu. These problems are of crucial importance given that experiments are costly and time consuming, and given that public-domain databases such as the Gene Expression Omnibus (GEO) [1, 2] contain thousands of array experiments with potential for exploration by post hoc analysis.
A central motivation for creating the S TAR N ET application was to leverage this tremendous resource of microarray data for the discovery of putative gene regulatory relationships and other biological interactions, prior to conducting additional costly wet lab experiments. This tool provides insights that may guide experimentation by fostering new hypotheses, or may provide additional support for previously formed hypotheses. The results may also be used to develop a preliminary list of genes to use as input for other regulatory network discovery and validation tools, such as those involving Bayesian inference or probabilistic Boolean networks.
Given a gene of interest provided by the user, S TAR N ET mines precomputed correlations from a collection of microarray expression data, which we refer to hereafter as a data cohort, and builds a correlation network centered at that gene. The visual data is also presented as text and is supplemented by annotations that were retrieved from NCBI database tables.
A previous murine-only version of S TAR N ET, which included both a full and developmental cohort of arrays, has been online since July 2007 . The current effort 1) expands the coverage to ten different species, 2) allows cross-species comparisons, and 3) introduces a new tool, H EAT S EEKER, for drawing false color maps comparing two selected networks. Additionally, the user interfaces for both S TAR N ET 2 and its predecessor have been improved for greater ease of use, and the responses to user queries have been improved for better visual organization and navigation of the displayed results.
In this report we describe the construction and use of S TAR N ET 2, describe the new H EAT S EEKER module, and discuss the output produced by user queries. S TAR N ET uses an approach that is uncommon in several ways. First, while there are numerous tools for the analysis of microarray data [4–12], there are relatively few tools that facilitate retrospective analysis or data mining of microarrays, e.g. . Second, rather than attempt to identify differential gene expression for a narrow range of experimental questions, S TAR N ET identifies gene pairs with high magnitude correlation across a large number of experiments, thus providing strong statistics that include confidence intervals. Third, although we have pre-selected the data cohorts for retrospective analysis, S TAR N ET allows user control over the general size and topology of the networks produced, and performs an on-the-fly test of GO term enrichment for those networks, along with a database search of known interactions involving genes and gene products from the prescribed networks. Thus, while tools such as STRING  and Y EAST N ET  provide a data integration approach to assessing likely functional protein interactions, S TAR N ET better facilitates exploratory analysis of selected data cohorts with finer control over general network size and topology. Moreover, previous approaches that have performed large-scale retrospective analyses have not always supplied a database for searching and reviewing their results, apart from supplying large data files as supplementary materials . Finally, H EAT S EEKER enhances the analysis provided by S TAR N ET by allowing users to directly compare the networks produced by two different data cohorts, which includes a provision for comparing data from two different species. H EAT S EEKER makes an unbiased comparison by combining the lists from both networks and then comparing only those genes that share orthologues on both platforms. H EAT S EEKER will thus provide insight into the differential wiring of gene regulatory networks among different species. This combination of uncommon attributes marks S TAR N ET 2 as a unique and powerful tool for accelerating discovery of gene regulatory networks.
Expression microarray data represented in STAR NET 2
Full Cohort Arrays
Development Cohort Arrays
Genes on Array
The set of correlation coefficients thus derived has a large memory footprint and contains a large amount of data that is of little interest from our perspective (i.e., low magnitude correlations). Thus, this collection was trimmed in a variety of ways. First, the 100,000 highest magnitude positive and negative correlations for each cohort were extracted. As highly correlated groups of genes in a correlation network exhibit a high amount of interconnectedness, or cliquishness, this distribution does not necessarily include all genes on an array. To guarantee full coverage, we constructed another sub-distribution through gene-by-gene extraction of the ten highest magnitude positive and negative correlations for that gene. This guarantees that each gene on the array is available for user queries. As described previously, other specialty distributions were also created, for more focused study on genes related to transcription and signaling .
Network construction algorithms were implemented in Perl. The user interface was built using Perl-CGI, and graphs are created on demand using the G RAPHVIZ package available from AT&T http://www.graphviz.org. H EAT S EEKER false-color maps are created on demand using R/BioConductor.
On the S TAR N ET 2 webpage http://vanburenlab.medicine.tamhsc.edu/starnet2.html the user enters a gene of interest as either an Entrez Gene ID or gene symbol, and selects either one or two data cohorts to examine. The user selects how many network levels to draw (l), and how many connections are to be made per level (n). Connections are then drawn between the gene of interest (Level 0) and the n genes with the highest magnitude correlations of co-expression with the gene of interest (Level 1 genes). Connections are then drawn from the Level 1 genes to Level 2 genes, and so on, until l levels have been built as the user specified. Further options for network topology specification and alternate sub-distributions of correlations are available, and are detailed in the documentation available on the webpage, http://www.vanburenlab.medicine.tamhsc.edu/starnet2.html.
To aid exploratory analysis of the networks, data is also presented in a tabular format. Lists of genes and correlations are provided, with links to the Entrez Gene entries for each gene. Genes common to both networks and those highlighted with the GO search term are also listed with appropriate hyperlinks to external sites.
Interpretation of the correlation networks is further facilitated by (a) drawing and listing networks of known interactions involving the genes in each correlation network, and by (b) performing a hypergeometric test of GO term enrichment for the genes within each network, relative to the entire complement of gene features on the array on which they were assayed. Enriched GO terms are provided together with lists of the genes annotated by the respective terms, and the terms are linked to AMIGO for detailed reference. As with the correlation networks described above, nodes in the documented interaction networks are linked to Entrez Gene.
Users may select any two of the available data cohorts for comparison, including comparisons between the 'full' cohort for an organism and that organism's 'development' cohort, as well as cross-species comparisons. This allows side-by-side comparison of the networks derived from orthologous genes in different species.
Full documentation for S TAR N ET 2 is available at http://vanburenlab.medicine.tamhsc.edu/starnet2_doc.html.
S TAR N ET is a useful tool for discovery of putative gene regulatory networks. Such efforts are facilitated by the graphs of known interactions of genes and gene products that are supplied together with the correlation networks produced by S TAR N ET. Known interactions are sometimes reflected within the correlation networks produced by S TAR N ET, which supports the biological relevance of these networks. S TAR N ET may thus be used to suggest new lines of research. Graphical depictions of data often supersede the utility of the same data presented in a table.
The notion of using correlations between the expression profiles to foster insight into gene function is neither contentious nor novel. However, in future studies it will be useful to assess S TAR N ET from a quantitative perspective to evaluate its ability to recapitulate segments of known biological networks . This is an important area of inquiry, as it will give some insight about the extent to which edges in S TAR N ET correlation networks may be used to predict regulatory relationships.
Recent efforts have suggested the utility of measuring changes in correlation as an important complement to measuring differential expression in microarray experiments, as changes in correlation are indicative of differential wiring of regulatory networks [3, 23, 24]. In the first version of S TAR N ET, differential wiring could be crudely assessed between a correlation network built from heterogeneous data sets, and a correlation network derived from a smaller subset of the data related to mouse heart development. With the cross-species capabilities introduced in S TAR N ET 2, users may now consider using knowledge of one species to supplement knowledge of regulatory networks in other species, and may use S TAR N ET 2 to develop new hypotheses regarding differential wiring between species, and for four of those species, between a large heterogeneous data set and a smaller data set related to development. Additionally, the H EAT S EEKER module is a first step in towards a more careful and unbiased comparison of the networks derived from two different data cohorts.
S TAR N ET 2 presents an intuitive, fast, and free way to produce preliminary impressions of gene regulatory relationships. Other methods for similar types of analysis are available. For example, clustering methods [4, 25–27] offer a simple way to group genes into modules of (potentially) interacting and interrelated genes. These results are qualitative, and lack any indication of how interactions within a module occur. At the other extreme, methods involving ordinary differential equations offer a much higher resolution view of regulatory networks. However, these methods require some preliminary knowledge of the network being modeled. Lying between these extremes, Bayesian networks [28–33] provide both qualitative and quantitative data. This class of techniques is both theoretically and computationally expensive, and often employs heuristics to obtain the networks. These approaches also typically require time series data. S TAR N ET 2 offers an attractive alternative: it produces both qualitative and quantitative data using a straightforward methodology that is highly accessible to experimental biologists. Furthermore, the default settings of S TAR N ET 2 will generate a list of correlated genes that is ≤ 31 genes, and such lists may be a useful starting place for inferring causal networks using one the other methods mentioned above, such as Bayesian inference.
Availability and Requirements
S TAR N ET 2 and the associated H EAT S EEKER module are freely available on the Web, and do not require user registration: http://vanburenlab.medicine.tamhsc.edu/starnet2.html
List of Abbreviations
Gene Expression Omnibus.
This work was supported by a National Scientist Development Grant from the American Heart Association (AHA SDG 0630263N, PI: VanBuren), an American Heart Association Postdoctoral Fellowship (AHA 0825110F, PI: Jupiter), and by start-up funds from the Dean of the College of Medicine and the Department of Systems Biology and Translational Medicine, Texas A&M Health Science Center (PI: VanBuren).
- Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R: NCBI GEO: mining tens of millions of expression profiles--database and tools update. Nucleic Acids Res 2007, 35: D760-D765. 10.1093/nar/gkl887PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 2002, 30: 207–210. 10.1093/nar/30.1.207PubMed CentralView ArticlePubMedGoogle Scholar
- Jupiter D, VanBuren V: A visual data mining tool that facilitates reconstruction of transcription regulatory networks. PLoS One 2008, 3: e1717. 10.1371/journal.pone.0001717PubMed CentralView ArticlePubMedGoogle Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95: 14863–14868. 10.1073/pnas.95.25.14863PubMed CentralView ArticlePubMedGoogle Scholar
- Grant GR, Manduchi E, Stoeckert CJJ: Analysis and management of microarray gene expression data. Curr Protoc Mol Biol 2007., Chapter 19: Unit 19.6 Unit 19.6Google Scholar
- Grewal A, Lambert P, Stockton J: Analysis of expression data: an overview. Curr Protoc Bioinformatics 2007., Chapter 7: Unit 7.1 Unit 7.1Google Scholar
- Hayden D, Lazar P, Schoenfeld D: Assessing statistical significance in microarray experiments using the distance between microarrays. PLoS One 2009, 4: e5838. 10.1371/journal.pone.0005838PubMed CentralView ArticlePubMedGoogle Scholar
- Hedegaard J, Arce C, Bicciato S, Bonnet A, Buitenhuis B, Collado-Romero M, Conley LN, Sancristobal M, Ferrari F, Garrido JJ, Groenen MA, Hornshoj H, Hulsegge I, Jiang L, Jimenez-Marin A, Kommadath A, Lagarrigue S, Leunissen JA, Liaubet L, Neerincx PB, Nie H, Poel J, Prickett D, Ramirez-Boo M, Rebel JM, Robert-Granie C, Skarman A, Smits MA, Sorensen P, Tosser-Klopp G, Watson M: Methods for interpreting lists of affected genes obtained in a DNA microarray experiment. BMC Proc 2009, 3(Suppl 4):S5.PubMed CentralView ArticlePubMedGoogle Scholar
- Hubble J, Demeter J, Jin H, Mao M, Nitzberg M, Reddy TB, Wymore F, Zachariah ZK, Sherlock G, Ball CA: Implementation of GenePattern within the Stanford Microarray Database. Nucleic Acids Res 2009, 37: D898–901. 10.1093/nar/gkn786PubMed CentralView ArticlePubMedGoogle Scholar
- Suarez E, Burguete A, Mclachlan GJ: Microarray data analysis for differential expression: a tutorial. P R Health Sci J 2009, 28: 89–104.PubMedGoogle Scholar
- Xia XQ, McClelland M, Porwollik S, Song W, Cong X, Wang Y: WebArrayDB: Cross-platform microarray data analysis and public data repository. Bioinformatics 2009, 25(18):2425–2429. 10.1093/bioinformatics/btp430PubMed CentralView ArticlePubMedGoogle Scholar
- Yi M, Mudunuri U, Che A, Stephens RM: Seeking unique and common biological themes in multiple gene lists or datasets: pathway pattern extraction pipeline for pathway-level comparative analysis. BMC Bioinformatics 2009, 10: 200. 10.1186/1471-2105-10-200PubMed CentralView ArticlePubMedGoogle Scholar
- Bisognin A, Coppe A, Ferrari F, Risso D, Romualdi C, Bicciato S, Bortoluzzi S: A-MADMAN: annotation-based microarray data meta-analysis tool. BMC Bioinformatics 2009, 10: 201. 10.1186/1471-2105-10-201PubMed CentralView ArticlePubMedGoogle Scholar
- Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A, Simonovic M, Bork P, von Mering C: STRING 8--a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res 2009, 37: D412–6. 10.1093/nar/gkn760PubMed CentralView ArticlePubMedGoogle Scholar
- Lee I, Li Z, Marcotte EM: An improved, bias-reduced probabilistic functional gene network of baker's yeast, Saccharomyces cerevisiae. PLoS One 2007, 2: e988. 10.1371/journal.pone.0000988PubMed CentralView ArticlePubMedGoogle Scholar
- Stuart JM, Segal E, Koller D, Kim SK: A gene-coexpression network for global discovery of conserved genetic modules. Science 2003, 302: 249–255. 10.1126/science.1087447View ArticlePubMedGoogle Scholar
- Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 2007, 35: D26-D31. 10.1093/nar/gkl993PubMed CentralView ArticlePubMedGoogle Scholar
- Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, Watson SJ, Meng F: Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res 2005, 33: e175. 10.1093/nar/gni179PubMed CentralView ArticlePubMedGoogle Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4: 249–264. 10.1093/biostatistics/4.2.249View ArticlePubMedGoogle Scholar
- Gentleman R, Carey VJ, Huber W, Irizarry R, Dudoit S: Bioinformatics and Computational Biology Solutions Using R and Bioconductor. New York: Springer-Verlag; 2005.View ArticleGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25: 25–29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Dougherty E: Validation of inference procedures for gene regulatory networks. Curr Genomics 2007, 8: 351–359. 10.2174/138920207783406505PubMed CentralView ArticlePubMedGoogle Scholar
- Hu R, Qiu X, Glazko G, Klebanov L, Yakovlev A: Detecting Intergene correlation changes in microarray analysis: a new approach to gene seletion. BMC Bioinformatics 2009., 10(20):
- Hudson NJ, Reverter A, Dalrymple BP: A Differential Wiring Analysis of Expression Data Correctly Identifies the Gene Containing the Causal Mutation. PLoS Computational Biology 2009, 5(5):e1000382. 10.1371/journal.pcbi.1000382PubMed CentralView ArticlePubMedGoogle Scholar
- Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P: 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol 2000, 1: RESEARCH0003. 10.1186/gb-2000-1-2-research0003PubMed CentralView ArticlePubMedGoogle Scholar
- Kaufman L, Rousseeuw PJ: Finding Groups in Data. New York: Wiley-Interscience; 1990.View ArticleGoogle Scholar
- Madeira SC, Oliveira AL: Biclustering Algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 2004, 1: 24–45. 10.1109/TCBB.2004.2View ArticlePubMedGoogle Scholar
- Beal MJ, Falciani F, Ghahramani Z, Rangel C, Wild DL: A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics 2005, 21: 349–356. 10.1093/bioinformatics/bti014View ArticlePubMedGoogle Scholar
- Hartemink AJ, Gifford DK, Jaakkola TS, Young RA: Combining location and expression data for principled discovery of genetic regulatory network models. Pacific Symposium on Biocomputing 2002, 2002: 437–449.Google Scholar
- Husmeier D: Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks. Bioinformatics 2003, 19: 2271–2282. 10.1093/bioinformatics/btg313View ArticlePubMedGoogle Scholar
- Nachman I, Regev A, Friedman N: Inferring quantitative models of regulatory networks from expression data. Bioinformatics 2004, 20(Suppl 1):i248-i256. 10.1093/bioinformatics/bth941View ArticlePubMedGoogle Scholar
- Rogers S, Khanin R, Girolami M: Bayesian model-based inference of transcription factor activity. BMC Bioinformatics 2007, 8(Suppl 2):S2. 10.1186/1471-2105-8-S2-S2PubMed CentralView ArticlePubMedGoogle Scholar
- Sanguinetti G, Lawrence ND, Rattray M: Probabilistic inference of transcription factor concentrations and gene-specific regulatory activities. Bioinformatics 2006, 22: 2775–2781. 10.1093/bioinformatics/btl473View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.