Storing, linking, and mining microarray databases using SRS
© Veldhoven et al; licensee BioMed Central Ltd. 2005
Received: 06 May 2005
Accepted: 27 July 2005
Published: 27 July 2005
SRS (Sequence Retrieval System) has proven to be a valuable platform for storing, linking, and querying biological databases. Due to the availability of a broad range of different scientific databases in SRS, it has become a useful platform to incorporate and mine microarray data to facilitate the analyses of biological questions and non-hypothesis driven quests. Here we report various solutions and tools for integrating and mining annotated expression data in SRS.
We devised an Auto-Upload Tool by which microarray data can be automatically imported into SRS. The dataset can be linked to other databases and user access can be set. The linkage comprehensiveness of microarray platforms to other platforms and biological databases was examined in a network of scientific databases. The stored microarray data can also be made accessible to external programs for further processing. For example, we built an interface to a program called Venn Mapper, which collects its microarray data from SRS, processes the data by creating Venn diagrams, and saves the data for interpretation.
SRS is a useful database system to store, link and query various scientific datasets, including microarray data. The user-friendly Auto-Upload Tool makes SRS accessible to biologists for linking and mining user-owned databases.
The extraction of information from data generated by high-throughput experiments in genomics and proteomics has been likened to "attempting to drink from a fire hose". We are flooded with information on many levels such as whole genome DNA sequences, RNA expression, protein-protein interactions, protein modifications, and more. All this information is accessible in very different formats, ranging from well-organized curated gene sequences to unstructured free text in scientific literature. A system that can manage, link and query these heterogeneous types of datasets is therefore extremely valuable. The Sequence Retrieval System (SRS) is such a unified database system in which numerous different scientific databases have already been integrated .
Of special interest are data from high-throughput RNA expression microarrays [2, 3]. Many of these datasets are freely available and, like information stored in other scientific databases, are from different platforms [4, 5]. Integrating and mining these databases strongly facilitates the analysis of genes of interest but will also support discovery of disease markers, drug-targets and new knowledge in general [6–9]. One such platform is Oncomine, which has integrated many different microarray datasets, focussing on human cancer . Additionally, standardized microarray depositories such as GEO (Gene Expression Omnibus) , ArrayExpress , and CIBEX  do or will soon provide options to browse and query the datasets [14–17]. No doubt, other platforms will be developed focussing on the integration of microarray data. If started from scratch, these initiatives will likely be limited in their direct linkage to other heterogeneous biological databases due to the laborious task of making those connections and programming the single and batch-wise query options. The universality and the availability of numerous scientific databases that have already been integrated in SRS make it a useful platform for integrating microarray databases. Although the SRS interface to query databases is quite user friendly, other aspects of working with SRS are not. These include (i) uploading microarray datasets, (ii) database security including setting user access, (iii) linking databases, (iv) generating standard views, and (v) communication with other programs such as statistical and clustering software. The current SRS interface has a major disadvantage in that it is not designed to perform complex calculations on the fly. This means that any microarray dataset to be uploaded must have all ratio and statistical calculations performed upfront. For example, once in SRS, one cannot change ratios from log10 to log2 or add an extra field per gene by dividing expression data of all "normal" by all "cancer" samples. However, software programs that perform calculations, statistical evaluations, clustering, protein domain predictions, homology searches, and more, can communicate with SRS. Interfaces can be generated that retrieve data from SRS, perform the required action and if desired, store the results in SRS. Alternatively, SRS allows direct integration of programs such as the BLAST and FASTA homology searches and the SRS-EMBOSS (European Molecular Biology Open Software Suite) tools [18, 19].
Generating a database in which heterogeneous datasets are integrated is a challenge in itself. However, retrieving statistically meaningful data by comparing datasets from different sources, platforms and designs is particularly difficult . There is a fast growing body of publications on microarray cross-platform comparisons, mainly showing how this can be achieved in very many different ways [8, 21–26]. Statistical evaluations of data within a dataset of sufficient technical and biological replicates, are better defined and can be implemented per dataset within a database system [27, 28]. The strategies and applications we discuss here to link, store and query scientific datasets in SRS, do not go beyond processed individual datasets and do not include cross-platform dataset integrations. We assume that each uploaded dataset consists of high-quality data and has been processed correctly.
In this paper we describe strategies to incorporate microarray databases into SRS and provide a database upload tool. Using the program Venn Mapper as an example, we show the possibility to automatically retrieve the stored microarray data from SRS for external statistical evaluation.
External programs accessing SRS: Venn Mapper for SRS
An important feature of storing and linking microarray data in SRS is the accessibility of the datasets for other programs. As an example, we generated a PHP web interface for the Venn Mapper program that retrieves microarray data from SRS to calculate the statistical significance of the number of co-occurring differentially expressed genes in any combination of two experiments . The functionality of the original Venn Mapper was enhanced by enabling the use of different ratio cut-offs for different microarray experiments. Upon login, the interface displays all microarray databases indexed in SRS to which the user has access. After selection of the datasets a second screen shows all fields (such as individual array experiments or averaged group ratios) of the selected databases. The microarray experiments of interest can be selected for Venn Mapper analysis after which the requested data is linked and exported from SRS into the Venn Mapper program. The output of the program is available for viewing and downloading. Information requests from the interface to SRS are made through the SRS getz command . This powerful feature of SRS makes the integrated databases accessible to any external program.
Results and discussion
Preparing and linking of microarray databases in SRS
With respect to the microarray database set-up, there are two important considerations. First, in our experience, microarray data mining often starts with selecting genes based on their differential expression. Differential gene expression is best determined using statistical evaluation of the data based on sufficient technical and biological replicates [27, 28]. Dependent on the statistical test and microarray platform, raw gene expression data and/or ratio calculations can be utilized. Second, making changes to a microarray dataset in SRS is impractical and datasets should be fully built before they are imported. This means that raw expression data and ratio calculations should be normalised and flagged and represented in a common format (such as log2). Importantly, statistical evaluation should be included. For simplicity in representation, datasets can be summarised in, for example, an average of all "normal" and "cancer" samples and additional fields of log2 "normal/cancer" can be included.
Instead of or in combination with a coupling file, one can make use of the links provided in the various biological databases (Figure 2). For example, the GenBank accession code in the microarray dataset can be linked to the UniGene database. The LocusLink database is linked to the UniGene database through RefSeq accession numbers and also contains IDs that for example link to EMBL, OMIM, and SwissProt. In this way, almost all biological databases are linked through a network of direct and indirect connections. In case multiple roads lead to the same databases, SRS utilizes only one route. By assigning values to each link, the route taken is the one having the lowest sum of link values, even when this results in a lower number of connected fields (genes). Although SRS can be forced to take a specified path, one should be careful to be dependent on many different databases. Inconsistencies in and incompleteness of databases are accumulated when linking occurs in sequence.
Linking to the network of biological databases through a coupling file has various advantages. One can establish directly validated links to each database, including databases outside the network. In addition, errors in links can easily be corrected. Platform-specific or overall coupling files can be retrieved from the Affymetrix website  and from sources such as Resourcerer , KARMA , GeneHopper , ProbeMatchDB , DAVID , EnsMart , and Source . Using these resources, microarray datasets from different platforms can be linked. This includes connecting cross-species datasets using ortholog converters such as HOMGL  and HOMOLOGENE .
Gene linking efficiency of different databases
Auto-Upload Tool and external programs accessing SRS
The availability of many scientific databases in SRS, the universality of the system and its free access for academic use, make SRS an excellent mining system for heterogeneous microarray datasets. The Auto-Upload Tool facilitates the exchange of microarray datasets between separate SRS installations. Using a single data file and optional description file, any user can upload the identical data and customize it to their own SRS environment. We would urge researchers and microarray data repositories to make their data available in an SRS format. In addition, microarray software programmers could make their software available in an SRS compatible format or include SRS data export options. The commercial SRS GeneSpring® Connector and public EMBOSS are examples of such microarray-SRS integration ventures.
We plan to extend our efforts of integrating more microarray databases into SRS. In addition, software tools specific for microarray data analysis, such as Go Mapper and CoPub Mapper will be rewritten for SRS [26, 41, 42]. The CoPub Mapper literature mining program contains databases that store, for each gene, all MEDLINE records mentioning the gene. This directly links microarray expression data to the published literature and allows for co-publication research of gene-gene and gene-keyword combinations.
The Sequence Retrieval System is a versatile and useful database system to store, link and query various scientific databases, including microarray datasets. Fully processed datasets can be incorporated and linked to other datasets using the Auto-Upload Tool. This user-friendly program makes SRS accessible to users who can themselves add, link and mine databases within minutes. Datasets stored in SRS can be interrogated by external programs to perform virtually any computation.
Availability and requirements
Project Name: Auto-Upload Tool and Venn Mapper for SRS
Project home page: http://www.erasmusmc.nl/gatcplatform
Operating system: Platform independent
Other requirements: Local SRS installation, DQS batch-queue, MySQL database server, PHP-enabled Webserver (like Apache)
License: SRS (Lion Bioscience)
Any restrictions to use by non academics: License needed
We would like to thank EMBL/EBI and Lion Bioscience for making SRS available and Peter Hendriksen for careful reading of the manuscript. This work was supported by Erasmus MC Breedtestrategie and the Urologic Research Foundation (SUWO) Erasmus MC.
- Zdobnov EM, Lopez R, Apweiler R, Etzold T: The EBI SRS server – recent developments. Bioinformatics 2002, 18: 368–373. 10.1093/bioinformatics/18.2.368View ArticlePubMedGoogle Scholar
- Brown PO, Botstein D: Exploring the new world of the genome with DNA microarrays. Nat Genet 1999, 21: 33–37. 10.1038/4462View ArticlePubMedGoogle Scholar
- Duggan DJ, Bittner M, Chen Y, Meltzer P, Trent JM: Expression profiling using cDNA microarrays. Nat Genet 1999, 21: 10–14. 10.1038/4434View ArticlePubMedGoogle Scholar
- Heller MJ: DNA microarray technology: devices, systems, and applications. Annu Rev Biomed Eng 2002, 4: 129–153. 10.1146/annurev.bioeng.4.020702.153438View ArticlePubMedGoogle Scholar
- Schena M, Heller RA, Theriault TP, Konrad K, Lachenmeier E, Davis RW: Microarrays: biotechnology's discovery platform for functional genomics. Trends Biotechnol 1998, 16: 301–306. 10.1016/S0167-7799(98)01219-0View ArticlePubMedGoogle Scholar
- Moreau Y, Aerts S, De Moor B, De Strooper B, Dabrowski M: Comparison and meta-analysis of microarray data: from the bench to the computer desk. Trends Genet 2003, 19: 570–577. 10.1016/j.tig.2003.08.006View ArticlePubMedGoogle Scholar
- Rhodes DR, Chinnaiyan AM: Bioinformatics strategies for translating genome-wide expression analyses into clinically useful cancer markers. Ann N Y Acad Sci 2004, 1020: 32–40. 10.1196/annals.1310.005View ArticlePubMedGoogle Scholar
- Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM: Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci U S A 2004, 101: 9309–9314. 10.1073/pnas.0401994101PubMed CentralView ArticlePubMedGoogle Scholar
- Welsh JB, Sapinoso LM, Kern SG, Brown DA, Liu T, Bauskin AR, Ward RL, Hawkins NJ, Quinn DI, Russell PJ, Sutherland RL, Breit SN, Moskaluk CA, Frierson HF Jr, Hampton GM: Large-scale delineation of secreted protein biomarkers overexpressed in cancer tissue and serum. Proc Natl Acad Sci U S A 2003, 100: 3410–3415. 10.1073/pnas.0530278100PubMed CentralView ArticlePubMedGoogle Scholar
- Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM: ONCOMINE: a cancer microarray database and integrated data-mining platform. Neoplasia 2004, 6: 1–6.PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 2002, 30: 207–210. 10.1093/nar/30.1.207PubMed CentralView ArticlePubMedGoogle Scholar
- Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG, Oezcimen A, Rocca-Serra P, Sansone SA: ArrayExpress – a public repository for microarray gene expression data at the EBI. Nucleic Acids Res 2003, 31: 68–71. 10.1093/nar/gkg091PubMed CentralView ArticlePubMedGoogle Scholar
- Ikeo K, Ishi-i J, Tamura T, Gojobori T, Tateno Y: CIBEX: center for information biology gene expression database. C R Biol 2003, 326: 1079–1082.View ArticlePubMedGoogle Scholar
- Stoeckert CJ Jr, Causton HC, Ball CA: Microarray databases: standards and ontologies. Nat Genet 2002, 32: 469–473. 10.1038/ng1028View ArticlePubMedGoogle Scholar
- Gardiner-Garden M, Littlejohn TG: A comparison of microarray databases. Brief Bioinform 2001, 2: 143–158.View ArticlePubMedGoogle Scholar
- Quackenbush J: Data standards for 'omic' science. Nat Biotechnol 2004, 22: 613–614. 10.1038/nbt0504-613View ArticlePubMedGoogle Scholar
- Penkett CJ, Bahler J: Getting the most from public microarray data. European Pharmaceutical Review 2004, 9: 8–17.Google Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 2000, 16: 276–277. 10.1016/S0168-9525(00)02024-2View ArticlePubMedGoogle Scholar
- Kulikova T, Aldebert P, Althorpe N, Baker W, Bates K, Browne P, van den BA, Cochrane G, Duggan K, Eberhardt R, Faruque N, Garcia-Pastor M, Harte N, Kanz C, Leinonen R, Lin Q, Lombard V, Lopez R, Mancuso R, McHale M, Nardone F, Silventoinen V, Stoehr P, Stoesser G, Tuli MA, Tzouvara K, Vaughan R, Wu D, Zhu W, Apweiler R: The EMBL Nucleotide Sequence Database. Nucleic Acids Res 2004, 32: D27-D30. 10.1093/nar/gkh120PubMed CentralView ArticlePubMedGoogle Scholar
- Marshall E: Getting the noise out of gene arrays. Science 2004, 306: 630–631. 10.1126/science.306.5696.630View ArticlePubMedGoogle Scholar
- Zhou XJ, Kao MC, Huang H, Wong A, Nunez-Iglesias J, Primig M, Aparicio OM, Finch CE, Morgan TE, Wong WH: Functional annotation and network reconstruction through cross-platform integration of microarray data. Nat Biotechnol 2005, 23: 238–243. 10.1038/nbt1058View ArticlePubMedGoogle Scholar
- Mitchell SA, Brown KM, Henry MM, Mintz M, Catchpoole D, LaFleur B, Stephan DA: Inter-platform comparability of microarrays in acute lymphoblastic leukemia. BMC Genomics 2004, 5: 71. 10.1186/1471-2164-5-71PubMed CentralView ArticlePubMedGoogle Scholar
- Chiorino G, Acquadro F, Mello GM, Viscomi S, Segir R, Gasparini M, Dotto P: Interpretation of expression-profiling results obtained from different platforms and tissue sources: examples using prostate cancer data. Eur J Cancer 2004, 40: 2592–2603. 10.1016/j.ejca.2004.07.029View ArticlePubMedGoogle Scholar
- Culhane AC, Perriere G, Higgins DG: Cross-platform comparison and visualisation of gene expression data using co-inertia analysis. BMC Bioinformatics 2003, 4: 59. 10.1186/1471-2105-4-59PubMed CentralView ArticlePubMedGoogle Scholar
- Shippy R, Sendera TJ, Lockner R, Palaniappan C, Kaysser-Kranich T, Watts G, Alsobrook J: Performance evaluation of commercial short-oligonucleotide microarrays and the impact of noise in making cross-platform correlations. BMC Genomics 2004, 5: 61. 10.1186/1471-2164-5-61PubMed CentralView ArticlePubMedGoogle Scholar
- Smid M, Dorssers LC, Jenster G: Venn Mapping: clustering of heterologous microarray data based on the number of co-occurring differentially expressed genes. Bioinformatics 2003, 19: 2065–2071. 10.1093/bioinformatics/btg282View ArticlePubMedGoogle Scholar
- Cui X, Churchill GA: Statistical tests for differential expression in cDNA microarray experiments. Genome Biol 2003, 4: 210. 10.1186/gb-2003-4-4-210PubMed CentralView ArticlePubMedGoogle Scholar
- Draghici S: Statistical intelligence: effective analysis of high-density microarray data. Drug Discov Today 2002, 7: S55-S63. 10.1016/S1359-6446(02)02292-4View ArticlePubMedGoogle Scholar
- Auto-Upload Tool Manual[http://www.erasmusmc.nl/gatcplatform/autouploadmanual.pdf]
- Schaftenaar G, Cuelenaere K, Noordik JH, Etzold T: A Tcl-based SRS v. 4 interface. Comput Appl Biosci 1996, 12: 151–155.PubMedGoogle Scholar
- Tsai J, Sultana R, Lee Y, Pertea G, Karamycheva S, Antonescu V, Cho J, Parvizi B, Cheung F, Quackenbush J: RESOURCERER: a database for annotating and linking microarray resources within and across species. Genome Biology 2001, 2: software0002. 10.1186/gb-2001-2-11-software0002PubMed CentralView ArticlePubMedGoogle Scholar
- Cheung KH, Hager J, Pan D, Srivastava R, Mane S, Li Y, Miller P, Williams KR: KARMA: a web server application for comparing and annotating heterogeneous microarray platforms. Nucleic Acids Res 2004, 32: W441-W444. 10.1093/nar/gkh661PubMed CentralView ArticlePubMedGoogle Scholar
- Svensson BA, Kreeft AJ, van Ommen GJ, den Dunnen JT, Boer JM: GeneHopper: a web-based search engine to link gene-expression platforms through GenBank accession numbers. Genome Biol 2003, 4: R35. 10.1186/gb-2003-4-5-r35PubMed CentralView ArticlePubMedGoogle Scholar
- Wang P, Ding F, Chiang H, Thompson RC, Watson SJ, Meng F: ProbeMatchDB – a web database for finding equivalent probes across microarray platforms and species. Bioinformatics 2002, 18: 488–489. 10.1093/bioinformatics/18.3.488View ArticlePubMedGoogle Scholar
- Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003, 4: P3. 10.1186/gb-2003-4-5-p3View ArticlePubMedGoogle Scholar
- Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a generic system for fast and flexible access to biological data. Genome Res 2004, 14: 160–169. 10.1101/gr.1645104PubMed CentralView ArticlePubMedGoogle Scholar
- Diehn M, Sherlock G, Binkley G, Jin H, Matese JC, Hernandez-Boussard T, Rees CA, Cherry JM, Botstein D, Brown PO, Alizadeh AA: SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data. Nucleic Acids Res 2003, 31: 219–223. 10.1093/nar/gkg014PubMed CentralView ArticlePubMedGoogle Scholar
- Bluthgen N, Kielbasa SM, Cajavec B, Herzel H: HOMGL-comparing genelists across species and with different accession numbers. Bioinformatics 2004, 20: 125–126. 10.1093/bioinformatics/btg379View ArticlePubMedGoogle Scholar
- Wheeler DL, Church DM, Edgar R, Federhen S, Helmberg W, Madden TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, Suzek TO, Tatusova TA, Wagner L: Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res 2004, 32: D35-D40. 10.1093/nar/gkh073PubMed CentralView ArticlePubMedGoogle Scholar
- Alako BT, Veldhoven A, van Baal S, Jelier R, Verhoeven S, Rullmann T, Polman J, Jenster G: CoPub Mapper: mining MEDLINE based on search term co-publication. BMC Bioinformatics 2005, 6: 51. 10.1186/1471-2105-6-51PubMed CentralView ArticlePubMedGoogle Scholar
- Smid M, Dorssers LC: GO-Mapper: functional analysis of gene expression data using the expression level as a score to evaluate Gene Ontology terms. Bioinformatics 2004, 20: 2618–2625. 10.1093/bioinformatics/bth293View ArticlePubMedGoogle Scholar
- Public SRS servers[http://downloads.lionbio.co.uk/publicsrs.html]
- NKI Central Microarray Facility[http://microarrays.nki.nl/]
- 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der KK, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415: 530–536. 10.1038/415530aView ArticlePubMedGoogle Scholar
- Dhanasekaran SM, Barrette TR, Ghosh D, Shah R, Varambally S, Kurachi K, Pienta KJ, Rubin MA, Chinnaiyan AM: Delineation of prognostic biomarkers in prostate cancer. Nature 2001, 412: 822–826. 10.1038/35090585View ArticlePubMedGoogle Scholar
- Compugen Oligo Library[http://www.labonweb.com/chips/libraries.html]
- Lapointe J, Li C, Higgins JP, Van de RM, Bair E, Montgomery K, Ferrari M, Egevad L, Rayford W, Bergerheim U, Ekman P, DeMarzo AM, Tibshirani R, Botstein D, Brown PO, Brooks JD, Pollack JR: Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci U S A 2004, 101: 811–816. 10.1073/pnas.0304146101PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.