Storing, linking, and mining microarray databases using SRS
BMC Bioinformatics volume 6, Article number: 192 (2005)
SRS (Sequence Retrieval System) has proven to be a valuable platform for storing, linking, and querying biological databases. Due to the availability of a broad range of different scientific databases in SRS, it has become a useful platform to incorporate and mine microarray data to facilitate the analyses of biological questions and non-hypothesis driven quests. Here we report various solutions and tools for integrating and mining annotated expression data in SRS.
We devised an Auto-Upload Tool by which microarray data can be automatically imported into SRS. The dataset can be linked to other databases and user access can be set. The linkage comprehensiveness of microarray platforms to other platforms and biological databases was examined in a network of scientific databases. The stored microarray data can also be made accessible to external programs for further processing. For example, we built an interface to a program called Venn Mapper, which collects its microarray data from SRS, processes the data by creating Venn diagrams, and saves the data for interpretation.
SRS is a useful database system to store, link and query various scientific datasets, including microarray data. The user-friendly Auto-Upload Tool makes SRS accessible to biologists for linking and mining user-owned databases.
The extraction of information from data generated by high-throughput experiments in genomics and proteomics has been likened to "attempting to drink from a fire hose". We are flooded with information on many levels such as whole genome DNA sequences, RNA expression, protein-protein interactions, protein modifications, and more. All this information is accessible in very different formats, ranging from well-organized curated gene sequences to unstructured free text in scientific literature. A system that can manage, link and query these heterogeneous types of datasets is therefore extremely valuable. The Sequence Retrieval System (SRS) is such a unified database system in which numerous different scientific databases have already been integrated .
Of special interest are data from high-throughput RNA expression microarrays [2, 3]. Many of these datasets are freely available and, like information stored in other scientific databases, are from different platforms [4, 5]. Integrating and mining these databases strongly facilitates the analysis of genes of interest but will also support discovery of disease markers, drug-targets and new knowledge in general [6–9]. One such platform is Oncomine, which has integrated many different microarray datasets, focussing on human cancer . Additionally, standardized microarray depositories such as GEO (Gene Expression Omnibus) , ArrayExpress , and CIBEX  do or will soon provide options to browse and query the datasets [14–17]. No doubt, other platforms will be developed focussing on the integration of microarray data. If started from scratch, these initiatives will likely be limited in their direct linkage to other heterogeneous biological databases due to the laborious task of making those connections and programming the single and batch-wise query options. The universality and the availability of numerous scientific databases that have already been integrated in SRS make it a useful platform for integrating microarray databases. Although the SRS interface to query databases is quite user friendly, other aspects of working with SRS are not. These include (i) uploading microarray datasets, (ii) database security including setting user access, (iii) linking databases, (iv) generating standard views, and (v) communication with other programs such as statistical and clustering software. The current SRS interface has a major disadvantage in that it is not designed to perform complex calculations on the fly. This means that any microarray dataset to be uploaded must have all ratio and statistical calculations performed upfront. For example, once in SRS, one cannot change ratios from log10 to log2 or add an extra field per gene by dividing expression data of all "normal" by all "cancer" samples. However, software programs that perform calculations, statistical evaluations, clustering, protein domain predictions, homology searches, and more, can communicate with SRS. Interfaces can be generated that retrieve data from SRS, perform the required action and if desired, store the results in SRS. Alternatively, SRS allows direct integration of programs such as the BLAST and FASTA homology searches and the SRS-EMBOSS (European Molecular Biology Open Software Suite) tools [18, 19].
Generating a database in which heterogeneous datasets are integrated is a challenge in itself. However, retrieving statistically meaningful data by comparing datasets from different sources, platforms and designs is particularly difficult . There is a fast growing body of publications on microarray cross-platform comparisons, mainly showing how this can be achieved in very many different ways [8, 21–26]. Statistical evaluations of data within a dataset of sufficient technical and biological replicates, are better defined and can be implemented per dataset within a database system [27, 28]. The strategies and applications we discuss here to link, store and query scientific datasets in SRS, do not go beyond processed individual datasets and do not include cross-platform dataset integrations. We assume that each uploaded dataset consists of high-quality data and has been processed correctly.
In this paper we describe strategies to incorporate microarray databases into SRS and provide a database upload tool. Using the program Venn Mapper as an example, we show the possibility to automatically retrieve the stored microarray data from SRS for external statistical evaluation.
In order to import microarray databases into SRS (version 7.1.3), an Auto-Upload Tool was built (Figure 1). This PHP-written tool allows one to store databases of a predefined format into a user-owned and password protected directory on a local SRS server [see Additional file 1] . In this directory, databases can be managed, viewed and uploaded into SRS. In the "edit-database" interface, links to other databases can be specified for each field. A standard view can be generated and the location of the dataset in the SRS directory determined. Finally, permissions can be set to control access to the various datasets in SRS. Upon uploading, the Auto-Upload Tool will generate the files required for SRS: (i) the SRS data file in which the spreadsheet input file is converted into a flat-file database, (ii) an Icarus syntax file (.is-file) which describes the layout of the flat-file database, (iii) a database index file (.i-file) which describes the way in which the different fields need to be indexed for the SRS system, (iv) a database view file (.view-file) in which a standard view is defined, and (v) an information file (.it-file) which can harbour a description of the dataset. These files are automatically placed into the SRS directory after which the Auto-Upload Tool updates the srsdb.i, user.i and site.i files. These files describe the name of the database, where the files are located (srsdb.i), user permissions (user.i), and configuration of the different database groups (site.i). The srssection command within the Auto-Upload Tool implements the changes in the configuration files after which srscheck and srsdo perform indexing of new databases and set links. Incorporation of a new dataset using this tool generally takes place within minutes. On our local SRS server, a DQS (Distributed Queuing System) batch-queue is installed to prevent data loss or corruption of datasets in case multiple users are editing datasets at the same time.
External programs accessing SRS: Venn Mapper for SRS
An important feature of storing and linking microarray data in SRS is the accessibility of the datasets for other programs. As an example, we generated a PHP web interface for the Venn Mapper program that retrieves microarray data from SRS to calculate the statistical significance of the number of co-occurring differentially expressed genes in any combination of two experiments . The functionality of the original Venn Mapper was enhanced by enabling the use of different ratio cut-offs for different microarray experiments. Upon login, the interface displays all microarray databases indexed in SRS to which the user has access. After selection of the datasets a second screen shows all fields (such as individual array experiments or averaged group ratios) of the selected databases. The microarray experiments of interest can be selected for Venn Mapper analysis after which the requested data is linked and exported from SRS into the Venn Mapper program. The output of the program is available for viewing and downloading. Information requests from the interface to SRS are made through the SRS getz command . This powerful feature of SRS makes the integrated databases accessible to any external program.
Results and discussion
Preparing and linking of microarray databases in SRS
With respect to the microarray database set-up, there are two important considerations. First, in our experience, microarray data mining often starts with selecting genes based on their differential expression. Differential gene expression is best determined using statistical evaluation of the data based on sufficient technical and biological replicates [27, 28]. Dependent on the statistical test and microarray platform, raw gene expression data and/or ratio calculations can be utilized. Second, making changes to a microarray dataset in SRS is impractical and datasets should be fully built before they are imported. This means that raw expression data and ratio calculations should be normalised and flagged and represented in a common format (such as log2). Importantly, statistical evaluation should be included. For simplicity in representation, datasets can be summarised in, for example, an average of all "normal" and "cancer" samples and additional fields of log2 "normal/cancer" can be included.
Linking of databases should be based on invariable and unique indexes. Links based on identifiers such as UniGene cluster identifiers that are regularly re-assigned, forces one to repeatedly update all databases that include such a denominator. Invariable links based on DNA sequence assignments such as GenBank and ProbeSet identifiers (IDs) are therefore more appropriate linking indexes. We recommend including only one of those invariable and unique linking fields in the microarray dataset to avoid the need for a regular update. Since many biological databases do not use these hard links, connecting microarray datasets to other databases can be achieved through a coupling file and/or by making use of the web of links provided by the biological databases (Figure 2). A coupling file can be minimal, containing for example only a RefSeq ID with the appropriate array spot number, or can contain a variety of indexes such as GenBank, RefSeq, SwissProt, OMIM, LocusLink, KEGG, GO, GeneCards, UniGene identifiers, that directly link to the different databases. Since some of these links are variable and biological databases keep growing with information, coupling files must be regularly renewed to update the various links.
Instead of or in combination with a coupling file, one can make use of the links provided in the various biological databases (Figure 2). For example, the GenBank accession code in the microarray dataset can be linked to the UniGene database. The LocusLink database is linked to the UniGene database through RefSeq accession numbers and also contains IDs that for example link to EMBL, OMIM, and SwissProt. In this way, almost all biological databases are linked through a network of direct and indirect connections. In case multiple roads lead to the same databases, SRS utilizes only one route. By assigning values to each link, the route taken is the one having the lowest sum of link values, even when this results in a lower number of connected fields (genes). Although SRS can be forced to take a specified path, one should be careful to be dependent on many different databases. Inconsistencies in and incompleteness of databases are accumulated when linking occurs in sequence.
Linking to the network of biological databases through a coupling file has various advantages. One can establish directly validated links to each database, including databases outside the network. In addition, errors in links can easily be corrected. Platform-specific or overall coupling files can be retrieved from the Affymetrix website  and from sources such as Resourcerer , KARMA , GeneHopper , ProbeMatchDB , DAVID , EnsMart , and Source . Using these resources, microarray datasets from different platforms can be linked. This includes connecting cross-species datasets using ortholog converters such as HOMGL  and HOMOLOGENE .
Gene linking efficiency of different databases
A high accuracy and comprehensiveness of linking are essential for a successful comparison of microarray data from different platforms. The extent of linkage of various databases in SRS was examined (Figure 3). The percentage of fields of scientific databases and microarray datasets that are linked to other databases was assessed. As shown in Figure 2, most microarray databases are directly linked to the scientific database network via a single connection. The U133A and U95 Affymetrix coupling files contain direct links to UniGene, SwissProt, LocusLink, and OMIM. The linkage of the various microarray platforms to UniGene varies between 84% and 96%. On average, 60% of the genes of microarray datasets can be linked to each other via UniGene.
Auto-Upload Tool and external programs accessing SRS
The availability of many scientific databases in SRS, the universality of the system and its free access for academic use, make SRS an excellent mining system for heterogeneous microarray datasets. The Auto-Upload Tool facilitates the exchange of microarray datasets between separate SRS installations. Using a single data file and optional description file, any user can upload the identical data and customize it to their own SRS environment. We would urge researchers and microarray data repositories to make their data available in an SRS format. In addition, microarray software programmers could make their software available in an SRS compatible format or include SRS data export options. The commercial SRS GeneSpring® Connector and public EMBOSS are examples of such microarray-SRS integration ventures.
We plan to extend our efforts of integrating more microarray databases into SRS. In addition, software tools specific for microarray data analysis, such as Go Mapper and CoPub Mapper will be rewritten for SRS [26, 41, 42]. The CoPub Mapper literature mining program contains databases that store, for each gene, all MEDLINE records mentioning the gene. This directly links microarray expression data to the published literature and allows for co-publication research of gene-gene and gene-keyword combinations.
The Sequence Retrieval System is a versatile and useful database system to store, link and query various scientific databases, including microarray datasets. Fully processed datasets can be incorporated and linked to other datasets using the Auto-Upload Tool. This user-friendly program makes SRS accessible to users who can themselves add, link and mine databases within minutes. Datasets stored in SRS can be interrogated by external programs to perform virtually any computation.
Availability and requirements
Project Name: Auto-Upload Tool and Venn Mapper for SRS
Project home page: http://www.erasmusmc.nl/gatcplatform
Operating system: Platform independent
Other requirements: Local SRS installation, DQS batch-queue, MySQL database server, PHP-enabled Webserver (like Apache)
License: SRS (Lion Bioscience)
Any restrictions to use by non academics: License needed
Zdobnov EM, Lopez R, Apweiler R, Etzold T: The EBI SRS server – recent developments. Bioinformatics 2002, 18: 368–373. 10.1093/bioinformatics/18.2.368
Brown PO, Botstein D: Exploring the new world of the genome with DNA microarrays. Nat Genet 1999, 21: 33–37. 10.1038/4462
Duggan DJ, Bittner M, Chen Y, Meltzer P, Trent JM: Expression profiling using cDNA microarrays. Nat Genet 1999, 21: 10–14. 10.1038/4434
Heller MJ: DNA microarray technology: devices, systems, and applications. Annu Rev Biomed Eng 2002, 4: 129–153. 10.1146/annurev.bioeng.4.020702.153438
Schena M, Heller RA, Theriault TP, Konrad K, Lachenmeier E, Davis RW: Microarrays: biotechnology's discovery platform for functional genomics. Trends Biotechnol 1998, 16: 301–306. 10.1016/S0167-7799(98)01219-0
Moreau Y, Aerts S, De Moor B, De Strooper B, Dabrowski M: Comparison and meta-analysis of microarray data: from the bench to the computer desk. Trends Genet 2003, 19: 570–577. 10.1016/j.tig.2003.08.006
Rhodes DR, Chinnaiyan AM: Bioinformatics strategies for translating genome-wide expression analyses into clinically useful cancer markers. Ann N Y Acad Sci 2004, 1020: 32–40. 10.1196/annals.1310.005
Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM: Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci U S A 2004, 101: 9309–9314. 10.1073/pnas.0401994101
Welsh JB, Sapinoso LM, Kern SG, Brown DA, Liu T, Bauskin AR, Ward RL, Hawkins NJ, Quinn DI, Russell PJ, Sutherland RL, Breit SN, Moskaluk CA, Frierson HF Jr, Hampton GM: Large-scale delineation of secreted protein biomarkers overexpressed in cancer tissue and serum. Proc Natl Acad Sci U S A 2003, 100: 3410–3415. 10.1073/pnas.0530278100
Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM: ONCOMINE: a cancer microarray database and integrated data-mining platform. Neoplasia 2004, 6: 1–6.
Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 2002, 30: 207–210. 10.1093/nar/30.1.207
Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG, Oezcimen A, Rocca-Serra P, Sansone SA: ArrayExpress – a public repository for microarray gene expression data at the EBI. Nucleic Acids Res 2003, 31: 68–71. 10.1093/nar/gkg091
Ikeo K, Ishi-i J, Tamura T, Gojobori T, Tateno Y: CIBEX: center for information biology gene expression database. C R Biol 2003, 326: 1079–1082.
Stoeckert CJ Jr, Causton HC, Ball CA: Microarray databases: standards and ontologies. Nat Genet 2002, 32: 469–473. 10.1038/ng1028
Gardiner-Garden M, Littlejohn TG: A comparison of microarray databases. Brief Bioinform 2001, 2: 143–158.
Quackenbush J: Data standards for 'omic' science. Nat Biotechnol 2004, 22: 613–614. 10.1038/nbt0504-613
Penkett CJ, Bahler J: Getting the most from public microarray data. European Pharmaceutical Review 2004, 9: 8–17.
Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 2000, 16: 276–277. 10.1016/S0168-9525(00)02024-2
Kulikova T, Aldebert P, Althorpe N, Baker W, Bates K, Browne P, van den BA, Cochrane G, Duggan K, Eberhardt R, Faruque N, Garcia-Pastor M, Harte N, Kanz C, Leinonen R, Lin Q, Lombard V, Lopez R, Mancuso R, McHale M, Nardone F, Silventoinen V, Stoehr P, Stoesser G, Tuli MA, Tzouvara K, Vaughan R, Wu D, Zhu W, Apweiler R: The EMBL Nucleotide Sequence Database. Nucleic Acids Res 2004, 32: D27-D30. 10.1093/nar/gkh120
Marshall E: Getting the noise out of gene arrays. Science 2004, 306: 630–631. 10.1126/science.306.5696.630
Zhou XJ, Kao MC, Huang H, Wong A, Nunez-Iglesias J, Primig M, Aparicio OM, Finch CE, Morgan TE, Wong WH: Functional annotation and network reconstruction through cross-platform integration of microarray data. Nat Biotechnol 2005, 23: 238–243. 10.1038/nbt1058
Mitchell SA, Brown KM, Henry MM, Mintz M, Catchpoole D, LaFleur B, Stephan DA: Inter-platform comparability of microarrays in acute lymphoblastic leukemia. BMC Genomics 2004, 5: 71. 10.1186/1471-2164-5-71
Chiorino G, Acquadro F, Mello GM, Viscomi S, Segir R, Gasparini M, Dotto P: Interpretation of expression-profiling results obtained from different platforms and tissue sources: examples using prostate cancer data. Eur J Cancer 2004, 40: 2592–2603. 10.1016/j.ejca.2004.07.029
Culhane AC, Perriere G, Higgins DG: Cross-platform comparison and visualisation of gene expression data using co-inertia analysis. BMC Bioinformatics 2003, 4: 59. 10.1186/1471-2105-4-59
Shippy R, Sendera TJ, Lockner R, Palaniappan C, Kaysser-Kranich T, Watts G, Alsobrook J: Performance evaluation of commercial short-oligonucleotide microarrays and the impact of noise in making cross-platform correlations. BMC Genomics 2004, 5: 61. 10.1186/1471-2164-5-61
Smid M, Dorssers LC, Jenster G: Venn Mapping: clustering of heterologous microarray data based on the number of co-occurring differentially expressed genes. Bioinformatics 2003, 19: 2065–2071. 10.1093/bioinformatics/btg282
Cui X, Churchill GA: Statistical tests for differential expression in cDNA microarray experiments. Genome Biol 2003, 4: 210. 10.1186/gb-2003-4-4-210
Draghici S: Statistical intelligence: effective analysis of high-density microarray data. Drug Discov Today 2002, 7: S55-S63. 10.1016/S1359-6446(02)02292-4
Auto-Upload Tool Manual[http://www.erasmusmc.nl/gatcplatform/autouploadmanual.pdf]
Schaftenaar G, Cuelenaere K, Noordik JH, Etzold T: A Tcl-based SRS v. 4 interface. Comput Appl Biosci 1996, 12: 151–155.
Tsai J, Sultana R, Lee Y, Pertea G, Karamycheva S, Antonescu V, Cho J, Parvizi B, Cheung F, Quackenbush J: RESOURCERER: a database for annotating and linking microarray resources within and across species. Genome Biology 2001, 2: software0002. 10.1186/gb-2001-2-11-software0002
Cheung KH, Hager J, Pan D, Srivastava R, Mane S, Li Y, Miller P, Williams KR: KARMA: a web server application for comparing and annotating heterogeneous microarray platforms. Nucleic Acids Res 2004, 32: W441-W444. 10.1093/nar/gkh661
Svensson BA, Kreeft AJ, van Ommen GJ, den Dunnen JT, Boer JM: GeneHopper: a web-based search engine to link gene-expression platforms through GenBank accession numbers. Genome Biol 2003, 4: R35. 10.1186/gb-2003-4-5-r35
Wang P, Ding F, Chiang H, Thompson RC, Watson SJ, Meng F: ProbeMatchDB – a web database for finding equivalent probes across microarray platforms and species. Bioinformatics 2002, 18: 488–489. 10.1093/bioinformatics/18.3.488
Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003, 4: P3. 10.1186/gb-2003-4-5-p3
Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a generic system for fast and flexible access to biological data. Genome Res 2004, 14: 160–169. 10.1101/gr.1645104
Diehn M, Sherlock G, Binkley G, Jin H, Matese JC, Hernandez-Boussard T, Rees CA, Cherry JM, Botstein D, Brown PO, Alizadeh AA: SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data. Nucleic Acids Res 2003, 31: 219–223. 10.1093/nar/gkg014
Bluthgen N, Kielbasa SM, Cajavec B, Herzel H: HOMGL-comparing genelists across species and with different accession numbers. Bioinformatics 2004, 20: 125–126. 10.1093/bioinformatics/btg379
Wheeler DL, Church DM, Edgar R, Federhen S, Helmberg W, Madden TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, Suzek TO, Tatusova TA, Wagner L: Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res 2004, 32: D35-D40. 10.1093/nar/gkh073
Alako BT, Veldhoven A, van Baal S, Jelier R, Verhoeven S, Rullmann T, Polman J, Jenster G: CoPub Mapper: mining MEDLINE based on search term co-publication. BMC Bioinformatics 2005, 6: 51. 10.1186/1471-2105-6-51
Smid M, Dorssers LC: GO-Mapper: functional analysis of gene expression data using the expression level as a score to evaluate Gene Ontology terms. Bioinformatics 2004, 20: 2618–2625. 10.1093/bioinformatics/bth293
Public SRS servers[http://downloads.lionbio.co.uk/publicsrs.html]
NKI Central Microarray Facility[http://microarrays.nki.nl/]
't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der KK, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415: 530–536. 10.1038/415530a
Dhanasekaran SM, Barrette TR, Ghosh D, Shah R, Varambally S, Kurachi K, Pienta KJ, Rubin MA, Chinnaiyan AM: Delineation of prognostic biomarkers in prostate cancer. Nature 2001, 412: 822–826. 10.1038/35090585
Compugen Oligo Library[http://www.labonweb.com/chips/libraries.html]
Lapointe J, Li C, Higgins JP, Van de RM, Bair E, Montgomery K, Ferrari M, Egevad L, Rayford W, Bergerheim U, Ekman P, DeMarzo AM, Tibshirani R, Botstein D, Brown PO, Brooks JD, Pollack JR: Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci U S A 2004, 101: 811–816. 10.1073/pnas.0304146101
We would like to thank EMBL/EBI and Lion Bioscience for making SRS available and Peter Hendriksen for careful reading of the manuscript. This work was supported by Erasmus MC Breedtestrategie and the Urologic Research Foundation (SUWO) Erasmus MC.
AV and DdL generated the Auto-Upload Tool. DdL, AV and MS generated the Venn Mapper for SRS program. AV and VdJ installed and managed the servers for the various tools. Funding for the project was obtained by JK and GJ. AV, JK and GJ contributed to the intellectual content and GJ supervised the project.
Electronic supplementary material
About this article
Cite this article
Veldhoven, A., Lange, D.d., Smid, M. et al. Storing, linking, and mining microarray databases using SRS. BMC Bioinformatics 6, 192 (2005). https://doi.org/10.1186/1471-2105-6-192
- Microarray Data
- Microarray Dataset
- Scientific Database
- Biological Database
- Microarray Database