Cross-species and cross-platform gene expression studies with the Bioconductor-compliant R package 'annotationTools'
© Kuhn et al. 2008
Received: 26 June 2007
Accepted: 17 January 2008
Published: 17 January 2008
Skip to main content
© Kuhn et al. 2008
Received: 26 June 2007
Accepted: 17 January 2008
Published: 17 January 2008
The variety of DNA microarray formats and datasets presently available offers an unprecedented opportunity to perform insightful comparisons of heterogeneous data. Cross-species studies, in particular, have the power of identifying conserved, functionally important molecular processes. Validation of discoveries can now often be performed in readily available public data which frequently requires cross-platform studies.
Cross-platform and cross-species analyses require matching probes on different microarray formats. This can be achieved using the information in microarray annotations and additional molecular biology databases, such as orthology databases. Although annotations and other biological information are stored using modern database models (e.g. relational), they are very often distributed and shared as tables in text files, i.e. flat file databases. This common flat database format thus provides a simple and robust solution to flexibly integrate various sources of information and a basis for the combined analysis of heterogeneous gene expression profiles.
We provide annotationTools, a Bioconductor-compliant R package to annotate microarray experiments and integrate heterogeneous gene expression profiles using annotation and other molecular biology information available as flat file databases. First, annotationTools contains a specialized set of functions for mining this widely used database format in a systematic manner. It thus offers a straightforward solution for annotating microarray experiments. Second, building on these basic functions and relying on the combination of information from several databases, it provides tools to easily perform cross-species analyses of gene expression data.
Here, we present two example applications of annotationTools that are of direct relevance for the analysis of heterogeneous gene expression profiles, namely a cross-platform mapping of probes and a cross-species mapping of orthologous probes using different orthology databases. We also show how to perform an explorative comparison of disease-related transcriptional changes in human patients and in a genetic mouse model.
The R package annotationTools provides a simple solution to handle microarray annotation and orthology tables, as well as other flat molecular biology databases. Thereby, it allows easy integration and analysis of heterogeneous microarray experiments across different technological platforms or species.
DNA microarrays allows reliable measures of the levels of several tens of thousands of different mRNAs simultaneously. The microarray 'annotation' contains information specifying the mRNAs and genes assayed by the probes present on a particular microarray. A good annotation not only lists the probed transcripts but also contains biological information about the corresponding genes and information about the reliability of each probe-transcript association for instance. The information needed to define an annotation is extracted from molecular biology databases, such as Entrez Gene.
In the case of commercial microarrays supported by a dedicated analysis software (e.g. GCOS for Affymetrix arrays), an annotation is provided by the manufacturer and integrated in the analysis environment. On the other hand, if custom analyses are performed with the help of a versatile computing environment like R, an annotation for the microarray format of interest can be chosen explicitly. Here, we present an R package to support the use of annotations and other molecular biology databases provided as flat file databases (i.e. tables in plain text files). Flat annotation databases may be provided by microarray manufacturers or generated anew on the basis of microarray probe sequences by researchers who aim at a better definition of targeted transcripts (in particular for short oligonucleotide arrays, see e.g.).
Some functions in annotationTools are tailored to mine flat file databases containing gene homology/orthology information. DNA microarrays are available for numerous different species allowing for the comparison of gene expression levels across different organisms. This opportunity is promising[6–8], in particular for the better understanding of human diseases, where the comparison of disease-related expression profiles in humans and in animal models might yield important insights into pathological molecular mechanisms (see e.g.). Furthermore, comparing a disease condition in humans and in animal models, we can assess whether experimental models recapitulate aspects of the human disease, a critical step in validating disease models with regard to their use in preclinical therapeutical trials for instance. The comparison of transcriptional profiles across different species can make use of orthologous genes (i.e. two genes that derive from a single gene in the last common ancestor of the two species), assuming that they have retained the same function and are thus involved in similar processes in the two species. Whole-genome sequences have allowed extensive mapping of orthologs across many species, and several databases store clusters of homologous/orthologous genes (e.g. HomoloGene).
Annotations and orthology information are sufficient to perform cross-species analyses but combining heterogeneous sources of information is often arduous (e.g.). Based on a set of functions with common input and output formats and mining flat databases, annotationTools implements a robust solution for integrating heterogeneous data. To illustrate the use and functionality of our software, we will present the following example analyses that are directly relevant to cross-species mRNA profiling studies. We mapped probes across two widely used microarray platforms (Affymetrix and Illumina) and also mapped probes on a mouse microarray to their orthologs on a human microarray using information supplied by HomoloGene. We further show how the latter mapping differed from mappings obtained using 3 alternative sources of orthology information (i.e. Affymetrix, Ensembl or EGO). Building on this example and using data from human patients and from a murine disease model, we then demonstrate how annotationTools can be used to easily perform cross-species comparisons of gene expression profiles.
Functions in annotationTools are coded in pure R and do not depend on other R packages. They are therefore easily usable on any R installation. The annotation functions accept a vector of identifiers to be annotated (e.g. probe identifiers) and an annotation table, created by loading a flat annotation file in a data.frame object (see also Results). In the annotation table, each column contains a piece of annotation information (like e.g. probe identifier, gene symbol or chromosomal location) and each row contains a probe record. The identifiers to be annotated are looked up in the annotation table. Since a single identifier can be annotated with multiple items (like a probe hybridizing to several transcripts and therefore annotated with several different gene symbols), the annotation retrieved for the whole vector of identifiers is output as a list where the i-th element is a vector containing all annotation items found for the i-th input identifier. Options can be set to output the cause of an annotation failure: empty or invalid input identifier, input identifier not found in the annotation table, input identifier with no annotation information provided in the annotation table.
The annotation functions can be grouped according to their look-up behavior. First, functions that find a single match in the annotation table for each input identifier (i.e. the first occurrence of the identifier in the table): getANNOTATION, getGENEID, getGENESYMBOL, getGENETITLE, getGO. Note that getANNOTATION is a flexible function that can be set to look up any fields (column) in an arbitrary annotation table. The four other functions are tailored to accept annotation tables provided by Affymetrix but they can be easily customized to other formats by setting options.
A second group of annotation functions find multiple matches (i.e. multiple records) for each input identifier: getMULTIANNOTATION, getPROBESET and getHOMOLOG. getMULTIANNOTATION is a flexible annotation function that can work with arbitrary annotation tables whereas getPROBESET works with Affymetrix annotation tables by default. getHOMOLOG is designed to look up homologs or orthologs and accepts an homology/orthology table and two inputs: a vector of gene identifiers and a target species identifier.
Finally, a third group of functions making use of the previous two groups allows to perform higher-level operations (like e.g. generating mapping tables of orthologous probes). In conclusion, our package provides the user with a set of customizable functions with consistent input and output formats that can be easily chained to perform more complex operations.
Let us consider a typical Affymetrix GeneChip experiment and assume that we chose to use the annotation provided by Affymetrix. After downloading the annotation from Affymetrix website (e.g. for array format HG-U133 Plus 2.0, the file 'HG-U133_Plus_2_.naXX.annot.csv', where 'XX' indicates the version number, see), the user loads it into R as a data.frame with
> annotation <- read.csv("HG-U133_Plus_2.naXX.annot.csv", colClasses="character")
Genes associated with particular probe sets can then be retrieved with the function getGENESYMBOL. To retrieve genes whose transcripts are probed by probe sets '117_at' and '1007_s_at', type
> getGENESYMBOL(c('117_at','1007_s_at'), annotation)
where the variable 'annotation' is the data.frame containing Affymetrix annotation. Being interested in human gene RFC2 (Entrez geneID 5982), we can find its mouse ortholog using the information in HomoloGene. Having downloaded the flat database to a file called 'homologene.data' for instance, we first load it into R and then mine it using the function getHOMOLOG
> homologene <- read.delim('homologene.data', header = FALSE)
> getHOMOLOG(5982, 10090, homologene)
where the variable 'homologene' is the data.frame containing the HomoloGene database and 10090 is the taxonomy ID ofMus musculus. Detailed function descriptions and further examples are given in help files and in the package's vignette.
Mapping of the first 10 probe sets on Affymetrix Mouse 430 2.0 to probes (also called targets) annotated with the same RefSeq IDs on Illumina Mouse-6v10. Gene symbols (GS) for Affymetrix and Illumina (column 3 and 6 respectively) are indicative and were not used to perform the mapping. In this example, we disregarded the version number of RefSeq identifiers (i.e. the suffix of RefSeq accession number) and considered accession number with the same prefix (i.e. NM_123456.1 and NM_123456.2 for instance) to be equivalent.
Affy probe set
scl011972.1_247-S, scl000737.1_18-S, scl000708.1_100-S
The use of gene symbols was not always consistent across both annotations, as illustrated by the last probe set pair annotated toPsenen(official symbol) by Affymetrix and to1700023M09Rik(synonym) by Illumina. This suggests that, in this example, matching probes via gene symbols instead of RefSeq identifiers would have yielded an incomplete mapping, which highlights the importance of the matching identifier when setting up a mapping strategy. Note that if annotations do not contain any suitable identifier to be used as matcher, an intermediate mapping step can be introduced that makes use of a molecular biology database providing a link between several identifiers (e.g. Entrez Gene, Unigene). In the mapping presented above, imagine for instance that one of two microarray annotations would have listed gene IDs and not RefSeq IDs, we could have used the database gene2accession (provided by Entrez Gene via FTP) to link these two identifier types through an additional, intermediate matching step.
To illustrate how annotationTools allows to combine various sources of information, we now consider the mapping of orthologous probe sets (i.e. probe sets probing transcripts from orthologous genes) across two Affymetrix GeneChip formats, namely from Mouse 430 2.0 to HG-U133 Plus 2.0. We used the following procedure, illustrated in Figure1B: annotate probe sets on Mouse 430 2.0 with their gene IDs using the corresponding Affymetrix annotation and use the gene IDs to mine HomoloGene and find orthologous gene IDs in the target species (Homo sapiens). Finally, use the Affymetrix annotation for HG-U133 Plus 2.0 to retrieve the corresponding (orthologous) probe sets. This method is easily implemented using getGENEID, getHOMOLOG and getPROBESET sequentially (see Figure1B and the example code in the package's vignette) or, alternatively through a single call to the wrapper function ps2ps
> ps2ps(annotation_HGU133Plus2, annotation_Mouse4302, homologene, 10090)
assuming that 'annotation_HGU133Plus2', 'annotation_Mouse4302' and 'homologene' are data.frame objects containing the corresponding annotations and the HomoloGene database. 10090 is the taxonomy ID ofMus musculus. The function ps2ps returns a complete mapping table as a data.frame. This operation lasts on the order of ten minutes on a desktop computer. We compared the result of this procedure to 3 alternative sources of orthologous Affymetrix probe sets: Resourcerer (an interface to the Eukaryotic Gene Ortholog database), Ensembl, and Affymetrix orthology file ('Mouse430_2_ortholog.csv'). The 3 corresponding mapping tables (obtained as text files) can be easily mined for orthologous probe sets with getHOMOLOG.
We now present a practical example of cross-species analysis using gene expression data from Huntington's disease patients and from a genetic mouse model of the disease. Huntington's disease (HD) is a neurological disorder caused by a trinucleotide (CAG) expansion in theHDgene. Animal models of HD have allowed the demonstration that mutant protein expression results in transcriptional dysregulation of many genes. More recently, many mRNA changes have also been detected in the brains of HD patients. How do these changes compare with those identified in mouse models?
Here we will consider the CHL2 mouse model of HD and investigate whether the most robust mRNA changes detected in striatal samples of these mutant mice parallel those measured in the corresponding brain region of HD patients. Thereby, we aim at assessing the faithfulness of the animal model with regard to transcriptional dysregulations in human. We used public data from a study that profiled striata of CHL2 and control mice and assessed differential expression between both genotypes. In short, Strand et al. extracted RNA from 3 transgenic and 3 control animals and hybridized it to Affymetrix MG-U74Av2 arrays. These were normalized using RMA and differential expression analysis was performed with the R package limma. The list of Affymetrix probe sets along with measures of expression change and associated statistics for differential expression is publicly available. In addition, we used data from a large transcriptomic study of human HD using Affymetrix HG-U133A arrays. Human striatal gene expression profiles from 44 HD patients and 36 controls were analyzed similarly as described for the mice and the list of probe sets with associated measures of expression change and differential expression statistics is available from the same source.
To perform this mouse-human comparison, we first needed to find orthologous probe sets on the two microarrays used in the aforementioned studies (namely MG-U74Av2 for the mouse and HG-U133A for humans). As presented above, a table of orthologous probe sets can easily be generated with the function ps2ps in annotationTools. This mapping table and results from differential gene expression analyses of the mouse and human data could then be used to look up the probe sets present in the human data that are orthologous to the selected top mouse probe sets (i.e. those detecting differential gene expression in the CHL2 model). This is implemented by the function getOrthologousProbesets
> getOrthologousProbesets(mouse_ps, human_diffExpr, mapping)
that takes a vector of probes to be matched ('mouse_ps'), a data.frame object containing the target probe sets and associated log2fold changes or statistics of interest ('human_diffExpr'), and the mapping table ('mapping'). This returns a list of target probe sets and associated values (log2fold changes, respectively statistics). In case of multiple human orthologous probe sets found to match a given mouse probe set, the function can be configured to either output a summary value (e.g. median log2fold change) or a single value corresponding to a selected probe set (e.g. the value associated with the most significant probe set, see the corresponding help file for a complete description of the function and its options). This information allowed us to readily compare gene expression changes in the CHL2 mouse model with those measured in HD patients.
Probe sets associated with the 5 most significant mRNA changes in mice and whose human ortholog showed significant same-sign regulation (i.e. concordant, A) or significant different-sign regulation (i.e. discordant, B). PS, GS, logFC and P stand for probe set, gene symbol, log2fold change and p-value respectively. P-values in the human dataset (last column) are Bonferroni-corrected to adjust for multiple testing (44'928 human probe sets in total).
human corrected P
human corrected P
As shown in a recent study, different microarray technologies can reliably and concordantly measure gene expression profiles. The combination of heterogeneous microarray datasets can subserve different purposes. On the one hand, one might want to combine several independent profiling experiments aimed at answering the same question but performed with different technologies. We have shown here how to use annotationTools to quickly map probes across different microarray platforms and thus 'align' datasets. Such parallel experiments can then be considered in the framework of meta-analysis and analyzed so as to reach robust conclusions that would have been difficult to obtain based on any single study (see e.g.[27, 28]). Note that different statistical methods have been proposed for combining multiple independent studies, based on the use of parameter estimates (e.g. differential expression), effect sizes (e.g. standardized differential expression), p-values, statistic ranks or test decisions (e.g. via Venn diagrams). Their extension and application in the field of microarray studies is an area of current active research.
Alternatively, heterogeneous datasets can be considered sequentially: a first study is considered as a reference and is used to derive a particular transcriptional signature (e.g. for a given cellular pathway, cell population or disease) whose presence is then tested for in a second dataset. In this context, we showed how to use annotationTools to map orthologous probes across species and explore the recapitulation of a human disease signature in a particular mouse model. Our example suggested that such comparisons can be made at various resolution levels (e.g. correlation of differential expression measures, concordance of regulation direction, see Figure4). Our package does not address the issue of significance testing in such comparisons. Over the last years, several groups have proposed solutions to this issue (e.g[29, 30]) and a general framework for statistical testing of global similarity is now emerging[31, 32]. We have recently proposed the concordance coefficient as a new measure of similarity between datasets (based on the concordance of gene expression, see Results and Figure4B), that is amenable to formal statistical testing and that we used to assess the extent to which transcriptomic changes in Huntington's disease were recapitulated by different genetic mouse models. In particular, we could show that the CHL2 model significantly recapitulated aspects of gene expression changes detected in HD patients.
An alternative Bioconductor solution for annotation and ortholog finding makes use of specialized data packages, which are compilations of biological information obtained from various databases for probe sets of particular microarray formats. These meta-data packages are prepared with the package annBuilder and can be obtained from the Bioconductor website. They can be subsequently queried with special functions available through the package annotate. Another Bioconductor solution is provided by the package biomaRt which allows the (online) query of biomart databases (like e.g. Ensembl). Annotation and homology/orthology information can thus be retrieved from the newest built of available databases without downloading the database. Many annotations and molecular biology databases, however, are readily (and sometimes exclusively) available as flat file databases. These include annotations provided by commercial microarray manufacturers, by academic facilities producing and annotating their own spotted arrays, as well as re-annotation efforts aimed at providing more faithful representation of the transcript species measured by a given microarray platform (e.g.). Moreover, most information present in very large databases such as Entrez Gene for instance is distributed as various flat file databases. The package annotationTools provides the R user with a simple solution to mine and combine data from flat file databases in a systematic way. In particular, the functions handling orthology databases allow for a straightforward use of publicly available orthology information (e.g. HomoloGene). As cross-species studies will become more frequent in the near future, it is of particular importance to develop user-friendly, flexible analysis tools that ease the comparison of gene expression profiles across microarray platforms.
The Bioconductor-compliant package annotationTools allows analysts to perform microarray annotation tasks, match orthologous probes across microarrays and, more generally, use and combine information from flat databases within R. In particular, it offers an easy solution for implementing cross-species analysis of gene expression, which is of timely relevance.
The R package annotationTools is freely available under the GPL license and can be downloaded from Bioconductor. The details for this package are provided below.
Project name: annotationTools
Operating systems: Linux, Windows
Programming language: R
Other requirements: none
License: GNU GPL
Restriction to use by non-academics: none
We thank Thierry Sengstag for helpful discussions in the course of this project.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.