GenomeGraphs: integrated genomic data visualization with R
- Steffen Durinck†1, 2Email author,
- James Bullard†2Email author,
- Paul T Spellman1 and
- Sandrine Dudoit2, 3
© Durinck et al; licensee BioMed Central Ltd. 2009
Received: 14 April 2008
Accepted: 06 January 2009
Published: 06 January 2009
Biological studies involve a growing number of distinct high-throughput experiments to characterize samples of interest. There is a lack of methods to visualize these different genomic datasets in a versatile manner. In addition, genomic data analysis requires integrated visualization of experimental data along with constantly changing genomic annotation and statistical analyses.
We developed GenomeGraphs, as an add-on software package for the statistical programming environment R, to facilitate integrated visualization of genomic datasets. GenomeGraphs uses the biomaRt package to perform on-line annotation queries to Ensembl and translates these to gene/transcript structures in viewports of the grid graphics package. This allows genomic annotation to be plotted together with experimental data. GenomeGraphs can also be used to plot custom annotation tracks in combination with different experimental data types together in one plot using the same genomic coordinate system.
GenomeGraphs is a flexible and extensible software package which can be used to visualize a multitude of genomic datasets within the statistical programming environment R.
Computational biologists are dealing with a growing range of genomic datasets, including microarray (e.g., mRNA, ChIP, SNP, CGH, and tiling-Chip) and ultra high-throughput sequencing (e.g., mRNA-Seq and ChIP-Seq) data. An increasing number of biological studies involve multiple, distinct, and high-throughput assays to characterize samples of interest. Novel and flexible visualization methods are needed to integrate these various data sources and combine them with annotation data from biological databases such as Ensembl .
Genome browsers such as the Ensembl Genome Browser , NCBI Entrez Map Viewer , and UCSC's Golden Path Genome Browser  enable upload and visualization of experimental data but have limited plotting options, do not provide data analysis capabilities of the displayed data, and are too far removed from the environment used to conduct statistical analysis. Other tools linking genome annotation to experimental data are mostly limited to a specific data type or rely on the Genome Browser's viewers for visualization. Statistical Viewer  for example facilitates interpretation of linkage and association data by providing a plug-in for data upload to the Ensembl Genome Browser.
The X:Map  genome annotation database and its companion software package exonmap enable integrated visualization of experimental data and genome annotation but it is specific to exon arrays and requires a local installation of the Ensembl database. It does not currently support visualization of multiple datasets and does not represent alternative splicing structures.
The main drawback of the tools described above is that they are not programmatically accessible and cannot be integrated into an analysis pipeline requiring batch processing. In addition, the required data upload step does not scale well for large and complex datasets.
The statistical programming environment R http://www.r-project.org along with the Bioconductor Project http://www.bioconductor.org provide a plethora of methods and tools to analyze and visualize data. The software package described in this paper, GenomeGraphs, builds on this functionality by providing an integrated API for direct visualization of data from a variety of sources. GenomeGraphs allows complex customization to facilitate a more complete integration and representation of genomic datasets.
Genomic dataset objects
Overview of classes representing drawable genomic datasets
the root class of the system, never directly instantiated
class managing various plotting parameters
class representing a gene
class defining a region of a chromosome, generally a set of genetic elements (genes)
class defining a transcript
class defining a region of a chromosome, generally a set of genetic elements (transcripts)
class representing an ideogram
class to draw a title
class to draw a legend
class to draw an axis
class used to represent custom annotation
root class for overlays, never directly instantiated
class to represent rectangular regions of interest
class to draw text on plots
class to draw horizontal lines in various sets of data
class to draw data from microarrays.
class to draw data from exon microarrays.
class to draw custom gene models (intron-exon structures)
class to draw arbitrary data at a given base
class to plot sequencing reads that are mapped to the genome
New technological developments to characterize cellular states may need novel representations. Classes representing these new data types can be easily added to GenomeGraphs and if the corresponding drawing methods are defined, the new data structures can be plotted using gdPlot along with data from existing classes.
Genome annotation retrieval from Ensembl using biomaRt
GenomeGraphs relies on the biomaRt package  to retrieve genomic annotation information on-line from Ensembl using BioMart web services . The annotation information retrievable through biomaRt ranges from gene annotation, transcript isoforms to SNP data. This information can be retrieved from the most current release of Ensembl or from archived releases. By using biomaRt, there is no need for a local database installation of Ensembl, greatly facilitating the software installation procedure.
Custom genome annotation tracks
Ensembl contains annotation of a limited number of eukaryotic genomes. Any custom genome annotation can be visualized in GenomeGraphs by constructing instances of the AnnotationTrack class. For instance, genomic annotation encoded in GFF files can be easily used to create a custom AnnotationTrack object for visualization. To use the AnnotationTrack class, region start and end positions need to be given, as well as how these regions are to be grouped.
Mapping of user data to genomic coordinates
GenomeGraphs is a visualization tool and as such does not provide mappings of user supplied data to the genome. Instances of the class gdObject take as input genomic coordinates provided by the user who is responsible for ensuring that these coordinates match the relevant genome annotation. To get the chromosomal coordinates of the data, users can either rely on the annotation provided by the platform which generated the data or on independently created mappings to the genome.
Example I: arrayCGH and exon array data
In this first example, we illustrate how different genomic datasets can be visualized together in an integrated GenomeGraphs graphic. We use arrayCGH and Affymetrix exon array data and plot these together with genomic annotation from Ensembl.
We first load the GenomeGraphs package and one of its example datasets. This dataset contains copy number data and segmented copy number data, as well as exon array data for a small genomic region. Once the data are loaded, a gdObject is created for each data type, namely a Segmentation object containing the copy number segments, a GenericArray object containing the raw copy number data, an Ideogram object representing the relevant chromosome we are plotting, a GenericArray object containing the exon array data, and a GenomeAxis object for the genomic coordinate axis.
> data('exampleData', package='GenomeGraphs')
> seg = makeSegmentation(value = segments,
start = segStart, end = segEnd, dp = DisplayPars(color = 'dodgerblue2', lwd = 2, lty = 'dashed'))
> copyNumber = makeGenericArray(intensity = cn, probeStart = probestart,
segmentation = seg, dp = DisplayPars(size = 3, color = 'seagreen', type="dot"))
> ideogram = makeIdeogram(chromosome = 3)
> expression = makeGenericArray(intensity = intensity, probeStart = exonProbePos,
dp = DisplayPars(color='darkred', type='point'))
> genomeAxis = makeGenomeAxis(add53 = TRUE, add35 = TRUE)
In a next step, genomic annotation information is retrieved on-line from Ensembl using the biomaRt package. We first connect to the Ensembl BioMart database and select the human (hsapiens) dataset. Then, we retrieve gene structures on the forward and reverse strands of the region we want to visualize.
> minbase = 180292097
> maxbase = 180492096
> mart = useMart('ensembl', dataset='hsapiens_gene_ensembl')
> genesplus = makeGeneRegion(start = minbase, end = maxbase, strand = '+', chromosome = '3', biomart = mart)
> genesmin = makeGeneRegion(start = minbase, end = maxbase, strand = '-', chromosome = '3', biomart = mart)
> gdPlot(list(ideogram, expression, copyNumber, genesplus, genomeAxis, genesmin), minBase = minbase, maxBase = maxbase)
Example II: Transcript isoforms and exon array data
> data('unrData', package='GenomeGraphs')
> title = makeTitle(text ='ENSG00000009307', color = 'darkred')
> col = colorRampPalette(c('firebrick2','dodgerblue2'))(length(unrData[1,]))
> exon = makeExonArray(intensity = unrData, probeStart = unrPositions[,3], probeEnd = unrPositions[,4],
probeId = as.character(unrPositions[,1), nProbes = unrNProbes,
dp = DisplayPars(color = col, mapColor = 'dodgerblue2'), displayProbesets = FALSE)
> affyModel <- makeAnnotationTrack(start = unrPositions[,3], end = unrPositions[,4],
feature = "gene_model", group = "ENSG00000009307",
dp = DisplayPars(gene_model = "darkblue"))
> gene = makeGene(id = 'ENSG00000009307', biomart = mart)
> transcript = makeTranscript(id ='ENSG00000009307', biomart = mart)
>legend = makeLegend(text = c('affyModel','Ensembl Gene', 'Ensembl Transcript'),
fill = c('darkgreen','orange','cornflowerblue'), cex = 0.5)
In a second step, we use the RectangleOverlay class to create a highlighted region followed by the gdPlot function to produce the integrated plot.
> rOverlay = makeRectangleOverlay(start = 115085100, end = 115086500, region = c(3,5),
dp = DisplayPars(alpha = .2, fill = "olivedrab1"))
> gdPlot(list(title, exon, affyModel, gene, transcript, legend), minBase = 115061061, maxBase = 115102147, overlay = rOverlay)
The plot generated in this second example is shown in Figure 2.
Example III: Short read sequencing and tiling array data
In the final example, we show how complex and diverse sets of data can be integrated to facilitate joint analysis and draw biological conclusions by presenting data from various published datasets on yeast. First, we construct a list where each gdObject represents either annotation or a publicly available dataset. We have plotted data from Ensembl, an Illumina sequencing dataset , Affymetrix tiling array data , nucleosome position data , and conservation data across 7 related species .
> data("seqDataEx", package = "GenomeGraphs")
> str = seqDataEx$david [,"strand"] == 1
> biomart = useMart("ensembl", "scerevisiae_gene_ensembl")
> pList = list("-" = makeGeneRegion(chromosome = "IV", start = 1300000, end = 1310000,
strand = "-", biomart = biomart,
dp = DisplayPars(plotId = TRUE, idRotation = 0, cex = .5)),
makeGenomeAxis(dp = DisplayPars(byValue = 1e3, size = 3)),
"+" = makeGeneRegion(chromosome = "IV", start = 1300000, end = 1310000,
strand = "+", biomart = biomart,
dp = DisplayPars(plotId = TRUE, idRotation = 0, cex = .5)),
"Nagalakshmi" = makeBaseTrack(base = seqDataEx$snyder [, "location"], value = seqDataEx$snyder [, "counts"],
dp = DisplayPars(lwd = .3, color = "darkblue", ylim = c(0,300))),
"David +" = makeGenericArray(probeStart = seqDataEx$david [str, "location"],
intensity = seqDataEx$david [str, "expr", drop = FALSE],
dp = DisplayPars(pointSize = .5)),
"David -" = makeGenericArray(probeStart = seqDataEx$david [!str, "location"],
intensity = seqDataEx$david [!str, "expr", drop = FALSE],
dp = DisplayPars(color = "darkgreen", pointSize = .5)),
"Lee" = makeBaseTrack(base = seqDataEx$nislow [, "location"],
value = seqDataEx$nislow [, "evalue"], dp = DisplayPars(color="grey", lwd = .25)),
"Conservation" = makeBaseTrack(base = seqDataEx$conservation [, "location"],
value = seqDataEx$conservation [, "score"],
dp = DisplayPars(color="gold4", lwd = .25)))
Having constructed the list of elements we wish to plot, we now set up an overlay, using the RectangleOverlay class, to highlight a region of interest. Finally, we plot the result using gdPlot. Although configuring and designing the initial plot may seem laborious, once we have this basic structure we can easily produce plots for all regions of interest.
> rOverlay = makeRectangleOverlay(start = 1302105, end = 1302190, region = c(4,8), dp = DisplayPars(alpha = .2))
> gdPlot(pList, minBase = 1301500, maxBase = 1302500, overlay = rOverlay)
GenomeGraphs is a versatile and extensible visualization package in R, which is well suited to create integrated displays of diverse experimental datasets and genomic annotation information. By using the biomaRt package, annotation information is retrieved directly from Ensembl and there is no need to install and maintain annotation databases locally. Custom annotation tracks can also be created by using the AnnotationTrack class. Finally, GenomeGraphs provides the user with tight integration into R providing immediate access to a wealth of statistical methods.
The software package comes with a vignette which is an executable document that demonstrates how to use the package. The examples described in this paper are also included in the vignette and can be executed after installation of the package. More complex features are also demonstrated in the vignette. Future versions of the package will include more flexibility in terms of plotting parameters and plotting novel features such as visualizing SNP information as annotated by Ensembl and stacked sequencing read representations.
Availability and requirements
GenomeGraphs is an open source software package under the Artistic-2.0 license and has been contributed to the Bioconductor Project. The software and source code are available for download from http://www.bioconductor.org. This document was produced using R-2.8.0 and GenomeGraphs version 1.2.0 available at the following URL: http://bioconductor.org/packages/2.3/bioc/html/GenomeGraphs.html. The package has been tested and run on OS X, Windows, and a variety of Linux systems. GenomeGraphs depends on the following software packages XML, RCurl, and biomaRt, which can be downloaded from Bioconductor or installed from R using the http://www.bioconductor.org/biocLite.R script. The versatility of GenomeGraphs visualization relies on the powerful R plotting package grid . Each gdObject is plotted in an individual viewPort from the grid package. Grid is typically installed together with the base installation of R.
We would like to acknowledge Elizabeth Purdom and Mark Robinson for beta testing early versions of the software and contributing the ExonArray example dataset. We thank the anonymous reviewers for their comments and suggestions to improve this work. Funding was provided by the U54 CA 112970 grant of the TCGA project http://cancergenome.nih.gov/.
- Hubbard T, Aken B, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L, Coates G, Fairley S, Fitzgerald S, Fernandez-Banet J, Gordon L, Graf S, Haider S, Hammond M, Holland R, Howe K, Jenkinson A, Johnson N, Kahari A, Keefe D, Keenan S, Kinsella R, Kokocinski F, Kulesha E, Lawson D, Longden I, et al.: Ensembl 2009. Nucleic Acids Res 2009, (37 Database):D690–697. 10.1093/nar/gkn828Google Scholar
- Wheeler D, Barrett T, Benson D, Bryant S, Canese K, Chetvernin V, Church D, Dicuccio M, Edgar R, Federhen S, Feolo M, Geer L, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman D, Madden T, Maglott D, Miller V, Ostell J, Pruitt K, Schuler G, Shumway M, Sequeira E, Sherry S, Sirotkin K, Souvorov A, Starchenko G, Tatusov R, et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 2008, (36 Database):D780-D786.Google Scholar
- Karolchik D, Kuhn R, Baertsch R, Barber G, Clawson H, Diekhans M, Giardine B, Harte R, Hinrichs A, Hsu F, Kober K, Miller W, Pedersen J, Pohl A, Raney B, Rhead B, Rosenbloom K, Smith K, Stanke M, Thakkapallayil A, Trumbower H, Wang T, Zweig A, Haussler D, Kent W: The UCSC Genome Browser Database: 2008 update. Nucleic Acids Research 2008, (36 Database):D773-D779.Google Scholar
- Stenger J, Xu H, Haynes C, Hauser E, Pericak-Vance M, Goldschmidt-Clermont P, Vance J: Statistical Viewer: a tool to upload and integrate linkage and association data as plots displayed within the Ensembl genome browser. BMC Bioinformatics 2005, 6: 95. 10.1186/1471-2105-6-95PubMed CentralView ArticlePubMedGoogle Scholar
- Yates T, Okoniewski M, Miller C: X:Map: annotation and visualization of genome structure for Affymetrix exon array analysis. Nucleic Acids Research 2008, (36 Database):D780-D786.Google Scholar
- R Development Core Team:R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2008. [http://www.R-project.org]Google Scholar
- Durinck S, Moreau Y, Kasprzyk A, Davis S, Moor BD, Brazma A, Huber W: BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics 2005, 21(16):3439–3440. 10.1093/bioinformatics/bti525View ArticlePubMedGoogle Scholar
- Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a generic system for fast and flexible access to biological data. Genome Res 2004, 14(1):160–169. 10.1101/gr.1645104PubMed CentralView ArticlePubMedGoogle Scholar
- Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 2008, 320(5881):1344–1349. 10.1126/science.1158441PubMed CentralView ArticlePubMedGoogle Scholar
- David L, Huber W, Granovskaia M, Toedling J, Palm CJ, Bofkin L, Jones T, Davis RW, Steinmetz LM: A high-resolution map of transcription in the yeast genome. Proc Natl Acad Sci USA 2006, 103(14):5320–5325. 10.1073/pnas.0601091103PubMed CentralView ArticlePubMedGoogle Scholar
- Lee W, Tillo D, Bray N, Morse RH, Davis RW, Hughes TR, Nislow C: A high-resolution atlas of nucleosome occupancy in yeast. Nat Genet 2007, 39(10):1235–1244. 10.1038/ng2117View ArticlePubMedGoogle Scholar
- Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 2005, 15(8):1034–1050. 10.1101/gr.3715005PubMed CentralView ArticlePubMedGoogle Scholar
- Murrell P: R Graphics. Boca Raton: CRC Press; 2005.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.