Genomic feature annotation and browsing
We have been investigating a proposed role for the L1 retrotransposon in facilitating X-chromosome inactivation, the process whereby a single X chromosome is epigenetically inactivated in cells of female mammals [11]. L1s (a major family of LINEs) are non-LTR retrotransposons 6–7 kb in length that collectively account for approximately a fifth of the mammalian genome [12]. To facilitate this study we have developed a novel software tool, the Genome Environment Browser (GEB), for visualising epigenetic and transcriptomic features in the context of the genome environment.
Currently GEB is configured to display Human and Mouse genomes. While many public genome browsers are designed to be gene-oriented, GEB focuses on both genes and non-genic elements in their vicinity. In the default version of the browser, CpG islands, LINEs (including families LINE-1, LINE-2 and LINE-3), L1s, SINEs and LTRs are annotated in detail in the context of known Ensembl genes and non-coding genes (mainly functional RNAs). Because of our interest in L1 elements we have included annotation of the functional components, the 5' UTR, ORF1, ORF2 and 3' UTR (Figure 1A), and additionally full length (FL) L1s. The distribution and density of all these features can be visualised at two levels: first a histogram display for a panoramic, whole-chromosome view (Figure 1B), and second, a physical map display for more detailed analysis of regions between 500 bp and 25 Mb (Figure 1C). See additional file 2 for the GEB user guide.
The histogram display is composed of a parallel set of histograms, each plotting the copy number/density of a genome feature across the entire chromosome (typically per Mb). It is powerful in capturing the modularity of genome features across a chromosome. For example, the previously reported positive correlation between CpGI and genes [13] and the reciprocal relationship between L1 and SINEs [14, 15] are reproduced graphically (Figure 1B). Specific regions with interesting patterns, e.g. a gene-rich region adjacent to a gene desert, can be easily identified and precisely selected on the histogram display to be examined in greater detail in the physical map display.
In the physical map display, unlike some public genome browsers, all genome features (including interspersed repeats) are displayed on their respective strands. A very important feature of GEB is the capability to display genomic features in two-dimensional multi-colour graphics reminiscent of a dot-plot, as illustrated by our custom annotation of L1 elements. The length of homology of an L1 element to each functional component in the FL-L1 consensus sequence is plotted on the vertical axis, whereas the physical length of the L1 element is indicated on the horizontal axis as in public genome browsers (Figure 1C). Relatively rare FL-L1s appear as a long, continuous diagonal arrangement of UTRs and ORFs, and are clearly distinguished from the bulk of truncated L1s (mostly remnants of the 3' UTR), shown as shortened lines or bars.
Another key feature of GEB is the rapid transition between high and low resolutions in the physical map display. The "condensed" view of the selected region is initially displayed at 10–200 kb resolution (depending on the size of the region), and can be zoomed in on the same screen flexibly to a detailed 2 bp resolution (and zoomed back out again), abolishing the need to refresh or load a new page. This allows uninterrupted visual scanning of genome features and pattern searching across very large genomic distances, up to 25 Mb. In essence, the physical map display provides a zoomed-in view of the spatial relationship between features across the region of interest. Local enrichment or depletion of any type of feature can be readily visualised.
Data retrieval and preliminary quantitative data analysis tools
Detailed textual annotation information on all features can be explored at a mouse click in physical map display. For the retrieval of other types of annotation data (e.g. gene ontology, protein-related annotation, homology to genes in other species), we have incorporated a functionality in GEB which brings up relevant Ensembl pages (ContigView and GeneView respectively) in the user's default web browser, depending on whether a genomic region or a gene has been selected in GEB histogram/physical map displays. Additionally, a simple quantitative sequence analysis tool is available, which calculates the copy number and/or percentage sequence representation of any type of annotated feature across a genome region of any size. For a more detailed analysis, the calculation can also be done in defined windows (e.g. every 100 kb) across a region (Figure 2).
Displaying custom experimental data in GEB
The GEB histogram and physical map display framework can be applied to one or more sets of custom genome/chromosome-wide experimental data in the context of genome environment, for instance, measurements of gene expression, location of transcription factor binding sites/DNA sequence motifs, ChIP-on-chip/ChIP-seq mapping of histone tail and CpG methylation modifications. Gene expression array data can be presented in two complementary ways in the histogram display (Figure 3A). First, the number of differentially expressed genes (DEGs) (up- or downregulated) per Mb interval can be plotted alongside other genome features of interest. Second, taking into account some regions are inherently more gene-rich than others, GEB also allows plotting in each window the proportion of DEGs normalised to the underlying gene density. Specific regions of interest can be explored in detail, again in the physical map display, with an added functionality where genes are colour-coded to reflect their expression status (downregulated, upregulated, or no change/data unavailable). Additionally, the cut-offs/thresholds defining differential expression can be altered in real-time using an incorporated slide-bar, allowing the user to observe how the pattern of DEG distribution changes (Figure 3B).
GEB's flexible and efficient browsing capabilities are pivotal in the visualisation of any patterns embedded in densely-tiled probes over long genomic distances, as in the case of ChIP-on-chip data. The interface for displaying data from tiling arrays shares many of the features described above for gene expression data, with two specific features in the physical map display accommodating for the higher probe density. A "glyphs" track has been designed for appreciating global patterns, where all or a subset of probes can be colour-coded according to their ChIP-enrichment levels/status and displayed linearly (as if the probes were individual genes) (Figure 4A). To discern local patterns, the exact ChIP-enrichment levels for each probe can also be plotted in the "graphs" track with an adjustable scale for the Y-axis (Figure 4B).
Comparison of GEB with other genome browsers
Relative to major public genome browsers (e.g. Ensembl Genome Browser [16], UCSC Genome Browser [17] and NCBI Map Viewer [18]), GEB provides novel visualisation tools to analyse patterns of genome features and experimental data along an entire chromosome, and additionally the capability to move rapidly to a scalable detailed view with a dynamic range between 25 megabases and 2 basepairs resolution. Thus, users can move seamlessly through data covering a large genomic region, for example zooming in and out from megabase scale gene clusters to specific genes, without the need to change to a new page. Additionally GEB has relatively detailed annotation of interspersed repeats, including directionality, and provides simple tools to quantify the density/frequency of a given genomic feature within user defined limits. The latter are important for preliminary quantitative data analysis in cases where interesting patterns have been visualised.
GEB also has unique features relative to more recently developed track-based public browsers for integrating genome and experimental data, such as Argo Genome Browser [19] and Integrative Genomics Viewer [20], notably the detailed annotation of dispersed repeats and unique display options of GEB, the use of the anti-parallel "top and bottom" strand representation (as used in Ensembl) for detecting potential strand-bias in the distribution of selected features (e.g. gene clusters), the interactive visualisation tools for chromosome-wide feature density/distribution, and the quantitative data analysis tools.