ggcoverage: an R package to visualize and annotate genome coverage for various NGS data

Song, Yabing; Wang, Jianbin

doi:10.1186/s12859-023-05438-2

Software
Open access
Published: 09 August 2023

ggcoverage: an R package to visualize and annotate genome coverage for various NGS data

Yabing Song¹ &
Jianbin Wang¹

BMC Bioinformatics volume 24, Article number: 309 (2023) Cite this article

10k Accesses
1 Citations
131 Altmetric
Metrics details

Abstract

Background

Visualizing genome coverage is of vital importance to inspect and interpret various next-generation sequencing (NGS) data. Besides genome coverage, genome annotations are also crucial in the visualization. While different NGS data require different annotations, how to visualize genome coverage and add the annotations appropriately and conveniently is challenging. Many tools have been developed to address this issue. However, existing tools are often inflexible, complicated, lack necessary preprocessing steps and annotations, and the figures generated support limited customization.

Results

Here, we introduce ggcoverage, an R package to visualize and annotate genome coverage of multi-groups and multi-omics. The input files for ggcoverage can be in BAM, BigWig, BedGraph and TSV formats. For better usability, ggcoverage provides reliable and efficient ways to perform read normalization, consensus peaks generation and track data loading with state-of-the-art tools. ggcoverage provides various available annotations to adapt to different NGS data (e.g. WGS/WES, RNA-seq, ChIP-seq) and all the available annotations can be easily superimposed with ‘ + ’. ggcoverage can generate publication-quality plots and users can customize the plots with ggplot2. In addition, ggcoverage supports the visualization and annotation of protein coverage.

Conclusions

ggcoverage provides a flexible, programmable, efficient and user-friendly way to visualize and annotate genome coverage of multi-groups and multi-omics. The ggcoverage package is available at https://github.com/showteeth/ggcoverage under the MIT license, and the vignettes are available at https://showteeth.github.io/ggcoverage/.

Peer Review reports

Background

Visualizing genome coverage is of vital importance to inspect and interpret various next-generation sequencing (NGS) data. Besides genome coverage, genome annotations are also crucial in the visualization. When analyzing whole-genome sequencing (WGS) data to get copy number variations (CNV), genome coverage plot can check for possible confounding factors, such as GC content bias, telomeres and centromeres proximity [1]. When dealing with RNA-sequencing (RNA-seq) data, we can utilize genome coverage plot to inspect the gene or exon knockout efficiency, 5’ or 3’ bias and visualize the read counts of differentially expressed genes, transcripts or exons [2]. In processing chromatin immunoprecipitation followed by sequencing (ChIP-seq) data, genome coverage plot can help to obtain and verify the peaks by comparing the signal of ChIP and input samples and visualize the relative distance between identified peaks and nearby genes [3].

Many tools have been developed to visualize genome coverage. However, existing tools are often inflexible, complicated, lack necessary preprocessing steps and annotations, and the figures generated support limited customization. For example, UCSC genome browser [4] and IGV Browser [5] require file upload or data transmission, which usually takes time, and are not accessible programmatically. Gviz [6] offers limited customization of plot aesthetics and themes. ggbio [7] and GenVisR [8] provide limited annotations. karyoploteR [9] focuses on visualizing chromosome ideogram and is complicated to create genome coverage plot. See Table 1 for a detailed comparison of ggcoverage to other visualization tools.

Table 1 Comparison of ggcoverage to other visualization tools

Full size table

Here, we present ggcoverage, an R package providing a flexible, programmable, efficient and user-friendly way to visualize genome coverage of multi-groups and multi-omics. It supports multiple input file formats and provides several functions to perform data preprocessing, including parallel read normalization, consensus peak generation and track data loading. It also provides various available annotations, which can be superimposed conveniently to better inspect and interpret different NGS data. Furthermore, ggcoverage can generate publication-ready plots and users can customize the plots with ggplot2 [10]. In addition, ggcoverage supports the visualization of protein coverage based on peptides obtained by mass spectrometry and adds protein feature annotation to the coverage plot.

Implementation

Inputs

The input file for ggcoverage to visualize genome coverage can be in BAM, BigWig, BedGraph and TSV formats. For TSV file, it should contain columns to specify chromosome, start, end, sample type and sample group. ggcoverage also requires additional files to generate annotation, such as FASTA file for GC content annotation, gene transfer format (GTF) file for gene and transcript annotations, and peak file for peak annotation. For the visualization of protein coverage, the input file should be Excel spreadsheets exported from an analyzer such as Proteome Discoverer.

Data preprocessing

Read normalization, consensus peak generation and track data loading are usually prerequisites for visualization. However, this requires users to learn to use different tools and possibly switch between different platforms. To facilitate users, ggcoverage provides functions to perform data preprocessing with state-of-the-art tools. For read normalization, ggcoverage provides multiple normalization methods to adapt to different NGS data using deeptools [11] and parallelize this process with BiocParallel [12]. When providing peak files from replicates, ggcoverage can generate consensus peaks with MSPC, which can run with more than two replicates and combine evidence (e.g. P-value) from multiple replicates to obtain more reliable peaks [13]. To load track data, ggcoverage extracts the visualized region specified by users instead of loading the whole files and then extracting the visualized region.

Visualization

ggcoverage introduces twelve layers to visualize and annotate coverage plot (Table 2). Besides these layers, ggcoverage also provides corresponding themes to beautify figures. geom_coverage will generate coverage plot of a specified region for different samples across different groups and provide ‘joint’ and ‘facet’ display styles (Additional file 1: Fig. S1). When mark regions are available, geom_coverage will also highlight these regions. geom_base is used to show base frequency and reference base for each locus, and it will also show amino acids of given region in IGV style. When SNVs exist, geom_base will highlight them with three styles (Additional file 1: Fig. S2). geom_cnv will show the normalized bin count and estimated copy number. geom_gc will calculate and visualize GC content of every bin, and it will also add a line to indicate mean GC content or user-specified GC content. geom_gene will obtain all genes in given region and classify these genes to different groups to avoid overlap when plotting. In gene annotation, the arrow direction indicates the strand of genes, the height of different elements indicates different gene parts, the color of line indicates gene strand or user-specified group information (e.g. gene type). geom_transcript is similar to geom_gene, but it shows all transcripts of a gene rather than the whole gene structure. geom_peak will show the peaks identified, so that the peaks and the nearby genes can be well visualized. geom_ideogram will show chromosome ideogram to illustrate the relative position of the displayed regions on the chromosome based on ggbio [7]. geom_tad will show 3D chromatin contact maps based on HiCBricks [14]. geom_link will create links of peak-gene or DNA-DNA. geom_protein will generate coverage plot of protein based on peptides obtained by mass spectrometry (Additional file 1: Fig. S3). geom_feature will show characteristics of protein or genome (Additional file 1: Fig. S3).

Table 2 ggcoverage layers

Full size table

Similar to graphical language implemented in ggplot2, users can superimpose the above layers by the ‘ + ’ operator with the help of patchwork [15]. For example, ggcoverage() + geom_gc() + geom_gene() + geom_ideogram() will create genome coverage plot and add GC content, gene structure and chromosome ideogram annotations.

Customization

ggcoverage is based on ggplot2, so users can easily customize the generated figures with ggplot2. In general, customization mainly includes modifying the elements of existing figures and adding new layers. Additional file 1: Fig. S4 shows the examples of these two kinds of customization.

Results

Here, we show several practical use cases of applying ggcoverage to multi-omics, including WGS, ChIP-seq and RNA-seq. The code used to generate the figures without typesetting for Fig. 1 is available in Additional file 2.

In CNV analysis, common confounding factors include GC content bias, telomeres and centromeres proximity. Figure 1A shows genome coverage with all these confounding factors to inspect the data. When applying ggcoverage on WGS data to visualize SNV (Fig. 1B), we can see that there is a candidate single nucleotide variant (SNV) with T to A transversion at coordinate hg19 chr4:62,474,264 (highlight with twill), the variant allele frequency is 100%, and this may affect Y (Tyrosine), I (Isoleucine) and * (stop codon). When applying ggcoverage on ChIP-seq data (Fig. 1C), we can see that the ChIP sample has an enriched signal in the promoter region of the ATP9B gene compared to the input control, which is consistent with the results of called peaks. When applying ggcoverage on RNA-seq data with HNRNPC knockdown (Fig. 1D), we can see that there is a significant reduction in read coverage of HNRNPC.

Conclusions

We have developed ggcoverage, an R package dedicated to visualizing and annotating genome coverage of multi-groups and multi-omics. It allows users to visualize genome coverage with flexible input file formats, and annotate the genome coverage with various annotations to meet the needs of different NGS data. In addition to visualization, ggcoverage also provides reliable and efficient ways to perform data preprocessing, including parallel reads normalization per bin, consensus peaks generation from replicates and track data loading by extracting subsets. And owing to the multi-platform support of R, users do not need to transmit data. Finally, it is very convenient to generate high-quality and publication-ready plots, users can also customize the figures with ggplot2.

Availability and requirements

Project name: ggcoverage
Project home page: https://github.com/showteeth/ggcoverage
Operating system(s): Unix, Linux, and Windows
Programming language: R
Other requirements: None
License: MIT License
Any restrictions to use by non-academics: None

Availability of data and materials

The datasets analyzed during the current study are available on GitHub (https://github.com/showteeth/ggcoverage).

Abbreviations

NGS:: Next-generation sequencing
WGS:: Whole-genome sequencing
CNV:: Copy number variations
RNA-seq:: RNA-sequencing
ChIP-seq:: Chromatin immunoprecipitation followed by sequencing
GTF:: Gene transfer format
SNV:: Single nucleotide variant

References

Nguyen D-Q, Webber C, Ponting CP. Bias of selection on human copy-number variants. PLoS Genet. 2006;2(2): e20.
Article PubMed PubMed Central Google Scholar
Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016;17(1):13.
Article PubMed PubMed Central Google Scholar
Park PJ. ChIP–seq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009;10(10):669–80.
Article CAS PubMed PubMed Central Google Scholar
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al. The human genome browser at UCSC. Genome Res. 2002;12(6):996–1006.
Article CAS PubMed PubMed Central Google Scholar
Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, et al. Integrative genomics viewer. Nat Biotechnol. 2011;29(1):24–6.
Article CAS PubMed PubMed Central Google Scholar
Hahne F, Ivanek R. Visualizing genomic data using Gviz and bioconductor. In: Methods in molecular biology. 2016. pp. 335–51.
Yin T, Cook D, Lawrence M. ggbio: an R package for extending the grammar of graphics for genomic data. Genome Biol. 2012;13(8):R77.
Article PubMed PubMed Central Google Scholar
Skidmore ZL, Wagner AH, Lesurf R, Campbell KM, Kunisaki J, Griffith OL, et al. GenVisR: genomic visualizations in R. Bioinformatics. 2016;32(19):3012–4.
Article CAS PubMed PubMed Central Google Scholar
Gel B, Serra E. karyoploteR: an R/Bioconductor package to plot customizable genomes displaying arbitrary data. Bioinformatics. 2017;33(19):3088–90.
Article CAS PubMed PubMed Central Google Scholar
Wickham H. ggplot2: elegant graphics for data analysis. Springer: Verlag New York; 2016.
Book Google Scholar
Ramírez F, Ryan DP, Grüning B, Bhardwaj V, Kilpert F, Richter AS, et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 2016;44(W1):W160–5.
Article PubMed PubMed Central Google Scholar
Morgan M, Obenchain V, Lang M, Thompson R, Turaga N. BiocParallel: Bioconductor facilities for parallel evaluation. R package version 1.24.1. 2020.
Jalili V, Matteucci M, Masseroli M, Morelli MJ. Using combined evidence from replicates to evaluate ChIP-seq peaks. Bioinformatics. 2015;31(17):2761–9.
Article CAS PubMed Google Scholar
Pal K, Tagliaferri I, Livi CM, Ferrari F. HiCBricks: building blocks for efficient handling of large Hi-C datasets. Bioinformatics. 2019;36(6):1917–9.
Article PubMed PubMed Central Google Scholar
Pedersen TL. Patchwork: The composer of plots. R package version 1.0.0. 2020.

Download references

Acknowledgements

We appreciate the valuable feedback provided by Jiaxin Gao.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 22050004).

Author information

Authors and Affiliations

School of Life Sciences, Tsinghua University, Beijing, China
Yabing Song & Jianbin Wang

Authors

Yabing Song
View author publications
You can also search for this author in PubMed Google Scholar
Jianbin Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YS developed ggcoverage and prepared the manuscript. JW provided supervision and secured funding. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Yabing Song or Jianbin Wang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1. Additional figures. Fig. S1.

Three styles of genome coverage plot. Fig. S2. Highlight SNV with three styles. Fig. S3. Protein coverage plot based on peptides obtained by mass spectrometry and annotation of protein characteristics. Fig. S4. Examples of figure customization.

Additional file 2.

The codes used to generate the figures without typesetting for Fig. 1.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Song, Y., Wang, J. ggcoverage: an R package to visualize and annotate genome coverage for various NGS data. BMC Bioinformatics 24, 309 (2023). https://doi.org/10.1186/s12859-023-05438-2

Download citation

Received: 08 February 2023
Accepted: 03 August 2023
Published: 09 August 2023
DOI: https://doi.org/10.1186/s12859-023-05438-2

ggcoverage: an R package to visualize and annotate genome coverage for various NGS data

Abstract

Background

Results

Conclusions

Background

Implementation

Inputs

Data preprocessing

Visualization

Customization

Results

Conclusions

Availability and requirements

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1. Additional figures. Fig. S1.

Additional file 2.

Rights and permissions

About this article

Cite this article

Keywords

BMC Bioinformatics

Contact us

ggcoverage: an R package to visualize and annotate genome coverage for various NGS data

Abstract

Background

Results

Conclusions

Background

Implementation

Inputs

Data preprocessing

Visualization

Customization

Results

Conclusions

Availability and requirements

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1. Additional figures. Fig. S1.

Additional file 2.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us