MoMI-G: modular multi-scale integrated genome graph browser

Background Genome graph is an emerging approach for representing structural variants on genomes with branches. For example, representing structural variants of cancer genomes as a genome graph is more natural than representing such genomes as differences from the linear reference genome. While more and more structural variants are being identified by long-read sequencing, many of them are difficult to visualize using existing structural variants visualization tools. To this end, visualization method for large genome graphs such as human cancer genome graphs is demanded. Results We developed MOdular Multi-scale Integrated Genome graph browser, MoMI-G, a web-based genome graph browser that can visualize genome graphs with structural variants and supporting evidences such as read alignments, read depth, and annotations. This browser allows more intuitive recognition of large, nested, and potentially more complex structural variations. MoMI-G has view modules for different scales, which allow users to view the whole genome down to nucleotide-level alignments of long reads. Alignments spanning reference alleles and those spanning alternative alleles are shown in the same view. Users can customize the view, if they are not satisfied with the preset views. In addition, MoMI-G has Interval Card Deck, a feature for rapid manual inspection of hundreds of structural variants. Herein, we describe the utility of MoMI-G by using representative examples of large and nested structural variations found in two cell lines, LC-2/ad and CHM1. Conclusions Users can inspect complex and large structural variations found by long-read analysis in large genomes such as human genomes more smoothly and more intuitively. In addition, users can easily filter out false positives by manually inspecting hundreds of identified structural variants with supporting long-read alignments and annotations in a short time. Software availability MoMI-G is freely available at https://github.com/MoMI-G/MoMI-G under the MIT license.


INTRODUCTION
Structural Variants (SVs), which are often characterized as 50 bp or larger genomic rearrangements of chromosomal segments, are associated with various human diseases [1][2][3].For example, some fusion genes caused by SVs are known oncogenes [4].Identifying SVs and interpreting their potential impacts are critical steps toward cataloguing the variations in the human genome and mechanistic understanding of genetic diseases and cancers.
SV visualization is a very important step in an SV calling process, because it enables the manual inspection of SVs for achieving two goals: The first is to better understand the relationships between SVs and other genomic features, and the second is to ensure a smaller number of false positives.
For the first goal, SV visualization tools should be able to simultaneously display multiple intervals along with their relationships, even when the breakpoints are distant or when SVs are nested.
Older SV visualization tools focus on visualizing only canonical SVs (insertion, deletion, inversion, duplication, and translocation) [5,6], because they account for a significant portion of the identified SVs at that time.However, as long-read sequencing technologies reveal an increasing number of SVs, SV visualization with the existing tools becomes more challenging.For example, a large inversion is often identified as two separate translocations at the two breakpoints of the inversion; one might not be able to immediately recognize that the two translocation events are explained by a single large inversion.Another example is a nested SV.When there is a large inversion that contains several smaller SVs, such as insertion of transposons or deletions, the nested SVs often obscure the relationship between genomic regions that are distant in the reference genome, but are actually close in the target genome.To this end, we employed genome graphs as a theoretical backbone for providing more systematic way of presenting SVs with varying complexities, including nested and large SVs [7].
Genome graphs can represent SVs more naturally than those that represent SVs as differences from a 4 reference genome (e.g., VCF).However, there is no visualization method for large genome graphs, such as human cancer genomes.
For the second goal, manual inspection of SVs identified using SV calling tools is important, because these tools are not yet accurate enough; human experts are required to accurately and reliably distinguish true positive SVs from false positive ones.For example, SVs identified by using reads obtained by different sequencing platforms are often not concordant [8], suggesting that the algorithms of SV callers need further improvement.Therefore, SV candidates need to be manually inspected using read alignments and genomic annotations [9].Developers of SV callers may wish to track down false positives, so they can improve the algorithms.Biologists may wish to filter out false positives through manual inspection to find genuine causal SVs.However, manual inspection by using existing SV visualization tools occasionally becomes very difficult for certain cases.For example, for nested SVs and long reads spanning over multiple breakpoints, existing tools cannot show the read alignments in multiple intervals at a glance, making it unrealistic to manually judge the authenticity of candidate SVs.Another example is that tandem duplications identified by SV callers are often inaccurate.This is presumably because the accuracy for finding tandem duplications has not been optimized for real biological data [8,10,11].
To achieve these two goals, we developed MoMI-G (pronounced mo-me-jee), a genome graph browser that visualizes SVs using variation graphs (Fig. 1, Additional file 1: Supplemental Fig. 1), a variant of genome graphs [12].Herein, we describe the use cases and features of MoMI-G using the LC-2/ad human lung adenocarcinoma cell line that carries a CCDC6-RET fusion gene [13][14][15][16], and CHM1, a human hydatidiform mole cell line that originates from a single haploid [17].MoMI-G helps in understanding the entire picture of SVs, even those that are nested or large, regardless of their size.MoMI-G allows researchers to obtain novel biological knowledge by comparing a reference genome with an individual genome by using a variation graph.

RESULTS
MoMI-G is a web-based genome browser developed as a single-page application implemented in TypeScript and with React.Because users need different types of views, even for the same data, MoMI-G provides three groups of view modules for the analysis of SVs at different scales, namely chromosome-scale, gene-scale, and nucleotide-scale view groups (Additional file 1: Supplemental Table 1).Users can use one or more view modules in a single window.
The input of MoMI-G is a variation graph, read alignment (optional), and annotations (optional).MoMI-G accepts a succinct representation of a vg variation graph, which is an XG file, as a variation graph.A script that converts a FASTA file of a reference genome and a common variant format (VCF) file into an XG file is included in the MoMI-G package, although the VCF format cannot represent some types of SVs that the XG format can represent, such as nested insertions.Read alignment data on the graph need to be represented as a graph alignment map (GAM) file; alternatively, users can convert a binary alignment map (BAM) file into a GAM file, although this is not recommended due to some limitations.
We show three examples from two samples to demonstrate the utility of MoMI-G.One of the examples is a large inversion, and the other is nested SVs that are difficult to visualize using existing tools.We also show nested SVs from the CHM1 dataset [17].For all the examples, we used the Amazon EC2 instance type t2.large with 8 GB of memory and 2.4 GHz Intel Xeon processor as the MoMI-G server; the server requirement for this tool is minimal.MoMI-G supports common browsers, including Chrome, Safari, and Firefox.

Data model used in MoMI-G and MoMI-G tools
To our knowledge, no publicly available SV visualization tools are available for large and nested SVs with alignments of long reads.Thus, we aimed to develop a genome graph browser at the earliest so that users can obtain new biological knowledge from real data.We used an existing library, SequenceTubeMap (https://github.com/vgteam/sequenceTubeMap),for visualizing a variation subgraph, rather than developing our own library from scratch.
SequenceTubeMap is a JavaScript library that visualizes multiple related sequences such as haplotypes.A variation graph used in SequenceTubeMap is a set of nodes and paths, where a node represents part of a DNA sequence, and a path represents (part of) a haplotype.Edges are implicitly represented by adjacent nodes in paths.
MoMI-G accepts variation graphs in which SVs are represented by paths so that SequenceTubeMap can visualize them.A deletion is represented by a path that skips over a sequence that other paths pass through.Similarly, an insertion is represented by a path that passes through an extra sequence that other paths do not visit; an inversion is represented by a path where part of the sequence in other paths is reversed; and, a duplication is represented by a path that passes through the same sequence twice or more.
The MoMI-G package includes a set of scripts (MoMI-G tools) that convert a VCF file into the variation graph format.We used MoMI-G tools for generating the input variation graphs; alternatively, users can generate variation graphs on their own.See the method section and Additional file 1: Supplemental Figure 2

Revealing a large SV: A large inversion and a subsequent short deletion
Using MoMI-G, we show an example of a complex SV that involves two SVs identified by Sniffles, each of which connects two different points on a chromosome.This complex SV can be considered a large inversion and two flanking deletions.Previous studies involving the use of whole genome sequencing or RNA-seq with the Illumina HiSeq or Nanopore MinION identified the CCDC6-RET fusion gene in LC-2/ad [14][15][16]18].However, those studies focused only on the region around the CCDC6-RET fusion point, and the entire picture, including the other end of the inversion, was unclear.
To address this issue, we explored the wider region around CCDC6-RET with MoMI-G.
First, we sequenced the genome of LC-2/ad with Oxford Nanopore MinION R9.5 pore chemistry and merged reads with those from a previous study (accession No. DRX143541-DRX143544) [18].We generated 3.5 M reads to 12.8× coverage in total and then aligned them with GRCh38.The average length of the aligned reads was 16 kb (Additional file 1: Supplemental Table 2).We detected 11,316 SVs in the VCF format, including the previously known CCDC6-RET fusion gene, on the nuclear DNA of LC-2/ad cell line (Additional file 1: Supplemental Table 3).See the methods section for details.
The distance between RET (chr10: 43,075,069-43,132,349) and CCDC6 (chr10: 59,786,748-59,908,656) is about 17 Mbp in GRCh38.We confirmed that a CCDC6-RET fusion gene exists in LC-2/ad (Fig. 2A).This fusion gene is presumably caused by an inversion, although only one end of the inversion was found.We found an unknown novel adjacency that well explains the other end of the inversion (Fig. 2B, Additional file 1: Supplemental Table 4).MoMI-G was able to display the relationships between the two breakends of the inversion, enabling users to understand large SVs.
We explored the read alignments around the fusion and found that the fusion was heterozygous (Fig. 10   2C).MoMI-G is the first stand-alone genome graph browser that can display long-read alignments over branching sequences that represent a heterozygous SV.
Further, we found that the large inversion was flanked by two small deletions.These deletions are explained by a single deletion event following the large inversion event (Fig. 2D).The loss of the RET-CCDC6 fusion gene corresponds to the two small deletions on GRCh38.A simple explanation is that a deletion occurred after the inversion event, but not vice versa, in favor of the smaller number of mutation events.
Next, we attempted to estimate the breakpoints of the large inversion before the deletion occurred.There were two possible scenarios for the positions of the two breakends of the large inversion.The first is that RET-CCDC6 and CCDC6-RET were generated by a large inversion and then RET-CCDC6 was lost.The second is that CCDC6 was first broken by a large inversion, and a subsequent small deletion led to CCDC6-RET.Previous studies support the former scenario.First, the RET gene often tends to be disrupted in thyroid cancer by paracentric inversion of the long arm of chromosome 10, or by chromosomal fusion [19].Second, in a previous study, two clinical samples had both RET-CCDC6 and CCDC6-RET in the genome [20].Both studies suggested that an inversion disrupted both CCDC6 and RET, and then a small deletion disrupted RET-CCDC6.We could never recognize these two deletions flanking the large inversion without simultaneously observing both the inversion records in VCF.

Nested SVs with alignment coverage
Visualizing nested SVs is necessary for evaluating the output of SV callers.However, most existing genome browsers cannot visualize nested SVs as well as the relationships between them.Genome browsers, including IGV [21], collapse SVs into intervals between breakpoints, and thus the topological relationships between nested SVs are not shown.MoMI-G can visualize nested SVs as a variation graph (Fig. 3).

Visualizing nested SVs in a pseudodiploid genome
We show an example of nested SVs in a pseudodiploid genome visualized using MoMI-G.We downloaded a CHM1 genome with an SV list previously generated in a whole-genome resequencing study with PacBio sequencers from human hydatidiform [17].The SV list includes insertions, deletions, and inversions for GRCh37/hg19.We converted the BED file of the SV list of CHM1 to a VCF file, and then filtered out deletions of less than 1,000 bp to focus on medium to large SVs.We found nested SVs for which existing genome browsers do not intuitively show the relationships between them (Fig. 4).This example indicates that four insertions and deletions occur in the large inversion.do not support the inversion, suggesting that CCDC6-RET is heterozygous.(C) The entire picture of the inversion that caused CCDC6-RET.This inversion is too large to span by a single read; thus, it was identified as two independent fusion events at both the ends of the inversion, which would be difficult to understand if the two fusion events are visualized separately.The red line is a translocation that was not analyzed in this study.(D) Putative evolution process of LC-2/ad at the CCDC6-RET site.

A. B.
First, a long inversion generated two fusion genes, CCDC6-RET and RET-CCDC6.Second, a large deletion caused the loss of RET-CCDC6.

Figure 3. Nested SVs called by Sniffles in LC-2/ad
The thin black lines are repeat annotations.The brown and purple lines are gene annotations.The red and orange lines are an end of an inversion called by Sniffles.There are two possibilities for the genome structure: one is that MUC3A and its flanking region are a duplication, and the internal region of MUC12 is an inverted duplication.The other is that MUC3A and its flanking region are an inverted duplication, and the internal region of MUC12 is a duplication.Several read alignments support the former interpretation.Although SVs called from the Illumina reads did not include any of the SVs shown here, the alignment coverage by the Illumina reads is consistent with both duplications.Note that the y-axis of the blue thin line on the chromosome showing the alignment coverage is logarithmic.

Figure 4. Nested SVs in CHM1
The black line represents a part of chromosome 5, where a large inversion is shown as the brown line.
The other lines are smaller SVs included in the large inversion.Because CHM1 is a pseudodiploid genome, all the SVs shown in this figure must be on the same haplotype, although MoMI-G tools assume diploid (polyploid) genomes and show the inner SVs as heterozygous SVs.

User-interface Design
The optimal way of visualizing SVs might vary.To rapidly explore the distribution of SVs in a genome, users might wish to use Circos-like plots.Other users might intend to focus on local graph structures of SVs that contain a few genes.In another scenario, a user might want to explore individual nucleotides.To address this issue, MoMI-G provides a customizable view in which users can place any combination of view modules.Further, preset view layouts are available for users' convenience.

Enabling easy manual inspection of detected SVs
Manual inspection, which includes determining if an SV is heterozygous or homozygous, confirming what part of a gene is affected by that SV, and determining the reason why an SV is called based on read alignments, is an important part of validating called SVs.As variants called by SV callers increase, the burden of manual inspection also increases, underscoring the importance of visualization both to inspect individual SV calls for filtering out false positives and to ensure that a filtered set of SVs is of high confidence [22,23].MoMI-G helps with the efficient inspection of SVs by using (1) Feature Table , which is an SV list, (2) Interval Card Deck, which is genomic coordinate stacks, and (3) shortcut keys.
The usage is as follows: (1) one can filter SVs using Feature Table, after which SVs are selected, and then (2) the listed variants are stacked on Interval Card Deck at the bottom of the window.In Interval Card Deck, intervals are displayed as cards, and the interval at the top (leftmost) card of the deck is shown on SequenceTubeMap.Each card can be dragged, and the order of cards can be changed.If one double-clicks on a card, the card moves to the top of the deck.A tag can be added for a card for later reference.Further, a card can be locked to avoid unintended modification or disposal, and the gene name can be input with autocompletion for specifying the interval of a card.
When the interval to view is changed, only part of the view that needs an update is rerendered, whereas most genome browsers working on web interface require rendering the entire view.
Interval Card Deck enables the rapid assessment of hundreds of intervals.Moreover, deciding whether an SV should be discarded or held becomes easier with shortcut keys.After all SVs are inspected, a set of SVs held on the Interval Card Deck is obtained, which might be a set of interesting SVs or a set of manually validated SVs.MoMI-G enables the rapid inspection of hundreds of SVs, providing a tool for validating hundreds of SVs or for selecting interesting SVs.

Input requirements
MoMI-G inputs an XG format as a variation graph.Users can specify a GAM file with an index that contains read alignment.They can convert a BAM file into GAM using MoMI-G tools or can generate the GAM file on their own.When a BED file of genes is provided, users can specify a genomic interval by gene name.A configuration file is written in YAML.MoMI-G also accepts bigBed and bigWig formats [24] for visualizing annotations (e.g., repeats, genes, alignment depth, and GC content) on the reference genome.The bigBed and bigWig need to be extended for genome graphs in the future.The list of formats that are accepted by MoMI-G is shown in Table 1.

File type Extension Description
A succinct index of variation graphs .xgVariation graphs displayed in MoMI-G.

DISCUSSION
We developed a genome graph browser, MoMI-G, that visualizes SVs on a variation graph.Existing visualization tools for SVs show either one SV at a time, or all SVs together; the former does not allow the understanding of the relationships between SVs, whereas the latter is useless when the target genome is very large and the whole variation graphs are too complicated to view in a single screen (i.e., the hairball problem).MoMI-G allows viewing only part of the genome, which resolves the hairball problem, while providing an intuitive view for multiple SVs, including large and nested SVs.
Further, MoMI-G enables the manual inspection of complex SVs by providing integrated multiple view modules; users can filter SVs, validate them with read alignments, and interpret them with genomic annotations.
We used vg as a server-side library and SequenceTubeMap as a client-side library for subgraph retrieval and visualization of genome graphs, because to our knowledge, these are the only combinations that are publicly available.We found that significant amounts of engineering efforts are required for an even better user experience.For example, vg is a standalone command line application that exits immediately after the given query is processed; therefore, it does not have a function to keep the succinct index on memory for later use; every time only part of the genome is retrieved, the entire index of several gigabytes is loaded, which is unnecessary overhead.SequenceTubeMap displays inversions and duplications as loops; however, we found that new users occasionally find it difficult to recognize the connections between nodes.Visualizing SVs is still an open problem.
The currently available tools and formats for SV analysis have many problems.First, different SV callers output different VCF records even for the same SV.For example, depending on SV callers, an inversion with both boundaries identified at a base pair level is represented by one of the following: (1) a single inversion record, (2) two inversion records at both ends, (3) two breakend records at both ends, or (4) four breakend records at both ends (a variant of (3), but the records are duplicated for both the strands).Thus, developing a universal tool for variant graph construction is difficult.Second, certain types of nested SVs, such as an insertion within an insertion, are impossible to represent in a VCF file without tricks, although variant graphs can easily handle these SVs.
Therefore, generating a variation graph from a VCF file including SVs is not ideal.We need an SV caller that directly outputs variation graphs.
Fostering the ecosystem around variation graphs is important for delivering their benefits to end users, as noted in the ecosystem around the SAM/BAM formats that spurred development of production-ready tools for end users.MoMI-G is the first step toward such a goal, because the availability of tools ranging from upstream analysis, such as read alignment to visualization, is critical for the entire ecosystem.This is the one million genome era that requires rapid and memory-sufficient data structure to allow alignments and store haplotype information.Because most parts of the genome are shared between individuals, we need to focus on differences for reducing computation time and resources.
Therefore, genome graph is considered promising, especially for human variation analysis.Moreover, new visualization methods as well as genome analysis methods are required.Genome graph browsers should be able to handle even thousands of genomes in the near future.MoMI-G is a step forward for visualizing genome graphs and could allow the development of new algorithms on genome graphs.

Figure 1 .
Figure 1.Overview of MoMI-G.A user typically selects one of the preset combinations of view for details of the MoMI-G tools.Briefly, MoMI-G tools convert a VCF record into a path in the output variation graph.A deletion is converted into a path that starts, at most, 1 Mbp before one breakend of the deletion, traverses to the breakend, jumps to the other breakend, and proceeds for a certain length (<1 Mbp).Note that the sequences flanking the deletion are added to indicate the edge representing the deletion because edges are implicitly represented in SequenceTubeMap.Insertions, inversions, and duplications are similarly represented by paths with flanking sequences.