Coordinates and intervals in graph-based reference genomes

Background It has been proposed that future reference genomes should be graph structures in order to better represent the sequence diversity present in a species. However, there is currently no standard method to represent genomic intervals, such as the positions of genes or transcription factor binding sites, on graph-based reference genomes. Results We formalize offset-based coordinate systems on graph-based reference genomes and introduce methods for representing intervals on these reference structures. We show the advantage of our methods by representing genes on a graph-based representation of the newest assembly of the human genome (GRCh38) and its alternative loci for regions that are highly variable. Conclusion More complex reference genomes, containing alternative loci, require methods to represent genomic data on these structures. Our proposed notation for genomic intervals makes it possible to fully utilize the alternative loci of the GRCh38 assembly and potential future graph-based reference genomes. We have made a Python package for representing such intervals on offset-based coordinate systems, available at https://github.com/uio-cels/offsetbasedgraph. An interactive web-tool using this Python package to visualize genes on a graph created from GRCh38 is available at https://github.com/uio-cels/genomicgraphcoords. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1678-9) contains supplementary material, which is available to authorized users.


Sequence graphs
We look at graphs where each vertex represents a base pair. A sequence S of base pairs, represented by letters in the finite alphabet α = {A, G, C, T }, can be represented as a graph by letting the base pairs be vertices and have edges between consecutive base pairs. A set of sequences can also be represented as a graph in the same way, but such graphs might not be connected. By including information about how the sequences are connected, they can be connected in the graph by adding edges.
This can be done with a reference genome such as GRCh38, in which there is a sequence for each chromosome and each alternative loci, and there is information about where the alternative loci are connected with the main path sequences.

Region Path Partitioning
We let a region path partitioning P = {P 1 , P 2 , . . . , P N } of a sequence graph G = (V, E) be a partition such that: for each region path p ∈ P is an acyclic connected subgraph where each max v∈p outdegree(v) ≤ 1 and max v∈p indegree(v) ≤ 1.
We look at two region path partitionings, hierarchical and sequential. The hierarchical partitioning is obtained by, starting with the complete graph G 0 and level L = 0, recursively: • Choosing one path between two leaf nodes in the graph, and save this as a region path • Repeat for each subgraph SG i on layer L + 1 In order to enable unambiguous interval representation, care must be taken in step two of this process so that no two region paths have two edges in the same direction between them and that no region paths have an edge to itself. This can be avoided by either splitting region paths where necessary, or by adding empty region paths that split edges that leads to ambiguity. The latter alternative has the advantage of maintaining backwards compatibility when ambiguity is introduced in an update to the graph.
The sequential partitioning is obtained by dividing the graph near every vertex v where either indegree(v) or outdegree(v) is higher than 1. Vertices with indegree(v) > 1 are included in the following region path, while those with outdegree(v) > 1 are included in the preceding region path. Thus a vertex that has both indegree(v) > 1 and outdegree(v) > 1 is its region path's only vertex.
The sequential partitioning is unique for a graph, while there can be several possible hierarchical partitionings for a graph.

Offset-based Coordinate Systems
Let regpath(v) of a vertex v be the unique region path that v is a part of, and v 0 (R) of a region path R be its first vertex. Using a region path partitioning P , one can then define a coordinate system on the graph using the functions regpath(v) and and offset(v) = dist(v, v 0 (regpath(v))). The coordinates for a vertex is then (regpath(v), offset(v)), uniquely identifying each vertex in the graph.

Single-path Intervals
We define a single-path interval on a graph as a path p = (v 0 , e 1 , v 1 , e 2 , . . . e n , v n ) between two vertices v 0 , v n . In order to uniquely represent a path p unambiguously, information about which edge is followed is required wherever outdegree One can also use region path identifiers as information about the path.
, and if using an offset based coordinate system, one can use these coordinates for the start and end vertex.

Multipath Intervals
We define a multipath interval as a set of paths with common start and end nodes. Thus, a multipath interval between the vertices v s and v e is a subset of all the paths between v s and v e denoted S path (v s , v e ). We limit the focus to multipath intervals that can be determined uniquely by a subgraph G S = (V , E ) by the function: Explisit Representation Given a sequential partitioning P = P 1 , P 2 , ..., P N , such multipath intervals can be defined by a set of region paths given by R = {P i ∈ P : ∃v ∈ V s.t. v ∈ P i }. Multipath intervals can then be represented by the set of region paths with the function: Thus, a multipath interval can be represented by giving the start and end vertex and explicitly specifying the 'allowed' region paths.
Critical Subpaths Representation A less stringent representation of multipath intervals can be achieved by only specifying the critical subpaths that are required in order to uniquely represent the interval. Subpaths can for instance be represented by region paths or intervals. For instance, given a set of critical region paths R ⊂ P , a multipath interval can be determined by the function: Fuzzy Representation In order to define fuzzy representation, some concepts are needed. Let the divergents of path p = (u 0 , f 1 , . . . f m , v m ) from q = (v 0 , e 1 , . . . , e n , v n ) denoted divergents(p, q) be the set of subpaths (u i , f i , . . . , f j , u j ), 0 ≤ i < j ≤ m of p where only the start and end vertex is in q. Let the divergentdistance between p and q, denoted D(p, q) be the maximal length of a subpath in divergents(p, q) and let the symmetric distance L var (p, q) be defined as: L var (p, q) = max(D(p, q), (D(q, p)) We can then define the fuzzy multipath interval, given a single path p = (v 0 , e 1 , . . . , e n , v n ) and a variation threshold t, as:

Experiments
Data referred to can be found in the Github repository, where also instructions on how to run the experiments can be found: https://github.com/uiocels/genomicgraphcoords/

Relationship between transcripts on alternative loci and the main path of GRCh38
The alternative loci were connected to the main path using the locations given in grch38 alt loci.txt. We define flanks as the continuous regions at the start and end of alternative loci that have identical sequence as the main path. These flanks were merged with the main path. Transcript variants from the RefSeq gene table were then translated to this graph, and the transcript variants on alternative loci (alt-loci transcripts) were categorized in two ways, summary and full categorization: Summary categorization: • Category A: The alt-loci transcript is in a flank, and there exists a mainpath transcript with the same transcript ID in the same region and with almost equal transcript length 1 .
• Category B: The alt-loci transcript is not on a flank. There exists a main-path transcript on the parallel part of the main path with the same transcript ID, and almost the same transcript length.
• Category C: The alt-loci transcript is partly on a flank. There exists a main-path transcript with the same transcript ID on the main path with almost equal transcript length and parallel to the alt-loci transcript on the parts that are not on the flank • Category D: The alt-loci transcript is in a flank, but there does not exist a main-path transcript in the same region.
• Category E: Any of the first 3 categories above, but where the main-path transcript extends to parts that are not parallel to the alt-loci. The alt-loci transcript is then assumed to be cut at the end of the alternative loci.
Full categorisation: We first classified the alt-loci transcript based on whether there was a parallel main-path transcript and whether that transcript had the same transcription length. Table 1 shows the distribution of alt-loci transcripts over the classes for each type of positioning of alt-loci transcript: Flanking region (flank), varying region (var) or both (flank+var).
Of particular interest is the alt-loci transcripts on the flanking regions without a parallel main-path transcript (8) and the ones with only a partial parallel match on the main path (147). These show that the alt-loci transcripts do not necessarily match the main-path transcripts, even in regions with sequence identity.

Representing transcripts using multipath intervals
The following is a description of how transcripts were analysed using multipath intervals. A graph was created by connecting alternative loci to the main chromosomes, and merging each alternative loci with the main chrosomes according to the NCBI alternative loci to main path alignments (ftp://ftp.ncbi.nlm.nih. gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/GCA_000001405. 15_GRCh38_assembly_structure/).

Critical subpaths multipath intervals
All RefSeq transcripts were represented as critical subpaths multipath intervals on this graph, using the exons of the transcript as critical subpaths (in this case, subpaths were represented as intervals). A match was defined as two transcripts having identical start and end position and identical critical subpaths (i.e. identical exons). This means that two transcripts can follow different paths through the graph, but will still be identical as long as they have the same start and end position and the same critical subpaths.

Fuzzy multipath intervals
All RefSeq transcripts were represented as fuzzy multipath intervals with threshold 10 (each exon represented as a fuzzy multipath interval and the transcription region represented as a fuzzy multipath interval). Two transcripts match if they can be represented with the same fuzzy multipath intervals.