Our data structure for colored de Bruijn graphs is the most important research object in this paper, so we start by describing its construction. We then give a detailed explanation of how we traverse, decompose and reorganize the graph because graph spatial structure analysis is the basis of further research. Using analytical results, on the one hand, we propose a variant detection method for small indels; on the other hand, we design a tri-tuple coordinate system with details of our implementation.
Graph construction
Let \(\{{s}_{1},{s}_{2},\dots ,{s}_{n}\}\) be a set of haplotype genomic sequences. Because degeneracy is intractable for VARI, we replaced degenerate bases with the most frequent base of other sequences and recorded the modification. In addition, to ensure that each sequence does not have repeat k-mers, we find the longest repeat segment length \({k}_{i}\) of each sequence \({s}_{i}\) and take \(K=max\{{k}_{1},{k}_{2},\dots ,{k}_{i}\}\). When the parameter k value is greater than \(K\), the colored de Bruijn graph is almost guaranteed to be acyclic. Additionally, we randomly generate two segments consisting of ACGT with a length of k and add them to the front and end of each sequence to guarantee that the graph has only one start node and end node. The segment’s function is only to anchor the sequences, which does not affect the spatial framework, so it cannot appear in any original sequences. The segments are not fixed and change with the samples.
Finally, the obtained colored de Bruijn graph has the following characteristics: directed, acyclic, nondegeneracy, and unique start and end nodes. Moreover, each sequence corresponds to a unique path from the start node to the end node in the graph (Additional file 1: property 1; all the proofs of the properties, theorems, and corollaries can be found in Additional file 1).
Decomposition and reorganization
This section includes four parts. The first part is the theoretical preparation where we define two important concepts: cSupB and offset value. The second part decomposes the graph and obtains all the cSupBs. The third part determines the relations between cSupBs, and the last part addresses cycles.
Colored superbubble (cSupB)
Before defining the cSupB, we need to know the definition of superbubble.
Definition 1
An assembly graph \(G=(V, E)\) is a directed graph. Denote \(V(G)\) and \(E(G)\) as the set of nodes and edges, respectively, of graph \(G\). For any two distinct nodes \(s\) and \(t\) in \(G\), \(<s, t>\) is called a superbubble if it satisfies the following four criteria [23, 39]:
-
(1)
reachability: there is a path from \(s\) to \(t\);
-
(2)
matching: the set of nodes reachable from \(s\) without passing through \(t\) is equal to the set of nodes from which \(t\) is reachable without passing through \(s\);
-
(3)
acyclicity: the subgraph induced by \(U\) is acyclic, where \(U\) is the set of nodes satisfying the matching criterion;
-
(4)
minimality: no node in \(U\) other than \(t\) forms a pair with \(s\) that satisfies the conditions above;
Here, \(s\) and \(t\) are called the source node and sink node, respectively.
Based on Definition 1, there are two superbubbles in Fig. 1b:\(<TAT,ACC >\); \(<TCA,GTA>\). Here, we only discuss the superbubbles that contain at least two supernodes.
Many superbubble search methods have been developed [23, 39,40,41]. These methods can be applied to different graphs, and all are based on the topological structure. However, a superbubble contains structural information but lacks sample information; therefore, based on the colored de Bruijn graph, we propose the cSupB, a generalization of the superbubble, and its relevant algorithm.
Definition 2
An assembly graph \(G = \left( {V,E,C} \right)\) is a colored de Bruijn graph. Denote \(V\left( G \right)\), \(E\left( G \right)\) and \(C\left( G \right)\) as the sets of nodes, edges, and colors, respectively, of graph \(G\). For a color set \(C_{1}\) where \(C_{1} \subset C\), \(G_{1} = \left( {V_{1} ,E_{1} ,C_{1} } \right)\) is the subgraph induced by the nodes \(u \in V\) whose color \(color\left( u \right)\) satisfies \(color\left( {u_{i} } \right) \cap C_{1} \ne \emptyset\). For any two distinct nodes \(s\) and \(t\) in \(G\),\(\left\langle {s,t,C_{1} } \right\rangle\) is called a cSupB if it satisfies the four criteria of a superbubble in graph \(G_{1} .\)
We can also define separation, intersection, and inclusion of cSupBs.
For any two different \(cSupBs\) named \(cSupB1\) and \(cSupB2\), \({G}_{1}=\left({V}_{1},{E}_{1},{C}_{1}\right)\) and \({G}_{2}=({V}_{2},{E}_{2},{C}_{2})\) are the subgraphs induced by the nodes in \(cSupB1\) and \(cSupB2\) respectively. Then we set \({V}_{0}={V}_{1}\cap {V}_{2}, {C}_{0}={C}_{1} \cap {C}_{2}\) and we can determine the relations of \(cSupBs\) according to five conditions as follows:
-
(1) Separation: a. \({V}_{0}\) = \(\varnothing\); b. \({V}_{0}\ne \varnothing\) and \({C}_{0}=\varnothing\);
-
(2) Intersection: c. \({V}_{0}\ne \varnothing\) and \({C}_{0}\) \(\ne\) \(\varnothing\) and \({V}_{0}\ne\) \({V}_{1}\) or \({V}_{2}\);
-
(3) Inclusion: d. \({V}_{0}={V}_{1}\) and \({C}_{0}={C}_{1}\); e. \({V}_{0}={V}_{2}\) and \({C}_{0}= {C}_{2}\).
In particular, if two cSupBs are inclusive relations, for example: \({V}_{0}={V}_{2}\) and \({C}_{0}= {C}_{2}\), we name \(cSupB1\) the parent bubble of \(cSupB2 \mathrm{and }cSupB2\) is the child bubble of \(cSupB1\).
Based on Definition 2, there are five cSupBs in Fig. 1b: bub1.\(<TAT,ACC,111 >\); bub2.\(<GGG,GTA,110>\); bub3.\(<CAC,GTA,011>\); bub4.\(<TCA,GGG,110>\); and bub5.\(<TCA,GTA,111>\). Bub5 is the parent of bub2, bub3 and bub4. Here, we can prove that a superbubble is only a special group of cSupBs whose source and sink node colors are the same. Each cSupB can be thoroughly decomposed to simple cSupBs that have only two paths from the source node to the sink node. Simple cSupB is the minimum bubble structure that can be used to study relationships and variants for any two samples, but superbubbles cannot.
Definition 3
In a directed acyclic graph (DAG) \(G=(V,E)\) with unique start node \({u}^{^{\prime}}\), for each node \({v}_{i}\in V\), let \({S}_{i}\) denote all the adjacent incoming nodes of \({v}_{i}\). If \(pos({v}_{i})=max\{pos({u}_{j})\}+1\), where \({u}_{j}\in {S}_{i}\), and \(pos({u}^{^{\prime}})\) is known, then \(pos({v}_{i})\) is called the offset value of \({v}_{i}\).
Indeed, in a DAG \(G=(V,E)\), for any two distinct nodes \(u,v\in V\) and \(pos\left(u\right)=a, pos(v)=b\), if there exists at least one path between \(u\) and \(v\), \(|a-b|\) must be the length of the longest path between \(u\) and \(v\) (Additional file 1: property 3).
Graph traversal and cSupB Matching
Before traversing and decomposing the graph, we need to label four visiting states for eachnode as unvisited, half-visited, to-be-visited and fully visited. See Additional file 1 for the details of the description.
We define a postorder-like traversal strategy: in a DAG, a node can be visited if all its parent nodes have been visited. In other words, for each node in the graph, we could visit it and its outgoing nodes if and only if all its incoming nodes have been visited.
Matching principle: At least two adjacent outgoing edges’ colors of the source node intersect with the color of the sink node. Similarly, at least two adjacent incoming edges’ colors of the sink node intersect with the color of the source node. Then, the source node and sink node are matched, and the cSupB color is the color intersection of the source node and the sink node.
If we encounter a node \(s\) whose outdegree is greater than one, we place it in a to-be-visited source node queue \(Q\); if we encounter a node \(t\) whose outdegree is greater than one, we start to find its matched source node \(s\) from \(Q\) reversely based on the matching principle. Specifically, if \(color(s)\subseteq color(t)\), we remove \(s\) from \(Q\) and continue; if \(color(s)\subseteq color(t)\), we stop.
We can prove that for each node \(t\) whose indegree is greater than one, at least one node \(s\) whose outdegree is greater than one can be used to construct a cSupB with \(t\) (Theorem 2, Corollary 5). Moreover, \(s\) is visited before \(t\), and \(s\) is the parent node of \(t\) (Corollaries 3 and 4).
cSupB subordination
After postorder-like traversal, we can obtain all the cSupBs and their information, including the source/sink node, cSupB color, and ordering. Ordering denotes the order of obtaining the cSupB, and we use the function \(order\) denoting it in the following study. Here, we reorganized the cSupBs by determining the relationship of cSupBs.
To organize the cSupBs, we adopt a hierarchical tree structure by exploiting the relations of cSupBs. Furthermore, after confirming the root cSupB (containing all samples), we assigned each cSupB a hierarchical level. Here, the root cSupB’s level is 1.
Method to determine the inclusion of cSupB
For \(cSupB1\), \(cSupB2\) and \(cSupB3\), \({G}_{1}=\left({V}_{1},{E}_{1},{C}_{1}\right)\), \({G}_{2}=({V}_{2},{E}_{2},{C}_{2})\) and \({G}_{3}=({V}_{3},{E}_{3},{C}_{3})\) are the subgraphs induced by the nodes in \(cSupB1,\) \(cSupB2\) and \(cSupB3\) respectively. If the following three conditions are satisfied simultaneously:
-
a.
\(\mathrm{a}.order({G}_{1}) < order({G}_{2})\);
-
b.
\({\mathrm{b}.C}_{1}\subseteq {C}_{2}\) and \({C}_{1}\ne {C}_{0}\)(\({C}_{0}\) is a color set containing all samples);
-
c.
there is no \(cSupB3\), s.t. \(order\left({G}_{1}\right)< order\left({G}_{3}\right)< order({G}_{2})\) and \({C}_{1}\subset {C}_{3}\);
we call \(cSupB1\) a child cSupB of \(cSupB2\) and call \(cSupB2\) the nearest parent cSupB of \(cSupB1\) (Additional file 1: Theorem 3).
An essential conclusion is that a child cSupB cannot be obtained before its parent cSupB (Additional file 1: Corollary 6), and the cSupB structure must be a tree (Additional file 1: Corollary 7). Figure 1\(D,E\) is an example of the final decomposition and representation of the colored de Bruijn graph.
To store various information about the colored de Bruijn graph, we define several file formats, such as cSupB topology (CST), cSupB detailed information (CSDI), node nearby information (NNI) and offset value information (OVI). See Additional file 1 for the details of the file formats.
Cycle
A cycle is the representation of repeat sequences on the graph and is sometimes common but inevitable. Cycles can be divided into two types: type I, formed by a single sample; and type II, formed by different samples. Our k value selection method can only avoid type I. Here, we propose a method to address cycles including whether cycles exist, where the cycles are and how to cut the sequences.
Cycle identification
There exists a depth-first-search strategy that can identify all the cycles simultaneously; however, the iterative recursion and high computational cost limit its application. In this paper, we propose a cycle identification method based on the offset value.
According to Theorem 2, if we traverse graph \(G\) from the start node using the postorder-like traversal strategy, we can visit all the nodes. In particular, if the traversal is finished but nodes still exist that are unvisited, we call the traversal finished in advance and there must be cycles in graph \(G\) (Theorem 2). Information about half-visited nodes is collected to prepare for cycle interval determination.
We divide half-visited nodes into two types: type I: on the cycle; type II: not on the cycle. We can prove that if the traversal is finished in advance, we must obtain at least one type I half-visited node (property 5) (Fig. 2).
Cycle interval location
At least one half-visited node must exist on the cycle when the traversal ends in advance (property 5). Therefore, we need to identify this node, mark it as fully visited, and restart the graph traversal. Due to the existence of the cycle, this node is revisited, and we record the offset value information again to help us determine the interval of the cycle.
Steps of interval determination
-
1. Let \({V}_{0}\) represent the collection of all half-visited state nodes.
-
2. Choose node \(u\) from \({V}_{0}\) whose offset value is minimal as the cycle start node.
-
3. Mark \(u\) as fully visited and restart the graph traversal from \(u\). If traversal is again finished in advance, we encounter a new cycle and return to step 1. If we meet a node \(v\), \(v\) is the adjacent parent node of \(u\). Then, we set \(cycle\_start\_pos= pos(u)\) \(cycle\_end\_pos=pos(v)\), \(cycle\_start\_node= u\), \(cycle\_end\_node= v\).
-
4. Then, [\(cycle\_start\_pos, cycle\_end\_pos\)] is the cycle interval.
-
We choose the node with the minimum offset value for simplicity. If the selected node is on the cycle, the traversal can proceed; otherwise, the traversal will terminate in advance, and the previous steps are repeated.
Sequence segmentation and graph reconstruction
After the interval [\(cycle\_start\_pos, cycle\_end\_pos\)] of the cycle is determined, we select a cutting point in this interval to divide the sequence into two parts and construct the graph.
Here, we divide the cutting points into three categories: type I: the bridge. This type is our ideal cutting point; type II: Not the bridge but involving all samples; type III: only involving a portion of the samples.
Based on the above ideas, we propose a cutting point selection model as follows.
-
1.
Position transformation. We establish a connection between the offset value and bases. Here, we define the offset value of the node according to its last base in the k-mer.
-
2.
Information obtainment. We take each position as the key and the color, bases, thickness, and other information as the values to establish hash tables in the interval [\(cycle\_start\_pos, cycle\_end\_pos\)].
-
3.
Evaluation strategy. We set an index \(R = L/t\), in which \(L\) denotes the number of continuous loci with the same position information and \(t\) denotes the thickness. In particular, if there are multiple choices with the same \(R\)-value, we prefer bridges; if the \(L\) value is 0, only the position with the smallest thickness can be selected as the cutting position. In other cases, we make the decision according to the size of the \(R\)-value.
-
4.
Cutting position determination. We select the optimal area as the reference cutting position. If there are multiple cycles, comprehensively considering the information of all cycles and their intersection should be the priority. The output format is cycle cut position (CCP) format (see Additional file 1 for details).
Sequence cutting and graph reconstruction
Although we know the cutting position of the graph, we must obtain the specific cutting position of each sample. We searched the cutting area sequence in each relevant sample. If there is only one search result, the middle position is the cutting position; if there is more than one search result, the nearest one is selected as the reference cutting position. Then, we cut each sequence, add the head and tail sequence again, divide them into several files, put them in multiple directories, and build graphs.
Variant detection
Variant information of an individual or population can be deduced from the structure of the colored de Bruijn graph. Here, we propose a node-based variant detection algorithm.
In practice, the traversal starts from the end node of the graph and against the direction of the edge. The final offset value can be obtained using simple subtraction from the preoffset value shown in Fig. 1c.
We choose reverse traversal because the appearance of each source node means that a variant occurs. If it is a base substitution, the offset value is not affected; however, if it is an indel, using traversal along the edge direction, the offset break node is the sink node (node \(GGG\) in Fig. 1b) instead of the source node (node \(TCA\)) we want. In this paper, we only discuss small indels (< 50 bp).
In the colored de Bruijn graph, we assume that the variant occurrence corresponds to the source node one-to-one, so we discuss variants near the source node. When traversing the graph, all the nodes' variant information is stored in the offset value; thus, the variant, source node, and offset value are closely linked.
Here, we propose a node-based method:
-
(1)
Save the maximum gap length and color near each source node to obtain gap information;
-
(2)
Analyze the two types of source nodes (with reference and without reference) in turn, determine the variant type of each source node (nested variant detection algorithm (NVD)), and modify the gap at each source node to determine the node variant;
-
(3)
Determine the relations between the nodes' position on the reference genome and graph (reference position determination (RPD)); then, transform the node variant to a locus variant (Fig. 3).
Nested variant detection (NVD) algorithm
Step 1: Detect variant types of source nodes involving the reference
We traversed the cSupB id from small to large and only discuss the cSupB containing the reference genome. A simple situation is shown in Fig. 4a. If gap1 > 0, \(var(a)=INS\); if gap2 > 0, \(var(a)=DEL\); and if gap1 = gap2 = 0, \(var(a)=SNP\). Here, function \(var\) denotes the variant type of the node. For the more general case, for example, in Fig. 4b, if the variant types at nodes \(a\) and \(b\) are the single nucleotide polymorphism (SNP) and insertion (INS), respectively, the actual gap conditions are gap2 > 0 and gap3 > 0 based on the traversal against the direction of the edge. Therefore, since the correct results cannot be obtained by considering the gap of a node such as \(a\) independently, we need to determine the node variant type more comprehensively. For detailed algorithm flows, see the NVD-1 flow chart in Additional file 1.
Step2: Detect variant types of other source nodes
This step traverses the cSupB id from small to large again and performs variant analysis on cSupB whose variant type has not been determined, that is, cSupB that does not contain the reference genome. The basic idea is to search the nearest parent source node containing the reference according to the cSupB tree structure when encountering a source node whose variant type has not yet been determined and record all the source nodes involved in turn. Then, by ordering from the parents to children, we can determine the variant types of all source nodes (see Additional file 1: father-children variant detection (FCVD)). For detailed algorithm flows, see the NVD-2 flow chart in Additional file 1.
Reference position determination (RPD)
All the source nodes can be divided into two types: source nodes contained in the reference genome or those that are not. For the former, we used a variable refpos to record its position in the reference genome. Therefore, the focus here is to infer the reference genome position of the second type of source nodes and finally obtain the approximate correspondence between the graph position nodepos and the refpos.
The basic idea here is similar to the NVD algorithm, and we still traverse the cSupB id from small to large. When encountering a source node in which the refpos has not been determined, we need to obtain its nearest parent source node that contains the reference according to the cSupB tree structure and record all the source nodes involved in turn. Then, following the ordering from parents to children, source nodes’ refpos can be determined, and the relationship between refpos and nodepos can also be obtained. Finally, we can simply obtain the variant at the reference genome site combined with the node variant information. For algorithm details, see the RPD flow chart in Additional file 1.
Construction of the coordinate system
In the linear coordinate system, a simple positive integer \(a\) and binary \((a, b)\) can be represented by the base and sequence information. The biological relationship between sequences can also be discussed. However, this representation is not suitable for genome graphs. Here, based on the cSupB tree model, we constructed a haplotype pangenome coordinate system.
We define a triple (position, bubid, basecolor) to represent each base location (BL) in a graph. Since every sample contained in any cSupB has only one path, this representation provides a one-to-one correspondence to nodes in the graph. The position represents the offset value of the bases, and bubid represents the smallest cSupB where the base is located. basecolor represents the samples that include this base at this position, which has two representations. One is to use a string consisting of ‘0’ and ‘1’, and its length is equal to the number of samples. This representation is same as the node or edge color. The other is to randomly select a sample id containing the base. The former is more comprehensive, and the latter is simpler: the selection can be made according to the application. In this triplet, position, bubid and basecolor provide numerical, topology and sample information, respectively.
Similar to the linear coordinate system, we discuss the sequence information. The sequence mapped to the graph is a path, which is represented by a six-tuple (path location (PL)), as follows:
$$\left( {startpos,startbub,endpos,endbub,pathbub,pathcolor} \right)$$
The startpos and endpos represent the offset values of the start and end nodes of the path, and startbub and endbub represent the smallest cSupB where the start and end nodes are located, respectively. pathbub represents the smallest cSupB that contains all the paths. startbub and endbub are both child bubbles of pathbub. If the path spans \(n\) root cSupBs, pathbub is recorded as \(-n\). The pathcolor represents the color of the path, which is represented as a 0–1 string. When the length of a path is equal to 1, the path is a base, startpos = endpos, startbub = endbub = pathbub, and pathcolor = basecolor. Then, the six-tuple of the path becomes a three-tuple of bases.
In the linear coordinate system, if two intervals are given, there are three kinds of relations: separation, intersection, and inclusion. Similarly, on the genome graph, we can provide the correlation between two paths.
Given two random paths, path1: (a1, bub1, b1, bub2, bub3, color1) and path2: (a2, bub4, b2, bub5, bub6, color2).
-
1.
If there is no intersection between [a1, b1] and [a2, b2], path1 and path2 are separated;
-
2.
If the relationship between [a1, b1] and [a2, b2] is inclusion and color1 and color2 also have an inclusion relationship, then path1 and path2 have an inclusion relationship;
-
3.
In other cases, path1 and path2 intersect.
Specifically, if [a1, b1], [a2, b2] have an intersection and color1 and color2 do not intersect, then path1 and path2 intersect. At this time, sequence similarity analysis can be performed. According to the cSupB tree, we can find the nearest parent bubble bub7 of bub3 and bub6 and its color color3. If bub7 does not exist, the path spans the root cSupB, and color3 includes all samples; if bub7 exists, similarity analysis can be performed in bub7, and the union of color1 and color2 is a subset of color3.
Taking Fig. 1c as an example, we know that there are five cSupBs bub1.\(<TAT,ACC,111 >\); bub2.\(<GGG,GTA,110>\); bub3.\(<CAC,GTA,011>\); bub4.\(<TCA,GGG,110>\); and bub5.\(<TCA,GTA,111>\). We randomly select a node such as TCA, and its base position is (3, 4, 111) in which 3, 4, and 111 denote TCA’s offset value, smallest cSupB id and base color, respectively. Similarly, we can also obtain other nodes’ base positions such as ATG: (13, 1, 100), CCC: (17, -1, 111) and GGG: (7, 2, 110). Then, we randomly select three paths: \(a\). CAGGGTGTA- > (5, 4, 11, 2, 5, 100); \(b\). GGGAGTA—> (7, 2, 11, 2, 2, 010); and \(c\). ATAACCC- > (13, 1, 17, 1, 1, 011). Taking path \(a\) as an example, we give a brief introduction to its components. The first node of path \(a\) is CAG, whose base position is (5, 4, 100), and the last node GTA’s base position is (11, 2, 111). All the nodes on path \(a\) are contained in cSupB5, and their intersection of node colors is 100. Here, path \(a\) and path \(b\) intersect because they have common nodes such as GGG, and path \(a\)(\(b\)) and path \(c\) are separated because there is no intersection between [5, 11] ([7, 11]) and [13, 17].
Based on the genome graph coordinate system and variant detection algorithm, we can achieve more functions. For example, given a genome annotation file (e.g., gene transfer format (gtf), general feature format (gff)), we can obtain the relations between the annotation information and the topological structure and predict all the variants; if given a variant file (e.g., variant call format (vcf)), we can not only find the topological information based on the variants but also predict the variant information and compare it to the variant file to explore further discoveries.