Rearranged cancer genomes and Interval Adjacency Graphs
A segment s=[st,sh] is a continuous part of the reference chromosome with extremities st and sh determining its beginning and end respectively. We assume that every segment appears exactly once in the reference, and segments do not overlap. An adjacency a={p,q} connects two segments’ extremities p and q determining a transition between adjacent segments along the (derived) chromosome. We assume that a derived cancer genome Q comes from the reference genome R via large scale rearrangements and can contain both linear and circular chromosomes. For a set S(R) of segments in the reference R and a set S(Q) of segments in genome Q we naturally have S(Q)⊆S(R) (i.e., “building blocks” that the derived genome is comprised of originate from the reference).
An Interval Adjacency Graph (IAG), or karyotype graph, G(S,A)=(V,E) is a graph built on a set S of segments and a set A of adjacencies. A set V={st,sh∣s∈S} of vertices represents extremities of segments in a set S. A set E of edges is comprised of two types of undirected edges: segment edges ES={{st,sh}∣s∈S}, encoding segments, and adjacency edges EA={{p,q}∣{p,q}∈A}, encoding adjacencies between segments extremities (Fig. 1a).
A linear chromosome corresponds to a segment/adjacency edge-alternating path in the IAG that starts with a segment edge and ends with a segment edge (Fig. 1b). A circular chromosome correspond to a segment/adjacency edge-alternating cycle in the IAG. The number of times every segment/adjacency edge is traversed (in either direction) across all chromosomes in the genome Q corresponds to the segment/adjacency copy number (i.e., the number of times segment/adjacency is present in the observed genome).
For an IAG G=(V,E) we define a multiplicity function \(\mu : E\rightarrow \mathbb {N}_{\geq 0}\) encoding respective segment/adjacency copy number, and we call respective IAG G=(V,E,μ)weighted. The weighted IAG G is positive if μ(e)≥1 for all e∈E. Further, since we can trivially derive a positive IAG G by removing all edges with multiplicity 0, we assume that IAG G is positive unless explicitly specified otherwise.
For a vertex v∈V we define eS(v)∈ES as a segment edge incident to v and EA(v)⊆EA as adjacency edges incident to v. We also define for all e∈EA an auxiliary function l(e) that outputs 2 if an adjacency edge e is a loop, and 1 otherwise.
We define copy number excess x(v) on a vertex v∈V as follows:
$$ x(v) = \mu(e_{S}(v)) - \sum_{e\in E_{A}(v)}l(e)\cdot\mu(e). $$
(1)
Given that both segment and adjacency edges can have positive multiplicities, for any set E′ of edges we define \(\mu (E')=\sum _{e\in E'}\mu (e)\).
Interval Adjacency Graph decompositions
For an IAG G=(V,E,μ) we define Eulerian decomposition (ED) as collection D=P∪C of segment/adjacency edge-alternating paths P and cycles C (i.e., linear and circular chromosomes in cancer genome) such that every segment/adjacency edge e is present in D exactly μ(e) times. Below we outline the known necessary and sufficient condition for an Eulerian decomposition to exist in an IAG:
Lemma 1
IAG G=(V,E,μ) has an Eulerian decomposition if and only if for all v∈V holds x(v)≥0. [18,19,23]
We call IAG Gdecomposable if there exist an Eulerian decomposition D of G and assume that the considered IAG is decomposable unless explicitly stated otherwise.
We define by C+(G) a set of connected components in G such that for every c∈C+(G) there exist a vertex v∈c:x(v)>0. We define by C0(G) a set of connected components in G such that for every c∈C0(G) for every vertex v∈c:x(v)=0.
As the number of chromosomes can play a crucial role in cancer development and progression we now observe the previously proven result on how the karyotype graph determines the quantity of chromosomes in the depicted cancer genome:
Lemma 2
For an IAG G=(V,E,μ) the number |P| of paths P in any Eulerian decomposition D of G is determined as follows [18]:
$$ |P| = \frac{1}{2}\sum\limits_{v\in V}x(v). $$
(2)
Below we observe the problem of finding the Eulerian decomposition of an IAG which in turn determines the collection of linear/circular rearranged cancer chromosomes:
Problem 1
(IAG Eulerian Decomposition Problem,) Given an IAG G find some Eulerian decomposition D of G.
Theorem 1
EDP can be solved in O(μ(E)) time.
Proof
We solve the problem for each connected component c∈G separately.
At first, we show how to find an ED of the component c from C0(G). In this case, x(v)=0 for all v∈c, so the numbers of adjacent segment and adjacency edges (with their multiplicities) are equal in each vertex. Thus, in each vertex we can match each segment edge with some adjacency edge. Since each edge has two matched edges of opposite type, one in each endpoint, these matches represent a set of alternating cycles, which is exactly the desired Eulerian decomposition.
Now, consider a component c from C+(G). We modify graph G so that c becomes the component c′∈C0(G′): for each vertex v with x(v)>0 we add supplemental adjacency loops l={v,v} with \(\mu (l)=\lfloor \frac {x(v)}{2} \rfloor \); and for the vertices v1,v2,…,v2k which are left with x(v)=1 we add supplemental adjacency edges ei={v2i−1,v2i} with μ(ei)=1. Indeed, there are an even number of vertices with excess 1 since the total copy number excess \(\sum _{v\in V}x(v)\) is even, and thus there may exist only an even number of vertices with an odd copy number excess. In this new graph G′ we have x(v)=0 for any v and, thus, we can build ED D′ using the algorithm from the previous paragraph. To get ED of G we just remove all supplemental edges from D′. We removed only adjacency edges, thus each cycle either remains the same or is split into several edge-alternating paths, which start and end with a segment edge. This satisfies the definition of Eulerian decomposition. □
With the proposed efficient algorithm for obtaining and Eulerian decomposition of an IAG and the respective collection of linear/circular cancer chromosomes, we now observe how many of said chromosomes can be present in the obtained decomposition (Fig. 2).
Rare circular chromosomes, or double minutes, have been previously observed in some cancers [24–26] and we first consider a case where the number of circular chromosomes in cancer genome is minimal. Below we show the previously known result that for IAG G every connected component c∈C+(G) can be decomposed into paths-only, and every connected components c∈C0(G) can be covered by a single cycle:
Lemma 3
For a decomposable IAG G=(V,E,μ) the size |D|=|P|+|C| of any of its minimal (cardinality-wise) Eulerian decompositions is equal to [18]:
$$ |D| = |P| + |C| = \frac{1}{2}\sum\limits_{v\in V}x(v) + |C_{0}(G)|. $$
(3)
We now described the problem of finding a minimal Eulerian decomposition of an IAG:
Problem 2
(IAG Minimal Eulerian Decomposition Problem,) Given an IAG G find a minimal Eulerian decomposition D of G (i.e., for any Eulerian decomposition D′ of G, |D|≤|D′|).
Theorem 2
min-EDP can be solved in (μ(E)2) time.
Proof
At first, using the algorithm from Theorem 1 we obtain some ED D of graph G. Then, we iteratively merge cycles with paths or cycles with cycles via shared segment edges, until we are left with one cycle in c from C0(G), and until there are no cycles in c from C+(G). To analyze the overall complexity of the aforementioned workflow for solving the min-EDP we first observe that, according to Theorem 1 we can find some ED of G in O(μ(E)) time. We then iteratively identify pairs C,Q of a cycle C and a path/cycle Q that share some segment edge s and merge them into a new path/cycle Q′. At each step a pair C,Q can be identified and merged in O(μ(E)), and there can be at most O(μ(E)) such steps, thus bringing the total runtime complexity of the proposed workflow to O(μ(E)2). □
While the min-EDP describes the most parsimonious and biologically relevant scenario, we also observe the opposite case and pose a problem of finding the cancer genome described by a given IAG with the maximum number of circular chromosomes and subsequently prove that it is computationally harder than the min-EDP:
Problem 3
(IAG Maximal Eulerian Decomposition Problem,) Given an IAG G find a maximal Eulerian decomposition D of G (i.e., for any Eulerian decomposition D′ of G, |D|≥|D′|).
Theorem 3
max-EDP is NP-hard.
Proof
The problem of checking if a simple undirected graph G=(V,E) has a K3 edge partition (i.e., a set E of edges can be partitioned into non-intersecting subsets E1,E2,…,En such that each Ei generates a subgraph Gi of G, where Gi is isomorphic to a complete graph K3) is NP-complete [27]. Let us show that max-EDP is NP-hard by reducing the problem of K3 edge partition to max-EDP.
Suppose that in the K3 edge partition problem we are given an graph G=(V,E), with deg(v) as a degree of a vertex v∈V. We note that ∀v∈V we have deg(v)≡0 (mod 2) as otherwise K3 edge partitioning of G does not exist. Now we build a new graph G′=(V′,E′) from G for the max-EDP instance as follows:
- (i)
add two vertices vh,vt∈V′ for each v∈V;
- (ii)
add an adjacency edge e={vh,uh} with μ(e)=1 for each edge {u,v}∈E;
- (iii)
add segment edges s={vh,vt} with μ(s)= deg(v);
- (iv)
add adjacency loops l={vt,vt} with μ(l)= deg(v)/2
The example of such transformation is shown in Fig. 3.
There exists a bijection between cycles in G and segment/adjacency edge-alternating cycles in G′. Since the smallest possible cycle in G′ corresponds to K3 in G, the solution in max-EDP corresponds to finding exactly the partition of G into K3, if it exists. Because the described reduction can be computed in linear time, the max-EDP is NP-hard, as is the K3 edge partition problem. □
Consistent contig covering problem
We can assume that minimal Eulerian decomposition(s) are the most likely ones, as they describe the simplest possible (composition-wise) representation of a cancer genome. When no additional information is available and there exist several minimal Eulerian decompositions of an observed IAG it is unclear how to select the true one (Fig. 4a). To address this challenge we consider the inference of sub-paths/cycles that are present across all possible minimal Eulerian decompositions of a considered IAG.
For IAG G=(V,E,μ) a collection T=P∪C of segment/adjacency edge-alternating paths/cycles is called a contig covering of G if
- (i)
every segment edge e∈ES is present exactly μ(e) times across all elements of T (i.e., μ(e)=μT(e)),
- (ii)
for every adjacency edge e∈EA we have μ(e)≥μT(e),
where μT is the copy number function on edges determined by T.
A contig covering of a given decomposable IAG represents (sub)parts of the full linear/circular chromosomes in Eulerian decomposition(s) and we note that an Eulerian decomposition is a special case of a contig covering.
As simple collection of individual segments with their multiplicities being taken into account represents a contig covering which we call primitive. Below we consider a problem of finding contig coverings that contain longer paths/cycles and thus represent more contiguous parts of rearranged cancer chromosomes. For IAG G=(V,E,μ) and its contig covering T we define contiguity discordance ∥G−T∥ as follows:
$$ {}\|G - T\| \!= \sum_{e\in E}\left(\mu(e) - \mu_{T}(e)\right) = \sum_{e\in E_{A}} \left(\mu(e) - \mu_{T}(e)\right). $$
(4)
We note, that for IAG G=(V,E,μ) and any Eulerian decomposition D of G we naturally have ∥G−D∥=0 (i.e., all adjacencies, including multiplicities, are present in D) and for a primitive decomposition \(\tilde {T}\) we have \(\|G-\tilde {T}\| = \mu (E_{A})\).
Since one IAG can have multiple minimal Eulerian decompositions we now formulate a problem of finding the “longest” consistent contigs shared across all minimal Eulerian decompositions (i.e., longest uniquely determined contigs across all possible linear/circular/mixed genomes that the given IAG determines) as follows:
Problem 4
(IAG Consistent Contig Covering Problem,) Given IAG G=(V,E,μ) find a contig covering T=P∪C of G such that for every minimal Eulerian decomposition D of G every path p∈P and cycle c∈C is present in D, and the contiguity discordance ∥G−T∥ is minimized.
Theorem 4
CCCP can be solved in O(μ(E)2) time.
Proof
At first, we assume that the graph is connected. Otherwise, we consider its connected components separately. Below we present the CCR algorithm capable of solving CCCP in O(μ(E)2). An overview of the CCR workflow is shown in Fig. 5.
We start constructing the solution T by setting it to the primitive contig covering \(T=\tilde {T}\) and then iteratively add adjacency edges to it as described below:
- 1.
Transform the initial graph G=(V,E,μ) into a graph G′ which minimal Eulerian decomposition is a single cycle, i.e., x(v)=0 for all vertices v∈V. For that we add the following elements into G: a supplemental segment edge i={it,ih}, a supplemental adjacency edge {v,it} for every v∈V:x(v)>0, and an adjacency edge-loop {ih,ih} (Fig. 4b). For every added supplemental adjacency edge e={v,it} we set the multiplicity μ(e)=x(v), for supplemental segment edge i we set multiplicity \(\mu (i) = \sum _{v} x(v)\), and for supplemental self-loop adjacency edge {ih,ih} we set its multiplicity \(\frac {\mu (i)}{2}\).
- 2.
Find some minimal Eulerian Decomposition D′ of G′ using the algorithm from Theorem 2;
- 3.
Since every (sub)path/cycle present in desired T must also be present in the obtained D′ (when considering only non-supplemental edges) we consider pairs b=(e,f) of non-supplemental consecutive adjacency edges in any path/cycle in D′. In the beginning every such pair b is marked as unprocessed. For every unprocessed pair b=(e,f) we check if there exist another minimal Eulerian decomposition of G′ where e and f are not consecutive. For space considerations we describe in detail how this check is performed in the Additional file 1: Methods.
If this pair b is not consecutive in some other minimal Eulerian decomposition of G′ we mark b as processed and never return to it again. Otherwise, it belongs to all minimal Eulerian decompositions of G′. Suppose that e={xh,yt}, f={yh,zt} and the segment edge y={yt,yh} between e and f in D′. We remove e, y, and f from G′ and insert a new non-supplemental contracted adjacency edge j={xt,zh} between endpoints xt and zh. We update D′ in the same manner: we replace e, y and f with j. We further update T by adding non-contracted adjacency edges from {e,f} by choosing copies of segments x, y, and z that do not have adjacency edges incident to them (at vertices xh, yt, yh, and zt), and connecting them via e and f. We update T with only non-contracted adjacency edges because all adjacency information from contracted adjacency edges has already been added to T.
If there are still unprocessed pairs of adjacency edges in G′ we reiterate Step 3.
- 4.
We end up with the terminal graph G′ in which no pair of adjacency edges belong to the same path/cycle in any consistent contig covering. Now we want to select the largest (cardinality-wise) collection L of non-contracted adjacency edges from G′ such that when updating T with edges from L the collection T remains consistent contig covering. We find such collection L by solving the maximum-matching problem in the auxiliary graph, see Additional file 1: Methods and Additional file 1: Figure S1 for details. After updating T with adjacency edges from L the collection T represents a solution to the CCCP instance. We note that the solution found to CCCP may not be unique.
Now we analyze complexity of the proposed CCR algorithm. In Step 2 min-ED can be found in O(μ(E)2) as per Theorem 2. There can be at most O(μ(E)) iterations of Step 3 of the algorithm: in each iteration we either mark some pair of edges as processed, or decrease the number of adjacency edges and increase the number of unprocessed pairs by 2. The check adjacency edge pair in Step 3 is performed in O(μ(E)). Complexity of the selection of non-contracted adjacency edges in the terminal graph is dominated by the Maximum-Matching search, which can be performed in O(μ(E)2). Thus, the total complexity of CCR algorithm is O(μ(E)2). □