Recovering rearranged cancer chromosomes from karyotype graphs

Background Many cancer genomes are extensively rearranged with highly aberrant chromosomal karyotypes. Structural and copy number variations in cancer genomes can be determined via abnormal mapping of sequenced reads to the reference genome. Recently it became possible to reconcile both of these types of large-scale variations into a karyotype graph representation of the rearranged cancer genomes. Such a representation, however, does not directly describe the linear and/or circular structure of the underlying rearranged cancer chromosomes, thus limiting possible analysis of cancer genomes somatic evolutionary process as well as functional genomic changes brought by the large-scale genome rearrangements. Results Here we address the aforementioned limitation by introducing a novel methodological framework for recovering rearranged cancer chromosomes from karyotype graphs. For a cancer karyotype graph we formulate an Eulerian Decomposition Problem (EDP) of finding a collection of linear and/or circular rearranged cancer chromosomes that are determined by the graph. We derive and prove computational complexities for several variations of the EDP. We then demonstrate that Eulerian decomposition of the cancer karyotype graphs is not always unique and present the Consistent Contig Covering Problem (CCCP) of recovering unambiguous cancer contigs from the cancer karyotype graph, and describe a novel algorithm CCR capable of solving CCCP in polynomial time. We apply CCR on a prostate cancer dataset and demonstrate that it is capable of consistently recovering large cancer contigs even when underlying cancer genomes are highly rearranged. Conclusions CCR can recover rearranged cancer contigs from karyotype graphs thereby addressing existing limitation in inferring chromosomal structures of rearranged cancer genomes and advancing our understanding of both patient/cancer-specific as well as the overall genetic instability in cancer.


Background
Cancer is a deadly disease propagated by accumulation of genetic mutations that range across scales from single nucleotide polymorphisms to large-scale genome rearrangements. Advances in whole-genome sequencing of tumor samples have enabled the detection of somatic mutations using specialized algorithms to identify different *Correspondence: aganezov@cs.jhu.edu 1 Department of Computer Science, Johns Hopkins University, 3400 N. Charles st., 21210 Baltimore, MD, USA Full list of author information is available at the end of the article classes of mutations from short, long, and linked DNA sequence reads available by current technologies [1][2][3][4][5][6][7][8][9][10].
In this study we focus on large-scale genome rearrangements that accumulate throughout the somatic evolutionary process to alter the order, orientation, and quantity of genomic sequences, and ultimate producing a collection of highly rearranged chromosomes that constitute the cancer genome. Large-scale alterations of the chromosomal structure in cancer genomes have been previously identified as risk factors [11], survival prognosis predictors [12,13], and possible therapeutic targets [14,15], thus underscoring the importance of this type of mutation and prompting the development of novel methods to decipher the rearranged structures of cancer genomes.
Previously proposed algorithms enabled the inference of copy number aberrations (CNA) (e.g., deletion, amplification) of genomic segments [10,16,17] and novel adjacencies (NA) between genomic loci that are distant in the reference but are adjacent in cancer genomes [5][6][7]9], in sequenced tumor samples. In the more recent study [18] a novel methodological framework RCK was proposed for reconstruction of cancer genomes karyotype graphs, taking into account both the CNA and NA signals, the non-haploid nature of both the reference and the derived genomes, and, when applicable, possible sample heterogeneity. The karyotype graph provides a much more accurate and complete description of the rearranged cancer genome than either of CNA or NA profiles on their own, but it falls short of describing the actual linear/circular structure of the rearranged chromosomes, and instead provides an alignment of the rearranged chromosomes onto the reference.
Here we consider the problem of inferring sequential structure of rearranged linear/circular chromosomes in a cancer genome given its karyotype graph representation. In previous studies either general observations about necessary and sufficient conditions on the karyotype graphs structure for the existence of the underlying cancer genome [18,19], or attempts at extracting information about specific local rearranged structures (e.g., amplicons, complex rearrangements, etc) [8,20,21] were made. In the present study we formulate a Eulerian Decomposition Problem (EDP) for a cancer karyotype graph with the objective of finding a covering collection of linear and/or circular paths/cycles (i.e., chromosomes). We derive and prove computational complexities for both general, min, and max versions of the EDP. We further observe that for a cancer karyotype graph there are often multiple possible Eulerian decompositions which presents ambiguities for chromosomal sequential structures inference. We note that similar non-uniqueness of traversals that constitute genome graph decompositions has also been previously observed in the area of genome assembly [22]. We then formulate a Consistent (i.e., shared across all possible minimal Eulerian decompositions) Contig Covering Problem (CCCP) and subsequently present a novel algorithm CCR, for Consistent Contig Recovering, capable of solving the CCCP in polynomial time.
We then apply CCR method to the karyotype graphs of 17 primary and metastatic prostate cancer samples. We find that CCR effectively finds long, unambiguous paths shared across all possible graph decompositions thus reliably inferring cancer contigs. We observe that for these cancer contig inferences, the N50 sizes of the obtained results reaches chromosomal scale, and the total number of contigs is within an order of magnitude of the number of chromosomes in the underlying genomes, whereas a naive algorithm fragments the reconstruction into thousands of contigs and appreciably smaller contig N50 sizes. We thus demonstrate that CCR effectively advances the frontier for the analysis of rearranged cancer genomes, allowing for a more detailed and comprehensive studies of structurally altered cancer chromosomes. A proof-of-concept implementation of CCR is available at https://github.com/izban/contig-covering and will be later integrated into the RCK software package available at https://github.com/aganezov/RCK.

Rearranged cancer genomes and Interval Adjacency Graphs
A segment s =[ s t , s h ] is a continuous part of the reference chromosome with extremities s t and s h determining its beginning and end respectively. We assume that every segment appears exactly once in the reference, and segments do not overlap. An adjacency a = {p, q} connects two segments' extremities p and q determining a transition between adjacent segments along the (derived) chromosome. We assume that a derived cancer genome Q comes from the reference genome R via large scale rearrangements and can contain both linear and circular chromosomes. For a set S(R) of segments in the reference R and a set S(Q) of segments in genome Q we naturally have S(Q) ⊆ S(R) (i.e., "building blocks" that the derived genome is comprised of originate from the reference).
An Interval Adjacency Graph (IAG), or karyotype graph, G(S, A) = (V , E) is a graph built on a set S of segments and a set A of adjacencies. A set V = {s t , s h | s ∈ S} of vertices represents extremities of segments in a set S. A set E of edges is comprised of two types of undirected edges: segment edges E S = {{s t , s h } | s ∈ S}, encoding segments, and adjacency edges E A = {{p, q} | {p, q} ∈ A}, encoding adjacencies between segments extremities (Fig. 1a).
A linear chromosome corresponds to a segment/adjacency edge-alternating path in the IAG that starts with a segment edge and ends with a segment edge (Fig. 1b). A circular chromosome correspond to a segment/adjacency edge-alternating cycle in the IAG. The number of times every segment/adjacency edge is traversed (in either direction) across all chromosomes in the genome Q corresponds to the segment/adjacency copy number (i.e., the number of times segment/adjacency is present in the observed genome).
For an IAG G = (V , E) we define a multiplicity function μ : E → N ≥0 encoding respective segment/adjacency copy number, and we call respective IAG G = (V , E, μ) weighted. The weighted IAG G is positive if μ(e) ≥ 1 for all e ∈ E. Further, since we can trivially derive a positive IAG G by removing all edges with multiplicity 0, we Rearranged genomes and Interval Adjacency Graphs. a Reference and derived genomes R and Q both depicted as sequences of oriented segments and a weighted Interval Adjacency Graph determined by a derived genome Q. In the IAG segment edges are shown as solid, with colors corresponding to distinct chromosomes, and adjacency edges are shown as dashed, with reference adjacencies shown in black, and novel adjacencies depicted in red. Edges with copy number 0 are shown as faded, copy numbers of 1 are omitted for clarity. b Rearranged linear chromosomes in the mutated genome Q depicted as both sequences of oriented segments and as segment/adjacency edge-alternating paths in the IAG's connected components assume that IAG G is positive unless explicitly specified otherwise.
For a vertex v ∈ V we define e S (v) ∈ E S as a segment edge incident to v and E A (v) ⊆ E A as adjacency edges incident to v. We also define for all e ∈ E A an auxiliary function l(e) that outputs 2 if an adjacency edge e is a loop, and 1 otherwise.
We define copy number excess x(v) on a vertex v ∈ V as follows: Given that both segment and adjacency edges can have positive multiplicities, for any set E of edges we define μ(E ) = e∈E μ(e).

Interval Adjacency Graph decompositions
For an IAG G = (V , E, μ) we define Eulerian decomposition (ED) as collection D = P ∪ C of segment/adjacency edge-alternating paths P and cycles C (i.e., linear and circular chromosomes in cancer genome) such that every segment/adjacency edge e is present in D exactly μ(e) times. Below we outline the known necessary and sufficient condition for an Eulerian decomposition to exist in an IAG: [18,19,23] We call IAG G decomposable if there exist an Eulerian decomposition D of G and assume that the considered IAG is decomposable unless explicitly stated otherwise.
We define by C + (G) a set of connected components in G such that for every As the number of chromosomes can play a crucial role in cancer development and progression we now observe the previously proven result on how the karyotype graph determines the quantity of chromosomes in the depicted cancer genome:

Lemma 2
For an IAG G = (V , E, μ) the number |P| of paths P in any Eulerian decomposition D of G is determined as follows [18]: Below we observe the problem of finding the Eulerian decomposition of an IAG which in turn determines the collection of linear/circular rearranged cancer chromosomes:

Theorem 1 EDP can be solved in O (μ(E)) time.
Proof We solve the problem for each connected component c ∈ G separately.
At first, we show how to find an ED of the component c from C 0 (G). In this case, x(v) = 0 for all v ∈ c, so the numbers of adjacent segment and adjacency edges (with their multiplicities) are equal in each vertex. Thus, in each vertex we can match each segment edge with some adjacency edge. Since each edge has two matched edges of opposite type, one in each endpoint, these matches represent a set of alternating cycles, which is exactly the desired Eulerian decomposition. Now, consider a component c from C + (G). We modify graph G so that c becomes the component c ∈ C 0 (G ): Indeed, there are an even number of vertices with excess 1 since the total copy number excess v∈V x(v) is even, and thus there may exist only an even number of vertices with an odd copy number excess. In this new graph G we have x(v) = 0 for any v and, thus, we can build ED D using the algorithm from the previous paragraph. To get ED of G we just remove all supplemental edges from D . We removed only adjacency edges, thus each cycle either remains the same or is split into several edge-alternating paths, which start and end with a segment edge. This satisfies the definition of Eulerian decomposition.
With the proposed efficient algorithm for obtaining and Eulerian decomposition of an IAG and the respective collection of linear/circular cancer chromosomes, we now observe how many of said chromosomes can be present in the obtained decomposition (Fig. 2).
Rare circular chromosomes, or double minutes, have been previously observed in some cancers [24][25][26] and we first consider a case where the number of circular chromosomes in cancer genome is minimal. Below we show the previously known result that for IAG G every connected component c ∈ C + (G) can be decomposed into pathsonly, and every connected components c ∈ C 0 (G) can be covered by a single cycle: Lemma 3 For a decomposable IAG G = (V , E, μ) the size |D| = |P| + |C| of any of its minimal (cardinality-wise) Eulerian decompositions is equal to [18]: We now described the problem of finding a minimal Eulerian decomposition of an IAG: Problem 2 (IAG Minimal Eulerian Decomposition Problem, min-EDP) Given an IAG G find a minimal Eulerian decomposition D of G (i.e., for any Eulerian decomposition D of G, |D| ≤ |D |).
Proof At first, using the algorithm from Theorem 1 we obtain some ED D of graph G. Then, we iteratively merge cycles with paths or cycles with cycles via shared segment edges, until we are left with one cycle in c from C 0 (G), and until there are no cycles in c from C + (G). To analyze the overall complexity of the aforementioned workflow for solving the min-EDP we first observe that, according to Theorem 1 we can find some ED of G in O(μ(E)) time. We then iteratively identify pairs C, Q of a cycle C and a path/cycle Q that share some segment edge s and merge them into a new path/cycle Q . At each step a pair C, Q can be identified and merged in O(μ(E)), and there can be at most O(μ(E)) such steps, thus bringing the total runtime complexity of the proposed workflow to O(μ(E) 2 ).
While the min-EDP describes the most parsimonious and biologically relevant scenario, we also observe the opposite case and pose a problem of finding the cancer genome described by a given IAG with the maximum number of circular chromosomes and subsequently prove that it is computationally harder than the min-EDP:

Problem 3 (IAG Maximal Eulerian Decomposition
Problem, max-EDP) Given an IAG G find a maximal Eulerian decomposition D of G (i.e., for any Eulerian decomposition D of G, |D| ≥ |D |). Proof The problem of checking if a simple undirected graph G = (V , E) has a K 3 edge partition (i.e., a set E of edges can be partitioned into non-intersecting subsets E 1 , E 2 , . . . , E n such that each E i generates a subgraph G i of G, where G i is isomorphic to a complete graph K 3 ) is NPcomplete [27]. Let us show that max-EDP is NP-hard by reducing the problem of K 3 edge partition to max-EDP.
Suppose that in the K 3 edge partition problem we are given an graph G = (V , E), with deg(v) as a degree of a vertex v ∈ V . We note that ∀v ∈ V we have deg(v) ≡ 0 (mod 2) as otherwise K 3 edge partitioning of G does not exist. Now we build a new graph G = (V , E ) from G for the max-EDP instance as follows: The example of such transformation is shown in Fig. 3.
There exists a bijection between cycles in G and segment/adjacency edge-alternating cycles in G . Since the smallest possible cycle in G corresponds to K 3 in G, the solution in max-EDP corresponds to finding exactly the partition of G into K 3 , if it exists. Because the described reduction can be computed in linear time, the max-EDP is NP-hard, as is the K 3 edge partition problem.

Consistent contig covering problem
We can assume that minimal Eulerian decomposition(s) are the most likely ones, as they describe the simplest possible (composition-wise) representation of a cancer genome. When no additional information is available and there exist several minimal Eulerian decompositions of an observed IAG it is unclear how to select the true one (Fig. 4a). To address this challenge we consider the inference of sub-paths/cycles that are present across all possible minimal Eulerian decompositions of a considered IAG.  times across all elements of T (i.e., μ(e) = μ T (e)), (ii) for every adjacency edge e ∈ E A we have where μ T is the copy number function on edges determined by T.
A contig covering of a given decomposable IAG represents (sub)parts of the full linear/circular chromosomes in Eulerian decomposition(s) and we note that an Eulerian decomposition is a special case of a contig covering.
As simple collection of individual segments with their multiplicities being taken into account represents a contig covering which we call primitive. Below we consider a problem of finding contig coverings that contain longer paths/cycles and thus represent more contiguous parts of rearranged cancer chromosomes. For IAG G = (V , E, μ) and its contig covering T we define contiguity discordance G − T as follows: (4) We note, that for IAG G = (V , E, μ) and any Eulerian decomposition D of G we naturally have G − D = 0 (i.e., all adjacencies, including multiplicities, are present in D) and for a primitive decompositionT we have G −T = μ(E A ).
Since one IAG can have multiple minimal Eulerian decompositions we now formulate a problem of finding the "longest" consistent contigs shared across all minimal Eulerian decompositions (i.e., longest uniquely determined contigs across all possible linear/circular/mixed genomes that the given IAG determines) as follows: E, μ) find a contig covering T = P ∪ C of G such that for every minimal Eulerian decomposition D of G every path p ∈ P and cycle c ∈ C is present in D, and the contiguity discordance G − T is minimized.

Theorem 4 CCCP can be solved in O(μ(E) 2 ) time.
Proof At first, we assume that the graph is connected. Otherwise, we consider its connected components separately. Below we present the CCR algorithm capable of solving CCCP in O(μ(E) 2 ). An overview of the CCR workflow is shown in Fig. 5.
We start constructing the solution T by setting it to the primitive contig covering T =T and then iteratively add adjacency edges to it as described below: 1. Transform the initial graph G = (V , E, μ)  We remove e, y, and f from G and insert a new non-supplemental contracted adjacency edge j = {x t , z h } between endpoints x t and z h . We update D in the same manner: we replace e, y and f with j. We further update T by adding non-contracted adjacency edges from {e, f } by choosing copies of segments x, y, and z that do not have adjacency edges incident to them (at vertices x h , y t , y h , and z t ), and connecting them via e and f. We update T with only non-contracted adjacency edges because all adjacency information from contracted adjacency edges has already been added to T. If there are still unprocessed pairs of adjacency edges in G we reiterate Step 3.

4.
We end up with the terminal graph G in which no pair of adjacency edges belong to the same path/cycle in any consistent contig covering. Now we want to select the largest (cardinality-wise) collection L of non-contracted adjacency edges from G such that when updating T with edges from L the collection T remains consistent contig covering. We find such collection L by solving the maximum-matching problem in the auxiliary graph, see Additional file 1: Methods and Additional file 1: Figure S1 for details. After updating T with adjacency edges from L the collection T represents a solution to the CCCP instance. We note that the solution found to CCCP may not be unique.
Now we analyze complexity of the proposed CCR algorithm. In Step 2 min-ED can be found in O(μ(E) 2 ) as per Theorem 2. There can be at most O(μ(E)) iterations of Step 3 of the algorithm: in each iteration we either mark some pair of edges as processed, or decrease the number of adjacency edges and increase the number of unprocessed pairs by 2. The check adjacency edge pair in Step 3 is performed in O(μ(E)). Complexity of the selection of non-contracted adjacency edges in the terminal graph is dominated by the Maximum-Matching search, which can be performed in O(μ(E) 2 ). Thus, the total complexity of CCR algorithm is O(μ(E) 2 ).

Evaluation and applications
To evaluate the performance of CCR we applied it on cancer karyotype graphs originally produced with RCK on 17 prostate cancer samples [18]. We note that the karyotypes inferred in the previous study are both cloneand haplotype-specific, but for the purposes of this study we do not consider intra-sample relations between distinct clones that comprise it. We also underscore that the CCR algorithm and it's theoretical framework does not discriminate between haplotype-specific and haploid karyotype graphs, as both are Interval Adjacency Graphs at their core.
Out of the 17 considered cancer samples, 10 were heterogeneous with a 2-clone compositions, and 7 were homogeneous with a single cancer clone present in each one of them. In the original study the clone-and haplotype-specific graphs were inferred with RCK which was ran with sample-specific novel adjacencies obtained from the earlier study [17] and with both Battenberg [17] and HATCHet [16] copy number aberrations (CNA) profile inputs, producing a separate graph for each CNA input. Both Battenberg and HATCHet analyze readdepth ratios and B-allele frequencies of bulk-sequenced short reads that are aligned and grouped across large genomic fragments. Both methods take into account possible sample heterogeneity with HATCHet being capable of analyzing jointly multiple samples coming from the same patient. Both methods produce clone-and allelespecific CNA profiles, that RCK takes as part of the input along with novel adjacencies and infers the karyotype graph for underlying rearranged cancer genomes. Thus we run CCR on 54 karyotype graphs originally inferred for 27 cancer clones in 17 samples.
Information about the lengths (i.e., sum of lengths of segments with their multiplicities taken into account) of the cancer genomes represented as karyotype graphs, the number of chromosomes in their minimal Eulerian decompositions, and whether or not the considered genomes had a Whole Genome Duplication event in the somatic evolutionary history (as according to the original study [16]) is provided in Additional file 1: Table S1.
We observe that CCR was capable of recovering long contigs in the obtained consistent contig covering on all observed cancer karyotype graphs, with the total number of unambiguous contigs being within an order of magnitude from the overall number of chromosomes (Fig. 6a,b). We also computed the N50 size for the obtained contigs, and observe that on average CCR obtains contigs with N50 sizes of 50Mbp or greater, which indicates that the CCR is capable of recovering chromosomal scale contigs (Fig. 6c,d). We note that the contig inference results obtained from karyotype graphs produced with either Battenberg or HATCHet CNA input for RCK were very similar, with the inference on RCK+HATCHet graphs producing fewer and slightly longer contigs on most samples.
We also investigated on whether and if so to what extent the proposed comprehensive CCR contig inference algorithm is more efficient than the "naive" approach of recovering contigs, when a contig covering is produced by only including pairs (a, b) of consecutive adjacency edges involving a segment edge s when their are no other adjacency edges involving s. We note that the naive approach, while trivial to derive, has not been previously implemented or utilized and serves as a baseline for assessing the performance of the proposed CCR method. We found that CCR substantially outperforms the naive approach in both reducing the number of contigs in the produced contig coverings (Fig. 6a,b) and in the contiguity of the recovered contigs (Fig. 6c,d).

Discussion
Inferring rearranged cancer genomes remains a challenging task, but recent advances in inferring clone-and haplotype-specific cancer karyotype graphs genomes have improved our ability to depict and analyze genetic instability in cancer. Our formal graph-based framework and CCR algorithm for inferring the sequential structure of cancer genome's chromosomes from its karyotype graph represents the next logical step in bringing us closer to recovering complete rearranged cancer chromosomes for further analysis.
While our CCR algorithm has demonstrated excellent performance on real cancer karyotype graphs with the average N50 of recovered unambiguous contigs exceeding 50Mbp, there are still several avenues for improving and extending the proposed approach. The joint consideration of information from karyotype graphs of several cancer clones that either comprise the same sample or in general belong to the same patient may improve the quality of the obtained contigs, as distinct clones from the same patient have parts of the somatic evolutionary history in common, and thus may share some of the rearranged chromosomal structure. Furthermore, CCR would benefit from incorporating, when available, the long-range molecule-of-origin information coming from 3 rd -generation long/linked reads, that can help resolve some of the ambiguities encountered during the contig inference process. We also note that, while running-time Contigs obtained with both CCR (red) and the baseline "naive" contig inference approach (blue) and on haplotype-specific cancer karyotype graphs. a Number of contigs recovered with CCR and the "naive" approach from input karyotype graphs obtained with RCK on HATCHet CNA input. b Number of contigs recovered with CCR and the "naive" approach from input karyotype graphs obtained with RCK on Battenberg CNA input. c N50 metric for contigs recovered with CCR and "naive" approach from input karyotype graphs obtained with RCK on HATCHet CNA input. d N50 metric for contigs recovered with CCR and "naive" approach from input karyotype graphs obtained with RCK on Battenberg CNA input complexity for the minimal Eulerian Decomposition inference step in the CCR algorithm is not dominant, a more efficient algorithm for minimal Eulerian Decomposition may exist which would improve upon the complexity described in Theorem 2.
Overall we believe that CCR addresses many of the major limitations in cancer chromosomal structural analysis and can help advance our understanding of both patient/typespecific as well as the overall genetic instability in cancer.

Conclusions
In the presented work, we analyzed the problem of finding the sequential chromosomal structure of rearranged cancer genomes from their karyotype graph representation. We formulated several variations of the Eulerian decomposition problem (EDP) for a given cancer karyotype graph and provided a polynomial-time solution for the most biological relevant min-EDP version of it, and proved that the max-EDP version is NP-hard.
We observed that the Eulerian decompositions of a given cancer karyotype graph, while representing the sequential structure of the rearranged chromosomes, may not be unique. To combat resulting ambiguities we formulated a consistent contig covering problem (CCCP) of inferring a collection of the longest unambiguous (i.e., present in every decomposition) cancer contigs, and presented a novel algorithm CCR capable of solving it in polynomial time.
We then evaluated the performance of CCR on 54 cancer karyotype graphs obtained with RCK on diverse group of 17 prostate cancer samples. Our results consistently showed that CCR is capable of decomposing the input karyotype graphs into relatively small collections of long, unambiguous contigs with N50 exceeding 50Mbp in most cases and the total number of contigs within an order of magnitude from the overall number of chromosomes in the considered cancer genomes.