 Research
 Open Access
 Published:
Recovering rearranged cancer chromosomes from karyotype graphs
BMC Bioinformatics volume 20, Article number: 641 (2019)
Abstract
Background
Many cancer genomes are extensively rearranged with highly aberrant chromosomal karyotypes. Structural and copy number variations in cancer genomes can be determined via abnormal mapping of sequenced reads to the reference genome. Recently it became possible to reconcile both of these types of largescale variations into a karyotype graph representation of the rearranged cancer genomes. Such a representation, however, does not directly describe the linear and/or circular structure of the underlying rearranged cancer chromosomes, thus limiting possible analysis of cancer genomes somatic evolutionary process as well as functional genomic changes brought by the largescale genome rearrangements.
Results
Here we address the aforementioned limitation by introducing a novel methodological framework for recovering rearranged cancer chromosomes from karyotype graphs. For a cancer karyotype graph we formulate an Eulerian Decomposition Problem (EDP) of finding a collection of linear and/or circular rearranged cancer chromosomes that are determined by the graph. We derive and prove computational complexities for several variations of the EDP. We then demonstrate that Eulerian decomposition of the cancer karyotype graphs is not always unique and present the Consistent Contig Covering Problem (CCCP) of recovering unambiguous cancer contigs from the cancer karyotype graph, and describe a novel algorithm CCR capable of solving CCCP in polynomial time. We apply CCR on a prostate cancer dataset and demonstrate that it is capable of consistently recovering large cancer contigs even when underlying cancer genomes are highly rearranged.
Conclusions
CCR can recover rearranged cancer contigs from karyotype graphs thereby addressing existing limitation in inferring chromosomal structures of rearranged cancer genomes and advancing our understanding of both patient/cancerspecific as well as the overall genetic instability in cancer.
Background
Cancer is a deadly disease propagated by accumulation of genetic mutations that range across scales from single nucleotide polymorphisms to largescale genome rearrangements. Advances in wholegenome sequencing of tumor samples have enabled the detection of somatic mutations using specialized algorithms to identify different classes of mutations from short, long, and linked DNA sequence reads available by current technologies [1–10].
In this study we focus on largescale genome rearrangements that accumulate throughout the somatic evolutionary process to alter the order, orientation, and quantity of genomic sequences, and ultimate producing a collection of highly rearranged chromosomes that constitute the cancer genome. Largescale alterations of the chromosomal structure in cancer genomes have been previously identified as risk factors [11], survival prognosis predictors [12, 13], and possible therapeutic targets [14, 15], thus underscoring the importance of this type of mutation and prompting the development of novel methods to decipher the rearranged structures of cancer genomes.
Previously proposed algorithms enabled the inference of copy number aberrations (CNA) (e.g., deletion, amplification) of genomic segments [10, 16, 17] and novel adjacencies (NA) between genomic loci that are distant in the reference but are adjacent in cancer genomes [5–7, 9], in sequenced tumor samples. In the more recent study[18] a novel methodological framework RCK was proposed for reconstruction of cancer genomes karyotype graphs, taking into account both the CNA and NA signals, the nonhaploid nature of both the reference and the derived genomes, and, when applicable, possible sample heterogeneity. The karyotype graph provides a much more accurate and complete description of the rearranged cancer genome than either of CNA or NA profiles on their own, but it falls short of describing the actual linear/circular structure of the rearranged chromosomes, and instead provides an alignment of the rearranged chromosomes onto the reference.
Here we consider the problem of inferring sequential structure of rearranged linear/circular chromosomes in a cancer genome given its karyotype graph representation. In previous studies either general observations about necessary and sufficient conditions on the karyotype graphs structure for the existence of the underlying cancer genome [18, 19], or attempts at extracting information about specific local rearranged structures (e.g., amplicons, complex rearrangements, etc) [8, 20, 21] were made. In the present study we formulate a Eulerian Decomposition Problem (EDP) for a cancer karyotype graph with the objective of finding a covering collection of linear and/or circular paths/cycles (i.e., chromosomes). We derive and prove computational complexities for both general, min, and max versions of the EDP. We further observe that for a cancer karyotype graph there are often multiple possible Eulerian decompositions which presents ambiguities for chromosomal sequential structures inference. We note that similar nonuniqueness of traversals that constitute genome graph decompositions has also been previously observed in the area of genome assembly [22]. We then formulate a Consistent (i.e., shared across all possible minimal Eulerian decompositions) Contig Covering Problem (CCCP) and subsequently present a novel algorithm CCR, for Consistent Contig Recovering, capable of solving the CCCP in polynomial time.
We then apply CCR method to the karyotype graphs of 17 primary and metastatic prostate cancer samples. We find that CCR effectively finds long, unambiguous paths shared across all possible graph decompositions thus reliably inferring cancer contigs. We observe that for these cancer contig inferences, the N50 sizes of the obtained results reaches chromosomal scale, and the total number of contigs is within an order of magnitude of the number of chromosomes in the underlying genomes, whereas a naive algorithm fragments the reconstruction into thousands of contigs and appreciably smaller contig N50 sizes. We thus demonstrate that CCR effectively advances the frontier for the analysis of rearranged cancer genomes, allowing for a more detailed and comprehensive studies of structurally altered cancer chromosomes. A proofofconcept implementation of CCR is available at https://github.com/izban/contigcovering and will be later integrated into the RCK software package available at https://github.com/aganezov/RCK.
Methods
Rearranged cancer genomes and Interval Adjacency Graphs
A segment s=[s^{t},s^{h}] is a continuous part of the reference chromosome with extremities s^{t} and s^{h} determining its beginning and end respectively. We assume that every segment appears exactly once in the reference, and segments do not overlap. An adjacency a={p,q} connects two segments’ extremities p and q determining a transition between adjacent segments along the (derived) chromosome. We assume that a derived cancer genome Q comes from the reference genome R via large scale rearrangements and can contain both linear and circular chromosomes. For a set S(R) of segments in the reference R and a set S(Q) of segments in genome Q we naturally have S(Q)⊆S(R) (i.e., “building blocks” that the derived genome is comprised of originate from the reference).
An Interval Adjacency Graph (IAG), or karyotype graph, G(S,A)=(V,E) is a graph built on a set S of segments and a set A of adjacencies. A set V={s^{t},s^{h}∣s∈S} of vertices represents extremities of segments in a set S. A set E of edges is comprised of two types of undirected edges: segment edges E_{S}={{s^{t},s^{h}}∣s∈S}, encoding segments, and adjacency edges E_{A}={{p,q}∣{p,q}∈A}, encoding adjacencies between segments extremities (Fig. 1a).
A linear chromosome corresponds to a segment/adjacency edgealternating path in the IAG that starts with a segment edge and ends with a segment edge (Fig. 1b). A circular chromosome correspond to a segment/adjacency edgealternating cycle in the IAG. The number of times every segment/adjacency edge is traversed (in either direction) across all chromosomes in the genome Q corresponds to the segment/adjacency copy number (i.e., the number of times segment/adjacency is present in the observed genome).
For an IAG G=(V,E) we define a multiplicity function \(\mu : E\rightarrow \mathbb {N}_{\geq 0}\) encoding respective segment/adjacency copy number, and we call respective IAG G=(V,E,μ)weighted. The weighted IAG G is positive if μ(e)≥1 for all e∈E. Further, since we can trivially derive a positive IAG G by removing all edges with multiplicity 0, we assume that IAG G is positive unless explicitly specified otherwise.
For a vertex v∈V we define e_{S}(v)∈E_{S} as a segment edge incident to v and E_{A}(v)⊆E_{A} as adjacency edges incident to v. We also define for all e∈E_{A} an auxiliary function l(e) that outputs 2 if an adjacency edge e is a loop, and 1 otherwise.
We define copy number excess x(v) on a vertex v∈V as follows:
Given that both segment and adjacency edges can have positive multiplicities, for any set E^{′} of edges we define \(\mu (E')=\sum _{e\in E'}\mu (e)\).
Interval Adjacency Graph decompositions
For an IAG G=(V,E,μ) we define Eulerian decomposition (ED) as collection D=P∪C of segment/adjacency edgealternating paths P and cycles C (i.e., linear and circular chromosomes in cancer genome) such that every segment/adjacency edge e is present in D exactly μ(e) times. Below we outline the known necessary and sufficient condition for an Eulerian decomposition to exist in an IAG:
Lemma 1
IAG G=(V,E,μ) has an Eulerian decomposition if and only if for all v∈V holds x(v)≥0. [18,19,23]
We call IAG Gdecomposable if there exist an Eulerian decomposition D of G and assume that the considered IAG is decomposable unless explicitly stated otherwise.
We define by C_{+}(G) a set of connected components in G such that for every c∈C_{+}(G) there exist a vertex v∈c:x(v)>0. We define by C_{0}(G) a set of connected components in G such that for every c∈C_{0}(G) for every vertex v∈c:x(v)=0.
As the number of chromosomes can play a crucial role in cancer development and progression we now observe the previously proven result on how the karyotype graph determines the quantity of chromosomes in the depicted cancer genome:
Lemma 2
For an IAG G=(V,E,μ) the number P of paths P in any Eulerian decomposition D of G is determined as follows [18]:
Below we observe the problem of finding the Eulerian decomposition of an IAG which in turn determines the collection of linear/circular rearranged cancer chromosomes:
Problem 1
(IAG Eulerian Decomposition Problem,) Given an IAG G find some Eulerian decomposition D of G.
Theorem 1
EDP can be solved in O(μ(E)) time.
Proof
We solve the problem for each connected component c∈G separately.
At first, we show how to find an ED of the component c from C_{0}(G). In this case, x(v)=0 for all v∈c, so the numbers of adjacent segment and adjacency edges (with their multiplicities) are equal in each vertex. Thus, in each vertex we can match each segment edge with some adjacency edge. Since each edge has two matched edges of opposite type, one in each endpoint, these matches represent a set of alternating cycles, which is exactly the desired Eulerian decomposition.
Now, consider a component c from C_{+}(G). We modify graph G so that c becomes the component c^{′}∈C_{0}(G^{′}): for each vertex v with x(v)>0 we add supplemental adjacency loops l={v,v} with \(\mu (l)=\lfloor \frac {x(v)}{2} \rfloor \); and for the vertices v_{1},v_{2},…,v_{2k} which are left with x(v)=1 we add supplemental adjacency edges e_{i}={v_{2i−1},v_{2i}} with μ(e_{i})=1. Indeed, there are an even number of vertices with excess 1 since the total copy number excess \(\sum _{v\in V}x(v)\) is even, and thus there may exist only an even number of vertices with an odd copy number excess. In this new graph G^{′} we have x(v)=0 for any v and, thus, we can build ED D^{′} using the algorithm from the previous paragraph. To get ED of G we just remove all supplemental edges from D^{′}. We removed only adjacency edges, thus each cycle either remains the same or is split into several edgealternating paths, which start and end with a segment edge. This satisfies the definition of Eulerian decomposition. □
With the proposed efficient algorithm for obtaining and Eulerian decomposition of an IAG and the respective collection of linear/circular cancer chromosomes, we now observe how many of said chromosomes can be present in the obtained decomposition (Fig. 2).
Rare circular chromosomes, or double minutes, have been previously observed in some cancers [24–26] and we first consider a case where the number of circular chromosomes in cancer genome is minimal. Below we show the previously known result that for IAG G every connected component c∈C_{+}(G) can be decomposed into pathsonly, and every connected components c∈C_{0}(G) can be covered by a single cycle:
Lemma 3
For a decomposable IAG G=(V,E,μ) the size D=P+C of any of its minimal (cardinalitywise) Eulerian decompositions is equal to [18]:
We now described the problem of finding a minimal Eulerian decomposition of an IAG:
Problem 2
(IAG Minimal Eulerian Decomposition Problem,) Given an IAG G find a minimal Eulerian decomposition D of G (i.e., for any Eulerian decomposition D^{′} of G, D≤D^{′}).
Theorem 2
minEDP can be solved in (μ(E)^{2}) time.
Proof
At first, using the algorithm from Theorem 1 we obtain some ED D of graph G. Then, we iteratively merge cycles with paths or cycles with cycles via shared segment edges, until we are left with one cycle in c from C_{0}(G), and until there are no cycles in c from C_{+}(G). To analyze the overall complexity of the aforementioned workflow for solving the minEDP we first observe that, according to Theorem 1 we can find some ED of G in O(μ(E)) time. We then iteratively identify pairs C,Q of a cycle C and a path/cycle Q that share some segment edge s and merge them into a new path/cycle Q^{′}. At each step a pair C,Q can be identified and merged in O(μ(E)), and there can be at most O(μ(E)) such steps, thus bringing the total runtime complexity of the proposed workflow to O(μ(E)^{2}). □
While the minEDP describes the most parsimonious and biologically relevant scenario, we also observe the opposite case and pose a problem of finding the cancer genome described by a given IAG with the maximum number of circular chromosomes and subsequently prove that it is computationally harder than the minEDP:
Problem 3
(IAG Maximal Eulerian Decomposition Problem,) Given an IAG G find a maximal Eulerian decomposition D of G (i.e., for any Eulerian decomposition D^{′} of G, D≥D^{′}).
Theorem 3
maxEDP is NPhard.
Proof
The problem of checking if a simple undirected graph G=(V,E) has a K_{3} edge partition (i.e., a set E of edges can be partitioned into nonintersecting subsets E_{1},E_{2},…,E_{n} such that each E_{i} generates a subgraph G_{i} of G, where G_{i} is isomorphic to a complete graph K_{3}) is NPcomplete [27]. Let us show that maxEDP is NPhard by reducing the problem of K_{3} edge partition to maxEDP.
Suppose that in the K_{3} edge partition problem we are given an graph G=(V,E), with deg(v) as a degree of a vertex v∈V. We note that ∀v∈V we have deg(v)≡0 (mod 2) as otherwise K_{3} edge partitioning of G does not exist. Now we build a new graph G^{′}=(V^{′},E^{′}) from G for the maxEDP instance as follows:
 (i)
add two vertices v^{h},v^{t}∈V^{′} for each v∈V;
 (ii)
add an adjacency edge e={v^{h},u^{h}} with μ(e)=1 for each edge {u,v}∈E;
 (iii)
add segment edges s={v^{h},v^{t}} with μ(s)= deg(v);
 (iv)
add adjacency loops l={v^{t},v^{t}} with μ(l)= deg(v)/2
The example of such transformation is shown in Fig. 3.
There exists a bijection between cycles in G and segment/adjacency edgealternating cycles in G^{′}. Since the smallest possible cycle in G^{′} corresponds to K_{3} in G, the solution in maxEDP corresponds to finding exactly the partition of G into K_{3}, if it exists. Because the described reduction can be computed in linear time, the maxEDP is NPhard, as is the K_{3} edge partition problem. □
Consistent contig covering problem
We can assume that minimal Eulerian decomposition(s) are the most likely ones, as they describe the simplest possible (compositionwise) representation of a cancer genome. When no additional information is available and there exist several minimal Eulerian decompositions of an observed IAG it is unclear how to select the true one (Fig. 4a). To address this challenge we consider the inference of subpaths/cycles that are present across all possible minimal Eulerian decompositions of a considered IAG.
For IAG G=(V,E,μ) a collection T=P∪C of segment/adjacency edgealternating paths/cycles is called a contig covering of G if
 (i)
every segment edge e∈E_{S} is present exactly μ(e) times across all elements of T (i.e., μ(e)=μ_{T}(e)),
 (ii)
for every adjacency edge e∈E_{A} we have μ(e)≥μ_{T}(e),
where μ_{T} is the copy number function on edges determined by T.
A contig covering of a given decomposable IAG represents (sub)parts of the full linear/circular chromosomes in Eulerian decomposition(s) and we note that an Eulerian decomposition is a special case of a contig covering.
As simple collection of individual segments with their multiplicities being taken into account represents a contig covering which we call primitive. Below we consider a problem of finding contig coverings that contain longer paths/cycles and thus represent more contiguous parts of rearranged cancer chromosomes. For IAG G=(V,E,μ) and its contig covering T we define contiguity discordance ∥G−T∥ as follows:
We note, that for IAG G=(V,E,μ) and any Eulerian decomposition D of G we naturally have ∥G−D∥=0 (i.e., all adjacencies, including multiplicities, are present in D) and for a primitive decomposition \(\tilde {T}\) we have \(\G\tilde {T}\ = \mu (E_{A})\).
Since one IAG can have multiple minimal Eulerian decompositions we now formulate a problem of finding the “longest” consistent contigs shared across all minimal Eulerian decompositions (i.e., longest uniquely determined contigs across all possible linear/circular/mixed genomes that the given IAG determines) as follows:
Problem 4
(IAG Consistent Contig Covering Problem,) Given IAG G=(V,E,μ) find a contig covering T=P∪C of G such that for every minimal Eulerian decomposition D of G every path p∈P and cycle c∈C is present in D, and the contiguity discordance ∥G−T∥ is minimized.
Theorem 4
CCCP can be solved in O(μ(E)^{2}) time.
Proof
At first, we assume that the graph is connected. Otherwise, we consider its connected components separately. Below we present the CCR algorithm capable of solving CCCP in O(μ(E)^{2}). An overview of the CCR workflow is shown in Fig. 5.
We start constructing the solution T by setting it to the primitive contig covering \(T=\tilde {T}\) and then iteratively add adjacency edges to it as described below:
 1.
Transform the initial graph G=(V,E,μ) into a graph G^{′} which minimal Eulerian decomposition is a single cycle, i.e., x(v)=0 for all vertices v∈V. For that we add the following elements into G: a supplemental segment edge i={i^{t},i^{h}}, a supplemental adjacency edge {v,i^{t}} for every v∈V:x(v)>0, and an adjacency edgeloop {i^{h},i^{h}} (Fig. 4b). For every added supplemental adjacency edge e={v,i^{t}} we set the multiplicity μ(e)=x(v), for supplemental segment edge i we set multiplicity \(\mu (i) = \sum _{v} x(v)\), and for supplemental selfloop adjacency edge {i^{h},i^{h}} we set its multiplicity \(\frac {\mu (i)}{2}\).
 2.
Find some minimal Eulerian Decomposition D^{′} of G^{′} using the algorithm from Theorem 2;
 3.
Since every (sub)path/cycle present in desired T must also be present in the obtained D^{′} (when considering only nonsupplemental edges) we consider pairs b=(e,f) of nonsupplemental consecutive adjacency edges in any path/cycle in D^{′}. In the beginning every such pair b is marked as unprocessed. For every unprocessed pair b=(e,f) we check if there exist another minimal Eulerian decomposition of G^{′} where e and f are not consecutive. For space considerations we describe in detail how this check is performed in the Additional file 1: Methods.
If this pair b is not consecutive in some other minimal Eulerian decomposition of G^{′} we mark b as processed and never return to it again. Otherwise, it belongs to all minimal Eulerian decompositions of G^{′}. Suppose that e={x^{h},y^{t}}, f={y^{h},z^{t}} and the segment edge y={y^{t},y^{h}} between e and f in D^{′}. We remove e, y, and f from G^{′} and insert a new nonsupplemental contracted adjacency edge j={x^{t},z^{h}} between endpoints x^{t} and z^{h}. We update D^{′} in the same manner: we replace e, y and f with j. We further update T by adding noncontracted adjacency edges from {e,f} by choosing copies of segments x, y, and z that do not have adjacency edges incident to them (at vertices x^{h}, y^{t}, y^{h}, and z^{t}), and connecting them via e and f. We update T with only noncontracted adjacency edges because all adjacency information from contracted adjacency edges has already been added to T.
If there are still unprocessed pairs of adjacency edges in G^{′} we reiterate Step 3.
 4.
We end up with the terminal graph G^{′} in which no pair of adjacency edges belong to the same path/cycle in any consistent contig covering. Now we want to select the largest (cardinalitywise) collection L of noncontracted adjacency edges from G^{′} such that when updating T with edges from L the collection T remains consistent contig covering. We find such collection L by solving the maximummatching problem in the auxiliary graph, see Additional file 1: Methods and Additional file 1: Figure S1 for details. After updating T with adjacency edges from L the collection T represents a solution to the CCCP instance. We note that the solution found to CCCP may not be unique.
Now we analyze complexity of the proposed CCR algorithm. In Step 2 minED can be found in O(μ(E)^{2}) as per Theorem 2. There can be at most O(μ(E)) iterations of Step 3 of the algorithm: in each iteration we either mark some pair of edges as processed, or decrease the number of adjacency edges and increase the number of unprocessed pairs by 2. The check adjacency edge pair in Step 3 is performed in O(μ(E)). Complexity of the selection of noncontracted adjacency edges in the terminal graph is dominated by the MaximumMatching search, which can be performed in O(μ(E)^{2}). Thus, the total complexity of CCR algorithm is O(μ(E)^{2}). □
Results
Evaluation and applications
To evaluate the performance of CCR we applied it on cancer karyotype graphs originally produced with RCK on 17 prostate cancer samples [18]. We note that the karyotypes inferred in the previous study are both clone and haplotypespecific, but for the purposes of this study we do not consider intrasample relations between distinct clones that comprise it. We also underscore that the CCR algorithm and it’s theoretical framework does not discriminate between haplotypespecific and haploid karyotype graphs, as both are Interval Adjacency Graphs at their core.
Out of the 17 considered cancer samples, 10 were heterogeneous with a 2clone compositions, and 7 were homogeneous with a single cancer clone present in each one of them. In the original study the clone and haplotypespecific graphs were inferred with RCK which was ran with samplespecific novel adjacencies obtained from the earlier study [17] and with both Battenberg [17] and HATCHet [16] copy number aberrations (CNA) profile inputs, producing a separate graph for each CNA input. Both Battenberg and HATCHet analyze readdepth ratios and Ballele frequencies of bulksequenced short reads that are aligned and grouped across large genomic fragments. Both methods take into account possible sample heterogeneity with HATCHet being capable of analyzing jointly multiple samples coming from the same patient. Both methods produce clone and allelespecific CNA profiles, that RCK takes as part of the input along with novel adjacencies and infers the karyotype graph for underlying rearranged cancer genomes. Thus we run CCR on 54 karyotype graphs originally inferred for 27 cancer clones in 17 samples.
Information about the lengths (i.e., sum of lengths of segments with their multiplicities taken into account) of the cancer genomes represented as karyotype graphs, the number of chromosomes in their minimal Eulerian decompositions, and whether or not the considered genomes had a Whole Genome Duplication event in the somatic evolutionary history (as according to the original study [16]) is provided in Additional file 1: Table S1.
We observe that CCR was capable of recovering long contigs in the obtained consistent contig covering on all observed cancer karyotype graphs, with the total number of unambiguous contigs being within an order of magnitude from the overall number of chromosomes (Fig. 6a,b). We also computed the N50 size for the obtained contigs, and observe that on average CCR obtains contigs with N50 sizes of 50Mbp or greater, which indicates that the CCR is capable of recovering chromosomal scale contigs (Fig. 6c,d). We note that the contig inference results obtained from karyotype graphs produced with either Battenberg or HATCHet CNA input for RCK were very similar, with the inference on RCK+HATCHet graphs producing fewer and slightly longer contigs on most samples.
We also investigated on whether and if so to what extent the proposed comprehensive CCR contig inference algorithm is more efficient than the “naive” approach of recovering contigs, when a contig covering is produced by only including pairs (a,b) of consecutive adjacency edges involving a segment edge s when their are no other adjacency edges involving s. We note that the naive approach, while trivial to derive, has not been previously implemented or utilized and serves as a baseline for assessing the performance of the proposed CCR method. We found that CCR substantially outperforms the naive approach in both reducing the number of contigs in the produced contig coverings (Fig. 6a,b) and in the contiguity of the recovered contigs (Fig. 6c,d).
Discussion
Inferring rearranged cancer genomes remains a challenging task, but recent advances in inferring clone and haplotypespecific cancer karyotype graphs genomes have improved our ability to depict and analyze genetic instability in cancer. Our formal graphbased framework and CCR algorithm for inferring the sequential structure of cancer genome’s chromosomes from its karyotype graph represents the next logical step in bringing us closer to recovering complete rearranged cancer chromosomes for further analysis.
While our CCR algorithm has demonstrated excellent performance on real cancer karyotype graphs with the average N50 of recovered unambiguous contigs exceeding 50Mbp, there are still several avenues for improving and extending the proposed approach. The joint consideration of information from karyotype graphs of several cancer clones that either comprise the same sample or in general belong to the same patient may improve the quality of the obtained contigs, as distinct clones from the same patient have parts of the somatic evolutionary history in common, and thus may share some of the rearranged chromosomal structure. Furthermore, CCR would benefit from incorporating, when available, the longrange moleculeoforigin information coming from 3^{rd}generation long/linked reads, that can help resolve some of the ambiguities encountered during the contig inference process. We also note that, while runningtime complexity for the minimal Eulerian Decomposition inference step in the CCR algorithm is not dominant, a more efficient algorithm for minimal Eulerian Decomposition may exist which would improve upon the complexity described in Theorem 2.
Overall we believe that CCR addresses many of the major limitations in cancer chromosomal structural analysis and can help advance our understanding of both patient/typespecific as well as the overall genetic instability in cancer.
Conclusions
In the presented work, we analyzed the problem of finding the sequential chromosomal structure of rearranged cancer genomes from their karyotype graph representation. We formulated several variations of the Eulerian decomposition problem (EDP) for a given cancer karyotype graph and provided a polynomialtime solution for the most biological relevant minEDP version of it, and proved that the maxEDP version is NPhard.
We observed that the Eulerian decompositions of a given cancer karyotype graph, while representing the sequential structure of the rearranged chromosomes, may not be unique. To combat resulting ambiguities we formulated a consistent contig covering problem (CCCP) of inferring a collection of the longest unambiguous (i.e., present in every decomposition) cancer contigs, and presented a novel algorithm CCR capable of solving it in polynomial time.
We then evaluated the performance of CCR on 54 cancer karyotype graphs obtained with RCK on diverse group of 17 prostate cancer samples. Our results consistently showed that CCR is capable of decomposing the input karyotype graphs into relatively small collections of long, unambiguous contigs with N50 exceeding 50Mbp in most cases and the total number of contigs within an order of magnitude from the overall number of chromosomes in the considered cancer genomes.
Availability of data and materials
The datasets generated and/or analysed during the current study are available in the RCKpubdata repository, https://github.com/aganezov/RCKpubdata.
Abbreviations
 CCCP:

Consistent Contig Covering Problem
 CNA:

Copy Number Aberrations
 ED:

Eulerian Decomposition EDP: Eulerian Decomposition Problem
 IAG:

Interval Adjacency Graph
 NA:

Novel Adjacency
 WGD:

Whole Genome Duplication
References
 1
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009; 25(16):2078–9. https://doi.org/10.1093/bioinformatics/btp352.
 2
Metzker ML. Sequencing technologies — the next generation. Nat Rev Genet. 2010; 11(1):31–46. https://doi.org/10.1038/nrg2626.
 3
Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, Mardis ER, Weinstock GM, Wilson RK, Ding L. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics. 2009; 25(17):2283–5. https://doi.org/10.1093/bioinformatics/btp373.
 4
Ritz A, Bashir A, Sindi S, Hsu D, Hajirasouliha I, Raphael BJ. Characterization of Structural variants with single molecule and hybrid sequencing approaches. Bioinformatics. 2014; 30(24):3458–66. https://doi.org/10.1093/bioinformatics/btu714.
 5
Layer RM, Chiang C, Quinlan AR, Hall IM. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 2014; 15(6):84. https://doi.org/10.1186/gb2014156r84.
 6
Wala JA, Bandopadhayay P, Greenwald NF, O’Rourke R, Sharpe T, Stewart C, Schumacher S, Li Y, Weischenfeldt J, Yao X, Nusbaum C, Campbell P, Getz G, Meyerson M, Zhang C. Z, Imielinski M, Beroukhim R. SvABA: genomewide detection of structural variants and indels by local assembly. Genome Res. 2018; 28(4):581–91. https://doi.org/10.1101/gr.221028.117.
 7
Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A, Schatz MC. Accurate detection of complex structural variations using singlemolecule sequencing. Nat Methods. 2018; 15(6):461–8. https://doi.org/10.1038/s4159201800017.
 8
Nattestad M, Goodwin S, Ng K, Baslan T, Sedlazeck F, Resheneder P, Garvin T, Fang H, Gurtowski J, Hutton E, Tseng E, Chin J, Beck T, Sundaravadanam Y, Kramer M, Antoniou E, McPherson J, Hicks J, McCombie WR, Schatz MC. Complex rearrangements and oncogene amplifications revealed by longread DNA and RNA sequencing of a highly rearranged cancer cell line. bioRxiv. 2017:1–12. https://doi.org/10.1101/174938.
 9
Elyanow R, Wu HT, Raphael BJ. Identifying structural variants using linkedread sequencing data. Bioinformatics. 2018; 34(2):353–60. https://doi.org/10.1093/bioinformatics/btx712.
 10
Ha G, Roth A, Khattra J, Ho J, Yap D, Prentice LM, Melnyk N, McPherson A, Bashashati A, Laks E, Biele J, Ding J, Le A, Rosner J, Shumansky K, Marra MA, Gilks CB, Huntsman DG, McAlpine JN, Aparicio S, Shah SP. TITAN: inference of copy number architectures in clonal cell populations from tumor wholegenome sequence data. Genome Res. 2014; 24(11):1881–93. https://doi.org/10.1101/gr.180281.114.
 11
Vasmatzis G, Kosari F, Murphy SJ, Terra S, Kovtun IV, Harris FR, Zarei S, Smadbeck JB, Johnson SH, Gaitatzes AG, Therneau TM, Rangel LJ, Knudson RA, Greipp P, Sukov WR, Knutson DL, KloftNelson SM, Karnes RJ, Cheville JC. Large Chromosomal Rearrangements Yield Biomarkers to Distinguish LowRisk From Intermediate and HighRisk Prostate Cancer. Mayo Clin Proc. 2019; 94(1):27–36. https://doi.org/10.1016/j.mayocp.2018.06.028.
 12
Paratala BS, Dolfi SC, Khiabanian H, RodriguezRodriguez L, Ganesan S, Hirshfield KM. Emerging Role of Genomic Rearrangements in Breast Cancer: Applying Knowledge from Other Cancers. Biomark Cancer. 2016; 8s1(Supple 1):34417. https://doi.org/10.4137/bic.s34417.
 13
Johung KL, Yeh N, Desai NB, Williams TM, Lautenschlaeger T, Arvold ND, Ning MS, Attia A, Lovly CM, Goldberg S, Beal K, Yu JB, Kavanagh BD, Chiang VL, Camidge DR, Contessa JN. Extended survival and prognostic factors for patients with ALKrearranged nonsmallcell lung cancer and brain metastasis. J Clin Oncol. 2016; 34(2):123–9. https://doi.org/10.1200/JCO.2015.62.0138.
 14
Shaw AT, Ou SHI, Bang YJ, Camidge DR, Solomon BJ, Salgia R, Riely GJ, VarellaGarcia M, Shapiro GI, Costa DB, Doebele RC, Le LP, Zheng Z, Tan W, Stephenson P, Shreeve SM, Tye LM, Christensen JG, Wilner KD, Clark JW, Iafrate AJ. Crizotinib in ROS1 Rearranged Non–SmallCell Lung Cancer. N Engl J Med. 2014; 371(21):1963–71. https://doi.org/10.1056/nejmoa1406766.
 15
Paratala BS, Chung JH, Williams CB, Yilmazel B, Petrosky W, Williams K, Schrock AB, Gay LM, Lee E, Dolfi SC, Pham K, Lin S, Yao M, Kulkarni A, DiClemente F, Liu C, RodriguezRodriguez L, Ganesan S, Ross JS, Ali SM, LeylandJones B, Hirshfield KM. RET rearrangements are actionable alterations in breast cancer. Nat Commun. 2018; 9(1):4821. https://doi.org/10.1038/s41467018073414.
 16
Zaccaria S, Raphael BJ. Accurate quantification of copynumber aberrations and wholegenome duplications in multisample tumor sequencing data. bioRxiv. 2018:496174. https://doi.org/10.1101/496174.
 17
Gundem G, Van Loo P, Kremeyer B, Alexandrov LB, Tubio JMC, Papaemmanuil E, Brewer DS, Kallio HML, Högnäs G, Annala M, Kivinummi K, Goody V, Latimer C, O’Meara S, Dawson KJ, Isaacs W, EmmertBuck MR, Nykter M, Foster C, KoteJarai Z, Easton D, Whitaker HC, Neal DE, Cooper CS, Eeles RA, Visakorpi T, Campbell PJ, McDermott U, Wedge DC, Bova GS, Bova GS. The evolutionary history of lethal metastatic prostate cancer. Nature. 2015; 520(7547):353–7. https://doi.org/10.1038/nature14347.
 18
Aganezov S, Raphael BJ. Reconstruction of clone and haplotypespecific cancer genome karyotypes from bulk tumor samples. 560839. 2019. https://doi.org/10.1101/560839.
 19
Oesper L, Ritz A, Aerni SJ, Drebin R, Raphael BJ. Reconstructing cancer genomes from pairedend sequencing data. BMC Bioinformatics. 2012; 13 Suppl 6(Suppl 6):10. https://doi.org/10.1186/1471210513S6S10.
 20
Deshpande V, Luebeck J, Nguyen NPD, Bakhtiari M, Turner KM, Schwab R, Carter H, Mischel PS, Bafna V. Exploring the landscape of focal amplifications in cancer using AmpliconArchitect. Nat Commun. 2019; 10(1):392. https://doi.org/10.1038/s4146701808200y.
 21
Dzamba M, Ramani AK, Buczkowicz P, Jiang Y, Yu M, Hawkins C, Brudno M. Identification of complex genomic rearrangements in cancers using CouGaR. Genome Res. 2017; 27(1):107–17. https://doi.org/10.1101/gr.211201.116.
 22
Kingsford C, Schatz MC, Pop M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics. 2010; 11(1):21. https://doi.org/10.1186/147121051121.
 23
Pevzner PA. DNA physical mapping and alternating Eulerian cycles in colored graphs. Algorithmica. 1995; 13(1):77–105. https://doi.org/10.1007/BF01188582.
 24
Carroll SM, DeRose ML, Gaudray P, Moore CM, NeedhamVandevanter DR, Von Hoff DD, Wahl GM. Double minute chromosomes can be produced from precursors derived from a chromosomal deletion. Mol Cell Biol. 1988; 8(4):1525–33.
 25
Fan Y, Mao R, Lv H, Xu J, Yan L, Liu Y, Shi M, Ji G, Yu Y, Bai J, Jin Y, Fu S. Frequency of double minute chromosomes and combined cytogenetic abnormalities and their characteristics. J Appl Genet. 2011; 52(1):53–9. https://doi.org/10.1007/s133530100007z.
 26
Turner KM, Deshpande V, Beyter D, Koga T, Rusert J, Lee C, Li B, Arden K, Ren B, Nathanson DA, Kornblum HI, Taylor MD, Kaushal S, Cavenee WK, WechslerReya R, Furnari FB, Vandenberg SR, Rao PN, Wahl GM, Bafna V, Mischel PS. Extrachromosomal oncogene amplification drives tumour evolution and genetic heterogeneity. Nature. 2017; 543(7643):122–5. https://doi.org/10.1038/nature21356.
 27
Holyer I. The NPCompleteness of Some EdgePartition Problems. SIAM J Comput. 1981; 10(4):713–7. https://doi.org/10.1137/0210054.
Acknowledgements
Authors would like to think Prof. Benjamin J. Raphael for thoughtful and productive discussions that helped to shape the original RCK theoretical framework, the observed graph decomposition research problems and ways to address them.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 20 Supplement 20, 2019: Proceedings of the 17th Annual Research in Computational Molecular Biology (RECOMB) Comparative Genomics Satellite Workshop: Bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume20supplement20.
Funding
SA and MS are supported by the US National Science Foundation (NSF) grant DBI1350041, US National Institute of Health (NIH) grant R01HG006677, and Bill and Melinda Gates Foundation. The work of NA is supported by the Government of the Russian Federation through the ITMO Fellowship and Professorship Program. Publication costs are funded via NSF grant DBI1350041.
Author information
Affiliations
Contributions
SA, NA, and MS conceived of the presented research. SA, NA, and MS formulated described graphtheoretical computational problems. All authors contributed to presented computational results, including the CCR method. IZ and VA have implemented the CCR algorithm and applied it on prostate cancer karyotype graphs dataset. All authors wrote and approved the manuscript.
Corresponding author
Correspondence to Sergey Aganezov.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Aganezov, S., Zban, I., Aksenov, V. et al. Recovering rearranged cancer chromosomes from karyotype graphs. BMC Bioinformatics 20, 641 (2019). https://doi.org/10.1186/s1285901932084
Published:
Keywords
 Computational biology
 Cancer genomics
 Contigs
 Genome rearrangements
 Structural variations
 Graph decomposition