Volume 13 Supplement 6
Proceedings of the Second Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMBseq 2012)
Reconstructing cancer genomes from pairedend sequencing data
 Layla Oesper^{1}Email author,
 Anna Ritz^{1},
 Sarah J Aerni^{2},
 Ryan Drebin^{1} and
 Benjamin J Raphael^{1, 3}Email author
DOI: 10.1186/1471210513S6S10
© Oesper et al.; licensee BioMed Central Ltd. 2012
Published: 19 April 2012
Abstract
Background
A cancer genome is derived from the germline genome through a series of somatic mutations. Somatic structural variants  including duplications, deletions, inversions, translocations, and other rearrangements  result in a cancer genome that is a scrambling of intervals, or "blocks" of the germline genome sequence. We present an efficient algorithm for reconstructing the block organization of a cancer genome from pairedend DNA sequencing data.
Results
By aligning paired reads from a cancer genome  and a matched germline genome, if available  to the human reference genome, we derive: (i) a partition of the reference genome into intervals; (ii) adjacencies between these intervals in the cancer genome; (iii) an estimated copy number for each interval. We formulate the Copy Number and Adjacency Genome Reconstruction Problem of determining the cancer genome as a sequence of the derived intervals that is consistent with the measured adjacencies and copy numbers. We design an efficient algorithm, called P airedend Re construction of G enome O rganization (PREGO), to solve this problem by reducing it to an optimization problem on an intervaladjacency graph constructed from the data. The solution to the optimization problem results in an Eulerian graph, containing an alternating Eulerian tour that corresponds to a cancer genome that is consistent with the sequencing data. We apply our algorithm to five ovarian cancer genomes that were sequenced as part of The Cancer Genome Atlas. We identify numerous rearrangements, or structural variants, in these genomes, analyze reciprocal vs. nonreciprocal rearrangements, and identify rearrangements consistent with known mechanisms of duplication such as tandem duplications and breakage/fusion/bridge (B/F/B) cycles.
Conclusions
We demonstrate that PREGO efficiently identifies complex and biologically relevant rearrangements in cancer genome sequencing data. An implementation of the PREGO algorithm is available at http://compbio.cs.brown.edu/software/.
Introduction
A cancer genome is derived from the germline genome through a series of somatic mutations that accumulate during the lifetime of an individual. These range in size from single nucleotide mutations through larger structural variants (SVs), that duplicate, delete, or rearrange segments of DNA sequence. These structural variants may amplify genes that promote cancer (oncogenes) or delete genes that inhibit cancer development (tumor suppressor genes). In addition, rearrangements such as translocations and inversions may change gene structure or regulation and create novel fusion genes, with or without concomitant changes in copy number [1]. Classic examples are the BCRABL fusion gene in chronic myeloid leukemia and the activation of the MYC oncogene in Burkitt's lymphoma via a translocation. Identification of other common structural aberrations is essential for understanding the molecular basis of cancer and for developing cancerspecific diagnostic markers or therapeutics such as Gleevec that targets BCRABL [2] or Herceptin that targets ERBB2 amplification [3]. However, many cancer genomes are aneuploid, containing extensive duplicated sequences, and are highly rearranged compared to the germline genomes from which they were derived. The organization of amplified regions in cancer genomes is often highly complex with many high copy amplicons from distant parts of the reference genome colocalized on the cancer genome [4, 5]. Estimating the number of copies of these amplicons is extremely difficult. Moreover, determining whether such extensive rearrangements occurred over many cell divisions or nearly simultaneously (e.g. chromothripsis) is difficult [6].
DNA sequencing technologies have improved dramatically over the past decade, and nextgeneration DNA sequencing technologies now enable the sequencing of large cohorts of cancer genomes [7, 8]. However, all present DNA sequencing technologies are limited in the length of DNA sequences they produce with the most affordable technologies producing reads less than 200bp in length. De novo assembly of human, or other mammalian genomes, from this data remains a difficult task [9]. This is primarily due to the presence of repeated sequences in these genomes. De novo assembly of cancer genomes is an even more daunting problem due to complications of aneuploidy and heterogeneity described above.
Because of these challenges, somatic mutations in cancer genomes are now typically analyzed through a resequencing approach that relies on alignment of DNA sequence reads to the human reference genome. Pairedend sequencing technologies that generate paired reads from a longer DNA fragment (or insert) allow the detection of all types of somatic structural variants. Paired end mapping [10, 11], or End Sequencing Profiling [12, 13], aligns paired reads from a cancer genome to the reference human genome. The distance between the aligned reads is computed. If this aligned distance is close to the length of end sequenced fragments, as determined by the distribution of fragment lengths, the aligned pair of reads is referred to as a concordant pair. If the aligned distance is far from the expected fragment length (either shorter or longer) or if the orientation of the aligned reads has changed, then the aligned pair is referred to as a discordant pair. Clusters of discordant pairs reveal novel adjacencies (or breakpoints) created by somatic structural aberrations [13]. Numerous methods have been developed in the past few years to identify structural variants by paired end mapping [14–18] and [19] review many of the recent techniques for accomplishing this goal. In addition, when the sequencing coverage is high, the number of aligned reads [20] or concordant pairs [21] provides an estimate of the number of copies of segments of the cancer genome.
In this paper we address the problem of reconstructing the organization of the cancer genome(s) present in a cancer DNA sample from the adjacencies and copy number information revealed by the concordant and discordant pairs from a pairedend resequencing approach. We define the Copy Number and Adjacency Genome Reconstruction Problem, a general formulation of the problem which we solve as a convex optimization problem. Our approach adapts and generalizes techniques that have been employed previously in genome assembly [22–24], ancestral genome reconstruction and genome rearrangement analysis in the presence of duplicated genes [25], and prediction of copy number variants [26]. In contrast to these works, we focus on the particular features and challenges of cancer genome reconstruction including a broad class of rearrangements, aneuploidy, heterogeneity, and the availability of an "ancestral" reference genome. We apply our algorithm, called P airedend Re construction of G enome O rganization (PREGO), for solving the Copy Number and Adjacency Genome Reconstruction Problem to simulated cancer genome data and to real sequencing data from 5 ovarian cancer genomes from The Cancer Genome Atlas (TCGA). We identify numerous rearrangements, or structural variants, in these genomes, analyze reciprocal vs. nonreciprocal rearrangements, and identify rearrangements consistent with known mechanisms of duplication such as tandem duplications and breakage/fusion/bridge (B/F/B) cycles.
Methods
Intervals, adjacencies, and cancer genome reconstruction
Suppose the cancer genome is derived from the germline genome through a series of somatic rearrangements. We perform pairedend DNA sequencing on a cancer DNA sample . We assume that the sample contains a genome sequence derived from the reference genome through some series of somatic structural rearrangements of blocks of DNA (we are not considering single nucleotide mutations). From the alignments of paired reads to the reference genome, we derive three pieces of information. First, we derive a partition of the reference genome into a sequence of intervals I = (I_{1}, I_{2}, ..., I_{ n }). Each interval I_{ j }= [s_{ j }, t_{ j }] is the DNA segment from the positive strand of the reference genome that starts at coordinate s_{ j }and ends at coordinate t_{ j }. Since intervals also appear in the opposite direction in a cancer genome (e.g. due to an inversion), we denote by I_{j}= [t_{ j }, s_{ j }] the inverted DNA segment. Second, concurrently with the definition of I, we derive a set of novel adjacencies in the cancer genome. Each adjacency (I_{ j }, I_{ k }) indicates that the end t_{ j }of interval I_{ j }is adjacent to the start s_{ k }of interval I_{ k }in the cancer genome. Thus $\mathcal{A}\subseteq \left\{\left({I}_{j},{I}_{k}\right)j,k\in \left\{\pm 1,\pm 2,\dots ,\pm n\right\}\right\}$. The partition I and associated set of adjacencies are obtained by clustering discordant paired reads whose distance or orientation suggest a rearrangement in the cancer genome [13]. Any existing algorithm can be used to create such input and therefore, the decision about what data to use (i.e. ambiguous reads, split reads, read mapping quality, etc) are part of upstream processing. Third, we derive a read depth vector r = (r_{1}, ..., r_{ n })^{ T }, where r_{ j }is the number of (paired) reads that align entirely within interval I_{ j }. The read depth vector r is obtained by counting concordant pairs in each interval [27].
Our goal is to reconstruct the block organization of the cancer genome(s) in the cancer DNA sample from the interval, adjacency, and copy number information. The block organization corresponds to a sequence I_{α(1)}I_{α(2)}... I_{α(M)}of M intervals where each α(j) ∈ {± 1, ..., ± n}. We formulate the following problem.
Copy number and adjacency genome reconstruction problem
Given an interval vector I , a set $\mathcal{A}$ of cancer adjacencies, and a read depth vector r derived from a cancer sample $\mathcal{S}$ , find the cancer genome(s) that are most consistent with these data.
The statement of this problem does not quantify "most consistent". Defining such a quantitative measure requires the consideration of several complicating factors. First, the measurements of adjacencies and the partition I that they determine may be incomplete or inaccurate. Second, many cancer genomes are aneuploid, meaning that the copy number of many intervals is above and below the diploid number of 2, and thus the read depth vector may not accurately represent the actual copy number of each interval in the cancer genome. Finally, a cancer sample consists of many tumor cells, and each of these may contain different somatic mutations. However, because most tumors are clonal originating from a single cell, a large fraction of the important somatic mutations will be found in all cells of the cancer sample . In this paper, we assume that the cancer sample is genetically homogenous so that we need only construct the organization of one rearranged cancer genome. Below, we formulate a specific instance of the Copy Number and Adjacency Genome Reconstruction Problem that considers the case of a single cancer genome with errors in the set of adjacencies, sequence I of intervals, and the copy numbers must be inferred from the read depth vector r. We defer the question of heterogeneity to future work. We first consider the case of perfect data.
Copy number and adjacency genome reconstruction problem: perfect data
We begin with the case that the data is complete and errorfree: thus, all cancer adjacencies are correctly measured, and we have correctly estimated the copy number of each interval from the read depth vector r. Also, for ease of exposition, we assume that the reference and cancer genomes each contain a single chromosome. Specifically, we define the interval count vector c = (c_{1}, c_{2}, ..., c_{ n })^{ T }, where c_{ j }indicates how many times the interval I_{ j }occurs in . Note that in general c is not directly measured, but rather must be inferred from the data, and we consider this extension in the next section. We have the following problem.
Single chromosome copy number and adjacency genome reconstruction problem
 1.
for j = 1, ..., M  1 either $\left({I}_{a\left(j\right)},{I}_{a\left(j+1\right)}\right)?\mathcal{A}$or $\left({I}_{a\left(j\right)},{I}_{a\left(j+1\right)}\right)?R$.
 2.
For k = 1, ..., n, the total number of indices j with a(j) = k or a(j) = k is equal to c_{ k }.
Now if the data I, , and c are generated from an unknown cancer genome generated by a series of rearrangements, duplications and deletions that do not alter the chromosome ends (telomeres) s_{1} and t_{ n }, then the block organization of this cancer genome corresponds to an alternating path through G beginning at s_{1} and ending at t_{ n }that alternately traverses interval edges and noninterval edges (i.e. reference/variant edges), and where the number of times that each interval I_{ j }is traversed (in either direction) on the path is equal to c_{ j }(Figure 1). We require an alternating path since traversal of an interval edge is equivalent to selection of a block from the reference genome, and traversal of a reference/variant edge corresponds to a transition between blocks. Therefore, such an alternating path spells out a sequence of blocks from the reference genome. Formally, if we transform the intervaladjacency graph into a multigraph where the multiplicity of each edge equals the number of times it is traversed, then the multigraph has an Eulerian tour, as in the repeat graph, or deBruijn graph, in genome assembly algorithms [22, 29].
The following theorem follows directly from (1) and Kotzig's Theorem for alternating Eulerian paths [30] (see also [31]).
Theorem 1. Given a connected intervaladjacency graph G = (V, E), there exists a function μ: E ↦ ℕ satisfying the copy number balance conditions (1) if and only if there exists a multigraph G_{ μ }= (V, E_{ μ }) with edge multiplicities μ containing an alternating Eulerian Tour beginning at s_{1} and ending at t_{ n }.
Finding such a function μ is the Eulerization problem and can be solved in polynomial time [24]. Applying the above result with the additional constraint μ(e_{ I }(j)) = c_{ j }for j = 1, ..., n provides an intervaladjacency multigraph that contains an alternating Eulerian tour, corresponding to a cancer genome consistent with the data I, , and c. In a later section, we extend Theorem 1 to the case of multiple chromosomes by finding a set of alternating tours.
In the case of perfect data, there is guaranteed to be a solution to the Eulerization problem: one such solution is the assignment of multiplicities that correspond to the cancer genome. However, there is no guarantee on the uniqueness of the solution, and other solutions  including solutions that do not use all variant edges  are possible. Figure 1 gives an example. In the case of perfect data we could require that all variant edges are assigned nonzero multiplicity, thus ensuring that all variant edges from the cancer genome are used. However, in the case of imperfect data addressed below, such constraints are not appropriate as we expect such data to contain missing and false adjacencies due to difficulties in inferring adjacencies (structural variants) from pairedend sequencing data.
Copy number and adjacency genome reconstruction problem: imperfect data
The previous section considered the case where the intervals I and adjacencies were derived from a cancer genome with no errors, and where the interval count vector c was known. Now we consider the situation that is presented by real data, where c is unknown and the adjacencies may be incorrect (with missing adjacencies and/or false adjacencies). Instead of c, we are given a (paired) read depth vector r = (r_{1},..., r_{ n }) derived by the alignment of concordant paired reads to the reference genome. Each entry rj is the number of concordant pairs of reads that when aligned to the reference genome lie entirely within the interval I_{ j }. We use a probabilistic model to derive the most likely edge multiplicities μ in the intervaladjacency graph.
Specifically, let L_{1}, L_{2}, ..., L_{ n }be the lengths of intervals I = (I_{1}, I_{ 2 }, ..., I_{ n }), and let ${L}_{R}={\sum}_{i=1}^{n}{L}_{i}$be the length of the reference genome. Let $N={\sum}_{i=1}^{n}{r}_{i}$ be the total number of concordant pairs that align within these intervals. Following the LanderWaterman model, we assume that the reads are distributed uniformly on the genome, so that the number of reads that align to each interval follows the Poisson distribution with mean λ_{ j }equal to the expected number of reads that align to an interval I_{ j }. Of course, the Poisson distribution is an idealized assumption, and it has been shown that read depth is more accurately fit by a overdispersed Poisson or negative binomial model [21, 32]. Nevertheless, the Poisson assumption has proven useful for copy number variant detection [26], and thus we use the Poisson model here, postponing consideration of other distributions to later work. We assume that the length of the cancer genome is approximately equal to the length L_{ r }of the reference genome and μ_{ j }= μ(e_{ I }(j)) is the integer multiplicity assigned to the interval edge I_{ j }. In a genome without any rearrangements, we expect $\frac{N{L}_{j}}{{L}_{R}}$ concordant paired reads to align within interval I_{ j }(ignoring end effects). Since humans are diploid, we need to rescale this value to indicate the presence of two copies of interval Ij. Therefore, we introduce a variable τ that represents the expected number of copies of each interval in a nonrearranged sample. Given τ, the expected number of reads that align to an interval I_{ j }appearing μ_{ j }times in the genome is ${\lambda}_{j}\left(\frac{{\mu}_{j}}{\tau}\right)=\frac{N{L}_{j}}{{L}_{R}}\times \frac{{\mu}_{j}}{\tau}$. In general we set τ = 2, but we defer discussion of handling multiple chromosomes until the next section.
Setting ${\u0109}_{j}={\mu}_{j}$ gives the most likely multiplicity for the interval I_{ j }in the cancer genome.
Note that [26] derives a similar formulation to predict germline copy number variants in human genomes, using a different construction based on bidirected graphs. Since human genomes are diploid, [26] add an additional source/sink vertex σ and add additional constraints that a flow of 2 be conserved across the graph. In contrast, most cancer genomes are aneuploid and might suffer deletions/duplications at the ends of chromosomes, this additional constraint is not applicable. We address this issue in the following section. [26] also show that their formulation reduces to a network flow problem that is solvable in polynomial time. The polynomial time result relies on two properties: (1) the objective function L_{r}(μ) is separably convex; (2) the constraints are totally unimodular [33].
The intervaladjacency graph has a corresponding bidirected graph, and assignment of edge multiplicities in the intervaladjacency graph is equivalent to assignment of flow to the corresponding edges in the bidirected graph. Thus, the problem formulation in (2) above also reduces to a network flow problem that is solvable in polynomial time. In particular, for an intervaladjacency graph, we obtain a corresponding bidirected graph by adding orientation information to both ends of all edges in the original intervaladjacency graph. Specifically, for all interval edges (s_{ j }, t_{ j }) we assign a positive direction to the end at vertex s_{ j }and a negative direction to the end at vertex t_{ j }. For all reference edges (t_{ j }, s_{j+1}) we assign a positive direction to the end at vertex t_{ j }and a negative direction to the end at vertex s_{ j }+1. For all the variant edges (v_{1},v_{2}) we assign a positive direction for all v ∈ {v_{1},v_{2}} such that v is a vertex of the form s_{ j }, and a negative direction if v is a vertex of the form t_{ j }. We directly transfer all constraints on edge multiplicities. The problem formulation in (2) can now be equivalently described as a network flow problem on the corresponding bidirected graph since edge multiplicity assignment can be viewed as equivalent to flow assignment. Due to how we orient the bidirected edges, the copy number balance conditions from (1) are also equivalent to requiring that the amount of flow going into each vertex is equal to the flow exiting the vertex.
The formulation above addresses the fact that sequencing data does not directly give copy numbers of intervals, but rather yields read depth, which we use along with adjacencies to estimate copy number simultaneously across all intervals. However, another source of error in the data are incorrect and missing adjacencies in the set . Incorrect adjacencies will subdivide intervals and alter the read depths in each of these intervals. Because our likelihood function considers both read depth and adjacencies when determining edge multiplicities, our algorithm is somewhat robust to the presence of incorrect adjacencies. Incorrect adjacencies that do not alter the estimated copy numbers of intervals are likely not to be used (i.e. the adjacency will be assigned multiplicity μ = 0). Missing adjacencies will also affect the local structure of the intervaladjacency graph near the missing variant. In particular, all interval edges incident to the missing variant will be concatenated, and the corresponding variant edge will not be present. In most cases, we expect that the resulting reconstruction will simply not contain the missing adjacency. However, in other cases the missing adjacency may lead to additional errors in the reconstruction: for example the cases where the missing adjacency leads to large differences in the estimated copy number of the merged interval, or where the missing adjacencies overlaps with other variants. Our objective function (2) does not attempt to maximize the usage of variant edges, instead allowing the copy number estimates to determine whether variant edges are used are not. Defining an appropriate objective function that includes both copy number balance and scoring of variant edges is left for future work.
Extensions: multiple chromosomes and telomere loss
We generalize the formulation above to handle two additional features of real data: (1) the reference and cancer genomes have multiple chromosomes, and (2) ends of chromosomes (telomeres) may be deleted in the generation of the cancer genome. First, to address the case of multiple chromosomes, we build a multichromosomal intervaladjacency graph G = (V, E) where the interval and reference edges are the union of interval and reference edges in the unichromosomal intervaladjacency graph, respectively. The variant edges ${E}_{\mathcal{A}}$ are derived from the set of adjacencies that connect intervals that are adjacent in the cancer genome, but not in the reference genome. These adjacencies are inferred from the discordant pairs, and now can include adjacencies between different chromosomes; e.g. those resulting from a translocation. The set of telomeric vertices is the union of telomeric vertices of each chromosome, and consequently $\left\mathcal{T}\right$ is even. We now revise Theorem 1 to multichromosomal genomes, where we now decompose the intervaladjacency graph into a set of alternating tours.
Theorem 2. Given an multichromosomal intervaladjacency graph G = (V, E) with telomeric vertices , there exists a function μ: E ↦ ℕ satisfying the copy number balance condition (1) for all $v\in V/\mathcal{T}$if and only if there exists a multigraph G_{ μ }= (V, Eμ) with edge multiplicities μ containing a set of edgedisjoint alternating tours that each begin and end at vertices in , and whose union is E_{ μ }.
A second feature of cancer genome data is that telomeres of the reference genome may be lost. In this case, the set of telomeric vertices contains vertices other than the starts and ends of each chromosome of the reference genome. De novo telomere loss does not produce novel adjacencies in the cancer genome, and thus requires examining the read depth along the genome to find changes in concordant coverage, as used in read depth methods for copy number variant prediction [21]. Additionally, nonreciprocal translocations or breakage/fusion/bridge cycles produce novel adjacencies in the cancer genome and thus the drop in concordant coverage will be apparent over adjacent intervals in I. We use a heuristic which determines the relative ratio of concordant reads to interval length between intervals to determine these drops in concordant coverage, and if at least one such case is found, we add an additional vertex σ to the intervaladjacency graph and to the set of telomeric vertices. We also add variant edges from σ to the incident interval edge of the loss.
Results
We ran our PREGO algorithm on both simulated data and real sequencing data. We solve the convex optimization formulation in Equation (2) with CPLEX 12.1, using a piecewise linear approximation of the log term in the objective function, thus transforming the problem into an Integer Linear Program (ILP). Note, we use CPLEX rather than the efficient network flow algorithm discussed in a previous section as there is no good implementation of the later for bidirected graphs.
Ovarian sequencing data
Ovarian dataset statistics
Dataset  ID  # Var Edges (Used) 

OV1  TCGA130890  771 (499) 
OV2  TCGA130723  562 (268) 
OV3  TCGA240980  311 (172) 
OV4  TCGA241103  340 (218) 
OV5  TCGA131411  389 (255) 
Reciprocal vs. nonreciprocal variants
Statistical tests for variant edges
Reciproc al vs. N on Reciprocal Variant Edges  

Dataset  VariantType  R (all)  $\overline{R}$(all)  R (nontriv)  $\overline{R}$(nontriv)  NR  $\overline{NR}$  pVal 
OV1  T  179  41  75  13  9  58  < 1E15 
OV1  I  46  20  16  12  2  29  3.46E5 
OV1  TO  210  46  70  16  9  38  2.79E12 
OV2  T  77  51  41  23  12  49  5.17E7 
OV2  I  21  15  9  5  10  21  0.057 
OV2  TO  96  64  46  18  15  44  2.63E7 
OV3  T  61  13  19  3  6  30  2.111E7 
OV3  I  19  13  5  5  2  13  0.075 
OV3  TO  58  26  22  8  7  28  1.92E5 
OV4  T  74  16  40  6  12  35  1.54E9 
OV4  I  10  0  2  0  3  12  0.073 
OV4  TO  48  22  22  10  12  26  0.0036 
OV5  T  93  19  29  7  8  37  2.30E8 
OV5  I  12  8  2  0  6  13  0.13 
OV5  TO  82  26  22  8  7  34  2.29E6 
We analyzed the output of our algorithm for reciprocal (nontrivial) edges and nonreciprocal variant edges. For each type of reciprocal variant (inversions, classical translocations and Robertsonian translocations) we tested whether there was an association between a variant edge being used vs. unused, and reciprocal vs. nonreciprocal, using Fisher's exact test. We find that in most cases there is a statistically significant association, with a larger fraction of (nontrivial) reciprocal variant edges being used than nonreciprocal variant edges (Table 2). We surmise that the observed significant association between reciprocal variants and their use in the solution obtained by our method is an indication that it may be easier to satisfy the copy number balance conditions for vertices associated with a reciprocal variant. In particular, we may only use a nonreciprocal variant if additionally the concordant coverage on the surrounding intervals is indicative of a possible change in copy number. In this respect, nonreciprocal variant edges that are used may represent structural variants whose signature is supported by both read depth and discordant read pairs.
Reconstructed variants
Simulated data
We tested our algorithm on simulated data to determine how robust the reconstructed intervaladjacency graphs are to various errors in the input data. Errors in the input data arise from a number of sources, and we studied the effect of two types of errors on the performance of a simulated sequence: sample contamination and read depth estimation error. We begin by constructing a cancer genome C = I_{α(1)}I_{α(2)}... I_{α(M)}consisting of 200 novel adjacencies: 100 homozygous deletions and 100 heterozygous deletions distributed over 22 autosomes (similar to the ovarian cancer genomes we analyzed in the previous section). The lengths of the deletions are sampled from a normal distribution with mean 10Kb and standard deviation 1Kb. From C we identify the sequence of intervals I. We introduce 50 additional "false" adjacencies, where each false adjacency simply partitions an interval in I into three subintervals and adds a corresponding false deletion adjacency to the set . We then simulate 30x physical coverage of pairedend sequencing by sampling uniformly from C the starting positions of intervals, called readintervals. We sample the length of these intervals from a normal distribution with mean 200 and standard deviation 10. We compute the resulting read depth r_{ j }for each interval I_{ j }.
Tumor samples are often a mixture of cells from the tumor itself and cells from noncancerous cells. To model this type of error, we sample some proportion ρ of the readintervals from the corresponding reference genome (i.e. the sequence of intervals I_{1}I_{2} ... I_{ n }), and sample (1  ρ) of the readintervals from the cancer genome C. Additional noise in the read depth estimation occurs due to experimental error (such as sequencing errors and alignment errors due to repetitive sequences in the reference genome) when estimating r_{ j }. Thus, we add Gaussian noise to each r_{ j }drawn from $\mathcal{N}\left(0,\varphi {r}_{j}\right)$. We use ϕr_{ j }rather than a single variance parameter to adjust the noise model for intervals with different read depths.
Discussion
The PREGO algorithm presented here combines copy number and adjacency information from pairedend sequencing data to infer cancer genome organization. However, the algorithm does not consider all the issues involved in real cancer sequencing data. In particular, we assume that structural variants can be identified by mapping of discordant paired reads, but this is difficult for structural variants in repetitive regions of the human genome [15, 17]. Thus, there may be missing or incorrect adjacencies in the data. Similarly, estimates of read depth are difficult to obtain in repetitive regions [21]. While some of these issues may be addressed computationally, the more difficult cases will require longer reads and/or longer fragments for paired reads.
Beyond the issues with data quality are limitations on the inferred organization. While we derive multiplicities on the edges using adjacency and copy number data, we do not resolve the resulting paths through the intervaladjacency graph, except in simple cases. In many datasets, there will be many such paths and therefore many reconstructions of the cancer genome that are consistent with the data. Even the solution for the estimated edge multiplicities may not be unique. Resolving such longer paths requires additional information about connections between consecutive adjacencies, and such information is generally not available unless the distance between consecutive adjacencies is within the length of a read/fragment. In addition, the intervaladjacency graph does not contain allelespecific information about copy number variants, as considered in other work [35]. Finally, we assume that a cancer sample contains a single genome, when in fact most cancer samples contain DNA from a mixture of tumor cells, each with potentially different somatic mutations. It is possible that some of this intratumor heterogeneity could be resolved computationally. Alternatively, DNA sequencing of single cells, or smaller pools of cells, will minimize these effects.
An additional area of investigation is to infer the temporal history of rearrangements. In the case of copyneutral rearrangements, inferences can be made using parsimony models such as HannenhalliPevzner theory [42]. This approach has previously been used in cancer genome analysis [13]. Models have also been introduced to infer orders of mutations in cases where there is interaction between duplications and rearrangements [43] and duplications and singlenucleotide mutations [35, 44].
Conclusions
We formulated the Copy Number and Adjacency Genome Reconstruction Problem of reconstructing a rearranged cancer genome and developed an efficient algorithm, called Pairedend Reconstruction of Genome Organization (PREGO), for a particular instance of this problem. We designed an optimization problem on the intervaladjacency graph, which is related to the breakpoint graph used in genome rearrangement studies. We applied our algorithm to 5 ovarian cancer genomes sequenced as part of The Cancer Genome Atlas (TCGA) and reconstruct structural variants in these genomes. We analyzed the patterns of reciprocal vs. nonreciprocal rearrangements, and identified rearrangements consistent with known mechanisms of duplication such as tandem duplications and breakage/fusion/bridge cycles.
List of abbreviations used
 TCGA:

The Cancer Genome Atlas
 B/F/B:

breakage/fusion/bridge
 ILP:

integer linear program
Declarations
Acknowledgements
LO is supported by a National Science Foundation Graduate Research Fellowship. BJR is supported by an National Science Foundation CAREER Award, a Career Award from the Scientific Interface from the Burroughs Wellcome Fund and an Alfred P. Sloan Research Fellowship.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 6, 2012: Proceedings of the Second Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMBseq 2012).
Authors’ Affiliations
References
 Albertson DG, Collins C, McCormick F, Gray JW: Chromosome aberrations in solid tumors. Nat Genet. 2003, 34 (4): 369376. 10.1038/ng1215.View ArticlePubMedGoogle Scholar
 Druker BJ, Talpaz M, Resta DJ, Peng B, Buchdunger E, Ford JM, Lydon NB, Kantarjian H, Capdeville R, Ohno Jones S, Sawyers CL: Efficacy and Safety of a Specific Inhibitor of the BCRABL Tyrosine Kinase in Chronic Myeloid Leukemia. New England Journal of Medicine. 2001, 344 (14): 10311037. 10.1056/NEJM200104053441401.View ArticlePubMedGoogle Scholar
 Kauraniemi P, Hautaniemi S, Autio R, Astola J, Monni O, Elkahloun A, Kallioniemi A: Effects of Herceptin treatment on global gene expression patterns in HER2amplified and nonamplified breast cancer cell lines. Oncogene. 2004, 23 (4): 10103. 10.1038/sj.onc.1207200. [http://www.ncbi.nlm.nih.gov/pubmed/14647448]View ArticlePubMedGoogle Scholar
 Raphael B, Volik S, Yu P, Wu C, Huang G, Linardopoulou E, Trask B, Waldman F, Costello J, Pienta K, Mills G, Bajsarowicz K, Kobayashi Y, Sridharan S, Paris P, Tao Q, Aerni S, Brown R, Bashir A, Gray J, Cheng J, de Jong P, Nefedov M, Ried T, PadillaNash H, Collins C: A sequencebased survey of the complex structural organization of tumor genomes. Genome Biol. 2008, 9: R5910.1186/gb200893r59.PubMed CentralView ArticlePubMedGoogle Scholar
 Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, Bignell G, Davies H, Teague J, Butler A, Stevens C, Edkins S, O'Meara S, Vastrik I, Schmidt EE, Avis T, Barthorpe S, Bhamra G, Buck G, Choudhury B, Clements J, Cole J, Dicks E, Forbes S, Gray K, Halliday K, Harrison R, Hills K, Hinton J, Jenkinson A, Jones D, Menzies A, Mironenko T, Perry J, Raine K, Richardson D, Shepherd R, Small A, Tofts C, Varian J, Webb T, West S, Widaa S, Yates A, Cahill DP, Louis DN, Goldstraw P, Nicholson AG, Brasseur F, Looijenga L, Weber BL, Chiew YE, DeFazio A, Greaves MF, Green AR, Campbell P, Birney E, Easton DF, ChenevixTrench G, Tan MH, Khoo SK, Teh BT, Yuen ST, Leung SY, Wooster R, Futreal PA, Stratton MR: Patterns of somatic mutation in human cancer genomes. Nature. 2007, 446 (7132): 1538. 10.1038/nature05610.PubMed CentralView ArticlePubMedGoogle Scholar
 Stephens PJ, Greenman CD, Fu B, Yang F, Bignell GR, Mudie LJ, Pleasance ED, Lau KW, Beare D, Stebbings LA, McLaren S, Lin ML, McBride DJ, Varela I, NikZainal S, Leroy C, Jia M, Menzies A, Butler AP, Teague JW, Quail MA, Burton J, Swerdlow H, Carter NP, Morsberger LA, IacobuzioDonahue C, Follows GA, Green AR, Flanagan AM, Stratton MR, Futreal PA, Campbell PJ: Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell. 2011, 144: 2740. 10.1016/j.cell.2010.11.055.PubMed CentralView ArticlePubMedGoogle Scholar
 Meyerson M, Gabriel S, Getz G: Advances in understanding cancer genomes through secondgeneration sequencing. Nat Rev Genet. 2010, 11: 685696. 10.1038/nrg2841.View ArticlePubMedGoogle Scholar
 Mardis ER, Wilson RK: Cancer genome sequencing: a review. Hum Mol Genet. 2009, 18 (R2): R163R168. 10.1093/hmg/ddp396.PubMed CentralView ArticlePubMedGoogle Scholar
 Schatz MC, Delcher AL, Salzberg SL: Assembly of large genomes using secondgeneration sequencing. Genome Res. 2010, 20: 11651173. 10.1101/gr.101360.109.PubMed CentralView ArticlePubMedGoogle Scholar
 Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA, Pertz LM, Haugen E, Hayden H, Albertson D, Pinkel D, Olson MV, Eichler EE: Finescale structural variation of the human genome. Nat Genet. 2005, 37 (7): 727732. 10.1038/ng1562.View ArticlePubMedGoogle Scholar
 Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, Kim PM, Palejev D, Carriero NJ, Du L, Taillon BE, Chen Z, Tanzer A, Saunders ACE, Chi J, Yang F, Carter NP, Hurles ME, Weissman SM, Harkins TT, Gerstein MB, Egholm M, Snyder M: Pairedend mapping reveals extensive structural variation in the human genome. Science. 2007, 318 (5849): 420426. 10.1126/science.1149504.PubMed CentralView ArticlePubMedGoogle Scholar
 Volik S, Zhao S, Chin K, Brebner JH, Herndon DR, Tao Q, Kowbel D, Huang G, Lapuk A, Kuo WL, Magrane G, De Jong P, Gray JW, Collins C: Endsequence profiling: sequencebased analysis of aberrant genomes. Proceedings of the National Academy of Sciences of the United States of America. 2003, 100 (13): 7696701. 10.1073/pnas.1232418100. [http://www.pnas.org/cgi/content/abstract/100/13/7696]PubMed CentralView ArticlePubMedGoogle Scholar
 Raphael BJ, Volik S, Collins C, Pevzner PA: Reconstructing tumor genome architectures. Bioinformatics. 2003, 19 (Suppl 2): ii162ii171. 10.1093/bioinformatics/btg1074. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/19/suppl\_2/ii162]View ArticlePubMedGoogle Scholar
 Sindi S, Helman E, Bashir A, Raphael BJ: A geometric approach for classification and comparison of structural variants. Bioinformatics (Oxford, England). 2009, 25 (12): i22230. 10.1093/bioinformatics/btp208. [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2687962\&tool=pmcentrez\&rendertype=abstract]View ArticleGoogle Scholar
 Hormozdiari F, Alkan C, Eichler EE, Sahinalp SC: Combinatorial algorithms for structural variation detection in highthroughput sequenced genomes. Genome Res. 2009, 19 (7): 12701278. 10.1101/gr.088633.108.PubMed CentralView ArticlePubMedGoogle Scholar
 Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP, Shi X, Fulton RS, Ley TJ, Wilson RK, Ding L, Mardis ER: BreakDancer: an algorithm for highresolution mapping of genomic structural variation. Nat Methods. 2009, 6: 677681. 10.1038/nmeth.1363.PubMed CentralView ArticlePubMedGoogle Scholar
 Quinlan AR, Clark RA, Sokolova S, Leibowitz ML, Zhang Y, Hurles ME, Mell JC, Hall IM: Genomewide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res. 2010, 20: 623635. 10.1101/gr.102970.109.PubMed CentralView ArticlePubMedGoogle Scholar
 Xi R, Kim TM, Park PJ: Detecting structural variations in the human genome using next generation sequencing. Briefings in functional genomics. 2010, 9 (56): 40515. 10.1093/bfgp/elq025. [http://bfg.oxfordjournals.org/content/9/56/405.full]PubMed CentralView ArticlePubMedGoogle Scholar
 Medvedev P, Stanciu M, Brudno M: Computational methods for discovering structural variation with nextgeneration sequencing. Nature methods. 2009, 6 (11 Suppl): S1320.View ArticlePubMedGoogle Scholar
 Chiang DY, Getz G, Jaffe DB, O'Kelly MJT, Zhao X, Carter SL, Russ C, Nusbaum C, Meyerson M, Lander ES: Highresolution mapping of copynumber alterations with massively parallel sequencing. Nature methods. 2009, 6: 99103. 10.1038/nmeth.1276.PubMed CentralView ArticlePubMedGoogle Scholar
 Yoon S, Xuan Z, Makarov V, Ye K, Sebat J: Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 2009, 19: 15861592. 10.1101/gr.092981.109.PubMed CentralView ArticlePubMedGoogle Scholar
 Pevzner PA, Tang H, Waterman MS: An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA. 2001, 98: 97489753. 10.1073/pnas.171285098.PubMed CentralView ArticlePubMedGoogle Scholar
 Pevzner PA, Tang H: Fragment assembly with doublebarreled data. Bioinformatics. 2001, 17 (Suppl 1): S225S233. 10.1093/bioinformatics/17.suppl_1.S225. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/17/suppl\_1/S225]View ArticlePubMedGoogle Scholar
 Medvedev P, Brudno M: Maximum likelihood genome assembly. J Comput Biol. 2009, 16: 11011116. 10.1089/cmb.2009.0047.PubMed CentralView ArticlePubMedGoogle Scholar
 Alekseyev MA, Pevzner PA: Breakpoint graphs and ancestral genome reconstructions. Genome research. 2009, 19 (5): 94357. 10.1101/gr.082784.108. [http://genome.cshlp.org/cgi/content/abstract/19/5/943]PubMed CentralView ArticlePubMedGoogle Scholar
 Medvedev P, Fiume M, Dzamba M, Smith T, Brudno M: Detecting copy number variation with mated short reads. Genome Res. 2010, 20 (11): 16131622. 10.1101/gr.106344.110.PubMed CentralView ArticlePubMedGoogle Scholar
 Campbell PJ, Stephens PJ, Pleasance ED, O'Meara S, Li H, Santarius T, Stebbings LA, Leroy C, Edkins S, Hardy C, Teague JW, Menzies A, Goodhead I, Turner DJ, Clee CM, Quail MA, Cox A, Brown C, Durbin R, Hurles ME, Edwards PAW, Bignell GR, Stratton MR, Futreal PA: Identification of somatically acquired rearrangements in cancer using genomewide massively parallel pairedend sequencing. Nature genetics. 2008, 40 (6): 7229. 10.1038/ng.128.PubMed CentralView ArticlePubMedGoogle Scholar
 Wittler R, Manňuch J, Patterson M, Stoye J: Consistency of sequencebased gene clusters. J Comput Biol. 2011, 18 (9): 10231039. 10.1089/cmb.2011.0083.View ArticlePubMedGoogle Scholar
 Pevzner PA, Tang H, Tesler G: De novo repeat classification and fragment assembly. Genome Res. 2004, 14: 17861796. 10.1101/gr.2395204.PubMed CentralView ArticlePubMedGoogle Scholar
 Kotzig A: Moves without forbidden transitions in a graph. Mathematica Slovaca. 1968, 18: 7680.Google Scholar
 Pevzner P: DNA physical mapping and alternating Eulerian cycles in colored graphs. Algorithmica. 1995, 13: 77105. 10.1007/BF01188582.View ArticleGoogle Scholar
 Bentley DR: Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008, 456: 5359. 10.1038/nature07517.PubMed CentralView ArticlePubMedGoogle Scholar
 Hochbaum D, Shanthikumar J: Convex separable optimization is not much harder than linear optimization. Journal of the ACM (JACM). 1990, 37 (4): 843862. 10.1145/96559.96597.View ArticleGoogle Scholar
 Mills RE: Mapping copy number variation by populationscale genome sequencing. Nature. 2011, 470: 5965. 10.1038/nature09708.PubMed CentralView ArticlePubMedGoogle Scholar
 Greenman CD, Pleasance ED, Newman S, Yang F, Fu B, NikZainal S, Jones D, Lau KW, Carter N, Edwards PAW, Futreal PA, Stratton MR, Campbell PJ: Estimation of rearrangement phylogeny for cancer genomes. Genome Res. 2012, 22 (2): 346361. 10.1101/gr.118414.110.PubMed CentralView ArticlePubMedGoogle Scholar
 Steinhardt AA, Gayyed MF, Klein AP, Dong J, Maitra A, Pan D, Montgomery EA, Anders RA: Expression of Yesassociated protein in common solid tumors. Hum Pathol. 2008, 39: 15821589. 10.1016/j.humpath.2008.04.012.PubMed CentralView ArticlePubMedGoogle Scholar
 Kelemen LE: Genetic variation in TYMS in the onecarbon transfer pathway is associated with ovarian carcinoma types in the Ovarian Cancer Association Consortium. Cancer Epidemiol Biomarkers Prev. 2010, 19: 18221830. 10.1158/10559965.EPI091317.PubMed CentralView ArticlePubMedGoogle Scholar
 Ng CK, Cooke SL, Howe K, Newman S, Xian J, Temple J, Batty EM, Pole JC, Langdon SP, Edwards PA, Brenton JD: The role of tandem duplicator phenotype in tumour evolution in highgrade serous ovarian cancer. J Pathol. 2011Google Scholar
 Moestue SA, Borgan E, Huuse EM, Lindholm EM, Sitter B, BørresenDale AL, Engebraaten O, Maelandsmo GM, Gribbestad IS: Distinct choline metabolic profiles are associated with differences in gene expression for basallike and luminallike breast cancer xenograft models. BMC Cancer. 2010, 10: 43310.1186/1471240710433.PubMed CentralView ArticlePubMedGoogle Scholar
 Takakura S, Kohno T, Manda R, Okamoto A, Tanaka T, Yokota J: Genetic alterations and expression of the protein phosphatase 1 genes in human cancers. Int J Oncol. 2001, 18 (4): 817824.PubMedGoogle Scholar
 Jung Y, Kim P, Jung Y, Keum J, Kim SN, Choi YS, Do IG, Lee J, Choi SJ, Kim S, Lee JE, Kim J, Lee S, Kim J: Discovery of ALKPTPN3 gene fusion from human nonsmall cell lung carcinoma cell line using next generation RNA sequencing. Genes Chromosomes Cancer. 2012Google Scholar
 Hannenhalli S, Pevzner PA: Transforming men into mice (polynomial algorithm for genomic distance problem). Proc th Annual Symp Foundations of Computer Science. 1995, 581592.Google Scholar
 Raphael BJ, Pevzner PA: Reconstructing tumor amplisomes. Bioinformatics. 2004, 20 (Suppl 1): i265i273. 10.1093/bioinformatics/bth931.View ArticlePubMedGoogle Scholar
 Durinck S, Ho C, Wang NJ, Liao W, Jakkula LR, Collisson EA, Pons J, Chan SW, Lam ET, Chu C, Park K, Hong Sw, Hur JS, Huh N, Neuhaus IM, Yu SS, Grekin RC, Mauro TM, Cleaver JE, Kwok PY, LeBoit PE, Getz G, Cibulskis K, Aster JC, Huang H, Purdom E, Li J, Bolund L, Arron ST, Gray JW, Spellman PT, Cho RJ: Temporal Dissection of Tumorigenesis in Primary Cancers. Cancer Discovery. 2011, [http://cancerdiscovery.aacrjournals.org/content/early/2011/06/23/21598290.CD110028.abstract]Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.