 Proceedings
 Open access
 Published:
The rise and fall of breakpoint reuse depending on genome resolution
BMC Bioinformatics volume 12, Article number: S1 (2011)
Abstract
Background
During evolution, largescale genome rearrangements of chromosomes shuffle the order of homologous genome sequences ("synteny blocks") across species. Some years ago, a controversy erupted in genome rearrangement studies over whether rearrangements recur, causing breakpoints to be reused.
Methods
We investigate this controversial issue using the synteny block's for humanmouserat reported by Bourque et al. and a series of synteny blocks we generated using Mauve at resolutions ranging from coarse to very finescale. We conducted analyses to test how resolution affects the traditional measure of the breakpoint reuse rate.
Results
We found that the inversionbased breakpoint reuse rate is low at finescale synteny block resolution and that it rises and eventually falls as synteny block resolution decreases. By analyzing the cycle structure of the breakpoint graph of humanmouserat synteny blocks for humanmouse and comparing with theoretically derived distributions for random genome rearrangements, we showed that the implied genome rearrangements at each level of resolution become more “random” as synteny block resolution diminishes. At highest synteny block resolutions the HannenhalliPevzner inversion distance deviates from the Double Cut and Join distance, possibly due to smallscale transpositions or simply due to inclusion of erroneous synteny blocks. At synteny block resolutions as coarse as the Bourque et al. blocks, we show the breakpoint graph cycle structure has already converged to the pattern expected for a random distribution of synteny blocks.
Conclusions
The inferred breakpoint reuse rate depends on synteny block resolution in humanmouse genome comparisons. At finescale resolution, the cycle structure for the transformation appears less random compared to that for coarse resolution. Small synteny blocks may contain critical information for accurate reconstruction of genome rearrangement history and parameters.
Background
Genomes of related organisms have been shown to share long tracts of homologous DNA sequence ("synteny blocks") across species [1]. During the course of evolution, largescale genome rearrangements of chromosomes shuffle the order of such homologous segments. Some years ago, a controversy erupted in genome rearrangement studies over whether rearrangements are likely to recur in regions known as rearrangement hotspots [2]. Pevzner and Tesler [3] inferred the existence of such “fragile sites”, from high values of their breakpoint reuse rate (BRR). An argument by Sankoff and Trinh [4] intended to show that loss of breakpoint usage information occurs via synteny block removal, and leads to the inference of an artificially inflated breakpoint reuse rate. That argument was criticized by Peng et al.[5] for its flawed synteny block generation method. Subsequently, by devising a new approach for computing the BRR, Bergeron et al.[6] observed that rearrangement scenarios maximizing operations involving rearrangements of telomeres using only single cuts could result in much lower rates of reuse. This approach resulted in bounds for the humanmouse transformation which not only demonstrated that much lower rates are possible, but also that the BRR is extremely sensitive to genome representation and rearrangement model.
In the current paper, we reconsider aspects of the original breakpoint reuse debate. We address a theme that David Sankoff suggested in his 2006 PLoS commentary [7] was not yet fully confronted. In that commentary, Sankoff argued that high values of inferred breakpoint reuse may result from noise introduced by imprecise synteny block construction rather than due to actual genome rearrangements in the course of evolution. Imprecise synteny block construction or other processing might effectively randomize the information needed to reconstruct rearrangement history. While others use a variety of methods to explore the level of this "randomization", such as distance in Xu et al.[8], the number of conserved adjacencies [9], or the structure of common intervals [10], we will use the cycle structure of the adjacency graph as an indicator of "randomization". We examine how the scale of synteny block resolution affects breakpoint reuse.
To this end, and to continue the discussion of whether elevated values of breakpoint reuse can be used to infer that genome rearrangements repeatedly strike the same “hot spots” during the course of mammalian evolution, we evaluated the traditional breakpoint reuse statistic using actual data, in the context of rearrangement model and resolution for two data sets: 3way synteny blocks for humanmouserat constructed by Bourque et al.[11], and a series of blocks generated by the Mauve genome alignment system [12] with resolutions ranging from finescale to coarse.
Genome rearrangment transformations
The rearrangement transformation between two genomes with no insertions, deletions or duplications can be specified by the connections between corresponding syntenic segments in the two genomes. We call such segments genes even if they don't actually consist of single genes. The adjacency graph introduced by Bergeron et al.[13] is an elegant representation of the genomic transformation. Vertices of the adjacency graph are either the connections or adjacencies between two gene ends in the initial and target genomes, or solo telomeric gene ends at the ends of chromosomes. The complete set of adjacencies {{a_{1},a_{2}}, {a_{3},a_{4}}, …, {a_{2N1},a_{2N}}} specifies a genome of N genes. A breakpoint represents a disruption of their order, an adjacency of two gene ends associated in one genome but not the other.
A number of methods have been used to evaluate the minimum number of genome rearrangement events, known as rearrangement distance, of a genome transformation; we will focus on two: the Hannenhalli and Pevzner "HP" formulation [14, 15] which involves generalized inversions, (inversions (Figure 1a), translocations, fissions and fusions), and the double cut and join (DCJ) [16] which includes simple transpositions (Figure 1b) and generalized transpositions (Figure 1c) in addition to the operations considered by HP.
In the adjacency graph (Figure 1), adjacencies of one genome are connected to those of another by lines representing their common gene ends. Followed continuously these lines resolve the adjacency graph into alternating closed paths called cycles and other continuously connected open paths. The ends of open paths belong to gene ends at the ends of chromosomes called telomeres. In our formulation of the DCJ paradigm all open paths are closed by a capping procedure described more extensively in [16–18] (for examples, see Additional File 1); essentially, adjacencies containing telomeric gene ends of chromosomes are "capped" by artificial gene ends called "endcaps". Paths that start and end on the same genome are closed by a double capped “null chromosome” on the opposite genome; paths starting and ending in different genomes are closed by connecting the two capped ends.
Semantics, conventions, and cycle nomenclature
Due to the addition of null chromosomes, when paths are closed into cycles, both genomes artificially contain the same number of chromosomes. Some DCJ operations involve "artificial cuts" between a telomeric gene end and an endcap. In other formulations [6, 13], paths are not closed hence such artificial operations do not occur. Another outcome of the device of introducing caps and closing paths is the consequent increase in breakpoints. Some formulations [3, 11] count only "internal breakpoints" which do not include "external breakpoints" such as capped telomeric gene ends, or null chromosomes. In our approach we include all breakpoints, internal and external; hence telomeric adjacencies are counted.
In a departure from the usual convention in the genome rearrangement literature, we call a cycle containing j adjacencies in only one genome a jcycle as this corresponds better to the analogous cycles for unsigned permutations. Usually it is called a 2jcycle. In our formulation, 1cycles are not counted in the DCJ distance (see below) as they are the identity transformation of a single adjacency transforming to itself; the cycle count "c" counts 2cycles (usually called 4cycles) or higher.
DCJ operation, distance, and comparable HP distance
The double cut and join (DCJ), is a universal operation that subsumes many biological rearrangement operations.To resolve the transformation, a DCJ is performed by breaking connections in two adjacencies in the current genome and swapping gene ends so that at least one of the resulting adjacencies exists in the target, decreasing the distance by 1. Performing a single DCJ decreases the distance by 1. Each cycle can be resolved independently by DCJ. For the jth cycle containing b_{ j } breakpoints, this results in d_{ j } = b_{ j } – 1 steps. As all quantities are additive over individual cycles, the total DCJ distance can thus be obtained from a cycle decomposition of the adjacency graph. That is, , , and , where the sums are over cycles. Hence, the resulting DCJ distance between any two genomes can be expressed as:
d_{ DCJ } = b – c
where b is the number of breakpoints and c the number of cycles [16]. In the Hannenhalli and Pevzner formulation which is based solely on generalized inversions, the distance is:
d_{ HP } = b – c + h + f
where b and c (using our conventions) are as above, and h and f are respectively hurdles and fortresses, positive permutations which don't contain any inversions by DCJ, or require "extra inversions" to resolve by HP, and are generally rare [19]. The difference in the distance between the two formulations, d_{ DCJ } – d_{ DCJ } = h + f, consists of these relatively rare obstructions. Figures 1b and 1c illustrate how hurdles increase distance in the HP over the DCJ paradigm.
Definition of the classical inversionbased breakpoint reuse rate (BRR)
In considering an “undisturbed” stretch of genome, performing a reversal creates at most two breakpoints at each step. Pevzner and Tesler [3] defined the inversion based breakpoint reuse rate (BRR) as r =2d/b, with d the genomic distance, and b the number of breakpoints. This "traditional" reuse statistic is 1 for reversals as long as no breakpoints are reused. The cycle distribution of such a transformation consists entirely of 2cycles in our nomenclature (or 4cycles, by the usual convention) such as in Figure 1a. If breakpoints are reused, the statistic can be as high as 2.
BRR for (generalized) transpositions is model dependent
To see the effect on the genome for these operations, Figures 1b and 1c show adjacency graphs superimposed on their genome graphs for a simple transposition and a block interchange (BI). Both are 2step DCJ operations which exchange two segments in a genome by creating an intermediate circular. The simple transposition reuses the same cut in the circular whereas the BI does not. A simple transposition is a special case of a block interchange. As both operations involve the creation and reabsorption of circular intermediates (CI) they are considered generalized transpositions
A breakpoint graph[20] (see Figure 1 for illustrative examples shown below their adjacency graphs) is a representation dual to the adjacency graph such that vertices in the adjacency graph, are lines or arcs in the breakpoint graph. Horizontal lines ("black lines") are typically adjacencies in the current genome, while arcs are "desired" adjacencies in the target. Vertices in the breakpoint graph correspond to gene ends, and are lines in the adjacency graph. Particularly for block interchanges, the creation of CI, is not easily discerned from the adjacency graphs as they look like graphs for a pair of overlapping 2cycles which seem deceptively like inversions. Breakpoint graphs maybe more informative for identifying transpositions.
To see why this is so, we note that in the breakpoint graphs for the simple and generalized transpositions, the arcs are "unoriented" unlike arcs composing the breakpoint graph of the inversion in Figure 1a, which are "oriented". Oriented arcs have their black lines pointing in the same direction relative to the arc, whereas unoriented arcs both point away or towards the central arc. An arc's orientation is only defined for arcs rooted in the same chromosome, otherwise we consider them nonoriented. DCJ performed on oriented arcs produce inversions, while DCJ performed on unoriented arcs produce CI. In order for CI to be reabsorbed, the arc about which we perform the DCJ must overlap others in the breakpoint or adjacency graph.
As shown in the legend for Figures 1b and 1c, cycle structures resulting in generalized transpositions in the DCJ paradigm do not result in the same distance in the HP paradigm, in which only generalized inversions can be performed and not the creation of a CI. As a result, not only do the distances differ, but also the inferred BRR. We note that while Figure 1b shows an unoriented 3cycle, 3cycles also exist which contain oriented arcs. Such 3cycles can be resolved by inversions. For oriented 3cycles, DCJ and HP distances agree as does breakpoint reuse.
Random permutations, cycle structure, and BRR for longer cycles
If we continue performing reversals indefinitely in the same stretch of genome, eventually we will start to reuse breakpoints (unless the simplifying assumptions of the “Infinite Sites Model” of Ma et al. [21] are used, which prevent breakpoint reuse). Consequently, the cycle structure will change. Each reuse of a breakpoint increases the length of the cycle in which that breakpoint appears. Longer and longer cycles appear which continue growing in number and length. The resulting permutation will exhibit increased breakpoint reuse. We see this by computing the inversion based breakpoint reuse for a longer cycle having n breakpoints. The DCJ distance for a cycle containing n breakpoints is n1 and, as we have seen, is identical to the HP distance unless there are hurdles and fortresses. The inversion based BRR statistic is therefore r_{ longcycle } = 2d/b = 2(n – 1)/n, which approaches 2 as n gets large. Since distance is additive over cycles, and so are breakpoints, for transformations containing N such cycles the total BRR becomes, r_{ longcycle } = 2N(n – 1)/N n = 2(n – 1)/n → 2 for large n.
As more and more random operations are performed between two genomes, they become increasingly divergent and randomized relative to one another. This affects the cycle structure. Considering results from a paper on random signed gene order permutations by Sankoff and Haque [22]which constructed cycles of the breakpoint graph of random permutations, Richard Friedberg communicated results on cycle structure to us which we discuss in the next section. In particular, he shared a result for the expected number of jcycles in the random edge graph for transformations by DCJ for genomes containing only circular chromosomes. As gene number per chromosome gets large, we surmise results for the unrestricted circular case (allowing intermediate genomes to contain circular chromosomes) coincide with those for the restricted linear case (circulars are immediately reabsorbed).
Cycle structure of random unsigned permutations
Before discussing the signed permutation case in the next section, we first consider the cycle structure of unsigned permutations (permutations of objects without consideration of orientation). The results for the unsigned permutation case have already been reported [23] . These formulas are analogous to those for signed permutations. In the unsigned case, the expected number of jcycles is:
To compare this later with the result for signed permutations, we write it as:
We can get the expected number of cycles directly by summing eq (1) over j:
Unfortunately the analogous formula for bipairings (ie signed permutations of genes with orientation) does not have the simplicity of eq (1), so we cannot accomplish this as easily for signed permutations. Nor can we compute c by summing an approximate formula because its error becomes large when j is near N, no matter how large N is. Fortunately, there is an indirect way of deriving eq (2) bypassing eq (1) so that c is found without finding the individual values of q_{ j } and the result agrees with eq (2). This indirect way has the merit that it also yields the variance of c, which cannot be inferred from eq (1). With a slight modification this method also works for bipairings. It yields:
<c> = 1 + 1/3 + 1/5 + … (2*)
Now these individual terms do not correspond to the formula for signed permutations, eq (1*) below. For small j the j th term of (2*) is larger than q_{ j }, but for j near N it is smaller. To solve for one can write:
<c_{3}> = <c> – (<q_{1}> + <q_{2}> + <q_{3}>) (3, 3*)
true for both (signed and unsigned) problems. In the unsigned permutation problem this simplifies to
<c_{3}> = 1/4 + 1/5 + 1/6 + …
which has no analogue for bipairings. To evaluate (3) for bipairings one must use (2*) for <c> and approximate formulas for the equivalent terms <q_{1}>, <q_{2}>, <q_{3}>.
The cycle structure of random signed genome permutations
In the case of signed genome permutations, 2N gene ends can be paired into N adjacencies in (2N – 1)!! different ways where K!! is a double factorial and is equal to K(K2)(K4)…k (with k=1 if K is odd, and 2 if K is even). The comparison of two such pairings permits a cycle structure to be constructed. For an adjacency graph with N adjacencies in each genome, let q_{ j } be the number of jcycles and the total number of cycles.

Then for such a random adjacency graph:
For large N and small j the asymptotic limit of equation (1*) is:
which can be compared to the analogous result for unsigned permutations, eq (1).
Thus for signed permutations:

The expected number of cycles, by the indirect method mentioned previously is:
where γ=Euler’s γ= 0.577. Hence, for large N we have that <c> is given approximately by:

The expected number of cycles of length > j is:
<cj>=<c>– <q_{1}> – <q_{1}> –…–<q_{ j }>
which can be evaluated from (1*) and (2*) and approximately from (1a*) and (2a*).
We wish to know <c_{3}>. Thus, for large N,
<c_{3}> = <c> – (<q_{1}> + <q_{2}> + <q_{3}>) (3*)
Methods
Summary of our approach
To explore whether randomization of genome transformations occurs beyond that due to actual evolutionary events recorded in the "signal in the genome" we have adopted the traditional breakpoint reuse statistic. We feel this statistic serves as an indicator of overall apparent breakpoint reuse and best reflects the approach followed by the pioneering paper of Pevzner and Tesler (2003). Towards this end, we consider the traditional inversionbased breakpoint reuse rate in three databased computational experiments discussed in the next three sections. We calculated BRR for the following [h=human, m=mouse, r=rat]:

1)
a series of Mauve synteny block sets at resolutions ranging from fine to coarse

2)
the 300 kb Bourque et al (2004) hm 3way hmr blocks with block removal

3)
the 141min LCB Mauve blocks (defined below) before and after block removal
Generating synteny blocks with mauveAligner
We applied the "mauveAligner" from the Mauve genome alignment system [12] to search for highresolution 3way locally collinear blocks (Synteny Blocks) among the genomes of Mouse, Rat, and Human. Mauve uses a seedandextend local multiple alignment method to identify highscoring local alignments, which it then clusters into locally collinear blocks—groups of matches in the same order and orientation among each genome. We assign each locally collinear block (LCB) a weight w(B) equal to the sum of lengths of the constituent ungapped local alignments. The weight of a block w(B) can be formally defined as , where a block B consists of one or more ungapped local alignments m, and length(m) is the number of nucleotides covered by m. In the context of mauveAligner, the matches in a block must be in the same order and orientation in all genomes. That is, blocks are internally free from rearrangement.
Because mauveAligner generates local alignments in unique regions of sequence only, it typically generates few LCBs containing only repetitive regions, and any LCBs generated in such regions tend to be small, having a low weight. mauveAligner filters out those spurious LCBs by iteratively removing lowweight LCBs in a process called greedy breakpoint elimination [12]. Greedy breakpoint elimination is analogous to iterative block removal using w(B) to give an ordering for blocks to be removed. At each step in the process, the lowest weight LCB is removed. If one or more rearrangement breakpoints have been eliminated, any surrounding LCBs may coalesce with each other. The weight of a coalesced LCB is equal to the sum of weights of LCBs that were joined.
In the present work, the RepeatMasked assemblies of human (NCBI 35), mouse (NCBI 33), and rat (RGSC 3.4) were searched for unique 3way seed matches on the forward and reverse strands using the palindromic seed pattern: 11111*111*11*1*11*111*11111 [24]. Initial seed matches were maximally extended in each direction until the seed pattern no longer matches at any overlapping position. A total of 922,081 ungapped 3way local alignments containing unique sequence resulted. These initial matches have a minimum length of 27 (dictated by the seed pattern above). The initial set of 3way matches gave rise to 567,782 LCBs, to which we applied greedy breakpoint elimination to remove all LCBs up to a minimum weight of 55, yielding a baseline set of 520,423 3way matches that compose 6483 LCBs. The minimum sequence alignment length among these blocks is 55 nucleotides. We further applied greedy breakpoint elimination to the baseline set of 6,483 LCBs, recording the observed genomic permutation at each successively higher LCB weight up to a minimum weight of 100,000. At minimum weight 97,673 (the last weight before 100,000), there are 75 3way LCBs among the mouse, rat, and human genomes. At weights larger than 500, the LCB weight is approximately proportional to the overall chromosomal span of an LCB, with chromosomal span 1001000x the LCB weight. The data set is included as Additional Files 2345678910
Calculating genomic distance by GRIMM and DCJ
The HP genomic distance for the Mauve synteny block sets constructed at each successive minimum LCB weight was computed using the GRIMM [25] server. We computed the DCJ [16] distance using a C++ program which inputs the genome as a signed permutation (exported from Mauve) as the GRIMM server does, and outputs the DCJ distance, the number of breakpoints, the number of cycles of each kind and their total, and the number of null chromososmes.
Results
Onset of transpositions for humanmouse (hm) rearrangements
We found that the HP and DCJ distances agree identically at low resolution but at high resolution, the GRIMM distance exceeds the DCJ distance. In Figure 2, we display these distances on a log scale so as to simultaneously display the difference between GRIMM (HP) and DCJ values. The maximum GRIMM (HP) value is 1223 while the maximum value of the DCJ distance is 1206. The maximum difference between these two distances is therefore relatively small, only 17 or 1.4% of the total (HP) distance at highest resolutions.
We assume the differences we found between GRIMM (HP) and DCJ distances are due to hurdles, since d_{ GRIMM }  d_{ DCJ } = d_{ HP }  d_{ DCJ } = h +f where fortresses, in breakpoint graphs containing an odd combination of hurdles, are extremely rare. The hurdles in our humanmouse (hm) transformations may be due to simple transpositions and generalized transpositions (GT) as in Figures 1b and 1c, which differ in HP and DCJ distances as shown. GT onset in Figure 2 occurs at min LCB weights 2153 for hm corresponding to a DCJ distance of 293 and GRIMM of 294
BRR for humanmouse Mauve blocks at different resolutions
The traditional inversion based breakpoint reuse rate, 2d/b, where d is the genomic distance and b the number of breakpoints [5] was computed for both GRIMM (HP) and DCJ distance measures on each set of Mauve synteny blocks. The two distances are identical for most of the min LCB weight range represented by Mauve block sets (Figure 2), and the calculated breakpoint reuse rate would also be the same in that range if the number of breakpoints agreed for both GRIMM and DCJ distances. However, to follow the 2003 paper of Pevzner and Tesler in PNAS [3], we used "internal" breakpoints to calculate the GRIMM breakpoint reuse rate. For the DCJ BRR calculation we used "internal + external" breakpoints.
Even though the DCJ and HP distances are largely identical over most of the range, the graphed values of BRR we present in (Figures 3, 4a, 4b) using the HP and DCJ distances we evaluated are not the same. They run essentially parallel to each other. Their difference is not due to the difference in distance for the most part (except at highest resolution where the effect due to the difference in distance is nondetectable), but rather is due to the manner of evaluating breakpoints.
Our maximum values for the BRR range from 1.67 (for the "DCJ" value) to 1.738 (for the "HP" value of BRR) depending on whether the number of breakpoints was evaluated using external breakpoints in addition to internal breakpoints (for the DCJ value) or only internal breakpoints (for the HP value). This variation is consistent with the range of values for BRR in [3] (which goes from 1.633 to 1.9).
The lowest BRR we observed is 1.376 (for the DCJ version), occurring in Mauve synteny blocks with min LCB weight 141. These blocks have an average chromosomal length of 2.66 Mb, a GRIMM distance of 813, and the total number of blocks is 1143. Increasing the min LCB weight from here (and hence decreasing the synteny block resolution), we observe a dramatic rise in both breakpoint reuse curves, which reach their maximum at min LCB weight 2386 (Figure 3). The HP BRR then plateaus and stays in the same range until min LCB weight 5518 (Figure 3). It remains over 1.7 for min LCB weights up to 40000, which have average block size 22.7 Mb, for a total of 133 blocks, and a GRIMM distance of 116.
Dependence of BRR and cycle structure on resolution
To understand the variation in the breakpoint reuse measures based on synteny block resolution, we calculated the cycle structure for the human mouse transformation on a series of progressively increasing resolution Mauve synteny blocks. Figure 4 shows the behavior of the breakpoint reuse measures along with the proportion of cycles for 2, 3 and >3 cycles for humanmouse as they vary with LCB weight.
To interpret how these fractional cycle curves affect the BRR, we wish to express the breakpoint reuse rate BRR r for the entire transformation as a weighted sum over the BRR of individual cycles r_{ j }, that is,
where we wish to determine the coefficient c_{ j } in such a way as to involve the fractional distributions over kinds of cycles. Since the total breakpoint reuse is r = 2d/b, where both the total distance and the total number of breakpoints are summed over cycles, that is and hence the coefficient for equation (4) is:
c_{ j } = b_{ j }/b.
However, the sum in equation (4) is over individual cycles, and not kinds of cycles. To accomplish this we group cycles of the same kind together. If n_{ k } is the number of kcycles the new sum over kinds of cycles, ie 2cycles, 3cycles, 4cycles, etc is:
where b_{ k } is the number of breakpoints in a kcycle. Expressing the coefficient in fractions of cycles, (2cycles, 3cycles etc) with the fraction of kcycles =f_{ k }=n_{ k }/Σn_{ k }_{.} and the sum over n_{ k }, that is, Σn_{ k }=c, the total number of cycles, we arrive at a weighted sum over different kinds of cycles:
where b_{ k } is the number of breakpoints in a kcycle, f_{ k } the fraction of kcycles and r_{ k } is the BRR of a kcycle, or r_{ k } = 2d_{ k }/b_{ k } = 2(k – 2)/k.
Resolution based cycle structure for humanmouse
In Figure 4 we represent the dependence of the cycle structure on min LCB weight. Figure 4a shows the dependence of the cycle structure over the full range of min LCB weight. Figure 4b, shows the highresolution end. At high resolution, the dependence of BRR bottoms at min LCB weight 141 approximately where the proportion of 2cycles (usually called 4cycles) is at a maximum and the proportion of >3cycles (usually called 6cycles) are at minimum. At the minimum of breakpoint reuse, 64% of the cycles are 2cycles, 23% are 3cycles and 12.5% are cycles of length greater than 3. At highest resolutions (below min LCB weight 140) a rise in the 3cycle and >3cycles fraction and corresponding drop in the 2cycles produces a rise in BRR.
In the plateau, nearly half (45% to 48%) of the cycles are 2cycles, 11.26 to 12.5% of the cycles are 3 cycles, and 39 to 43.66% of the cycles are of length greater than 3. The initial rise to the plateau from the minimum of breakpoint reuse corresponds to a rise in the percentage of 2cycles, a decline in the percentage of 3cycles and in the percentage of cycles of length greater than 3. The decline in breakpoint reuse as the min LCB weight grows is due to a rise in the percentage of 2cycles from 48.5% at minimum LCB weight 51235 to 67% at LCB weight 97673 while the percentage of greater than 3cycles declines from 36.4% to 22.2%.
The decline in breakpoint reuse as min LCB weight grows past 50000 is due to finite chromosome length. Entire chromosomes can be spanned by Mauve synteny blocks at their lowest resolution. The average length of blocks with highest min LCB weight (100,000) is 42.4 Mb, almost the size of the smallest (human) chromosome, 46.9 Mb. At such resolutions, the BRR diminishes, the percentage of 2cycles rises and that of >3 cycles declines.
Minimum of BRR and cycle structure
The region around the minimum of BRR is very interesting: near min LCB weight 200, the fraction of 2cycles peaks (Figure 4a and b) and the fraction of >3 cycles is at a minimum at a slightly lower min LCB weight than the min of BRR. It is in this overall region that the transformations are least complicated, and the "signal in the genome" is best preserved. There are mainly "simple generalized inversions" since we are simultaneously at the maximum of the 2cycles fraction and at the min of the >3 cycles, hence at the smallest fraction of long cycles, although there is a caveat in that longer cycles have correspondingly higher weights. We next compare our computed cycle structures for humanmouse at different resolutions with the previously derived predictions of cycle structure for random permutations.
Low resolution cycle structure approaches random permutation
We compared the total number of >3 cycles in the humanmouse transformation to the expected distribution for a random permutation having the same number of synteny blocks derived earlier. As resolution decreases, e.g. with increasing min LCB weight (Figure 4c), the number of >3 cycles of the transformation approaches the theoretical distribution derived earlier for this number in a random permutation having the same number of synteny blocks.
Systematic block removal for the 300 kb hm Bourque et al. blocks
Bourque et al. [11], constructed 3way synteny blocks for human, mouse and rat, using GRIMMSyntenywhich can be accessed online [3, 5]. GRIMMSynteny preserves information about microrearrangements within synteny blocks, dividing the blocks into “micro” and “macro” by a choice of parameters. Choosing 300 kb as their cutoff, they arrived at 394 blocks for humanmouse.
We applied the following protocol of successive block removal on these 300 kb “Bourque” blocks. Blocks were removed in a stepwise fashion, starting with the smallest blocks. Remaining blocks were concatenated if they appeared in the same orientation and order in both genomes. Although our approach follows the spirit of the SankoffTrinh [22] block removal procedure, in their approach, blocks were removed from simulated genomes. Ours used a set of blocks that were derived from real data. Theirs were generated so as to have no breakpoint reuse. After removing blocks, the SankoffTrinh procedure amalgamated the proximal blocks in a manner subsequently shown to be faulty [5]. Since their data was simulated as a permutation file, it had no actual size. Since we used real data, the blocks we used have size and orientation.
We calculated breakpoint reuse after each set of blocks was removed, and plotted the resulting curves versus the size of blocks removed (Figure 6). BRR was calculated using GRIMM [25] via the Hannenhalli and Pevzner algorithm [14, 15] and via DCJ [16] with breakpoints evaluated as previously discussed. As before, the differences in these values of breakpoint reuse are not significant, but are mainly due to the different methods of counting breakpoints (internal for GRIMM and internal+external for DCJ). The change in breakpoint reuse with block removal (for either BRR measure) for the Bourque et al. blocks is not significant. As the breakpoint reuse rates do not change significantly with block removal, we hypothesize that this situation is similar to the plateau region in Figures 3 and 4a and 4b where the genomic permutation is "randomized". In the next sections we perform the same block removal procedure on the Mauve blocks most closely corresponding to the 300 kb Bourque et al. blocks.
Corespondence between Mauve blocks and "Bourqueet al." blocks
No direct correspondence can be made between the Mauve synteny blocks and those of Bourque et al 2004, since Mauve blocks and the Bourque et al. 3way synteny blocks for human mouse rat were generated from different genome assemblies. By comparing various parameters including GRIMM and DCJ distance, breakpoints, and Breakpoint Reuse Rate (BRR), we derived a rough correspondence (Tables 1 and 2); the Mauve blocks with min LCB wt 141 most closely correspond to both the 300 kb and 1 Mb Bourque et al. blocks. We call these the Bourque et al.like Mauve blocks.
BRR for the Bourque et al.like Mauve blocks
We repeated our stepwise block removal protocol for the Bourque et al.like Mauve blocks (Mauve blocks at min LCB weight of 141), since after block removal these Mauve blocks most closely resemble the 300 kb Bourque et al. blocks. In addition, blocks at min LCB weight 141 have the minimum observed value of BRR (see Figure 4b). Although both GRIMM and DCJ BRR curves initially rise with block removal for the min LCB weight 141 Mauve blocks, after all blocks smaller than 300 kb are removed, the curves enter a plateau (Figure 6a).
Our prior experience with the sets of Mauve blocks at different resolutions suggests that when enough smaller size blocks are removed, the remaining blocks approach a more randomized cycle structure. The BRR curves in Figure 6a for the Mauve blocks after >300 kb blocks have been removed resemble those for the Bourque et al. blocks (Figure 5) although the BRR values do not match exactly. The existence of the plateau suggests the Bourque et al. blocks have lost rearrangement information encoded by true rearrangement breakpoints and are approaching a random permutation.
>3 cycles for Bourque et al.like Mauve blocks with block removal
In Figure 6b we note that the divergence between the >3 cycle numbers for actual vs expected number of cycles for a random permutation is greatest when no blocks are removed and decreases with the proportion deleted. Although the two curves in Figure 6b never merge, the number of cycles greater than 3 approaches the curve depicting the expected number of >3 cycles lending credence to the notion that the permutation becomes increasingly randomized as blocks are removed.
Discussion
Alternative measure for breakpoint reuse
The work presented here uses the classical definition of breakpoint reuse rate. In a different approach, Bergeron et al. (2008) devised a new way of calculating the breakpoint reuse rate, BRR, and showed that by this definition the BRR is intimately connected to particular rearrangement scenario and model. They redefined the breakpoint reuse rate as:
r = C/b
where C is the total number of cuts made by the operations of the scenario, and b the number of Bvertices (ie in the target genome) in long cycles or paths. In methods that force an artificial closure of paths ending in telomeric adjacencies, and an equalizing of chromosome number resorting to use of null chromosomes (including our formulation of the DCJ), there is no biological basis for cuts performed between caps and gene ends or between caps and caps in null chromosomes. The traditional breakpoint reuse measure can doublecount the number of actual cuts performed for each DCJ in specific scenarios, leading to a severe overestimate of the BRR. Bergeron et al. (2008) followed up on this insight by devising a number of ingenious manoeuvres to find scenarios that either maximize or minimize their statistic. For some long paths it is possible to decrease the number of cuts by a factor of two, thereby radically diminishing the effective breakpoint reuse. By the new definition, the value of the breakpoint reuse rate can become less than 1, achieving the following bounds for humanmouse:
0.89 ≤ r ≤ 1.51
Although the variation we achieved is not quite as dramatic, as that obtained by Bergeron et al. we did attain nearly half this variability just by changing the resolution of the synteny blocks.
In another work on this subject [26], Sinha and Meller investigated the relationship between BRR and synteny blocks using a simple approach to synteny block aggregation depending on two principle parameters: the maximum gap (max_gap) between adjacent blocks to be merged, and the minimum length (min_len) of synteny blocks. They found that the classical breakpoint reuse rate was almost constant for different data sets and a wide range of parameters, which roughly corresponds to our results for BRR in the plateau region (our Figures 3456). Their work did not investigate synteny blocks of the high resolution presented here. Although they do not report the actual chromosomal span of their smallest blocks, the block set they analyzed with the highest synteny block count is generated by GRIMM and, having 2000 blocks, has much fewer than our highest resolution dataset which has about 6500. Like us, they also found that BRR was strongly correlated with HP distance, increasing with more divergent genomes. To compare our results for BRR with theirs, we calculated the minimum, maximum, mean and standard deviation of the BRR over the range in Figure 3 as well as for regions corresponding to the rise in the graph (from min LCB weight 100 to 2717), the plateau (min LCB weight 2717 to 42336), and the fall (min LCB weight 42336 to 76134). The results are summarized in Table 3.
Our results largely agree with those of Sinha and Meller for the plateau region, however, Sinha and Meller did not arrive at values of breakpoint reuse varying as much as ours did overall. Our breakpoint reuse rate varied nearly 20 % of the maximum value.
Conclusions
High values of breakpoint reuse have been used to justify the existence of fragile sites in genome rearrangement scenarios. Although fragile sites may well exist in the course of mammalian genome evolution, we argue that computed high values of the "traditional" breakpoint reuse statistic do not yield conclusive evidence for the existence of such sites. Rather, as we have shown, the cycle structure for such high BRR transformations these genome transformations are more like random permutations. Small synteny blocks may contain critical information about rearrangement history. As small synteny blocks are lost, either by diminished resolution or block removal, the numbers of >3 cycles increasingly approach the expected values for a random permutation. For the Bourque et al. 300 kb blocks, block removal did not diminish BRR. In an experiment in which we best matched the Bourque blocks with Mauve (at min LCB weight 141 blocks) we showed that the behavior of BRR curves of the Mauve blocks upon block removal also had flat levels once all blocks spanning less than 300 kb were removed. We posit that blocks spanning less than 300 kb, which are missing from the Bourque et al. dataset, may encode vital information about the true rearrangement history and associated parameters. The distribution of the number of cycles >3 corroborates our suggestion that information is lost either with diminished resolution or with systematic block removal starting with the smallest blocks. Finally, because the GRIMM and DCJ distance was not significantly different for a great portion of the resolution range, the difference in rearrangement models did not come into play except at the highest resolutions. At these resolutions, evidence for transpositions exists but they comprise less than 2% of all rearrangements.
An implication of our study is that precise definition of synteny blocks both large and small is crucial for accurate inference of rearrangement history parameters. Small blocks matter. Although larger blocks can be predicted more reliably, homology can be confidently predicted even for small regions spanning less than 1000 nucleotides using BLAST statistics. Probabilistic methods for Synteny Block reconstruction [27] can be used to assign a confidence value (or posterior probability) to blocks large and small. Future work might investigate the relationship between filtering blocks using such confidence estimates and rearrangement parameters such as breakpoint reuse, cycle count distributions, and others.
Finally, even though we suggest that the breakpoint reuse rate may be lower than previous estimates, we note that our findings do not preclude the existence of chromosomal regions with an unusually large number of closely spaced rearrangement breakpoints or "fragile regions". Even if breakpoint reuse is low, breakpoints might cluster near each other on the chromosome. Indeed this could be a natural consequence of Nadeau and Taylor’s work: if breakpoints are selected uniformly at random along a genome, the interbreakpoint distances will be geometrically or exponentially distributed (as will synteny block lengths), and clusters of nearby breakpoints may exist purely by chance. However, Pevzner and Tesler [3] inferred an excess of short distances exist between breakpoints over the expected exponential distribution in the random model, concluding this implied the existence of fragile regions solely based on elevated values of their breakpoint reuse statistic. Even though we contend this argument may not hold as we have shown the breakpoint reuse statistic depends significantly on the resolution scale used in the analysis, other investigators have shown the existence of breakpoint clusters [28–30].
Abbreviations
 BI:

block interchange
 BRR:

breakpoint reuse rate
 CI:

circular intermediate
 DCJ:

double cut and join
 GT:

generalized transpositions
 hm:

humanmouse
 hmr:

humanmouserat
 HP:

HannenhalliPevzner
 kb:

kilobase
 LCB:

locally collinear block
 Mb:

megabase.
References
Nadeau JH, Taylor BA: Lengths of chromosomal segments conserved since divergence of man and mouse. Proc Natl Acad Sci USA 1984, 81: 814–818. 10.1073/pnas.81.3.814
Alekseyev MA, Pevzner PA: Are there rearrangement hotspots in the human genome? PLoS Comput Biol 2007, 3(11):e209. 10.1371/journal.pcbi.0030209
Pevzner PA, Tesler G: Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. Genome Res 2003, 13(1):37–45. 10.1101/gr.757503
Sankoff D, Trinh P: Chromosomal breakpoint reuse in genome sequence rearrangement. J Comput Biol 2005, 12(6):812–821. 10.1089/cmb.2005.12.812
Peng Q, Pevzner PA, Tesler G: The fragile breakage versus random breakage models of chromosome evolution. PLoS Comput Biol 2006, 2(2):e14. 10.1371/journal.pcbi.0020014
Bergeron A, Mixtacki J, Stoye J: On Computing the Breakpoint Reuse Rate in Rearrangement Scenarios. RECOMBCG 2008, 226–240.
Sankoff D: The signal in the genomes. PLoS Comput Biol 2006, 2(4):e35. 10.1371/journal.pcbi.0020035
Xu W, Zheng C, Sankoff D: Paths and cycles in breakpoint graph of random multichromosomal genomes. J Comput Biol 2007, 14(4):423–435. 10.1089/cmb.2007.A004
Xu W, Alain B, Sankoff D: Poisson adjacency distributions in genome comparison: multichromosomal, circular, signed and unsigned cases. Bioinformatics 2008, 24(16):i146–152. 10.1093/bioinformatics/btn295
Bouvel M, Chauve C, Mishna M, Rossin D: Averagecase analysis of perfect sorting by reversals. In Combinatorial Pattern Matching: 2009; Lille. Volume 5577. France: LNCS; 2009:314–325. 10.1007/9783642024412_28
Bourque G, Pevzner PA, Tesler G: Reconstructing the genomic architecture of ancestral mammals: lessons from human, mouse, and rat genomes. Genome Res 2004, 14(4):507–516. 10.1101/gr.1975204
Darling AC, Mau B, Blattner FR, Perna NT: Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 2004, 14(7):1394–1403. 10.1101/gr.2289704
Bergeron A, Mixtacki J, J S: A unifying view of genome rearrangements. In Algorithms in Bioinformatics Proceedings of WABI 2006 Edited by: Moret B. 2006.
Hannenhalli S, Pevzner PA: Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations by reversals). Proceedings of the 27th Annual ACM Symposium on the Theory of Computing 1995, 178–189.
Hannenhalli S, Pevzner PA: Transforming men into mice (polynomial algorithm for genomic distance problem). Proceedings of the IEEE 36th Annual Symposium on Foundations of Computer Science: 1995, 581 592.
Yancopoulos S, Attie O, Friedberg R: Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics 2005, 21(16):3340–3346. 10.1093/bioinformatics/bti535
Yancopoulos S, Friedberg R: DCJ path formulation for genome transformations which include insertions, deletions, and duplications. J Comput Biol 2009, 10: 1311–1338.
Friedberg R, Darling A, Yancopoulos S: Genome rearrangement by the double cut and join operation. In Methods Mol Biol Bioinformatics Vol I Data, Sequence Analysis and Evolution. Volume 1. Edited by: JM K. Humana Press; 2008:385–416.
Swenson KM, Lin Y, Rajan V, Moret BM: Hurdles Hardly Have to Be Heeded. In RECOMBCG '08: Paris. France: SpringerVerlag; 2008:241251.
Bafna V, Pevzner P: Genome rearrangements and sorting by reversals. In Proc 34th Ann IEEE Symp Found Comp Sci. IEEE Press; 1993:148–157.
Ma J, Ratan A, Raney BJ, Suh BB, Miller W, Haussler D: The infinite sites model of genome evolution. Proc Natl Acad Sci U S A 2008, 105(38):14254–14261. 10.1073/pnas.0805217105
Sankoff D, Haque L: The distribution of genomic distance between random genomes. J Comput Biol 2006, 13(5):1005–1012. 10.1089/cmb.2006.13.1005
Stanley R: Enumerative combinatorics, Volume I. Cambridge University Press; 2002:29–37.
Darling AC, Treangen TJ, Zhang L, Kuiken C, Messeguer X, Perna NT: Procrastination leads to efficient filtration for local multiple alignments. Lecture Notes in Bioinformatics 4175 2006, 126–137.
Tesler G: GRIMM: genome rearrangements web server. Bioinformatics 2002, 18(3):492–493. 10.1093/bioinformatics/18.3.492
Sinha AU, Meller J: Sensitivity analysis for reversal distance and breakpoint reuse in genome rearrangements. Pac Symp Biocomput 2008, 13: 37–48.
Mau B, Darling AC, Perna NT: Identifying evolutionarily conserved segments among multiple divergent and rearranged genomes. In RECOMB Comparative Genomics: October 16. Volume 2004. Bertinoro, Italy: SpringerVerlag; 2004:72–84.
Lemaitre C, Tannier E, Gautier C, Sagot MF: Precise detection of rearrangement breakpoints in mammalian chromosomes. BMC Bioinformatics 2008, 9: 286. 10.1186/147121059286
Trinh P, McLysaght M, Sankoff D: Genomic features in the breakpoint regions between syntenic blocks. Bioinformatics 2004, 20(Suppl 1):I318I325. 10.1093/bioinformatics/bth934
Hinsch H, Hannenhalli S: Recurring genomic breaks in independent lineages support genomic fragility. BMC Evolutionary Biology 2006, 6: 90. 10.1186/14712148690
Acknowledgements
We are grateful to Richard Friedberg for discussions illuminating breakpoint reuse and the cycle structure of genomes and for providing the theoretical predictions for cycle structure as well as the number of cycles > 3 for random permutations. We thank John Sikorski for writing the perl scripts to run GRIMM and for help with data analysis. We thank Glenn Tesler for clarifying issues around internal vs external breakpoints as well as the differences in breakpoint reuse between the DCJ and HP formulations. We also thank our reviewers for their constructive criticisms and suggestions. OA is supported by Simon Daefler; SY thanks Nicholas Chiorazzi, MD for support.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 9, 2011: Proceedings of the Ninth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/12?issue=S9.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
AED served as coeditor for the supplement, but was not involved with the review of this paper. OA and SY declare that they have no competing interests.
Authors' contributions
SY conceived the idea of using Mauve and the DCJ algorithm to investigate the behavior of BRR based on resolution using real data. SY and AED designed the approach using Mauve for this purpose. AED generated the 3way locally collinear synteny blocks for humanmouserat genomes using Mauve. SY and OA designed and implemented the DCJ program. OA wrote perl and shell scripts, and R programs to process and analyze the data. SY, OA and AED analyzed the results. SY, OA and AED wrote and edited the manuscript. All authors read and approved the final manuscript.
Oliver Attie, Aaron E Darling and Sophia Yancopoulos contributed equally to this work.
Electronic supplementary material
12859_2011_4808_MOESM1_ESM.pdf
Additional file 1: Appendix on Capping Examples for the DCJ This appendix provides illustrative examples on how capping is performed in our version of the DCJ paradigm. (PDF 115 KB)
12859_2011_4808_MOESM2_ESM.tgz
Additional file 2: Permutation file These files contain the permutations for the humanmouserat collinear synteny blocks which are output by Mauve. (TGZ 2 MB)
12859_2011_4808_MOESM3_ESM.tgz
Additional files 3: LCB files These files contain the lcbs for the humanmouserat collinear synteny blocks files which are output by Mauve. In addition, there are three files giving the lengths of the chromosomes of each genome. (TGZ 4 MB)
12859_2011_4808_MOESM4_ESM.tgz
Additional files 4: LCB files These files contain the lcbs for the humanmouserat collinear synteny blocks files which are output by Mauve. In addition, there are three files giving the lengths of the chromosomes of each genome. (TGZ 7 MB)
12859_2011_4808_MOESM5_ESM.tgz
Additional files 5: LCB files These files contain the lcbs for the humanmouserat collinear synteny blocks files which are output by Mauve. In addition, there are three files giving the lengths of the chromosomes of each genome. (TGZ 3 MB)
12859_2011_4808_MOESM6_ESM.tgz
Additional files 6: LCB files These files contain the lcbs for the humanmouserat collinear synteny blocks files which are output by Mauve. In addition, there are three files giving the lengths of the chromosomes of each genome. (TGZ 7 MB)
12859_2011_4808_MOESM7_ESM.tgz
Additional files 7: LCB files These files contain the lcbs for the humanmouserat collinear synteny blocks files which are output by Mauve. In addition, there are three files giving the lengths of the chromosomes of each genome. (TGZ 1 MB)
12859_2011_4808_MOESM8_ESM.tgz
Additional files 8: LCB files These files contain the lcbs for the humanmouserat collinear synteny blocks files which are output by Mauve. In addition, there are three files giving the lengths of the chromosomes of each genome. (TGZ 1 MB)
12859_2011_4808_MOESM9_ESM.tgz
Additional files 9: LCB files These files contain the lcbs for the humanmouserat collinear synteny blocks files which are output by Mauve. In addition, there are three files giving the lengths of the chromosomes of each genome. (TGZ 2 MB)
12859_2011_4808_MOESM10_ESM.tgz
Additional files 10: LCB files These files contain the lcbs for the humanmouserat collinear synteny blocks files which are output by Mauve. In addition, there are three files giving the lengths of the chromosomes of each genome. (TGZ 832 KB)
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Attie, O., Darling, A.E. & Yancopoulos, S. The rise and fall of breakpoint reuse depending on genome resolution. BMC Bioinformatics 12 (Suppl 9), S1 (2011). https://doi.org/10.1186/1471210512S9S1
Published:
DOI: https://doi.org/10.1186/1471210512S9S1