Refining the breakpoints
We start by describing the core of the method, that is the narrowing down as precisely as possible into the breakpoints given a set of synteny blocks.
We are given two sequenced genomes, and the synteny blocks between them. Since we wish to locate all the breakpoints in one genome, the method is not symmetric: one genome is thus the reference and is denoted by G_{
r
}, while the other genome to which it is compared is denoted by G_{
o
}. A synteny block is defined by a pair (A_{
r
}, A_{
o
}) of coordinates, one (A_{
r
}) in genome G_{
r
}and the other (A_{
o
}) in genome G_{
o
}, each indicating a chromosome, a start point and an end point. A breakpoint on G_{
r
}is a region between two synteny blocks that are consecutive on G_{
r
}, but not on G_{
o
}. Assuming that the synteny blocks are correct, it is certain that in this region, or in one of its orthologs on the other genome, there has been at least one break due to a rearrangement. The size of the region can vary from several base pairs to several millions of base pairs. As mentioned in introduction, we do not know a priori if this imprecision is due to a biological property of the region, or to the fact that orthology has not been detected beyond the extremities of the synteny blocks. We are interested in refining this region as much as possible to eliminate the latter possibility. The refinement of the region is done by aligning the region inbetween the two synteny blocks in G_{
r
}with the regions flanking the orthologs of the blocks in G_{
o
}. The results of the alignment are then coded into a numerical sequence which is partitioned to identify the narrowed breakpoint.
Alignment
Given two synteny blocks (A_{
r
}, A_{
o
}) and (B_{
r
}, B_{
o
}) that are consecutive in G_{
r
}, three sequences of interest are defined (see Figure 1): S_{
r
}corresponds to the region in G_{
r
}between A_{
r
}and B_{
r
}, S_{
oA
}and S_{
oB
}are the sequences flanking A_{
o
}and B_{
o
}in G_{
o
}(respecting the orientation) up to the extremity of the next synteny blocks on G_{
o
}. Part at least of the sequences S_{
oA
}and S_{
oB
}is expected to be orthologous to sequence S_{
r
}. As an example, in Figure 1, the prefix of sequence S_{
r
}should be orthologous to the prefix of sequence S_{
oA
}and its suffix to the suffix of sequence S_{
oB
}. Depending on the nature of the markers, their extremities may be poorly conserved and the limits of the synteny blocks thus not very clear. The markers at the extremities of the blocks may then be added to the sequences S_{
r
}, S_{
oA
}, and S_{
oB
}. For example, in the case of genes, when their orthology relationship has been assigned based on similarity criteria at the aminoacid level, the orthologous genes may not be alignable on their whole length at the DNA level and thus the extremities of the genes on the genomes may not be orthologous. This can also be due to errors in the prediction of the boundaries of genes. Including the genes at the extremities of the blocks in the sequences thus allows to overcome these problems.
To identify the orthologous relationships between sequence S_{
r
}and the two sequences S_{
oA
}and S_{
oB
}, we perform two local alignments: one of sequence S_{
r
}against sequence S_{
oA
}and another of sequence S_{
r
}against sequence S_{
oB
}. We use for this the algorithm Blastz [15], after having masked the sequences for repeats with RepeatMasker [16]. The choice of a local aligner like Blastz is motivated by the fact that the sequences are usually (for the major part) intergenic, and therefore in general not well conserved over their whole length, and Blastz has been shown to be more sensitive on such sequences [15]. Large indels, small rearrangements and duplications are possible. The sequences are also often long. These two characteristics call for an algorithm that is both sensitive and fast enough.
Two lists of local alignments, called hits, are obtained and mapped onto sequence S_{
r
}, regardless of their orientations and locations in the sequences S_{
oA
}and S_{
oB
}. We expect to have significantly more hits from S_{
oA
}in sequence S_{
r
}close to block A_{
r
}, and more hits from sequence S_{
oB
}close to B_{
r
}. In the region in between, in general no clear difference can be made between the amount of hits from S_{
oA
}and S_{
oB
}. This defines the refined breakpoint.
Segmentation
To detect this region in a quantitative manner, the information provided by the hits is coded along the sequence S_{
r
}giving as result a sequence I of discrete values of which three only are possible: 1, 0 and 1. The value is 1 (resp. 1) if position i of S_{
r
}is covered by at least one hit from sequence S_{
oA
}(resp. S_{
oB
}) and no hit from sequence S_{
oB
}(resp. S_{
oA
}). The value is zero if position i is covered by at least one hit from each of the sequences S_{
oA
}and S_{
oB
}. The positions covered by no hits are deleted from sequence I. Thus sequence I has length n, the number of positions covered by at least one hit.
The problem we solve is then, given the sequence I of 1, 0 and 1's, to find the optimal segmentation of I into three segments, such that the first presents an orthology relationship with sequence S_{
oA
}, the third segment an orthology relationship with sequence S_{
oB
}, and the segment in between corresponds to the breakpoint. We define two change points u_{1} and u_{2} over the sequence I of length n, such that 1 ≤ u_{1} ≤ u_{2} ≤ n. The sequence is modelled by a piecewise constant function with the values μ_{1}, μ_{2} and μ_{3} respectively in the three segments.
We are looking for the two change points u_{1} and u_{2} that minimise the sum of the squares of the deviations of the data to the model (called the objective function):
\begin{array}{ccc}f({u}_{1},{u}_{2})={\displaystyle \sum _{j=1}^{3}{\displaystyle \sum _{k={u}_{j1}+1}^{{u}_{j}}{({I}_{k}{\mu}_{j})}^{2}}},& \text{with}& \begin{array}{cc}{u}_{0}=0& \begin{array}{cc}\text{and}& {u}_{3}=n.\end{array}\end{array}\end{array}
(1)
The values of μ_{1}, μ_{2} and μ_{3} are defined as follows:

{\mu}_{1}=\{\begin{array}{l}\begin{array}{ccc}\frac{{\displaystyle {\sum}_{k=1}^{{u}_{1}}{I}_{k}}}{{u}_{1}}& \text{if}& {\displaystyle {\sum}_{k=1}^{{u}_{1}}{I}_{k}>0}\end{array}\hfill \\ \begin{array}{cc}\infty & \text{otherwise}\end{array}\hfill \end{array}

μ_{2} = 0

{\mu}_{3}=\{\begin{array}{l}\begin{array}{ccc}\frac{{\displaystyle {\sum}_{k={u}_{2}+1}^{n}{I}_{k}}}{n{u}_{2}}& \text{if}& {\displaystyle {\sum}_{k={u}_{2}+1}^{n}{I}_{k}<0}\end{array}\hfill \\ \begin{array}{cc}\infty & \text{otherwise}\end{array}\hfill \end{array}
In the middle segment, μ_{2} equals zero, thus representing the breakpoint. In the two extreme segments the value of the function is the observed mean of I over the segment if the latter has the "correct sign". In order for the first (resp. last) segment to represent an orthology relationship with sequence S_{
oA
}(resp. S_{
oB
}), the value of the function must be positive (resp. negative) meaning that it contains more hits with S_{
oA
}than with S_{
oB
}(resp. with S_{
oB
}than with S_{
oA
}). If the observed mean in the segment has the wrong sign, the value is infinite; it ensures that this segment will not be part of the optimal solution since the contributions over this segment will be infinite.
Observe also that u_{1} can be equal to 0, or to u_{2}, and u_{2} can be equal to n. This provides the possibility for some segments to be empty, and thus to segment sequence I in less than 3 segments.
Since the objective function is additive over the contributions of the positions, a dynamic programming algorithm efficiently provides the optimal partition [17, 18]. Notice that, since the number of change points is two, a simple algorithm scanning all possible partitions would be as efficient: the execution time grows with the square of the length n of sequence I.
Speedup
The problem we solve is, however, more constrained than the classical changepoint detection problem. We show that the two change points u_{1} and u_{2} can be found independently in linear time with the length of sequence I instead of using the classical dynamic programming algorithm in O(n^{2}).
Lemma 1. Given the sequence I of size n, such that for all k ∈ {1, n}, I_{
k
}∈ {1, 0, 1}, the positions u_{1} and u_{2}, with u_{1} = u_{2}, that minimise the function f (u_{1}, u_{2}) (see Formula (1)) are such that:

\frac{1}{\sqrt{{u}_{1}}}{\displaystyle {\sum}_{k=1}^{{u}_{1}}{I}_{k}}is maximal,

\frac{1}{\sqrt{n{u}_{2}}}{\displaystyle {\sum}_{k={u}_{2}+1}^{n}{I}_{k}}is minimal.
Proof: First, by developing the square terms of each sum in function (1), we obtain:
f({u}_{1},{u}_{2})={\displaystyle \sum _{k=1}^{n}{I}_{k}^{2}\frac{1}{{u}_{1}}}{({\displaystyle \sum _{k=1}^{{u}_{1}}{I}_{k}})}^{2}\frac{1}{n{u}_{2}}{({\displaystyle \sum _{k={u}_{2}+1}^{n}{I}_{k}})}^{2}
The first term is a constant and the other two are independent from each other. Thus f(u_{1}, u_{2}) is minimal when the two last sums are both maximal, that is when {S}_{1}({u}_{1})=\frac{1}{\sqrt{{u}_{1}}}{\displaystyle {\sum}_{k=1}^{{u}_{1}}{I}_{k}} is maximal (since it must be positive), and {S}_{2}({u}_{2})=\frac{1}{\sqrt{n{u}_{2}}}{\displaystyle {\sum}_{k={u}_{2}+1}^{n}{I}_{k}} is minimal (since it must be negative). However the solution must respect the condition u_{1} = u_{2}. We show next that this condition is always fulfilled.
Let x_{1} (resp. x_{2}) be the position on I that maximises S_{1}(u_{1}) (resp. minimises S_{2}(u_{2})). Suppose x_{1} > x_{2}.
Then:
{S}_{1}({x}_{1})=\frac{1}{\sqrt{{x}_{1}}}{\displaystyle {\sum}_{k=1}^{{x}_{1}}{I}_{k}}=\frac{1}{\sqrt{{x}_{1}}}{\displaystyle {\sum}_{k=1}^{{x}_{2}}{I}_{k}}+\frac{1}{\sqrt{{x}_{1}}}{\displaystyle {\sum}_{k={x}_{2}+1}^{{x}_{1}}{I}_{k}}
{S}_{2}({x}_{2})=\frac{1}{\sqrt{n{x}_{2}}}{\displaystyle {\sum}_{k={x}_{2}+1}^{n}{I}_{k}}=\frac{1}{\sqrt{n{x}_{2}}}{\displaystyle {\sum}_{k={x}_{2}+1}^{{x}_{1}}{I}_{k}}+\frac{1}{\sqrt{n{x}_{2}}}{\displaystyle {\sum}_{k={x}_{1}+1}^{n}{I}_{k}}

if {\displaystyle {\sum}_{k={x}_{2}+1}^{{x}_{1}}{I}_{k}}\ge 0, then {S}_{2}({x}_{2})\ge \frac{1}{\sqrt{n{x}_{2}}}{\displaystyle {\sum}_{k={x}_{1}+1}^{n}{I}_{k}}>\frac{1}{\sqrt{n{x}_{1}}}{\displaystyle {\sum}_{k={x}_{1}+1}^{n}{I}_{k}}={S}_{2}({x}_{1}), thus S_{2}(x_{2}) is not minimal.

else ({\displaystyle {\sum}_{k={x}_{2}+1}^{{x}_{1}}{I}_{k}<0}),{S}_{1}({x}_{1})<\frac{1}{\sqrt{{x}_{1}}}{\displaystyle {\sum}_{k=1}^{{x}_{2}}{I}_{k}}<\frac{1}{\sqrt{{x}_{2}}}{\displaystyle {\sum}_{k=1}^{{x}_{2}}{I}_{k}}={S}_{1}({x}_{2}), thus S_{1}(x_{1}) is not maximal.
We therefore have that x_{1} ≤ x_{2}. □
Statistical test
Whatever the structure of the signal, the method will output the best segmentation of the data into at most three segments. It is therefore important to test if the data are actually structured into three segments, respecting the constraints mentioned above, or if there is no such structure in sequence I. In the latter case, the obtained change points do not make statistical sense and we must conclude that we are not able to refine the breakpoint based on the alignments.
The more a given sequence I of length n is structured into three segments, the lower will be the value of the minimised objective function (that is the sum of the squares of the deviations of the data to the model), and thus the better will be the quality of the fit. What we need to test is therefore whether this fit is significantly better than the one that could be obtained with a nonstructured sequence.
The null model is obtained by shuffling sequence I and computing for each permutation the value of the objective function corresponding to an optimal segmentation. Since I represents the alignment hits, the positions are not independent from one another and the values of 1 and 1 (corresponding to such hits) appear clustered. To take into account this structure in the shuffling operation, we do not shuffle individual positions, but instead blocks of consecutive identical values, given by the extremities of the hits. We accept the null hypothesis that I is not structured if more than five percent of the random permutations have a value that is lower than the value obtained by I.
Building the synteny blocks
Motivation
We now describe our own method for finding the synteny blocks. The general goal of such a method is to detect unbroken chains of markers which appear in the same order and same orientation in both genomes. Depending on the nature of the markers however, the orthologous relationships can be more or less reliable, and some errors or misleading relationships may disrupt regions of conserved order and orientation. This is why in general more flexible blocks of synteny are constructed.
Methods for identifying orthologous markers and for constructing synteny blocks are numerous in the scientific literature, starting with blocks built from physical or genetic maps of the chromosomes [19], to conserved segments of genes or blocks grouping genomic markers from whole genomic alignments [20–30]. The method to refine breakpoints described in the previous section requires two properties of the synteny blocks. First, they must not overlap on one or the other genome because this would lead to nonexisting sequences S_{
r
}, S_{
oA
}or S_{
oB
}since the latter are defined as the sequences that stand between two consecutive blocks. Second, the extremities of the synteny blocks must correspond to an orthology because they are used to define the sequences that will then be aligned. For example, in a block (A_{
r
}, A_{
o
}), the sequence at one extremity of A_{
r
}should be orthologous to the sequence at the corresponding extremity on A_{
o
}.
Few methods for finding blocks available in the literature satisfy these requirements, and those that do so are either very computer intensive or are incompletely described heuristics. For example, GRIMMsynteny [28] builds blocks by clustering markers that are close together, and keeps among the maximal clusters thus detected only those that are bigger than a threshold. The synteny blocks built by GRIMMsynteny may thus overlap. Furthermore, since markers are clustered based on a distance criterion regardless of their order and orientation, the boundaries of a synteny block on any of the two genomes may be defined by two markers which are not orthologous. For instance, in the example of Figure 2, the synteny block composed of the three markers a, b and c would end, according to GRIMMsynteny, with marker c in the first genome and marker b in the second.
The synteny blocks defined by Sankoff et al. in [22] do not overlap, and orthologous markers inside a block appear in the same order in both genomes. However, since the problem they propose to compute synteny blocks is NPhard, the authors introduce some constraints in order to reduce their original dataset of markers and thus to address the complexity problem. This problem comes from the fact the authors do try to solve the conflicts that may appear and from how they do it. Indeed, when two blocks overlap, the authors attempt to choose which one is the "best" according to some criterion. The one adopted corresponds to maximising the overall number of markers used.
Various other methods based on chaining techniques produce synteny blocks that may overlap [21, 24, 27, 29, 31]. They are based on the same principle: two markers are chained if they appear in the same order and orientation in both genomes, and if they stand close enough to one another in terms either of a number of intervening (out of order) markers, or of the physical distance separating them. Only long enough chains are kept, the length corresponding to the number of markers in the chain or the number of nucleotides covered by the chain. None of these methods mentions, and therefore deals with the problem of overlaps and of conflicts between distinct chains.
We then describe our own method for finding the synteny blocks. It is formally well described, and thus it is possible to prove some properties of the blocks found, in particular the ones that are necessary for the refinement of the breakpoints. It takes as input, as all other methods, pairs of homologous markers described by their position and orientation on each genome. Though our approach is closely related to the ones of other published methods (all consist in chaining markers), working with formal definitions of the objects we are looking for guarantees that the synteny blocks we use satisfy the precisely characterised properties we need.
Description
We take as input an integer number k and a set of anchors between two genomes G_{
r
}and G_{
o
}. We call anchor a pair of markers, one in G_{
r
}and one in G_{
o
}, which are orthologous. We consider only markers that do not overlap in both genomes and that are part of exactly one anchor (no marker is in more than one anchor). If chromosomes are arbitrarily oriented (with a starting point and a reading direction), a marker can be identified by the chromosome it lies in, its position on the chromosome and its orientation (with respect to the starting point of the chromosome). Since we are interested in the relative order of markers on chromosomes, we consider the rank of a marker on a chromosome rather than its physical position on it. An anchor is then identified by a pair of chromosomes, a pair of ranks (the ranks of both markers on each chromosome), and a relative orientation. For instance, let a be an anchor, then a is identified by (c_{
r
}, c_{
o
}, a_{
r
}, a_{
o
}, σ_{
a
}), with c_{
r
}and a_{
r
}the, resp., chromosome and rank of the marker on G_{
r
}, c_{
o
}and a_{
o
}the, resp., chromosome and rank of the orthologous marker on G_{
o
}, and σ_{
a
}equal to +1 if the two markers have the same orientation, 1 otherwise.
If two distinct anchors a and b are located on the same chromosome in both species, the distance between a and b, denoted by d(a, b), is the maximum of the rank differences between a and b on each genome: if a_{
r
}, a_{
o
}(b_{
r
}, b_{
o
}) are the ranks of anchor a (b) on the genomes G_{
r
}and G_{
o
}, then d(a, b) = max(b_{
r
} a_{
r
}, b_{
o
} a_{
o
}).
Thus, if a and b are consecutive on each genome, the distance between them is one. If two anchors contain markers that are not in the same chromosome in at least one of the species, then the distance is ∞.
Let then \mathcal{G}_{
k
}be a directed graph, with the anchors as vertices, and an arc between two distinct anchors a and b with a_{
r
}<b_{
r
}, if d(a, b) ≤ k, and either (a_{
o
}<b_{
o
}and both anchors have a positive orientation) or (a_{
o
}> b_{
o
}and both anchors have a negative orientation). Arcs are identified by the labels of their start and end vertices (anchors). Thus the arc between anchors a and b, if such exists, is denoted by ab.
At this step, if k is bigger than one, one anchor can be chained to several other anchors, possibly leading to connected components that overlap regarding genomic positions, or to boundaries of connected components that are not defined by a unique anchor. We say in this case that there is a conflict. We define two types of conflict:

conflict of type I: two arcs ab and cd belonging to the same connected component of \mathcal{G}_{
k
}are said to be conflicting if the markers of the anchors a, b, c, d do not appear in the same order in both genomes (an example where a = c is given in Figure 2);

conflict of type II: an arc ab in a component C is conflicting if there exists an anchor x whose markers appear between the markers of a and b in one of the two genomes, and x belongs to a connected component of at least k vertices, different from C (see Figure 3).
Let now \mathcal{H}_{
k
}be the subgraph of \mathcal{G}_{
k
}which contains all the nonconflicting arcs of \mathcal{G}_{
k
}. A kblock is a connected component of \mathcal{H}_{
k
}containing more than k vertices. The extremal genomic coordinates of the kblocks define the synteny blocks.
The absence of conflicting arcs of type II ensures that the synteny blocks never intersect. Moreover, the absence of conflicting arcs of type I ensures that in each component, the markers can be totally ordered in both genomes. This yields in particular that one anchor is unambiguously present at each extremity, so we have the required property that the extremities of the blocks in each genome are orthologous.
A polynomialtime computation of the blocks is straightforward from the definition: if n is the number of anchors, the computation of the ranks needs the application of a sorting procedure over all anchors in both genomes (in time O(n log n)). Then, the computation of the graph takes time O(n × k), and produces at most k × n arcs. For each arc, detecting conflicts requires the comparison to at most k other arcs or vertices, leading to a total time complexity of O(n × k^{2} + n log n), where k is a fixed parameter (we chose k = 2 in all our experiments).
Discussion on the method
Our method for finding synteny blocks is flexible and outputs totally ordered and nonintersecting kblocks, which is the right entry for the refinement method described in the previous section. Indeed, we build blocks of synteny by chaining orthologous markers that appear in the same order and orientation in the two genomes, but allowing for a number of intervening markers. The maximum degree of flexibility allowed is controlled by one parameter k.
This contrasts with most of the other methods which use two parameters [27–29, 31], one (denoted by d) for the maximum distance allowed between two anchors to be chained and the other (denoted by S), for the minimum size of a block to be retained. However, one should fix d ≤ S to prevent a block from lying inside another one. On the other hand, one should fix d ≥ S, to prevent a block of size less than S, which we thus consider as irrelevant, from breaking a bigger one. This is why we fix d = S, and denote it by k. Actually, these two parameters d and S are often assigned the same value when the methods are applied (see for example [6, 28, 29, 31]).
Flexibility is necessary, at least when dealing with orthologous markers whose orthology has been inferred from alignment methods. Indeed, false positives are quite common in this case, particularly in the presence of paralogous sequences. Thus, some false orthology assignments can generate "false" breakpoints, that is regions which have not been rearranged in either of the two lineages. The greater k is, the more reliable are the synteny blocks since they are supported by more anchors. However, using a bigger k has the drawback of missing small blocks (in number of anchors). The outcome is not only less breakpoints, but also a decrease in the resolution of the remaining breakpoints. In fact, if a block is missed inside a breakpoint, we may not be able to refine it efficiently.
Another outcome of introducing flexibility is that it may produce conflicts. Conflicting arcs represent several chaining possibilities (conflict of type I) or overlapping ones (conflict of type II). Instead of introducing constraints that may not always have an obvious or universal biological meaning, we choose not to solve the conflicts, but instead to discard them. This may seem like a radical solution and, indeed, it produces blocks that are sometimes not as long as they could be if we attempted to solve the conflicts. However, finding the synteny blocks is just one step towards refining the breakpoints and we find preferable to use reliable blocks. The information lost in this initial step will in most cases be recovered in the second step. If removing conflicting arcs implies only the reduction of a block at one of its extremities, the removed extremities on the two genomes will be aligned during the refinement step. On the other hand, if removing conflicting arcs implies missing a whole block, this block will probably not be recovered.