Refining the breakpoints
We start by describing the core of the method, that is the narrowing down as precisely as possible into the breakpoints given a set of synteny blocks.
We are given two sequenced genomes, and the synteny blocks between them. Since we wish to locate all the breakpoints in one genome, the method is not symmetric: one genome is thus the reference and is denoted by G
r
, while the other genome to which it is compared is denoted by G
o
. A synteny block is defined by a pair (A
r
, A
o
) of coordinates, one (A
r
) in genome G
r
and the other (A
o
) in genome G
o
, each indicating a chromosome, a start point and an end point. A breakpoint on G
r
is a region between two synteny blocks that are consecutive on G
r
, but not on G
o
. Assuming that the synteny blocks are correct, it is certain that in this region, or in one of its orthologs on the other genome, there has been at least one break due to a rearrangement. The size of the region can vary from several base pairs to several millions of base pairs. As mentioned in introduction, we do not know a priori if this imprecision is due to a biological property of the region, or to the fact that orthology has not been detected beyond the extremities of the synteny blocks. We are interested in refining this region as much as possible to eliminate the latter possibility. The refinement of the region is done by aligning the region in-between the two synteny blocks in G
r
with the regions flanking the orthologs of the blocks in G
o
. The results of the alignment are then coded into a numerical sequence which is partitioned to identify the narrowed breakpoint.
Alignment
Given two synteny blocks (A
r
, A
o
) and (B
r
, B
o
) that are consecutive in G
r
, three sequences of interest are defined (see Figure 1): S
r
corresponds to the region in G
r
between A
r
and B
r
, S
oA
and S
oB
are the sequences flanking A
o
and B
o
in G
o
(respecting the orientation) up to the extremity of the next synteny blocks on G
o
. Part at least of the sequences S
oA
and S
oB
is expected to be orthologous to sequence S
r
. As an example, in Figure 1, the prefix of sequence S
r
should be orthologous to the prefix of sequence S
oA
and its suffix to the suffix of sequence S
oB
. Depending on the nature of the markers, their extremities may be poorly conserved and the limits of the synteny blocks thus not very clear. The markers at the extremities of the blocks may then be added to the sequences S
r
, S
oA
, and S
oB
. For example, in the case of genes, when their orthology relationship has been assigned based on similarity criteria at the aminoacid level, the orthologous genes may not be alignable on their whole length at the DNA level and thus the extremities of the genes on the genomes may not be orthologous. This can also be due to errors in the prediction of the boundaries of genes. Including the genes at the extremities of the blocks in the sequences thus allows to overcome these problems.
To identify the orthologous relationships between sequence S
r
and the two sequences S
oA
and S
oB
, we perform two local alignments: one of sequence S
r
against sequence S
oA
and another of sequence S
r
against sequence S
oB
. We use for this the algorithm Blastz [15], after having masked the sequences for repeats with RepeatMasker [16]. The choice of a local aligner like Blastz is motivated by the fact that the sequences are usually (for the major part) intergenic, and therefore in general not well conserved over their whole length, and Blastz has been shown to be more sensitive on such sequences [15]. Large indels, small rearrangements and duplications are possible. The sequences are also often long. These two characteristics call for an algorithm that is both sensitive and fast enough.
Two lists of local alignments, called hits, are obtained and mapped onto sequence S
r
, regardless of their orientations and locations in the sequences S
oA
and S
oB
. We expect to have significantly more hits from S
oA
in sequence S
r
close to block A
r
, and more hits from sequence S
oB
close to B
r
. In the region in between, in general no clear difference can be made between the amount of hits from S
oA
and S
oB
. This defines the refined breakpoint.
Segmentation
To detect this region in a quantitative manner, the information provided by the hits is coded along the sequence S
r
giving as result a sequence I of discrete values of which three only are possible: -1, 0 and 1. The value is 1 (resp. -1) if position i of S
r
is covered by at least one hit from sequence S
oA
(resp. S
oB
) and no hit from sequence S
oB
(resp. S
oA
). The value is zero if position i is covered by at least one hit from each of the sequences S
oA
and S
oB
. The positions covered by no hits are deleted from sequence I. Thus sequence I has length n, the number of positions covered by at least one hit.
The problem we solve is then, given the sequence I of -1, 0 and 1's, to find the optimal segmentation of I into three segments, such that the first presents an orthology relationship with sequence S
oA
, the third segment an orthology relationship with sequence S
oB
, and the segment in between corresponds to the breakpoint. We define two change points u1 and u2 over the sequence I of length n, such that 1 ≤ u1 ≤ u2 ≤ n. The sequence is modelled by a piecewise constant function with the values μ1, μ2 and μ3 respectively in the three segments.
We are looking for the two change points u1 and u2 that minimise the sum of the squares of the deviations of the data to the model (called the objective function):
(1)
The values of μ1, μ2 and μ3 are defined as follows:
-
-
μ2 = 0
-
In the middle segment, μ2 equals zero, thus representing the breakpoint. In the two extreme segments the value of the function is the observed mean of I over the segment if the latter has the "correct sign". In order for the first (resp. last) segment to represent an orthology relationship with sequence S
oA
(resp. S
oB
), the value of the function must be positive (resp. negative) meaning that it contains more hits with S
oA
than with S
oB
(resp. with S
oB
than with S
oA
). If the observed mean in the segment has the wrong sign, the value is infinite; it ensures that this segment will not be part of the optimal solution since the contributions over this segment will be infinite.
Observe also that u1 can be equal to 0, or to u2, and u2 can be equal to n. This provides the possibility for some segments to be empty, and thus to segment sequence I in less than 3 segments.
Since the objective function is additive over the contributions of the positions, a dynamic programming algorithm efficiently provides the optimal partition [17, 18]. Notice that, since the number of change points is two, a simple algorithm scanning all possible partitions would be as efficient: the execution time grows with the square of the length n of sequence I.
Speed-up
The problem we solve is, however, more constrained than the classical change-point detection problem. We show that the two change points u1 and u2 can be found independently in linear time with the length of sequence I instead of using the classical dynamic programming algorithm in O(n2).
Lemma 1. Given the sequence I of size n, such that for all k ∈ {1, n}, I
k
∈ {-1, 0, 1}, the positions u1 and u2, with u1 = u2, that minimise the function f (u1, u2) (see Formula (1)) are such that:
-
is maximal,
-
is minimal.
Proof: First, by developing the square terms of each sum in function (1), we obtain:
The first term is a constant and the other two are independent from each other. Thus f(u1, u2) is minimal when the two last sums are both maximal, that is when is maximal (since it must be positive), and is minimal (since it must be negative). However the solution must respect the condition u1 = u2. We show next that this condition is always fulfilled.
Let x1 (resp. x2) be the position on I that maximises S1(u1) (resp. minimises S2(u2)). Suppose x1 > x2.
Then:
-
if , then , thus S2(x2) is not minimal.
-
else , thus S1(x1) is not maximal.
We therefore have that x1 ≤ x2. □
Statistical test
Whatever the structure of the signal, the method will output the best segmentation of the data into at most three segments. It is therefore important to test if the data are actually structured into three segments, respecting the constraints mentioned above, or if there is no such structure in sequence I. In the latter case, the obtained change points do not make statistical sense and we must conclude that we are not able to refine the breakpoint based on the alignments.
The more a given sequence I of length n is structured into three segments, the lower will be the value of the minimised objective function (that is the sum of the squares of the deviations of the data to the model), and thus the better will be the quality of the fit. What we need to test is therefore whether this fit is significantly better than the one that could be obtained with a non-structured sequence.
The null model is obtained by shuffling sequence I and computing for each permutation the value of the objective function corresponding to an optimal segmentation. Since I represents the alignment hits, the positions are not independent from one another and the values of 1 and -1 (corresponding to such hits) appear clustered. To take into account this structure in the shuffling operation, we do not shuffle individual positions, but instead blocks of consecutive identical values, given by the extremities of the hits. We accept the null hypothesis that I is not structured if more than five percent of the random permutations have a value that is lower than the value obtained by I.
Building the synteny blocks
Motivation
We now describe our own method for finding the synteny blocks. The general goal of such a method is to detect unbroken chains of markers which appear in the same order and same orientation in both genomes. Depending on the nature of the markers however, the orthologous relationships can be more or less reliable, and some errors or misleading relationships may disrupt regions of conserved order and orientation. This is why in general more flexible blocks of synteny are constructed.
Methods for identifying orthologous markers and for constructing synteny blocks are numerous in the scientific literature, starting with blocks built from physical or genetic maps of the chromosomes [19], to conserved segments of genes or blocks grouping genomic markers from whole genomic alignments [20–30]. The method to refine breakpoints described in the previous section requires two properties of the synteny blocks. First, they must not overlap on one or the other genome because this would lead to non-existing sequences S
r
, S
oA
or S
oB
since the latter are defined as the sequences that stand between two consecutive blocks. Second, the extremities of the synteny blocks must correspond to an orthology because they are used to define the sequences that will then be aligned. For example, in a block (A
r
, A
o
), the sequence at one extremity of A
r
should be orthologous to the sequence at the corresponding extremity on A
o
.
Few methods for finding blocks available in the literature satisfy these requirements, and those that do so are either very computer intensive or are incompletely described heuristics. For example, GRIMM-synteny [28] builds blocks by clustering markers that are close together, and keeps among the maximal clusters thus detected only those that are bigger than a threshold. The synteny blocks built by GRIMM-synteny may thus overlap. Furthermore, since markers are clustered based on a distance criterion regardless of their order and orientation, the boundaries of a synteny block on any of the two genomes may be defined by two markers which are not orthologous. For instance, in the example of Figure 2, the synteny block composed of the three markers a, b and c would end, according to GRIMM-synteny, with marker c in the first genome and marker b in the second.
The synteny blocks defined by Sankoff et al. in [22] do not overlap, and orthologous markers inside a block appear in the same order in both genomes. However, since the problem they propose to compute synteny blocks is NP-hard, the authors introduce some constraints in order to reduce their original dataset of markers and thus to address the complexity problem. This problem comes from the fact the authors do try to solve the conflicts that may appear and from how they do it. Indeed, when two blocks overlap, the authors attempt to choose which one is the "best" according to some criterion. The one adopted corresponds to maximising the overall number of markers used.
Various other methods based on chaining techniques produce synteny blocks that may overlap [21, 24, 27, 29, 31]. They are based on the same principle: two markers are chained if they appear in the same order and orientation in both genomes, and if they stand close enough to one another in terms either of a number of intervening (out of order) markers, or of the physical distance separating them. Only long enough chains are kept, the length corresponding to the number of markers in the chain or the number of nucleotides covered by the chain. None of these methods mentions, and therefore deals with the problem of overlaps and of conflicts between distinct chains.
We then describe our own method for finding the synteny blocks. It is formally well described, and thus it is possible to prove some properties of the blocks found, in particular the ones that are necessary for the refinement of the breakpoints. It takes as input, as all other methods, pairs of homologous markers described by their position and orientation on each genome. Though our approach is closely related to the ones of other published methods (all consist in chaining markers), working with formal definitions of the objects we are looking for guarantees that the synteny blocks we use satisfy the precisely characterised properties we need.
Description
We take as input an integer number k and a set of anchors between two genomes G
r
and G
o
. We call anchor a pair of markers, one in G
r
and one in G
o
, which are orthologous. We consider only markers that do not overlap in both genomes and that are part of exactly one anchor (no marker is in more than one anchor). If chromosomes are arbitrarily oriented (with a starting point and a reading direction), a marker can be identified by the chromosome it lies in, its position on the chromosome and its orientation (with respect to the starting point of the chromosome). Since we are interested in the relative order of markers on chromosomes, we consider the rank of a marker on a chromosome rather than its physical position on it. An anchor is then identified by a pair of chromosomes, a pair of ranks (the ranks of both markers on each chromosome), and a relative orientation. For instance, let a be an anchor, then a is identified by (c
r
, c
o
, a
r
, a
o
, σ
a
), with c
r
and a
r
the, resp., chromosome and rank of the marker on G
r
, c
o
and a
o
the, resp., chromosome and rank of the orthologous marker on G
o
, and σ
a
equal to +1 if the two markers have the same orientation, -1 otherwise.
If two distinct anchors a and b are located on the same chromosome in both species, the distance between a and b, denoted by d(a, b), is the maximum of the rank differences between a and b on each genome: if a
r
, a
o
(b
r
, b
o
) are the ranks of anchor a (b) on the genomes G
r
and G
o
, then d(a, b) = max(|b
r
- a
r
|, |b
o
- a
o
|).
Thus, if a and b are consecutive on each genome, the distance between them is one. If two anchors contain markers that are not in the same chromosome in at least one of the species, then the distance is ∞.
Let then
k
be a directed graph, with the anchors as vertices, and an arc between two distinct anchors a and b with a
r
<b
r
, if d(a, b) ≤ k, and either (a
o
<b
o
and both anchors have a positive orientation) or (a
o
> b
o
and both anchors have a negative orientation). Arcs are identified by the labels of their start and end vertices (anchors). Thus the arc between anchors a and b, if such exists, is denoted by ab.
At this step, if k is bigger than one, one anchor can be chained to several other anchors, possibly leading to connected components that overlap regarding genomic positions, or to boundaries of connected components that are not defined by a unique anchor. We say in this case that there is a conflict. We define two types of conflict:
-
conflict of type I: two arcs ab and cd belonging to the same connected component of
k
are said to be conflicting if the markers of the anchors a, b, c, d do not appear in the same order in both genomes (an example where a = c is given in Figure 2);
-
conflict of type II: an arc ab in a component C is conflicting if there exists an anchor x whose markers appear between the markers of a and b in one of the two genomes, and x belongs to a connected component of at least k vertices, different from C (see Figure 3).
Let now
k
be the subgraph of
k
which contains all the non-conflicting arcs of
k
. A k-block is a connected component of
k
containing more than k vertices. The extremal genomic coordinates of the k-blocks define the synteny blocks.
The absence of conflicting arcs of type II ensures that the synteny blocks never intersect. Moreover, the absence of conflicting arcs of type I ensures that in each component, the markers can be totally ordered in both genomes. This yields in particular that one anchor is unambiguously present at each extremity, so we have the required property that the extremities of the blocks in each genome are orthologous.
A polynomial-time computation of the blocks is straightforward from the definition: if n is the number of anchors, the computation of the ranks needs the application of a sorting procedure over all anchors in both genomes (in time O(n log n)). Then, the computation of the graph takes time O(n × k), and produces at most k × n arcs. For each arc, detecting conflicts requires the comparison to at most k other arcs or vertices, leading to a total time complexity of O(n × k2 + n log n), where k is a fixed parameter (we chose k = 2 in all our experiments).
Discussion on the method
Our method for finding synteny blocks is flexible and outputs totally ordered and non-intersecting k-blocks, which is the right entry for the refinement method described in the previous section. Indeed, we build blocks of synteny by chaining orthologous markers that appear in the same order and orientation in the two genomes, but allowing for a number of intervening markers. The maximum degree of flexibility allowed is controlled by one parameter k.
This contrasts with most of the other methods which use two parameters [27–29, 31], one (denoted by d) for the maximum distance allowed between two anchors to be chained and the other (denoted by S), for the minimum size of a block to be retained. However, one should fix d ≤ S to prevent a block from lying inside another one. On the other hand, one should fix d ≥ S, to prevent a block of size less than S, which we thus consider as irrelevant, from breaking a bigger one. This is why we fix d = S, and denote it by k. Actually, these two parameters d and S are often assigned the same value when the methods are applied (see for example [6, 28, 29, 31]).
Flexibility is necessary, at least when dealing with orthologous markers whose orthology has been inferred from alignment methods. Indeed, false positives are quite common in this case, particularly in the presence of paralogous sequences. Thus, some false orthology assignments can generate "false" breakpoints, that is regions which have not been rearranged in either of the two lineages. The greater k is, the more reliable are the synteny blocks since they are supported by more anchors. However, using a bigger k has the drawback of missing small blocks (in number of anchors). The outcome is not only less breakpoints, but also a decrease in the resolution of the remaining breakpoints. In fact, if a block is missed inside a breakpoint, we may not be able to refine it efficiently.
Another outcome of introducing flexibility is that it may produce conflicts. Conflicting arcs represent several chaining possibilities (conflict of type I) or overlapping ones (conflict of type II). Instead of introducing constraints that may not always have an obvious or universal biological meaning, we choose not to solve the conflicts, but instead to discard them. This may seem like a radical solution and, indeed, it produces blocks that are sometimes not as long as they could be if we attempted to solve the conflicts. However, finding the synteny blocks is just one step towards refining the breakpoints and we find preferable to use reliable blocks. The information lost in this initial step will in most cases be recovered in the second step. If removing conflicting arcs implies only the reduction of a block at one of its extremities, the removed extremities on the two genomes will be aligned during the refinement step. On the other hand, if removing conflicting arcs implies missing a whole block, this block will probably not be recovered.