PhylDiag: identifying complex synteny blocks that include tandem duplications using phylogenetic gene trees

Background Extant genomes share regions where genes have the same order and orientation, which are thought to arise from the conservation of an ancestral order of genes during evolution. Such regions of so-called conserved synteny, or synteny blocks, must be precisely identified and quantified, as a prerequisite to better understand the evolutionary history of genomes. Results Here we describe PhylDiag, a software that identifies statistically significant synteny blocks in pairwise comparisons of eukaryote genomes. Compared to previous methods, PhylDiag uses gene trees to define gene homologies, thus allowing gene deletions to be considered as events that may break the synteny. PhylDiag also accounts for gene orientations, blocks of tandem duplicates and lineage specific de novo gene births. Starting from two genomes and the corresponding gene trees, PhylDiag returns synteny blocks with gaps less than or equal to the maximum gap parameter gapmax. This parameter is theoretically estimated, and together with a utility to graphically display results, contributes to making PhylDiag a user friendly method. In addition, putative synteny blocks are subject to a statistical validation to verify that they are unlikely to be due to a random combination of genes. Conclusions We benchmark several known metrics to measure 2D-distances in a matrix of homologies and we compare PhylDiag to i-ADHoRe 3.0 on real and simulated data. We show that PhylDiag correctly identifies small synteny blocks even with insertions, deletions, incorrect annotations or micro-inversions. Finally, PhylDiag allowed us to identify the most relevant distance metric for 2D-distance calculation between homologies. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-268) contains supplementary material, which is available to authorized users.

: Different ways of defining gene families. Figure A represents a species tree with two extant species S a and S b . LCA(S a , S b ) is the last common ancestor of species S a and S b , CA 1 (S a , S b ) is a common ancestor of S a and S b , and CA 2 (S a , S b ) is another common ancestor of S a and S b that lived more recently than CA 1 (S a , S b ). Figure B represents a gene tree within the species tree. This gene tree is represented in simple 3D schema for better visualisation. In a gene tree, squares represent duplication events, circles represent speciation events and crosses represent deletion events. Figure  Filled cells represent homologies. Homologies of the same colour belong to the same sb. The symbol in a homology represents g a • g b , the sign of the homology, where g a and g b are the two homologous genes. For convenience, a sign +1 is denoted + and a sign −1 is denoted −. c a can be written [g a,k ] k∈ [1,8] , where g a,1 is the leftmost gene of c a and g a,8 is the rightmost gene of c a . Adjacent homologous genes are considered as tandem duplicates thus the second and third genes of c a , g a,2 and g a,3 , are tandem duplicates. g a,2 has an orientation equal to −1 whereas g a,3 has an orientation equal to +1. c a can be rewritten with 6 tbs, c a = [tb a,k ] k∈ [1,6] , where tb a,1 is the leftmost tb and tb a,6 is the rightmost tb. Starting from the left, tb a,2 , the second tb of c a , has an unknown orientation, all the other tbs have an orientation equal to +1.
In the MH, rectangles of non-0 values represent hps. In this example there are 4 hps. The first hp has a size 2 × 1, the second hp has a size 1 × 1, the third hp has a size 2 × 3 and the last hp has a size 1 × 1. tb a,4 = c a [5 → 6] is in a homology relation with tb b,4 = c b [4 → 6]. Thus the corresponding hp is the submatrix MH[5 → 6, 4 → 6]. This hp is said to have a size 2 × 3, with 2 the size of tb a,4 and 3 the size of tb b,4 . Figure S2b is the corresponding MHP of the MH of figure S2a after rewriting the chromosomes with tbs. Along the X-axis and the Y-axis, arrows represent oriented tbs. The rectangle on the X-Axis represents a tb with an unknown orientation. The values in the arrows are the sizes of each tbs. The symbol in a hp represents tb a • tb b , the sign of the hp, where tb a and tb b are the two homologous tbs. For convenience, a sign +1 is denoted + and a sign −1 is denoted − whereas an unknown sign is denoted ∅. The bottom-most and left-most hp corresponds to a tb of size 2 (on the X-axis) with an unknown orientation in a homology relation with a tb of size 1 with an orientation equal to +1 (on the Y-axis). Thus the sign of this hp is unknown, i.e. equal to ∅. Going top and right, the third hp corresponds to a tb of size 2 (on the X-axis) in a homology relation with a tb of size 3 (on the Y-axis). Both tbs have an orientation equal to +1. Thus the sign of the corresponding hp is +1. All hps have a sign equal to +1 except for the first hp which has a sign equal to ∅.

Distance metric formulas
If (x 0 , y 0 ) and (x 1 , y 1 ) are the coordinates of two positions in the MHP, depending on the metric used, the distances between these two positions are given by the formulas: Where [x] is the nearest integer of x. CD stands for Chebyshev Distance metric, ED for Euclidean Distance metric, MD stands for Manhattan Distance metric and DPD stands for Diagonal Pseudo Distance metric. It is easy to construct figure S3 with these formulas.

Strict consistent diagonals are putative strict synteny blocks
We will demonstrate that extracting putative strict sbs (containing no gaps between tbs) of c a and c b is equivalent to extracting strict and consistent diagonals of hps in the related MHP. Let [g i ] i∈[s,s+l−1] be an uninterrupted ancestral sequence of l tbs in LCA(S a , S b ), each tb is a unique gene. If this sequence of tbs is a strict sb, the ancestral sequence remains an uninterrupted sequence of tbs up to the two compared species, furthermore, within the sb, tbs order is conserved and tbs orientations either remain conserved or change from a known to an unknown orientation. Therefore the sb is present in a chromosome c a of S a and is also present in a chromosome c b of S b . Without loss of generality, we arbitrarily choose the reference order of the tbs of the sb to be the same as in c a , so g s+k Htb a,sa+k ∀k ∈ [0, l − 1]. Depending on the choice of the order of the tbs in c b , the order of the syntenic tbs in c b may thus be in the same order as the syntenic tbs in c a , or in the reverse order. Thus two cases must be treated (cf figure S4): • The order of the syntenic tbs in c b is conserved in the same order as the syntenic tbs in c a so there is a row [tb a,i ] i∈[sa,sa+l −1] in c a and a row [ tb a,sa+k • tb b,s b +k = +1 or ∅ and the diagonal is composed of hps with signs equal to either +1 or ∅, It is thus a strict and consistent slash diagonal.
• The order of the syntenic tbs in c b is conserved in the reverse order compared to the syntenic tbs in c a so there is a row [tb a,i ] i∈[sa,sa+l −1] in c a and a row [ This corresponds to a strict backslash diagonal in the MHP.
Furthermore, when the tbs [g i ] i∈[s,s+l−1] do not evolve from a known to an unknown orientation, the tbs conserve their orientations within the sb, i.e. relatively to the choice of the order of tbs in the sb, from g s to g s+l−1 . Therefore the orientation of tb b,s b +k relatively to sa tb a,sa+l−1 is the same orientation as the reference orientation of c a but now is the reverse orientation compared to the reference orientation of c b , thus, when we consider the orientations of tbs on their respective chromosomes, It is thus a strict and consistent backslash diagonal.
We demonstrated that a strict sb conserved from LCA(S a , S b ) to S a and S b generates a strict and consistent diagonal in the MHP of a chromosome c a in G a and a chromosome c b in G b , either a strict and consistent slash diagonal or a strict and consistent backslash diagonal.
It may be, in theory, that a strict and consistent diagonal in a MHP does not correspond to a sb if some tbs, brought in adjacent positions, generate strict and consistent diagonals by chance. However the statistical validation of PhylDiag ensures that when such a case is highly probable the strict and consistent diagonal is not considered as a signature of a strict synteny block. That is why we consider that a strict and consistent diagonal is only a putative strict synteny block. A strict and consistent diagonal is considered as a synteny block if it passes the statistical validation. ancestral synteny block Figure S4: Provenance of the distinction between slash and backslash diagonals. In the leftmost MHP, the order of the chromosome c b defines a reference orientation that is in the same orientation as the orientation of the synteny block. With this order of tbs in the chromosome c b , the synteny block yields a strict and consistent slash diagonal, that goes up according to a direction from bottom-left to top-right. In the rightmost MHP, the order of the chromosome c b defines a reference orientation that is in the reverse orientation compared to the orientation of the synteny block. With this order of tbs in the chromosome c b , the synteny block yields a strict and consistent backslash diagonal, that goes down according to a direction from top-left to bottom-right.

Algorithm f indDiagT ype
f indDiagT ype sets the diagonal type at the beginning of a strict and consistent diagonal extraction using the sign of the first hp if the sign is known or using the position of the second hp if their is a second hp. If two known diagonal types (either slash or backslash) are possible, the slash type is chosen by default. By convention the algorithm gives an orientation unknown to a single hp not involved in a strict diagonal. 6 Demonstration of the p d formula Using the reasoning of [2], in a MHP of two chromosomes c a and c b of n a and n b tbs without dispersed paralogy, involving n ab hps, the probability of obtaining exactly k hps in a window W ab of size l a × l b is: It is easy to demonstrate that p d (1, 1, 1, n ab , n a , n b ) = n ab na×n b , the density of the MHP. We also have the extreme case p d (n ab , n a , n b , n ab , n a , n b ) = 1. The numerator is the number of ways to fill the chromosomal windows W a (length l a tbs) and W b (length l b tbs) with tbs from c a and c b in order to obtain exactly k hps in W ab . The denominator is the total number of ways to fill W a and W b with tbs from c a and c b . This denominator is simply the total number of combinations of l a tbs among the n a tbs times the total number of combinations of l b tbs among the n b tbs. The first term of the numerator n ab k corresponds to the number of ways to choose the k hps of W ab among the total number of n ab hps. Once these hps are chosen, it remains l a − k tbs to choose in order to fill W a . Because there are many ways to choose them, we sum over all the possible combinations. Each of these may involve a different number of "coloured tbs", tbs that have a hp in the MHP (see figure S5). Considering that we choose to put i coloured tbs in W a , n ab − k i counts all the possible combinations of i coloured tbs among the n ab − k remaining coloured tbs. n a − n ab l a − (k + i) counts the number of combinations to finish to fill W a with "grey tbs", tbs that do not have hps in the MHP (see figure S5). Finally the last term, n b − (k + i) l b − k corresponds to the number of ways of choosing the remaining l b − k tbs in c b in order to fill W b , while avoiding choosing the i coloured tbs that would generate a hp in W ab because of the i coloured tbs that we have already placed in W a . We numerically verified that p d (k, l a , l b , n ab , n a , n b ) = p d (k, l b , l a , n ab , n b , n a ). The probability that k hps form a consistent diagonal is This probability is equal to the probability to form a consistent slash diagonal or a consistent backslash diagonal. The case k = 1 allows an extension of the formula to "diagonals" that contain 1 hp. In this case the intersection of the two probabilities is not null and they cannot be directly summed. In other words, the fact that a "diagonal" of 1 hp can be both a slash and a backslash diagonal if the hp sign is ∅ is a special case.
9 Explanation of the p w formula In a MHP of two chromosomes c a and c b of n a and n b tbs without dispersed paralogy, involving n ab hps, the probability that in a window W ab of size l a × l b there is at least one consistent diagonal containing at least m hps spaced by gaps ≤ g is p w (m, g, l a , l b , n ab , n a , n b ) = Only varying parameters are shown in the right-hand side of the equation in the preceding formula. Since p d (k) ∀k ∈ [m, min(n ab , l a , l b )] are the probabilities of having exactly k homologies in a window of size l a × l b , we can add these probabilities without removing the probabilities of the intersections. The second sum allows some hps in W ab to not be involved in a consistent diagonal with gaps ≤ g. If we already know that there is at least m hps in W ab , k i=m p g,2D (i, g)p o,o (i) is an upper bound for the probability that there are at least m hps forming a consistent diagonal with gaps ≤ g in W ab . To be exact we should remove the probabilities of the intersections while summing probabilities. Indeed, the probability of forming a consistent diagonal of 4 hps and the probability of forming a consistent diagonal of 3 hps are dependent since a consistent diagonal of 3 hps is a subset of a consistent diagonal of 4 hps. However removing the probability of the intersection is not trivial and we have therefore chosen an upper bound in order to retain the specificity of the statistical filtering. It is easy to verify that p w (m = 1, g = 0, l a = 1, l b = 1, n ab , n a , n b ) = n ab na×n b whatever the values of n ab , n a and n b .
10 Explanation of the passage from a window sampling probability to a whole genome comparison probability Relying on the reasoning of section 4.2 of [1] we adjust the probability p w , corresponding to a window sampling scenario, to compute the probability corresponding to a whole genome comparison. In a MHP of size n a × n b containing n ab hps without dispersed paralogy (see discussion), the probability of finding at least one window W ab of size l a × l b containing at least one consistent diagonal with gaps ≤ g of at least m hps can be approximated by: where n w = n a n b l a l b is the number of windows of width l a and height l b in the MHP such that no window overlap with any other window. Still following the reasoning of [1], this last equation is based on an unwarranted assumption that finding clusters in the various n w windows are independent events. It should be noted that a linearisation considering that p w << 1 highlights the missing O(n a n b ) term: pV al(m, g, l a , l b , n ab , n a , n b ) n w p w = n a n b l a l b p w = O(n a n b )p w (2) 11 Numerical applications of the p-value formula pV al(m = 4, g = 2, l a = 6, l b = 4, n ab = 6, n a = 11, n b = 8) = 3.9 × 10 −4 p g,2D (k = 4, g = 2, l a = 6, l b = 4) = 1.0 because any combination of 4 tbs in W a will create a chain of 4 hps with gaps ≤ g = 2. The same applies to tbs in W b . Thus any cluster is guaranteed to possess gaps lower or equal to 2 in W ab if there is at least 4 hps in W ab . If the cut-off probability α is set to 1 × 10 −3 , since the p-value of this consistent diagonal is lower, this consistent diagonal is validated as a significant synteny block. Here it is obvious that for such small diagonal and small chromosomes, accounting for tbs order and tbs orientations is important for the computation of the p-value.
An example of a more common case would be to compute the p-value of a consistent diagonal which has m = 3 hps, a width l a = 18 tbs, a height l b = 10 tbs and a maximum gap g = 10 tbs. Here results show that even a consistent diagonal with very long gaps may be considered as a relevant sb with a cut-off probability α set to 1 × 10 −3 . This is possible because tbs order and orientations are considered when assessing the statistical relevance of the sb.
Finally, in a real example, we compare human chromosome Y (hY ) to mouse chromosome Y (mY ), where PhylDiag, using the CD metric and a gap max ≥ 5, extracts a consistent diagonal of 3 hps. The maximum gap in this diagonal is g = 5 and the sb is contained within a window W ab of size 8 × 5. hY contains n hY = 25 tbs and mY contains n mY = 16 tbs. The corresponding MHP contains n hY,mY = 7 hps. Statistics on the orientations of tbs gives us P Since the p-value of this consistent diagonal is higher than the cut-off probability α = 1 × 10 −3 , this diagonal is removed during the statistical validation.

Estimation of a recommended maximum gap parameter
As in ColinearScan [3], under the null hypothesis, we assume that homologous tbs are uniformly distributed in chromosomes and we explore the possibility of finding consistent diagonals with gaps ≤ g containing m hps by chance. Although this assumption of a uniform distribution is not strictly correct, we consider it to be reasonable here for the purpose of finding a recommended gap max . We consider that the probability of finding consistent diagonals with gaps ≤ g containing m hps can be calculated from an average MHP. The average MHP has a width n a (respectively a height n b ) equal to the weighted mean of the distribution of chromosome lengths of G a (respectively G b ): where w ca = n ca ca∈Ga n ca is the weight given to the length n ca (in tbs) of c a . The average MHP is designed in order to have the same density as the the whole genome comparison of G a with G b . The density of the whole genome comparison is: where n GaG b is the total number of hps in the whole genome comparison of G a with G b and n Gx is the total number of tbs in G x , thus the number of hps in the average MHP is: For many gaps values (g), a numerical computation of pV al(m, g, l a , l b , n ab , n a , n b ) with l a = l b = (m−1)g +m are performed. The recommended gap max value is defined as the lowest gap g that returns a p-value higher than the target probability P target . m and P target are fixed by the user. Default values are m = 2 and P target = 0.01. For instance, when comparing the human (S h ) to the mouse (S m ) (Ensembl database v72) we have: n h = 18560 tbs, n m = 18934 tbs and n hm = 18236 hps. The average MHP is characterised by n a = 980, n b = 1052 and n ab = 53. With the default values of m and P target , the recommended maximum gap value is 5.
It should be noted that it is not because we use a gap max parameter of 5 that we will statistically validate all consistent diagonals with gaps up to 5. When a consistent diagonal is found it undergoes the statistical validation that depends on the value of the probability threshold α (default value is 1 × 10 −3 ) and also on the characteristics (density of hps, dimensions, ...) of the MHP of the current pairwise comparison of chromosomes that may differ from the characteristics of the average MHP. Here is an example. Still in the comparison of the human genome with the mouse genome, we consider the comparison of the human X chromosome (c hX ) with the mouse X chromosome (c mX ). In the MHP of this comparison, PhylDiag (using the recommended gap max = 5) finds a consistent diagonal of 2 hps characterized by a maximum gap g = 3 and a window W ab of size 3 × 5. Given that n hX = 735 is the number of tbs in c hX and n mX = 726 is the number of tbs in c mX and n hX,mX = 690 is the number of hps involved in the comparison of c hX with c mX , the p-value of the putative synteny block is pV al(2, 3, 3, 5, n hX,mX , n hX , n mX ) = 0.63. Since this p-value is > α = 1 × 10 −3 the putative synteny block is rejected. This is not surprising since n hX,mX n hX n mX = 1.3 × 10 −3 >> n h,m n h nm = 5.1 × 10 −5 .

Simulator
Our simulator first designs an ancestral genome G anc with a user defined number of genes and chromosomes. The length of chromosomes in G anc are expressed in number of genes, and are determined randomly. Simulated evolution gives rise to the two extant genomes G a and G b of two extant species. The simulator performs genic events, which include de novo gene births, deletions, duplications (either tandem or dispersed), and genomic rearrangements, which include chromosome fusions and fissions, segmental translocations or segmental inversions.
Inversions and translocations involve a chromosomal segment. Each time a translocation or an inversion occurs, a chromosome is chosen with a frequency that depends on its length (i.e. the longer a chromosome, the higher the chance that it will be chosen). The length of the rearranged segment is chosen as a proportion of the chromosome length in a density function represented on figure S6 obtained from a modification of the von Mises probability distribution. If it is a translocation, the insertion position is chosen with a uniform The evolutionary scenario is calibrated so as to fit the known evolution of the human and the mouse genome from the Euarchontoglire genome (G anc ). Based on phylogenetic gene tree reconstructions from Ensembl Compara version 72, the Euarchontoglire genome possessed at least 21806 genes that evolved during approximately 90 million years into the human genome on the one hand and into the mouse genome on the other hand. In each simulation, the ancestral Euarchontoglire genome is populated with the same 21806 genes distributed into 20 chromosomes, but in a different random order. The extant human genome contains 20172 genes and the mouse genome contains 22542 genes. According to the forest of gene trees stemming from Euarchontoglire, 3836 gene deletions, 821 de novo gene births, 1381 gene duplications with 791 tandem duplications (57%) and 590 dispersed duplications took place in the human lineage. Similarly, 4060 gene deletions, 1658 de novo gene births and 3138 gene duplications with 1950 tandem duplications (62%) and 1188 dispersed duplications took place in the mouse lineage. We calibrated the rates of rearrangements on each branch, starting from known rates [4] and optimised to visually reproduce the distribution of genes between the real mouse and human genomes (table S1 and figure S7). Of note, we aim here at simulating the evolution of the human and mouse genome in a reasonably realistic way. A proper modelling of this process is out of the scope of this study, so long as simulated data make it possible to compare different methods to identify synteny blocks. (a) Homology matrix of the whole-genome comparison between the real human genome (Y-axis) and the real mouse genome (X-axis) 14 Influence of genic events and chromosomal rearrangements on the MHP and comparison between the DPD and the MD Figure S8 shows how different events disturb or not the linearity of synteny blocks. Based on figure S8, since merging diagonals with the DPD metric attributes more importance to linearity, chosing the DPD metric will allow more small inversions within sbs gaps while considering that genic/segmental indels and wrong annotations break the synteny more easily than with the MD metric. Conversely, merging diagonals with the MD metric gives priority to lateral directions and this allows more small genic/segmental indels and annotation errors within sbs gaps and considers that inversions break the synteny more easily than with the DPD metric. (g) incorrect annotation, missing one gene Figure S8: Examples of evolutionary scenarios of a synteny block. Inversion events create a gap in the sb and do not change the linearity of the sb, whereas gene indels, segmental indels and annotation errors affect the linearity of the sb. Vocabulary synteny block (sb) an ordered sequence of close oriented genes that is conserved in several species gene specific to one lineage a gene that did not exist in the ancestor and that appeared along a lineage from the ancestor to one extant species tandem block (tb) a contiguous sequence of tandem duplicates homology pack (hp) a pack of homologies corresponding to a homology between two tandem blocks sign of a hp a value that indicates whether or not the two corresponding homologous tandem blocks are in the same orientation, the opposite orientation or if at least one tandem block has an unknown orientation gap between two tbs the number of tbs between them distance between two tbs the gap between them minus one distance metric a metric that is used to calculate a distance between two points in a 2D array distance between two hps given a distance metric, the distance between the two points corresponding to the two hps in the MHP gap between two hps the distance between them plus one chain a set of tbs spaced by gaps ≤ gap max cluster a set of hps spaced by gaps ≤ gap max diagonal a diagonal of hps in a MH or a MHP, it may be interpreted as a cluster with a constraint on gene order and gene orientations strict diagonal a diagonal of hps with no gaps between hps slash diagonal a type of diagonal that goes from bottom-left to top-right backslash diagonal a type of diagonal that goes from top-left to bottom-right consistent diagonal a diagonal of hps with signs consistent with the diagonal type putative synteny block a consistent diagonal that may be a synteny block if it passes the statistical validation