 Research article
 Open Access
 Published:
PhylDiag: identifying complex synteny blocks that include tandem duplications using phylogenetic gene trees
BMC Bioinformatics volume 15, Article number: 268 (2014)
Abstract
Background
Extant genomes share regions where genes have the same order and orientation, which are thought to arise from the conservation of an ancestral order of genes during evolution. Such regions of socalled conserved synteny, or synteny blocks, must be precisely identified and quantified, as a prerequisite to better understand the evolutionary history of genomes.
Results
Here we describe PhylDiag, a software that identifies statistically significant synteny blocks in pairwise comparisons of eukaryote genomes. Compared to previous methods, PhylDiag uses gene trees to define gene homologies, thus allowing gene deletions to be considered as events that may break the synteny. PhylDiag also accounts for gene orientations, blocks of tandem duplicates and lineage specific de novo gene births. Starting from two genomes and the corresponding gene trees, PhylDiag returns synteny blocks with gaps less than or equal to the maximum gap parameter g a p_{ m a x }. This parameter is theoretically estimated, and together with a utility to graphically display results, contributes to making PhylDiag a user friendly method. In addition, putative synteny blocks are subject to a statistical validation to verify that they are unlikely to be due to a random combination of genes.
Conclusions
We benchmark several known metrics to measure 2Ddistances in a matrix of homologies and we compare PhylDiag to iADHoRe 3.0 on real and simulated data. We show that PhylDiag correctly identifies small synteny blocks even with insertions, deletions, incorrect annotations or microinversions. Finally, PhylDiag allowed us to identify the most relevant distance metric for 2Ddistance calculation between homologies.
Background
Changes in the order of genes in a genome are caused by two categories of mutational events: genic events, which include de novo gene births, deletions, duplications, and genomic rearrangements, which include chromosome fusions and fissions, segmental translocations or segmental inversions. Synteny blocks are composed of those genes that retain an ancestral organisation despite these events, and one way to understand how genic events and genomic rearrangements affect genome evolution is to identify such synteny blocks. The extremities of synteny blocks also define the positions of breakpoints where rearrangements took place. Precisely defining synteny blocks thus allows, in turn, an accurate definition of breakpoints [1], which has important implications from ancestral genome reconstruction [2] to the understanding of genome mutational processes in healthy and disease states [3]. In addition, it has been shown in eukaryotes that some synteny blocks may be under negative selection due to longrange functional constraints between genes and regulatory elements [4, 5].
Several methods have been developed to identify synteny blocks from extant chromosomes comparisons. In the field of bacterial genome evolution, algorithms tend to focus on the notion of “gene team” [6], which denotes a set of genes that stay in the vicinity of each other with no constraint on gene order. Such methods include TEAM [7], HomologyTeams [8], CCCPart [9], CloseUp [10] and MCMuSeC [11].
However, because gene order conservation in eukaryotes is stronger [12] compared to bacteria, algorithms that infer synteny blocks in eukaryotes tend to account for this extra constraint. GRIMMsynteny [13], iADHoRe 3.0 (often just called ADHoRe later) [14–17], DiagHunter [18], LineUp [19], FISH [20], DAGchainer [21], SyMAP [22], ColinearScan [23], Cinteny [24], OrthoCluster [25], Syntenator [26] and Cyntenator [27], MCScan [28] and MCScanX [29], Enredo [30], and DRIMMSynteny [31] are the main algorithms developed to infer synteny blocks in eukaryotes. Many were applied to model species such as Arabidopsis thaliana and rice, among plants, and mammals such as human, mouse, dog and rat, among metazoans. These algorithms can be broadly classified according to their heuristic and features.
Four distinct heuristics are used to infer synteny blocks. The first builds twodimensional matrices filled with homologies [13, 17, 18, 20, 22, 24]. The algorithms analyse the matrices with procedures that resemble those developed in the field of image analysis.
A second heuristic uses optimisation techniques and dynamic programming [19, 21, 28]. Many of the methods that fall in this category are greedy, although with the benefit of often providing more flexibility. Indeed, the choice of the cost parameters in the objective function, allows the user to accurately account for different synteny block characteristics. A third heuristic is based on a modification of the SmithWaterman [32] approach [23, 26] while the last type of heuristic relies on graph editing [30, 31].
Some algorithms compare genomes by performing pairwise comparisons of genomes whereas others perform multigenomes comparisons. Combining pairwise comparisons does not capture the additional significance of genes that are conserved in more than two regions, resulting in underestimation of cluster significance [33]. Multigenomes comparisons are especially relevant for highly diverged synteny blocks and Whole Genome Duplication (WGD) analysis. However, multigenomes comparisons usually require genomes to be reduced to a set of markers shared between all genomes, thus limiting the resolution of the analysis.
The transcriptional orientations of genes on the chromosome are used by some algorithms and provide informations about microrearrangements and may contribute to making the correct choice when there are several possibilities to extend a synteny block. In addition, accounting for gene orientations increases the statistical relevance of small synteny blocks, see [Additional file 1: Section 11].
Gene duplications increase the complexity of identifying synteny blocks. Duplications can be dispersed, or in tandem when the two copies are adjacent. Tandem duplications create blocks of tandem duplicates that disrupt local gene adjacencies without strictly breaking the synteny. In order to overcome blocks of tandem duplicates, algorithms may propose to collapse tandem duplicates into one occurrence by remapping their coordinates [17, 20] or by performing ad hoc editions of the graph of adjacencies [30, 31]. WGDs complicate matters further when new genes copies have been randomly inactivated throughout the genome. Yet some algorithms identify highly diverged synteny blocks or double conserved syntenies caused by WGDs [17, 19].
Once an algorithm has returned putative synteny blocks, a statistical validation can assess their relevance given the input data. A putative synteny block is more likely to be found by chance in a comparison involving a large number of homologies than when few homologies are available. A putative synteny block is also less likely to have occurred by chance if it is composed of a large number of ordered adjacent homologies than if it is composed of a few unordered homologies separated by gaps. Statistical validation may involve either a pvalue, an evalue or a score. The analytical calculation is not a simple task [8, 33–36] and there is no standard pvalue yet established in the field. Simulations are often used to bypass this difficulty, although they are usually time consuming and not very realistic.
To infer a synteny block, each algorithm uses parameters such as the maximum gap g a p_{ m a x } to define the maximum allowed distance between two genes in a synteny block. The g a p_{ m a x } parameter value can be optimised through a theoretical exploration, saving the need to test numerous different values before finding the optimal value [23].
Another important variable is the metric used to allow gaps between genes within a synteny block. Some algorithms use the Diagonal Pseudo Distance [17, 18] whereas others use the Manhattan Distance [13, 20, 22, 24].
Finally, a useful feature is to represent synteny blocks graphically, such as diagonals in a matrix [18], circular views [29] or alignments [14, 29].
Here, we are interested in reconstructing synteny blocks to capture the signals of ancestral gene order and gene orientations in eukaryotic genomes. To this end, we developed PhylDiag, a userfriendly method to identify synteny blocks between two genomes using reconstructed phylogenetic gene trees. The full evolutionary history of each ancestral gene is taken into account in the form of those phylogenetic gene trees, which include in particular gene losses, duplications, 1:1 but also 1:many and many:many homology relationships. All PhylDiag parameters can either be set automatically or be specified by the user. A pvalue calculation provides a statistical basis to select significant blocks and a utility provides graphical representations of identified synteny blocks. Users may also chose among several metrics to allow gene gaps within a synteny block. PhylDiag accounts for tandem duplications and gene orientations, and is thus able to accurately identify small synteny blocks. Among algorithms that already account for gene order and gene orientations, only iADHoRe 3.0, FISH and Enredo also handle tandem duplications, although they do not use gene trees reconstructions. Here we compare PhylDiag to iADHoRe 3.0 [14] (version iADHoRe 3.0.2a) using both real data and simulations.
By introducing the concepts of “tandem blocks” and “homology packs”, PhylDiag overcomes the disruption of gene adjacencies caused by blocks of tandem duplicates. As in other existing methods, PhylDiag allows gaps between genes within synteny blocks up to a customizable maximum gap parameter, and thus bypasses small genic indels (insertions and deletions) and annotation errors. In this study, we also benchmark different metrics used to allow these gaps within a synteny block on simulated data, and show that the choice of the metric has a direct impact on performances.
Methods
After providing basic definitions, we describe the PhylDiag algorithm, which consists of four main parts. First, PhylDiag filters extant genomes. Second, PhylDiag rewrites the genomes from lists of genes to lists of tandem blocks. Third, PhylDiag extracts synteny blocks as diagonals with no gaps by considering the order and orientations of tandem blocks on the chromosomes and then merge these diagonals as long as merges do not generate gaps longer than g a p_{ m a x }. Finally, PhylDiag computes a pvalue to remove diagonals that are likely to be produced by chance rather than being a signature of an ancestral gene order. Before performing these tasks PhylDiag also calculates a recommended value for the maximum gap g a p_{ m a x } to free the user from testing multiple values before finding the appropriate one.
Basic notations and definitions
Genomic conventions
S is a species. Given two species S_{ a } and S_{ b }, LCA(S_{ a },S_{ b }) is the Last Common Ancestor of S_{ a } and S_{ b }. A species S_{ a } has a genome G_{ a } composed of chromosomes. ${c}_{a}=[\phantom{\rule{0.3em}{0ex}}{\mathsf{g}}_{a,1},\dots ,{\mathsf{g}}_{a,{N}_{a}}]=\phantom{\rule{0.3em}{0ex}}{\left[\phantom{\rule{0.3em}{0ex}}{\mathsf{g}}_{a,k}\right]}_{k\in [1,{N}_{a}]}$ is a chromosome of G_{ a } with N_{ a } oriented genes g_{a,k}. The chromosome is chosen to be ordered from g_{a,1} to g_{a,N} and not the reverse, thus defining a reference orientation. The orientation of a gene is determined by the orientation of transcription into RNA, and the orientation of g_{a,k}, denoted o(g_{a,k}), is equal to +1 if transcription is performed in the same direction as $\overrightarrow{{\mathsf{g}}_{a,1}{\mathsf{g}}_{a,N}}$ otherwise o(g_{a,k})= −1. A sublist of c_{ a } is often denoted c_{ a }[i_{ s }→ i_{ e }] where i_{ s } (respectively i_{ e }) is the index of the starting (respectively ending) gene in the sublist.
Synteny block, intuitive definition
Intuitively (a formal definition is given in ‘Synteny block, formal definition’) we define a Synteny Block (sb, plural sbs) between two species S_{ a } and S_{ b } as a set of neighbouring genes with gene content, gene order and gene orientations conserved during the evolution from LCA(S_{ a },S_{ b }) to S_{ a } and S_{ b }. Two genes are neighbours if they are separated by less than a userdefined parameter g a p_{ m a x }. During evolution we consider that a set of neighbouring genes remains a synteny block until:

a chromosomal rearrangement creates a breakpoint within the sb and changes the order or the orientations of genes

the gap between any two neighbouring genes, caused by gene insertions and/or gene deletions, exceeds g a p_{ m a x } genes (see the formal definition of g a p_{ m a x } in ‘Synteny block, formal definition’ for the choice of the type of gene insertions or gene deletions that may break the synteny)
An ancestral sequence of genes remains a sb even if tandem duplications occur within the synteny block.
Gene family and homology
The evolution of a gene can be represented by a rooted binary tree called a gene tree. The root of a gene tree is the first ancestral gene, the nodes correspond to events of speciations or duplications that occurred during the evolutionary history of the descending genes, and the leaves of the gene tree correspond to extant genes originating from the first gene.
Two genes are homologs if they are in the same gene tree. Two genes are orthologs if they are in the same gene tree and if their last common event is a speciation. Two genes are paralogs if they are in the same gene tree and if their last common event is a duplication. The homology relationship between two genes g_{ a } and g_{ b } is denoted ${\mathsf{g}}_{a}\phantom{\rule{0.3em}{0ex}}\mathcal{\mathscr{H}}\phantom{\rule{0.3em}{0ex}}{\mathsf{g}}_{b}$. A homology relation defines classes of homologs, called families. An issue in comparative genomics is to define gene families and gene trees. Sequence comparison algorithms provide measures (such as BLASTP [37] scores) that make it possible to quantify the similarity between two sequences which may, in turn, be used to cluster genes that show high similarity, thus defining gene families. Gene families can then be organised in phylogenetic gene trees using a vast choice of tree reconstruction methods. Here, we use gene trees from Ensembl [38], built using the TreeBest pipeline [39]. Since in this study we are interested in finding synteny blocks conserved from LCA(S_{ a },S_{ b }) to S_{ a } and S_{ b }, we pruned all gene trees to define a gene family as a set of genes that come from a unique gene of LCA(S_{ a },S_{ b }). Families are defined with these genes, so that two genes are in the same family if and only if they come from the same ancestral gene of LCA(S_{ a },S_{ b }). We note that, depending on the purpose of the analysis, PhylDiag offers the possibilty to prune gene trees at an ancestor that precedes LCA(S_{ a },S_{ b }), so that more paralogy relationships are included in the gene family, see [Additional file 1: Section 1].
Considering the species tree of Figure 1A and the original gene tree of Figure 1B, the Figure 1C describes how we pruned the original gene trees to define our families. Ultimately, the roots of the gene trees correspond to a unique gene of LCA(S_{ a },S_{ b }).
Step 1: Filter extant genomes
When comparing two species S_{ a } and S_{ b }, the first step of PhylDiag is to propose a filtering of extant genomes. There are two filters:

InBothSpecies removes genes that have no homolog in the other genome. This only retains genes that previous algorithms call “anchor genes” and it is the classical way of filtering extant genomes. This filter is well suited for finding functional clusters of genes.

InCommonAncestor removes genes that arose de novo specifically after LCA(S_{ a },S_{ b }). The removed genes are those that have no ancestral gene in LCA(S_{ a },S_{ b }) and they are called “lineage specific genes”. The selective removal is possible using the precomputed phylogenetic gene trees. This step is equivalent to retaining “anchor genes”, but here, using gene trees, the procedure also keeps genes that have lost their ortholog in the other species because of a deletion since LCA(S_{ a },S_{ b }). This filtering is well suited for reconstructing ancestral gene orders.
Both filtering get rid of the noise introduced by lineage specific genes. PhylDiag using the InBothSpecies filter does not consider ancestral gene deletions as events that break the synteny whereas PhylDiag using the InCommonAncestor filter does consider ancestral gene deletions as events that break the synteny.
It may also be advantageous in some specific cases of functional studies of synteny blocks to avoid filtering extant genomes thus considering de novo births of lineage specific genes as events that break the synteny.
Depending on the desired purpose, PhylDiag offers the possibility to easily choose between no filtering at all, the InBothSpecies filter or the InCommonAncestor filter. Since in this study we are interested in reconstructing the ancestral gene order, the InCommonAncestor filter is applied and extant genomes should now be considered to only be composed of genes that have an ancestral gene in LCA(S_{ a },S_{ b }).
Step 2: Build the matrix of homology packs
Extracting sbs conserved in G_{ a } and G_{ b } corresponds to extracting sbs for each comparison of chromosomes c_{ a } of G_{ a } and c_{ b } of G_{ b }. Indeed, genes in two different chromosomes, if they were in synteny before, have been separated by a chromosomal rearrangement and the synteny is broken anyway. Thus it is justified to limit the search to pairs of chromosomes rather than pairs of genomes.
Tandem blocks, an abstraction of genes
In a chromosome, under a parsimonious reasoning, homologous and adjacent genes are tandem duplicates. Here, we refer to such blocks as “tandem block”. Formally, a tandem block (tb, plural tbs) of a chromosome c is an uninterrupted sublist of c that contains paralogous genes. For instance, if the 3 paralogous genes g_{4},g_{5} and g_{6} are in an uninterrupted row in c, the corresponding tb is the subsequence c[ 4→6]=[ g_{4},g_{5},g_{6}]. The size of a tb is equal to the number of tandem gene copies that it contains, for instance the last tb has a size 3. A gene which has no tandem duplicate is in a tb of size 1. By convention tbs are always maximum, i.e. a given tb cannot be contained within another tb. Like genes, tbs also have an orientation. However, in a tb, tandem duplicates may or may not all have the same orientation. When they all share the same orientation, the tb itself is oriented with the same orientation as the orientation of the genes thus, either o(tb)=+1 or o(tb)=−1. When tandem duplicates have different orientations, the orientation of the tb is considered to be unknown, and $o\left(\mathsf{tb}\right)=\varnothing $.
It is possible to rewrite chromosomes as a unique ordered list of oriented tbs. For instance ${c}_{a}=\phantom{\rule{0.3em}{0ex}}[\phantom{\rule{0.3em}{0ex}}{\mathsf{g}}_{a,1},\dots ,{\mathsf{g}}_{a,{N}_{a}}]$ can be rewritten ${c}_{a}=\phantom{\rule{0.3em}{0ex}}[\phantom{\rule{0.3em}{0ex}}{\mathsf{tb}}_{a,1},\dots ,{\mathsf{tb}}_{a,{n}_{a}}]$ where n_{ a } is the number of tbs in c_{ a }. n_{ a }≤N_{ a } and n_{ a }=N_{ a } if and only if there is no tandem duplicate in c_{ a }.
A tandem block tb_{ a } of S_{ a } is said to be in a homology relation with a tandem block tb_{ b } of S_{ b } if the genes of the two tbs are in the same family. We will also say that in this case tb_{ a } and tb_{ b } are homologs or even that tb_{ a } and tb_{ b } are homologous tandem blocks. Using the same notation as for genes, ${\mathsf{tb}}_{a}\phantom{\rule{0.3em}{0ex}}\mathcal{\mathscr{H}}\phantom{\rule{0.3em}{0ex}}{\mathsf{tb}}_{b}$ means that tb_{ a } and tb_{ b } are homologs. If tb_{ a } and tb_{ b } are homologs, they share a Last Common Ancestral gene in LCA(S_{ a },S_{ b }) and we note LCAg(tb_{ a },tb_{ b }) the Last Common Ancestral gene of tb_{ a } and tb_{ b }. LCAg(tb_{ a },tb_{ b }) is defined as soon as it is observed that ${\mathsf{tb}}_{a}\phantom{\rule{0.3em}{0ex}}\mathcal{\mathscr{H}}\phantom{\rule{0.3em}{0ex}}{\mathsf{tb}}_{b}$. Of note, two homologous tandem blocks tb_{ a } and tb_{ b } are not necessarily of the same size if deletions or tandem duplications took place specifically in the branches of S_{ a } or S_{ b } after LCA(S_{ a },S_{ b }).
Matrix of homologies
The classic Matrix of Homologies $\mathsf{MH}\in {\mathfrak{M}}_{{N}_{a},{N}_{b}}$ of two chromosomes ${c}_{a}={\left[{g}_{a,k}\right]}_{k\in [1,{N}_{a}]}$ and ${c}_{b}={\left[{g}_{b,k}\right]}_{k\in [1,{N}_{b}]}$ is defined such that:
Where g_{ a }• g_{ b } is the “sign” of the homology of g_{ a } and g_{ b }
A MH can be represented as an array of values equal to +1,−1 or 0. Non0 values correspond to homologies.
Homology packs, an abstraction of homologies
A Homology Pack (hp, plural hps) is the set of homology relationships between the tandem duplicates of two homologous tandem blocks tb_{ a } (in c_{ a }) and tb_{ b } (in c_{ b }). A hp is always maximum, i.e. a hp cannot be contained within another hp. Graphically, a hp appears as a rectangle of non0 values in a MH. Each hp has a last common ancestral gene in LCA(S_{ a },S_{ b }) denoted LCAg(hp) and equal to LCAg(tb_{ a },tb_{ b }). Tandem duplications generate vertical, horizontal, or rectangular hps in a MH, making it difficult to identify sbs as diagonals. However, the rewriting of a chromosome in a way that collapses these hps to unique values in the MH, as described above, greatly simplifies this problem. Indeed, once c_{ a } and c_{ b } are rewritten as ordered lists of tbs, it becomes possible to define a matrix whose non0 values correspond to hps of the two chromosomes c_{ a } and c_{ b }.
Matrix of homology packs
Given that c_{ a } is rewritten in ${\left[\phantom{\rule{0.3em}{0ex}}{\mathsf{tb}}_{a,k}\right]}_{k\in [1,{n}_{a}]}$ and c_{ b } is rewritten in ${\left[\phantom{\rule{0.3em}{0ex}}{\mathsf{tb}}_{b,k}\right]}_{k\in [1,{n}_{b}]}$, we introduce the Matrix of Homology Packs $\mathsf{MHP}\in {\mathfrak{M}}_{{n}_{a},{n}_{b}}$ of the two chromosomes ${c}_{a}={\left[\phantom{\rule{0.3em}{0ex}}{\mathsf{tb}}_{a,k}\right]}_{k\in [1,{n}_{a}]}$ and ${c}_{b}={\left[\phantom{\rule{0.3em}{0ex}}{\mathsf{tb}}_{b,k}\right]}_{k\in [1,{n}_{b}]}$ defined such that:
Whith tb_{ a }∙tb_{ b } the “sign” of the hp of tb_{ a } and tb_{ b }
In other words, the matrix construction is the same as for the MH of c_{ a } and c_{ b }, with tbs instead of genes and hps instead of gene homologies. The only difference is that while genes always have a known orientation, tbs can have unknown orientations that generate hps with signs equal to $\varnothing $. Similarly, the MHP can be represented as an array of values equal to $+1,1,\varnothing $ or 0. Non0 values correspond to hps. The Xaxis corresponds to c_{ a } ordered from tb_{a,1} to ${\mathsf{tb}}_{a,{n}_{a}}$ and the Yaxis corresponds to c_{ b } ordered from tb_{b,1} to ${\mathsf{tb}}_{b,{n}_{b}}$. With this convention MHP[0,0] corresponds to the bottomleft corner, MHP[n_{ a },0] corresponds to the bottomright corner, MHP[0,n_{ b }] corresponds to the topleft corner and MHP[n_{ a },n_{ b }] corresponds to the top right corner of the array.
[Additional file 1: Section 2] gives a graphical representation of the transition between the MH and the MHP via rewriting chromosomes with tbs.
Distances and gaps
The “gap between two tbs” on the same chromosome is the number of tbs between them.
The “distance between two tbs” is equal to the gap between these two tbs plus one. Thus two adjacent tbs are at a distance one from each other.
As in definition 2.1 in [35], a set of tbs forms a “chain” with gaps ≤ g a p_{ m a x } if all consecutive tbs are separated by gaps ≤ g a p_{ m a x } tbs.
Now, given a MHP, we define the “distance between two hps” as the 2Ddistance between hps coordinates which depends on a distance metric. Several distance metrics can be used in PhylDiag: the Euclidean Distance (ED), the Chebyshev Distance (CD), the Manhattan Distance (MD), or the Diagonal Pseudo Distance (DPD) (Figure 2). Equations for each distance metric can be found in [Additional file 1: Section 3]. The CD yields the maximum of the distances on c_{ a } and c_{ b }, the ED yields the classical geometric distance, the DPD yields smaller distances between hps sitting close to the diagonal axis and therefore tends to provide a higher distance as the distance from the diagonal axis increases. In contrast, the MD tends to yield smaller distances between hps sitting close to the vertical and horizontal axis.
We define the “gap between two hps” as the distance between these hps minus one, thus a gap between two hps depends on the distance metric used. A gap of 0 between two hps means that there is no gap and this corresponds to a distance equal to 1. Given a maximum gap g a p_{ m a x }, a set of hps forms a “cluster” if no gap between them is longer than g a p_{ m a x }.
Step 3: Extract putative synteny blocks as consistent diagonals
In the following section, we define the notion of consistent diagonals in a MHP and we formally define synteny blocks. Then, we explain how synteny blocks generate consistent diagonals in MHPs, and we describe how PhylDiag extracts consistent diagonals. Because some consistent diagonals may be due to chance, we next describe how they are validated as synteny blocks after succeeding a statistical test.
Diagonals
In a MHP, a list of m hps [MHP[x_{ k },y_{ k }]]_{k∈[0,m−1]}forms a:
− “slash” diagonal if $\left\{\begin{array}{c}{x}_{k+1}\ge {x}_{k}\\ {y}_{k+1}\ge {y}_{k}\end{array}\right.\phantom{\rule{2em}{0ex}}\forall k\in [\phantom{\rule{0.3em}{0ex}}0,m2]$.
− “backslash” diagonal if $\left\{\begin{array}{c}{x}_{k+1}\ge {x}_{k}\\ {y}_{k+1}\le {y}_{k}\end{array}\right.\phantom{\rule{2em}{0ex}}\forall k\in [\phantom{\rule{0.3em}{0ex}}0,m2]$.
In both cases, x_{ k } (respectively y_{ k }) is the index of the homologous tb on c_{ a } (respectively c_{ b }) corresponding to the k^{th} hp. In a MHP, a “slash” diagonal is thus a list of non0 cells that goes up according to a direction from bottomleft to topright and a “backslash” diagonal is a list of non0 cells that goes down according to a direction from topleft to bottomright. A “diagonal” is either a slash diagonal or a backslash diagonal. A diagonal with gaps ≤g a p_{ m a x } is a diagonal where all consecutive hps are separated by gaps ≤g a p_{ m a x }.
We define a “strict” diagonal as a diagonal that has no gap between its hps. Thus, m hps form a strict slash diagonal if the list of m hps can be written [ MHP[ s_{ a }+k,s_{ b }+k]]_{k∈[0,m−1]}. Similarly, m hps form a strict backslash diagonal if the list of m hps can be written [ MHP[ s_{ a }+k,s_{ b }−k]]_{k∈[0,m−1]}. In both cases, (s_{ a },s_{ b }) is the position of the first hp of the diagonal.
We also define a “consistent” diagonal as a diagonal composed of hps with signs consistent with hps order:

either a slash diagonal only composed of hps with signs equal to either +1 or $\varnothing $

or a backslash diagonal only composed of hps with signs equal to either −1 or $\varnothing $
In addition, we consider that the distance between two diagonals corresponds to the distance between their closest extremities.
Synteny block, formal definition
We formally define a “synteny block” of m tbs with gaps ≤ g a p_{ m a x } of a comparison of two genomes G_{ a } and G_{ b }, as a chain of m tbs with gaps ≤ g a p_{ m a x } that, during the evolution from LCA(S_{ a },S_{ b }) to S_{ a } and S_{ b }, remains a chain of m tbs with gaps ≤ g a p_{ m a x }. Within a synteny block, tbs order is conserved and tbs orientations either remain conserved or change from a known to an unknown orientation. Synteny blocks are chosen maximal, i.e. not included in another synteny block.
In addition we define a “strict synteny block” as a synteny block with no gaps between tbs (g a p_{ m a x }=0). In [Additional file 1: Section 4], we show that a strict synteny block generates a strict and consistent diagonal in a MHP. Following the reasoning in [Additional file 1: Section 4], using the CD distance metric, it is also possible to show that a synteny block with gaps ≤ g a p_{ m a x } generates a consistent diagonal with gaps ≤ g a p_{ m a x }. However, using the ED, MD or the DPD distance metrics, a synteny block with gaps ≤ g a p_{ m a x } may generate a consistent diagonal with gaps >g a p_{ m a x }, although every consistent diagonal with gaps ≤ g a p_{ m a x } always represents a putative synteny block with gaps ≤ g a p_{ m a x }. It should be noted that, given the CD distance metric and a g a p_{ m a x }, our definition of a diagonal is similar to the definition 4.1 of a “maxgap cluster” in [35] with constraints on gene order and gene orientations.
Extract strict consistent diagonals
Algorithm 1 describes how PhylDiag finds strict and consistent diagonals of hps in the MHP. First, chromosomes are rewritten with tbs and the MHP is built. Then the MHP is scanned from left to right and from bottom to top. Algorithm findDiagType in [Additional file 1: Section 5], sets the diagonal type at the beginning of a strict and consistent diagonal extraction using the sign of the first hp if the sign is known or using the position of the second hp if there is a second hp.
Strict and consistent diagonals are recorded as chains of ordered and oriented (whenever it is possible) ancestral genes. By convention the orientation of an ancestral gene LCAg(tb_{ a },tb_{ b }) is chosen equal to the orientation of tb_{ a }. However, if the orientation of tb_{ a } is unknown, the orientation of LCAg(tb_{ a },tb_{ b }) may still be inferred using the diagonal type of the current diagonal and a known orientation of tb_{ b }, see Equation 1.
Algorithm 1 1 e x t r a c t S b s ( c_{ a }, c_{ b })
Merge strict consistent diagonals
Once strict diagonals have been returned, it is advantageous to merge diagonals which have the same diagonal type, as long as their extremities are in close proximity. Depending on the allowed gap size g a p_{ m a x }, a limited number of errors of annotation and indels are thus allowed, and longer sbs are found that still reflect an ancestral arrangement of genes. It should be noted that this step possibly introduces microinversions within gaps of a diagonal, which will however always remain shorter than g a p_{ m a x } tbs. As we will see, the choice of the distance metric used to merge diagonals is crucial to limit or allow such microinversions, see [Additional file 1: Section 14].
The merging process is simple: diagonals are merged iteratively, starting by those separated by the shortest gap to those separated by the longest gap, as long as the gap remains below g a p_{ m a x }. For a given diagonal extremity, more than one other extremity may be situated at exactly the same distance. In this case, PhyDiag chooses to fuse the diagonals that maximise the number of hps in the diagonal that results from the fusion.
As described in the introduction, the DPD is used in ADHoRe and DiagHunter whereas the MD is used in GRIMMSynteny, FISH, Cinteny and SyMAP. Although the CD and the ED have never been used to our knowledge in the context of synteny block inference we still included them in the benchmark presented in the ‘Results’ section.
Figure 3 shows an example of a merge between two strict backslash diagonals spaced by a distance 5 if the user chose the DPD, 4 if the user chose the MD and 3 if the user chose the CD or the ED.
Given a maximum gap g a p_{ m a x }, users should be aware that, with reference to the formal definition of sbs given in section ‘Synteny block, formal definition’, choosing another distance metric than the CD may return nonmaximum sbs in the MHP. Another reason that may lead to nonmaximum sbs may come from the fusion of diagonals. As mentioned before, if during the fusion process more than one diagonal extremity is available to extend the current diagonal, PhylDiag choses the extremity of the longest diagonal. However, it may be that fusing with a shorter one ultimately would lead to a longer diagonal once the iterative fusion process is complete.
In Algorithm 1, the merging process is encapsulated in the function mergeDiags that takes a list of strict and consistent diagonals and returns a list of consistent diagonals with gaps ≤ g a p_{ m a x }.
Step 4: Statistical validation of consistent diagonals as synteny blocks
We compare two chromosomes, c_{ a } and c_{ b }. c_{ a } has a length of n_{ a } tbs, c_{ b } has a length of n_{ b } tbs and the comparison involves n_{ a b } hps. During the comparison, PhylDiag returns many consistent diagonals that correspond to putative synteny blocks, each characterized by its number of hps, its window W_{ a b } and the maximum gap g between its tbs (note that g ≠ g a p_{ m a x }). Figure 3 shows an example of a consistent diagonal of 4 hps contained in the window W_{ a b } with a maximum gap g=2 tbs reached on c_{ a }. The window W_{ a b } has a size 6 × 4. The chromosomal windows W_{ a } and W_{ b } are the projections of W_{ a b } on each chromosome. W_{ a } has a length of l_{ a }= 6 tbs and W_{ b } has a length of l_{ b }= 4 tbs. As in previous works [33–35], here distances and gaps between hps are calculated with the Chebyshev Distance metric which allows the most relaxed and methodindependent sb definition.
A given consistent diagonal is a statistically significant signature of a sb if it cannot be obtained from a random distribution of tbs (nullhypothesis) up to a fixed probability threshold α. This is equivalent to selecting consistent diagonals that are unlikely to be the result of chance, which we wish to quantify here by a probability, a pvalue.
We calculate the pvalue of each consistent diagonal in five steps. Considering a consistent diagonal of m hps contained in a window W_{ a b } of size l_{ a }× l_{ b } with a maximum gap between hps equal to g, the probability that such a consistent diagonal (or an even more improbable consistent diagonal with gaps ≤ g) arises by chance is denoted p V a l(m,g,l_{ a },l_{ b },n_{ a b },n_{ a },n_{ b }). To compute this value we first compute p_{ d }(k,l_{ a },l_{ b },n_{ a b },n_{ a },n_{ b }) the probability of obtaining exactly k hps in the window W_{ a b }, knowing the MHP density in terms of non0 values. We next compute p_{g,2D}(k,g,l_{ a },l_{ b }), the probability that k hps in W_{ a b } are spaced with gaps ≤ g, knowing that there is at least k hps in W_{ a b }. We also calculate p_{o,o}(k) the probability that k hps have consistent order and signs. By summing and multiplying these probabilities in an appropriate manner we calculate p_{ w }(m,g,l_{ a },l_{ b },n_{ a b },n_{ a },n_{ b }), the probability corresponding to a window sampling search. Finally, we use the former probability to compute the pvalue p V a l(m,g,l_{ a },l_{ b },n_{ a b },n_{ a },n_{ b }) corresponding to a whole genome comparison. The formulas of the two first probabilities are based on [33, 35] respectively and the passage from p_{ w } to the pVal is based on [34]. Here we combine these probabilities and add a last probability, p_{o,o}(k), to account for tbs order and orientations.
Probability accounting for the density
Using the reasoning of [33], in a MHP of size n_{ a }× n_{ b } without dispersed paralogy (see Discussion), involving n_{ a b } hps, the probability of obtaining exactly k hps in a window W_{ a b } of size l_{ a }× l_{ b } is:
The subscript d stands for density because this probability takes into account the density of the MHP. The demonstration of this formula is in [Additional file 1: Section 6].
Probability accounting for the maximum gap between hps
Using the reasoning of [35], the probability that k marked tbs (in any order) form a chain with gaps ≤ g anywhere within a window composed of l tbs is:
Where w_{ k g }= k+(k−1)g is the maximum length of a chain containing k tbs with a maximum gap g, and
the number of ways of arranging k tbs so that they form a chain with gaps shorter or equal to g anywhere within a window of l tbs even if w_{ k g }>l+1, to address edge effects.
Thus, knowing that W_{ a b } contains at least k hps, the probability that W_{ a b } contains k marked hps spaced with gaps ≤ g is:
Probability accounting for hps order and signs
Then, if k hps are close enough, the probability that they form a consistent slash diagonal with gaps ≤ g is:
Where $P(\mathit{\text{sign}}=+1\text{or}\varnothing )=P(\mathit{\text{sign}}=+1)+P(\mathit{\text{sign}}=\varnothing )$ and P(s i g n = s) is the probability that one hp sign equals s, this probability calculation is explained in [Additional file 1: Section 7]. $\frac{1}{k!}$ is the probability that k homologous tbs of chromosome c_{ b } have the same order as the corresponding k homologous tbs of chromosome c_{ a } and ${\left[\phantom{\rule{0.3em}{0ex}}P\right(\mathit{\text{sign}}=+1\text{or}\varnothing \left)\right]}^{k}$ is the probability that the k signs of the hps are consistent with a slash diagonal. p_{ b a c k s l a s h }(k) is defined similarly.
Thus, if k hps are close enough, the probability that they form a consistent diagonal with gaps ≤ g is:
The subscript o,o stands for consistent tbs Order and tbs Orientations. The demonstration of the p_{o,o} formula can be found in [Additional file 1: Section 8].
Probability for a window sampling scenario
Now, in a MHP of size n_{ a }× n_{ b } without dispersed paralogy (see Discussion), involving n_{ a b } hps, the probability that in a window W_{ a b } of size l_{ a }× l_{ b } there is at least one consistent diagonal containing at least m hps spaced with gaps ≤ g is:
The subscript w stands for Window because this probability corresponds to a window sampling [34] scenario. Only varying parameters are shown in the righthand side of the equation in the preceding formula. This formula is explained in [Additional file 1: Section 9].
Probability for a whole chromosome comparison
Finally, since PhylDiag performs a whole chromosome comparison, it is not possible to use the probability of a window sampling method that would underestimate the probability to find a consistent diagonal by a factor of O(n_{ a }n_{ b }). Thus, relying on the reasoning of section 4.2 of [34] we adjust the former probability to compute the probability corresponding to a whole chromosome comparison.
In a MHP of size n_{ a }× n_{ b } containing n_{ a b } hps without dispersed paralogy (see Discussion), the probability of finding at least one window W_{ a b } of size l_{ a }× l_{ b } containing at least a consistent diagonal of at least m hps spaced by gaps ≤ g can be approximated by:
where ${n}_{w}=\frac{{n}_{a}{n}_{b}}{{l}_{a}{l}_{b}}$ is the number of windows of width l_{ a } and height l_{ b } in the MHP such that no window overlap with any other window. The underlying assumption of this formula is justified in [Additional file 1: Section 10] and examples of calculation are performed in [Additional file 1: Section 11].
In Algorithm 1, the statistical validation is encapsulated in the function statisticalValidation that takes a list of consistent diagonals as input and returns statistically validated sbs.
Estimation of a recommended maximum gap parameter
All algorithms designed to identify synteny blocks use a maximum gap parameter (g a p_{ m a x }) to allow gaps in sbs. However, the user may find it difficult to estimate the optimal value for this parameter. In order to avoid guessing or multiple trials before finding the optimal g a p_{ m a x } value, PhylDiag uses the dependency between the probability of finding a consistent diagonal of m hps spaced by gaps ≤ g a p_{ m a x } and the g a p_{ m a x } value. The complete reasoning used to calculate the recommended maximum gap parameter can be found in [Additional file 1: Section 12].
Viewer
PhylDiag includes a utility to visualise the MHP of a pairwise comparison of chromosomes with colours and surrounding black rectangles for sbs recognition. This viewer writes a vectorial image allowing an infinite zoom on details with no pixelisation. Figure 4A shows an example of the viewer during the comparison of the human X chromosome with the mouse X chromosome. If more information about a region of the MHP is required, a zoom can be performed by specifying the desired chromosomal regions. If these are small enough, more information is shown, such as hps signs, oriented tbs on each axis, the size of each tb and colours for homology recognition. Grey tbs represent tbs that do not have hps in the MHP, but they have hps elsewhere in the pairwise comparisons of genomes, in another pairwise comparison of chromosomes. Figure 3 was produced with the viewer and shows such informations. The user may also visualise the MH, for example to study the genic composition of a tb.
Implementation
The complete algorithm has been implemented in Python. Pairwise comparisons of chromosomes are performed in parallel since they are independent. In Algorithm 1, the MHP matrix is stored considering that it is a sparse matrix to reduce memory usage and the merging process is optimised. Combinations in probability formulas are computed using Pascal’s rule and dynamic programming. On a single 3,0 GHz processor with 32 Gb RAM, loading the data in memory requires 3 seconds, and the running time for the pairwise analysis of the Human and Mouse genomes requires less than 3 seconds. Without any optimisation of the memory allocations the peak of RAM consumption is 221 Mb, thus a standard personal computer can run PhylDiag.
Results
To evaluate the performances of PhylDiag, we performed a comparative analysis with iADHoRe 3.0 [14], a stateoftheart algorithm used in many recent studies. To make comparisons possible however, we used a version of the program provided by the authors. Indeed, iADHoRe 3.0 first rewrites genomes in tbs like PhylDiag, but allows a userdefined “tandem_gap” between genes in a tb. In version 3.0, the minimal tandem_gap is 2, and it is not possible to set the tandem_gap to 0, as in PhylDiag. In the version provided (iADHoRe 3.0.2a) this option is enabled.
When ADHoRe compares two chromosomes, it first generates “baseclusters” which correspond to PhylDiag’s sbs. ADHoRe uses the DPD metric to build baseclusters containing gaps ≤“gap_size” in the matrix of homologies. ADHoRe also uses the “prob_cutoff” parameter for the statistical filtering and a last parameter is the “q_value”, a real value between 0 and 1, indicating the minimum r^{2} (a measure for the linearity of baseclusters in the matrix of homologies) that a cluster should display.
Comparison with iADHoRe 3.0.2a on real data
In a first comparison, we provided the same input based on real genomic data to PhylDiag and ADHoRe. We used the human genome (G_{ h }) and the mouse genome (G_{ m }) of Ensembl v72. As explained in section ‘Gene family and homology’, families correspond to genes that are descended from a unique gene of LCA(S_{ h },S_{ m })= Euarchontoglire.
PhylDiag computes a recommended g a p_{ m a x } of 5 tbs for the humanmouse comparison. We therefore set a g a p_{ m a x } parameter of 5 and we chose a probability threshold α = 1 × 10^{−3} for PhylDiag. iADHoRe 3.0.2a was set with tandem_gap=0, gap_size=5, prob_cutoff=1×10^{−3} and q_value=0.9 (the default value). Figure 5 compares the distributions of synteny block lengths of ADHoRe and PhylDiag using the MD distance metric. The two distributions are not different from each other (MannWhitney U test: pval = 0.9791), and show that neither methods suffer from strong biases in over or under detection of synteny blocks in a given size range. Of note, PhylDiag returned 17 significant sbs of 2 hps out of 175 consistent diagonals of 2 hps. These are not shown in Figure 5 because ADHoRe does not report sbs of size 2. PhylDiag statistically validated all consistent diagonals containing more than 2 hps as significant synteny blocks.
Comparison with iADHoRe 3.0.2a on simulated data
Our simulator first designs an ancestral genome G_{ a n c } with a user defined number of genes and chromosomes. The lengths of chromosomes in G_{ a n c } are expressed in number of genes, and are determined randomly. Simulated evolution gives rise to the two extant genomes G_{ a } and G_{ b } of two extant species. The simulator performs genic events, which include de novo gene births, deletions, duplications (tandem and dispersed), and genomic rearrangements, which include chromosome fusions and fissions, segmental translocations or segmental inversions. The evolutionary scenario is calibrated so as to fit the known evolution of the human and the mouse genome from the Euarchontoglire genome using phylogenetic gene tree reconstructions from Ensembl Compara version 72. See [Additional file 1: Section 13] for a more detailed description of the Simulator.
We performed 100 simulations of the evolution of the human and the mouse genome, and analysed them with PhylDiag and ADHoRe to identify sbs. The PhylDiag merging process was performed with the 4 different distance metrics (ED,CD,DPD and MD). For ADHoRe the DPD is the only distance metric available. As in the comparison with real data, since the simulation is calibrated to fit real evolutionnary rates, the recommended g a p_{ m a x } found by PhylDiag is still 5. Results of PhylDiag with a g a p_{ m a x }=5 and ADHoRe with g a p_s i z e= 5 are shown in Table 1.
Coverage is the fraction of the number of gene families (each family corresponds to a single ancestral gene of Euarchontoglire) contained in sbs over the total number of ancestral genes conserved in both the simulated human genome and the simulated mouse genome. N50 is the length of the sb such that all sbs of greater lengths represent 50% of the ancestral genes contained in sbs. Sensitivity is the fraction of the number of correctly inferred ancestral adjacencies over the total number of ancestral genes conserved in both the simulated human genome and the simulated mouse genome. Specificity is the fraction of the number of correctly inferred ancestral adjacencies over the total number of inferred ancestral adjacencies, false inferences included.
Specificity and sensitivity are calculated twice: first by ignoring gene orientations (an inferred adjacency between two genes is considered correct if both genes are adjacent in G_{ a n c } even if their relative orientation is different compared to the ancestral relative orientation), and second by taking gene orientations into account (to be correct an inferred adjacency must contain genes with a relative orientation that is the same as in the G_{ a n c }).
Results show that PhylDiag with the DPD, and ADHoRe obtain similar results when we do not consider gene orientations during the analysis. Interestingly, simply using the ED, the CD or the MD metrics allows PhylDiag to achieve better sensitivity and specificity than ADHoRe (MannWhitney U test on sensitivity % and specificity % using the MD in PhylDiag and DPD in ADHoRe over 100 simulations: pval ≤ 2.2e16 and pval ≤ 2.2e16 respectively). In addition, as soon as gene orientations are considered in the analysis, PhylDiag improves substantially, in part because of Equation 1.
Discussion
We have compared PhylDiag to iADHoRe 3.0, a stateoftheart algorithm including advanced features which are not present in PhylDiag, including the possibility to identify sbs in the “twilight zone”, i.e. sbs highly diverged or separated by a WGD, where many gene deletions may have occurred. ADHoRe uses “profiles” across more than two genomes to identify poorly conserved sbs, for example due to long divergence times. These features were not exploited here because unlike ADHoRe, PhylDiag only performs pairwise comparisons of genomes since our primary interest is to identify sbs in closely related species.
We explored different distance metrics to measure distances in matrices of homology, and found that the DPD used in ADHoRe, which favours fusions of diagonals along ±45° axes in the MHP (Figures 2D, 2E and 2F), is not optimal. This has been discussed previously [23] and the simulations clearly show that exploring first laterally (i.e. vertically and horizontally), as with the MD (Figure 2C), improves results. Merging diagonals with the DPD distance metric allows more small inversions within sbs gaps while considering that genic/segmental indels and incorrect annotations break the synteny more easily than with the MD. Conversely, merging diagonals with the MD metric gives priority to lateral directions and this allows more small genic/segmental indels and annotation errors within sbs gaps and considers that inversions break the synteny more easily than with the DPD, see [Additional file 1: Section 14]. Interestingly the unusual ED or CD distance metrics also show improved results over the DPD (Table 1). It should be noted that a given distance may cover a different number of cells in the MHP depending on the metric chosen. For instance 9 cells are covered within a distance value of 3 with the MD whereas 7 cells are covered within the same distance value of 3 with the DPD (Figures 2C and 2D). Although this bias may play a role in the results, on chromosomes, gaps between tbs involved in pairs of chains corresponding to sbs are always smaller or equal to g a p_{ m a x } independently of the metric chosen. Thus comparing metrics is fair. Finally, contrary to ADHoRe, PhylDiag can return sbs containing 2 hps if their pvalue is under the pvalue threshold of the user.
PhylDiag includes a new statistical validation to estimate the probability that a putative sb may be due to chance. Unlike other tests, it accounts for gene orientations, thus providing increased sensitivity. It also accounts for tandem duplications but ignores the possibility that duplicate gene copies may be dispersed. Neglecting dispersed duplicates underestimates the pvalues of sbs and the significance of sbs are thus overestimated. However models considering gene families exist [8, 23] and in a future version it might be advantageous to implement the pvalue proposed in [36], even if the calculation is based on an unrealistic assumption that all gene families are of fixed size. Nevertheless the error in the pvalue calculation in PhylDiag is likely to be small for closely related species. For instance the analysis of phylogenetic trees described here shows that only 2.4% of tbs are dispersed duplicates in the human genome (3.2% in mouse) using our family definition (section ‘Gene family and homology’).
The pvalue used by PhylDiag is relative to a comparison of two chromosomes, and therefore assumes that random consistent diagonals might arise based on the number of tbs and hps relevant to the two chromosomes only. In contrast, a global (i.e. genome wide) threshold α is chosen to distinguish significant sbs from nonsignificant sbs. This inconsistency represents an area of further development, in order to better account for heterogeneous densities of hps depending on which chromosomes are being compared.
Conclusion
PhylDiag is designed around a heuristicindependent formal definition of synteny blocks. Its implementation and benchmarking using real and simulated data allowed us to rank 2Ddistance metrics in terms of sensitivity and specificity, and to evaluate its performance in comparison with ADHoRe. Results show that the DPD distance metric yields the poorest performances when identifying synteny blocks, both with ADHoRe and PhylDiag. In contrast, PhylDiag highlights the interesting sensitivityspecificity tradeoff achieved by the MD distance metric, closely followed by the CD and the ED distance metrics. Compared to ADHoRe and other algorithms that infer synteny blocks, the definition of gene families in PhylDiag is based on gene trees. Most notably, this feature offers the opportunity to precisely group extant genes into families that descend from a unique gene in the last common ancestor of the two species being compared. Furthermore, a meticulous attention to tandem duplicates and gene orientations allow PhylDiag to reach a high resolution in the analysis of rearrangements, down to single gene inversion. Finally, the statistical validation of putative synteny blocks filters out putative false positives due to randomly convergent gene order. PhylDiag is a software for synteny block inference that benefits from extensive parameters, including g a p_{ m a x }, distance metric, pvalue threshold, filtering of extant genomes and ancestor for the gene family definition. Their values can be set by PhylDiag (default values are based on previous benchmarks or set automatically based on the data) or set by the user. These features, together with postprocessing graphical analysing tools and printed statistics (number of tandem duplicates in extant genomes, number of dispersed duplicates, number of homologies involved in the pairwise comparison) contribute to making PhylDiag a userfriendly method to find synteny blocks.
Availability and requirements
Project name: PhylDiag Project home page: https://github.com/DyogenIBENS/PhylDiag Operating system(s): Platform independent Programming language: Python Other requirements: Python 2.7 or higher License: GNU GPL v3 or later, and the CeCiLL v2 license in France Any restrictions to use by nonacademics: No
References
 1.
Murphy WJ, Larkin DM, Bourque G, Tesler G, Auvil L, Beever JE, Chowdhary BP, Galibert F, Gatzke L, Hitte C, Meyers SN, Milan D, Pape G, Parker HG, Raudsepp T, Rogatcheva MB, Schook LB, Skow LC, Welge M, Womack JE, O’brien SJ, Evertsvan der Wind A: Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science. 2005, 309 (5734): 6137. doi:10.1126/science.1111387,
 2.
Chauve C, Tannier E: A methodological framework for the reconstruction of contiguous regions of ancestral genomes and its application to mammalian genomes. PLoS Comput Biol. 2008, 4 (11): 100023410.1371/journal.pcbi.1000234. doi:10.1371/journal.pcbi.1000234. PMID:19043541,
 3.
DaraiRamqvist E, Sandlund A, Müller S, Klein G, Imreh S, KostAlimova M: Segmental duplications and evolutionary plasticity at tumor chromosome breakprone regions. Genome Res. 2008, 18 (3): 3709. doi:10.1101/gr.7010208,
 4.
Kikuta H, Laplante M, Navratilova P, Komisarczuk AZ, Engström PG, Fredman D, Akalin A, Caccamo M, Sealy I, Howe K, Ghislain J, Pezeron G, Mourrain P, Ellingsen S, Oates A. C, Thisse C, Thisse B, Foucher I, Adolf B, Geling A, Lenhard B, Becker TS: Genomic regulatory blocks encompass multiple neighboring genes and maintain conserved synteny in vertebrates. Genome Res. 2007, 17 (5): 54555. doi:10.1101/gr.6086307,
 5.
Irimia M, Tena JJ, Alexis M. S, FernandezMiñan A, Maeso I, Bogdanovic O, de la CalleMustienes E, Roy SW, GómezSkarmeta JL, Fraser HB: Extensive conservation of ancient microsynteny across metazoans due to cisregulatory constraints. Genome Res. 2012, 22 (12): 235667. doi:10.1101/gr.139725.112,
 6.
Bergeron A, Corteel S, Raffinot M: The algorithmic of gene teams. Algorithms Bioinformatics. 2002, 2452: 464476. doi:10.1007/3540457844_36,
 7.
Luc N, Risler JL, Bergeron A, Raffinot M: Gene teams: a new formalization of gene clusters for comparative genomics. Comput Biol Chem. 2003, 27 (1): 5967. 10.1016/S14769271(02)00097X. PMID:12798040,
 8.
He X, Goldwasser MH: Identifying conserved gene clusters in the presence of homology families. J Comput Biology: J Comput Mol Cell Biol. 2005, 12 (6): 638656. doi:10.1089/cmb.2005.12.638. PMID:16108708,
 9.
Boyer F, Morgat A, Labarre L, Pothier J, Viari A: Syntons, metabolons and interactons: an exact graphtheoretical approach for exploring neighbourhood between genomic and functional data. Bioinformatics. 2005, 21 (23): 42094215. doi:10.1093/bioinformatics/bti711. PMID:16216829. Accessed 20131024,
 10.
Hampson SE, Gaut BS, Baldi P: Statistical detection of chromosomal homology using sharedgene density alone. Bioinformatics (Oxford, England). 2005, 21 (8): 13391348. doi:10.1093/bioinformatics/bti168. PMID:15585535,
 11.
Ling X, He X, Xin D: Detecting gene clusters under evolutionary constraint in a large number of genomes. Bioinformatics (Oxford, England). 2009, 25 (5): 571577. doi:10.1093/bioinformatics/btp027. PMID:19158161,
 12.
Ma J, Zhang L, Suh BB, Raney BJ, Burhans RC, Kent WJ, Blanchette M, Haussler D, Miller W: Reconstructing contiguous regions of an ancestral genome. Genome Res. 2006, 16 (12): 15571565. doi:10.1101/gr.5383506. Accessed 20130618,
 13.
Tesler G: GRIMM genome rearrangements web server. Bioinformatics. 2002, 18 (3): 492493. doi:10.1093/bioinformatics/18.3.492. PMID:11934753. Accessed 20130618,
 14.
Proost S, Fostier J, Dhoedt B, Demeester P, Vandepoele K, De Witte D: iADHoRe 3.0–fast and sensitive detection of genomic homology in extremely large data sets. Nucleic Acids Res. 2012, 40 (2): 1110.1093/nar/gkr955. doi:10.1093/nar/gkr955,
 15.
Simillion C, Vandepoele K, Saeys Y, Van de Peer Y: Building genomic profiles for uncovering segmental homology in the twilight zone. Genome Res. 2004, 14 (6): 10951106. doi:10.1101/gr.2179004. Accessed 20130708,
 16.
Van de Peer Y, Meyer A: Chapter 6  largescale gene and ancient genome duplications. The Evolution of the Genome. Edited by: Gregory TR. 2005, Burlington: Academic Press, 340344. http://www.sciencedirect.com/science/article/pii/B9780123014634500085. Accessed 20130928,
 17.
Vandepoele K, Saeys Y, Simillion C, Raes J, Van De Peer Y: The automatic detection of homologous regions (ADHoRe) and its application to microcolinearity between arabidopsis and rice. Genome Res. 2002, 12 (11): 17921801. doi:10.1101/gr.400202. PMID:12421767,
 18.
Cannon SB, Kozik A, Chan B, Michelmore R, Young ND: DiagHunter and GenoPix2D: programs for genomic comparisons, largescale homology discovery and visualization. Genome Biol. 2003, 4 (10): 6810.1186/gb2003410r68. doi:10.1186/gb2003410r68. PMID:14519203,
 19.
Hampson S, Gaut B, Baldi P, McLysaght A: LineUp: statistical detection of chromosomal homology with application to plant comparative genomics. Genome Res. 2003, 13 (5): 9991010. doi:10.1101/gr.814403. PMID:12695327,
 20.
Calabrese PP, Chakravarty S, Vision TJ: Fast identification and statistical evaluation of segmental homologies in comparative maps. Bioinformatics. 2003, 19 (suppl 1): 7480. doi:10.1093/bioinformatics/btg1008. PMID:12855440. Accessed 20130621,
 21.
Haas BJ, Delcher AL, Wortman JR, Salzberg SL: DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics. 2004, 20 (18): 36433646. doi:10.1093/bioinformatics/bth397. PMID:15247098. Accessed 20130621,
 22.
Soderlund C, Nelson W, Shoemaker A, Paterson A: SyMAP: a system for discovering and viewing syntenic regions ofFPC maps. Genome Res. 2006, 16 (9): 11591168. doi:10.1101/gr.5396706. PMID:16951135. Accessed 20130621,
 23.
Wang X, Shi X, Li Z, Zhu Q, Kong L, Tang W, Ge S, Luo J: Statistical inference of chromosomal homology based on gene colinearity and applications to arabidopsis and rice. BMC Bioinformatics. 2006, 7: 44710.1186/147121057447. doi:10.1186/147121057447. Accessed 20130620,
 24.
Sinha AU, Meller J: Cinteny: flexible analysis and visualization of synteny and genome rearrangements in multiple organisms. BMC Bioinformatics. 2007, 8 (1): 8210.1186/14712105882. doi:10.1186/14712105882. PMID:17343765. Accessed 20130619,
 25.
Zeng X, Nesbitt MJ, Pei J, Wang K, Vergara IA, Chen N: OrthoCluster. Proceedings of the 11th International Conference on Extending Database Technology Advances in Database Technology  EDBT ’08. 2008, New York, USA: ACM Press, 656656. doi:10.1145/1353343.1353423. [http://portal.acm.org/citation.cfm?doid=1353343.1353423],
 26.
Rödelsperger C, Dieterich C: Syntenator: Multiple gene order alignments with a genespecific scoring function. Algorithms Mol Biol. 2008, 3 (1): 1410.1186/17487188314. doi:10.1186/17487188314. PMID:18990215. Accessed 20130621,
 27.
Rödelsperger C, Dieterich C: CYNTENATOR: progressive gene order alignment of 17 vertebrate genomes. PLoS ONE. 2010, 5 (1): 886110.1371/journal.pone.0008861. doi:10.1371/journal.pone.0008861. Accessed 20130621,
 28.
Tang H, Wang X, Bowers JE, Ming R, Alam M, Paterson AH: Unraveling ancient hexaploidy through multiplyaligned angiosperm gene maps. Genome Res. 2008, 18 (12): 19441954. doi:10.1101/gr.080978.108. PMID:18832442. Accessed 20131023,
 29.
Wang Y, Tang H, Debarry JD, Tan X, Li J, Wang X, Jin H, Marler B, Guo H, Kissinger JC, Paterson AH, Lee Th: MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 2012, 40 (7): 4910.1093/nar/gkr1293. doi:10.1093/nar/gkr1293. PMID:22217600,
 30.
Paten B, Herrero J, Beal K, Fitzgerald S, Birney E: Enredo and pecan: Genomewide mammalian consistencybased multiple alignment with paralogs. Genome Res. 2008, 18 (11): 18141828. doi:10.1101/gr.076554.108. Accessed 20130910,
 31.
Pham SK, Pevzner PA: DRIMMSynteny: decomposing genomes into evolutionary conserved segments. Bioinformatics (Oxford, England). 2010, 26 (20): 250916. doi:10.1093/bioinformatics/btq465,
 32.
Smith TF, Waterman MS, Subsequences CM: Identification of common molecular subsequences. J Mol Biol. 1981, 147 (1): 1957. 10.1016/00222836(81)900875.
 33.
Raghupathy N, Hoberman R, Durand D: Two plus two does not equal three: statistical tests for multiple genome comparison. J Bioinform Comput Biol. 2008, 6 (1): 122. 10.1142/S0219720008003242.
 34.
Durand D, Sankoff D: Tests for gene clustering. J Comput Biol: J Comput Mol Cell Biol. 2003, 10 (34): 45382. doi:10.1089/10665270360688129,
 35.
Hoberman R, Sankoff D, Durand D: The statistical analysis of spatially clustered genes under the maximum gap criterion. J Comput Biol: J Comput Mol Cell Biol. 2005, 12 (8): 10831102. doi:10.1089/cmb.2005.12.1083. PMID:16241899,
 36.
Raghupathy N, Durand D: Gene cluster statistics with gene families. Mol Biol Evol. 2009, 26 (5): 95768. doi:10.1093/molbev/msp002,
 37.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 40310. doi:10.1016/S0022283605803602,
 38.
Kersey PJ, Allen JE, Christensen M, Davis P, Falin LJ, Grabmueller C, Hughes DST, Humphrey J, Kerhornou A, Khobova J, Langridge N, McDowall MD, Maheswari U, Maslen G, Nuhn M, Ong CK, Paulini M, Pedro H, Toneva I, Tuli MA, Walts B, Williams G, Wilson D, YouensClark K, Monaco MK, Stein J, Wei X, Ware D, Bolser DM, Howe KL, et al: Ensembl Genomes 2013 scaling up access to genomewide data. Nucleic Acids Res. 2014, 42 (Database issue): 54652. doi:10.1093/nar/gkt979,
 39.
Vilella AJ, Severin J, UretaVidal A, Heng L, Durbin R, Birney E: EnsemblCompara GeneTrees: complete, duplicationaware phylogenetic trees in vertebrates. Genome Res. 2009, 19 (2): 327335. doi:10.1101/gr.073585.107. Accessed 20130929,
Acknowledgements
We thank Alexandra Louis and Pierre Vincens for their help with computer facilities and Wassim AbouJaoudé for his assistance with mathematical problems. We also thank Yves Van de Peer, Klaas Vandepoele and Jan Fostier, the authors of ADHoRe who provided the version iADHoRe 3.0.2a. This work is funded by Centre National de la Recherche Scientifique (CNRS) and grants from the Agence Nationale de la Recherche (ANR) [Ancestrome Project ANR10BINF0103, ANR BlancPAGE ANR2011 BSV600801].
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
JMEXL designed the core of PhylDiag. MM designed the library framework for genome data manipulation and gene tree editing. HRC supervised the work. JMEXL and HRC wrote the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Cite this article
Lucas, J.M., Muffato, M. & Roest Crollius, H. PhylDiag: identifying complex synteny blocks that include tandem duplications using phylogenetic gene trees. BMC Bioinformatics 15, 268 (2014). https://doi.org/10.1186/1471210515268
Received:
Accepted:
Published:
Keywords
 Comparative genomics
 Synteny
 Synteny block
 Segmental homologies
 Homology
 Gene order
 Rearrangement
 Ancestral genome
 Gene tree