Volume 13 Supplement 19
Proceedings of the Tenth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics
Gene family assignmentfree comparative genomics
 Daniel Doerr^{1, 2}Email author,
 Annelyse Thévenin^{1, 2} and
 Jens Stoye^{1, 2}
https://doi.org/10.1186/1471210513S19S3
© Doerr et al.; licensee BioMed Central Ltd. 2012
Published: 19 December 2012
Abstract
Background
The comparison of relative gene orders between two genomes offers deep insights into functional correlations of genes and the evolutionary relationships between the corresponding organisms. Methods for gene order analyses often require prior knowledge of homologies between all genes of the genomic dataset. Since such information is hard to obtain, it is common to predict homologous groups based on sequence similarity. These hypothetical groups of homologous genes are called gene families.
Results
This manuscript promotes a new branch of gene order studies in which prior assignment of gene families is not required. As a case study, we present a new similarity measure between pairs of genomes that is related to the breakpoint distance. We propose an exact and a heuristic algorithm for its computation. We evaluate our methods on a dataset comprising 12 γproteobacteria from the literature.
Conclusions
In evaluating our algorithms, we show that the exact algorithm is suitable for computations on small genomes. Moreover, the results of our heuristic are close to those of the exact algorithm. In general, we demonstrate that gene order studies can be improved by direct, gene family assignmentfree comparisons.
Background
In the field of comparative genomics, studying the relative order of genes in genomes is a popular practice to gain information about organisms and their relationships. This information ranges from transcription and functional linkage of genes such as correlated expression, the phylogeny of organisms, to detailed evolutionary dynamics of their genomes. Gene order methods are also incorporated in genome alignment strategies to identify regions that are subsequently used to anchor the alignment [1].
Genes are the atomic elements in gene order studies. Although no precise, formal definition is generally agreed upon, from the biological point of view a gene represents a specific inheritable entity in a particular locus on a chromosomal sequence in a particular organism. It often features a protein coding region. Nevertheless, the notion of a "gene" can also represent more finegrained genetic structures such as protein domains or other functional elements of the genome.
Gene families. Many gene order studies hope for evolutionary relationships being resolved between all pairs of genes. Rested upon the biological concept of homology, such studies require information about orthology, paralogy and (potentially) xenology for each pair of genes in the dataset. This information is generally not given, hence it is common to cluster genes according to their sequence similarity. Sometimes such groups are called gene families, thus we will stick to this notion in the following.
Various databases exist, such as COG [2], eggNOG [3], Inparanoid [4], TreeFam [5], and OrthologID [6] (only to name a few) that offer gene family information. These databases can be divided into two groups: databases that primarily use sequence similarity to cluster genes into groups of coorthologs; and treebased databases offering reconstructed gene family trees [7].
The former group of databases provides usually more gene family data while covering a larger set of species. However, the contained information should always be taken with a pinch of salt: Even though high sequence similarity is a good indicator of homology, per se these gene families do not reflect an evolutionary relation. This is because they depend on arbitrary parameters of sequence comparison, similarity quantification, and clustering. Generally such parameters are usercontrolled and influence the size and granularity of the computed gene families. Yet, the vast majority of these databases is uncurated or offers only a negligible amount of curated data.
Lacking a gene tree, within these gene families no differentiation can be made between in and outparalogs when comparing a specific pair of genomes. As is wellknown, gene duplication and sub or neofunctionalization occurs frequently in evolution. Hence the number of coorthologous genes in a genome that are pooled into the same gene family grows the higher one ascends in the evolutionary tree. With increasing number of diverse genomes in the database, these gene families become less useful for gene order analyses, if only a close subset of taxa is of interest. The blemish of disregarding the evolutionary tree needed for truly resolving evolutionary relationships between genes of a given set of genomes is often covered by offering varying levels of granularity. This means that for some subtrees (but generally not for all) of the genomes in the database, gene families are recomputed with tighter parameters. Moreover, the computed sequencebased similarity estimates are rarely based on models of DNA evolution as these involve considerably more computational load. Subsequently differential evolutionary rates are disregarded, amplifying the dilemma of grouping genes based on sequence similarity: selecting too loose criteria in clustering genes to gene families may lead to the mistake that two genes are assigned to the same gene family while they are not homologous, whereas too strict criteria can split gene families although they should belong together [7].
Treebased databases such as TreeFam and OrthologID may provide more accurate information desired for gene order studies. This is partly because the evolutionary relationships between genes in a gene family are considerd in more detail. Furthermore the species tree is taken into account while reconstructing the gene family trees. Also, treebased databases tend to be more often manually curated than their sequence similarity based counterparts. In return, the provided gene family information is often sparse and covers not all genes of a genome. Moreover, such databases usually comprise only a handful of species. As a result, they are of limited use in gene order studies.
Gene content variations. Apart from modelfree comparison or welldefined rearrangements in genomes, gene order studies can allow for additional biologically motivated operations of evolution. That is, genes can duplicate, emerge or become lost in the genome. Similarly, a gene family can grow or shrink, or new gene families can arise.
Gene order studies. Based on the concept of gene families, many gene order studies share a common data structure where chromosomes are represented as words drawn from a finite alphabet of gene families. The strength of this data structure lies in its simplicity; it allows to study the corresponding gene order problems in an abstract form composed of permutations or sequences over a set of characters. Another important advantage is the fact that homology is a binary and transitive relation. This led to the emergence of a multitude of efficient algorithms which solve gene order problems combinatorially.
In the following we will briefly review three different types of gene order studies. Dissimilarity measures such as the breakpoint distance [8] are used to calculate evolutionary distances between two or more genomes, without explicitly drawing on rearrangement operations. The breakpoint distance is defined by the number of unconserved adjacencies between characters of two genomes. For gene cluster detection, several competing models exist. One of them is based on the notion of approximate common intervals [9]. Thereby a gene cluster is defined as a set of maximal intervals, on two or more genomes, that share the same character set. Small differences between the set of characters constituting the gene cluster and the set of characters within the intervals are allowed. The number of tolerated differences as well as the minimal size of an interval is determined by a usercontrolled parameter. Finally, a group of popular rearrangement models are based on the socalled doublecutandjoin (DCJ) operation [10, 11]. By disrupting the genome on two different positions and rejoining the resulting ends, one aims to transform one genome into another by a minimal sequence of DCJ operations. This sequence is denoted sorting scenario.
Limits of the gene family concept. The concept of gene families comes with much benefit, but also has its detriments. On the one hand, gene family information can be gained with comparatively low effort by accessing various public databases or by direct computation. On the other hand, comparative studies based on uncurated gene families are hampered since data can be incorrect.
There are many reasons why the exclusive, binary membership relation between genes and gene families is disputable in itself. For one, most gene families are uncurated, hence it would be supporting in constitutive analyses to distinguish between weak and strong assumptions of homology between genes in supporting their membership to one or more gene families. Moreoever, the gene family concept disregards the facts that gene families may share conserved protein domains and that genes may fuse with others in the course of evolution.
In this paper we promote the idea that gene order studies can be performed without prior gene family assignment. We propose direct use of similarity values because such information not only allows to make more substantiated choices in resolving gene order in subsequent analyses, but can sometimes better reflect the biological reality. In support of our case, we present a new approach to calculate the number of conserved adjacencies, which is a similarity measure related to the breakpoint distance, without the use of gene families. Our method is based on a weighted bipartite graph, representing pairwise similarities between genes of two genomes. We show that this allows for stable adjacency analyses when similarities are calculated based on sequence similarity.
In the "Methods" section we will introduce the problem setting formally and devise an exact algorithm as well as a heuristic for its solution. In the "Experiments and Discussion" section we discuss the performance of our presented method on this dataset and compare results with former work. The manuscript closes with concluding remarks and future prospects in the "Conclusions" section.
Methods
Formal problem description
Genome model. Let $\mathcal{G}\phantom{\rule{0.1em}{0ex}}$ be the universe of all genes, then a chromosome is defined as a sequence of genes (º, g_{1}, g_{2}, ..., g_{n−1},º), with ${g}_{i}\in \mathcal{G}$ for all i = 1, ..., n − 1, flanked by telomeric ends represented by "º". Depending on the type of gene order study, chromosomes can be signed or unsigned. If signed, a gene g has a direction indicated by −g or +g (but it is common to omit the "+"), which represents the relative orientation of each gene along the chromosome. A chromosome can also be circular as it is often observed in bacteria; in this case, it does not exhibit telomeric ends, implying that the outermost genes adjoin. For the time being, let us assume that a genome is unichromosomal and linear, since the general case of our model can be easily inferred. The size of a genome G with n − 1 genes is G = n. In order to refer to the i th gene of G, we use the notation G[i]. Further, let $\sigma :\mathcal{G}\times \mathcal{G}\to \left[0,1\right]$ be a normalized similarity measure between all pairs of genes.
Unconnected genes are omitted from the chromosomal sequences. The remaining genes form connected components of size two or larger. Let $\mathcal{C}\phantom{\rule{0.1em}{0ex}}$ denote the set of all such connected components of B, then for some $C\in \mathcal{C}$ and x ∈ {1,2}, C_{ x } denotes the set of all genes of C that are part of G_{ x }. Given B, we will be interested in finding a set of disjoint edges. Such a set, denoted by $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$, is known as matching.
Matchings. Let us assume for now that a matching $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$ between G_{1} and G_{2} is given. $\#\mathsf{\text{edg}}\phantom{\rule{2.77695pt}{0ex}}\left(\mathcal{M}\right)$ denotes the number of edges in $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$. We call a gene saturated if it is incident to an edge of the matching. A pair of genes (G_{ x }[i], G_{ x }[j]), with x ∈ {1,2} and 0 ≤ i <j ≤ n_{ x }, is a consecutive pair if no saturated gene lies between them.
 1.
for k <ℓ, sgn(G_{1}[i]) = sgn(G_{2}[k]) and sgn(G_{1}[j]) = sgn(G_{2}[ℓ]) or
 2.
for k >ℓ, sgn(G_{1}[i]) ≠ sgn(G_{2}[k]) and sgn(G_{1}[j]) ≠ sgn(G_{2}[ℓ]).
For example, in Figure 1(b) the consecutive gene pairs (2, −3) and (6, −7) represent a conserved adjacency. Telomeres located at the first and last position of the chromosomes are "unsigned" and thus can be used to form adjacencies in both directions. We denote the sum of all conserved adjacencies in a matching $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$ by $\#\text{adj}\left(\mathcal{M}\right)$.
Notice that the edge weights in the sum of the Equation 3 are squared to match the dimension of Equation 2. Optimizing a matching with respect to $\mathit{edg}(\mathcal{M})$ will result in a maximal weighted matching in the graph model we introduced above. As our overall objective function we propose a linear combination between Equations 2 and 3. We allow the user to balance between those two quantities by a parameter α. Moreover it is reasonable to add the constraint that at least one edge per connected component of the bipartite graph between G_{1} and G_{2} must be contained in the matching; The matching obtained is an intermediate matching.
Problem FFAdjacencies can be reduced to two problems that were addressed already by Tang and Moret [12] and Angibaud et al. [13]. Therefore, let us consider equivalent conditions that prevail if gene families are given: In the bipartite graph B = (G_{1}, G_{2},E) between two genomes G_{1} and G_{2} all edges have edge weight 1 and all connected components are cliques. Then finding a solution to Problem FFAdjacencies with α = 1 is equivalent to finding a matching that maximizes the number of adjacencies between two genomes with duplicate genes under the intermediate model [13]. If α comes close enough to 0, we will obtain a maximum matching, yet maximizing the number of adjacencies [12]. The case where family conditions are met also reveals the difference between an arbitrary maximum matching and the maximum matching found by solving Problem FFAdjacencies for α → 0.
The reduced problems presented above being already NPhard, the problem FFAdjacencies is NPhard as well. In the next two subsections we propose first an exact algorithm, FFAdjInt, to solve Problem FFAdjacencies and then a fast heuristic approach.
Exact algorithm
Our algorithm FFAdjInt solving Problem FFAdjacencies is based on previous work in [13]. The idea is to translate the problem into a 01 linear program. That means we define a set of constraints (linear inequations) whose variables are booleans and an objective function (maximization or minimization of a linear formula). Then, we use a solver to assign a value for each variable such that the constraints are verified and the objective is optimized.
Variables:

♦ Variables a(i, k), 0 ≤ i ≤ n_{1} and 0 ≤ k ≤ n_{2}, define a matching$\mathcal{M}\phantom{\rule{0.1em}{0ex}}$: a_{ i,k }= 1 if and only if the gene at position i in G_{1} is matched with the gene at position k in G_{2} in $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$, i.e. e_{ ik }∈ $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$.

♦ Variables b_{ x }(i), x ∈ {1, 2} and 0 ≤ i ≤ n_{ x }, represent the genes saturated by $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$: b_{ x }(i) = 1 if and only if the gene at position i in G_{ x }is saturated by the matching $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$. Clearly, Σ_{0 ≤ i ≤ n 1}b_{1}(i) = Σ_{0 ≤ k ≤ n 2} b_{2}(k), and this is precisely the size of the matching $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$.

♦ Variables c_{ x }(i, j), x ∈ {1, 2} and 0 ≤ i <j ≤ n_{ x }, represent consecutive pairs according to the matching $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$: c_{ x }(i, j) = 1 if and only if the genes at positions i, j in G_{ x }are saturated by $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$ and no gene at position p, i<p<j, is saturated by $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$.

♦ Variables d(i, j, k, e), 0 ≤ i <j ≤ n_{1}, 0 ≤ k, ℓ ≤ n_{2}, represent conserved adjacencies according to the matching $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$: d(i, j, k, e) = 1 if and only if s(i, j, k, ℓ) > 0.
Because the matching is possible only between similar genes, the variables a(i, k) and d(i, j, k, ℓ) are not defined whenever σ(G_{1}[i], G_{2}[k]) = 0. Similarly, the variables d(i, j, k, ℓ) are not defined if σ(G_{1}[j],G_{2}[ℓ]) = 0.
Objective:
The goal of FFAdjInt is to find a matching $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$ between the two considered genomes that maximizes the formula $\mathcal{F}$_{ α } (α ∈ ]0, 1]). Hence, the objective of FFAdjInt reduces to maximizing the sum of all variables d(i, j, k, ℓ) multiplied by α · s(i, j, k, ℓ), plus the sum of all variables a(i, k) multiplied by (1 −α) · σ(i, k)^{2} .
Constraints:
Assume x ∈ {1, 2}, 0 ≤ i <j ≤ n_{1} and 0 ≤ k, ℓ ≤ n_{2}.

♦ Constraints in (C.01) ensure that each gene of G_{1} and of G_{2} is saturated at most once, i.e. b_{1}(i) = 1 (resp. b_{2}(k) = 1) if and only if there exists a unique k (resp. i) such that a(i, k) = 1, i.e. e_{ ik }∈ $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$.

♦ Constraints in (C.02) ensure that the matching $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$ is an intermediate matching, we want for each component at least one edge in the matching $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$. For each component $C\in \mathcal{C}$, the sum of the variables b_{ x }(i) for i ∈ C_{ x }must be greater than or equal to 1.

♦ Constraints in (C.03) and (C.04) express the definition of consecutive pairs, thus fixing the values of the variables c_{ x }. The variable c_{ x }(i, j) (0 ≤ i <j ≤ n_{ x }) is equal to 1 if and only if there exists no p such that I <p <j and b_{ x }(p) = 1. It is worth noticing that the constraints do not force the variables c_{ x }(i, j) to have exactly the values we intuitively wish according to the above mentioned interpretation. Here, we accept that c_{ x }(i, j) = 1 even if the gene at position i or j is not saturated. However, this will pose no problem in the sequel.

♦ Constraints in (C.05) and (C.06) define variables d. Knowing the variables d(i, j, k, ℓ) are defined only if σ(i, k) > 0 and σ(j, ℓ) > 0, constraints (C.05) and (C.06) ensure that we have d(i, j, k, ℓ) = 1 if and only if all variables a(i, k), a(j, e), c_{1}(i, j) and c_{2}(k, e) are equal to 1 and the signs and the order of G_{1}[i], G_{1}[j], G_{2}[k] and G_{2}[ℓ] are consistent with the definition of conserved adjacencies.
The program FFAdjInt has O((n_{1}n_{2})^{2}) constraints and O((n_{1}n_{2})^{2}) variables, which could result in a timeconsuming computation.
So far we have used only one simple rule in order to reduce the space complexity: By the definition of the intermediate model, for all components with only two genes, G_{1}[i] and G_{2}[k], the edge e_{ ik } is in $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$. By the constraints (C.01) and (C.03), we already enforce that the variables a(i, k), b_{1}(i) and b_{2}(k) are equal to 1. The rule is based on the fact that there is no possible consecutivity in $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$ between G_{1}[s] and G_{1}[t] (resp. G_{2}[s] and G_{2}[t]) such that 0 ≤ s < i < t ≤ n_{1} (resp. 0 ≤ s <k <t ≤ n_{2}), i.e. c_{1}(s, t) (resp. c_{2}(s, t)) is equal to 0. The corresponding variables d(s, t, ., .) (resp. d(., ., s, t) and d(., ., t, s)) are also equal to 0.
Heuristic
Because of the combinatorial explosion, FFAdjInt does not solve Problem FFAdjacencies for all pairs of complete, larger genomes. But, we will see in the "Experiments and discussion" section that FFAdjInt allows to obtain enough results to evaluate our heuristic presented in this section. It is based on similar ideas as the heuristic IILCS in [13]. IILCS allows to compute the number of adjacencies between two genomes when gene families are known, under three models: exemplar (only one match by gene family), intermediate, and maximum. IILCS resolves our Problem FFAdjacencies in the particular case where α = 1 and each component represents a gene family, i.e. each component is a clique where the weight of each edge is 1.
The heuristic IILCS is a greedy algorithm based on the notion of LCS, Longest Common Substring: Given two genomes G_{1} and G_{2}, an LCS is a longest string S such that S is a (consecutive) substring in G_{1} and G_{2}, up to a complete reversal (opposite sign and reverse order). The idea is to match, at each iteration, all the genes that are in an LCS. If there are several LCSs, one is chosen arbitrarily. At each iteration, not only we match an LCS, but we also remove each unmatched gene from the genome, for which there is no unmatched gene of same component in the other genome. The process (determination of LCS, match and deletion of genes) is iterated until a satisfying matching is obtained. Under the intermediate model, the iteration is stopped when there is at least one edge in $\mathcal{M}\phantom{\rule{0.1em}{0ex}}$ for each component.
For the problem FFAdjacencies,we update the IILCS heuristic by three modifications. The goal of the first change is to take into account our objective in the choice of common substrings. In each iteration we match the common substring that maximizes locally $\mathcal{F}$_{ α } (α ∈ ]0,1]), i.e. the sum of weights of adjacencies and edges. We call this common substring a Maximum Common Substring (MCS). The second modification is an improvement that may also be applied to the original IILCS heuristic: After the deletion of an unsaturated gene g_{1}, such that there is no unmatched gene g_{2} with σ(g_{1},g_{2}) > 0, we attempt to increase the size of each previously matched MCS by extending it on both extremities. The next and the last change is related to the model. We have two options to increase our objective. The first one is to stop the iteration only when we have at least one edge per component and when the size of the MCS of the current iteration is below 2. In the case of the gene family constraints, this criterion improves also the results of IILCS. The second possibility is to stop the iteration only when there is no more edge between unmatched genes. In comparison to the first possibility, we increase our objective $\mathcal{F}$_{ α } (α ∈ ]0,1]) only if α ≠ 1, so not in the context of IILCS. We choose this second possibility because the objective is bigger, but it is important to understand that then we also increase the number of breakpoints. We call this heuristic FFAdjMCS.
Experiments and Discussion
Data
Genomic dataset.
Species/strain name  Short name  Accession No.  Size (bp)  #Genes 

Buchnera aphidicola APS  BAPHI  NC_002528  640681  564 
Escherichia coli K12  ECOLI  NC_000913  4639675  4320 
Haemophilus influenzae Rd  HAEIN  NC_000907  1830138  1657 
Pseudomonas aeruginosa PA01  PAERU  NC_002516  6264404  5571 
Pasteurella multocida Pm70  PMULT  NC_002663  2257487  2012 
Salmonella typhimurium LT2  SALTY  NC_003197  4857432  4423 
Wigglesworthia glossinidia brevipalpis  WGLOS  NC_004344  697724  611 
Xanthomonas axonopodis pv. citri 306  XAXON  NC_003919  5175554  4312 
Xanthomonas campestris  XCAMP  NC_003902  5076188  4179 
Xylella fastidiosa 9a5c  XFAST  NC_002488  2679306  2766 
Yersinia pestis CO_92  YPESTCO92  NC_003143  4653728  3885 
Yersinia pestis KIM5 P12  YPESTKIM  NC_004088  4600755  4048 
All genomes comprise a single, circular chromosome. In support of simplified code but at the expense of accuracy, our implemented algorithms do not allow a chromosome to be circular, even though this is permitted by our presented model. However, the maximal error made by this inaccuracy in comparing two genomes is at most one adjacency, which is negligible in our analysis. The genomes were linearized in the order inherent to the NCBI data, and telomeres were added at the beginning and at the end of the resulting chromosomal sequences.
Pairwise normalized similarities were obtained using the relative reciprocal BLAST score (RRBS) [17]. Genes were compared on the basis of their encoding protein sequence using BLASTP with an evalue threshold of 0.1, disabled query sequence filtering, and disabled compositionbased score adjustments. All computations were performed on a computer system with 32 gigabytes of main memory.
Exact Algorithm vs Heuristic
Exact algorithm vs heuristic.
Relative deviation  RF distance  

Α  $\mathcal{F}$_{α}($\mathcal{M}\phantom{\rule{0.1em}{0ex}}$)  #adj  #edg  #exact results  exact  heuristic 
0.001  2.67%  2.83%  0.23%  43  2  2 
0.3  3.47%  0.90%  0.31%  63  2  2 
0.5  4.26%  1.03%  0.84%  61  2  2 
0.8  6.34%  1.71%  1.14%  54  4  2 
1  8.41%  2.39%  17.7%  48  6  2 
Evaluating phylogenies
A good indicator for accuracy of a genomebased distance measure is the quality of the phylogenetic tree based on its drawn distances.
Often, one cannot judge the treeadditivity of the underlying distances by investigating the fully resolved Neighbor Joining tree. Thus, in Figure 4 we provide a NeighborNet [20] representation of some of our obtained phylogenies. In the plots the internal edges that are hard to reconstruct are directly exposed, showing networklike rather than treelike structures, in particular for the tree obtained from [13]. To conduct these phylogenetic analyses, we used the software packages PHYLIP [21] and SPLITSTREE [22].
Conclusions
In this work, we introduced the concept of comparative genomics by direct analysis of gene similarities without prior assignment of gene families. To illustrate this approach, we resorted specifically to one problem of gene order comparison: Finding a matching that identifies similarities between two genomes by maximizing conserved adjacencies and similarities for each pair of genes simultaneously. This problem is NPhard. We propose to resolve it by an exact algorithm (efficient for small genomes) and a good heuristic. In our experiments on 12 γproteobacterial genomes, we observed that the omission of gene families allowed for an increase in the number of adjacencies as well as the size of the matching while the resulting distances gain higher precision in reconstructing phylogenies.
Future work. This study is a preliminary work in a new field of comparative genomics wherein the assignment of gene family is unnecessary. Many studies can be explored. With regard to the specific problem studied here, our exact algorithm can be improved by rules which reduce the required main memory. Moreover, we believe that a hybrid heuristic  starting a prematching using the iterative heuristic until the size of the MCS is less than a parameter k, then finishing the matching with our exact algorithm  can allow to find nearexact results for even larger genomes. On the other side, a deep study of the measure σ can increase the quality of the comparison; comparing genes by sequence similarity is only one of many methods that can be applied.
From a more general point of view, this study shows that it is conceivable to extend the direct analysis approach to other types of gene order studies such as the computation of DCJ distances or gene cluster prediction.
Declarations
Acknowledgements
DD receives a scholarship from the CLIB Graduate Cluster Industrial Biotechnology. AT is a research fellow of the Alexander von Humboldt Foundation.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 19, 2012: Proceedings of the Tenth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S19
Authors’ Affiliations
References
 Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D, Haussler D: Algorithms for genome multiple sequence alignment. Cactus Genome Research. 2011, 21 (9): 15121528. 10.1101/gr.123356.111.View ArticlePubMedGoogle Scholar
 Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003, 4: 4110.1186/14712105441.PubMed CentralView ArticlePubMedGoogle Scholar
 Powell S, Szklarczyk D, Trachana K, Roth A, Kuhn M, Muller J, Arnold R, Rattei T, Letunic I, Doerks T, Jensen LJ, von Mering C, Bork P: eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res. 2011, 40 (D1): D284D289.PubMed CentralView ArticlePubMedGoogle Scholar
 Remm M, Storm CE, Sonnhammer EL: Automatic clustering of orthologs and inparalogs from pairwise species comparisons. J Mol Biol. 2001, 314 (5): 10411052. 10.1006/jmbi.2000.5197.View ArticlePubMedGoogle Scholar
 Li H: TreeFam: a curated database of phylogenetic trees of animal gene families. Nucleic acids res. 2006, 34 (90001): D572D580. 10.1093/nar/gkj118.PubMed CentralView ArticlePubMedGoogle Scholar
 Chiu JC, Lee EK, Egan MG, Sarkar IN, Coruzzi GM, DeSalle R: OrthologID: automation of genomescale ortholog identification within a parsimony framework. Bioinformatics. 2006, 22 (6): 699707. 10.1093/bioinformatics/btk040.View ArticlePubMedGoogle Scholar
 Fu Z, Jiang T: Clustering of main orthologs for multiple genomes. J Bioinform Comput Biol. 2007, 6: 195201.Google Scholar
 Watterson G, Ewens W, Hall T, Morgan A: The Chromosome Inversion Problem. J Theor Biol. 1982, 99: 17. 10.1016/00225193(82)903848.View ArticleGoogle Scholar
 Stoye J: Computation of Median Gene Clusters. J Comput Biol. 2009, 16 (8): 10851099. 10.1089/cmb.2009.0098.View ArticlePubMedGoogle Scholar
 Yancopoulos S, Attie O, Friedberg R: Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics. 2005, 21 (16): 33403346. 10.1093/bioinformatics/bti535.View ArticlePubMedGoogle Scholar
 Bergeron A, Mixtacki J, Stoye J: A unifying view of genome rearrangements. Proc of WABI. 2006, 163173.Google Scholar
 Tang J, Moret BME: Phylogenetic Reconstruction from GeneRearrangement Data with Unequal Gene Content. Proc of WADS. 2003, 3746.Google Scholar
 Angibaud S, Fertin G, Rusu I, Thévenin A, Vialette S: Efficient tools for computing the number of breakpoints and the number of adjacencies between two genomes with duplicate genes. J Comput Biol. 2008, 15 (8): 10931115. 10.1089/cmb.2008.0061.View ArticlePubMedGoogle Scholar
 Lerat E, Daubin V, Moran NA: From Gene Trees to Organismal Phylogeny in Prokaryotes:The Case of the γProteobacteria. PLoS Biology. 2003, 1: e9View ArticleGoogle Scholar
 Blin G, Chauve C, Fertin G: Genes Order and Phylogenetic Reconstruction: Application to γProteobacteria. Proc of RECOMBCG. 2005, 1120.Google Scholar
 Williams KP, Gillespie JJ, Sobral BWS, Nordberg EK, Snyder EE, Shallom JM, Dickerman AW: Phylogeny of Gammaproteobacteria. J Bacteriol. 2010, 192 (9): 23052314. 10.1128/JB.0148009.PubMed CentralView ArticlePubMedGoogle Scholar
 Pesquita C, Faria D, Bastos H, Ferreira AE, Falcão AO, Couto FM: Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics. 2008, 9 (Suppl 5): S410.1186/147121059S5S4.PubMed CentralView ArticlePubMedGoogle Scholar
 Saitou N, Nei M: The neighborjoining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4 (4): 406425.PubMedGoogle Scholar
 Robinson D, Foulds L: Comparison of Phylogenetic Trees. Math Biosci. 1981, 53: 131147. 10.1016/00255564(81)900432.View ArticleGoogle Scholar
 Bryant D: NeighborNet: An Agglomerative Method for the Construction of Phylogenetic Networks. Mol Biol Evol. 2003, 21 (2): 255265. 10.1093/molbev/msh018.View ArticlePubMedGoogle Scholar
 Felsenstein J: PHYLIPPhylogeny Inference Package (Version 3.2). Cladistics. 1989, 5: 164166.Google Scholar
 Huson DH: Application of Phylogenetic Networks in Evolutionary Studies. Mol Biol Evol. 2005, 23 (2): 254267. 10.1093/molbev/msj030.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.