On the distribution of cycles and paths in multichromosomal breakpoint graphs and the expected value of rearrangement distance
- Pedro Feijão^{1, 2}Email author,
- Fábio Viduani Martinez^{1, 2, 3} and
- Annelyse Thévenin^{1, 2}
https://doi.org/10.1186/1471-2105-16-S19-S1
© Feijão et al. 2015
Published: 16 December 2015
Abstract
Finding the smallest sequence of operations to transform one genome into another is an important problem in comparative genomics. The breakpoint graph is a discrete structure that has proven to be effective in solving distance problems, and the number of cycles in a cycle decomposition of this graph is one of the remarkable parameters to help in the solution of related problems. For a fixed k, the number of linear unichromosomal genomes (signed or unsigned) with n elements such that the induced breakpoint graphs have k disjoint cycles, known as the Hultman number, has been already determined. In this work we extend these results to multichromosomal genomes, providing formulas to compute the number of multichromosal genomes having a fixed number of cycles and/or paths. We obtain an explicit formula for circular multichromosomal genomes and recurrences for general multichromosomal genomes, and discuss how these series can be used to calculate the distribution and expected value of the rearrangement distance between random genomes.
Keywords
Background
In molecular biology and genetics, comparative genomics is a discipline interested in the comparison of genomic attributes of different organisms. These attributes may encompass DNA sequences, gene content, gene order, regulatory sequences, and other structural features. Several measures have been proposed to compute the (dis)similarity between genomes. The field called genome rearrangements is concerned with measures of dissimilarity involving large-scale mutations, such as reversals and transpositions, where a fundamental problem is to determine the smallest sequence of such rearrangement operations that transforms one given genome into another. This minimum number of operations is called the rearrangement distance between the two given genomes. These and other aspects of genome rearrangements are discussed in detail by Fertin et al. [1].
A remarkable characteristic of methods to compute distances is the systematic use of a graph, first introduced by Bafna and Pevzner [2], known as the breakpoint graph. It has proven, by its decomposition into disjoint cycles, a useful tool to efficiently compute rearrangement distances such as transposition or reversal, directly related to the number of cycles in this decomposition [1].
Since cycle decomposition of breakpoint graphs plays a central role in computing distances, it is useful to investigate the distribution of such cycles. Particularly, the distribution of genomes with a number of cycles c allows us to evaluate the probability to have a scenario of a distance d depending of c. Doignon and Labarre [3] enumerated the unsigned permutations of a given size such that the corresponding graph has a given number of cycles, and called it the Hultman number. Subsequently, Grusea and Labarre [4] extended this result for signed permutations, where the signs model gene orientation.
In this work we extend previous results providing formulas to compute the number of multichromosomal genomes with a given number of cycles and/or paths. We obtain an explicit formula for circular genomes and recurrences for more general cases.
Our paper is organized as follows. In the Preliminaries section we give some definitions and notations. The results for circular and general multichromosomal genomes are presented in the next section, called The Multichromosomal Hultman Number. The following section presents some discussion about the distribution of the rearrangement distance, derived from the multichromosomal Hultman numbers, and the Conclusion section presents final remarks and perspectives.
Preliminaries
We represent multichromosomal genomes using a similar notation as in [5]. A gene is a fragment of DNA on one of the two DNA strands in a chromosome, showing its orientation. A gene is represented by an integer and its orientation by a sign. The orientation of a gene g allows us to distinguish its two extremities, the tail (g^{ t }) and the head (g^{ h }). A chromosome is represented by a sequence of genes, flanked in the extremities by telomeres (∘) if the chromosome is linear; otherwise, it is circular. Genomes are represented as sets of chromosomes. An adjacency in a genome is either a pair of consecutive gene extremities in a chromosome, or a gene extremity adjacent to a telomere (a telomeric adjacency ). For instance, A = {(∘ 1 2 3 4 ∘)} is a genome with one linear chromosome and four genes, and has the adjacencies ∘1^{ t }, 1^{ h }2^{ t }, 2^{ h }3^{ t }, 3^{ h }4^{ t } and 4^{ h }∘, where the first and the last are telomeric adjacencies.
There is a one-to-one correspondence between genomes and matchings in the set of extremities. Adjacencies correspond to two matched (saturated) vertices, and telomeric adjacencies correspond to unmatched (unsaturated) vertices. Therefore, a perfect matching (i.e., matching which saturates all vertices of the graph) corresponds to a genome with only circular chromosomes. The matching corresponding to a genome A is denoted by M_{ A } . Because of this one-to-one relationship, in this text we use the terms genome and matching interchangeably.
The multichromosomal Hultman number
In this section, we extend the results of [3, 4] for multichromosomal genomes. There are two new aspects that must be considered. First, since the breakpoint graph can be decomposed in cycles and paths, we may have to count not only cycles, but also paths. The other question is about the identity genome. In the unichromosomal case, the identity genome is easily defined. In the multichromosomal case, it is not obvious which given genome is the identity. When working on multichromosomal circular genomes, the identity is defined as in the unichromosomal case. In the general case, working on genomes with linear and circular chromosomes, we analyze two types of identities for genomes: one with only one set of circular chromosomes and another with a set of circular chromosomes and a set of linear chromosomes.
In the next sections, we propose extensions of the Hultman number for multichromosomal genomes, first considering only circular genomes, and then extending the results to general genomes, with linear and circular chromosomes. The same strategy is used in all cases: first, start with a matching representing the identity, and then superimpose all other possible matchings, while counting recursively cycles and paths. To do that, we need to consider all possible operations to build such matchings. In Figure 4, all such operations are shown.
Multichromosomal circular genomes
A circular genome is a genome where all chromosomes are circular. Since there are no telomeric adjacencies, the matching M_{ A } of a circular genome A is a perfect matching on the extremities of A. Moreover, the breakpoint graph of two circular genomes is decomposed in disjoint alternating cycles, since each vertex has degree two.
where ${\mathcal{C}}_{n}$ is the set of all circular multichromosomal genomes with n genes and cyc(G) denotes the number of cycles in a graph G.
Starting with a perfect matching ${M}_{{I}_{\circ}}$ of the 2n vertices, we build all breakpoint graphs BG(A, I_{ ∘ }), for circular genomes A, which correspond to perfect matchings, adding one edge at a time, while counting the number of cycles, recursively.
(a) Create Cycle: If u and v belong to the same component, the edge e = (u, v) will create a cycle. There is only one possibility for this type of edge.
(b) Merge Paths: If u and v belong to different components, uv will merge both paths. There are 2n - 2 possibilities of adding such an edge.
The following result states an explicit formula to H_{ C } (n, c).
where Q = q_{2} + ... + q_{ k } and ${\sum}_{2}^{n-c}m{q}_{m}=n-c$ is a sum over all partitions of n - c.
Proof We know from [6] that unsigned Stirling numbers of first kind satisfy the following recurrence equation: $\left[\begin{array}{c}\hfill n\hfill \\ \hfill c\hfill \end{array}\right]=\left[\begin{array}{c}\hfill n-1\hfill \\ \hfill c-1\hfill \end{array}\right]+\left(n-1\right)\left[\begin{array}{c}\hfill n-1\hfill \\ \hfill c\hfill \end{array}\right]$. Multiplying both sides by 2^{ n-c } and using H_{ C } (n, c) recurrence equation we arrive at ${H}_{C}\left(n,\phantom{\rule{2.36043pt}{0ex}}c\right)={2}^{n-c}\left[\begin{array}{c}\hfill n\hfill \\ \hfill c\hfill \end{array}\right]$. Then, using the explicit formula for $\left[\begin{array}{c}\hfill n\hfill \\ \hfill c\hfill \end{array}\right]$ given in [7], we arrive at our result. □
Furthermore, the sequence of integers generated by H_{ C } (n, c) is the unsigned entry A039683 in the OEIS (On-Line Encyclopedia of Integer Sequences) [8].
General multichromosomal genomes
where ${\mathcal{G}}_{n}$ is the set of all multichromosomal genomes with n genes, and pt(·) denotes the number of paths in a graph. In this set, each genome corresponds to a matching, not necessarily perfect, since only circular genomes correspond to perfect matchings. Similarly as the previous case, we start with the matching ${M}_{{I}_{\circ}}$ on 2n vertices, and recursively build all possible matchings, while counting cycles and paths. Since a matching induced by an arbitrary genome A in ${\mathcal{G}}_{n}$ is not necessarily perfect, together with the create cycle and merge paths operations on a vertex u, we can also choose to not saturate a vertex u in the matching being built, thus creating a telomere, which we call a skip vertex operation.
Moreover, since we now have an operation that is applied on just one vertex, and not two at a time such as the operations presented in Section, we need to define a different recurrence, where n correspond to vertices in the breakpoint graph, and not to genes in the genomes. In a genome I_{∘} with n genes, there are 2n vertices (extremities) in ${M}_{{I}_{\circ}}$ and consequently in BG(A, I_{∘}). So, we need an auxiliary number ${H}_{G}^{\prime}\left(e,\phantom{\rule{2.36043pt}{0ex}}c,\phantom{\rule{2.36043pt}{0ex}}p\right)$, such that ${H}_{G}\left(n,\phantom{\rule{2.36043pt}{0ex}}c,\phantom{\rule{2.36043pt}{0ex}}p\right)={H}_{G}^{\prime}\left(e,\phantom{\rule{2.36043pt}{0ex}}c,\phantom{\rule{2.36043pt}{0ex}}p\right)$, with e = 2n, and ${H}_{G}^{\prime}\left(e,\phantom{\rule{2.36043pt}{0ex}}c,\phantom{\rule{2.36043pt}{0ex}}p\right)\equiv \left|\left\{M\in {\mathcal{M}}_{e}:cyc\left(BG\left(M,\phantom{\rule{2.36043pt}{0ex}}{M}_{{I}_{\circ}}\right)\right)=c\phantom{\rule{2.36043pt}{0ex}}\mathsf{\text{and}}\phantom{\rule{2.36043pt}{0ex}}pt\left(BG\left(M,\phantom{\rule{2.36043pt}{0ex}}{M}_{{I}_{\circ}}\right)\right)=p\right\}\right|$, where ${\mathcal{M}}_{e}$ is the set of all possible matchings on e vertices, and ${M}_{{I}_{\circ}}$ is a perfect matching with e/2 edges induced by I_{∘}.
(a) Create Cycle: There is one edge uv such that v(≠ u) is the unvisited vertex in the same component as u, and this edge (shown as a grey edge uv) will create a cycle. Vertices u and v are marked as visited (Figure 3(a)).
(b) Merge Paths: There are e - 2 edges uv such that v is an unvisited vertex in a different component as u, and this edge will merge these components, that are paths. Vertices u and v are marked as visited. (Figure 3(b)).
(c) Skip Vertex: Vertex u is not saturated; no edge is created and only u is marked as visited (Figure 3(c)).
If e is odd, it means that there is a vertex u that is connected to a visited vertex. For this vertex, there is no way to close a cycle, but the other two operations are possible:
(d) Merge Paths: There are e - 1 edges uv such that v is in a different component as u, merging these components. Vertices u and v are marked as visited (Figure 3(d)).
(e) Skip Vertex: Vertex u is not saturated; no edge is created, only u is marked as visited. A path where all vertices are visited is created (Figure 3(e)).
with (1) if any of e, c, p is negative, or e = 0 and any of c, p is positive; (2) if e = c = p = 0; (3) if e is even; and (4) if e is odd.
Multichromosomal genomes with a fixed number of linear chromosomes
In this section we generalize the previous approach for different identity genomes. Instead of fixing the identity as a circular genome, the identity I_{ ℓ } is a genome with a fixed number of ℓ linear chromosomes. As for the input genomes, first we consider all possible genomes, and in a second approach also fix the number of linear chromosomes.
Identity genome I_{ℓ} with ℓ linear chromosomes
(a) Create Cycle: There is one edge uv such that v(≠u) is an unvisited vertex in the same component as u, and this edge will create a cycle. Vertices u and v are marked as visited.
(b) Merge Paths: There are e - 2 - i edges uv such that v is saturated in I_{ i } and is in a different component as u, and uv will merge these components, that are paths. Vertices u and v are marked as visited.
(c) Skip Vertex: No edge is created and u is marked as visited.
(d) Connect with unsaturated: There are i possible edges from u to an unsaturated vertex v in I_{ i }. Vertices u and v are marked as visited.
Cases (a) and (b) visit two vertices that are saturated in I_{ i }, which means that the state remains balanced. Case (c) changes the state to unbalanced, since only one vertex is visited. Case (d) visits two vertices, but one is a unsaturated vertex in I_{ i }, which means that the parity of e + i changes and the state becomes unbalanced.
In the unbalanced state, focusing on a vertex u belonging to a component with all other vertices visited, there are three possibilities (Figure 4e-g):
(e) Merge Paths: There are e - 1 - i edges uv such that v is saturated in I_{ i } and is in a different component as u, and this edge will merge these components, that are paths. Vertices u and v are marked as visited.
(f) Skip Vertex: Vertex u is not saturated in M ; no edge is created and only u is marked as visited, and a path with all vertices visited is created.
(g) Connect with unsaturated: There are i possible edges from u to an unsaturated vertex v in I_{ i }. Vertices u and v are marked as visited, and a path with all vertices visited is created.
Cases (e), (f) and (g) are similar to cases (b), (c) and (d), respectively, which means that (e) keeps the state unbalanced, but (f) and (g) change it to balanced again. There are still two cases to consider, when e = i (Figure 4h, i).
(h) Connect two unsaturated: There are i - 1 possible edges from an unsaturated vertex u to an unsaturated vertex v in I_{ i }. Vertices u and v are marked as visited, and a path with all vertices visited is created.
(i) Skip Vertex: No edge is created and u is marked as visited. A path with all vertices visited is created.
For the base cases, as before when e = 0 we have ${H}_{L}^{\prime}\left(0,\phantom{\rule{2.36043pt}{0ex}}c,\phantom{\rule{2.36043pt}{0ex}}p,\phantom{\rule{2.36043pt}{0ex}}i\right)=1$ if and only if c = p = i = 0, and ${H}_{L}^{\prime}\left(0,\phantom{\rule{2.36043pt}{0ex}}c,\phantom{\rule{2.36043pt}{0ex}}p,\phantom{\rule{2.36043pt}{0ex}}i\right)=0$ if any of c, p, i is positive. Also, if any of e, c, p, i is negative, ${H}_{L}^{\prime}\left(e,\phantom{\rule{2.36043pt}{0ex}}c,\phantom{\rule{2.36043pt}{0ex}}p,\phantom{\rule{2.36043pt}{0ex}}i\right)=0$.
with (1) if any of e, c, p, i is negative, or e = 0 and any of c, p, i is positive; (2) if e = c = p = i = 0; (3) if e = i > 0, (4) if e + i is even, e > i, (5) if e + i is odd, e > i.
Identity genome ${I}_{{\ell}_{i}}$ and input genomes ${A}_{{\ell}_{a}}$ with ℓ_{i} and ℓ_{a} linear chromosomes
were ${\mathcal{G}}_{n,{\ell}_{a}}$ is the set of all multichromosomal genomes with n genes and exactly ℓ_{ a } linear chromosomes, and I_{ ℓ } is, as before, a genome with exactly ℓ linear chromosomes. By definition, we have that ${\sum}_{{\ell}_{a}=0}^{n}{H}_{\ell}\left(n,\phantom{\rule{2.36043pt}{0ex}}c,\phantom{\rule{2.36043pt}{0ex}}p,\phantom{\rule{2.36043pt}{0ex}}{\ell}_{i},\phantom{\rule{2.36043pt}{0ex}}{\ell}_{a}\right)={H}_{L}\left(n,\phantom{\rule{2.36043pt}{0ex}}c,\phantom{\rule{2.36043pt}{0ex}}p,\phantom{\rule{2.36043pt}{0ex}}{\ell}_{i}\right)$.
Again we define an auxiliary series, in this case ${H}_{\ell}^{\prime}\left(e,\phantom{\rule{2.36043pt}{0ex}}c,\phantom{\rule{2.36043pt}{0ex}}p,\phantom{\rule{2.36043pt}{0ex}}i,\phantom{\rule{2.36043pt}{0ex}}a\right)\equiv \left|\left\{M\in {\mathcal{M}}_{e,\phantom{\rule{2.36043pt}{0ex}}a}:cyc\left(BG\left(M,\phantom{\rule{2.36043pt}{0ex}}{M}_{e,\phantom{\rule{2.36043pt}{0ex}}i}\right)\right)=c\phantom{\rule{2.36043pt}{0ex}}\mathsf{\text{and}}\phantom{\rule{2.36043pt}{0ex}}pt\left(BG\left(M,\phantom{\rule{2.36043pt}{0ex}}{M}_{e,\phantom{\rule{2.36043pt}{0ex}}i}\right)\right)=p\right\}\right|$, where ${\mathcal{M}}_{e,\phantom{\rule{2.36043pt}{0ex}}a}$ is the set of all possible matchings on e vertices that has exactly a unsaturated vertices, and ${M}_{{I}_{i}}$ is a matching on these vertices such that exactly i vertices are unsaturated. To build the breakpoint graph for this new series, we use exactly the same operations as in the previous, summarized in Figure 4. The only difference is that we have to track how many unsaturated vertices a the current matching being build has. The only operations that change this are the skip vertex operations (c), (i) and (f), decreasing a by 1. The other operations keep a the same, as they all create an edge and do not mark any vertex as unsaturated.
The base cases are also similar, only including a in the constraints. When e = 0 we have ${H}_{\ell}^{\prime}\left(0,\phantom{\rule{2.36043pt}{0ex}}c,\phantom{\rule{2.36043pt}{0ex}}p,\phantom{\rule{2.36043pt}{0ex}}i,\phantom{\rule{2.36043pt}{0ex}}a\right)=1$ if and only if c = p = i = a = 0, and ${H}_{\ell}^{\prime}\left(0,\phantom{\rule{2.36043pt}{0ex}}c,\phantom{\rule{2.36043pt}{0ex}}p,\phantom{\rule{2.36043pt}{0ex}}i,\phantom{\rule{2.36043pt}{0ex}}a\right)=0$ if any of c, p, i, a is positive. Also, if any of e, c, p, i, a is negative, ${H}_{\ell}^{\prime}\left(e,\phantom{\rule{2.36043pt}{0ex}}c,\phantom{\rule{2.36043pt}{0ex}}p,\phantom{\rule{2.36043pt}{0ex}}i,\phantom{\rule{2.36043pt}{0ex}}a\right)=0$.
with (1) if any of e, c, p, i is negative, or e = 0 and any of c, p, i is positive; (2) if e = c = p = i = 0; (3) if e = i > 0, (4) if e + i is even, e > i, (5) if e + i is odd, e > i.
Distribution of rearrangement distances
From the Hultman series that we introduced, it is possible to derive the distribution of rearrangement distances for each scenario.
and can therefore be calculated with the given recurrence equations. For instance, for n = 100 we have E[X_{100}] = 95.22. A closed formula for the expected value of a rearrangement distance, to the best of our knowledge, has only been found for the very simple breakpoint distance d_{ BP }, which counts how many adjacent genes in the identity are not adjacent in the other genome, and is given by $E\left[{d}_{BP}\left({A}_{n},\phantom{\rule{2.36043pt}{0ex}}{I}_{n}\right)\right]=n-\left(\frac{1}{2}+\frac{1}{2n}+O\left(\frac{1}{{n}^{2}}\right)\right)$ [12]. This converges to n - 1/2 when n goes to infinity, which is almost the diameter n for the breakpoint distance. Although we have no closed formula for E[X_{ n } ], we conjecture that it also converges to n - k for some constant k > 0, as n goes to infinity, and the experimental results point to k ≈ 5.
Conclusions
Summary of the results in this paper.
Hultman Number | Identity | Universe |
---|---|---|
$\mathcal{H}\left(n,\phantom{\rule{2.36043pt}{0ex}}k\right)$[3] | π = ⟨ ⋯ ⟩ | S_{ n } (unsigned permutations) |
${\mathcal{H}}^{\pm}\left(n,\phantom{\rule{2.36043pt}{0ex}}k\right)$[4] | π = ⟨ ⋯ ⟩ | ${S}_{n}^{\pm}\left(\mathsf{\text{signed}}\phantom{\rule{2.36043pt}{0ex}}\mathsf{\text{permutations}}\right)$ |
H_{ C } (n, c) | Circular genome | Circular genomes |
H_{ G }(n, c, p) | Circular genome | General genomes |
H_{ L }(n, c, p, ℓ) | Genome with ℓ linear chr. | General genomes |
H_{ ℓ } (n, c, ℓ_{ i }, ℓ_{ a }) | Genome with ℓ_{ i } linear chr. | Genomes with ℓ_{ a } linear chr. |
For the Hultman number H_{ C } (n, c), in addition to the recursive equations we also provided an explicit formula, using the relationship between this series and the unsigned Stirling numbers of first kind. An interesting future direction is finding explicit formulas for the other proposed sequences H_{ G }(n, c, p) and H_{ L }(n, c, p, ℓ).
These equations might be useful for finding explicit equations for some of the numbers. We wrote a Python script with all recurrence relations proposed, and the above equations were useful to check the correctness of each series.
The Hultman number can also be used to find the expected value of the rearrangement distance between uniformly distributed genomes, in our case the algebraic distance between multichromosomal genomes. Future directions include finding explicit equations for the introduced recursive equations and the expected value of the rearrangement distance.
Declarations
Acknowledgements
FVM is funded from the Brazilian research agency CNPq grant Ciência sem Fronteiras Postdoctoral Scholarship 245267/2012-3. We acknowledge support of the publication fee by Deutsche Forschungsgemeinschaft and the Open Access Publication Funds of Bielefeld University.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 19, 2015: Brazilian Symposium on Bioinformatics 2014. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S19
Authors’ Affiliations
References
- Fertin G, Labarre A, Rusu I, Tannier E, Vialette S: Combinatorics of Genome Rearrangements. 2009, MIT Press, Cambridge, MAView ArticleGoogle Scholar
- Bafna V, Pevzner PA: Genome rearrangements and sorting by reversals. SIAM Journal on Computing. 1996, 25 (2): 272-289.View ArticleGoogle Scholar
- Doignon J, Labarre A: On Hultman numbers. Journal of Integer Sequences. 2007, 10 (6): Article 07.6.2, 13 p.Google Scholar
- Grusea S, Labarre A: The distribution of cycles in breakpoint graphs of signed permutations. Discrete Applied Mathematics. 2013, 161 (10-11): 1448-1466.View ArticleGoogle Scholar
- Bergeron A, Mixtacki J, Stoye J: A unifying view of genome rearrangements. Lecture Notes in Computer Science. 2006, 4175: 163-173.View ArticleGoogle Scholar
- Graham RL, Knuth DE, Patashnik O: Concrete Mathematics: A Foundation for Computer Science. 1994, Addison-Wesley, USAGoogle Scholar
- Malenfant J: Finite, closed-form expressions for the partition function and for Euler, Bernoulli, and Stirling numbers. ArXiv e-prints (2011). 1103.1585Google Scholar
- Sloane NJA: The On-Line Encyclopedia of Integer Sequences - OEIS. 2014, [http://oeis.org]Google Scholar
- Yancopoulos S, Attie O, Friedberg R: Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics. 2005, 21 (16): 3340-6. doi:10.1093/bioinformatics/bti535View ArticlePubMedGoogle Scholar
- Bergeron A, Mixtacki J, Stoye J: A unifying view of genome rearrangements. Lecture Notes in Computer Science. 2006, 4175: 163-173.View ArticleGoogle Scholar
- Feijao P, Meidanis J: Extending the algebraic formalism for genome rearrangements to include linear chromosomes. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2013, 10 (4): 819-831. doi:10.1109/TCBB.2012.161View ArticlePubMedGoogle Scholar
- Xu W, Alain B, Sankoff D: Poisson adjacency distributions in genome comparison: multichromosomal, circular, signed and unsigned cases. Bioinformatics. 2008, 24 (16): 146-152. doi:10.1093/bioinformatics/btn295View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.