Multiple genome comparison based on overlap regions of pairwise local alignments
- Katharina Jahn^{1, 2}Email author,
- Henner Sudek^{2} and
- Jens Stoye^{2}
https://doi.org/10.1186/1471-2105-13-S19-S7
© Jahn et al.; licensee BioMed Central Ltd. 2012
Published: 19 December 2012
Abstract
Background
Mancheron, Uricaru and Rivals (Nucleic Acids Res. 39:e101, 2011) recently introduced a new approach in the context of multiple genome comparison that allows to detect regions of strong overlaps in a set of pairwise local alignments between several reference genomes and one target genome. Such overlap regions are an important source of information in genome annotation.
Results
In this paper we introduce a series of algorithms that improve over the approach of Mancheron et al., both in terms of computational complexity and in practical runtime. We also extend the problem definition such that overlaps to different reference genomes can be rated differently and regions overlapping only a subset of the reference genomes are detected.
Keywords
Background
Comparative approaches are an important source of information when it comes to the analysis of newly sequenced genomes. On the level of genes, the use of reciprocal BLAST hits is the most widely accepted approach suitable for tasks like gene annotation and the inference of homologies. However, it is a notoriously slow process, especially when it comes to all-against-all comparisons of several genomes, as commonly used in multiple genome comparisons. The alignment of whole genomes is known to be a computationally hard problem [1]. Heuristic approaches such as Shuffle-LAGAN [2] or MAUVE [3] involve a combination of chaining local alignments and inferring the history of large-scale genome rearrangement processes, which is an NP-hard problem of its own [1].
Recently, Mancheron et al. [4] pointed out that for many goals in multiple genome comparison, such as gene annotation, whole genome-based approaches as well as all-against-all BLAST comparisons are too involved. Instead they suggest to identify regions of strong overlaps in a set of pairwise local alignments between one target genome and any number of reference genomes. A strong overlap is a region that maps to at least one segment in every reference genome. If it is not contained in a bigger region that fulfills the same property, it is called a maximum common interval (MCI). In the same paper, Mancheron et al. present an algorithm to compute all MCIs for k genomes and a total number of n mappings in O(n log k) time and O(n) space. Note that the term "common interval" has been used previously in a different context, in order to describe a set of genes that occur as a consecutive block in two or more genomes [5, 6]. To avoid confusion of the two concepts, we use the term maximum overlapping interval (MOI) in this paper when referring to the concept of maximum common interval by Mancheron et al.
In this paper, we re-visit the above problem and introduce three new algorithms that improve the asymptotic time complexity as well as the practical performance over the approach of Mancheron et al. [4]. At first we present an algorithm that requires O(n) time and space. Then we show two variations of this basic algorithm: While the first modification reduces memory consumption but keeps the O(n) running time, the second one gives up on linear worst case runtime, but seems to have linear average runtime and is very fast in practice. We also introduce a generalization of the MOI problem, which includes the case where an MOI may map to only q out of k reference genomes. We show how the first two of our algorithms can be adapted to this variant.
Methods
Preliminaries
For the basic definitions of maximum overlapping intervals we follow closely the notation of Mancheron et al. [4]. Given a finite sequence S, let |S| denote its length. An interval I of a sequence S is a pair I = [s(I), e(I)] of start and end positions within S, i.e. 1 ≤ s(I) ≤ e(I) ≤ |S|. The length of I is defined as e(I) - s(I) + 1. The subset relation is defined as usual for intervals, i.e. I_{1} ⊆ I_{2} if and only if s(I_{2}) ≤ s(I_{1}) and e(I_{1}) ≤ e(I_{2}).
In the following a target genome is given as a sequence T together with k reference genomes G_{1}, . . ., G_{ k }. For each j, 1 ≤ j ≤ k, let ${C}_{j}=\left({I}_{1}^{j},{I}_{2}^{j},...,{I}_{{n}_{j}}^{j}\right)$ be the collection of base intervals representing mappings of G_{ j } to T.
An interval J is an overlapping interval of a set of collections C = {C_{1}, . . ., C_{ k }} if and only if there exists an interval ${I}_{i}^{j}$ in every C_{ j } with $J\subseteq {I}_{i}^{j}$. An overlapping interval J = [p, q] with p ≤ q is maximum if neither [p - 1, q] nor [p, q + 1] are overlapping intervals.
The computational problem we study in this paper is the following: given a set of collections of base intervals, find all its maximum overlapping intervals (MOIs).
Upper bound for the number of MOIs
Before presenting our algorithms, we derive a tight upper bound for the number of MOIs in a set of k collections with a total number of n base intervals.
As already shown in supplementary file 1 of [4], asymptotically there can only be O(n) MOIs: every MOI starts at the beginning of a base interval and no MOI can be included in another MOI, which means that no two MOIs can have the same starting point, so clearly there can not be more than n MOIs. In fact, the bound can be given more precisely:
Lemma 1. For k collections with a total number of n base intervals, there can be at most n-k + 1 maximum overlapping intervals. This is a tight bound.
Proof. Assume the MOIs to be ordered by their beginning from left to right. Clearly, the leftmost MOI must contain at least k base intervals. Moreover, moving to the right, the next MOI requires at least one new interval, otherwise it would be identical to the previous one. This argument can be repeated for all remaining MOIs. Since after the first MOI there are only n - k intervals left, the total number of MOIs can not be larger than 1 + n - k.
Algorithms for finding maximum overlapping intervals
In this section we present three new algorithms to find maximum overlapping intervals. Without loss of generality we assume that the set of base intervals is compact, meaning that at least one interval starts or ends at every position in T and thus l := |T | ∈ O(n). We also assume that the list of base intervals is already sorted by increasing start positions. This can be done in linear time using the radix sort algorithm.
Algorithm LinearMOI
The outline of our first algorithm, LinearMOI, is as follows: while going through the sorted list of base intervals, we track for each of the k collections the largest end point of any of its intervals processed so far. Once we have processed all intervals with the same start position, we test whether the smallest of the current end points, denoted as min, is smaller than the recent start position. If so, there is at least one collection that does not cover the start position, and there is no MOI. If min is greater than the recent start position, then the segment in between is covered by all collections, and is therefore part of an MOI. We only need to test if we have already used min as end point for an MOI starting further to the left. Otherwise the interval ranging from the recent start position to min is an MOI.
Obviously, min can change with each new base interval. To update it efficiently, we use a counter array c, indexed from 0 to l, where l is the length of the target genome. In this array we store at each position the number of collections that have their current end point at this position. Clearly, min equals the index of the smallest non-zero entry. The following observation helps us to track this position.
Observation 1. Values in c at indices smaller than the current leftmost non-zero entry do not change when a new base interval is considered.
Using Observation 1 we can find the current leftmost non-zero entry in c simply by starting from the former leftmost non-zero entry and going to the right until we encounter the first non-zero value. Although we may need several steps for one update of min, the total number of steps can not be larger than l, as min is never decreased. As all other operations for processing a base interval take constant time, and c and endPoint have at most n entries, we get a total runtime of O(l). By our assumption of compactness we thus have a runtime linear in the number of base intervals O(n).
Table 1
Algorithm 1 (LinearMOI) |
---|
Input: sorted list of all intervals interval[1..n]; number of collections k |
Variables: largest end point seen so far in each collection endPoint[1..k]; c[0..l] |
1: endPoint[1..k] ← 0 |
2: prevEnd ← 0 |
3: min ← 0 |
4: c[0] ← k; c[1..l] ← 0 |
5: for all $\left({I}_{i}^{j}=\left[start,\phantom{\rule{2.77695pt}{0ex}}end\right]\right)$ ∈ interval[1..n] do |
6: if end >endPoint[j] then |
7: c[endPoint[j]] ← c[endPoint[j]] - 1 |
8: c[end] ← c[end] + 1 |
9: endPoint[j] ← end |
10: end if |
11: if all intervals with recent start position processed then |
12: while c[min] = 0 do |
13: min ← min + 1 |
14: end while |
15: if prevEnd < min and min ≥ start then |
16: output MOI(start, min) |
17: prevEnd ← min |
18: end if |
19: end if |
20: end for |
Algorithm CircularMOI
The space needed to store the counter array c of algorithm LinearMOI may become problematic when large genomes are processed. Assuming the human genome as our target, an additional 4GB of memory would be required. However, we find that only a small part of c is informative in each step of the algorithm:
Observation 2. We only need to know if there is any non-zero entry in c left of the recent start position and not where or what this value is.
Observation 3. All positions in c that are further to the right of the recent start position than the length of the longest base interval have value 0.
Based on these observations we can replace array c by a circular memory structure that is just as long as the longest base interval as well as a single counter in which we store the sum of all values that would be on the left of our current position in the counter array.
To employ the new data structure, several modifications need to be made in algorithm LinearMOI. Obviously we need to use the modulo operation whenever accessing c. We also have to make sure that we do not update c at indices smaller than the recent start position as these positions are no longer represented in the array. For that purpose, we introduce a new variable open to count the number of collections that cover the recent start position.
To keep open updated, we change the role of variable min: when processing a new interval, we increase min until the recent start position is reached. Any non-zero value encountered in this process is subtracted from open and then set back to 0, so that the respective array positions can be re-used. We also need to update open whenever a collection gets a new endpoint and is not already counted in open.
Finally, we know that we can only have an MOI if open equals k. Therefore we add a test for this condition in line 11. Note that if this condition holds, we use min in its original role to find the endpoint of a potential MOI. The final test before reporting an MOI reduces to testing whether prevEnd is smaller than min. The complete pseudocode of this algorithm is shown in Algorithm 2 (CircularMOI).
In practice this approach is quite slow because of the extensive use of the modulo operation. However when extending the length of c to the next power of two, the modulo operation can be replaced by a very fast bitwise AND operation, and one gets about the same performance as in the first algorithm. This variant of algorithm CircularMOI was used in the benchmark tests described in the Results section.
Algorithm TestMOI
Table 2
Algorithm 2 (CircularMOI) |
---|
Input: sorted list of intervals interval [1 ..n ]; number of collections k ; length of the longest interval ℓ |
Variables: largest end point seen so far in each collection endPoint [1 ..k ]; c [0 ..l ] |
1: endPoint[1..k] ← 0 |
2: prevEnd ← 0 |
3: min ← 0 |
4: c[0..ℓ] ← 0 |
5: open ← 0 |
6: for all $\left({I}_{i}^{j}=\left[start,\phantom{\rule{2.77695pt}{0ex}}end\right]\right)$∈ interval[1..n] do |
7: while min < start do |
8: open ← open - c[min mod ℓ] |
9: c[min mod ℓ] ← 0 |
10: min ← min + 1 |
11: end while |
12: if end > endPoint[j] then |
13: if endPoint[j] ≥ min then |
14: c[endPoint[j] mod ℓ] ← c[endPoint[j] mod ℓ] - 1 |
15: else |
16: open ← open + 1 |
17: end if |
18: c[end mod ℓ] ← c[end mod ℓ] + 1 |
19: endPoint[j] ← end |
20: end if |
21: if all intervals with recent start position processed and open = k then |
22: while c[min mod ℓ] = 0 do |
23: min ← min + 1 |
24: end while |
25: if prevEnd < min then |
26: output MOI(start, min) |
27: prevEnd ← min |
28: end if |
29: end if |
30: end for |
We present a third algorithm, TestMOI, that works without a counter array. To find min, it just takes the minimum stored in endPoint. Finding this smallest value takes O(k) time and might be needed for every interval, so we get a total runtime of O(kn). Using the following observation as a runtime heuristic, we get back to near-linear performance.
Observation 4. The minimum value of a set of numbers can only increase if a value equal to the smallest value in the set is removed or increased.
Therefore a test for a new minimum needs to be performed only when the smallest of the current endpoints changes. This gives rise to the procedure shown in Algorithm 3 (TestMOI).
Table 3
Algorithm 3 (TestMOI) |
---|
Input: sorted list of all intervals interval [1 ..n ]; number of collections k |
Variables: largest end point seen so far in each collection endPoint [1 ..k ] |
1: endPoint[1..k] ← 0 |
2: prevEnd ← 0 |
3: min ← 0 |
4: newEndPoint ← false |
5: for $\left({I}_{i}^{j}=\left[start,\phantom{\rule{2.77695pt}{0ex}}end\right]\right)$∈ interval[1..n] do |
6: if end >endPoint[j] then |
7: if endPoint[j] = min then |
8: newEndPoint ← true |
9: end if |
10: endPoint[j] ← end |
11: end if |
12: if newEndPoint and all intervals with recent start position processed then |
13: min ← min_{i = 1..k}{endPoint[i]} |
14: newEndPoint ← false |
15: if prevEnd < min and min ≥ start then |
16: output MOI(start, min) |
17: prevEnd = min |
18: end if |
19: end if |
20: end for |
Weighted maximum overlapping intervals
We now introduce the concept of weighted maximum overlapping intervals. One constraint for MOIs was that there had to be at least one interval in every collection that covers the MOI. For weighted MOIs we replace this constraint by the following: we assign a positive integer weight to every collection, and require the sum of the weights of all collections covering a weighted MOI to be at least as high as a predefined threshold w. The formal definitions are as follows:
Given a positive integer weight w, an interval J is a w-overlapping interval of a set of collections C = {C_{1}, . . ., C_{ k }} if and only if the sum of weights of all collections C_{ j } that contain an interval ${I}_{i}^{j}$ with $J\subseteq {I}_{i}^{j}$ is at least w.
A w-overlapping interval J = [p, q] with p ≤ q is maximum if neither [p - 1, q] nor [p, q + 1] are w-overlapping intervals.
Algorithm LinearWeightedMOI
This algorithm follows the same ideas as Algorithm 1 (LinearMOI) but we have to make several small adjustments. First we need to change the counter array such that it stores the weights of the collections. The second modification is a bit more complicated. In the unweighted case all values in c left of min are 0, which is not true in the weighted case. To deal with this, we use a similar technique as we already used in Algorithm 2 (CircularMOI). But instead of counting how many collections cover the recent start position, we count the sum of their weights in a variable named openWeight.
Table 4
Algorithm 4 (LinearWeightedMOI) |
---|
Input: sorted list of intervals interval [1 ..n ]; number of collections k ; weight of the collections weight [1 ..k ]; length l of the target genome; minimum weight a weighted MOI must have minWeight |
Variables: largest end point seen so far in each collection endPoint [1 ..k ]; c [0 ..l ] |
1: endPoint[1..k] ← 0 |
2: prevEnd ← 0 |
3: min ← 0 |
4: openWeight ← 0 |
5: c[0..l] ← 0 |
6: for all $\left({I}_{i}^{j}=\left[start,\phantom{\rule{2.77695pt}{0ex}}end\right]\right)$∈ interval[1..n] do |
7: if end >endPoint[j] then |
8: c[endPoint[j]] ← c[endPoint[j]] - weight[j] |
9: c[end] ← c[end] + weight[j] |
10: if endPoint[j] < min and end ≥ min then |
11: openWeight ← openWeight + weight[j] |
12: end if |
13: endPoint[j] = end |
14: end if |
15: if all intervals with recent start position processed then |
16: while openWeight-c[min] ≥ minWeight do |
17: openWeight ← openWeight - c[min] |
18: min ← min + 1 |
19: end while |
20: if prevEnd < min and min ≥ start then |
21: output MOI(start, min) |
22: prevEnd ← min |
23: end if |
24: end if |
25: end for |
Algorithm CircularWeightedMOI
In order to adapt Algorithm 4 (LinearWeightedMOI) to using the circular memory structure, we basically follow the same strategy as we did in the unweighted case. However there is no need to introduce an additional variable as was done for Algorithm 2. This is because we already use variable openWeight which takes over the function of open in Algorithm 2. Apart from this, there is only one non-trivial change in comparison to the unweighted case: a second condition in the if statement in line 12 to ensure that the endpoint of the current base interval is not smaller than the current min. This is because in the weighted case the current base interval can end to the left of the current location of min, and we have to make sure not to update c at indices smaller than min. The details are left to the reader.
Results and discussion
To analyze the practical run times of our algorithms, we implemented them in C++ and compared them to the original implementation of the approach by Mancheron et al. [4] which is part of the QOD-1.0.0 software package and available under the CeCILL software license.
All benchmarks were performed using an Intel(R) Core(TM)2 Duo CPU T8100 @ 2.10GHz processor with 4GB RAM running Linux 3.3.4. We used the gcc version 4.7 with compiler flags -Ofast and -march = nativeset. If no other values are given, the length of the target genome is 10 Mb and the length of the base intervals is normally distributed around a mean of μ = 2 Kb with a standard deviation of σ = 0.5μ. A total of 5 million intervals are distributed randomly into 200 collections. In order to compare the weighted and unweighted algorithms we set all weights to 1 and the threshold minWeight to k. Please note that the time for sorting the input is not included in any of the reported timings. In any case, sorting the input will virtually take the same time for all algorithms, even though QOD sorts every collection separately, while we sort all base intervals in a single list.
Conclusions
In this paper we studied an algorithmic problem that was recently introduced by Mancheron et al. [4] in the context of multiple genome comparison. The goal is to find regions of strong overlaps in a set of pairwise local similarities between several reference genomes and one target genome.
We have presented efficient algorithms to solve this problem, two of which have asymptotically optimal, linear runtime. The third one excelled in terms of practical performance. All three algorithms were shown to outperform the approach introduced by Mancheron et al. [4]. We have also generalized the problem such that segments in overlap regions can be scored differently based on the reference genome they originate from. We have shown how this problem can still be solved in linear time.
For further work it may be interesting to assign individual weights to the base intervals. However we would then have to consider also intervals that are nested into other intervals of lower weight and therefore lose the strict-linear ordering for processing the intervals. Hence, we can not expect that the algorithms we presented here will be easily adaptable to this problem and still run in linear time.
Declarations
Acknowledgements
Part of this work was funded by a postdoctoral fellowship of the German Academic Exchange Service (DAAD).
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 19, 2012: Proceedings of the Tenth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S19
Authors’ Affiliations
References
- Pevzner PA: Computational Molecular Biology - An Algorithmic Approach. 2000, MIT PressGoogle Scholar
- Brudno M, Malde S, Poliakov A, Do CB, Couronne O, Dubchak I, Batzoglou S: Glocal alignment: finding rearrangements during alignment. Bioinformatics. 2003, 19: i54-i62. 10.1093/bioinformatics/btg1005.View ArticlePubMedGoogle Scholar
- Darling A, Mau B, Blattner F, Perna N: Mauve: Multiple Alignment of Conserved Genomic Sequence With Rearrangements. Genome Res. 2004, 14: 1394-1403. 10.1101/gr.2289704.PubMed CentralView ArticlePubMedGoogle Scholar
- Mancheron A, Uricaru R, Rivals E: An alternative approach to multiple genome comparison. Nucleic Acids Res. 2011, 39: e101-10.1093/nar/gkr177.PubMed CentralView ArticlePubMedGoogle Scholar
- Heber S, Mayr E, Stoye J: Common Intervals of Multiple Permutations. Algorithmica. 2011, 60: 175-206. 10.1007/s00453-009-9332-1.View ArticleGoogle Scholar
- Uno T, Yagiura M: Fast Algorithms to Enumerate All Common Intervals of Two Permutations. Algorithmica. 2000, 26: 290-309. 10.1007/s004539910014.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.