There are two different aspects of the comparison of a completely assembled genome G
1
with a genome in scaffold form G2. One is scaffold filling, which predicts where in G2 to locate potential genes that have not been identified in the sequence but are present in G1. The second is contig fusion, which suggests how to piece G2 contigs together to form chromosomes. In Figure 1, only scaffold filling is necessary for scenario (d) and only contig fusion is required for scenario (b). Scenario (c) requires both.
We have shown how to handle the contig fusion problem in previous publications on papaya [2] and on Drosophila[3], and this will be reviewed in a separate section below. In the present paper we design and analyze an efficient exact algorithm for scaffold filling that simultaneously carries out contig fusion. We use this algorithm to analyze real and simulated data.
Filling in scaffolds
When G2 is only partially sequenced, and is missing some orthologs with G1 (cases (c) and (d) in Figure 1), we cannot complete the breakpoint graph since the red edges cannot be drawn to the two vertices corresponding to each missing gene, though these vertices are present in the graph and are incident to blue edges. At the same time, although we can draw a red edge between the last gene in one contig of a scaffold and the first gene in the next contig, we know that in reality there may be genes in the unsequenced gap between the contigs, and that once these genes are identified, the red edge will have to be "cut" and replaced by two or more gene vertices and two or more other red edges.
Statement of the combinatorial optimization problem
G1 consists of χ chromosomes, each of which is an ordered set of signed genes.
A contig in G2 is an ordered set of one or more signed genes, each orthologous to a gene in G1. A scaffold in G2 is an ordered set of contigs. Then G2 consists of a number of scaffolds, each of which is an ordered set of genes interrupted occasionally by a gap.
Then with reference to Figure 1(c) and (1d), the problem becomes: Find an assignment of the missing genes to the gaps in the scaffolds or at the ends of the scaffolds of G2, thus transforming the scaffolds into contigs, such that the resulting set of contigs
is at a minimum rearrangement distance from G1.
Implicit in our definitions is that between every pair of successive contigs in a scaffold is a gap large enough to contain genes. Where this is not the case, we can simply create a larger contig by disregarding the gap and concatenating the contigs on either side. We also disregard contigs without genes, so that they too may be subsumed in a gap. Note these are basically terminological conventions, rather than restrictions on the data.
A polynomial-time algorithm
The exact, linear-time, algorithm we have devised completes the breakpoint graph, only partially determined by G1 and by the scaffolds of G2, by means of insertions of missing genes into the gaps of G2.
Terminology
We have hitherto used the term path only to refer to alternating-colour sequences of edges connecting some of the bivalent vertices in the breakpoint graph, with telomeres at either end, that are eventually turned into cycles by joining or collapsing these two telomeres. In what follows, however, a path more generally may be any such connected set of edges, with or without telomeres, and may consist of only one (blue) edge. Paths with two telomeres will be called complete paths.
A free end is a vertex in the graph that has no incident red edges, only a blue one.
Thus when we say that that G1 and the scaffolds of G2partially determine a breakpoint graph, we mean that there are paths not ending in two T vertices, but in at least one free end.
A half path is a path ending in one telomere and one free end. A pseudopath is a structure consisting of two half paths where the two telomeres are deemed to be adjacent, though not by means of a red or blue edge. Pseudopaths will sometimes be treated as if they were paths, with the two free ends being the free ends from the two constituent half paths.
Initially, a cuttable edge is a red edge drawn between vertices of two successive genes in a scaffold that are not in the same contig, i.e., there is an unsequenced gap between the genes. Subsequently, if a red edge is disrupted during gene insertion, new red edges are created as will be specified in the algorithm presented below.
A bundle is a subset of the paths in the breakpoint graph of G1 and G2. Each bundle is associated with one or more of the missing genes. The vertices corresponding to each missing gene, its free ends, must be in the same bundle and must be endpoints of two paths, or the two ends of one path. An open bundle contains at least one cuttable edge; a closed one has no cuttable edges. As the breakpoint graph is completed by the algorithm, the bundles also change.
A sketch of the algorithm
We have divided the algorithm into three parts. The first, the main algorithm fillScaffolds, constructs the partial breakpoint graph determined by G1 and the scaffolds of G2, and then partitions the paths in this graph (except the complete paths, and not including the cycles) among a number of bundles, some open and some closed. Initially, a bundle can contain either zero or two telomeres. If they are present, the half-paths, which are the two paths ending in telomeres, are linked together to become a pseudopath.
Although the missing genes represented by the free ends in an open bundle will eventually be inserted in an optimal way by manipulating cuttable edges, this is not possible within closed bundles. fillScaffolds thus calls the second algorithm combineBundles, which subsumes all closed bundles within open ones, as in Figure 3, thus creating larger open bundles, including some which contain more than two telomeres. This is done in such a way as to minimize the eventual genomic distance between G1 and
. This step requires interchanging the half paths of the pseudopaths in the two bundles being combined, through changes in telomere adjacencies, to maximize the number of good paths according to the Tesler formulation in Equation (2).
Finally, fillScaffolds calls completeBundle, which makes the connections between the free ends and the cuttable edges within each of the open bundles.
The output of the algorithm includes cycles, each containing at most one pair of "adjacent" telomeres, which become the two endpoints of a complete path within the breakpoint graph.
After presenting the algorithm, we state and prove a theorem establishing its correctness:
Algorithm fillScaffolds
Input: A fully sequenced and assembled (without gaps) genome G1, and a genome G2 made up of scaffolds containing some of the genes in G1 and gaps.
Output: A completed form of G2, denoted
where the missing genes from G1 are inserted into the gaps in such a way as to minimize
, and the associated breakpoint graph.
-
1.
Construct the breakpoint graph based on genome G 1 (blue edges) and G 2 (red edges), including cuttable red edges between consecutive genes in G 2 scaffolds separated by a gap. We include T 1 vertices at the telomeres of G 1 chromosomes and T 2 vertices at the end of G 2 scaffolds. We do not complete the third step of Figure 2, so the graph may contain cycles, complete paths and other paths.
-
2.
We construct the initial bundles as follows. We choose any free end not already in any bundle as the seed of a new bundle. Then if a path containing free end g
t
is in a bundle B, then we also include the path with g
h
as a free end, and vice versa.
-
3.
There can be zero or two T vertices in an initial bundle. If there are two, we consider the two half paths as if they were one path where the two T are adjacent, even though there is no red or blue edge connecting them.
-
4.
We use combineBundles to remove all the closed bundles by merging them with open bundles, or with complete paths or cycles with cuttable edges, resulting in larger open bundles. We do this in such a way as to minimize
.
-
5.
Complete each bundle, using completeBundle.
Algorithm combineBundles
Input: The set of open and closed bundles as well as the set S of complete paths and cycles with cuttable edges.
Output: A set of open bundles, and a subset S' of the complete paths and cycles. The open bundles contain all the vertices in the input bundles plus those vertices in S\S', the paths and cycles not included in S'.
-
1.
while there is a closed bundle with a T 1 T 1 adjacency and a open bundle, or complete path with a cuttable edge, with a T 2 T 2 adjacency, combine them by switching the adjacencies between T vertices, i.e., by exchanging two half-paths. This results in a larger open bundle and also increases the number of good complete paths by one.
-
2.
while there is a closed bundle with a T 2 T 2 adjacency and a open bundle, or complete path with a cuttable edge, with a T 1 T 1 adjacency, combine them by switching adjacencies. This results in a larger open bundle and also increases the number of good complete paths by one.
-
3.
while there is a closed bundle with a T 2 T 2 adjacency and closed bundle with a T 1 T 1 adjacency, combine them by switching adjacencies. This results in a larger closed bundle and increases the number of good complete paths by one. The closed bundle eventually has to be combined with an open bundle or cycle or complete path.
-
4.
while there is a closed bundle with a TT adjacency and a open one with a TT adjacency, combine them by switching adjacencies. To maintain the number of good paths, if the adjacencies are T 1 T 2 , and
, then after the switching the adjacencies they should be
and
.
-
5.
while there is a closed bundle, combine it with an open bundle or cycle or complete path by adding a pair of cuttable edges, as in Figure 4:
-
i.
Find two free ends g
h
and g
t
in the closed bundle.
-
ii.
Choose a cuttable edge kl in some open bundle, or path or cycle.
-
iii.
Replace kl by two cuttable edges kg
h
and and g
t
l.
Algorithm completeBundle
Input: a good bundle.
Output: a number of cycles.
while there remain paths in the bundle as in Figure 5
-
1.
Choose a path containing a cuttable edge kl, with endpoint g
t
, where l is not on the subpath between k and g
t
.
-
2.
Find the path with endpoint g
h
, possibly the same path.
-
3.
Replace kl by kg
t
and g
h
l, which are red cuttable edges. This results in a cycle containing kg
t
and a path containing g
h
l, unless g
t
and g
h
are on the same path, in which case the operation produces two cycles.
Proving the algorithm
After the first three steps of fillScaffolds, suppose we have constructed γ open bundles with r1, ⋯, r
γ
paths, β closed bundles not containing T vertices with q1, ⋯, q
β
paths, and δ - β closed bundles containing T vertices with qβ + 1, ⋯, q
δ
paths. Let ϵ = 0 unless δ - β > 0 but there are no open bundles containing T vertices nor any complete paths with cuttable edges, in which case ϵ = 1. Suppose there were κ* cycles and p* complete paths in the original breakpoint graph of G1 and G2.
Theorem: There are
cycles and complete paths in the final breakpoint graph constructed by fillScaffolds. Moreover, not only is the number of cycles κ maximal over all ways of inserting the missing genes, but so is the number of good complete paths π ≤ p. Thus the algorithm also implicitly produces the value of D(G1, G2).
Proof: We first show that in completing an open bundle with r paths, we obtain r + 1 cycles. Later, we will show that each of these cycles has at most two T vertices.
Consider the case r = 1. Figure 5 shows that completing this bundle in the optimal way creates two cycles. It also shows that for r > 1, we obtain a open r - 1-bundle plus one cycle. Thus, completing an open bundle with r paths produces a total of r +1 cycles.
It is thus never advantageous to draw a pair of red edges between two open bundles with r and s edges, since this cannot create a cycle, only a bundle with r + s - 1 edges. When completed this will only give r + s cycles instead of the r + s + 2 if we had completed them separately.
On the other hand, to be processed toward completion, it is necessary for a each closed bundle to be combined with either an open bundle, or a cycle or a complete path with a cuttable edges, since a closed bundle has no cuttable edges by itself. The optimum ways to do this are illustrated in Figs. 3 and 4. In the former case, where both bundles have T vertices, switching adjacencies allows a closed bundle with r paths to contribute r paths to the open bundle, and eventually to be responsible for r cycles. If one of the bundles has no T vertices, on the other hand, the closed bundle can contribute only r - 1 of its r paths in combining with the open bundle (Figure 4).
Now the numbers of open bundles, closed bundles with T vertices, closed bundles without T are fixed at the outset, and we can also find out if there are open bundles with T (or complete paths with cuttable edges) or not at the initial stage. Counting the number of cycles given by each type we arrive at the first claim of the theorem. Since each combination and completion is done optimally in the algorithm, the result for κ is best possible. So is π, through the operations minimizing the number of T2T2 edges in combineBundles.
It remains to show that the cycles output by the algorithm have no T vertices, i.e., are the kind of cycles appearing in the breakpoint graph in the second to last stage of the construction of Figure 2, or exactly two adjacent T, i.e., are the kind of complete paths (upon dissolution of the TT adjacency) appearing breakpoint graphs. Otherwise, the values of κ and π that we obtain in this theorem would not be those required for Equation (2).
To prove this, we refer to Figure 6, which integrates aspects of Figure 3, 4 and 5. The case by case analysis illustrated there shows that if there are more than one TT adjacency in a path, these adjacencies will necessarily be incorporated at most one at a time into cycles. Cycles without TT adjacencies are also cycles in the breakpoint graph between G1 and the augmented genome
and the cycles with TT adjacencies become complete paths, either good or bad, in this breakpoint graph. This completes the proof.
The construction of the optimal breakpoint graph by fillScaffolds inserts the missing orthologs in the scaffold gaps and at the ends of scaffolds in a way that minimizes the number of rearrangements intervening between G1 and the optimal G2 thus constructed. Once the optimal breakpoint graph is known, these rearrangements can be recovered rapidly by standard manipulations on the graph [4], as mentioned in our discussion of Equation (1). The construction of breakpoint graphs is of linear complexity, and this extends to the identification of bundles and their manipulations in fillScaffolds. This includes the placement of missing genes. The recovery of minimizing rearrangements can be implemented in subquadratic time [4].
Contig fusion
The algorithm in the preceding section fills in the gaps between the scaffold whenever this is justified, so that by our definitions, the scaffolds become contigs. For unanchored scaffolds, as they are filled in by our algorithm described above, they are also being assembled into chromosomes. In doing this, our method based on the breakpoint graph treats the incorporation of each scaffold/contig as if it were a chromosomal fusion operation.
We previously found [3] through simulations that for ordinary genomes, i.e., complete gene orders, if there are τ rearrangements, but the genomic distance algorithm infers D rearrangements, then the expectation
An estimate of τ is
where λ1 and λ2 are parameters that depend on how the rearrangements are generated.
When one of the genomes consists of unanchored contigs (or filled-in scaffolds), we have to correct the output of the genomic distance algorithm D
S
before using (5) to take into account the number of "fusions" necessary to optimally piece together the contigs into chromosomes. The corrected distance is
where α(τ) is a decreasing function of the number of rearrangements τ, approximately paralleling the derivative of D, namely
.
Missing genes: absent or just unsequenced?
We will use G1 and G2 here to refer to the genomes that are the source of the gene order data. By definition in our method, unsequenced genes must be located in gaps between the contigs or at the ends of scaffolds. We assume any genes within contigs have been identified. However, many or even most genes that are in G1 but have no ortholog in the G2 data may actually be absent from the latter genome either because over time they have been deleted from G2 , or because they were acquired by G1 but not by G2 since the two lineages diverged.
The scaffold filling algorithm is designed to enhance sequence assembly, and cannot distinguish one type of missing gene from another. Indeed, where gene models are available from cDNA or EST data, we could simply discard the missing genes from G1 that are not reflected in the set of gene models for G2. In general, however, we do not have this information, and the best we can hope for is to be able to estimate quantitatively how many of the missing genes are present in the genome, but unsequenced.
Let
represent the genome G1 with all the genes missing from G2 deleted. The remaining genes are ordered in the chromosomes in the same way as in G1 . One way to estimate the proportions of the two types of missing genes is to compare the genomic distance
, where only the genes in common in the data from the two genomes are considered, with the distance
after G2 is augmented to
by the scaffold-filling procedure. As detailed below, we have found in extensive simulations that if all unsequenced genes were originally located in regions that are gaps after the (partial) sequencing and assembly are finished, the distances
and
are identical, or almost so, over a wide range of genome sizes, rearrangement distances and missing gene sets. If on the other hand, many of the missing genes are in reality absent from the G2 genome, a major proportion of these, approximately equal to the coverage of the genome sequencing, will have been in syntenic contexts in G1 that are in contigs in G2. Thus forcing these genes to be in gaps, as the scaffold-filling algorithm does, will tend to increase the rearrangement distance
. Then if m is the number of missing genes
is a measure of how the proportion of missing genes are not actually in the G2 genome.
The value of D' depends on how much the contigs are already rearranged in the independent evolution of the two genomes. If the contigs are highly rearranged compared to G1 , then there is no necessary increase in D' when the missing gene is forced into a gap. But if the syntenic context of a missing gene is intact in a contig, then forcing this gene into a gap remote from this context will necessarily increase D'.
Our strategy for evaluating this dependence requires us to manipulate the overall degree of syntenic context conservation while keeping D fixed. In the simulations in the Results section below, we accomplish this by using fixed length inversions. By generating the genomic divergence with very short inversions, we require more inversions to attain the same inferred D, but we also guarantee the existence of a good number of conserved segments (conserved syntenic contexts) and allow D' to increase. By fixing the inversion length at successively higher values, the scope of each inversion becomes longer and it is less likely a conserved segment will remain undisrupted, and D' will tend not to increase.