 Proceedings
 Open Access
 Published:
Evaluating genome architecture of a complex region via generalized bipartite matching
BMC Bioinformatics volume 14, Article number: S13 (2013)
Abstract
With the remarkable development in inexpensive sequencing technologies and supporting computational tools, we have the promise of medicine being personalized by knowledge of the individual genome. Current technologies provide high throughput, but short reads. Reconstruction of the donor genome is based either on de novo assembly of the (short) reads, or on mapping donor reads to a standard reference. While such techniques demonstrate high success rates for inferring 'simple' genomic segments, they are confounded by segments with complex duplication patterns, including regions of direct medical relevance, like the HLA and the KIR regions.
In this work, we address this problem with a method for assessing the quality of a predicted genome sequence for complex regions of the genome. This method combines two natural types of evidence: sequence similarity of the mapped reads to the predicted donor genome, and distribution of reads across the predicted genome. We define a new scoring function for readtogenome matchings, which penalizes for sequence dissimilarities and deviations from expected read location distribution, and present an efficient algorithm for finding matchings that minimize the penalty. The algorithm is based on a formal problem, first defined in this paper, called C overage S ensitive manytomany mincost bipartite M atching (CSM). This new problem variant generalizes the standard (onetoone) weighted bipartite matching problem, and can be solved using network flows. The resulting Javabased tool, called SAGE (S coring function for A ssembled GE nomes), is freely available upon request. We demonstrate over simulated data that SAGE can be used to infer correct haplotypes of the highly repetitive KIR region on the Human chromosome 19.
Introduction
The inexorable drop in costs and rise in throughput of DNA sequencing is driving a future in which every individual person will have their genome sequenced, perhaps multiple times in their lifetimes [1]. Current high throughput technologies produce sequenced read fragments from donor genomes, which are then used for inferring the complete genomic sequence. The main algorithmic approaches for inferring a donor genome from a set of its sequenced reads are either based on de novo assembly [2, 3], i.e. producing a parsimonious superstring that approximately contains most reads as its substrings, or based on mapping approaches [4–6], in which the algorithm takes the read set and a previously sequenced reference genome (or a set of reference genomes), maps the reads to the reference, and uses the identified similarities and variations in order to predict the donor genome.
While the accuracies of sequencing technologies keep improving and their usage costs keep decreasing, many of them still produce reads of relatively short lengths. Reconstruction of repetitive genomic regions using the mentioned approaches is considered more challenging, due to the fact that short reads may be denovo assembled, or mapped to the reference, in multiple ambiguous manners. The difficulty even increases for diploid genomes, limiting the investigation of many important genomic regions, such as the killer cell immunoglobulin like receptor (KIR) region (located in humans within the 1Mb Leucocyte Receptor Complex 19q13.4, see Figure 1b), the 3.6Mbp Human Leucocyte Antigen (HLA) region and others, which exhibit highly repetitive sequences and extensive polymorphisms.
Here, we address the problem of assessing the quality of a donor genome prediction given the set of its sequenced reads, confronting difficulties related to genomic regions of repetitive nature. We present a prediction quality measure a prediction quality measure which is independent of the approach used for generating the prediction. It combines scoring penalties related to both (a) imperfect alignments of the reads to the predicted region, and (b) deviations between the expected and actual read coverage of segments of the region. Our tool differs from previous ones which compare predictions to a known reference. For example, tools that evaluate the quality of denovo assemblies [7] rely on comparing assembled genomes to known references. Mapping tools [8, 9] can be used to provide a naive scoring function comparable to SAGE by summing up the best alignment score of each read. This naive scoring function only optimizes the alignment of the reads and does not take into account read coverage. Our results show the advantage of simultaneously optimizing the combined alignment and coverage score by comparing our tool to the naive approach.
In order to evaluate the new cost function, we applied it to the KIR, a hypervariable region known to be important for the immediate immune response in humans and higher mammals [10]. The KIR region is challenging to reconstruct from sequence read fragments due to its variable gene architecture (Figure 1a) and repetitive nature (Figure 1b). We show that our scoring function allows us to correctly identify KIR haplotype templates in diploid genomes, differentiating correct predictions form incorrect ones based on their computed score, while the naive approach fails in many cases to predict the correct template.
Our cost function for evaluating donor genome predictions is based on a new variant of a bipartite matching problem, entitled Coverage Sensitive manytomany mincost bipartite Matching (CSM), which is a manytomany generalization of the classical mincost (or maxweight) bipartite matching problem [11, 12]. The formal definition of the CSM problem is given in the next section. While in general CSM is NPHard (see Additional File 1), we show a special "convexed" case for which CSM can be efficiently solved by reducing it to a network flow problem, similar to many other variants of bipartite matching problems [12]. Optimal matching/flow algorithms were recently used by several related works to predict structural variations between genomes. Examples to such works include [13], in which mincost flow was used to call copy number variations between a reference and a donor genome, [14], which used maximumweight matching in order to reconstruct breakpoint sequences in long genomic insertions, and [15], which used maximumflow in order to apply a postprocess refinement of simultaneous detection of structural variations in multiple genomes.
Coverage Sensitive manytomany mincost bipartite Matching (CSM)
The CSM problem is a manytomany generalization of the classical mincost bipartite matching problem [12]. We describe the problem in an abstract setting, and cast it to a read alignment problem in the next section.
Consider arbitrary sets X and Y. A manytomany matching (henceforth a matching) between X and Y is a set M of pairs {(x, y) ∈ X × Y} (see Figure 2, (a), (b), (c). The coverage of an element x ∈ X with respect to a matching M is c_{ M } (x) = {y : (x, y) ∈ M}. Symmetrically, c_{ M } (y) = {x : (x, y) ∈ M} for y ∈ Y .
A coverage sensitive matching cost function (henceforth a cost function) w for X and Y assigns matching costs w_{ m } (x, y) for every pair (x, y) ∈ X × Y , and coverage costs w_{ c } (z, i) for every z ∈ X ∪ Y and every integer i ≥ 0. The cost of a matching M between X and Y with respect to w is given by
The CSM problem
Input: A Matching Instance (X, Y, w) consisting of sets X, Y, and cost function w.
Output: Compute $CSM\left(X,Y,w\right)=\underset{M\subseteq X\times Y}{\text{min}}w\left(M\right)$.
Note that CSM is a generalization of classical problems in combinatorics. For example, consider the problem of finding a maximum (partial onetoone) matching on a bipartite graph G with vertex shores X, Y, and an edge set E. This problem can be solved by solving CSM on the input X, Y using the following costs: set w_{ c } (z, 0) = w_{ c } (z, 1) = 0, and w_{ c } (z, i) = ∞ for all z ∈ X ∪ Y, i > 1; set w_{ m } (x, y) = 1 for (x, y) ∈ E and otherwise set w_{ m } (x, y) = ∞. Similarly, CSM can also be used for solving the minimum/maximum weight variants of the bipartite matching problem. However, CSM is NPhard in general (see Additional File 1), and therefore we do not expect to solve the general instance efficiently.
CSM with convex coverage costs
Let (X, Y, w) be a matching instance. We say that w has convex coverage costs if for every element z ∈ X ∪ Y and every integer i > 0, ${w}_{c}\left(z,i\right)\le \frac{{w}_{c}\left(z,\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}i1\right)\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{w}_{c}\left(z,\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}i+1\right)}{2}$. We show here that CSM with convex coverage costs can be reduced to the polytime solvable mincost integer flow problem [11].
For x ∈ X, denote d_{ x } = {y : w_{ m } (x, y) <∞}, and similarly d_{ y } = {x : w_{ m } (x, y) <∞} for y ∈ Y . Denote ${d}_{X}=\underset{x\in X}{\text{max}}{d}_{x}$ and ${d}_{Y}=\underset{y\in Y}{\text{max}}{d}_{y}$. The reduction builds the flow network N = (G, s, t, c, w'), where G is the network graph, s and t are the source and sink nodes respectively, and c and w' are the edge capacity and cost functions respectively. The graph G = (V, E) is defined as follows (Figure 2d).

♦ V = X ∪ Y ∪ C^{X}∪ C^{Y}∪ {s, t}, where the sets ${C}^{X}=\left\{{c}_{1}^{X},{c}_{2}^{X},...,{c}_{{d}_{X}}^{X}\right\}$, ${C}^{Y}=\left\{{c}_{1}^{Y},{c}_{2}^{Y},...,{c}_{{d}_{Y}}^{Y}\right\}$, and {s, t} contain unique nodes different from all nodes in X and Y . Note that we use the same notations for elements in X and Y and their corresponding nodes in V, where ambiguity can be resolved by the context.

♦ E = E_{1} ∪ E_{2} ∪ E_{3} ∪ E_{4} ∪ E_{5}, where

${E}_{1}=\left\{\left(s,\phantom{\rule{2.77695pt}{0ex}}{c}_{i}^{X}\right):{c}_{i}^{X}\in {C}^{X}\right\}$,

${E}_{2}=\left\{\left({c}_{i}^{X},\phantom{\rule{2.77695pt}{0ex}}x\right)\phantom{\rule{2.77695pt}{0ex}}:\phantom{\rule{2.77695pt}{0ex}}{c}_{i}^{X}\in {C}^{X},\phantom{\rule{2.77695pt}{0ex}}x\in X,\phantom{\rule{2.77695pt}{0ex}}{d}_{x}\le i\right\}$,

${E}_{3}=\left\{\left(x,\phantom{\rule{2.77695pt}{0ex}}y\right)\phantom{\rule{2.77695pt}{0ex}}:\phantom{\rule{2.77695pt}{0ex}}x\in X,\phantom{\rule{2.77695pt}{0ex}}y\in Y,\phantom{\rule{2.77695pt}{0ex}}{w}_{m}\left(x,\phantom{\rule{2.77695pt}{0ex}}y\right)<\infty \right\}$,

${E}_{4}=\left\{\left(y,\phantom{\rule{2.77695pt}{0ex}}{c}_{i}^{Y}\right):y\in Y,\phantom{\rule{2.77695pt}{0ex}}{c}_{i}^{Y}\in {C}^{Y},\phantom{\rule{2.77695pt}{0ex}}{d}_{y}\le i\right\}$,
and

${E}_{5}=\left\{\left({c}_{i}^{Y},\phantom{\rule{2.77695pt}{0ex}}t\right):{c}_{i}^{Y}\in {C}^{Y}\right\}$.

The capacity function c assigns infinity capacities to all edges in E_{1} and E_{5} and unit capacities to all edges in E_{2}, E_{3} and E_{4}. The cost function w' assigns zero costs to edges in E_{1} and E_{5}, costs w_{ c } (x, i)  w_{ c } (x, i  1) to edges $\left({c}_{i}^{X},x\right)\in {E}_{2}$, costs w_{ c } (y, i)  w_{ c } (y, i  1) to edges $\left(y,\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{c}_{i}^{Y}\right)\in {E}_{4}$, and costs w_{ m } (x, y) to edges (x, y) ∈ E_{3}. For E' ⊆ E, denote $w\text{'}\left({E}^{\prime}\right)=\sum _{e\in {E}^{\prime}}w\text{'}\left(e\right)$. An integer flow in N is a function f : E → {0, 1, 2, . . .}, satisfying that f(e) ≤ c(e) for every e ∈ E (capacity constraints), and $\sum _{u:\left(u,v\right)\in E}f\left(u,v\right)=\sum _{u:\left(u,v\right)\in E}f\left(v,u\right)$ for every v ∈ V \ {s, t} (flow conservation constraints). The cost of a flow f in N is defined by $w\text{'}\left(f\right)=\sum _{e\in E}f\left(e\right)w\text{'}\left(e\right)$.
In what follows, let (X, Y, w) be a matching instance where w has convex coverage costs, and let N be its corresponding network. Due to the convexity requirement, for every x ∈ X and every integer i > 0, $\begin{array}{c}w\text{'}\left({c}_{i+1}^{X},\phantom{\rule{2.77695pt}{0ex}}x\right)w\text{'}\left({c}_{i}^{X},\phantom{\rule{2.77695pt}{0ex}}x\right)=\left({w}_{c}\left(x,\phantom{\rule{2.77695pt}{0ex}}i+1\right){w}_{c}\left(x,\phantom{\rule{2.77695pt}{0ex}}i\right)\right)\left({w}_{c}\left(x,\phantom{\rule{2.77695pt}{0ex}}i\right){w}_{c}\left(x,\phantom{\rule{2.77695pt}{0ex}}i1\right)\right)\\ ={w}_{c}\left(x,\phantom{\rule{2.77695pt}{0ex}}i+1\right)+{w}_{c}\left(x,\phantom{\rule{2.77695pt}{0ex}}i1\right)2{w}_{c}\left(x,\phantom{\rule{2.77695pt}{0ex}}i\right)\ge 0.\end{array}$ Similarly, for every y ∈ Y and every integer i > 0, $w\text{'}\left(y,{c}_{i+1}^{Y}\right)w\text{'}\left(y,{c}_{i}^{Y}\right)\ge 0$, and we get the following observation:
Observation 1. Series of the form $w\text{'}\left({c}_{1}^{X},\phantom{\rule{2.77695pt}{0ex}}x\right),w\text{'}\left({c}_{2}^{X},\phantom{\rule{2.77695pt}{0ex}}x\right),\phantom{\rule{0.3em}{0ex}}...$ and $w\text{'}\left(y,\phantom{\rule{2.77695pt}{0ex}}{c}_{1}^{Y}\right),w\text{'}\left(y,\phantom{\rule{2.77695pt}{0ex}}{c}_{2}^{Y}\right),\phantom{\rule{0.3em}{0ex}}...$are nondecreasing. Consequentially, for every ${E}^{\prime}\subseteq \left\{\left({c}_{i}^{X},\phantom{\rule{2.77695pt}{0ex}}x\right)\phantom{\rule{2.77695pt}{0ex}}:\phantom{\rule{2.77695pt}{0ex}}x\in X,\phantom{\rule{2.77695pt}{0ex}}1\le i\le {d}_{x}\right\}$and ${E}^{\u2033}=\left\{\left({c}_{i}^{X},\phantom{\rule{2.77695pt}{0ex}}x\right)\phantom{\rule{2.77695pt}{0ex}}:\phantom{\rule{2.77695pt}{0ex}}x\in X,\phantom{\rule{2.77695pt}{0ex}}1\le i\le \left{E}^{\prime}\right\right\},$$w\text{'}\left({E}^{\u2033}\right)\le w\text{'}\left({E}^{\prime}\right)$, and similarly for ${E}^{\prime}\subseteq \left\{\left(y,\phantom{\rule{0.3em}{0ex}}{c}_{i}^{Y}\right)\phantom{\rule{2.77695pt}{0ex}}:\phantom{\rule{2.77695pt}{0ex}}y\in Y,\phantom{\rule{2.77695pt}{0ex}}1\le i\le {d}_{y}\right\}$and ${E}^{\u2033}=\left\{\left(y,\phantom{\rule{2.77695pt}{0ex}}{c}_{i}^{Y}\right)\phantom{\rule{2.77695pt}{0ex}}:\phantom{\rule{2.77695pt}{0ex}}y\in Y,\phantom{\rule{2.77695pt}{0ex}}1\le i\le \left{E}^{\prime}\right\right\}$.
Given a flow f in N, define the matching M_{ f } = {(x, y) : (x, y) ∈ E_{3}, f(x, y) = 1}. Denote ${E}_{x}^{f}=\left\{\left({c}_{i}^{X},x\right):f\left({c}_{i}^{X},x\right)=1\right\}$ and ${E}_{y}^{f}=\left\{\left(y,\phantom{\rule{0.3em}{0ex}}{c}_{i}^{Y}\right):f\left(y,\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{c}_{i}^{Y}\right)=1\right\}$. Since for edges e ∈ E_{1} ∪ E_{5} we have that w'(e) = 0, and since for edges e ∈ E_{2} ∪ E_{3} ∪ E_{4} we have that f(e) ∈ {0, 1} (due to capacity constraints), we can write
Given a noninfinity cost matching M between X and Y, define the flow f_{ M } in N as follows:

♦ For every (x, y) ∈ E_{3}, f (x, y) = 1 if (x, y) ∈ M, and otherwise f(x, y) = 0;

♦ For every $\left({c}_{i}^{X},\phantom{\rule{2.77695pt}{0ex}}x\right)\in {E}_{2},\phantom{\rule{2.77695pt}{0ex}}f\left({c}_{i}^{X},\phantom{\rule{2.77695pt}{0ex}}x\right)=1$ if c_{ M }(x) ≤ i, and otherwise $f\left({c}_{i}^{X},\phantom{\rule{2.77695pt}{0ex}}x\right)=0$;

♦ For every $\left(y,\phantom{\rule{2.77695pt}{0ex}}{c}_{i}^{Y}\right)\in {E}_{4}$, $f\left(y,\phantom{\rule{2.77695pt}{0ex}}{c}_{i}^{Y}\right)=1$ if c_{ M }(y) ≤ i, and otherwise $f\left(y,\phantom{\rule{2.77695pt}{0ex}}{c}_{i}^{Y}\right)=0$;

♦ For every $\left(s,\phantom{\rule{2.77695pt}{0ex}}{c}_{i}^{X}\right)\in {E}_{1},\phantom{\rule{2.77695pt}{0ex}}f\left(s,\phantom{\rule{2.77695pt}{0ex}}{c}_{i}^{X}\right)=\left\left\{x:f\left({c}_{i}^{X},\phantom{\rule{2.77695pt}{0ex}}x\right)=1\right\}\right$;

♦ For every $\left({c}_{i}^{Y},\phantom{\rule{2.77695pt}{0ex}}t\right)\in {E}_{5},\phantom{\rule{2.77695pt}{0ex}}f\left({c}_{i}^{Y},\phantom{\rule{2.77695pt}{0ex}}t\right)=\phantom{\rule{0.3em}{0ex}}\left\left\{y:f\left(y,\phantom{\rule{2.77695pt}{0ex}}{c}_{i}^{Y}\right)=1\right\}\right$.
It is simple to assert that f_{ M } is a valid flow in N (satisfying all capacity and flow conservation constraints), and that ${M}_{{f}_{M}}=M$.
Claim 1. For every flow f in N, $w\text{'}\left({f}_{{M}_{f}}\right)\le w\text{'}\left(f\right).$
Proof. From flow conservation constraints $\left{E}_{x}^{f}\right=\left{E}_{x}^{{f}_{{M}_{f}}}\right={c}_{{M}_{f}}\left(x\right)$ for every x ∈ X, where in particular by definition we have that ${E}_{x}^{{f}_{{M}_{f}}}=\left\{\left({c}_{i}^{X},\phantom{\rule{2.77695pt}{0ex}}x\right):1\le i\le {c}_{{M}_{f}}\left(x\right)\right\}$ Therefore, it follows from Observation 1 that $w\text{'}\left({E}_{x}^{{f}_{{M}_{f}}}\right)\le w\text{'}\left({E}_{x}^{f}\right)$ for every x ∈ X, and similarly it may be shown that $w\text{'}\left({E}_{x}^{{f}_{{M}_{f}}}\right)\le w\text{'}\left({E}_{x}^{f}\right)$ for every y ∈ Y. Hence,
□
Denote $\mathrm{\Delta}=\mathrm{\Delta}\left(X,\phantom{\rule{2.77695pt}{0ex}}Y,\phantom{\rule{2.77695pt}{0ex}}w\right)=\sum _{z\in X\cup Y}{w}_{c}\left(z,\phantom{\rule{2.77695pt}{0ex}}0\right)$, and note that Δ depends only on the instance (X, Y, w) and not on any specific matching.
Claim 2. For every matching M between × and Y, w'(f_{ M }) = w(M)  Δ.
Proof. For x ∈ X, we have that $\begin{array}{c}w\text{'}\left({E}_{x}^{{f}_{M}}\right)=w\text{'}\left({c}_{1}^{X},\phantom{\rule{2.77695pt}{0ex}}x\right)+w\text{'}\left({c}_{2}^{X},\phantom{\rule{2.77695pt}{0ex}}x\right)+...+w\text{'}\left({c}_{{c}_{M}\left(x\right)}^{X},\phantom{\rule{2.77695pt}{0ex}}x\right)\\ =\left({w}_{c}\left(x,\phantom{\rule{2.77695pt}{0ex}}1\right){w}_{c}\left(x,\phantom{\rule{2.77695pt}{0ex}}0\right)\right)+\left({w}_{c}\left(x,\phantom{\rule{2.77695pt}{0ex}}2\right){w}_{c}\left(x,\phantom{\rule{2.77695pt}{0ex}}1\right)\right)+...+\left({w}_{c}\left(x,\phantom{\rule{2.77695pt}{0ex}}{c}_{M}\left(x\right)\right)\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}{w}_{c}\left(x,\phantom{\rule{2.77695pt}{0ex}}{c}_{M}\left(x\right)1\right)\right)\\ ={w}_{c}\left(x,\phantom{\rule{2.77695pt}{0ex}}{c}_{M}\left(x\right)\right){w}_{c}\left(x,\phantom{\rule{2.77695pt}{0ex}}0\right),\end{array}$ and similarly $w\text{'}\left({E}_{y}^{{f}_{M}}\right)={w}_{c}\left(y,\phantom{\rule{2.77695pt}{0ex}}{c}_{M}\left(y\right)\right){w}_{c}\left(y,\phantom{\rule{2.77695pt}{0ex}}0\right)$ for y ∈ Y. Therefore,
□
Claim 3. Let f* be a minimum cost flow in N. Then, M_{ f* } is a minimum cost matching between X and Y, and CSM(X, Y, w) = w'(f*) + Δ.
Proof. Since f* is a minimum cost flow in N, $\phantom{\rule{2.77695pt}{0ex}}w\text{'}\left({f}^{*}\right)\le w\text{'}\left({f}_{{M}_{{f}^{*}}}\right)\stackrel{\mathsf{\text{Clm}}.1}{\le}w\text{'}\left({f}^{*}\right)$, thus $w\text{'}\left({f}^{*}\right)=w\text{'}\left({f}_{{M}_{{f}^{*}}}\right)$. Let M be a matching between X and Y. Again, from the optimality of f*, w'(f*) ≤ w'(f_{ M }) and so $w\left({M}_{{f}^{*}}\right)\mathrm{\Delta}\phantom{\rule{0.3em}{0ex}}\stackrel{Clm.2}{=}w\text{'}\left({f}_{{M}_{f*}}\right)=w\text{'}\left({f}^{*}\right)\le w\text{'}\left({f}_{M}\right)\stackrel{Clm.2}{=}w\left(M\right)\mathrm{\Delta}$, and in particular $w\left({M}_{{f}^{*}}\right)\le w\left(M\right)$. Thus, ${M}_{{f}^{*}}$ is a minimum cost matching for (X, Y, w), and so $CSM\left(X,\phantom{\rule{2.77695pt}{0ex}}Y,\phantom{\rule{2.77695pt}{0ex}}w\right)=w\left({M}_{{f}^{*}}\right)\stackrel{Clm.2}{=}w\text{'}\left({f}^{*}\right)+\mathrm{\Delta}$.
□
Constructing CSM instance from read mapping data
Consider a set of reads and a prediction of the genomic sequence (henceforth, the "prediction") from which the reads were extracted. It is assumed that the sequencing procedure produces reads with some sequencing error probability, and that read extraction positions along the genome adhere to some expected distribution. The probability for extracting a read starting at a given position may depend on the sequential context at this position and its location along the genome. Given such probabilities, it is possible to compute for a given segment of the prediction an expected amount of extracted reads starting within this segment. Such an amount of expected reads will be referred to here as the expected coverage of the segment. Hence, we can argue that the reads well support the prediction in case it is possible to assign to each read a position within the prediction, from which it was presumably extracted, in a manner that (a) each read sequence approximately matches the substring of the prediction starting at the assigned position, and (b) for every segment of the prediction, the amount of reads assigned to positions within this segment does not deviate significantly from the expected coverage of the segment. On the other hand, when no such position assignment can be found, it is suggestive that the prediction exhibits some variation with respect to the true genome.
Given a predicted region, a mapping between the reads and the prediction is a function that assigns to each read a set of positions in the region from which it is possible to extract the read (with some allowed amount of sequencing errors). Software tools for producing such mappings exist (e.g. Bowtie [8]) and are widely used. Ideally, if the prediction is in fact the correct genomic sequence from which the reads were extracted, and this region is nonrepetitive, it is expected that a mapping would assign to each read a unique position that is the true position from which it was extracted. Nevertheless, when the sequence contains repeats, and sequencing errors are not negligible, it is expected that some of the reads will be mapped to multiple positions (due to the repeats), while others may not be mapped to any position (due to sequencing errors). Given a mapping between the reads and the region, we define a readtogenome matching as a function that selects for each read at most one corresponding position among its set of positions given by the mapping, from which it was presumably extracted. A readtogenome matching better supports the prediction the more reads it matches to the genome, the higher the similarity is between reads and their matching positions, and the smaller the deviation is between the expected coverage and the coverage implied by the matching positions.
The quality of a readtogenome matching can be naturally evaluated using the CSM formalism described in the previous section. A matching instance (X, Y, w) can be generated, choosing X to be the set of reads, and Y to be a partition of the prediction into segments (where each element in Y corresponds to a segment in the partition). For each read x ∈ X and each segment y ∈ Y, w_{ m } (x, y) is set to the best sequence similarity score between x and a substring of the prediction starting at y (such similarity scores may be generated by tools such as Bowtie [8]), or set to ∞ if no substring starting at y is similar to x. The coverage cost function for a read x ∈ X sets w_{ c } (x, 0) to some penalty added to the score in case x is unmatched, sets w_{ c } (x, 1) to 0 (no penalty is added when x participates in the matching), and w_{ c } (x, i) for i > 1 to ∞ (a matching in which a read is assigned to more than one position is illegal, and has an infinite cost). For a segment y ∈ Y, it is possible to compute the expected coverage c_{ y } of y, and generate a convex score function f(i) whose minimum point is at i = c_{ y }, and set w_{ c } (y, i) = f (i) for every nonnegative integer i. The cost of an optimal matching for this instance can then serve as a quality measure for the prediction.
Implementation
We implemented the CSM algorithm as a java based tool named SAGE, a S coring function for A ssembled GE nomes. The inputs to SAGE are a set of reads, R, mapped to a genomic template, G, in the BAM format [16] along with a parameter file containing alignment costs, unmatched read penalty, genome segmentation, expected segment coverage values, and a choice of coverage cost functions (currently linear and polynomial cost functions).
Results
We tested SAGE on the hypervariable KIR region. The KIR region, while variable, is tightly organized and contains between 8 and 14 genes, and 2 pseudogenes (Figure 1a) [17]. The genes are organized into two adjacent regions, each bordered by two anchoring genes/pseudogenes: KIR3DL3 and 3DP1 for the centromeric region; 2DL4 and 3DL2 for the telomeric region. Variability within KIR is expressed in the form of changing gene numbers, genecopy numbers, and gene polymorphisms. There are two broad types of KIR haplotypes Type A and Type B that are distinguished by their gene content. Type A haplotypes are characterized by the absence of the following genes: {KIR2DL5, 2DS1, 2DS2, 2DS3, 2DS5, 3DS1}, while Type B haplotypes contain one or more of these genes [18]. Type B haplotypes can be split further into different subtypes, characterized by the gene content on the centromericside and telomericside. The various (sub)types of KIR haplotypes are denoted by {A, AB, BA1, BA2, BA2X, Bdel, B}. However, the typing is incompletely developed, and is likely to change as more data is acquired.
To test the effectiveness of SAGE on a variety of haplotype types, we simulated reads from 27 known KIR haplotypes using GemSIM [19] with an error model learned from pairedend (100 × 2)bp reads generated by Illumina GA IIx with TrueSeq SBS Kit v5GA [19]. The 27 haplotype templates were taken from the IPDKIR database [20]. The sequences of these templates were obtained experimentally by first separating the two haplotypes of an individual using fosmidpools, determining the gene content and architecture of each haplotype using STS assays, and then finally sequencing the individual fosmids [21].
Before we ran SAGE, we mapped each read set, R, back to each template, G, using Bowtie. We ran Bowtie under the 'a' option with all other parameters set to the default, in order to obtain a set of all possible mapping locations and their corresponding alignment costs for each read, which was used as input into SAGE. The mapping position of a pairedend read was set to be the genomic index to which the first character of the first subread was aligned. The alignment cost for a complete (100 × 2)bp pairedend read varied between 0 and 180, with 0 corresponding to identity. When two pairedends mapped in a concordant manner, the total alignment cost for the read was calculated by adding the alignment cost of both pairedends. When a pairedend did not have a concordant mate, suggestive of incorrect architecture, the alignment cost was further penalized by adding a cost of 90, which is the maximum penalty for one pairedend. The unmatched read penalty was constant for all reads and set to 100. On the other side, the genome G was partitioned to segments of fixed length of 1000bp (except for the last segment which may be shorter), with expected coverage per segment given by $\lambda =1000\frac{\leftR\right}{\leftG\right}$ (with the appropriate adjustment for the last segment), where R and G denote the number of reads and the length of the genome, respectively. To allow for natural variation in coverage, the quadratic function, f(i) = (λ  i)^{2}, was chosen as the segment coverage cost function.
To the best of our knowledge, SAGE is the first tool that scores templates given a set of reads. As there is no competing tool, we compared SAGE results against a naive approach that ignores coverage and sums up the best alignment score for each read to obtain a total score for each read set and template. The scores obtained by this approach will be referred to as the Bowtie scores below.
Haploid templates
As a first pass, we tested SAGE's ability to score haploid templates. We scored each of the 27 read sets against each of the 27 templates using SAGE. A visualization of the scores are shown in Figure 3, where the templates are organized by sequence similarity so that templates of the same type/subtype are clustered together. Note that the matrix is not symmetric. Each row corresponds to the scores of a single read data set against a collection of haploid templates. As can be seen, SAGE always gets the topscore for the correct template. Moreover, the other templates from the same subtype get progressively weaker scores. Major haplotypes fall within distinct blocks, but the scores also suggest a hierarchy within the subtypes that can be studied further.
Dipolid templates
To test scoring on more realistic templates, we simulated reads from 9 diploid individuals whose pair of haploid templates were obtained experimetally in Pyo et al. [21] and are in the IPDKIR database [20]. The 9 diploid templates from this study fell into one of 6 combination of subtypes. We scored each of the 9 simulated read sets against each of the 9 diploid templates using SAGE. In all but one case, SAGE (Figure 4a) and Bowtie (Figure 4b) predicted the correct diploid template of the donor. Furthermore, SAGE is better at predicting the subtype of the donor template than Bowtie. When the donor template is not in database, as is usually the case in practice, SAGE will give a better score to templates that are more similar to the donor while Bowtie may not. For example, row 3 of Figure 4(a, b) show the scores when the donor template is of type A and BA1. Both SAGE and Bowtie correctly gave the best score to the diploid template G085A/BA1. However, the template with the next best SAGE score was also of subtype A/BA1, while the template with the next best Bowtie score was of subtype A/BA2.
In general, coverage plays an important role in determining the correct haplotype. Figure 5(be) show the coverage plots when reads from donor template G085A/BA1 are mapped to a template of the same subtype (F06A/BA1) and a template of a different subtype (FH13A/BA2) using SAGE and Bowtie. When mapped to templates of the same subtype (Figure 5(b, d)), the coverage plots for both SAGE and Bowtie show less variance when compared to the coverage plots of the other templates (Figure 5(c, e)). Bowtie does not take into account variance of coverage and scores the template of a different subtype (FH13A/BA2) higher than the template of the same subtype (F06A/BA1). On the contrary, SAGE penalizes for the variance in coverage, and correctly predicts the subtype of the donor. Furthermore, if several possible mappings of a read are given, SAGE can be used to determine the best mapping. In Figure 5(b, c), we see less variability in the coverage plots from SAGE's matching compared against those of Bowtie's matching (Figure 5(d, e)). Therefore, even if Bowtie is able to determine the correct donor template, it may not output the correct mapping.
Running time
For a dataset with n reads and a total of m read mapping locations, SAGE scales as O(nm + n^{2} log n). Thus, on our datasets with haploid genomes of average length 166Kbp (166 1000bpsegments), and ~ 24,900 reads, SAGE ran in 21 seconds. The running time increased to 210 seconds for the average diploid genome (~ 332 1000bpsegments, ~ 49,800 reads). Running times were recorded using a 4 core Intel 2.66GHz processor with 9Gb of RAM.
Discussion and conclusions
To the best of our knowledge, SAGE is the first tool that scores predicted donor templates given a set of sequenced reads. Our results on the KIR region show that SAGE can be used to predict the subtype of the donor KIR template, and can be directly used for haplotyping this region. Furthermore, SAGE scores the correct template higher than even templates of the same subtype.
While we focused our attention on the KIR region, SAGE is general enough to be applied to any complex region. It is also possible to implement many different scoring functions, which would allow the user to obtain optimal matchings according to his own custom scores. For example, read unmatching penalties may be constant for all reads, or may be readspecific. A motivation for read specific costs is in the case where the sequencing phase produces some sequencing qualities for reads, and it is possible to "pay" less when not matching reads of lower sequencing quality. Similarly, it is possible to choose a segmentation of the prediction in which all segments are of the same length, and uniform coverage is assumed, or one with variable segment lengths and possibly different coverage cost functions for each segment. A motivation to such complex segmentation is e.g. in the case where one tries to identify a specific structural variation, such as a deletion of a segment of specific length around a specific region of the prediction. Setting lengths of segments in the examined region to the expected deletion length can increase the likelihood that an optimal matching would not add artifact matchings of reads to a long segment spanning the deleted segment, in order to compensate for low coverage of the deleted segment. Lastly, by using different coverage cost functions, it is possible to decide the rate in which penalty increases due to deviations of expected coverage, which may grow linearly, polynomially, exponentially, or based on other probabilistic models, as long as the function satisfies the convexity requirement.
Future work would involve extending the use of SAGE on real data. Some challenges in dealing with real data include obtaining the set of reads extracted from the region of interest (especially when sequencing data is likely taken from the whole genome) and providing the expected coverage. If we know the parameters of the sequencing run, we could use the target read coverage as the expected coverage; however, if that is unknown, we may be able to estimate the expected coverage from the number of reads we need to map to the region. For example, if we assume a uniform distribution of coverage, then the expected coverage per segment is simply the total length of the segment multiplied by $\frac{\leftR\right}{\leftG\right}$.
Although haplotype analysis of the KIR region is medically relevant, the genomic complexity (i.e. repetitive nature and variable gene architecture) of this region makes it difficult to do a complete analysis. Indeed, the possible subtypes of this region have not been completely characterized. Thus, reconstruction of this region and other complex regions of the genome remain a worthwhile problem. SAGE takes the first step in reconstructing complex regions of the genome by providing a scoring function for predicted templates based on their similarity to the true donor. Therefore, it might be possible to obtain a complete reconstruction of the donor genome by iteratively refining predicted donor templates until SAGE scores are optimized. Furthermore, SAGE can also be applied for scoring denovo assemblies and for comparing the accuracies of different assemblers.
Abbreviations
 CSM:

Coverage Sensitive manytomany mincost Matching
 SAGE:

Scoring function for Assembled Genomes
 KIR:

Killer cell Immunoglobulinlike Receptor
 HLA:

Human Leucocyte Antigen
References
 1.
Hall Y: Coming Soon: Your Personal DNA Map?. National Geographic News. 2011
 2.
Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J: De novo assembly of human genomes with massively parallel short read sequencing. Genome Research. 2010, 20 (2): 265272. 10.1101/gr.097261.109.
 3.
Pevzner PA, Tang H, Waterman MS: An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA. 2001, 98 (17): 97489753. 10.1073/pnas.171285098.
 4.
Havlak P, Chen R, Durbin KJ, Egan A, Ren Y, Song XZ, Weinstock GM, Gibbs RA: The Atlas genome assembly system. Genome Res. 2004, 14 (4): 721732. 10.1101/gr.2264004.
 5.
Kidd JM: Mapping and sequencing of structural variation from eight human genomes. Nature. 2008, 453: 5664. 10.1038/nature06862.
 6.
Mills RE: Mapping copy number variation by populationscale genome sequencing. Nature. 2011, 470: 5965. 10.1038/nature09708.
 7.
Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marcais G, Pop M, Yorke JA: GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2012, 22 (3): 557567. 10.1101/gr.131383.111.
 8.
Langmead B, Salzberg S: Fast gappedread alignment with Bowtie 2. Nature methods. 2012, 9 (4): 357359. 10.1038/nmeth.1923.
 9.
Li H, Durbin R: Fast and accurate longread alignment with BurrowsWheeler transform. Bioinformatics. 2010, 26 (5): 589595. 10.1093/bioinformatics/btp698.
 10.
Hsu KC, Chida S, Geraghty DE, Dupont B: The killer cell immunoglobulinlike receptor (KIR) genomic region: geneorder, haplotypes and allelic polymorphism. Immunol Rev. 2002, 190: 4052. 10.1034/j.1600065X.2002.19004.x.
 11.
Edmonds J, Karp R: Theoretical improvements in algorithmic efficiency for network flow problems. Journal of the ACM (JACM). 1972, 19 (2): 248264. 10.1145/321694.321699.
 12.
Lovász L, Plummer M: Matching Theory, Volume 29 of Annals of Discrete Mathematics. Amsterdam: NorthHolland. 1986
 13.
Medvedev P, Fiume M, Dzamba M, Smith T, Brudno M: Detecting copy number variation with mated short reads. Genome research. 2010, 20 (11): 16131622. 10.1101/gr.106344.110.
 14.
Hajirasouliha I, Hormozdiari F, Alkan C, Kidd J, Birol I, Eichler E, Sahinalp S: Detection and characterization of novel sequence insertions using pairedend nextgeneration sequencing. Bioinformatics. 2010, 26 (10): 12771283. 10.1093/bioinformatics/btq152.
 15.
Hormozdiari F, Hajirasouliha I, McPherson A, Eichler E, Sahinalp S: Simultaneous structural variation discovery among multiple pairedend sequenced genomes. Genome research. 2011, 21 (12): 22032212. 10.1101/gr.120501.111.
 16.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup GPDP: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25 (16): 20782079. 10.1093/bioinformatics/btp352.
 17.
Middleton D, Gonzelez F: The extensive polymorphism of KIR genes. Immunology. 2010, 129: 819. 10.1111/j.13652567.2009.03208.x.
 18.
Marsh SG, Parham P, Dupont B, Geraghty DE, Trowsdale J, Middleton D, Vilches C, Carrington M, Witt C, Guethlein LA, Shilling H, Garcia CA, Hsu KC, Wain H: Killercell immunoglobulinlike receptor (KIR) nomenclature report, 2002. Tissue Antigens. 2003, 62: 7986. 10.1034/j.13990039.2003.00072.x.
 19.
McElroy KE, Luciani F, Thomas T: GemSIM: general, errormodel based simulator of nextgeneration sequencing data. BMC Genomics. 2012, 13: 7410.1186/147121641374.
 20.
Robinson J, Waller MJ, Stoehr P, Marsh SG: IPDthe Immuno Polymorphism Database. Nucleic Acids Res. 2005, 33: D523526.
 21.
Pyo CW, Guethlein LA, Vu Q, Wang R, AbiRached L, Norman PJ, Marsh SG, Miller JS, Parham P, Geraghty DE: Different patterns of evolution in the centromeric and telomeric regions of group A and B haplotypes of the human killer cell Iglike receptor locus. PLoS ONE. 2010, 5 (12): e1511510.1371/journal.pone.0015115.
 22.
Krumsiek J, Arnold R, Rattei T: Gepard: a rapid and sensitive tool for creating dotplots on genome scale. Bioinformatics. 2007, 23 (8): 10261028. 10.1093/bioinformatics/btm039.
Acknowledgements
The research was supported by grants from the NIH (RO1HG004962, U54 HL108460), and the NSF (CCF1115206). CL was supported by an NSF graduate fellowship.
Declarations
Publication of this article was supported by NIH (RO1HG004962, U54 HL108460) and NSF (CCF1115206).
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 5, 2013: Proceedings of the Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMBseq 2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S5.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
SZ and CL developed and implemented the method. CL and SK designed the experiments. CL performed the experiments. CL, SZ, and VB wrote the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Lo, C., Kim, S., Zakov, S. et al. Evaluating genome architecture of a complex region via generalized bipartite matching. BMC Bioinformatics 14, S13 (2013) doi:10.1186/1471210514S5S13
Published
DOI
Keywords
 Killer Cell Immunoglobulin Like Receptor
 Human Leucocyte Antigen
 Donor Genome
 Bipartite Match
 Minimum Cost Flow