Approximation properties of haplotype tagging

Background Single nucleotide polymorphisms (SNPs) are locations at which the genomic sequences of population members differ. Since these differences are known to follow patterns, disease association studies are facilitated by identifying SNPs that allow the unique identification of such patterns. This process, known as haplotype tagging, is formulated as a combinatorial optimization problem and analyzed in terms of complexity and approximation properties. Results It is shown that the tagging problem is NP-hard but approximable within 1 + ln((n2 - n)/2) for n haplotypes but not approximable within (1 - ε) ln(n/2) for any ε > 0 unless NP ⊂ DTIME(nlog log n). A simple, very easily implementable algorithm that exhibits the above upper bound on solution quality is presented. This algorithm has running time O((2m - p + 1)) ≤ O(m(n2 - n)/2) where p ≤ min(n, m) for n haplotypes of size m. As we show that the approximation bound is asymptotically tight, the algorithm presented is optimal with respect to this asymptotic bound. Conclusion The haplotype tagging problem is hard, but approachable with a fast, practical, and surprisingly simple algorithm that cannot be significantly improved upon on a single processor machine. Hence, significant improvement in computatational efforts expended can only be expected if the computational effort is distributed and done in parallel.


Background
Much of the population-wide variation of the human genome can be attributed to single nucleotide polymorphisms (SNPs), which are changes in single base pairs within the genome. SNPs are of specific interest because they allow disease association studies; this means that the involvement of genes in particular diseases can be studied by the analysis of SNP alleles within these genes [1]. For the study of population genomics, SNPs can be used to measure linkage disequilibrium, an indication of how much more (or less) likely, compared to chance, certain combinations of neighboring SNP alleles are [2,3].
After the completion of the Human Genome Project emphasized the importance of SNPs to study the location of disease genes, the SNP Consortium project produced a genome-wide map of more than 1.4 million SNPs [4]. Due to linkage disequilibrium, the distribution of possible alleles at SNPs is not uniformly random, and some combinations of neighboring alleles occur more often than others. Such a combination of SNP alleles is called a haplotype, and a given set of SNPs can give rise to a wide variety of haplotypes.
It is an important problem to identify a subset of SNPs within a haplotype that allows the unique identification of all possible allele variations within a haplotype. If such a subset can be found, a haplotype can be uniquely identified by knowing only the allele values at a few SNP positions. SNPs that satisfy this requirement are called haplotype tagging SNPs (htSNPs).
The problem of identifying a minimal set of htSNPs is known to be NP-hard. NP-hardness means that there currently does not exist a way of solving the problem optimally with reasonable computing resources. Even though it is strongly believed among theoretical computer scientists that this state of affairs will prevail, NP-hardness does not exclude the possibility of finding an adequate solution with a reasonable effort. The validity of this last statement hinges on the definitions of the adequacy of a solution and what constitutes reasonable effort. We will define reasonable effort to be polynomial time computability, while adequacy will be achieved by an approximation algorithm, the solution of which is provably no worse than a certain factor r times the optimal.
In recent years, a number of algorithms for calculating htSNPs were developed. Due to the high computational complexity of the problem, these algorithms were either stochastic or, when deterministic, only applicable to haplotypes with a few hundreds to thousands of SNPs. At present, this is not a limitation, since there are few samples that contain these many SNPs. Given the speed of progress and innovation in the biosciences it will, however, be only a matter of time until brute-force approaches will no longer be feasible. Stochastic algorithms do not have this limitation; due to their nature they can give only probabilistic bounds on the results achieved. We believe that approximation algorithms are a viable alternative to both stochastic and brute-force approaches. The advantage over the former is that the bounds on the results are guaranteed to hold; the advantage over the latter is that approximation algorithms can be executed in polynomial time.
The theoretical question of how many SNPs are required to tag a given number of haplotypes was investigated, using Boolean algebra, by Wiuf et al. [5]. The number of htSNPs required for association studies was investigated by Thompson et al. [6]. Previous work on exact algorithms for identifying htSNPs were based on search strategies [7,8]; a stochastic algorithm is given by Johnson et al. [9]. Wiuf et al. [5] and Bafna et al. [10] both present polynomial time algorithms that find a minimum cardinality set of SNPs for the special case that Wiuf et al. call pairwise compatible and Bafna et al. call complete LD, i.e., the four gamete test by Hudson and Kaplan [11] fails for any pair of SNPs. We will in this exposition focus on the general unrestricted case.

Results
The main results are summarized as follows.

Theorem 1
The haplotype tagging problem is NP-hard and approximable within 1+ln( ), but NP-hard to approximate within cln(n/2) for c > 0 and not approximable within (1 -ε)ln(n/2) for any ε > 0 unless NP ⊂ DTIME(n log log n ). If we bound n from above by a constant q, the problem is APX-complete.
We have that cln(n/2) = Ω(ln n) and 1 + ln((n 2 -n)/2) = O(ln n), by which we can see that the approximation bound is asymptotically tight. We present an algorithm with running time O( (2m -p + 1)) where p = min(n, m), for the MHT problem that exhibits the above approximation bound. This algorithm is easily implemented, the detailed pseudo code listing presented in this paper contains only 22 lines.
A second algorithm is presented that can be seen as an instance of a schema that lets us approach a family of haplotype tagging problems including the problem of tagging the haplotypes using a set of SNPs with minimal diameter. This latter result addresses the problem of selecting practical "window" size bounds in approaches where such are needed (Bafna et al. [10]).

Discussion
For 250 million samples (approximately the current population of the USA) the size of the set of tags returned would be at most 39 times larger than the optimum. If we assume that our computer is capable of doing 35 trillion operations per second (the world's currently fastest parallel supercomputer, NEC's Earth Simulator, is capable of this), and we wish to search among 1 million SNPs, the computation using the algorithm presented in this paper would only take about 41 days. If we happen to know that an optimal solution uses at most 1000 of these SNPs (28 is an absolute lower bound for 250 million samples), we would be done in about 2 hours. Hence, to approach truly large scale experiments, the authors believe that imple- Our analysis of the minimum haplotype tagging problem (MHT) relies on a relationship we establish to the minimum set cover problem (MSC). Exploiting this relationship, we can relate variants of the MSC problem to MHT. This we can do as we describe an algorithm that transforms a MHT instance into a MSC instance, solves this instance, and transforms this solution into a solution of the MHT problem. We can form variants of this algorithm by substituting particular variants of the MSC algorithm in this process. Below we list MSC variants, their known approximation properties, and if substituted into our algorithm schema, the corresponding resulting MHT problem that it solves.
• Minimum Exact Cover. This problem is approximable within 1 + ln m [12] and the associated tagging problem is to find a minimum set of SNPs such that the sets of haplotype pairs each of the SNPs distinguishes are as disjoint as possible.
• Maximum Coverage by at most k sets. This problem is approximable within e/(e -1) [13] and not approximable within e/(e -1) -O(1) [14] for e > 1. The associated tagging problem is the problem of discerning a maximum number of haplotypes with at most k SNPs.
• Minimum k-redundant Coverage is the problem of creating a minimum cover such that each element is covered at least k times. This problem is approximable within b -k + 1 [15] for a constant b. The associated tagging problem is the problem of requiring each SNP in a minimum tagging set to discern between at least k pairs of haplotypes.
• Minimum Diameter Coverage is the problem of finding a cover where the greatest distance between two elements in the cover is minimal. This problem is in general not approximable within a constant, but approximable within 2 if the distance measure observes the triangle inequality and no better approximation is possible [16]. The associated tagging problem is to find a tagging set of SNPs such that the maximal pairwise distance is minimal.
• Minimum Cluster Diameter Coverage is the problem where we assume that the cover can be partitioned into k clusters and the objective is to minimize the greatest distance between two elements in the same cluster. This problem is approximable within 2 for any fixed k and upper bound of cluster size l if the distance measure satisfies the triangle inequality [16]. The associated tagging problem is to find k disjoint sets of maximally l SNPs such that the greatest distance between two SNPs in the same set is minimized.
• Maximum Dispersion Cover is the problem of finding a cover that maximizes the min imum pairwise distance between elements in the cover. This problem is NP-hard and assuming P ≠ NP, no polynomial time approximation with bounded error guarantee exists [16]. The corresponding tagging problem is to find a set of SNPs that lets us discern between all haplotypes such that the minimum distance between two of these SNPs is maximized. We see that this problem is the most difficult to solve of all the problems considered in this exposition.
All these variations can be implemented by substituting the corresponding cover algorithm for the greedy set cover algorithm in the MHT algorithm schema. Approximation bounds and non-approximability results for the above variations can be used to establish the same for the resulting variation of the haplotype tagging problem analogously to what we do for the minimum set cover problem.

Conclusion
Although the haplotype tagging problem is hard, it can be approached with a simple, fast and practical algorithm. The contribution of this work is not only yet another fast and simple algorithm for the tagging problem, but also the proof that this algorithm delivers a solution with the asymptotically best error bound possible.
Furthermore, the algorithm schema we present via the connection to the minimum set cover problem is applicable to not only the original problem as it is defined in our analysis, but is applicable to a family of related problems, which address problems presented in published research.
As the algorithms presented are asymptotically optimal with respect to approximation bounds, meaning that solution quality cannot be significantly improved upon in polynomial time, efforts in tackling truly large scale problem instances should concentrate on distributing the computational efforts in parallel.

Methods
Let H = {h 1 , h 2 ,..., h n } be a set of haplotypes with associated SNP markers at positions S = {1, 2,..., m}, and let S' be a subset of S. We define h i [S'] to be the string consisting of marker values of haplotype h i found at the positions in S'. We can view H as an n × m matrix over the set of possible SNP alleles values. Assuming that they are bi-allelic, H becomes a Boolean matrix. The problem of discerning a set of haplotypes by a minimum cardinality set of SNPs, alternatively the problem of predicting SNPs for a set of haplotypes using a minimal set of SNPs, which we will call the minimum haplotype tagging problem (MHT), can be formulated as follows.

Problem 1 (MHT) Let H, S and h i be as above. Find a minimum cardinality set S' ⊆ S such that h i [S'] = h j [S'] implies i = j.
We now define some formal concepts and present results that we will use in the analyses that follow.

Formal concepts
We formally define a minimization problem as a 3tuple (p, s, m) where • p is the polynomial time computable characteristic function of the set of problem instances , • s(x) is the set of feasible solutions for instance x and its characteristic function is computable in polynomial time for any y such that |y| ≤ O(|x| q ) for some q N, • m is the polynomial time computable natural number measure we wish to minimize.
In other words, the problem is: given instance x such The problem of deciding whether a given tuple ( ≤ k} we will call the decision problem associated with , and, overloading notation slightly, denote by L( ).
Let f and g be two functions, and let and be two problems. If for all x we have that f(x) and for all y (f(x)) we have that g(x, y) (x) we say that the tuple (f, g) is a reduction from to . If both f and g are computable in polynomial time, we write ≤ p and call (f, g) a polynomial time reduction. Note that g is the identity on the solution for a reduction between decision problems.
In the following, we will rely on the result that an optimization problem as defined above is NP-hard if the associated decision problem is NP-complete [ [17], Theorem 1.2], and this can be established by finding a polynomial time reduction from a known NP-complete problem.
We now turn to the approximation properties of minimization problems. For instance x and solution y of minimi-zation problem , we define the performance ratio of y and the optimal to be and say that y s(x) is an (x, y)-approximate solution for instance x. Let be an algorithm that computes (x) s(x) for instance x of . We say that is an approximation algorithm and is an r(|x|)-approxima-  g(x, y)) ≤ 1 + α(r -1) for constant α ≥ 1 and constant rational r > 1.
If there exist functions f and g that fulfill requirements 1, 2, and 3 then the tuple (f, g, α) is an AP-reduction from to and we write ≤ AP .
Using AP-reductions, we can define completeness and hardness for classes of approximation problems analogously to how we do for problem complexity classes. In particular, a problem is said to be APX-hard if ≤ AP for all in APX. If itself is in APX, we say that is APX-complete. By transitivity of AP-reductions, given ≤ AP , we have that is APX-hard if is APXcomplete. We will rely on this result in the analyses that follow.

Analysis roadmap
We will proceed in the analysis with two goals in mind.
Our strategy is to "bracket" the problem of interest in a sequence of problems. As we choose the sequence endpoints such that we know the desired properties of these, we use reductions to propagate these results to the problems of interest.
We will prove NP-hardness of Problem 1 by reduction from the Minimum Set Cover problem (MSC). In fact, L(P MSC ) ≤ p L(P MHT ) ≤ p L(P MSC ). In order to prove the approximation properties we show that the reductions used in the NP-hardness proofs can be used to construct AP-reductions. We show that L(P MSC ) ≤ AP P MHT ≤ AP P MSC with α = 1. We further show that the existence of a bounded version of MSC that is in APX naturally leads to the existence of APX version of MHT. Since the bounded MSC problem is known to be APX-complete, we have that the bounded version of MHT is APX-complete. Requirement 3 in Definition 1 lets us conclude that a lower approximation bound for MSC can be used via the AP reductions to produce a lower approximation bound for MHT as well.

The optimization problems
The two problem formulations below are such that all instances can be seen as an n × m binary matrix B and the measure m for a feasible solution I is defined as m(I) = |I|. Hence, in the following, the problem definitions only consist of the specification of the function s for the instance B.
We can reformulate the "Minimum Haplotype Tagging" Problem 1 as follows.

Problem 2 (MHT) s(B)
The following problem is the "Minimum Hitting Set" problem which we will use as a known starting point in our analyses.

Problem 3 (MSC) s(B)
Johnson [12] presents a greedy 1 + ln m-approximation algorithm for MSC(B). Furthermore, if we bound m from above by a constant K, the MSC problem becomes APXcomplete and approximable by [18].

NP-hardness
As we know that L(MSC) is NP-complete we have by Lemma 1 that L(MHT) is NP-hard. We also have that any solution of MHT is of size at most the size of the instance it is a solution of. This lets us conclude that there exists a polynomial time verifier for L(MHT) languages, meaning that L(MHT) is in NP and hence NP-complete by the above lemma. This in turn lets us state the following theorem.
Proof: The theorem follows from that if B is an n × m instance of MHT, then f MSC (B) is the m × n', n' ≤ (n 2 -n)/2 instance of MSC. MSC is known to be approximable within 1 + ln m for n × m instances, and since we know from Lemma 3 that we have an AP-reduction from MHT to MSC with α = 1 we have that MHT is approximable within 1 + ln( ).

The bad news
Let MHT' be the version of MHT where we bound n from above by a constant.

Theorem 4 MHT' is APX-complete.
Proof: From Lemma 3 it follows that MSC' ≤ AP MHT' ≤ AP MSC', where MSC' is the version of MSC where we bound − m by a constant for n × m instances. From Theorem 3 it follows that MHT' is in APX, and as MSC' is known to be APX-complete [18], the theorem follows.
Proof: We know [19] that it is NP-hard to approximate MSC within cln m for any constant c > 0. We also know [14] that MSC is not approximable within (1 -ε)ln(n) for any constant ε > 0 unless NP ⊂ DTIME(n log log n ). From the proof of Lemma 3, we know that MSC ≤ AP MHT with α =

1, meaning that
If x is an m × n' instance of MSC, then f MHT (x) as defined in the proof of Lemma 3 is a 2n' × (m + [log 2 n]) instance of MHT. If we have a r(n)-approximation algorithm for MHT, we have a r(2n')-approximation algorithm for MSC, and the theorem follows.

Algorithms
We present here two polynomial time algorithms for solving the MHT problem that exhibit the approximation bound found above. The first algorithm is based on the transformation to the minimum set cover problem. This allows us to guarantee the performance bounds. The second algorithm does not compute a transformation explicitly, but accomplishes the same effects directly. The second algorithm is much simpler to implement, uses significantly less space and has the same asymptotic running time as an optimal implementation of the first.
The advantage of the algorithm with the explicit transformation to the minimum set cover problem is that it lets us produce algorithms for a family of haplotype tagging problems as presented in Section.
Let B be an n × m instance of the MHT problem such that each row is unique. The first algorithm is essentially applying an implementation of f MSC followed by the application of an algorithm for the set cover problem. To understand how this works, consider that each set in the collection represented in matrix form by f MSC (B) is associated with one column (representing a SNP) in B. We can form (n 2 -n)/2 pairs of row indices such that their respective rows in B are different. Let each of these pairs be associated with a unique identifier. Each set in f MSC (B) contains the unique identifiers of the row pairs that differ in the associated column in B. The objective of the MHT problem is to find a minimum number of columns (equivalently sets in f MSC (B)) such that they together dis-cern between all pairs of rows (cover all the unique row pair identifiers).
Johnson's 1 + ln m greedy approximation algorithm [12] iteratively selects the set that covers the most uncovered elements and adds it to the initially empty solution. This is done until all elements are covered. It is well known that a greedy algorithm for the unweighted set cover problem on the collection can be implemented to run in and exhibits the promised approximation bound. We will in the following refer to this algorithm as MHT.
The second algorithm is based on the following observation. Selecting the set that covers the most uncovered elements in the algorithm above is equivalent to selecting the column that discerns between the most pairs of previously undiscerned rows. Hence we can achieve the same effect as the first algorithm by recursively partitioning the set of rows in B by at each iteration selecting the column that refines the partition the most. This more direct algorithm we will call D-MHT. Pseudo code for it is given below.  Now assume that m > n > 1, and look at the ratio of the running times of D-MHT and MHT respectively. This ratio is given as Again using that m > n, we see that for n > 1. Indeed, as n → ∞ the above approaches 2.
Now assume that 1 <m ≤ n. Then we have that the same ratio is The ratio above is maximal when m = n, 1 when m = n -2, and we see a linear increase of the reciprocal of the ratio with the quantity n -m. Knowing that p ≤ min(n, m) yields We hence conclude that the direct algorithm has the same worst case asymptotic running time as the one based on the optimal implementation of the minimum set cover problem. The direct algorithm is simpler to implement, might run faster in short solution cases, and has significantly smaller space requirements as we do not need to construct the set cover instance. Also, it might be of interest to weight SNPs. While including weights is trivial in the D-MHT algorithm, implementing an O(mn) version of the minimum weighted set cover problem can prove to be a difficult task. This is because the unweighted algorithm depends on a data structure that allows remove-max, remove, and insert in constant time but depends on keys to lie in a known, finite, and small set. However, a O((mn) ln(mn)) implementation is readily available.
A third algorithm can be constructed by applying the best suited of the two algorithms presented above. The analysis suggests that one could choose MHT over D-MHT whenever m > n -2.

Authors' contributions
SAV analyzed the approximation properties of the MHT problem and suggested the D-MHT algorithm. SD participated in the analysis of the running time complexities and the placement of the paper in context. LOM related the work to the context it is presented in. All authors wrote parts of the manuscript.