We implemented two versions of the integer linear program both of which were competitive in practice. The first is a direct integer linear program implementation and the second is a branch-and-bound algorithm that wraps over a second integer linear program. We describe the direct implementation first followed by the branch-and-bound method. Both methods were implemented in C++ using the Concert Technology of CPLEX 10.0 for integer linear program (ILP) solutions. We found the branch-and-bound method to give generally lower run times in practice than the direct ILP method. We therefore used the branch-and-bound method exclusively in generating the empirical results presented here.

### Direct Integer Linear Programming Approach

This section introduces our ILP algorithm to solve the Genotype MP Phylogeny Problem. In the first subsection, we introduce pre-processing techniques that typically reduce the problem size after which we describe the ILP.

#### Preprocessing

Preprocessing techniques form an integral part of any solution method based on integer programming. We now describe the major preprocessing methods used.

Let *G* be the *n* × *m* input genotype matrix. Without loss of generality we can assume that all *m* sites are varying. We can also remove redundant rows (genotypes) of *G* until all rows are distinct, since this does not change the length of the optimal phylogeny. We now describe a method to remove redundant sites (columns) from *G*. Note that we are free to exchange labels 0 and 2 (homozygous major and minor alleles) independently at each site without change in the size of the phylogeny. Therefore two sites *i* and *j* are considered redundant if they are identical or become identical after relabeling one site.

For all sites *k*, let weight *w*
_{
k
}be initialized to 1. We then iteratively perform the following operation: for any pair of redundant sites *i*, *j*, set *w*
_{
i
}:= *w*
_{
i
}+ *w*
_{
j
}, and remove site *j* from the matrix. Let *G'* be the final matrix after this sequence of preprocessing steps. We now redefine the length of a phylogeny using a weighted Hamming distance as follows.

### Definition 3

*The* length *of phylogeny T*(*V*, *E*) *is* length(*T*) = ∑_{(u,v)∈E
}∑_{i∈D(u,v)}*w*_{
i
}, *where D*(*u*, *v*) *is the set of sites where u, v differ*.

The following lemma justifies the preprocessing step:

### Lemma 1

*The transformation from genotype matrix G to weighted genotype matrix G' does not change the length of the most parsimonious phylogeny*.

### Proof 1

*For any genotype matrix I, let T*_{
I
}*denote the optimal phylogeny on I. For a site i of I, let j be a redundant site and consider the matrix I* ∪ {*j*}. *The topology of phylogeny T*_{
I
}*also gives a phylogeny for I* ∪ {*j*}, *obtained by mutating j wherever i mutates. The length of T*_{I∪{j} }*is* length(*T*_{I∪{j}}) = length(*T*_{
I
}) + *μ*(*i*), *where μ*(*i*) *is the number of times site i mutates in T*_{
I
}. *Now, assume that the most parsimonious phylogeny T*_{
G
}*for G resolves redundant genotype sites i and j differently, i.e., there is a haplotype for which sites i and j differ. Without loss of generality, suppose μ*(*i*) ≤ *μ*(*j*) *in T*_{
G
}. *Then removing column j from T*_{
G
}*results in a phylogeny T*_{G\{j} }*with* length(*T*_{G\{j}}) = length(*T*_{
G
}) - *μ*(*j*). *Now, since j is identical to i, the argument above implies that adding site j back to the phylogeny gives a tree with length* length(*T*_{
G
}) - *μ*(*j*) + *μ*(*i*) ≤ length(*T*_{
G
}). *Therefore, there is an optimal phylogeny resolving sites i and j identically*.

Due to these preprocessing steps, we assume from now on that the input genotype matrix *G* has distinct rows, distinct sites, and weights *w*
_{
i
}≥ 1 associated to sites. For each genotype *g* ∈ *G*, we create the set *R*(*g*) consisting of all possible pairs of haplotypes explaining *g*. Note that if *p* is the number of heterozygous sites in *g*, then *R*(*g*) consists of *2*
^{p-1 }pairs of haplotypes.

Now, consider the matrix *H* = ⋃_{g∈G}
*R*(*g*), where the rows are all possible haplotypes that can explain input genotypes in *G*. *H* is a binary matrix, and for such instances, structural properties of the optimal phylogeny can be captured by a graph known as the *Buneman graph*
(*H*) [30]. We will explain the generalization of this graph due to Barthélemy [31].

For a binary input matrix *H* and a site *c* of *H*, the *split c*(0)|*c*(1) defined by *c* is a partition of the haplotypes into two sets, where *c*(0) is the set of haplotypes with value 0 in site *c* and *c*(1) is the set of haplotypes with value 1 in site *c*. Each of *c*(0) and *c*(1) is called a *block* of *c*. Each vertex of the Buneman graph is an *m*-tuple of blocks [*c*
_{1}(*i*
_{1}), *c*
_{2}(*i*
_{2}),...*c*
_{
m
}(*i*
_{
m
})] (*i*
_{
j
}= 0 or 1 for each 1 ≤ *j* ≤ *m*), with one block for each site and such that each pair of blocks *c*
_{
j
}(*i*
_{
j
}) ∩ *c*
_{
k
}(*i*
_{
k
}) has nonempty intersection. There is an edge between two vertices in
(*H*) if and only if they differ in exactly one block. Notice that vertices in the Buneman graph can be viewed simply as haplotypes. An *m*-tuple [*c*
_{1}(*i*
_{1}),...,*c*
_{
m
}(*i*
_{
m
})] translates to haplotype (*i*
_{1},...,*i*
_{
m
}). Buneman graphs are very useful due to the following theorem:

### Lemma 2

[11]*The Buneman graph*
(*H*) *for input H on m sites can be constructed in time O(km) where k is the number of vertices in*
(*H*).

The Buneman graph is simply a method to reduce the size of the underlying graph from an *m*-cube with 2^{
m
}vertices to a (typically significantly) smaller sub-graph. Putting together these methods, we can summarize our preprocessing steps as follows:

1. Create a weighted genotype matrix *G* where sites are pair-wise distinct.

2. Create a set *H* of all possible haplotypes explaining rows of *G*.

3. Construct the underlying graph *F*(*V*, *E*) =
(*H*) where *H* ⊆ *V* and (*u*, *v*) ∈ *E* connects two vertices (haplotypes) if and only if they differ in exactly one site. Edge weights *w*
_{u,v}= *w*
_{
i
}where *i* is the site at which *u* and *v* differ.

We apply some additional heuristic preprocessing steps that have proven very effective in practice. One of these steps identifies a subset of haplotypes that must occur in any optimal solution and then removes from the input any genotypes that can be produced from pairs of these obligatory haplotypes. As any optimal output can produce these genotypes, their absence will not change the final output. We can also eliminate certain possible haplotypes because they would imply high-weight edges and therefore cannot occur in any low-cost solution.

Once all preprocessing steps have been applied, we have a weighted Buneman graph *B*(*H*) that contains every node and edge that might be included in any optimal phylogeny for *G*. We now show an ILP formulation to simultaneously select the optimal subset *H'* ⊆ *H* such that all of *G* can be derived from *H'* and connect *H'* using a tree.

#### ILP Formulation

We now develop an ILP formulation for the problem based on multicommodity flows [32]. The formulation borrows from prior work on fast ILP solution of maximum parsimony phylogenies on haplotypes [11]. Although this formulation can use exponential numbers of variables and constraints in the worst case, it is fast in practice. It is possible to solve the maximum parsimony genotype problem using an ILP with polynomial numbers of variables and constraints, but all polynomial-size variants that we developed proved intractable in practice.

The high-level idea of the method is to send flow from a designated root to each haplotype that is used to explain an input genotype. Each of these haplotypes acts as a sink for one unit of flow. The program must select a subset of edges that accommodate all flow while minimizing the cost of the edges selected. This flow formulation guarantees that every haplotype is connected to the root and the minimization prevents formation of cycles. The formulation thus forces the output to be a tree. For the sake of simplicity, we assume that the all-zeros haplotype is present in all the solutions. We can treat this as the *root*.

Let *h*
_{
k
}be an indicator variable denoting the presence or absence of haplotype *k* ∈ *H*. If *h*
_{
k
}= 1, *k* is called a *present* haplotype. We have binary variables *p*
_{i,j}that denote the presence of both haplotypes *h*
_{
i
}and *h*
_{
j
}. All the present haplotypes act as a sink for one unit of flow from the root vertex. On the other hand, all non-present haplotype vertices and Steiner vertices satisfy perfect flow conservation. To enforce this, we use two types of binary variables
and *s*
_{i,j}for each edge (*i*, *j*) ∈ *E*. The variables
are real valued and represent the amount of flow along edge (*i*, *j*) whose final destination is haplotype *k*. Note that if *k* is a non-present haplotype, then
should be set to 0 for all edges (*i*, *j*). Variables *s*
_{i,j}are binary variables that denotes if edge (*i*, *j*) of the graph has been selected. We want to enforce that flow can be sent along edge (*i*, *j*) only if it has been selected.

We now have the following integer linear program:

In constraints (2) and (3), variable *p*
_{
ij
}indicates the presence of the haplotype pair (*h*
_{
i
}, *h*
_{
j
}). Constraint (4) guarantees that each genotype is explained by at least one pair of haplotypes. Constraint (5) imposes inflow/outflow constraints on haplotypes as well as enforcing the condition that there is positive flow to a haplotype *h*
_{
k
}only if *h*
_{
k
}is selected. Constraint (6) imposes flow conservation at all non-present haplotype vertices as well as Steiner vertices and constraint (7) imposes the condition that flow can only be sent along edges present in the solution. Note that all integer variables of the above linear program are binary. Finally, we observe that the solution of the ILP is the size of the most parsimonious phylogeny on *G*.

### Branch and Bound Algorithm

We developed an alternative method for the problem that uses a simpler integer linear program embedded in a branch-and-bound routine. The high-level idea behind the method is to first guess the set of haplotypes that would phase the given input genotypes and then construct a most parsimonious phylogeny on the haplotypes. Note that all the pre-processing techniques outlined in the previous sub-section still apply for this method.

We use *G* to refer to the input set of genotypes. For a given set of haplotypes
, we can construct the most parsimonious phylogeny
using the algorithm described by Sridhar et al. [33]. We will use hapMP to denote this algorithm, which will take a set of haplotypes and return the size of the most parsimonious phylogeny. We now have the following branch and bound method.

function genBB(genotypes *G*, haplotypes
, integer *u*)

- 1.
for all row vectors

∈

*G*
- (a)
if ∃*h*
_{1}, *h*
_{2} ∈
s.t.
then *G* ← *G*\{*g*}

- 2.
if (|*G*| = ∅) then return hapMP(
)

- 3.
if (hapMP(
) ≥ *u* - 1) then return ∞

- 4.
let
be a row vector of *G*

- 5.
for all

,

s.t.

- (a)
*G'* ← *G*\{
}

- (b)
*H'* ←
∪ {*h*_{1}, *h*_{2}}

- (c)

- (d)

- 6.

The *branch step* is performed by Step 5, where the algorithm attempts to phase genotype *g* using all possible pairs of haplotypes *h*
_{1}, *h*
_{2}. Integer *u* of the above pseudo-code refers to the current best upper-bound. The *bound step* is performed by Step 3 which just reconstructs a phylogeny over the current set
of haplotypes. Step la ensures that at least one more haplotype *h* ∉
is required to obtain the final set of haplotypes. Therefore, even if hapMP(
) = *u* - 1, this branch cannot yield a solution of smaller cost than current upper-bound *u*.

In the above method, the height of the branch-and-bound tree is at most *n*, the number of input genotypes. The branching factor at each internal node is at most 2^{
k
}where *k* is the number of heterozygous sites on the genotype *g*. This is always bounded by 2^{
m
}. Although the running-time of the final branch-and-bound method is super-exponential, we find that its run time is competitive with and often superior to the ILP described in the previous section.

### Data Generation and Analysis

In order to generate simulated data, coalescent trees were created using Hudson's ms program [34]. The only parameter required to generate tree topologies is the number of haploid chromosomes *n*
_{
h
}. The ms program can also use this tree to produce haplotype sequences, but does so under the infinite-sites model (without any recurrent mutations). We therefore instead used the seq-gen program of Rambaut and Grassly [35] to generate *n*
_{
h
}haplotypes using the ms coalescent tree. We varied the number of SNPs *m* and the mutation rate parameter *θ* = 4*N*
_{0}
*μ*, where *μ* is the probability of mutation of any of the simulated SNPs in one generation and *N*
_{0} is the effective population size. We relate the simulation parameter *μ* to the per-site mutation rate by assuming an effective population size *N*
_{0} = 10, 000 (a reasonable estimate for humans [36]). For instance, for 5 sites, we obtain a per-site mutation probability of 10^{-6} for *θ* = 0.2.

seq-gen was used under the GTR model, a generic time reversible Markov model. Mutation rates between *A* and *C* and between *G* and *T* were defined using the same parameter *θ*. For all the other four pairs we set the mutation rate to be 0 in order to produce biallelic data. The exact command line used to execute seq-gen for a given mutation rate parameter *θ* and SNP number *m* was the following:

seq-gen -mGTR -r *θ*, 0, 0, 0, 0, *θ* -l *m*

Each data point was generated from 200 independently generated simulated data sets, with the reported error rates summed over the 200 replicates. In our first set of simulations, designed to test the effect of mutation rate on accuracy, we varied *θ* over the range 0.2–0.6 in increments of 0.05 for windows of 5 and 10 SNPs and for sample sizes of 30 and 60 input haplotypes. Our second set of experiments, designed to test the effect of sample size on accuracy, fixed *θ* at 0.5 and varied the number of haplotypes from 30 to 120 in increments of 10 for windows of 5 and 10 SNPs. Data points plotted represent summed errors over the 200 replicates per parameter value.

Mitochondrial data was extracted from of a set 63 complete mitochondrial DNA sequences of 16,569 bases each produced by from Fraumene et al. [27]. We produced artificial diploids from the data by randomly selecting 60 of the sequences and randomly grouping them into 30 pairs. We computationally inferred haplotypes from all of the genotypes using fastPHASE and we constructed phylogenies for all sliding windows of 50 bases across the data set by each of three methods: maximum parsimony using true haplotypes, inferred haplotypes and from the genotypes.

Autosomal DNA was extracted from a lipoprotein lipase (LPL) data set due to Nickerson et al. [29]. Because the pairs of haplotypes into genotypes were not published, we duplicated the first haplotype to obtain 78 distinct sequences and then randomly paired them to produce 39 artificial genotypes from the true haplotypes. As in the previous case, we ran fastPHASE and haplotyper on all of the SNPs put together to obtain inferred haplotypes. In order to reduce the possibility of recombination events confounding our results, we used the HAP webserver [16] to break the 86 SNPs into blocks. HAP was also used to infer missing data. We then evaluated phylogeny sizes by our direct method, from the true haplotypes, and from the inferred haplotypes for each block.