# Haplotype reconstruction using perfect phylogeny and sequence data

- Anatoly Efros
^{1}Email author and - Eran Halperin
^{1, 2, 3}Email author

**13(Suppl 6)**:S3

https://doi.org/10.1186/1471-2105-13-S6-S3

© Efros and Halperin; licensee BioMed Central Ltd. 2012

**Published: **19 April 2012

## Abstract

Haplotype phasing is a well studied problem in the context of genotype data. With the recent developments in high-throughput sequencing, new algorithms are needed for haplotype phasing, when the number of samples sequenced is low and when the sequencing coverage is blow. High-throughput sequencing technologies enables new possibilities for the inference of haplotypes. Since each read is originated from a single chromosome, all the variant sites it covers must derive from the same haplotype. Moreover, the sequencing process yields much higher SNP density than previous methods, resulting in a higher correlation between neighboring SNPs. We offer a new approach for haplotype phasing, which leverages on these two properties. Our suggested algorithm, called *Perfect Phlogeny Haplotypes from Sequencing* (*PPHS*) uses a perfect phylogeny model and it models the sequencing errors explicitly. We evaluated our method on real and simulated data, and we demonstrate that the algorithm outperforms previous methods when the sequencing error rate is high or when coverage is low.

## Keywords

## Introduction

The etiology of complex diseases is composed of both environmental and genetic factors. In the last few decades, there has been a tremendous effort to discover the genetic component of the etiology of a large number of common traits, that is, characterizing the heritability of these traits. In recent years, much of this effort has been focused on genome-wide association studies (*GWAS*), in which the DNA of a population of cases (individuals carrying the studied condition), and a population of controls (general population) is being measured and compared. These studies have been focusing on measurements of single nucleotide polymorphisms (*SNPs*), which are positions in the genome in which at some point in history there has been a mutation that was fixed in the population.

In recent years, genotyping technology, or the extraction of the SNP information from the genome, has been advancing rapidly. Only a few years ago, genome-wide association studies were simply infeasible. On the other hand, even the most modern genotyping technologies only provide a partial picture of the genome, since the number of positions measured with these technologies is typically less than a million, while there are more than three billion positions in the genome. Additionally, there are many different types of genetic variations that are not captured well by genotyping technologies, particularly rare SNPs, and short deletions and insertions. For this reason the next generation of genetic studies of diseases will surely include the new high-throughput sequencing technologies, or next generation sequencing platforms (*NGS*). These technologies provide in one experiment hundreds of millions of short sequence reads of the sampled DNA.

The technical analysis of disease association studies encountered a few computational challenges, some of which will remain when considering NGS based studies. One of the major obstacles in these studies has been the inference of haplotypes from the genotype data (*phasing*). As opposed to genotypes, a haplotype is the sequence of alleles across a chromosome. Genotype technologies provide the information about the number of minor alleles occurring at each position, but not the relation between the positions. Mathematically, we can think of a haplotype as a binary string, and the genotype is simply the sum of the two haplotypes of the corresponding chromosomes. In the last decade, many phasing algorithms have been suggested [1–7]; these algorithms take as input a set of genotypes, and leverage the correlations between neighboring SNPs, or the linkage disequilibrium (*LD*), to infer the haplotypes.

The phasing problem is of different nature when applied to sequencing studies. First, unlike genotyping technologies, sequence data allows us to consider both SNPs and short structural variations (e.g., short deletions). Second, in sequencing technologies the measured SNPs are closer to each other than in genotyping, resulting in a much higher LD, or correlations between neighboring SNPs. Third, the short reads obtained from the sequencing platform are always read from one chromosome, and some of these reads may contain more than a single SNP, suggesting that partial haplotypes are provided. Finally, the noise obtained by NGS technologies is inherently different than genotyping technologies; the sequence reads contain substantially more errors than genotypes, especially towards the end of the reads, and the final errors made by the algorithms are highly dependent on specific parameters such as the *coverage*, which is the expected number of reads overlapping with a specific position in the genome.

Unlike the case of genotyping, sequencing allows the possibility of phasing a single individual (genotype data requires a population). Very recently, a few methods were suggested for the problem [8, 9]; these methods leverage the fact that each sequenced read originates from a single chromosome, and that all variants covered by the read must originate from the same haplotype. They use these sets of partial haplotypes to reconstruct a full haplotype.

## Methods

The basic assumption of our algorithm is that within a short region, the history of the genetic variants (SNPs or deletions) follows the perfect phylogeny model, i.e., there are no recurrent mutations or recombinations. We denote by *T =* (*V*, *E*) the underlying directed perfect phylogeny tree, where each *v* ∈ *V* corresponds to a haplotype of length *m* (a string over {0,1}^{
m
}) and every edge corresponds to a mutation. For every *v* ∈ *V*, let *p*_{
v
} be the haplotype frequency in the population.

### Modeling the sequencing procedure

The haplotypes themselves are not given by the sequencing procedure, but instead, the sequencing procedure provides a large set of short reads, arguably sampled from random positions in the genome and from a random copy of the chromosome (out of the two copies). The sequencing itself is a noisy procedure, which depends on two parameters, the coverage *k*, the read length *l*, and the sequencing error rate *ε*. We will assume each read is a copy of *l* bases extracted from the genome starting at a random position. The copy is not an exact copy and we will assume that there is a probability *ε* for the base to be read incorrectly. Under simplifying assumptions, we can assume that the error rate does not depend on the genomics position.

### Problem statement

*ε*in each position of the read independently. We will denote the set of reads of individual

*i*as

*R*

_{ i }. Additionally, we will denote the two haplotypes of individual

*i*as ${H}_{i}=\left({h}_{i}^{1},{h}_{i}^{2}\right),$ where ${h}_{ij}^{1},{h}_{ij}^{2}$ are the alleles of the haplotypes in SNP

*j*, and ${h}_{i,S}^{1},{h}_{i,S}^{2},$ are the haplotypes restricted to the set of SNPs

*S*. If

*H*

_{ i }is known for each

*i*, we can write the likelihood of the data as

*Pr*(

*r|H*

_{ i }) is the probability of observing read

*r*given the two haplotypes of individual

*i*. Particularly, if

*r*spans the positions of a set of SNPs

*S*, and if

*d*(

*x*,

*y*) is the Hamming distance between two sequences

*x*and

*y*(restricted to the set of SNPs), then

Our algorithm aims at finding a perfect phylogeny tree on the set of SNPs in a given window, and a corresponding haplotype assignment for each individual. The haplotypes assigned to each individual need to be taken from the tree, and the objective is to optimize the likelihood of the reads. We will explain later how the perfect phylogeny assumption can be relaxed.

### Tree reconstruction within a window

Within a window, the tree reconstruction algorithm works in the following way. We first search for a SNP *j** that is adjacent to the root of the tree. We then use a partitioning procedure to decide for every other SNP *j*, whether it is a descendent of *j** in the tree. This partitioning procedure splits the set of SNPs into two, and we recursively compute the two subtrees.

Throughout, we assume that the alleles of each SNP are represented by the {0, 1} notation, where 0 is the more common allele. As shown in [10], the set of haplotypes corresponds to a perfect phylogeny, if and only if it corresponds to a perfect phylogeny with the all zeros vector as the root. We thus assume that the root of the tree is the haplotype with all 0 values. Under these assumptions, it is clear that for two SNPs *j* and *j'*, SNP *j* cannot be a descendant of SNP *j'* if their corresponding allele frequencies satisfy *f*_{
j
} *> f*_{
j'
}. Therefore, we first estimate the minor allele frequency of each SNP and we choose the SNP *j** as the SNP with the largest estimated allele frequency, breaking ties arbitrarily.

*j*and an individual

*i*, we denote by

*R*

_{ ij }⊆

*R*

_{ i }the set of reads that overlap with SNP

*j*. We write the likelihood of the minor allele frequency as follows:

*i*and position

*j*. Note that Pr(

*r | G*

_{ ij }) is given in Equation (1), restricted to the case

*S*= {

*j*}. Now, ${A}_{i,j,g}={\prod}_{r\in {r}_{ij}}\text{Pr}\left(r\mid {G}_{ij}\right)$ can be computed in linear time, in a preprocessing procedure, regardless of

*f*

_{ g }. Therefore, the log likelihood can be rewritten in the following way:

We use an expectation-maximization [11] (*EM*) algorithm to estimate *f*_{
j
}.

### The partitioning procedure

Now that we have chosen *j**, the algorithm proceeds by partitioning the other SNPs into two sets, *T*_{1} and *T*_{2}, where *T*_{1} corresponds to the subtree of the root excluding the edge *j** and its descendants, and *T*_{2} corresponds to the subtree located below *j**. In order to nd this partition, for each SNP *j*, we calculate the likelihoods of *j* being in *T*_{1} verses the likelihood of *j* being in *T*_{2}, and we assign *j* to the tree for which the likelihood is higher. We later refer to this partitioning method as PPHS-2.

*j** and all other SNPs. For this reason, we also consider an alternative partitioning approach, (PPHS-3), which is based on the pairwise relations of pairs of SNPs with

*j**. Formally, for each

*j*

_{1},

*j*

_{2}, ∈ {1,..., m}\{

*j**}, we consider the four possible configurations, that is (1)

*j*

_{1},

*j*

_{2}∈

*T*

_{1}, (2)

*j*

_{1},

*j*

_{2}∈

*T*

_{2}, (3)

*j*

_{1}, ∈

*T*

_{1},

*j*

_{2}∈

*T*

_{2}, and (4)

*j*

_{1}, ∈

*T*

_{2},

*j*

_{2}∈

*T*

_{1}. Each of these configurations corresponds to a set of possible subtrees induced by

*j**,

*j*

_{1},

*j*

_{2}. Particularly, in cases (1) and (2) there are three possible subtrees, while cases (3) and (4) correspond to a unique subtree (see Figure 1).

For each of these subtrees, *T**, we guess the most likely haplotype frequency distribution over the tree using EM. As before, *p*_{
v
} corresponds to the frequency of the haplotype represented by vertex *v*, and *p*^{
t
}(*v*) corresponds to the guess of the EM algorithm at step *t*.

In practice, we start the EM algorithm from a distribution ${p}_{v}^{0}$ defined as follows. Let *s*_{0} be the SNP corresponding to the parent edge of *v*, and let *S* be the set of SNPs corresponding to the children edges of *v*. We set ${p}_{v}^{0}={p}_{so}-{\sum}_{s\in S}ps,$ where *p*_{
s
} is the estimation of the allele frequency of SNP *s* using the EM as described above. It is easy to verify that if the allele frequency are accurately estimated then the haplotype frequencies are correctly estimated as well (restricted to the tree *T**).

*T**. The likelihood of the subtree can now be written as

*G*

_{ ab }= (

*V, E*

_{ ab }) (

*a, b*∈ {1,2}) where

*V*is the set of SNPs. In graph

*G*

_{ ab }, an edge

*e*

_{j1,j 2}∈

*E*

_{ ab }has an associated weight

*S*

_{1},

*S*

_{2}) be a partition of the graphs

*G*

_{ ab }. Let

The algorithm proceeds by searching for the partition that maximizes ∑_{
a, b
}*w*_{
ab
} (*S*_{1},*S*_{2}). In order to find the best partition, for low values of *m* we enumerate over all possible partitions, and for larger values of *m* we randomly pick subsets of the vertices, compute the partition for these subsets, and use a majority vote to decide on the overall partition.

### Generation of haplotypes for each window

For every window, the recursive process described above returns the perfect phylogeny tree *T* which best fits the data. Now the algorithm needs to select for every individual *i* two haplotypes *h*_{1}; *h*_{2}. Since the data may not perfectly fit the model, we add additional haplotypes to the pool of possible haplotypes. Using the tree *T* we create a list *L* of haplotypes, where each haplotype *h* corresponds to a specific node *v* ∈ *T*. From each haplotype in *L* we create m new haplotypes, where *m* is the number of SNPs in the window by adding all possible one-mutations to the haplotype. We call these haplotypes the haplotypes derived from *L*.

*i*we select two haplotypes

*h*

_{1},

*h*

_{2}which maximizes the a posteriori probability:

*h*) are calculated based on the haplotypes assigned so far to other individuals, and based on the fact that the haplotypes in

*L*are more likely than the haplotypes derived from

*L*. Particularly, when we consider individual

*i*+ 1, let

*H*

_{ i }be the set of haplotypes assigned to the first

*i*individuals, and for each haplotype

*h*, let

*n*

_{ h }be the total number of occurrences of

*h*in

*H*

_{ i }. Then, we set the vector of prior to be $\overrightarrow{p}=\left({{p}_{h}}_{{}_{1}},...,{p}_{{h}_{k}}\right)$ which maximizes the maximum a posteriori (MAP) under a Dirichelet prior with α

_{1}weight in the

*L*haplotypes and

*α*

_{2}weight in the haplotypes derived from

*L*:

which is maximized for *p*_{
h
} ∝ *n*_{
h
} + *α*_{
h
} - 1.

### Stitching of windows

The framework discussed so far assumes that the haplotypes are inferred within a window of *m* SNPs. The length of the window is mainly determined by the assumption of perfect phylogeny, and therefore we cannot apply our method to long windows. When working on large number of SNPs we use BEAGLE results as a guide that connects the different windows.

Let ${h}_{{i}_{1}}^{\prime},{h}_{{i}_{2}}^{\prime}$ be the haplotypes returned by BEAGLE for individual *i*. Consider a window *w* or length *m* spanned by SNPs *j* ∈{*w*...*w* + *m* - 1}. Let ${h}_{{i}_{1}}^{w},{h}_{{i}_{2}}^{w}$ be the haplotype returned by the above algorithm for window *w* and let ${h}_{{i}_{1}}^{{w}^{\prime}},{h}_{{i}_{2}}^{{w}^{\prime}}$ the haplotypes returned by BEAGLE restricted to the same window *w*.

*d*

_{1},

*d*

_{2}:

In case *d*_{1} <*d*_{2}, *d*_{1} ≤ *d** we use haplotypes ${h}_{{i}_{1}},{h}_{{i}_{2}}$ as the haplotypes for window *w*, if *d*_{2} <*d*_{1}, *d*_{2} ≤ *d** we use haplotypes ${h}_{{i}_{1}},{h}_{{i}_{2}}$ as the haplotypes for window *w*. In all other cases we use the haplotypes ${h}_{{i}_{1}}^{\prime},{h}_{{i}_{2}}^{\prime}$ which were returned by BEAGLE as the halotypes for window *w*. Put differently, we use the BEAGLE haplotype results in order to determine the ordering of our solution if our solution is relatively close to BEAGLE's solution. If our solution is far from BEAGLE's solution we assume that the perfect phylogeny model does not hold and therefore we simply use BEAGLE's solution in this case. We chose the threshold *d** to be *m/2* based on empirical evaluation, with this paramters the algorithm choose PPHS answer over BEAGLE's in 60-70% of the windows while working on small populations.

## Results

*S*+

*M*when

*S*is the number of switch errors in the data and

*M*is the number of mismatch errors). We refer to this as the SM error metric. While running Beagle we used the default software parameters which are 10 iterations. The likelihoods for each SNP

*k*were calculated using the following equations:

Where all reads and haplotypes were restricted to SNP *k* for all options of *H*_{i} ∈ {(0, 0), (1, 0), (1, 1)}.

### Evaluation under Simulated data

We used the MSMS software [12] to simulate data from a perfect phylogeny. MSMS [12] uses the Coalescent process to create haplotypes and when the mutation rate is set low enough, the resulting haplotypes fit the perfect phylogeny model.

### Accuracy of tree reconstruction

*N*genotypes, where

*N*varies from 5 to 500. We then simulated the sequencing procedure on these genotypes (see Methods). For a given set of genotypes, we let PPHS reconstruct the tree, and the set of haplotypes obtained can be compared to the haplotypes of the original tree. In the true tree there are 30 SNPs and therefore there are exactly 31 haplotypes, denoted by

*H*

_{ True }. We denote the haplotype frequencies of

*H*

_{ True }as

*p*

_{1},...,

*p*

_{31}. The output of the algorithm results in another tree with a possibly different set of haplotypes,

*H*

_{ PPHS }. Let

*S*=

*H*

_{ True }∩

*H*

_{ PPHS }. Then, we measure the accuracy of the reconstruction as ∑

_{i∈S}

*p*

_{ i }. We observed (Figure 2) that the reconstruction of the tree by our algorithm is highly accurate, and it converges to the correct tree when the number of samples increases, even in the presence of high sequencing errors.

### Evaluation of phasing accuracy

### Evaluation under real data

In the above section we evaluated the performance of the algorithms under the Coalescent model without recombinations. In practice, there are cases in which this model does not capture the empirical behavior of the data and it is therefore important to test the algorithm using real datasets. To do so, we used the EUR population data from the 1000 genome project [13] (taken from the BEAGLE website, 2010-08), in which the full sequences of 283 unrelated individuals of European ancestry are given. For each test we used haplotypes from human chromosome 22 which contains overall 162027 SNPs. We note that the genotype error rate in this data is quite high since the data used for the SNP calling is of very low coverage (phase 1 of the 1000 genomes project). Therefore, it may be the case that in some cases the perfect phylogeny assumption does not hold while in fact the inconsistency is caused by the sequencing errors.

The genotype data provided by the 1000 Genomes project provides us with a realistic setting of the haplotype distribution in the population. Our experiments involved taking these haplotypes and simulating the sequencing process, as described in the Methods. We first measured the accuracy of the algorithms as a function of the coverage and the sequencing errors with a set of five individuals. We also compared our algorithm to haplotype assembly algorithms (HapCut [9]) which work on a single individual. We observed that only in the cases where the coverage is high and the error rate is low do the haplotype assembly algorithms have a chance of preforming better than PPHS and BEAGLE (assuming long reads).

The performance of HapCut versus PPHS.

Algorithm | Test 1 | Test 2 | Test 3 | Test 4 |
---|---|---|---|---|

HapCut - No Calls | 20.85 | 20.63 | 18.67 | 18.03 |

HapCut - Errors | 15.92 | 8.71 | 24.80 | 15.56 |

Beagle | 2.76 | 1.81 | 2.20 | 1.09 |

PPHS | 2.43 | 1.20 | 1.71 | 0.24 |

## Conclusions

This work presents a new algorithm for phasing. This algorithm works by reconstructing the prefect phylogeny tree in every short region. Unlike previous methods (e.g. HAP [5]) we use raw read data to build the tree. Since raw read data includes sequencing errors, methods to overcome them were developed. The fact that a single read can contain more than one SNP is also used by the algorithm. Some deviation from perfect phylogeny is allowed by permitting a single recurrent mutation event in each haplotype.

The results demonstrate that the proposed algorithm works well and is immune to sequencing errors and small population sizes, making it robust. In addition to the solution for the phasing problem, the algorithm provides a new method to reconstruct perfect phylogeny under the condition of an error model. This approach, however, has a few limitations, which may be of interest for further research. The method is designed to work in windows. In order to stitch the windows the BEAGLE results are used as a skeleton, and it is likely that more tailored methods (e.g. a variant of [Halperin, Sharan, Eskin] [14]) may perform better.

An additional limitation of the algorithm is its performance while handling large populations. As the results show, the performance of BEAGLE improves rapidly with the size of the population, whereas the performance of PPHS improves at a slower rate. However, given simulated data (based on the Coalescent model), the algorithm's performance does improve considerably as the population size is increased. This might suggest that when using the 1000 genomes data, the perfect phylogeny model breaks as the number of samples increases due to subtle population structure, or simply large number of errors within that data. If the former is true, there may be an optimization procedure that selects the length of the windows as a function of the linkage disequilibrium structure in the region. Such optimization of the parameters may result in better phasing algorithms, and particularly in a better reconstruction of the trees.

## Declarations

### Acknowledgements

E.H. is a faculty fellow of the Edmond J. Safra Bioinformatics program at Tel-Aviv University. A.E. was supported by the Israel Science Foundation grant no. 04514831, and by the IBM Open Collaborative Research Program.

This article has been published as part of *BMC Bioinformatics* Volume 13 Supplement 6, 2012: Proceedings of the Second Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S6.

## Authors’ Affiliations

## References

- Excoffier L, Slatkin M: Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Molecular biology and evolution. 1995, 12 (5): 921-PubMedGoogle Scholar
- Kimmel G, Shamir R: GERBIL: Genotype resolution and block identification using likelihood. Proceedings of the National Academy of Sciences of the USA. 2005, 102: 158-10.1073/pnas.0404730102.PubMed CentralView ArticlePubMedGoogle Scholar
- Stephens M, Smith N, Donnelly P: A new statistical method for haplotype reconstruction from population data. The American Journal of Human Genetics. 2001, 68 (4): 978-989. 10.1086/319501.View ArticlePubMedGoogle Scholar
- Scheet P, Stephens M: A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. The American Journal of Human Genetics. 2006, 78 (4): 629-644. 10.1086/502802.View ArticlePubMedGoogle Scholar
- Halperin E, Eskin E: Haplotype reconstruction from genotype data using imperfect phylogeny. Bioinformatics. 2004, 20 (12): 1842-1849. 10.1093/bioinformatics/bth149.View ArticlePubMedGoogle Scholar
- Huang Y, Chao K, Chen T: An approximation algorithm for haplotype inference by maximum parsimony. Journal of Computational Biology. 2005, 12 (10): 1261-1274. 10.1089/cmb.2005.12.1261.View ArticlePubMedGoogle Scholar
- Browning S, Browning B: Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. The American Journal of Human Genetics. 2007, 81 (5): 1084-1097. 10.1086/521987.View ArticlePubMedGoogle Scholar
- He D, Choi A, Pipatsrisawat K, Darwiche A, Eskin E: Optimal algorithms for haplotype assembly from whole-genome sequence data. Bioinformatics. 2010, 26 (12): i183-10.1093/bioinformatics/btq215.PubMed CentralView ArticlePubMedGoogle Scholar
- Bansal V, Bafna V: HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics. 2008, 24 (16): i153-10.1093/bioinformatics/btn298.View ArticlePubMedGoogle Scholar
- Eskin E, Halperin E, Karp R: Efficient reconstruction of haplotype structure via perfect phylogeny. International journal of bioinformatics and computational biology. 2003, 1: 1-20. 10.1142/S0219720003000174.View ArticleGoogle Scholar
- Moon T: The expectation-maximization algorithm. Signal Processing Magazine, IEEE. 1996, 13 (6): 47-60. 10.1109/79.543975.View ArticleGoogle Scholar
- Ewing G, Hermisson J: MSMS: a coalescent simulation program including recombination, demo-graphic structure and selection at a single locus. Bioinformatics. 2010, 26 (16): 2064-10.1093/bioinformatics/btq322.PubMed CentralView ArticlePubMedGoogle Scholar
- Siva N: 1000 Genomes project. Nature biotechnology. 2008, 26 (3): 256-256.PubMedGoogle Scholar
- Eskin E, Sharan R, Halperin E: A note on phasing long genomic regions using local haplotype predictions. Journal of bioinformatics and computational biology. 2006, 4 (3): 639-648. 10.1142/S0219720006002272.View ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.