- Research article
- Open Access
Maximum-parsimony haplotype frequencies inference based on a joint constrained sparse representation of pooled DNA
- Guido H Jajamovich^{1},
- Alexandros Iliadis^{2, 3},
- Dimitris Anastassiou^{2, 3} and
- Xiaodong Wang^{2}Email author
https://doi.org/10.1186/1471-2105-14-270
© Jajamovich et al.; licensee BioMed Central Ltd. 2013
- Received: 12 April 2013
- Accepted: 27 August 2013
- Published: 8 September 2013
Abstract
Background
DNA pooling constitutes a cost effective alternative in genome wide association studies. In DNA pooling, equimolar amounts of DNA from different individuals are mixed into one sample and the frequency of each allele in each position is observed in a single genotype experiment. The identification of haplotype frequencies from pooled data in addition to single locus analysis is of separate interest within these studies as haplotypes could increase statistical power and provide additional insight.
Results
We developed a method for maximum-parsimony haplotype frequency estimation from pooled DNA data based on the sparse representation of the DNA pools in a dictionary of haplotypes. Extensions to scenarios where data is noisy or even missing are also presented. The resulting method is first applied to simulated data based on the haplotypes and their associated frequencies of the AGT gene. We further evaluate our methodology on datasets consisting of SNPs from the first 7Mb of the HapMap CEU population. Noise and missing data were further introduced in the datasets in order to test the extensions of the proposed method. Both HIPPO and HAPLOPOOL were also applied to these datasets to compare performances.
Conclusions
We evaluate our methodology on scenarios where pooling is more efficient relative to individual genotyping; that is, in datasets that contain pools with a small number of individuals. We show that in such scenarios our methodology outperforms state-of-the-art methods such as HIPPO and HAPLOPOOL.
Keywords
- DNA pools
- Haplotype frequency estimation
- Sparse representations
- ADM
Background
In recent years large genome wide association studies have been considered a promising approach to identify disease genes. In these studies, which typically include thousands of individuals, a potential allele frequency difference for a specific marker between cases and controls could indicate an association between the marker and the disease.
Allele frequencies for cases and controls can be obtained either through individual genotyping or through DNA pooling. Although individual genotyping provides more accurate estimates of individual allele frequencies, as well as haplotypes which enable the study of genetic interactions, DNA pooling has been widely used as it can be more cost effective in genome wide association studies [1-6]. In genotype pooling, equimolar amounts of DNA from different individuals are mixed into one sample prior to the amplification and sequencing steps and the frequency of each allele in each position is given. Therefore, for pools of size n, the cost of genotyping is reduced by a factor of n[5].
As evident, one of the main concerns with the use of genotype pooling is genotype error. For a given pooled DNA sample, the standard deviation (SD) of the estimated allele frequency is between 1% and 4% [6]. However, as was argued by Kirkpatrick et al. [7] pooling errors have a greater effect on pools that contain a large number of individuals. To illustrate this point assume that σ is the SD of allele frequencies. After a genotype experiment, the ability of the clustering algorithms to correctly identify the number of each distinct allele depends on whether 2σ is smaller than the difference of allowable frequency calls. For example, in pools of two individuals where the difference between allowable frequency calls is 0.25 (0,0.25, 0.5, 0.75, 1), an accuracy of σ<0.125 will ensure a low rate of incorrect calls (<1%). For the same experiment, if pools of four individuals are considered then the difference of allowable frequencies is cut into half (0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1). Then, it is obvious that to get the same percentage of incorrect calls, σ, should be correspondingly halved. The situation quickly deteriorates for larger pool sizes.
Even though the main purpose of pooling is to screen alleles for potential discrepancies between cases and controls, estimating haplotype frequencies across a number of markers is also of interest with the pooled data as it can improve the power of detecting associations with disease. To facilitate haplotype-based association analysis it is necessary to estimate haplotype frequencies from pooled DNA data.
It has been claimed in the literature that pooling DNA samples is efficient for estimating haplotype frequencies. However, the results presented within the context of haplotype frequency estimation algorithms are largely numerical and they do not address the statistical properties and efficiency of the estimates being computed. In a recent study, Kuk et al. [8] addressed this issue and provided a general guideline on scenarios where pooling would be more efficient relative to individual genotyping. Instead of resorting to simulations, this study was based on theoretical analysis. For a fixed genotype cost, the authors have compared the maximum likelihood estimate based on pooled and individual genotype data. Their findings suggest that for the case of linkage equilibrium and non-rare allele, pooling begins to loose efficiency relative to no pooling when the number of loci is larger than 3 (2^{3} haplotypes with appreciable frequency). Factors such as Linkage Disequilibrium (LD) and rare alleles reduce the number of non-rare haplotypes appearing in the population and pooling could still remain more efficient either for a larger number of loci or when the pool size is kept considerably small, as suggested by Barratt et al. [9].
A variety of haplotype estimation methods from pooled genomic data have been proposed in the literature that fall into two large categories. The first category consists of methods that focus on a small number of markers but allow for considerably larger pool sizes while the second category of methods allows for a larger number of markers but for a small number of individuals per pool.
As haplotype frequency estimation from pooled genomic data can be seen as a missing data problem, it comes to no surprise that the majority of methods focusing on small pool sizes mainly contains methods that use the expectation-maximization (EM) algorithm for maximizing the multinomial likelihood [10-12]. Kirkpatrick et al. [7] suggested a perfect phylogeny method, HAPLOPOOL, that was supplemented with the EM algorithm and linear regression in order to combine haplotype segments and was shown to outperform competing EM algorithms.
Haplotype frequency estimation from large genotype pools was first addressed by Zhang et al. [13] using Poool and was further modified by Kuk et al. [14] resulting in the AEM algorithm. As the EM algorithm presents limitations in speed and difficulties with large pool sizes or long haplotypes, Kuk et al. [15] developed a fast collapsed method that trades performance but can handle larger datasets. Gasbarra et al.[16] introduced a haplotyping method for pooled DNA based on a continuous approximation of the multinomial distribution and a set of constraints (LinEq). The goal of the method is to perform haplotype inference incorporating prior database knowledge from databases such as HapMap. Finally, Pirinen introduced HIPPO [17], a Bayesian model for estimating the pooled haplotypes. HIPPO uses a multinormal approximation of the likelihood and a reversible-jump Markov chain Monte Carlo (RJMCMC) algorithm to estimate the existing haplotypes in the population and their frequencies. The HIPPO framework is also able to accommodate prior database knowledge for the existing haplotypes in the population and has demonstrated improvements in the performance over the AEM and LinEq methods.
There is also an equivalence between the haplotype frequencies estimation and the inference of relative abundances of species in mutagenomics studies. Kessner et al. [18] proposed an EM-based method based on individual sequence reads that can be used to deal with both scenarios. The haplotypes present in the pools are assumed to be known and need to be input to the method. Another EM method was proposed by Eskin et al. [19], where some individual genotypes are required in addition to the pooled sequence data. Amir et al. [20] proposed a method to reconstruct the abundance of each bacterium in a bacteria community by looking at a database of known 16S rRNA sequences and a single Sanger-sequence of the unknown mixture, by assuming that only a small set of bacteria are present within the set of bacteria with known 16S rRNA sequence.
In this study we present an algorithm for haplotype frequency estimation based on the maximum parsimony principle. A mathematical framework is presented where this principle is translated in a joint sparsity requirement and the frequency inference is performed using the alternating direction method (ADM) of multipliers. Our method focuses on datasets that have a small number of individuals per pool and a considerably large number of markers. We compare our method with the best performing methods from the two pooling algorithm categories as presented above, namely HIPPO and HAPLOPOOL. We have performed comparisons on a variety of marker and dataset sizes. All our comparisons represent scenarios for which, based on Kuk et al. [8], pooling is more efficient than individual genotyping. We show that our method demonstrates superior performance in terms of accuracy compared with state-of-the-art competing methods for almost all scenarios examined with special emphasis on scenarios where the number of loci is large.
Results and discussions
In this section, first we describe the datasets and figures of merit used to evaluate the method. Then we present the results from comparing our method ADM to HIPPO and HAPLOPOOL.
All our comparisons were performed in scenarios were the use of pooling is potentially beneficial relative to no pooling according to the guidelines of Kuk et al. [8]. Our methodology specifically targets datasets that have a small number of individuals per pool and a large number of SNPs.
In real applications, it is very often the case that studies are performed in datasets for which partial knowledge of the existing haplotypes already exists (for example datasets from HAPMAP studied populations). This information could be used as a basis for an accurate definition of the haplotype dictionary matrix H, as will be defined in the Methods section, so that the number of possible haplotypes M is much smaller than the full set of allowable haplotypes. However, in order to evaluate the proposed method in the most general scenarios, no prior information is assumed and all possible haplotypes are considered.
The presented method is based on the augmented Lagrangian expansion of a constrained optimization problem which has an associated parameter ρ, as it will be shown in the Methods section. For all the results presented in this study, we have set $\rho =\frac{1}{\u0101}$ with $\u0101$ being the average of the observed relative frequency of the allele 1 of the considered SNPs and pools. We have found experimentally that this choice of ρ achieves a good performance. Moreover, the ADM is an iterative method, which finalizes once a stopping criterion is met. For the results presented here, the l_{2}-norm of the difference between the solution at step k and the solution at step k−1 over the l_{2}-norm of the solution at step k−1 is compared to a tolerance parameter of 10^{−20}. If the first term is smaller or k=8000, then the ADM stops and the solution at step k is presented.
Datasets
To examine the performance of our methodology we have considered in our experiments real datasets for which estimates of the haplotype frequencies were already available and which cover a variety of dataset sizes.
Haplotypes and frequencies for the AGT gene
Haplotype | Frequency |
---|---|
1 1 1 1 0 1 1 0 0 0 | 0.033 |
1 1 0 1 0 1 1 1 1 0 | 0.016 |
1 1 0 1 0 0 1 0 0 1 | 0.017 |
1 0 0 1 0 1 1 0 0 1 | 0.017 |
1 1 0 1 0 1 1 0 0 1 | 0.017 |
1 1 1 1 0 1 1 1 0 1 | 0.507 |
0 1 0 1 1 0 0 1 1 1 | 0.017 |
1 1 0 0 0 0 1 1 1 1 | 0.033 |
0 1 0 1 0 0 1 1 1 1 | 0.1 |
1 1 0 1 0 1 1 1 1 1 | 0.193 |
1 1 1 1 1 1 1 1 1 1 | 0.05 |
The second dataset consisted of SNPs from the first 7Mb (742 kb to 7124.8 kb) of the HapMap CEU population (HapMap 3 release 2- Phasing data (http://hapmap.ncbi.nlm.nih.gov/)). This chromosomal region was partitioned based on physical distance into disjoint blocks of 15 kb. The resulting blocks had a varying number of markers ranging from 2 to 28. For our purposes we have considered only the datasets that had more than 10 SNPs and less than 20 (which was the maximum number of loci so that HAPLOPOOL could produce estimates within a reasonable amount of time) which resulted in selecting a total of 80 blocks. On each block the parental haplotypes and their estimated frequencies were used as the true haplotype distribution. As in the previous cases in each block four different pool sizes were considered: O=50, 75, 100 and 150 pools.
Performance criteria
Assume first that g= [g_{1}⋯g_{ M }]^{ T } is the gold standard haplotype frequency vector in a given dataset observed in the population and f= [f_{1}⋯f_{ M }]^{ T } is the predicted haplotype frequency vector from a given method. To compare the performance of different methodologies we have considered two criteria:
χ^{2} distance: The χ^{2} distance between the two distributions g and f is defined as ${\chi}^{2}(\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}f,g)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\sum _{i=1,{g}_{i}\ne 0}^{M}{(\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{f}_{i}-{g}_{i})}^{2}/{g}_{i}$ where only the terms with non-zero haplotype frequency vector g_{ i } are considered.
l_{1} distance: The l_{1} distance between the two distributions is defined as ${l}_{1}(\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}f,g)=\sum _{i=1}^{M}|\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{f}_{i}-{g}_{i}|$.
Frequency estimation
For the HapMap dataset (Figure 2) only ADM and HAPLOPOOL were evaluated since the maximum number of loci HIPPO can handle is 10. At the same time, even though HAPLOPOOL can in principle handle larger datasets, we restricted our comparisons to datasets between 10 and 20 loci due to excessive computational time.
From our experiments we can also see that the number of pools also affected accuracy. All algorithms demonstrated improved performance with increasing number of pools in the dataset.
Noise and missing data
We have further evaluated the performance of our method in the presence of measurement error. We have simulated genotyping error by adding a Gaussian error with SD σ to each called allele frequency. In particular, if we denote the correct allele frequency at SNP j in pool i as c_{ ij }, then the perturbed allele frequency is given by ${\u0109}_{\mathit{\text{ij}}}={c}_{\mathit{\text{ij}}}+x$ where x∼N(0,σ^{2}). To obtain the allele counts we discretize each allele frequency to the closest allowed frequency depending on the number of individuals per pool and obtain the allele counts accordingly.
We can see that ADM demonstrates superior performance compared to competing methods and, as expected, its performance deteriorates with increasing noise levels. The results also demonstrate the fact that pooling errors affect more pools that contain a large number of individuals. The reason is, as has been noted before, that in smaller pools the gap between allowable frequency calls is much larger resulting in a smaller percentage of miscalled allele counts and thus in better frequency estimates.
Conclusions
In this study we have presented a method for estimating haplotype frequencies from pooled data based on the maximum parsimony principle. A novel mathematical framework is introduced where this principle is translated to finding a sparse representation of the observed DNA pools in a dictionary of haplotypes. This leads to an optimization problem that is solved with the alternating direction method of multipliers. The proposed method is also extended to scenarios where noisy and missing data is present in the considered DNA pools, and is able to process pools with a large number of SNPs.
Numerical experiments using synthetic and real data have shown improved performance with respect to the best of the haplotype frequency inference methods. In particular, the proposed ADM method is an efficient method that performs better than other methods such as HIPPO and HAPLOPOOL in the considered datasets consisting of pools with a small number of individuals and a large number of markers.
Methods
Overview
This section provides a description of the proposed method for haplotype frequencies inference based on the maximum-parsimony principle. The method seeks to discover the frequencies of the haplotypes present in a population given the observed relative frequencies of each allele in each DNA pool. In order to obtain a biological meaningful estimation, the proposed method makes use of the maximum-parsimony principle which attempts to minimize the total number of haplotypes observed in the sample [21].
Each pool has an associated vector of observed relative frequencies that, with the proposed mathematical framework, can be expressed as the linear combination of haplotypes of a dictionary. This dictionary of haplotypes can be constructed using information from external databases [16] or, in the most general case where such information is not available, all possible haplotypes need to be considered. Each vector of observed relative frequencies should be reconstructed with the minimum number of distint haplotypes in the dictionary according to the maximum-parsimony principle. Moreover, as there are more than one pool available, the set of used haplotypes needs to be selected to explain all pools jointly.
This framework for haplotype frequencies estimation leads to a joint constrained sparse optimization problem. This kind of optimization problem has been studied in the compressed sensing literature, where the alternating direction method (ADM) of multipliers has been proposed to find the corresponding solution. The proposed method makes use of the ADM adapted to the haplotype frequencies estimation.
Maximum-parsimony haplotype frequencies inference framework
where $\parallel {\mathbf{p}}^{i}{\parallel}_{1}\triangleq \sum _{j=1}^{M}\left|{p}_{j}^{i}\right|$ is the l_{1}-norm of the vector p^{ i }. Since p^{ i }≽0, we have $\parallel {\mathbf{p}}^{i}{\parallel}_{1}=\sum _{j=1}^{M}{p}_{j}^{i}$; that is, the l_{1}-norm is the sum of the proportions which needs to be 1. Each proportion ${p}_{j}^{i}$ can only be discrete multiples of the basic unit of $\frac{1}{2{n}_{i}}$; that is, ${p}_{j}^{i}$ takes values in the set $\{0,\frac{1}{2{n}_{i}},\cdots \phantom{\rule{0.3em}{0ex}},1\}$, but as measurements contain noise, we relax this condition and allow each proportion to take any value in the interval [0,1] [16].
Then the haplotype frequency estimation problem can be stated as follows: Given the observed relative frequencies of the alleles a^{ i }, i=1,⋯,O, infer the proportions of the haplotypes p^{ i }, i=1,⋯,O, in every pool. The dimension of each relative frequency of the alleles a^{ i } is L, while the dimension of the unknown proportion vector p^{ i } is M, where generally M≫L; that is, the estimation task is an ill-posed inverse problem and side information is needed to complete this task. In particular, in this paper, we make use of the maximum parsimony principle. This principle states that the number of different haplotypes that explains all the observed relative frequency vectors a^{ i } should be as small as possible. Therefore, the maximum parsimony haplotype inference problem is stated as follows. Given the set {a_{ i },i=1,…,O} of observed relative frequency vectors of i pools with n_{ i } subjects and for L loci, we aim at inferring the vector of proportions {p_{ i },i=1,…,O} that is composed of the minimum number of distinct haplotypes. From the point of view of Eq. (1), the maximum parsimony principle can be translated as using as few columns of H as possible to explain all the observed frequency vectors a^{ i }.
Haplotype frequencies inference based on a joint constrained sparse representation of pooled DNA
where ∥e(X)∥_{0} is the l_{0} norm of the vector e(X) and corresponds to the number of non-zero components of the vector. This means that the solution will have as many all-zero rows as possible.
This matrix problem lies within the convex optimization framework. In the most general case where there is no prior information regarding the possible haplotypes to be considered, the size of the matrix H grows exponentially with the number of SNPs. In what follows we present an efficient method to find the solution to (4) by means of the alternating direction method (ADM) of multipliers. The ADM proceeds to solve local small problems in order to uncover the global solution to the problem with proven convergence; that is, the ADM is guaranteed to find the optimal solution to (4) [22]. We first briefly describe the ADM in its general form and then we show how (4) can be solved with the ADM.
Alternating direction method of multipliers
with $\mathit{C}\in {\mathbb{R}}^{p\times {m}_{1}}$, $\mathit{D}\in {\mathbb{R}}^{p\times {m}_{2}}$, and $\mathit{e}\in {\mathbb{R}}^{p}$.
Alternating direction method of multipliers
1 | Set k = 0 |
2 | Set ρ > 0 |
3 | Initialize x^{0}, z^{0} and u^{0} |
4 | Repeat |
5 | k = k + 1 |
6 | ${\mathit{x}}^{k+1}\triangleq arg\underset{\mathit{x}}{min}\left(f\left(\mathit{x}\right)+\frac{\rho}{2}\underset{2}{\overset{2}{\u2225\mathit{C}\mathit{x}+\mathit{D}{\mathit{z}}^{k}-\mathit{e}+{\mathit{u}}^{k})\u2225}}\right)$ |
7 | ${\mathit{z}}^{k+1}\triangleq arg\underset{\mathit{z}}{min}\left(g\left(\mathit{z}\right)+\frac{\rho}{2}\underset{2}{\overset{2}{\u2225\mathit{C}{\mathit{x}}^{k+1}+\mathit{D}\mathit{z}-\mathit{e}+{\mathit{u}}^{k})\u2225}}\right)$ |
8 | ${\mathit{u}}^{k+1}\triangleq {\mathit{u}}^{k}+\left(\mathit{C}{\mathit{x}}^{k+1}+\mathit{D}{\mathit{z}}^{k+1}-\mathit{e}\right)$ |
9 | until convergence |
Joint constrained sparse haplotype frequency estimation algorithm
where $\mathit{E}\triangleq \left(\begin{array}{l}{1}^{T}\\ \ddots \\ {1}^{T}\end{array}\right)$, U_{ S } is the indicator function of the set S (that is, U_{ S }(x)=0 if x∈S and ∞ otherwise), and ${\mathbf{z}}_{{g}_{i}}$ is the vector of components in z that correspond to the i-th row of matrix Z.
Joint constrained sparse haplotype frequency estimation algorithm
1 | Set k = 0 |
2 | Set ρ > 0 |
3 | Set X^{0} = 0, Z^{0} = 0, ${\mathbf{U}}_{1}^{0}=0$, ${\mathbf{U}}_{2}^{0}=0$ |
4 | U_{4} = (I−H^{ T }(I+H H^{ T })^{−1}H) |
5 | Repeat |
6 | k = k + 1 |
7 | ${\mathbf{u}}_{3}^{T}=\frac{{1}^{T}{\mathbf{U}}_{4}\left({\mathbf{H}}^{H}\left({\mathbf{U}}_{2}^{k}-\mathbf{A}\right)+{\mathbf{U}}_{1}^{k}-{\mathbf{Z}}^{k}\right)-{1}^{T}}{{1}^{T}{\mathbf{U}}_{4}1}$ |
8 | ${\mathbf{X}}^{k+1}={\mathbf{U}}_{4}\left({\mathbf{U}}_{1}^{k}-{\mathbf{Z}}^{k}+{\mathbf{H}}^{T}({\mathbf{U}}_{2}^{k}-\mathbf{A})-1{\mathbf{u}}_{3}^{T}\right)$ |
9 | ${\mathbf{Z}}^{k+1}=max\left(\text{Shrink}\left({\mathbf{X}}^{k+1}+\frac{1}{\rho}{\mathbf{U}}_{1}^{k},\frac{1}{\rho}\right),0\right)$ |
10 | ${\mathbf{U}}_{1}^{k+1}={\mathbf{U}}_{1}^{k}+{\mathbf{X}}^{k+1}-{\mathbf{Z}}^{k+1}$ |
11 | ${\mathbf{U}}_{2}^{k+1}={\mathbf{U}}_{2}^{k}+\mathbf{H}{\mathbf{X}}^{k+1}-\mathbf{A}$ |
12 | until convergence |
the max operation of step 9 is component-wise, and $0\triangleq {[\phantom{\rule{0.3em}{0ex}}0\phantom{\rule{2.77626pt}{0ex}}\cdots 0]}^{T}$.
Extensions
Noisy data
where ${\mathbf{z}}_{2}^{i}$ is the i-th column of matrix Z_{2}. This simple transformation allows us to obtain closed-form expressions for the local minimization steps of the ADM.
Joint constrained sparse haplotype frequency estimation algorithm in the presence of noisy measurements
1 | Set k = 0 |
2 | Set ρ > 0 |
3 | Set X^{0} = 0, Z^{0} = 0, ${\mathbf{U}}_{1}^{0}=0$, ${\mathbf{U}}_{2}^{0}=0$ |
4 | U_{4} = (I−H^{ T }(I+H H^{ T })^{−1}H) |
5 | Repeat |
6 | k = k + 1 |
7 | ${\mathbf{u}}_{3}^{T}=\frac{{1}^{T}{\mathbf{U}}_{4}\left({\mathbf{H}}^{H}\left({\mathbf{U}}_{2}^{k}-{\mathbf{Z}}_{1}^{k}\right)+{\mathbf{U}}_{1}-{\mathbf{Z}}_{2}^{k}\right)-{1}^{T}}{{1}^{T}{\mathbf{U}}_{4}1}$ |
8 | ${\mathbf{X}}^{k+1}={\mathbf{U}}_{4}\left({\mathbf{U}}_{1}^{k}-{\mathbf{Z}}_{1}^{k}+{\mathbf{H}}^{T}({\mathbf{U}}_{2}^{k}-{\mathbf{Z}}_{2}^{k})-1{\mathbf{u}}_{3}^{T}\right)$ |
9 | ${\mathbf{Z}}_{1}^{k+1}=max\left(\text{Shrink}\left({\mathbf{X}}^{k+1}+\frac{1}{\rho}{\mathbf{U}}_{1}^{k},\frac{1}{\rho}\right),0\right)$ |
10 | ${\mathbf{Z}}_{2}^{k+1}=\text{proj}(\mathbf{H},{\mathbf{X}}^{k+1},{\mathbf{U}}_{2}^{k},\mathbf{A},\delta )$ |
11 | ${\mathbf{U}}_{1}^{k+1}={\mathbf{U}}_{1}^{k}+{\mathbf{X}}^{k+1}-{\mathbf{Z}}_{1}^{k+1}$ |
12 | ${\mathbf{U}}_{2}^{k+1}={\mathbf{U}}_{2}^{k}+\mathbf{H}{\mathbf{X}}^{k+1}-{\mathbf{Z}}_{2}^{k+1}$ |
13 | until convergence |
Missing data
Errors often occur during the genotyping process, and the data at some loci might not have been observed. We present modifications to the algorithms to perform haplotype inference in the presence of missing data. We assume that it is known a priori where the genotype information is missing for each genotype of each individual.
The presence of missing data in a genotype of a given pool imply a smaller number of constraints. Let ${\stackrel{~}{\mathbf{a}}}_{i}$ be the observed relative frequency vector where all the loci with missing information have been removed, and H_{ i } the matrix with all the rows corresponding to those loci removed. Notice that different pools present missing information in different loci, making the matrix dependant on the considered individual.
Joint constrained sparse haplotype frequency estimation algorithm with missing data
1 | Set k = 0 |
2 | Set ρ > 0 |
3 | Set X^{0} = 0, Z^{0} = 0, ${\mathbf{U}}_{1}^{0}=0$, ${\mathbf{U}}_{2}^{0}=0$ |
4: | U_{4} = (I−H^{ T }(I+H H^{ T })^{−1}H) |
5 | Repeat |
6 | k = k + 1 |
7 | For i = 1,⋯,O |
8 | ${u}_{3,i}=\frac{{1}^{T}{\mathbf{U}}_{4}\left({\mathbf{H}}_{i}^{T}\left({\mathbf{u}}_{2,i}^{k}-{\stackrel{~}{\mathbf{a}}}_{i}\right)+{\mathbf{u}}_{i}^{1}-{\mathbf{z}}_{i}^{k}\right)-1}{{1}^{T}{\mathbf{U}}_{4}1}$ |
9 | ${\mathbf{x}}_{i}^{k+1}={\mathbf{U}}_{4}\left({\mathbf{u}}_{1,i}^{k}-{\mathbf{z}}_{i}^{k}+{\mathbf{H}}_{i}^{T}({\mathbf{u}}_{2,i}^{k}-\stackrel{~}{{\mathbf{a}}_{i}})-{u}_{3,i}1\right)$ |
10 | end for; |
11 | ${\mathbf{Z}}^{k+1}=max\left(\text{Shrink}\left({\mathbf{X}}^{k+1}+\frac{1}{\rho}{\mathbf{U}}_{1}^{k},\frac{1}{\rho}\right),0\right)$ |
12 | ${\mathbf{U}}_{1}^{k+1}={\mathbf{U}}_{1}^{k}+{\mathbf{X}}^{k+1}-{\mathbf{Z}}^{k+1}$ |
13 | For i = 1,⋯,O |
14 | ${\mathbf{u}}_{2,i}^{k+1}={\mathbf{u}}_{2,i}^{k}+{\mathbf{H}}_{i}{\mathbf{x}}_{i}^{k+1}-{\stackrel{~}{\mathbf{a}}}_{i}$ |
15 | end for |
16 | until convergence |
Large number of SNPs
When the number of SNPs is large, the size of the matrix H increases dramatically. One approach for this case is to partition the data into blocks and process one block at a time. After all blocks are processed, a ligation process is performed to obtain the final result. We adopt the partition-ligation (PL) method [23] for haplotype frequency estimation.
The PL method starts with the partition phase. The vectors of observed relative frequencies a_{ i },i=1,⋯,O is divided into Q non-overlapping and non-empty sets that cover all of the vectors. Each set contains segments from the same SNP loci for all individuals. Let $\left\{{\mathbf{G}}_{{q}_{1}^{1}:{q}_{2}^{1}},{\mathbf{G}}_{{q}_{1}^{2}:{q}_{2}^{2}}\cdots \phantom{\rule{0.3em}{0ex}},{\mathbf{G}}_{{q}_{1}^{Q}:{q}_{2}^{Q}}\right\}$ be the partitioned sets of relative frequency vectors, where the i-th subset ${\mathbf{G}}_{{q}_{1}^{i}:{q}_{2}^{i}}$ contains the relative frequencies for SNP locus ${q}_{1}^{i}$ to ${q}_{2}^{i}$ for all N individuals. We impose that the first locus of the first set be the first locus of the complete genotype, i.e., ${q}_{1}^{1}=1$. Moreover, each set is adjacent to the previous one, i.e., ${q}_{1}^{i}={q}_{2}^{i-1}+1$ for i={2⋯Q}. Notice that as we need to cover all loci, the last locus for the last set is ${q}_{2}^{Q}=L$. For each set ${\mathbf{G}}_{{q}_{1}^{i}:\phantom{\rule{0.3em}{0ex}}{q}_{2}^{i}}$, the haplotypes frequencies are inferred using our algorithm, which outputs a small set of haplotypes frequencies.
Then, the PL proceeds to a ligation phase, where adjacent sets are merged to obtain a new partition of the data, with $\u2308\frac{Q}{2}\u2309$ sets, e.g., when merging the (2i)-th set with the (2i+1)-th set, the resulting set consists of the observed frequencies for all individuals between locus ${q}_{1}^{2i}$ and ${q}_{2}^{2i+1}$. For each merged set ${\mathbf{G}}_{{q}_{1}^{2i}:\phantom{\rule{0.3em}{0ex}}{q}_{2}^{2i+1}}$, we run the haplotype inference algorithm again, but restricting H to contain every possible concatenations of the haplotypes of the (2i)-th set with the haplotypes of the (2i+1)-th set that have non-zero estimated frequencies. The process continues until there is only one set of relative frequencies and the haplotype frequencies inference algorithm is finally applied to this set.
In order to use the PL method, we need to determine an initial partition of the data. Therefore, we need to specify the number of partitions Q and the length of each partition or equivalently, the initial locus of each partition, i.e., ${\left\{{q}_{1}^{i}\right\}}_{i=1\cdots Q}$. A simple and low-cost way of setting the initial loci ${\left\{{q}_{1}^{i}\right\}}_{i=1\cdots Q}$ is to fix each block to be of equal length. Then, given an upper bound W on the length for each initial block, the number of blocks is $Q=\lceil \frac{L}{W}\rceil $.
Availability of supporting data
Our method is available for download at http://www.ee.columbia.edu/~guido/ADM/.
Declarations
Acknowledgements
This work was supported by the research grant DBI-0850030 from the National Science Foundation.
Authors’ Affiliations
References
- Bansal A, van den Boom D, Kammerer S, Honisch C, Adam G, Cantor CR, Kleyn P, Braun A: Association testing by DNA pooling: an effective initial screen. Proc Nat Acad Sci. 2002, 99 (26): 16871-16874. 10.1073/pnas.262671399.PubMed CentralView ArticlePubMedGoogle Scholar
- Barcellos LF, Klitz W, Field LL, Tobias R, Bowcock AM, Wilson R, Nelson MP, Nagatomi J, Thomson G: Association mapping of disease loci, by use of a pooled DNA genomic screen. Am J Hum Genet. 1997, 61 (3): 734-747. 10.1086/515512.PubMed CentralView ArticlePubMedGoogle Scholar
- Norton N, Williams M, O’Donovan C, Owen J: DNA pooling as a tool for large-scale association studies in complex traits. Annals Med. 2004, 36 (2): 146-152. 10.1080/07853890310021724.View ArticleGoogle Scholar
- Pearson JV, Huentelman MJ, Halperin RF, Tembe WD, Melquist S, Homer N, Brun M, Szelinger S, Coon KD, Zismann VL, et al: Identification of the genetic basis for complex disorders by use of pooling-based genomewide single-nucleotide-polymorphism association studies. Am J Human Genet. 2007, 80: 126-139. 10.1086/510686.View ArticleGoogle Scholar
- Sham P, Bader JS, Craig I, O’Donovan M, Owen M: DNA pooling: a tool for large-scale association studies. Nat Rev Genet. 2002, 3 (11): 862-871.View ArticlePubMedGoogle Scholar
- Zuo Y, Zou G, Zhao H: Two-stage designs in case-control association analysis. Genetics. 2006, 173 (3): 1747-1760. 10.1534/genetics.105.042648.PubMed CentralView ArticlePubMedGoogle Scholar
- Kirkpatrick B, Armendariz CS, Karp RM, Halperin E: HAPLOPOOL: improving haplotype frequency estimation through DNA pools and phylogenetic modeling. Bioinformatics. 2007, 23 (22): 3048-3055. 10.1093/bioinformatics/btm435.View ArticlePubMedGoogle Scholar
- Kuk AY, Xu J, Yang Y: A study of the efficiency of pooling in haplotype estimation. Bioinformatics. 2010, 26 (20): 2556-2563. 10.1093/bioinformatics/btq492.View ArticlePubMedGoogle Scholar
- Barratt B, Payne F, Rance H, Nutland S, Todd J, Clayton D: Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design. Annals Hum Genet. 2002, 66 (5-6): 393-405.View ArticleGoogle Scholar
- Ito T, Chiku S, Inoue E, Tomita M, Morisaki T, Morisaki H, Kamatani N: Estimation of haplotype frequencies, linkage-disequilibrium measures, and combination of haplotype copies in each pool by use of pooled DNA data. Am J Hum Genet. 2003, 72 (2): 384-10.1086/346116.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang S, Kidd KK, Zhao H: On the use of DNA pooling to estimate haplotype frequencies. Genet Epidemiol. 2003, 24: 74-82. 10.1002/gepi.10195.View ArticlePubMedGoogle Scholar
- Yang Y, Zhang J, Hoh J, Matsuda F, Xu P, Lathrop M, Ott J: Efficiency of single-nucleotide polymorphism haplotype estimation from pooled DNA. Proc Nat Acad Sci. 2003, 100 (12): 7225-7230. 10.1073/pnas.1237858100.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang H, Yang HC, Yang Y: PoooL: an efficient method for estimating haplotype frequencies from large DNA pools. Bioinformatics. 2008, 24 (17): 1942-1948. 10.1093/bioinformatics/btn324.View ArticlePubMedGoogle Scholar
- Kuk AY, Zhang H, Yang Y: Computationally feasible estimation of haplotype frequencies from pooled DNA with and without Hardy-Weinberg equilibrium. Bioinformatics. 2009, 25 (3): 379-386. 10.1093/bioinformatics/btn623.View ArticlePubMedGoogle Scholar
- Kuk AY, Li X, Xu J: A fast collapsed data method for estimating haplotype frequencies from pooled genotype data with applications to the study of rare variants. Stat Med. 2012, 32 (8): 1343-1360.View ArticlePubMedGoogle Scholar
- Gasbarra D, Kulathinal S, Pirinen M, Sillanpaa MJ: Estimating haplotype frequencies by combining data from large DNA pools with database information. Comput Biol Bioinform IEEE/ACM Trans. 2011, 8: 36-44.View ArticleGoogle Scholar
- Pirinen M: Estimating population haplotype frequencies from pooled SNP data using incomplete database information. Bioinformatics. 2009, 25 (24): 3296-3302. 10.1093/bioinformatics/btp584.View ArticlePubMedGoogle Scholar
- Kessner D, Turner TL, Novembre J: Maximum Likelihood Estimation of Frequencies of Known Haplotypes from Pooled Sequence Data. Mol Biol Evol. 2013, 30 (5): 1145-1158. 10.1093/molbev/mst016.PubMed CentralView ArticlePubMedGoogle Scholar
- Eskin I, Hormozdiari F, Conde L, Riby J, Skibola C, Eskin E, Halperin E: eALPS: estimating abundance levels in pooled sequencing using available genotyping data. Research in Computational Molecular Biology. 2013, Berlin, Germany: Springer Berlin Heidelberg, 32-44.View ArticleGoogle Scholar
- Amir A, Zuk O: Bacterial community reconstruction using compressed sensing. J Comput Biol. 2011, 18 (11): 1723-1741. 10.1089/cmb.2011.0189.View ArticlePubMedGoogle Scholar
- Wang L, Xu Y: Haplotype inference by maximum parsimony. Bioinformatics. 2003, 19 (14): 1773-1780. 10.1093/bioinformatics/btg239.View ArticlePubMedGoogle Scholar
- Boyd S, Parikh N, Chu E, Peleato B, Eckstein J: Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations Trends®; Mach Learn. 2011, 3: 1-122.View ArticleGoogle Scholar
- Niu T, Qin ZS, Xu X, Liu JS: Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am J Hum Genet. 2002, 70: 157-10.1086/338446.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.