- Methodology article
- Open Access
- Published:

# Efficient haplotype block recognition of very long and dense genetic sequences

*BMC Bioinformatics*
**volumeÂ 15**, ArticleÂ number:Â 10 (2014)

## Abstract

### Background

The new sequencing technologies enable to scan very long and dense genetic sequences, obtaining datasets of genetic markers that are an order of magnitude larger than previously available. Such genetic sequences are characterized by common alleles interspersed with multiple rarer alleles. This situation has renewed the interest for the identification of haplotypes carrying the rare risk alleles. However, large scale explorations of the linkage-disequilibrium (LD) pattern to identify haplotype blocks are not easy to perform, because traditional algorithms have at least *Î˜*(*n*^{2}) time and memory complexity.

### Results

We derived three incremental optimizations of the widely used haplotype block recognition algorithm proposed by Gabriel *et al*. in 2002. Our most efficient solution, called MIG ^{++}, has only *Î˜*(*n*) memory complexity and, on a genome-wide scale, it omits >80*%* of the calculations, which makes it an order of magnitude faster than the original algorithm. Differently from the existing software, the MIG ^{++} analyzes the LD between SNPs at any distance, avoiding restrictions on the maximal block length. The haplotype block partition of the entire HapMap II CEPH dataset was obtained in 457 hours. By replacing the standard likelihood-based *D*^{â€²} variance estimator with an approximated estimator, the runtime was further improved. While producing a coarser partition, the approximate method allowed to obtain the full-genome haplotype block partition of the entire 1000 Genomes Project CEPH dataset in 44 hours, with no restrictions on allele frequency or long-range correlations. These experiments showed that LD-based haplotype blocks can span more than one million base-pairs in both HapMap II and 1000 Genomes datasets. An application to the North American Rheumatoid Arthritis Consortium (NARAC) dataset shows how the MIG ^{++} can support genome-wide haplotype association studies.

### Conclusions

The MIG ^{++} enables to perform LD-based haplotype block recognition on genetic sequences of any length and density. In the new generation sequencing era, this can help identify haplotypes that carry rare variants of interest. The low computational requirements open the possibility to include the haplotype block structure into genome-wide association scans, downstream analyses, and visual interfaces for online genome browsers.

## Background

Linkage disequilibrium (LD) is the non-random association of alleles at different loci and decays with increasing distance between loci [1]. High LD regions reflect the presence of chromosomal segments (haplotypes) that are transmitted from parents to offsprings more often than expected by chance. LD is traditionally assessed by the normalized *D*^{â€²} coefficient [2]. The *r*^{2} coefficient [3] is currently more commonly used than *D*^{â€²} to identify independent signals in genome-wide association studies (GWAS). In this regard, the *r*^{2} was also considered in haplotype block recognition algorithms [4, 5]. Nevertheless, the *D*^{â€²} should remain the statistics of choice for LD modeling because of its more direct biological interpretation. It reflects the history of recombination, mutation, and selection events that cause some chromosomal segments to be less diverse than others and, therefore, influence the haplotype distribution. Moreover, it has been shown that *r*^{2} is not significantly more precise, accurate or efficient than *D*^{â€²}[6]. The *D*^{â€²} and *r*^{2} coefficients capture similar information but their range of variation can be very different [6]. The *D*^{â€²} goes from -1 to +1 and is independent of the allelic frequencies of the two markers involved. When *D*^{â€²}â€‰=â€‰0, the two markers are independent (perfect equilibrium), while |*D*^{â€²}|â€‰=â€‰1 indicates that no more than three of the four possible haplotypes are being observed in the sample (complete disequilibrium). In contrast, the range of *r* greatly depends on the allele frequencies and equals -1 or +1 only when the two markers have the same allele frequency. In such cases, |*r*|â€‰=â€‰1 indicates that knowing the allele at one marker allows to determine the allele at the other marker (perfect disequilibrium). But when the two markers have very different allele frequencies, the interpretation of *r*^{2} becomes difficult. This is especially relevant with the data generated by the new sequencing technologies, that allow genotyping markers over a very wide spectrum of allele frequencies. In such situations, the *r*^{2} may fail to identify the correct relationship between nearby variants. In GWAS, this may lead to a wrong definition of the identified loci.

Although, in the past, haplotype blocks have been mainly used to identify tag SNPs [7], a variety of other applications is possible with currently available data. Recently, analysis of exome-chip data has shown that within-gene LD-block distribution can be informative of the gene function and of the possible relationship between genes and specific groups of phenotypes [8]. Another application is the genome-wide haplotype association scan, which was successful in uncovering risk loci for coronary artery disease [9], Alzheimerâ€™s disease [10] and breast cancer [11]. So far, genome-wide haplotype association scans have been mostly performed based on fixed- or variable-width sliding window methods, which systematically miss haplotypes that are longer than some specified maximal window widths. An efficient genome-wide haplotype block recognition could help overcome such limitations, thus enhancing the biological interpretation of the results. In the study of rare variants, where collapsing methods (mostly based on gene boundaries) are becoming increasingly popular [12], the availability of haplotype blocks at genome-wide level would allow to collapse variants based on block boundaries, capturing inter-genic variants, and avoiding the problem to define the gene boundaries. Additional applications include downstream analyses of GWAS, such as pathway-based approaches, where statistics for multiple SNPs are summarized into gene-specific P-values, which are then employed for gene ranking [13]. In pathway-based analyses, SNP-to-gene mapping is typically based on SNP proximity to the gene boundaries. With this method, when a region is gene-dense, it may be problematic to assign SNPs to a single, specific gene. An LD-based assignment would overcome this limitation and increase the power of downstream analyses [14]. In general, ignoring the LD structure in downstream analyses of GWAS can result in the misinterpretation of the findings [15]. Popular genome browsers, such as the Ensembl [16] or UCSC [17], are suitable for visualizing the LD distribution over regions of interest. However, they only allow pairwise LD calculation between markers at <500 kb distance from each other and do not provide any LD-block partition. With no predefined block partition, the visual assessment of such LD patterns might be influenced by investigatorâ€™s subjectivity. On the other hand, the 500 kb distance constraint may limit the investigation of larger strong-LD regions. With the availability of pre-calculated, threshold-free LD blocks, we would overcome both these limitations.

There is extensive literature on haplotype block inference [18â€“21], including methods based on probabilistic graphical models [22]. The latter allow an accurate identification of SNP clusters, even in situations when SNPs are not necessarily contiguous. However, due to its simplicity, the most commonly used LD-based algorithm remains the one proposed by Gabriel *et al.*[23], which is implemented in Haploview [24]. The Haploview algorithm is widely used in genetic association studies and it is included in popular software, such as PLINK [25]. However, with a *Î˜*(*n*^{2}) time and memory complexity, where *n* is the number of SNPs, the algorithm is applicable only to short genomic segments containing no more than a few thousand SNPs. Unless runtime and memory usage are artificially reduced by splitting large segments into smaller chunks, the algorithm cannot be applied to densely genotyped segments or genome-wide analyses.

In this paper, we describe how we improved efficiency and scalability of the Haploview algorithm (1) by adopting an incremental computation of the haplotype blocks based on iterative chromosome scans and (2) by estimating *D*^{â€²} confidence intervals (CIs) using the approximate variance estimator proposed by Zapata *et al.*[26]. The incremental computation strategy led to an algorithm, termed MIG ^{++}, that has *Î˜*(*n*) memory complexity and omits more than 80% of the pairwise LD computations, while obtaining exactly the same final haplotype block partition as Haploview. In contrast to Haploview, the new algorithm can consider pairwise LD between SNPs at any distance. With MIG ^{++}, we performed the haplotype block recognition of the entire HapMap phase II dataset of CEPH haplotypes. By introducing the approximate variance estimator, the performance of the MIG ^{++} was further improved and allowed us to perform the block partition of the entire 1000 Genomes Project dataset of CEPH haplotypes. To show a practical application of the obtained genome-wide block partition, we finally compared SNP-based and haplotype block-based association tests in a GWAS context.

## Methods

### Haplotype block definition

The haplotype block recognition algorithm proposed by Gabriel *et al.*[23] is based on |*D*^{â€²}| and its 90% CI, with CL and CU being the lower and upper bounds of the CI, respectively. SNP pairs are classified as follows: (1) in *strong LD* if *C* *L*â€‰â‰¥â€‰0.7 and *C* *U*â€‰â‰¥â€‰0.98; (2) *showing strong evidence of historical recombination* (*strong EHR*) if *C* *U*<0.9; (3) *non-informative*, otherwise. *Informative* pairs are those satisfying conditions (1) or (2). A haplotype block was then defined as follows:

#### Definition 1.

*(Haplotype Block)*. *Let C*â€‰=â€‰ã€ˆ*g*_{1}, â€¦,*g*_{
n
}ã€‰ *be a chromosome of n SNPs*, *G*â€‰=â€‰ã€ˆ*g*_{
i
}, â€¦, *g*_{
j
}ã€‰ *a region of adjacent SNPs in C*, *l the number of strong LD SNP pairs in G*, *and r the number of strong EHR SNP pairs in G*. *Then*, *G is a haplotype block if*

*(a) the two outermost SNPs*, *g*_{
i
}*and g*_{
j
}, *are in strong LD*, *and*

*(b) there is at least a proportion d of informative pairs that are in strong LD*, *i.e.*: *l*â€‰/â€‰(*l*â€‰+â€‰*r*)â€‰â‰¥â€‰*d*.

In their original work, Gabriel *et al.*[23] set *d*â€‰=â€‰0.95 after investigating the fractions of *strong LD* SNP pairs in genomic regions of different length and in different populations.

The Haploview algorithm [24] performs a haplotype block partition in two steps: (1) all regions satisfying Definition 1 (a) are collected in a set of candidate haplotype blocks; (2) from this set of candidates, a subset of non-overlapping regions that satisfy Definition 1 (b) is selected. In the first step, the entire chromosome is scanned and, for every SNP pair, the |*D*^{â€²}| CI is computed and stored in an *n*â€‰Ã—â€‰*n* matrix. The matrix is then traversed to identify the pairs that satisfy Definition 1 (a). These pairs mark regions of different length that are candidates to become haplotype blocks. In the second step, the candidate regions are sorted by decreasing length and processed starting with the largest one. If a region satisfies Definition 1 (b), it is classified as a haplotype block, and all other overlapping candidate regions are discarded. Regions not satisfying Definition 1 (b) are skipped. This process continues with the next largest candidate region, until the candidate set is completely processed and the list of haplotype blocks is complete.

The overall complexity of the algorithm is mainly determined by the first step. More specifically, the *Î˜*(*n*^{2}) time and memory complexity is due to the computation and maintenance of the *n*â€‰Ã—â€‰*n* CI matrix. For this reason, we concentrated our improvements on the first step.

### Incremental computation of haplotype blocks

The core ideas of our optimizations are to compute haplotype blocks incrementally and to omit, as soon as possible, regions that cannot be extended to larger blocks due to an insufficient proportion of *strong LD* SNP pairs. In this way, we avoid both unnecessary computations and the storage of an *n*â€‰Ã—â€‰*n* CI matrix. The incremental haplotype block computation is based on the concepts of a *SNP-pair weight* and a *region weight* described below.

#### Definition 2.

*(SNP-pair weight)*. *Let C and d be as defined in Definition 1. For a given pair of SNPs g*_{
i
}*and g*_{
j
}, *the SNP-pair weight*, *w*(*i*, *j*), *is defined as follows:*

#### Definition 3.

*(Region weight)*. *Let G be as defined in Definition 1*. *The region weight of G*, \stackrel{\xcc\u201e}{w}(i,j), *is defined as the sum of all SNP-pair weights in G*:

The following theorem defines a haplotype block based on the region weight.

#### Theorem 1.

*Let G be as defined in Definition 1. G is a haplotype block if w* (*i*, *j*)â€‰=â€‰1â€‰-â€‰*d and* \stackrel{\xcc\u201e}{w}(i,j)\xe2\u2030\yen 0.

#### Proof

From Definition 2, if SNPs *g*_{
i
} and *g*_{
j
} are in *strong LD*, then *w* (*i*, *j*)â€‰=â€‰1â€‰-â€‰*d*. Therefore, Definition 1 (a) is satisfied. *G* contains S=\underset{v=i+1}{\overset{j}{\xe2\u02c6\u2018}}\underset{u=i}{\overset{v}{\xe2\u02c6\u2018}}1 possible SNP pairs, of which *l* are in *strong LD*, *r* show *strong EHR*, and the remaining ones are non-informative. From Definitions 2 and 3, it follows that \stackrel{\xcc\u201e}{w}(i,j)=\underset{v=i+1}{\overset{j}{\xe2\u02c6\u2018}}\underset{u=i}{\overset{v}{\xe2\u02c6\u2018}}w\phantom{\rule{0.3em}{0ex}}(u,v)=l(1-d)+r(-d)+(S-l-r)\xc2\xb70=l-d(l+r). If \stackrel{\xcc\u201e}{w}(i,j)\xe2\u2030\yen 0, then *l*â€‰-â€‰*d*(*l*â€‰+â€‰*r*)â€‰â‰¥â€‰0â€‰â‡’â€‰*l*â€‰/â€‰(*l*â€‰+â€‰*r*)â€‰â‰¥â€‰*d*. Therefore, Definition 1 (b) is also satisfied.

Theorem 1 is the basis for the incremental haplotype block reconstruction, which is the core of our optimizations. In the following, we present three gradual improvements of the Haploview algorithm: a memory-efficient implementation based on the Gabriel *et al.*[23] definition (MIG); MIG with additional search space pruning (MIG ^{+}); and MIG ^{+} with iterative chromosomal processing (MIG ^{++}). Theorem 1 ensures that all three algorithms produce block partitions that are identical to the original Haploview results.

#### The MIG algorithm

For a given chromosomal segment *C* containing *n* SNPs, the maintenance of an *n*â€‰Ã—â€‰*n* matrix containing all the |*D*^{â€²}| CIs can be avoided by storing *n* region weights in a unidimensional vector *W*_{nâ€‰Ã—â€‰1}. In each element of *W*, *W*[*i*], we store the weight of a chromosomal region that starts at SNP *g*_{
i
}. When the region is enlarged by including additional SNPs to the right of *g*_{
i
}, the weight *W*[*i*] is updated accordingly. This procedure, illustrated in Figure 1, begins with setting all the weights to 0. At the initial stage, the vector *W* represents all one-SNP regions. Then, the region starting at SNP *g*_{1} is enlarged by including the next SNP, *g*_{2}. Therefore, starting from *g*_{2}, chromosome *C* is processed one SNP after the other, from left to right. For a SNP *g*_{
j
}, with *j*â€‰â‰¥â€‰2, all SNP pair weights *w*(*i*, *j*), *i*â€‰=â€‰*j*â€‰-â€‰1, â€¦, 1, are computed and added up as *s*â€‰=â€‰*w*(*j*â€‰-â€‰1, *j*)â€‰+â€‰â‹¯â€‰+â€‰*w* (*i*, *j*).

*s* and *W*[*i*] are updated for every computed weight *w*(*i*, *j*). Before the update, *s*â€‰=â€‰*w*(*j*â€‰-â€‰1, *j*)â€‰+â€‰â‹¯â€‰+â€‰*w*(*i*â€‰-â€‰1, *j*) and *W*[*i*] contains the region weight \stackrel{\xcc\u201e}{w}(i,j-1), which was already computed for the previous SNP *g*_{j-1}. Then, *s* is incremented by *w*(*i*, *j*) and *W*[*i*] is incremented by the new value of *s*. *W*[*i*] now represents the region weight \stackrel{\xcc\u201e}{w}(i,j), i.e., \stackrel{\xcc\u201e}{w}(i,j)=\stackrel{\xcc\u201e}{w}(i,j-1)+w(j-1,j)+\xe2\u2039\xaf+w(i,j). Whenever *w*(*i*, *j*)â€‰=â€‰1â€‰-â€‰*d* and \stackrel{\xcc\u201e}{w}(i,j)\xe2\u2030\yen 0, Theorem 1 is satisfied and the region ã€ˆ*g*_{
i
}, â€¦, *g*_{
j
}ã€‰ is added to the set of candidate haplotype blocks. This procedure is repeated with the next SNP, *g*_{j+1}. An example of the first three computational steps is given in Figure 2. The pseudocode is provided in Algorithm A.1 (Additional file 1).

MIG reduces the memory complexity from *Î˜*(*n*^{2}) to *Î˜*(*n*). Moreover, instead of identifying candidate regions that satisfy only Definition 1 (a) (as in Haploview), MIG checks immediately both conditions (a) and (b). This yields a smaller set of candidate blocks, and therefore indirectly speeds up also the second step of the Haploview algorithm.

#### The MIG ^{+}algorithm

While MIG drastically reduces the memory requirements by avoiding the maintenance of the CI matrix, it still computes weights for all SNP pairs, totaling *n*(*n*â€‰-â€‰1)â€‰/â€‰2 computations as in Haploview. To omit unnecessary computations, we apply a search space pruning to the MIG algorithm to identify regions that cannot be further extended to form a haplotype block. The pseudocode is shown in Algorithm A.2 (Additional file 1).

Instead of computing weights for all pairs of SNPs, only weights *w*(*j*â€‰-â€‰1, *j*), â€¦, *w*(*b*, *j*) are computed, where b=\mathit{\text{min}}\left(\right\{i\xe2\u02c6\pounds 1\xe2\u2030\xa4i<j\xe2\u02c6\S {\stackrel{\xcc\u201e}{w}}_{\mathit{\text{max}}}(i,j)\xe2\u2030\yen 0\left\}\right) and {\stackrel{\xcc\u201e}{w}}_{\mathit{\text{max}}}(i,j)=\mathit{\text{max}}\left\{\stackrel{\xcc\u201e}{w}\right(i,k)\xe2\u02c6\pounds j<k\xe2\u2030\xa4n\}. The function {\stackrel{\xcc\u201e}{w}}_{\mathit{\text{max}}}(i,j) is an upper bound for the weight of all regions ã€ˆ*g*_{
i
}, â€¦, *g*_{
j
}, â€¦, *g*_{
k
}ã€‰ that start at *g*_{
i
} and end after *g*_{
j
}, i.e., those extending beyond the region ã€ˆ*g*_{
i
}, â€¦, *g*_{
j
}ã€‰. If {\stackrel{\xcc\u201e}{w}}_{\mathit{\text{max}}}(i,j)<0 for some *i*, none of the regions ã€ˆ*g*_{
i
}, â€¦, *g*_{
k
}ã€‰ can satisfy Definition 1 (b). The smallest *i*, that can be a potential starting point of a region with a positive weight, can therefore be set as breakpoint *b*. Regions starting left of *b* and stopping right of *j* receive negative weights and are discarded (Figure 3, left panel).

The upper bound, {\stackrel{\xcc\u201e}{w}}_{\mathit{\text{max}}}(i,j), is estimated assuming that all unprocessed SNPs to the right of *g*_{
j
} are in *strong LD* with each other and with all SNPs in the region ã€ˆ*g*_{
i
}, â€¦, *g*_{
j
}ã€‰. Then, \stackrel{\xcc\u201e}{w}(i,k)\xe2\u2030\xa4\stackrel{\xcc\u201e}{w}(i,j)+(1-d)\xc2\xb7S, where \stackrel{\xcc\u201e}{w}(i,j) is already computed and *S*â€‰=â€‰((*j*â€‰-â€‰*i*â€‰+â€‰1)â€‰+â€‰(*k*â€‰-â€‰*i*))(*k*â€‰-â€‰*j*)â€‰/â€‰2 is the number of unprocessed SNP pairs. Since *S* is largest for the longest region ã€ˆ*g*_{
i
}, â€¦, *g*_{
n
}ã€‰, we have \mathit{\text{max}}\left\{\stackrel{\xcc\u201e}{w}\right(i,k)\xe2\u02c6\pounds k>j\}\xe2\u2030\xa4\stackrel{\xcc\u201e}{w}(i,j)+(1-d)\xc2\xb7\left(\right(j-i+1)+(n-i\left)\right)(n-j)/2, and the estimated upper bound {\stackrel{\xcc\u201e}{w}}_{\mathit{\text{max}}}(i,j) is defined as follows:

The MIG ^{+} algorithm performs at most *Î»* *n*(*n*â€‰-â€‰1)â€‰/â€‰2 computations, where *Î»*, 1â€‰-â€‰*d*â€‰â‰¤â€‰*Î»*â€‰â‰¤â€‰1, depends on the data. The worst case of *Î»*â€‰=â€‰1 occurs only in the unlikely situation when a few very large blocks span an entire chromosome.

#### The MIG ^{++}algorithm

A limitation of the MIG ^{+} algorithm is its blindness about the unprocessed area to the right of the current SNP *g*_{
j
}. Assuming *strong LD* for all SNP pairs in this area results in a conservative upper bound, {\stackrel{\xcc\u201e}{w}}_{\mathit{\text{max}}}(i,j), for the region weights. An additional optimization step allows to obtain a more precise estimate of {\stackrel{\xcc\u201e}{w}}_{\mathit{\text{max}}}(i,j) and further prunes unnecessary computations. The pseudocode of the modified algorithm, MIG ^{++}, is given in Algorithm A.3 (Additional file 1).

The improved algorithm is an iterative procedure that, at each iteration, scans the chromosome from left to right and computes the weights only for a limited number of SNP pairs. For a SNP *g*_{
j
}, the SNP pairs considered in an iteration are restricted to a window of size *win*: only the weights *w*(*j*â€‰-â€‰1, *j*), â€¦, *w*(*t*, *j*) are computed, where *t*= max({*b*, *j*â€‰-â€‰*win*}) and 1â€‰â‰¤â€‰*win*â€‰â‰¤â€‰*n* (Figure 3, right panel). At each new iteration, the window size is increased by a number of SNPs equal to *win*. Therefore, the number of computed SNP-pair weights increases proportionally. This allows a more precise estimation of the upper bounds for the region weights with every new iteration.

By considering all SNP-pair weights computed in all previous iterations for the estimation of the upper bound, {\stackrel{\xcc\u201e}{w}}_{\mathit{\text{max}}}(i,j), the algorithm requires linear time for each individual SNP pair to sum up all weights inside the corresponding region. We use a computationally cheaper constant-time solution, though it may lead to a less accurate estimation. Since \stackrel{\xcc\u201e}{w}(i,k)\xe2\u2030\xa4\stackrel{\xcc\u201e}{w}(i,j)+\stackrel{\xcc\u201e}{w}(1,k)-\stackrel{\xcc\u201e}{w}(1,j), we have max\left\{\stackrel{\xcc\u201e}{w}\right(i,k)\xe2\u02c6\pounds k>j\}\xe2\u2030\xa4\stackrel{\xcc\u201e}{w}(i,j)+max\left\{\stackrel{\xcc\u201e}{w}\right(1,k)\xe2\u02c6\pounds k>j\}-\stackrel{\xcc\u201e}{w}(1,j). An upper bound {\stackrel{\xcc\u201e}{w}}_{\mathit{\text{max}}}(i,j) can then be computed as follows:

max\left\{\stackrel{\xcc\u201e}{w}\right(1,k)\xe2\u02c6\pounds k>j\} is computed in linear time after every scan of the chromosome, whereas \stackrel{\xcc\u201e}{w}(1,j) is computed in constant time. Thus, the computation of the upper bound {\stackrel{\xcc\u201e}{w}}_{\mathit{\text{max}}}(i,j) for each individual SNP pair requires only constant time.

When *win*â€‰=â€‰*n*, MIG ^{++} is identical to MIG ^{+}. When *win* = 1, the number of iterations becomes too large, introducing a significant computational burden. We propose to set *win*â€‰=â€‰âŒˆ(*n*â€‰-â€‰1)(1â€‰-â€‰*d*)â€‰/â€‰2âŒ‰, that corresponds to 1â€‰-â€‰*d* percent of all SNP pairs, which is the minimal fraction of SNP pairs that must be considered before one can be sure that an *n*-SNP segment is not a haplotype block.

The MIG ^{++} performs at most *Î»* *n*(*n*â€‰-â€‰1)â€‰/â€‰2 computations, where *Î»*, 1â€‰-â€‰*d*â€‰â‰¤â€‰*Î»*â€‰â‰¤â€‰1, depends on the data. However, the value of *Î»* obtained with the MIG ^{++} algorithm is expected to be always smaller than that from the MIG ^{+} algorithm, because of the more precise estimation of {\stackrel{\xcc\u201e}{w}}_{\mathit{\text{max}}}(i,j).

### Alternative methods to estimate the *D*^{â€²}CI

A critical step of the Gabriel *et al.*[23] approach is the estimation of the *D*^{â€²} CI. In a genome-wide context, this calculation can be repeated hundreds of millions of times. In Haploview, the CIs are obtained by means of the likelihood-based procedure proposed by Wall and Pritchard [27], which requires from 100 to 1,000 iterations. This method can be replaced with a computationally cheaper solution, based on an approximated estimator of the *D*^{â€²} variance, as proposed by Zapata *et al*. [26]. This solution would make the whole block recognition algorithm significantly faster.

#### The wall and pritchard (WP) method

The true allele frequencies of each SNP are assumed to be equal to the observed allele frequencies. The likelihood of the data in the four-fold table obtained by crossing any SNP pair, conditional to the |*D*^{â€²}| value, can be expressed as *l*â€‰=â€‰*P*(*data*âˆ£|*D*^{â€²}|). *l* is evaluated at each value of |*D*^{â€²}| = 0.001â€‰Ã—â€‰*p*, with *p* = 0, 1, â€¦, 1000. CL is defined as the largest value of |*D*^{â€²}| such that \underset{i=0}{\overset{p-1}{\xe2\u02c6\u2018}}l\left(i\right)/\underset{i=0}{\overset{1000}{\xe2\u02c6\u2018}}l\left(i\right)\xe2\u2030\xa4\mathrm{\xce\pm}, where *Î±* is the significance level. Similarly, CU is defined as the smallest value of |*D*^{â€²}| such that \underset{i=p+1}{\overset{1000}{\xe2\u02c6\u2018}}l\left(i\right)/\underset{i=0}{\overset{1000}{\xe2\u02c6\u2018}}l\left(i\right)\xe2\u2030\xa4\mathrm{\xce\pm}.

#### The approximate variance (AV) method

Consider two SNPs, *u* and *v*, with alleles {*u*_{1}, *u*_{2}} and {*v*_{1}, *v*_{2}}, respectively. Let {n}_{{u}_{i}{v}_{j}} and {f}_{{u}_{i}{v}_{j}} denote, respectively, the absolute and relative frequencies of the four possible haplotypes, *u*_{
i
}*v*_{
j
}(*i*, *j*â€‰âˆˆâ€‰{1, 2}), with {f}_{{u}_{i}} and {f}_{{v}_{j}} being the marginal frequencies of the two SNPs. In total, N=\xe2\u02c6\u2018{n}_{{u}_{i}{v}_{j}} haplotypes are observed. Zapata et al. [26] showed that the variance of *D*^{â€²} can be approximated as follows:

where {D}^{\xe2\u20ac\xb2}\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}D/{D}_{\mathit{\text{max}}};D\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}{f}_{{u}_{1}{v}_{1}}-{f}_{{u}_{1}}{f}_{{v}_{1}}; *D*_{
max
} is min\left\{\phantom{\rule{0.3em}{0ex}}{f}_{{u}_{1}}\right(1-{f}_{{v}_{1}}),(1-{f}_{{u}_{1}}\left){f}_{{v}_{1}}\right\} when *D*â€‰>â€‰0 or \phantom{\rule{0.3em}{0ex}}min\{\phantom{\rule{0.3em}{0ex}}{f}_{{u}_{1}}{f}_{{v}_{1}},(1-{f}_{{u}_{1}}\left)\right(1-{f}_{{v}_{1}}\left)\right\} when *D*â€‰<â€‰0; *f*_{1} is {f}_{{v}_{1}} when *D*^{â€²}â€‰>â€‰0 or {f}_{{v}_{2}} when *D*^{â€²}â€‰<â€‰0; *f*_{2} is {f}_{{v}_{2}} when *D*^{â€²}â€‰>â€‰0 or {f}_{{v}_{1}} when *D*^{â€²}â€‰<â€‰0; *f*_{3} is {f}_{{u}_{1}{v}_{1}}, {f}_{{u}_{1}{v}_{2}}, {f}_{{u}_{2}{v}_{1}}, and {f}_{{u}_{2}{v}_{2}} when *D*_{
max
} is {f}_{{u}_{1}}{f}_{{v}_{1}}, {f}_{{u}_{1}}{f}_{{v}_{2}}, {f}_{{u}_{2}}{f}_{{v}_{1}}, and {f}_{{u}_{2}}{f}_{{v}_{2}}, respectively; and

When *D*^{â€²}â€‰=â€‰Â±1, then *V*(*D*^{â€²})â€‰=â€‰0. The 1â€‰-â€‰*Î±* CI of *D*^{â€²} is equal to {D}^{\xe2\u20ac\xb2}\xc2\pm {Z}_{\mathrm{\xce\pm}/2}\sqrt{V\left({D}^{\xe2\u20ac\xb2}\right)}, where *Z*_{Î±/2} is the 1â€‰-â€‰*Î±*â€‰/â€‰2 percentile of the standard normal distribution.

### Experimental evaluation

The experimental evaluation was based on the phased CEPH genotypes included in the HapMap phase II (HapMapII) [28] and the 1000 Genomes Project phase 1 release 3 (1000G) [29] databases. The HapMapII dataset included 2,543,857 SNPs from 120 haplotypes (60 individuals) and the 1000G dataset included 10,858,788 SNPs from 170 haplotypes (85 individuals).

To compare the new algorithms to the standard Haploview, in terms of runtime and memory usage, the ideal solution would have been that of randomly sampling regions with different characteristics from the HapMapII or 1000G datasets. However, the Haploview algorithm was so computationally expensive that it prohibited to consider a sufficiently large number of random regions and, therefore, to obtain a representative sample of all possible scenarios over the whole genome. For this reason, we selected the regions such that the most extreme scenarios, in terms of median SNP minor allele frequency (MAF) and median inter-SNP distance, were covered. To identify such representative regions, we performed the systematic scan of all SNPs in the genome using a sliding window of 1,000 SNPs, after removing chromosomal centromeres and the HLA region. For each sliding region, the median MAF and inter-SNP distance were recorded (Additional file 1, Figure A.1). All regions were then represented in a two-dimensional Euclidean space, where the normalized inter-SNP distance was plotted against the normalized median MAF (Additional file 1, Figure A.2). A total of nine regions were chosen for the experiments: the eight regions located on the outermost boundaries of the Euclidean space and the region closest to the center of the space. These regions represent scenarios with extreme and moderate median MAF and median inter-SNP distance. The procedure was repeated using larger sliding windows of 5,000 to 30,000 SNPs. If not stated otherwise, in the experimental results we report median values over the nine regions for every different window size.

The block partitions obtained with the WP and AV methods for *D*^{â€²} CI estimation were compared in terms of total number of blocks, median number of SNPs per block, proportion of SNPs clustered into blocks, and median within block haplotype diversity. Haplotype diversity [19, 20] is defined as the ratio between the number of common haplotypes and the total number of haplotypes within a block. Common haplotypes are those occurring more than once. The haplotype diversity index ranges from 0 (complete diversity) to 1 (no diversity).

The three MIG algorithms were implemented in C++. To guarantee a fair comparison, the original Java implementation of the Haploview algorithm was rewritten in C++, too. By default, Haploview considers only SNP pairs within a maximal distance of 500Kbp. We removed this constraint because it could affect the block partitions of very wide regions. For the WP method, we set the number of likelihood estimation points to 1,000 (100 in the original Haploview implementation). We didnâ€™t consider the population specific two-, three-, and four-marker rules, proposed by Gabriel *et al.*[23] when very short regions are processed, because they have no impact on the computational efficiency of the algorithms. All experiments were run on a machine with an Opteron 8356 Quad Core (2.3GHz) CPU.

### Genome-wide association study of rheumatoid arthritis

We applied our haplotype block partitioning algorithm to the genome-wide association study of the North American Rheumatoid Arthritis Consortium (NARAC) dataset. Data consisted of 868 cases and 1,194 controls. The samples were genotyped at 544,917 autosomal and sex chromosome SNPs. Quality check was performed with PLINK 1.07 [25]: we excluded 5,422 SNPs with a call rate of <90*%*, 11,327 SNPs with a minor allele frequency of <0.001, and 898 SNPs because of significant deviation from Hardy-Weinberg equilibrium in controls (*p-value*â€‰â‰¤â€‰10^{-6}). No samples were excluded because of low call rate (<90*%*); 2 cases and 5 controls were removed because of sex mismatch; 1 case and 8 controls were additionally excluded after population stratification test based on principal component analysis performed with EIGENSOFT 5.0.1 [30]. After the quality control, 514,539 autosomal SNPs and 2,046 samples were available for analyses.

Haplotypes were phased using SHAPEIT version 2 [31]. To achieve good accuracy, we set 400 conditioning states per SNP. Recombination rates were taken from HapMap phase II build 36 and effective population size was set to 11,418 (as suggested for CEU populations). The estimated haplotypes were submitted to MIG ^{++} and processed with the WP and AV methods. We obtained 98,979 WP blocks, covering 445,832 SNPs, with 68,707 singleton SNPs outside of any block. The AV method identified 97,816 blocks, covering 446,170 SNPs, and 68,369 singleton SNPs.

The genome-wide association scan was based on a logistic regression model adjusted for sex and the top 10 eigenvectors obtained from EIGENSOFT 5.0.1 [30]. The association between disease status and individual SNPs or haplotype blocks was tested with a likelihood ratio test using PLINK 1.0.7 [25] with the logistic-genotypic and omnibus options, respectively. Within each block, haplotypes with frequency of <0.01 were collapsed together to preserve power. Singleton SNPs outside blocks were treated as in the SNP-based analysis, therefore producing analogous results. Genomic control (GC) correction was applied to both SNP- and block-based GWAS results. Bonferroni-corrected significance thresholds were set to 2.98â€‰Ã—â€‰10^{-7} for analysis based on the WP block partition (i.e. 0.05 divided by the sum of 98,979 WP blocks and 68,707 singleton SNPs), 3.01â€‰Ã—â€‰10^{-7} for the analysis based on the AV method (i.e. 0.05 divided by 97,816 AV blocks plus 68,369 singleton SNPs), and 9.17â€‰Ã—â€‰10^{-8} (i.e. 0.05 / 514,539) for the individual SNP analysis.

## Results

### Runtime and memory with the WP method

Figure 4 shows runtime and memory performance of Haploview and the three MIG algorithms based on the WP method, when applied to the 1000G dataset. Since both Haploview and MIG perform *n*(*n*â€‰-â€‰1)â€‰/â€‰2 computations, it was expected to see identical runtime: both of them took 80 hours to process regions of 30,000 SNPs. However, MIG used three orders of magnitude less memory than Haploview (3 MB vs. 7 GB). The runtime was significantly reduced with MIG ^{+} (27 hours) and even further with MIG ^{++} (14 hours). The runtime difference between algorithms increased with the region size (number of SNPs). Memory usage was identical for MIG and MIG ^{+}, whereas MIG ^{++} required slightly more memory to store the computational status between iterative region scans. Similar results were obtained on the HapMapII dataset (results not shown).

The MIG ^{++} omitted more unnecessary computations than MIG ^{+}, which is reflected by the smaller *Î»* coefficient in both HapMapII and 1000G datasets (Figure 5). The *Î»* values decreased with increasing number of SNPs in the region. When increasing the region size, after a rapid decline for small regions, *Î»* reached stable values with both MIG ^{+} and MIG ^{++} algorithms and in both datasets. This behavior relates to the LD decay with distance. In regions of 30,000 SNPs in the 1000G dataset, the MIG ^{++} algorithm was able to omit âˆ¼80*%* of the calculations (*Î»*â€‰âˆ¼0.20), while MIG ^{+} could omit âˆ¼60*%* of the calculations (*Î»*âˆ¼0.40). An example of the reduction of the number of calculations is given in Figure 6, where MIG ^{+} and MIG ^{++} are compared to Haploview, which is represented by the entire triangle.

### Runtime and memory with the AV method

When we introduced the AV method to estimate the *D*^{â€²} CI, we observed a drastic reduction of the computational time of the MIG algorithm. With the AV approach, the median runtime needed to analyze sequences of 10,000 SNPs in the 1000G dataset was of 2 minutes. The same analysis took a median of 8.7 hours with the WP method (Figure 7, left panel). Proportional time reduction was observed for MIG ^{+} and MIG ^{++}. Similar results were obtained in the HapMapII dataset (results not shown).

We observed that the introduction of the AV method caused a slight increase of the *Î»* coefficient (Figure 8). This is because, with the AV method, more SNP pairs are classified to be in *strong LD*. This causes an increase of the number of possible configurations to be checked and results in a larger set of candidate haplotype blocks. With the AV method, the MIG algorithms identified tens of millions of candidate haplotype blocks (Additional file 1, Figure A.3). The number of candidate blocks was even larger when the AV method was applied directly to Haploview, where candidate blocks need to satisfy only Definition 1 (a). This significantly larger number of candidate blocks explains the increase in runtime of Haploview when using the AV method: for regions of 5,000 SNPs in the 1000G dataset, the median runtime was of 451 hours with the AV against the 2 hours with the WP method (Figure 7, right panel).

### Block Partitions with the WP and AV Methods

The characteristics of the different block partitions obtained with the WP and AV methods are summarized in Figure 9. The AV method produced a smaller number of blocks than the WP method (top panels). The median number of blocks per region increased along with the number of SNPs, and it increased faster for the WP compared to the AV method. Considering the median number of SNPs per block, the AV method produced larger blocks than the WP method (middle panels). For very short regions (e.g., 1,000 SNPs) both methods generally induced larger blocks. This is because such small regions might be completely covered by a single or very few haplotype blocks. The median number of SNPs per block decreased along with the increase of the length of the region considered. Overall, the AV method assigned a higher percentage of SNPs to blocks compared to the WP method, which left more singleton SNPs outside of any block (bottom panels). In the analysis of the HapMapII dataset, 98.4*%* of the SNPs were clustered within blocks with the AV method and 90.5*%* with the WP method. In the analysis of the 1000G dataset, the percentages were of 99.7 and 86.8, respectively.

We observed that 100% of the blocks identified by the WP method were overlapping blocks identified by the AV method. More specifically, 80% to 90% of the blocks based on the WP method were completely included within blocks based on the AV method (Figure 10). The remaining 10% to 20% of WP blocks whose borders were crossing borders of AV blocks, could be entirely attributed to the selection mechanism in the step (2) of the algorithm, when larger candidate blocks are prioritized over the shorter ones. In fact, when, instead of looking at the final block partition, we focused on the intermediate set of candidate blocks before the final pruning, we observed that 100% of the candidates from the WP method were entirely included within the candidate blocks from the AV method.

Consistently with the findings of larger AV blocks, we observed a generally higher haplotype diversity in the partitions obtained with the AV method compared to the WP method (Figure 11). For instance, when considering regions of 30,000 SNPs in the 1000G dataset, we observed median within-block haplotype diversity indices of 0.876 and 0.982 with the AV and WP methods, respectively. Slightly higher diversity indices were observed in the HapMapII dataset: 0.975 and 0.992 for the AV and WP methods, respectively. The within-block diversity was more variable in short than in long regions because, as observed above, when regions are too small, then it might be difficult to identify more than one block.

### Whole genome haplotype block recognition

The linear memory complexity and the significant reduction of the number of computations allowed us to run MIG ^{++} on a genome-wide scale. We could run MIG ^{++} on the full HapMapII dataset using both the WP and AV methods for *D*^{â€²} CI derivation. Using the more efficient AV method, we were also able to run a genome-wide haplotype block partition of the complete 1000G dataset. The runtime for the two datasets is shown in Figure 12. For HapMapII, the maximal runtime was of 1 hour when using the AV method and of 457 hours when using the WP method. In both cases, the maximal runtime was observed for chromosome 2, which contained 220,833 SNPs. The median *Î»* value across all chromosomes was 0.129 (min = 0.125, max = 0.133) for the AV method and 0.103 (min = 0.099, max = 0.110) for the WP method. For the 1000G dataset, the maximal runtime using the AV method was of 44 hours on chromosome 2, which contained 913,923 SNPs. The median *Î»* value across all chromosomes was 0.216 (min = 0.206, max = 0.224). The maximal memory usage was very low and didnâ€™t exceed 151 MB and 3.6 GB for the HapMapII and 1000G datasets, respectively.

Figure 13 shows the number of haplotype blocks per chromosome. In the HapMapII dataset, the largest number of blocks occurred in chromosome 2: 14,164 blocks with the WP method and 7,482 blocks with the AV method. The number of blocks detected with the WP method was always exceeding the number of blocks detected with the AV method. For some chromosomes, the partitions obtained by the WP method contained almost twice as many blocks as the partitions obtained by the AV method. When using the AV method, we detected a very similar number of blocks in the HapMapII and 1000G datasets. Across all chromosomes, 100% of the blocks detected with the WP method were overlapping blocks detected with the AV method. The median percentage of the WP blocks completely covered by the AV blocks was 0.797 (min = 0.773, max = 0.813).

The characteristics of the whole-genome block partitions obtained with the AV and WP methods are summarized in Table 1. The results were similar to the experiments on smaller regions. In the HapMapII dataset, fewer and larger blocks were detected with the AV method than with the WP method. With the AV method, a higher percentage of SNPs was assigned to blocks and the within-block haplotype diversity index was slightly smaller. However, the haplotype diversity was close to one for both methods, indicating that in both cases the number of possible haplotypes should be very limited. When applying the AV-based MIG ^{++} algorithm to the 1000G dataset, we observed a higher percentage of SNPs in blocks and a slightly smaller diversity index, which is explained by the higher number of SNPs per block.

For both HapMapII and 1000G datasets, the largest blocks were located over the chromosomal centromeres and spanned tens of millions of base-pairs (bp) (Additional file 1, Figure A.4). Some of these very large blocks were characterized by very low and irregular SNP density. After filtering out these exceptionally large blocks, the largest block identified by the WP-based MIG ^{++} algorithm in the HapMapII dataset was located in chromosome 1, it was 1,017,844 bp long and included 398 SNPs. When using the AV method, the largest block was located in chromosome 12, it was 1,190,412 bp long and included 335 SNPs. In the 1000G dataset, the largest block detected by the AV-based MIG ^{++} was located in chromosome 1, it was 1,361,781 bp long and included 2,896 SNPs.

### Genome-wide association study of rheumatoid arthritis

After the GWAS, we observed a genomic inflation factor *Î»* of 1.015 for the SNP-based analysis, 1.082 for the AV block-based analysis, and 1.077 for the WP block-based analysis. After GC correction, in the SNP-based analysis, 116 SNPs were genome-wide significant. Of them, 106 were located inside 25 AV blocks and 110 inside 27 WP blocks. From the AV and WP block-based analyses, we observed 29 and 33 genome-wide significant blocks, respectively. Twenty-three of such blocks were the same between the two methods. The results from the SNP- and block-based analyses are compared in Table 2. The first part of the table shows the 20 genome-wide significant loci detected by both SNP- and block-based analyses. In most cases, the AV and WP methods brought to identical results. One exception was the 4^{th} locus, where two adjacent AV blocks including 6 and 14 SNPs, respectively, corresponded to two adjacent WP blocks of 7 and 13 SNPs, respectively. That is, one SNP shifted from one block to another. In terms of significance, results were practically unchanged. A second exception was locus 13, were a block was detected only with the WP but not with the AV method. The last two exceptions were loci number 15 and 19. In both cases, an AV block was split into two WP blocks. The second part of Table 2 shows a number of loci that wouldnâ€™t have been detected with a SNP-based GWAS, but were uncovered by at least one of the two block partition methods. The AV and WP methods produced similar results. We didnâ€™t observe any clear advantage of one method compared to the other. The last section of the table shows that there was a small number of loci uncovered only by the SNP-based analysis. For these loci, the p-values from the block-based analyses were often close to the significance level, with the exception of the last two loci.

## Discussion

We propose an algorithm for haplotype block partitioning, termed MIG ^{++}, which represents a scalable implementation of the Haploview algorithm and produces the same results in a much shorter time and using a substantially smaller amount of main memory. MIG ^{++} can process large DNA regions using only a handful of megabytes of main memory. In such situations, Haploview would require gigabytes. In terms of runtime, the MIG ^{++} is several times faster than Haploview. We also demonstrated that more than 80% of calculations were not necessary for the purpose of block recognition and could be omitted, thus achieving a higher efficiency. The improved performance of the algorithm makes it possible to process very large chromosomal segments. When the approximated variance estimator, proposed by Zapata *et al.*[26], is used to estimate the *D*^{â€²} CI, the MIG ^{++} can be applied genome-wide and process high density datasets, such as the 1000G, in a very short time.

With its very small memory requirements, the MIG ^{++} can process any number of SNPs. This allowed us to avoid Haploviewâ€™s restrictions on the maximal haplotype block length (the default limitation is 500Kbp) and to consider the LD between SNPs at any distance. Our whole-genome experiments showed that the haplotype blocks, based on the Gabriel *et al.*[23] definition, can span more than 500Kbp and can extend over several millions of base pairs. This empirical result suggests that limiting the maximal block length may alter the block partition. The alteration can be substantial because the algorithm prioritizes the largest blocks. The smallest blocks are retained only when they do not overlap with the largest ones. For this reason, to constrain the block length within pre-specified limits may induce a cascade of effects and may affect the final partition of very large segments. This is relevant, for example, when assessing the LD pattern of loci selected from GWAS, with the aim of identifying genes related to the lead SNP. In such cases, different partitions could imply different genes to be selected for follow-up.

With the MIG ^{++} algorithm, we were able to run a haplotype block recognition of the entire HapMapII dataset. However, it still required an unacceptably long time to apply the algorithm to larger and denser genomes, such as the 1000G dataset. This limitation is due to the use of the Wall and Pritchard [27] method, which models the |*D*^{â€²}| likelihood and derives the |*D*^{â€²}| CIs using an iterative procedure. In contrast, if the *D*^{â€²} variance is estimated with the approximated formula suggested by Zapata *et al.*[26], it is possible to derive the *D*^{â€²} CI with a single mathematical calculation. Thanks to this computationally less demanding solution, we could perform a complete block recognition of the HapMapII dataset in 1 hour and to process the entire 1000G dataset in 44 hours. To the best of our knowledge, this is the first time that such a marker-dense genome has been partitioned with a threshold-free approach. Previously, block partition of the whole genome could only be achieved by dividing chromosomes into small chunks or by restricting computations using sliding window approaches. Such choices may introduce artificial breaks to the real haplotype structure.

It is important to note that the block partition obtained with an algorithm based on the AV method is not fully equivalent to a partition obtained based on the WP method. For large sample sizes and for common variants, the estimated variance of the *D*^{â€²} statistic is going to be similar, whichever method is used. However, when crossing a common with a rare SNP, it often happens that one of the four possible haplotypes is not present in the sample. In such situations, it is very likely that the *D*^{â€²} CI shrinks to 1 because the approximated variance is zero. In this way, the SNP pair is systematically classified as a *strong LD* pair. As a result, SNPs with rare alleles are easily grouped together into very large blocks, boosting the region coverage and the median number of SNPs per block. The WP method is less sensitive to extreme *D*^{â€²} values, and the resulting blocks are generally shorter. However, we observed that most (80%) of the haplotype blocks obtained with the WP method were contained within the larger blocks obtained with the AV method. That is, the use of the AV method produced a coarser partition, where AV blocks entirely contained one or more WP blocks. For this reason, the AV blocks showed a higher haplotype diversity, in the terms described by Patil *et al.*[19] and Zhang *et al.*[20], than the WP blocks.

To provide an application of a whole-genome haplotype block partition, we analyzed the data from the North American Rheumatoid Arthritis Consortium (NARAC) dataset using both block partitions: the one obtained with the standard WP method and the one obtained with the AV approach. As observed in previous studies [32, 33], the GWAS results were dominated by the HLA locus on chromosome 6. However, other loci were identified in other chromosomes. For what concerns the two block partition methods, the results were very similar, suggesting that the AV approach might be a convenient way to run a fast recognition of the haplotype blocks. However, we recognize that ours was an empirical application based on half a million genotyped SNPs. Results might be different in a larger context, such as that of a GWAS based on the 1000 Genomes dataset, where the number of AV blocks is expected to be much smaller than the number of WP blocks, and the AV blocks are expected to be much larger than the WP ones. Our empirical analysis of the NARAC data also confirmed previous observations that SNP- and block-based analyses are complementary to each other [32, 34]. In fact, in our analysis some loci were identified only by the single-SNP analysis, other loci were identified only by the haplotype-block analysis, and others by both methods. Thus, genome-wide haplotype association scans are not in competition with standard GWAS. Genome-wide haplotype association scans should be considered as complementary tools that may help to identify loci that could be overlooked by methods based on single-SNP analysis. We also observed that haplotype blocks may simplify gene annotation. While only one gene, the HLA-DRA [35], which was reported by previous GWASs, was directly implied by a genome-wide significant SNP, four additional previously reported genes were implied by genome-wide significant blocks: the APOM, HLA-DQA1, HLA-DRB1, and HLA-DQA2 genes [35].

## Conclusions

We have provided an efficient and scalable haplotype block recognition algorithm, termed MIG ^{++}, which improves the well-known Haploview algorithm by reducing memory complexity from quadratic to linear and by omitting approximately 80% of unnecessary computations. The improved algorithm was able to efficiently process dense genomic segments of any size. When applied to individual-level data, where genotypes are available, the MIG ^{++} efficiency can be exploited to set up haplotype-based (genome-wide) association scans that could account for the correct underlying haplotype distribution. This seems to be especially relevant when rarer variants are involved. If ran on summary results from GWAS, the MIG ^{++} could help identify biologically plausible scenarios for SNP-set analysis and it could support a more correct annotation of genes surrounding variants of interest. From a population-genetic point of view, the method could facilitate the comparison of human genomes across different ethnicities, helping to highlight structural differences. Finally, the algorithm opens up the possibility to integrate genome-wide LD-based haplotype block structure into visual assessment tools, thus improving the interpretation of already available, but incomplete, LD heatmaps (Figure 14).

The MIG algorithms are available in the *LDExplorer* R package at http://www.eurac.edu/LDExplorer together with usage instructions and examples. Further improvements will include application of parallel computation techniques to MIG ^{++} in order to further speed up the processing while keeping memory requirements low.

## References

Reich DE, Cargill M, Bolk S, Ireland J, Sabeti PC, Richter DJ, Lavery T, Kouyoumjian R, Farhadian SF, Ward R, Lander ES: Linkage disequilibrium in the human genome. Nature. 2001, 411 (6834): 199-204. 10.1038/35075590.

Lewontin RC: The interaction of selection and linkage. I. general considerations; heterotic models. Genetics. 1964, 49: 49-67.

Hill W, Robertson A: Linkage disequilibrium in finite populations. Theo Appl Genet. 1968, 38 (6): 226-231. 10.1007/BF01245622.

Gu S, Pakstis AJ, Kidd KK: HAPLOT: a graphical comparison of haplotype blocks, tagSNP sets and SNP variation for multiple populations. Bioinformatics. 2005, 21 (20): 3938-3939. 10.1093/bioinformatics/bti649.

Pattaro C, Ruczinski I, Fallin D, Parmigiani G: Haplotype block partitioning as a tool for dimensionality reduction in SNP association studies. BMC Genomics. 2008, 9: 405-10.1186/1471-2164-9-405.

Zapata C: On the uses and applications of the most commonly used measures of linkage disequilibrium from the comparative analysis of their statistical properties. Hum Hered. 2011, 71 (3): 186-195. 10.1159/000327732.

Zhang K, Qin Z, Chen T, Liu JS, Waterman MS, Sun F: HapBlock: haplotype block partitioning and tag SNP selection software using a set of dynamic programming algorithms. Bioinformatics. 2005, 21: 131-134. 10.1093/bioinformatics/bth482.

Gibson J, Tapper W, Ennis S, Collins A: Exome-based linkage disequilibrium maps of individual genes: functional clustering and relationship to disease. Hum Genet. 2013, 132 (2): 233-243. 10.1007/s00439-012-1243-6.

Tregouet DA, Konig IR, Erdmann J, Munteanu A, Braund PS, Hall AS, Groszhennig A, Linsel-Nitschke P, Perret C, DeSuremain M, Meitinger T, Wright BJ, Preuss M, Balmforth AJ, Ball SG, Meisinger C, Germain C, Evans A, Arveiler D, Luc G, Ruidavets JB, Morrison C, van der Harst P, Schreiber S, Neureuther K, Schafer A, Bugert P, El Mokhtari NE, Schrezenmeir J, Stark K, et al: Genome-wide haplotype association study identifies the, SLC22A3-LPAL2-LPA gene cluster as a risk locus for coronary artery disease. Nat Genet. 2009, 41 (3): 283-285. 10.1038/ng.314.

Lambert JC, Grenier-Boley B, Harold D, Zelenika D, Chouraki V, Kamatani Y, Sleegers K, Ikram MA, Hiltunen M, Reitz C, Mateo I, Feulner T, Bullido M, Galimberti D, Concari L, Alvarez V, Sims R, Gerrish A, Chapman J, Deniz-Naranjo C, Solfrizzi V, Sorbi S, Arosio B, Spalletta G, Siciliano G, Epelbaum J, Hannequin D, Dartigues JF, Tzourio C, Berr C, et al: Genome-wide haplotype association study identifies the, FRMD4A gene as a risk locus for Alzheimerâ€™s disease. Mol Psychiatry. 2013, 18 (4): 461-470. 10.1038/mp.2012.14.

Song C, Chen GK, Millikan RC, Ambrosone CB, John EM, Bernstein L, Zheng W, Hu JJ, Ziegler RG, Nyante S, Bandera EV, Ingles SA, Press MF, Deming SL, Rodriguez-Gil JL, Chanock SJ, Wan P, Sheng X, Pooler LC, Van Den Berg DJ, Le Marchand L, Kolonel LN, Henderson BE, Haiman CA, Stram DO: A Genome-wide scan for breast cancer risk haplotypes among African American women. PLoS ONE. 2013, 8 (2): e57298-10.1371/journal.pone.0057298.

Dering C, Hemmelmann C, Pugh E, Ziegler A: Statistical analysis of rare sequence variants: an overview of collapsing methods. Genet Epidemiol. 2011, 35 (S1): S12-S17. 10.1002/gepi.20643.

Wang K, Li M, Bucan M: Pathway-based approaches for analysis of genomewide association studies. Am J Hum Genet. 2007, 81 (6): 1278-1283. 10.1086/522374.

Petersen A, Alvarez C, DeClaire S, Tintle NL: Assessing methods for assigning SNPs to genes in gene-based tests of association using common variants. PLoS ONE. 2013, 8 (5): e62161-10.1371/journal.pone.0062161.

Christoforou A, Dondrup M, Mattingsdal M, Mattheisen M, Giddaluru S, NÃ¶then MM, Rietschel M, Cichon S, Djurovic S, Andreassen OA, Jonassen I, Steen VM, Puntervoll P, Hellard SL: Linkage-disequilibrium-based binning affects the interpretation of GWASs. Am J Hum Genet. 2012, 90 (4): 727-733. 10.1016/j.ajhg.2012.02.025.

Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, Gil L, GarcÃa-GirÃ³n C, Gordon L, Hourlier T, Hunt S, Juettemann T, KÃ¤hÃ¤ri AK, Keenan S, Komorowska M, Kulesha E, Longden I, Maurel T, McLaren WM, Muffato M, Nag R, Overduin B, Pignatelli M, Pritchard B, Pritchard E, et al: Ensembl 2013. Nucl Acids Res. 2013, 41 (D1): D48-D55. 10.1093/nar/gks1236.

Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12 (6): 996-1006.

Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES: High-resolution haplotype structure in the human genome. Nat Genet. 2001, 29 (2): 229-232. 10.1038/ng1001-229.

Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM, Hacker CR, Kautzer CR, Lee DH, Marjoribanks C, McDonough DP, Nguyen BTN, Norris MC, Sheehan JB, Shen N, Stern D, Stokowski RP, Thomas DJ, Trulson MO, Vyas KR, Frazer KA, Fodor SPA, Cox DR: Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science. 2001, 294 (5547): 1719-1723. 10.1126/science.1065573.

Zhang K, Deng M, Chen T, Waterman MS, Sun F: A dynamic programming algorithm for haplotype block partitioning. Proc Natl Acad Sci USA. 2002, 99 (11): 7335-7339. 10.1073/pnas.102186799.

Anderson EC, Novembre J: Finding haplotype block boundaries by using the minimum-description-length principle. Am J Human Genet. 2003, 73 (2): 336-354. 10.1086/377106.

Mourad R, Sinoquet C, Leray P: Probabilistic graphical models for genetic association studies. Brief Bioinformatics. 2012, 13: 20-33. 10.1093/bib/bbr015.

Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, Liu-Cordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly MJ, Altshuler D: The structure of haplotype blocks in the human genome. Science. 2002, 296 (5576): 2225-2229. 10.1126/science.1069424.

Barrett JC, Fry B, Maller J, Daly MJ: Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005, 21 (2): 263-265. 10.1093/bioinformatics/bth457.

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Human Genet. 2007, 81 (3): 559-575. 10.1086/519795.

Zapata C, Alvarez G, Carollo C: Approximate variance of the standardized measure of gametic disequilibrium Dâ€™. Am J Hum Genet. 1997, 61 (3): 771-774. 10.1016/S0002-9297(07)64342-0.

Wall JD, Pritchard JK: Assessing the performance of the haplotype block model of linkage disequilibrium. Am J Hum Genet. 2003, 73 (3): 502-515. 10.1086/378099.

The International HapMap Consortium: The international HapMap project. Nature. 2003, 426 (6968): 789-796. 10.1038/nature02168.

The 1000 Genomes Project Consortium: A map of human genome variation from population-scale sequencing. Nature. 2010, 467 (7319): 1061-1073. 10.1038/nature09534.

Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006, 38: 904-909. 10.1038/ng1847.

Delaneau O, Marchini J, Zagury JF: A linear complexity phasing method for thousands of genomes. Nature Methods. 2011, 9 (2): 179-181. 10.1038/nmeth.1785.

Shim H, Chun H, Engelman C, Payseur B: Genome-wide association studies using single-nucleotide polymorphisms versus haplotypes: an empirical comparison with data from the North American Rheumatoid Arthritis consortium. BMC Proceedings. 2009, 3 (Suppl 7): S35-10.1186/1753-6561-3-s7-s35.

Park J, Namkung J, Jhun M, Park T: Genome-wide analysis of haplotype interaction for the data from the North American Rheumatoid Arthritis Consortium. BMC Proceedings. 2009, 3 (Suppl 7): S34-10.1186/1753-6561-3-s7-s34.

Lorenz AJ, Hamblin MT, Jannink JL: Performance of single nucleotide Polymorphisms versus Haplotypes for genome-wide association analysis in Barley. PLoS ONE. 2010, 5 (11): e14079-10.1371/journal.pone.0014079.

Hindorff LA, MacArthur J, Morales J, Junkins HA, Hall PN, Klemm AK, Manolio TA: A catalog of published genome-wide association studies. http://www.genome.gov/gwastudies. [Accessed December 7,2013]

## Acknowledgements

The Genetic Analysis Workshop was funded by NIH grant R01 GM031575. Our work was based on data that was gathered with the support of grants from the National Institutes of Health (NO1-AR-2-2263 and RO1-AR-44422), and the National Arthritis Foundation.

We are grateful to Christian Fuchsberger (University of Michigan), James F. Gusella (Massachusetts General Hospital), Richard H. Myers (Boston University School of Medicine), Francisco Domingues (European Academy of Bozen/Bolzano), Christian X. Weichenberger (European Academy of Bozen/Bolzano), and Cosetta Minelli (Imperial College London) for their valuable comments and helpful discussion.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Competing interests

The authors declare that they have no competing interests.

### Authorsâ€™ contributions

DT developed the MIG algorithms, implemented all algorithms in C++, designed and performed all experiments, and drafted the manuscript. JG supervised the project and revised the manuscript. CP supervised the project, planned the experiments, and revised the manuscript. The authors read and approved the manuscript.

## Electronic supplementary material

### 12859_2013_6262_MOESM1_ESM.pdf

Additional file 1: Appendix. Includes pseudocode for the MIG algorithms, illustrations of chromosomal regions sampling procedure, and figures illustrating various haplotype block properties. (PDF 1 MB)

## Authorsâ€™ original submitted files for images

Below are the links to the authorsâ€™ original submitted files for images.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Taliun, D., Gamper, J. & Pattaro, C. Efficient haplotype block recognition of very long and dense genetic sequences.
*BMC Bioinformatics* **15**, 10 (2014). https://doi.org/10.1186/1471-2105-15-10

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/1471-2105-15-10

### Keywords

- Linkage Disequilibrium
- Haplotype Block
- Strong Linkage Disequilibrium
- Memory Complexity
- Block Partition