Haplotype block definition
The haplotype block recognition algorithm proposed by Gabriel et al.[23] is based on |D′| and its 90% CI, with CL and CU being the lower and upper bounds of the CI, respectively. SNP pairs are classified as follows: (1) in strong LD if C L ≥ 0.7 and C U ≥ 0.98; (2) showing strong evidence of historical recombination (strong EHR) if C U<0.9; (3) non-informative, otherwise. Informative pairs are those satisfying conditions (1) or (2). A haplotype block was then defined as follows:
Definition 1.
(Haplotype Block). Let C = 〈g1, …,g
n
〉 be a chromosome of n SNPs, G = 〈g
i
, …, g
j
〉 a region of adjacent SNPs in C, l the number of strong LD SNP pairs in G, and r the number of strong EHR SNP pairs in G. Then, G is a haplotype block if
(a) the two outermost SNPs, g
i
and g
j
, are in strong LD, and
(b) there is at least a proportion d of informative pairs that are in strong LD, i.e.: l / (l + r) ≥ d.
In their original work, Gabriel et al.[23] set d = 0.95 after investigating the fractions of strong LD SNP pairs in genomic regions of different length and in different populations.
The Haploview algorithm [24] performs a haplotype block partition in two steps: (1) all regions satisfying Definition 1 (a) are collected in a set of candidate haplotype blocks; (2) from this set of candidates, a subset of non-overlapping regions that satisfy Definition 1 (b) is selected. In the first step, the entire chromosome is scanned and, for every SNP pair, the |D′| CI is computed and stored in an n × n matrix. The matrix is then traversed to identify the pairs that satisfy Definition 1 (a). These pairs mark regions of different length that are candidates to become haplotype blocks. In the second step, the candidate regions are sorted by decreasing length and processed starting with the largest one. If a region satisfies Definition 1 (b), it is classified as a haplotype block, and all other overlapping candidate regions are discarded. Regions not satisfying Definition 1 (b) are skipped. This process continues with the next largest candidate region, until the candidate set is completely processed and the list of haplotype blocks is complete.
The overall complexity of the algorithm is mainly determined by the first step. More specifically, the Θ(n2) time and memory complexity is due to the computation and maintenance of the n × n CI matrix. For this reason, we concentrated our improvements on the first step.
Incremental computation of haplotype blocks
The core ideas of our optimizations are to compute haplotype blocks incrementally and to omit, as soon as possible, regions that cannot be extended to larger blocks due to an insufficient proportion of strong LD SNP pairs. In this way, we avoid both unnecessary computations and the storage of an n × n CI matrix. The incremental haplotype block computation is based on the concepts of a SNP-pair weight and a region weight described below.
Definition 2.
(SNP-pair weight). Let C and d be as defined in Definition 1. For a given pair of SNPs g
i
and g
j
, the SNP-pair weight, w(i, j), is defined as follows:
Definition 3.
(Region weight). Let G be as defined in Definition 1. The region weight of G, , is defined as the sum of all SNP-pair weights in G:
The following theorem defines a haplotype block based on the region weight.
Theorem 1.
Let G be as defined in Definition 1. G is a haplotype block if w (i, j) = 1 - d and .
Proof
From Definition 2, if SNPs g
i
and g
j
are in strong LD, then w (i, j) = 1 - d. Therefore, Definition 1 (a) is satisfied. G contains possible SNP pairs, of which l are in strong LD, r show strong EHR, and the remaining ones are non-informative. From Definitions 2 and 3, it follows that . If , then l - d(l + r) ≥ 0 ⇒ l / (l + r) ≥ d. Therefore, Definition 1 (b) is also satisfied.
Theorem 1 is the basis for the incremental haplotype block reconstruction, which is the core of our optimizations. In the following, we present three gradual improvements of the Haploview algorithm: a memory-efficient implementation based on the Gabriel et al.[23] definition (MIG); MIG with additional search space pruning (MIG +); and MIG + with iterative chromosomal processing (MIG ++). Theorem 1 ensures that all three algorithms produce block partitions that are identical to the original Haploview results.
The MIG algorithm
For a given chromosomal segment C containing n SNPs, the maintenance of an n × n matrix containing all the |D′| CIs can be avoided by storing n region weights in a unidimensional vector Wn × 1. In each element of W, W[i], we store the weight of a chromosomal region that starts at SNP g
i
. When the region is enlarged by including additional SNPs to the right of g
i
, the weight W[i] is updated accordingly. This procedure, illustrated in Figure 1, begins with setting all the weights to 0. At the initial stage, the vector W represents all one-SNP regions. Then, the region starting at SNP g1 is enlarged by including the next SNP, g2. Therefore, starting from g2, chromosome C is processed one SNP after the other, from left to right. For a SNP g
j
, with j ≥ 2, all SNP pair weights w(i, j), i = j - 1, …, 1, are computed and added up as s = w(j - 1, j) + ⋯ + w (i, j).
s and W[i] are updated for every computed weight w(i, j). Before the update, s = w(j - 1, j) + ⋯ + w(i - 1, j) and W[i] contains the region weight , which was already computed for the previous SNP gj-1. Then, s is incremented by w(i, j) and W[i] is incremented by the new value of s. W[i] now represents the region weight , i.e., . Whenever w(i, j) = 1 - d and , Theorem 1 is satisfied and the region 〈g
i
, …, g
j
〉 is added to the set of candidate haplotype blocks. This procedure is repeated with the next SNP, gj+1. An example of the first three computational steps is given in Figure 2. The pseudocode is provided in Algorithm A.1 (Additional file 1).
MIG reduces the memory complexity from Θ(n2) to Θ(n). Moreover, instead of identifying candidate regions that satisfy only Definition 1 (a) (as in Haploview), MIG checks immediately both conditions (a) and (b). This yields a smaller set of candidate blocks, and therefore indirectly speeds up also the second step of the Haploview algorithm.
The MIG +algorithm
While MIG drastically reduces the memory requirements by avoiding the maintenance of the CI matrix, it still computes weights for all SNP pairs, totaling n(n - 1) / 2 computations as in Haploview. To omit unnecessary computations, we apply a search space pruning to the MIG algorithm to identify regions that cannot be further extended to form a haplotype block. The pseudocode is shown in Algorithm A.2 (Additional file 1).
Instead of computing weights for all pairs of SNPs, only weights w(j - 1, j), …, w(b, j) are computed, where and . The function is an upper bound for the weight of all regions 〈g
i
, …, g
j
, …, g
k
〉 that start at g
i
and end after g
j
, i.e., those extending beyond the region 〈g
i
, …, g
j
〉. If for some i, none of the regions 〈g
i
, …, g
k
〉 can satisfy Definition 1 (b). The smallest i, that can be a potential starting point of a region with a positive weight, can therefore be set as breakpoint b. Regions starting left of b and stopping right of j receive negative weights and are discarded (Figure 3, left panel).
The upper bound, , is estimated assuming that all unprocessed SNPs to the right of g
j
are in strong LD with each other and with all SNPs in the region 〈g
i
, …, g
j
〉. Then, , where is already computed and S = ((j - i + 1) + (k - i))(k - j) / 2 is the number of unprocessed SNP pairs. Since S is largest for the longest region 〈g
i
, …, g
n
〉, we have , and the estimated upper bound is defined as follows:
The MIG + algorithm performs at most λ n(n - 1) / 2 computations, where λ, 1 - d ≤ λ ≤ 1, depends on the data. The worst case of λ = 1 occurs only in the unlikely situation when a few very large blocks span an entire chromosome.
The MIG ++algorithm
A limitation of the MIG + algorithm is its blindness about the unprocessed area to the right of the current SNP g
j
. Assuming strong LD for all SNP pairs in this area results in a conservative upper bound, , for the region weights. An additional optimization step allows to obtain a more precise estimate of and further prunes unnecessary computations. The pseudocode of the modified algorithm, MIG ++, is given in Algorithm A.3 (Additional file 1).
The improved algorithm is an iterative procedure that, at each iteration, scans the chromosome from left to right and computes the weights only for a limited number of SNP pairs. For a SNP g
j
, the SNP pairs considered in an iteration are restricted to a window of size win: only the weights w(j - 1, j), …, w(t, j) are computed, where t= max({b, j - win}) and 1 ≤ win ≤ n (Figure 3, right panel). At each new iteration, the window size is increased by a number of SNPs equal to win. Therefore, the number of computed SNP-pair weights increases proportionally. This allows a more precise estimation of the upper bounds for the region weights with every new iteration.
By considering all SNP-pair weights computed in all previous iterations for the estimation of the upper bound, , the algorithm requires linear time for each individual SNP pair to sum up all weights inside the corresponding region. We use a computationally cheaper constant-time solution, though it may lead to a less accurate estimation. Since , we have . An upper bound can then be computed as follows:
is computed in linear time after every scan of the chromosome, whereas is computed in constant time. Thus, the computation of the upper bound for each individual SNP pair requires only constant time.
When win = n, MIG ++ is identical to MIG +. When win = 1, the number of iterations becomes too large, introducing a significant computational burden. We propose to set win = ⌈(n - 1)(1 - d) / 2⌉, that corresponds to 1 - d percent of all SNP pairs, which is the minimal fraction of SNP pairs that must be considered before one can be sure that an n-SNP segment is not a haplotype block.
The MIG ++ performs at most λ n(n - 1) / 2 computations, where λ, 1 - d ≤ λ ≤ 1, depends on the data. However, the value of λ obtained with the MIG ++ algorithm is expected to be always smaller than that from the MIG + algorithm, because of the more precise estimation of .
Alternative methods to estimate the D′CI
A critical step of the Gabriel et al.[23] approach is the estimation of the D′ CI. In a genome-wide context, this calculation can be repeated hundreds of millions of times. In Haploview, the CIs are obtained by means of the likelihood-based procedure proposed by Wall and Pritchard [27], which requires from 100 to 1,000 iterations. This method can be replaced with a computationally cheaper solution, based on an approximated estimator of the D′ variance, as proposed by Zapata et al. [26]. This solution would make the whole block recognition algorithm significantly faster.
The wall and pritchard (WP) method
The true allele frequencies of each SNP are assumed to be equal to the observed allele frequencies. The likelihood of the data in the four-fold table obtained by crossing any SNP pair, conditional to the |D′| value, can be expressed as l = P(data∣|D′|). l is evaluated at each value of |D′| = 0.001 × p, with p = 0, 1, …, 1000. CL is defined as the largest value of |D′| such that , where α is the significance level. Similarly, CU is defined as the smallest value of |D′| such that .
The approximate variance (AV) method
Consider two SNPs, u and v, with alleles {u1, u2} and {v1, v2}, respectively. Let and denote, respectively, the absolute and relative frequencies of the four possible haplotypes, u
i
v
j
(i, j ∈ {1, 2}), with and being the marginal frequencies of the two SNPs. In total, haplotypes are observed. Zapata et al. [26] showed that the variance of D′ can be approximated as follows:
where ; D
max
is when D > 0 or when D < 0; f1 is when D′ > 0 or when D′ < 0; f2 is when D′ > 0 or when D′ < 0; f3 is , , , and when D
max
is , , , and , respectively; and
When D′ = ±1, then V(D′) = 0. The 1 - α CI of D′ is equal to , where Zα/2 is the 1 - α / 2 percentile of the standard normal distribution.
Experimental evaluation
The experimental evaluation was based on the phased CEPH genotypes included in the HapMap phase II (HapMapII) [28] and the 1000 Genomes Project phase 1 release 3 (1000G) [29] databases. The HapMapII dataset included 2,543,857 SNPs from 120 haplotypes (60 individuals) and the 1000G dataset included 10,858,788 SNPs from 170 haplotypes (85 individuals).
To compare the new algorithms to the standard Haploview, in terms of runtime and memory usage, the ideal solution would have been that of randomly sampling regions with different characteristics from the HapMapII or 1000G datasets. However, the Haploview algorithm was so computationally expensive that it prohibited to consider a sufficiently large number of random regions and, therefore, to obtain a representative sample of all possible scenarios over the whole genome. For this reason, we selected the regions such that the most extreme scenarios, in terms of median SNP minor allele frequency (MAF) and median inter-SNP distance, were covered. To identify such representative regions, we performed the systematic scan of all SNPs in the genome using a sliding window of 1,000 SNPs, after removing chromosomal centromeres and the HLA region. For each sliding region, the median MAF and inter-SNP distance were recorded (Additional file 1, Figure A.1). All regions were then represented in a two-dimensional Euclidean space, where the normalized inter-SNP distance was plotted against the normalized median MAF (Additional file 1, Figure A.2). A total of nine regions were chosen for the experiments: the eight regions located on the outermost boundaries of the Euclidean space and the region closest to the center of the space. These regions represent scenarios with extreme and moderate median MAF and median inter-SNP distance. The procedure was repeated using larger sliding windows of 5,000 to 30,000 SNPs. If not stated otherwise, in the experimental results we report median values over the nine regions for every different window size.
The block partitions obtained with the WP and AV methods for D′ CI estimation were compared in terms of total number of blocks, median number of SNPs per block, proportion of SNPs clustered into blocks, and median within block haplotype diversity. Haplotype diversity [19, 20] is defined as the ratio between the number of common haplotypes and the total number of haplotypes within a block. Common haplotypes are those occurring more than once. The haplotype diversity index ranges from 0 (complete diversity) to 1 (no diversity).
The three MIG algorithms were implemented in C++. To guarantee a fair comparison, the original Java implementation of the Haploview algorithm was rewritten in C++, too. By default, Haploview considers only SNP pairs within a maximal distance of 500Kbp. We removed this constraint because it could affect the block partitions of very wide regions. For the WP method, we set the number of likelihood estimation points to 1,000 (100 in the original Haploview implementation). We didn’t consider the population specific two-, three-, and four-marker rules, proposed by Gabriel et al.[23] when very short regions are processed, because they have no impact on the computational efficiency of the algorithms. All experiments were run on a machine with an Opteron 8356 Quad Core (2.3GHz) CPU.
Genome-wide association study of rheumatoid arthritis
We applied our haplotype block partitioning algorithm to the genome-wide association study of the North American Rheumatoid Arthritis Consortium (NARAC) dataset. Data consisted of 868 cases and 1,194 controls. The samples were genotyped at 544,917 autosomal and sex chromosome SNPs. Quality check was performed with PLINK 1.07 [25]: we excluded 5,422 SNPs with a call rate of <90%, 11,327 SNPs with a minor allele frequency of <0.001, and 898 SNPs because of significant deviation from Hardy-Weinberg equilibrium in controls (p-value ≤ 10-6). No samples were excluded because of low call rate (<90%); 2 cases and 5 controls were removed because of sex mismatch; 1 case and 8 controls were additionally excluded after population stratification test based on principal component analysis performed with EIGENSOFT 5.0.1 [30]. After the quality control, 514,539 autosomal SNPs and 2,046 samples were available for analyses.
Haplotypes were phased using SHAPEIT version 2 [31]. To achieve good accuracy, we set 400 conditioning states per SNP. Recombination rates were taken from HapMap phase II build 36 and effective population size was set to 11,418 (as suggested for CEU populations). The estimated haplotypes were submitted to MIG ++ and processed with the WP and AV methods. We obtained 98,979 WP blocks, covering 445,832 SNPs, with 68,707 singleton SNPs outside of any block. The AV method identified 97,816 blocks, covering 446,170 SNPs, and 68,369 singleton SNPs.
The genome-wide association scan was based on a logistic regression model adjusted for sex and the top 10 eigenvectors obtained from EIGENSOFT 5.0.1 [30]. The association between disease status and individual SNPs or haplotype blocks was tested with a likelihood ratio test using PLINK 1.0.7 [25] with the logistic-genotypic and omnibus options, respectively. Within each block, haplotypes with frequency of <0.01 were collapsed together to preserve power. Singleton SNPs outside blocks were treated as in the SNP-based analysis, therefore producing analogous results. Genomic control (GC) correction was applied to both SNP- and block-based GWAS results. Bonferroni-corrected significance thresholds were set to 2.98 × 10-7 for analysis based on the WP block partition (i.e. 0.05 divided by the sum of 98,979 WP blocks and 68,707 singleton SNPs), 3.01 × 10-7 for the analysis based on the AV method (i.e. 0.05 divided by 97,816 AV blocks plus 68,369 singleton SNPs), and 9.17 × 10-8 (i.e. 0.05 / 514,539) for the individual SNP analysis.