Volume 10 Supplement 1

## Selected papers from the Seventh Asia-Pacific Bioinformatics Conference (APBC 2009)

- Research
- Open Access

# A fast algorithm for genome-wide haplotype pattern mining

- Søren Besenbacher
^{1, 2}Email author, - Christian NS Pedersen
^{1, 2}and - Thomas Mailund
^{1}

**10 (Suppl 1)**:S74

https://doi.org/10.1186/1471-2105-10-S1-S74

© Besenbacher et al; licensee BioMed Central Ltd. 2009

**Published:**30 January 2009

## Abstract

### Background

Identifying the genetic components of common diseases has long been an important area of research. Recently, genotyping technology has reached the level where it is cost effective to genotype single nucleotide polymorphism (SNP) markers covering the entire genome, in thousands of individuals, and analyse such data for markers associated with a diseases. The statistical power to detect association, however, is limited when markers are analysed one at a time. This can be alleviated by considering multiple markers simultaneously. The *Haplotype Pattern Mining* (HPM) method is a machine learning approach to do exactly this.

### Results

We present a new, faster algorithm for the HPM method. The new approach use patterns of haplotype diversity in the genome: locally in the genome, the number of observed haplotypes is much smaller than the total number of possible haplotypes. We show that the new approach speeds up the HPM method with a factor of 2 on a genome-wide dataset with 5009 individuals typed in 491208 markers using default parameters and more if the pattern length is increased.

### Conclusion

The new algorithm speeds up the HPM method and we show that it is feasible to apply HPM to whole genome association mapping with thousands of individuals and hundreds of thousands of markers.

## Keywords

- Association Mapping
- Haplotype Diversity
- Single Nucleotide Polymorphism Marker
- Pattern Length
- Distinct Haplotype

## Background

Identifying the genetic causes of common diseases has long been an important research area in genetics. Where early studies were limited to studying few genes at a time, due to economical and technological constraints, development in genotyping technology has revolutionised the field. It is now cost effective to obtain hundreds of thousands of genotype markers in thousands of individuals for a single study. This makes it possible to scan the entire genome for disease associated markers in a single analysis and such genome-wide association studies have recently lead to a virtual flood of newly discovered disease genes [1–7].

Most studies search for disease association through a marker-by-marker approach where each marker in turn is tested for association to the disease phenotype, e.g. using a simple Fisher's exact test or a *χ*^{2}-test. However, a marker by marker approach is limited in statistical power due to the indirect testing for association, where so called "tag SNPs" are used as proxies for unobserved markers, but by using multiple markers, this problem can be alleviated [8, 9]. A tradeoff must be made between method sophistication and computation efficiency when developing multi-marker approaches, however.

The Haplotype Pattern Mining method is a multi-marker approach introduced in 2000 by Toivonen *et al.* [10, 11]. It is based on the idea of extracting local haplotype similarities and locating areas where haplotypes are correlated with the disease phenotype. Compared to methods based on statistical sampling [12–17] HPM is computationally much more efficient, similar to other heuristic approaches [18–20] capable of analysing genome-wide datasets. In this paper, we develop a faster version of HPM and show that it scales to genome-wide association studies.

## Methods

The goal of association mapping is to find disease-predisposing regions of the genome. This can be done by looking for differences in the frequency of genetic variants between cases and controls. Since genome sequencing is expensive the whole genomes of the case and control individuals in a case-control study are usually not sequenced. Instead only single base pairs that are known to frequently differ between humans, called SNP markers, are sequenced.

### The association mapping problem

If *k* SNP markers are typed then we can represent a chromosome by a haplotype vector H of length *k*, where *H* = (*h*_{1},..., *h*_{
k
}) and *h*_{
i
}∈ *alleles*(*i*) for all *i*, 1 ≤ *i* ≤ *k*; *alleles*(*i*) is the domain of the *i* th marker. The input to an association mapping method then consists of a set *A* = {*A*_{1},..., *A*_{
p
}} of disease-associated haplotypes and a set *C* = {*C*_{1} ... *C*_{
q
}} of control haplotypes.

### Haplotype pattern

A haplotype pattern P over *k* markers is a vector (*p*_{1} ... *p*_{
k
}), where *p*_{
i
}∈ *alleles*(*i*) ∪ {*} for all *i*, 1 ≤ *i* ≤ *k*, where * is the "dont't care" symbol. The haplotype pattern occurs in a given haplotype vector (chromosome) *H* = (*h*_{1},... *h*_{
k
}) if either *p*_{
i
}= *h*_{
i
}or *p*_{
i
}= * for all *i*, 1 ≤ *i* ≤ *k*. The length of a pattern is defined as the maximum distance between two non-"*" characters in the pattern. Gaps are subsequences of "don't care" symbols in a pattern that are surrounded by non-"*" characters on both sides. Since long patterns are not likely to exist we only want look at a subset of the possible patterns. We call the patterns that we want to consider for legal patterns. A pattern is legal if the pattern length is less than the parameter *l*, it contains fewer than *g* gaps, and no gaps are longer than *s*.

### Strongly associated pattern

The signed *χ*^{2} measure ± *χ*^{2}(*P*) of a haplotype pattern P is the standard *χ*^{2} measure where the sign is positive if the relative frequency of *P* is higher in cases than in controls, and negative otherwise. Given a positive association threshold *x*, we say that *P* is strongly associated with the disease if ± *χ*^{2}(*P*) ≥ *x*.

### The HPM problem

Given a set of case haplotypes *A* = {*A*_{1},..., *A*_{
p
}} and control haplotypes *C* = {*C*_{1} ... *C*_{
q
}} the goal of the HPM algorithm is to find all strongly associated patterns that are legal.

### Localizing disease genes using HPM

#### Old algorithm

The algorithm presented in [11] recursively generates haplotype patterns using a depth-first-search strategy. To avoid looking at all possible patterns the algorithm prunes away parts of the search tree based on a lower bound on the number of disease-associated chromosomes that match a pattern.

*depthFirst*will take time

*O*(

*n*·

*l*) where

*n*is the number of individuals and

*l*is the length of the pattern. If we remember which individuals match the pattern at a given time then we only need to look through these when a new non-"*" symbol is inserted in the pattern. Pseudo code for the algorithm with this improvement is shown in fig. 2. The improvement greatly speeds up the algorithm.

#### New algorithm

The idea of the new algorithm is to exploit that LD structure means that you usually only see a handful of the 2^{
n
}possible haplotypes if you look at *n* neighboring SNPs. Instead of looking at all different haplotype patterns spanning a region we look at all combinations of haplotypes over the region. We search these haplotype sets in a depth-first-search but stop examining a branch if there is no legal haplotype pattern that could occur in all of the haplotypes in the current set.

### Induced pattern

Given a set of haplotypes *h*_{1} ... *h*_{
k
}the induced pattern of the set is the haplotype pattern that occurs in all of the haplotypes and contains fewest possible "*"("don't care") symbols.

An induced pattern over a set of haplotypes that is not legal can sometimes be made legal by inserting extra "*" symbols if *s* > 2. This happens if a pattern is illegal because it contains too many gaps but would become legal if two gaps were joined into one. If for example *l* = 5, *g* = 1 and *s* = 3 then "0 * 1 * 0" is an illegal pattern because it contains more than *g* gaps. The pattern can however be made legal by substituting the "1" for a "*" yielding the pattern "0 * * * 0".

### Valid pattern

An induced pattern over a set *S* of haplotypes is said to be valid with regard to *S*, if the pattern occurs in all of the haplotypes in *S* but not in any of the other haplotypes found in the input data.

### Equivalent pattern

A haplotype pattern will split the set of individuals into those that match the patterns and those that do not. We say that two patterns are equivalent if they result in the same bipartitions of the set of individuals.

### The algorithm

## Results and discussion

We have implemented both the old and the new algorithm in Python using the *SNPfile* library [21] to read and store the input data. To evaluate the algorithms, we have used the Crohn's disease data set from the Wellcome Trust Case-Control Consortium (WTCCC) [4]. This data set contains 491208 markers in 2005 disease affected individuals and 3004 unaffected control individuals. We used the *Beagle* [22] program to phase the haplotypes and infer missing genotypes.

### Time vs. number of individuals

### Time vs. pattern length

*l*. Figure 5 show the running time of the two algorithms as a function

*l*, when analysing the full chromosome 22 from the WTCCC data set. The running time of both algorithms clearly grows super-linear, but with the time for the new algorithm clearly growing slower.

### Time vs. haplotype diversity

*l*= 11) and the number of haplotypes along chromosome 22 of the WTCCC data: The plot on the left shows both the running time per marker (the time to test all patterns beginning in a given marker) together with the number of distinct haplotypes starting in a given marker. Figure 7 shows the running time for scoring a marker as a function of the number of unique haplotypes overlapping the marker.

The same dependency on haplotype diversity is not seen for the old algorithm (results not shown), nor is it expected to be as the old algorithm does not depend on the number of distinct haplotypes seen in the data. Instead, the running time could depend on the maximal score we see when scoring a marker, since this is the threshold used in the branch and bounds heuristic. From the data, however, we do not see a significant effect here.

### Genome-wide analysis

Time per chromosome. Table showing the time it took to analyze the different chromosomes of the WTCCC Crohn's disease data set. With the following parameters: l = 7, g = 2, s = 2, x = 20.

Crohn's disease running times | |||
---|---|---|---|

Chromosome | # markers | Original HPM (with optimisation) | New HPM |

1 | 40220 | 9 hours, 16 minutes | 4 hours, 51 minutes |

2 | 41400 | 9 hours, 37 minutes | 4 hours, 54 minutes |

3 | 33799 | 7 hours, 48 minutes | 4 hours, 1 minutes |

4 | 32334 | 7 hours, 31 minutes | 3 hours, 55 minutes |

5 | 32056 | 7 hours, 20 minutes | 3 hours, 59 minutes |

6 | 31470 | 7 hours, 16 minutes | 3 hours, 48 minutes |

7 | 25835 | 5 hours, 52 minutes | 3 hours, 09 minutes |

8 | 27457 | 6 hours, 11 minutes | 3 hours, 21 minutes |

9 | 22864 | 5 hours, 13 minutes | 2 hours, 50 minutes |

10 | 28501 | 6 hours, 35 minutes | 3 hours, 25 minutes |

11 | 26273 | 6 hours, 8 minutes | 3 hours, 11 minutes |

12 | 24954 | 5 hours, 38 minutes | 3 hours, 5 minutes |

13 | 19188 | 4 hours, 19 minutes | 2 hours, 19 minutes |

14 | 15721 | 3 hours, 37 minutes | 1 hour, 54 minutes |

15 | 14355 | 3 hours, 16 minutes | 1 hour, 47 minutes |

16 | 15308 | 3 hours, 35 minutes | 1 hour, 53 minutes |

17 | 11281 | 2 hours, 37 minutes | 1 hour, 23 minutes |

18 | 14881 | 3 hours, 28 minutes | 1 hour, 44 minutes |

19 | 6399 | 1 hours, 28 minutes | 43 minutes |

20 | 12400 | 2 hours, 53 minutes | 1 hour, 23 minutes |

21 | 7125 | 1 hour, 38 minutes | 51 minutes |

22 | 6207 | 1 hour, 27 minutes | 44 minutes |

## Conclusion

We have developed a new algorithm for the haplotype pattern mining method and shown that it outperforms the original algorithm on genome wide association data. As a function of the number of individuals or the maximal pattern length, both the new and old algorithm appears to have the same asymptotic running time, with the new algorithm having a significantly smaller time overhead.

The new algorithm is very sensitive to the haplotype diversity. The same is not the case for the old algorithm, but here the mean running time per marker is 8.8 ± 0.57 seconds (with pattern length *l* = 11) where for the new algorithm the mean running time per marker is 2.5 ± 9.6 seconds. It might therefore be worthwhile to use a hybrid algorithm where the new algorithm is used in areas with lower haplotype diversity and the old algorithm is used in areas with high haplotype diversity. If this would reduce the time usage on markers now taking more than 4 seconds to only 3, the hybrid algorithm would spend 1.4 ± 0.63 seconds per marker.

## Declarations

### Acknowledgements

TM is funded by the Danish Research Council, grant FNU-272-07-0380.

This article has been published as part of *BMC Bioinformatics* Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1

## Authors’ Affiliations

## References

- Arking DE, Pfeufer A, Post W, Kao WH, Newton-Cheh C, Ikeda M, West K, Kashuk C, Akyol M, Perz S, Jalilzadeh S, Illig T, Gieger C, Guo CY, Larson MG, Wichmann HE, Marban E, O'donnell CJ, Hirschhorn JN, Kaab S, Spooner PM, Meitinger T, Chakravarti A: A common genetic variant in the NOS1 regulator NOS1AP modulates cardiac repolarization.
*Nat Genet*2006, 38(6):644–651.View ArticlePubMedGoogle Scholar - Smyth DJ, Cooper JD, Bailey R, Field S, Burren O, Smink LJ, Guja C, Ionescu-Tirgoviste C, Widmer B, Dunger DB, Savage DA, Walker NM, Clayton DG, Todd JA: A genome-wide association study of nonsynonymous SNPs identifies a type 1 diabetes locus in the interferon-induced helicase (IFIH1) region.
*Nat Genet*2006, 38(6):617–619.View ArticlePubMedGoogle Scholar - Amundadottir LT, Sulem P, Gudmundsson J, Helgason A, Baker A, Agnarsson BA, Sigurdsson A, Benediktsdottir KR, Cazier JB, Sainz J, Jakobsdottir M, Kostic J, Magnusdottir DN, Ghosh S, Agnarsson K, Birgisdottir B, Le Roux L, Olafsdottir A, Blondal T, Andresdottir M, Gretarsdottir OS, Bergthorsson JT, Gudbjartsson D, Gylfason A, Thorleifsson G, Manolescu A, Kristjansson K, Geirsson G, Isaksson H, Douglas J, Johansson JE, Balter K, Wiklund F, Montie JE, Yu X, Suarez BK, Ober C, Cooney KA, Gronberg H, Catalona WJ, Einarsson GV, Barkardottir RB, Gulcher JR, Kong A, Thorsteinsdottir U, Stefansson K: A common variant associated with prostate cancer in European and African populations.
*Nat Genet*2006, 38(6):652–658.View ArticlePubMedGoogle Scholar - Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls Nature 2007, 447(7145):661–78.Google Scholar
- Gudmundsson J, Sulem P, Manolescu A, Amundadottir LT, Gudbjartsson D, Helgason A, Rafnar T, Bergthorsson JT, Agnarsson BA, Baker A, Sigurdsson A, Benediktsdottir KR, Jakobsdottir M, Xu J, Blondal T, Kostic J, Sun J, Ghosh S, Stacey SN, Mouy M, Saemundsdottir J, Backman VM, Kristjansson K, Tres A, Partin AW, Albers-Akkers MT, Godino-Ivan Marcos J, Walsh PC, Swinkels DW, Navarrete S, Isaacs SD, Aben KK, Graif T, Cashy J, Ruiz-Echarri M, Wiley KE, Suarez BK, Witjes JA, Frigge M, Ober C, Jonsson E, Einarsson GV, Mayordomo JI, Kiemeney LA, Isaacs WB, Catalona WJ, Barkardottir RB, Gulcher JR, Thorsteinsdottir U, Kong A, Stefansson K: Genome-wide association study identifies a second prostate cancer susceptibility variant at 8q24.
*Nat Genet*2007, 39(5):631–7.View ArticlePubMedGoogle Scholar - Gudmundsson J, Sulem P, Steinthorsdottir V, Bergthorsson JT, Thorleifsson G, Manolescu A, Rafnar T, Gudbjartsson D, Agnarsson BA, Baker A, Sigurdsson A, Benediktsdottir KR, Jakobsdottir M, Blondal T, Stacey SN, Helgason A, Gunnarsdottir S, Olafsdottir A, Kristinsson KT, Birgisdottir B, Ghosh S, Thorlacius S, Magnusdottir D, Stefansdottir G, Kristjansson K, Bagger Y, Wilensky RL, Reilly MP, Morris AD, Kimber CH, Adeyemo A, Chen Y, Zhou J, So WY, Tong PC, Ng MC, Hansen T, Andersen G, Borch-Johnsen K, Jorgensen T, Tres A, Fuertes F, Ruiz-Echarri M, Asin L, Saez B, van Boven E, Klaver S, Swinkels DW, Aben KK, Graif T, Cashy J, Suarez BK, van Vierssen Trip O, Frigge ML, Ober C, Hofker MH, Wijmenga C, Christiansen C, Rader DJ, Palmer CN, Rotimi C, Chan JC, Pedersen O, Sigurdsson G, Benediktsson R, Jonsson E, Einarsson GV, Mayordomo JI, Catalona WJ, Kiemeney LA, Barkardottir RB, Gulcher JR, Thorsteinsdottir U, Kong A, Stefansson K: Two variants on chromosome 17 confer prostate cancer risk, and the one in TCF2 protects against type 2 diabetes.
*Nat Genet*2007, 39(8):977–83.View ArticlePubMedGoogle Scholar - Saxena R, Voight BF, Lyssenko V, Burtt NP, de Bakker PI, Chen H, Roix JJ, Kathiresan S, Hirschhorn JN, Daly MJ, Hughes TE, Groop L, Altshuler D, Almgren P, Florez JC, Meyer J, Ardlie K, Bengtsson Bostrom K, Isomaa B, Lettre G, Lindblad U, Lyon HN, Melander O, Newton-Cheh C, Nilsson P, Orho-Melander M, Rastam L, Speliotes EK, Taskinen MR, Tuomi T, Guiducci C, Berglund A, Carlson J, Gianniny L, Hackett R, Hall L, Holmkvist J, Laurila E, Sjogren M, Sterner M, Surti A, Svensson M, Tewhey R, Blumenstiel B, Parkin M, Defelice M, Barry R, Brodeur W, Camarata J, Chia N, Fava M, Gibbons J, Handsaker B, Healy C, Nguyen K, Gates C, Sougnez C, Gage D, Nizzari M, Gabriel SB, Chirn GW, Ma Q, Parikh H, Richardson D, Ricke D, Purcell S: Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels.
*Science*2007, 316(5829):1331–6.View ArticlePubMedGoogle Scholar - Pe'er I, de Bakker PI, Maller J, Yelensky R, Altshuler D, Daly MJ: Evaluating and improving power in whole-genome association studies using fixed marker sets.
*Nat Genet*2006, 38(6):663–667.View ArticlePubMedGoogle Scholar - de Bakker PIW, Yelensky R, Pe'er I, Gabriel SB, Daly MJ, Altshuler D: Efficiency and power in genetic association studies.
*Nat Genet*2005, 37(11):1217–1223.View ArticlePubMedGoogle Scholar - Toivonen HT, Onkamo P, Vasko K, Ollikainen V, Sevon P, Mannila H, Herr M, Kere J: Data mining applied to linkage disequilibrium mapping.
*Am J Hum Genet*2000, 67: 133–145.PubMed CentralView ArticlePubMedGoogle Scholar - Toivonen HTT, Onkamo P, Vasko K, Ollikainen V, Sevon P, Mannila H, Kere J: Gene Mapping by Haplotype Pattern Mining. In
*IEEE International Symposium on Bio-Informatics and Biomedical Engineering (BIBE'00)*. Los Alamitos, CA, USA: IEEE Computer Society; 2000:99.View ArticleGoogle Scholar - Liu JS, Sabatti C, Teng J, Keats BJ, Risch N: Bayesian analysis of haplotypes for linkage disequilibrium mapping.
*Genome Res*2001, 11(10):1716–1724.PubMed CentralView ArticlePubMedGoogle Scholar - Morris AP, Whittaker JC, Balding DJ: Fine-scale mapping of disease loci via shattered coalescent modeling of genealogies.
*Am J Hum Genet*2002, 70(3):686–707.PubMed CentralView ArticlePubMedGoogle Scholar - Larribe F, Lessard S, Schork NJ: Gene mapping via the ancestral recombination graph.
*Theor Popul Biol*2002, 62(2):215–229.View ArticlePubMedGoogle Scholar - Molitor J, Marjoram P, Thomas D: Fine-scale mapping of disease genes with multiple mutations via spatial clustering techniques.
*Am J Hum Genet*2003, 73(6):1368–1384.PubMed CentralView ArticlePubMedGoogle Scholar - Zöllner S, Pritchard JK: Coalescent-based association mapping and fine mapping of complex trait loci.
*Genetics*2005, 169(2):1071–1092.PubMed CentralView ArticlePubMedGoogle Scholar - Minichiello MJ, Durbin R: Mapping Trait Loci by Use of Inferred Ancestral Recombination Graphs.
*American Journal of Human Genetics*2006, 79(5):910–922.PubMed CentralView ArticlePubMedGoogle Scholar - Mailund T, Besenbacher S, Schierup MH: Whole genome association mapping by incompatibilities and local perfect phylogenies.
*BMC Bioinformatics*2006, 7: 454.PubMed CentralView ArticlePubMedGoogle Scholar - Li J, Jiang T: Haplotype-based linkage disequilibrium mapping via direct data mining.
*Bioinformatics*2005, 21(24):4384–4393.View ArticlePubMedGoogle Scholar - Browning SR: Multilocus association mapping using variable-length Markov chains.
*Am J Hum Gen*2006, 78(6):903–913.View ArticleGoogle Scholar - Nielsen J, Mailund T: The SNPFile library.[http://www.birc.au.dk/~mailund/SNPFile/]
- Browning SR, Browning BL: Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering.
*Am J Hum Genet*2007, 81(5):1084–1097.PubMed CentralView ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.