Skip to main content

NullHap – a versatile application to estimate haplotype frequencies from unphased genotypes in the presence of null alleles

Abstract

Background

Laboratory techniques used to determine haplotypes are often too expensive for large-scale studies and lack of phase information is commonly overcome using likelihood-based calculations. Whereas a number of programs are available for that purpose, none of them can handle loci with both multiple and null alleles.

Results

Here we present a description of a modified Expectation – Maximization algorithm as well as its implementation (NullHap) which allow to effectively overcome these limitations. As an example of application we used Nullhap to reanalyze published data on distribution of KIR genotypes in Polish psoriasis patients and controls showing that the KIR2DS4/1D locus may be a marker of KIR2DS1 haplotypes with different effects on disease susceptibility.

Conclusion

The developed application can estimate haplotype frequencies for every type of polymorphism and can effectively be used in genetic research as illustrated by a novel finding regarding the genetic susceptibility to psoriasis.

Background

Laboratory techniques used to determine haplotypes [1] are often too expensive for large-scale studies. The lack of phase information provided by the popular typing methods could be overcome using likelihood-based calculations [2], which estimate haplotype frequencies in a population, and reconstruct the haplotype pair in each individual. This approach is more cost-effective and powerful than linkage analysis [3], and gives more information than single marker-based methods [4].

Haplotype estimation procedures typically use maximum likelihood approach. The most popular algorithm implemented for example in Arlequin [5] is The Expectation – Maximization algorithm (EM) [6] but other methods were also proposed: Bayesian method using a pseudo-Gibbs sampler [7], partition-ligation [8], Monte Carlo [9] and Hidden Markov Model [10].

A frequent shortage of available software packages [5, 7] is the lack of possibility to analyze loci where null variants occur with an appreciable frequency. In a diploid organism, a null allele is a variant which is not detected in genotyping, because of a deletion of an entire locus or because of a mutation interfering with analysis. This makes it impossible to distinguish between some heterozygous and homozygous genotypes [11]. For example, if there is only one alternative allele A1 besides the null allele A0, then there are three possible haplotype pairs: A1/A1, A1/A0 and A0/A0, but only two kinds of experimental observations: A0 and A1. An example of a genetic system, which is at present intensely studied [11] and which contains null alleles, is the locus encoding killer immunoglobulin-like receptors (KIR) of natural killer (NK) cells.

To our knowledge, the only available computer program designed to handle null alleles is Haplo-IHP [12], which, however, has a shortcoming of being applicable only to biallelic loci. The purpose of our work was to design a versatile application for estimation of haplotypes from unphased population data useful for multiallelic polymorphism with and/or without null alleles.

Implementation

The null variants decrease the number of different genotypes G which can be observed, equation (1), when the k polymorphic loci are analyzed and each locus has l i different variants (optionally including a null variant) for i-th locus, δ i = 1 if i-th locus has null allele, otherwise δ i = 0. The number of haplotypes is H = i = 1 k l i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaeaaaaaaaaa8qacqWGibascqGH9aqpdaqeWaqaaiabdYgaSnaaBaaaleaacqWGPbqAaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabdUgaRbqdcqGHpis1aaaa@3792@ .

G = i = 1 k ( l i δ i ) ( l i + 1 δ i ) + 2 δ i 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaeaaaaaaaaa8qacqWGhbWrcqGH9aqpdaqeWbqcfayaamaalaaabaGaeiikaGIaemiBaW2aaSbaaeaacqWGPbqAaeqaaiabgkHiTiabes7aKnaaBaaabaGaemyAaKgabeaacqGGPaqkcqGGOaakcqWGSbaBdaWgaaqaaiabdMgaPbqabaGaey4kaSIaeGymaeJaeyOeI0IaeqiTdq2aaSbaaeaacqWGPbqAaeqaaiabcMcaPiabgUcaRiabikdaYiabes7aKnaaBaaabaGaemyAaKgabeaaaeaacqaIYaGmaaaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGRbWAa0Gaey4dIunaaaa@4ED2@
(1)

The average number of haplotype resolutions which give genotype j (when phase information is lost) grows exponentially with the number of observed loci, thus full space search algorithm cannot be used to find the best haplotype frequencies. The equation (2) provides the number of haplotype resolutions r j which give genotype j, where s j is the number of observed heterozygous and t j is the number of observed (not null) alleles for loci with null allele(s).

r j = { 2 s j 1 3 t j for  s j > 0 3 t j + 1 2 for  s j = 0 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaeaaaaaaaaa8qacqWGYbGCpaWaaSbaaSqaa8qacqWGQbGAa8aabeaak8qacqGH9aqpdaGabaqaauaabaqaciaaaeaacqaIYaGmdaahaaWcbeqaaiabdohaZnaaBaaameaacqWGQbGAaeqaaSGaeyOeI0IaeGymaedaaOGaeG4mamZaaWbaaSqabeaacqWG0baDdaWgaaadbaGaemOAaOgabeaaaaaakeaacqqGMbGzcqqGVbWBcqqGYbGCcqqGGaaicqWGZbWCdaWgaaWcbaGaemOAaOgabeaakiabg6da+iabicdaWaqcfayaamaalaaabaGaeG4mamZaaWbaaeqabaGaemiDaq3aaSbaaeaacqWGQbGAaeqaaaaacqGHRaWkcqaIXaqmaeaacqaIYaGmaaaakeaacqqGMbGzcqqGVbWBcqqGYbGCcqqGGaaicqWGZbWCdaWgaaWcbaGaemOAaOgabeaakiabg2da9iabicdaWaaaaiaawUhaaaaa@5713@
(2)

Maximum likelihood approach to estimate haplotypes

In the maximum likelihood approach haplotype frequencies h i are estimated to maximize the probability of the given sample of genotyping data. The sample of genotyping data from n individuals is simplified to a vector S = (n1, n2,..., n G ), where G is the number of different genotyping data (with a lack of phase information, equation (1)), and n j is the number of individuals having j-th genotype, j = 0 G n j = n MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaeaaaaaaaaa8qadaaeWaqaaiabd6gaUnaaBaaaleaacqWGQbGAaeqaaOGaeyypa0JaemOBa4galeaacqWGQbGAcqGH9aqpcqaIWaamaeaacqWGhbWra0GaeyyeIuoaaaa@37C2@ .

The conditional probability of sample S, given each genotype probability g i , and assuming unrelatedness of individuals in the sample is provided in equation (3), where α does not depend on g j .

P ( S | g 1 , g 2 , ... , g G ) = n ! n 1 ! n 2 ! ... n G ! j = 1 G g j n j = α j = 1 G g j n j MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaeaaaaaaaaa8qacqWGqbaucqGGOaakcqWGtbWucqGG8baFcqWGNbWzpaWaaSbaaSqaa8qacqaIXaqma8aabeaak8qacqGGSaalcqWGNbWzpaWaaSbaaSqaa8qacqaIYaGma8aabeaak8qacqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqWGNbWzpaWaaSbaaSqaa8qacqWGhbWra8aabeaak8qacqGGPaqkcqGH9aqpjuaGdaWcaaWdaeaapeGaemOBa4Maeiyiaecapaqaa8qacqWGUbGBpaWaaSbaaeaapeGaeGymaedapaqabaWdbiabcgcaHiabd6gaU9aadaWgaaqaa8qacqaIYaGma8aabeaapeGaeiyiaeIaeiOla4IaeiOla4IaeiOla4IaemOBa42damaaBaaabaWdbiabdEeahbWdaeqaa8qacqGGHaqiaaGcdaqeWbWdaeaapeGaem4zaC2aa0baaSqaaiabdQgaQbqaaiabd6gaUnaaBaaameaacqWGQbGAaeqaaaaaaSWdaeaapeGaemOAaOMaeyypa0JaeGymaedapaqaa8qacqWGhbWra0Gaey4dIunakiabg2da9iabeg7aHnaarahapaqaa8qacqWGNbWzdaqhaaWcbaGaemOAaOgabaGaemOBa42aaSbaaWqaaiabdQgaQbqabaaaaaWcpaqaa8qacqWGQbGAcqGH9aqpcqaIXaqma8aabaWdbiabdEeahbqdcqGHpis1aaaa@6D1C@
(3)

The frequency of genotype g j is the sum of frequencies of respective haplotype pairs z mn , and with Hardy-Weinberg equilibrium (HWE) assumption, it is calculated from haplotype frequencies as shown in equation (4), where z mn is the frequency of haplotype pair m and n, r j is the number of haplotype pairs for the j-th genotype (equation 2), and h m , h n are the frequencies of haplotypes m and n respectively.

g j = i = 0 r j z m n , where z m n = { h m 2 for  m = n 2 h m h n for  m n MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaeaaaaaaaaa8qafaqabeqadaaabaGaem4zaC2damaaBaaaleaapeGaemOAaOgapaqabaGcpeGaeyypa0ZaaabCa8aabaWdbiabdQha69aadaWgaaWcbaWdbiabd2gaTjabd6gaUbWdaeqaaaqaa8qacqWGPbqAcqGH9aqpcqaIWaama8aabaWdbiabdkhaY9aadaWgaaadbaWdbiabdQgaQbWdaeqaaaqdpeGaeyyeIuoakiabcYcaSaqaaiabbEha3jabbIgaOjabbwgaLjabbkhaYjabbwgaLbqaaiabdQha69aadaWgaaWcbaWdbiabd2gaTjabd6gaUbWdaeqaaOGaeyypa0ZaaiqaaeaafaqaaeGacaaabaGaemiAaG2aa0baaSqaaiabd2gaTbqaaiabikdaYaaaaOqaaiabbAgaMjabb+gaVjabbkhaYjabbccaGiabd2gaTjabg2da9iabd6gaUbqaaiabikdaYiabdIgaOnaaBaaaleaacqWGTbqBaeqaaOGaemiAaG2aaSbaaSqaaiabd6gaUbqabaaakeaacqqGMbGzcqqGVbWBcqqGYbGCcqqGGaaicqWGTbqBcqGHGjsUcqWGUbGBaaaacaGL7baaaaaaaa@69B1@
(4)

The estimation of haplotype frequencies to maximize the probability of the observed sample can be described as optimization, the equation (5) summarizes the considered approach.

arg max h 1 , h 2 , ... , h H P ( S | h 1 , h 2 , ... , h H ) = arg max h 1 , h 2 , ... , h H j = 1 G ( i = 0 r j z m n ) n j MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabeqaaaqaamaaxababaGagiyyaeMaeiOCaiNaei4zaCMagiyBa0MaeiyyaeMaeiiEaGhaleaaqaaaaaaaaaWdbiabdIgaO9aadaWgaaadbaWdbiabigdaXaWdaeqaaSWdbiabcYcaSiabdIgaO9aadaWgaaadbaWdbiabikdaYaWdaeqaaSWdbiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdIgaO9aadaWgaaadbaWdbiabdIeaibWdaeqaaaWcbeaaaaGcpeGaemiuaaLaeiikaGIaem4uamLaeiiFaWNaemiAaG2damaaBaaaleaapeGaeGymaedapaqabaGcpeGaeiilaWIaemiAaG2damaaBaaaleaapeGaeGOmaidapaqabaGcpeGaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaemiAaG2damaaBaaaleaapeGaemisaGeapaqabaGcpeGaeiykaKIaeyypa0ZaaCbeaeaacyGGHbqycqGGYbGCcqGGNbWzcyGGTbqBcqGGHbqycqGG4baEaSqaaiabdIgaO9aadaWgaaadbaWdbiabigdaXaWdaeqaaSWdbiabcYcaSiabdIgaO9aadaWgaaadbaWdbiabikdaYaWdaeqaaSWdbiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdIgaO9aadaWgaaadbaWdbiabdIeaibWdaeqaaaWcpeqabaGcdaqeWbWdaeaapeGaeiikaGcal8aabaWdbiabdQgaQjabg2da9iabigdaXaWdaeaapeGaem4raCeaniabg+GivdGcdaaeWbWdaeaapeGaemOEaO3damaaBaaaleaapeGaemyBa0MaemOBa4gapaqabaaabaWdbiabdMgaPjabg2da9iabicdaWaWdaeaapeGaemOCai3damaaBaaameaapeGaemOAaOgapaqabaaan8qacqGHris5aOGaeiykaKYdamaaCaaaleqabaWdbiabd6gaUnaaBaaameaacqWGQbGAaeqaaaaaaaa@8683@
(5)

Extended EM algorithm

The EM alternates between performing an expectation step E(t), which computes an expectation value of unknown parameters, here the frequencies of haplotype pairs, and a maximization step M(t), which computes the value of parameters by maximizing the probability of observed data. The parameters found on the M(t)step are then used to begin another E(t+1)step, and the process is repeated until the parameters are changed.

The description of algorithm details for the observed genotype data of k linked loci, l i variants for i-th locus, and the sample S = (n1, n2,..., n G ) is given below.

Initiation

The EM algorithm could be trapped into a local maximum, therefore multiple random starts are employed (any number determined by the user) in order to help the algorithm reach the global maximum. If n > 1 starting points are specified, for i-th point, the program calculates the mean error between the first and i-th estimate, and if this exceeds a predefined value (default = 0.05) a message is displayed about possible multiple local maxima. Since this feature increases computational time, it is optional.

If no random starts are used, the initial haplotype pair frequencies are set as described in equation (6) (the E0 step). For each haplotype resolution, the initial frequency depends only on the number of haplotype pairs for the given genotype. A similar initiation is described in [6].

z m n ( 0 ) = 1 r j where the  m n  gives the  j  genotype MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaeaaaaaaaaa8qacqWG6bGEpaWaa0baaSqaa8qacqWGTbqBcqWGUbGBa8aabaWdbiabcIcaOiabicdaWiabcMcaPaaakiabg2da9Kqbaoaalaaapaqaa8qacqaIXaqma8aabaWdbiabdkhaY9aadaWgaaqaa8qacqWGQbGAa8aabeaaaaWdbiabbEha3PGaeeiAaGMaeeyzauMaeeOCaiNaeeyzauMaeeiiaaIaeeiDaqNaeeiAaGMaeeyzauMaeeiiaaIaemyBa0MaemOBa4MaeeiiaaIaee4zaCMaeeyAaKMaeeODayNaeeyzauMaee4CamNaeeiiaaIaeeiDaqNaeeiAaGMaeeyzauMaeeiiaaIaemOAaOMaeeiiaaIaee4zaCMaeeyzauMaeeOBa4Maee4Ba8MaeeiDaqNaeeyEaKNaeeiCaaNaeeyzaugaaa@6365@
(6)

Maximization step

In this step, the typical EM algorithm was adopted, the only modification consisting of the fact that the genotype frequency calculation was performed as a sum of corresponding haplotype pair frequencies, equation (4), taking into account that the heterozygotes with null allele are genotyped identically as homozygotes without null allele.

Next, the haplotype pair frequencies are corrected, to maximize the probability of a given sample. Details are given in equation (7), where z m n ( t ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaeaaaaaaaaa8qacqWG6bGEpaWaa0baaSqaa8qacqWGTbqBcqWGUbGBa8aabaWdbiabcIcaOiabdsha0jabcMcaPaaaaaa@33C8@ is the input haplotype pair frequency, g j ( t ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaeaaaaaaaaa8qacqWGNbWzpaWaa0baaSqaa8qacqWGQbGAa8aabaWdbiabcIcaOiabdsha0jabcMcaPaaaaaa@3237@ is the calculated genotype frequency (inclusive of appropriate heterozygous genotypes with null variants), z m n ( t + 1 ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaeaaaaaaaaa8qacqWG6bGEpaWaa0baaSqaa8qacqWGTbqBcqWGUbGBa8aabaWdbiabcIcaOiabdsha0jabgUcaRiabigdaXiabcMcaPaaaaaa@359A@ is the output haplotype pair frequency, corrected to maximize the observed sample, n j is the number of observed genotypes g j in sample and n is the number individuals in the sample.

z m n ( t + 1 ) = n j n z m n ( t ) g j ( t ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaeaaaaaaaaa8qacqWG6bGEpaWaa0baaSqaa8qacqWGTbqBcqWGUbGBa8aabaWdbiabcIcaOiabdsha0jabgUcaRiabigdaXiabcMcaPaaakiabg2da9Kqbaoaalaaapaqaa8qacqWGUbGBpaWaaSbaaeaapeGaemOAaOgapaqabaaabaWdbiabd6gaUbaakiabgEHiQKqbaoaalaaapaqaa8qacqWG6bGEpaWaa0baaeaapeGaemyBa0MaemOBa4gapaqaa8qacqGGOaakcqWG0baDcqGGPaqkaaaapaqaa8qacqWGNbWzpaWaa0baaeaapeGaemOAaOgapaqaa8qacqGGOaakcqWG0baDcqGGPaqkaaaaaaaa@4C0F@
(7)

Expectation step

Haplotype frequencies h m s are calculated from the given haplotype pair frequencies z mn s, as a half of the sum of frequencies of all pairs of haplotypes in which given haplotype occurs. The next expected haplotype pair frequencies are calculated using haplotype frequencies as described in equation (8).

z m n ( t + 1 ) = { ( h m ( t ) ) 2 for  m = n 2 h m ( t ) h n ( t ) for  m n h m ( t ) = 1 2 ( i z i m ( t ) + j z m j ( t ) ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaeaaaaaaaaa8qafaqabeqacaaabaGaemOEaO3damaaDaaaleaapeGaemyBa0MaemOBa4gapaqaa8qacqGGOaakcqWG0baDcqGHRaWkcqaIXaqmcqGGPaqkaaGccqGH9aqpdaGabaqaauaabeqaciaaaeaacqGGOaakcqWGObaAdaqhaaWcbaGaemyBa0gabaGaeiikaGIaemiDaqNaeiykaKcaaOGaeiykaKYaaWbaaSqabeaacqaIYaGmaaaakeaacqqGMbGzcqqGVbWBcqqGYbGCcqqGGaaicqWGTbqBcqGH9aqpcqWGUbGBaeaacqaIYaGmcqWGObaAdaqhaaWcbaGaemyBa0gabaGaeiikaGIaemiDaqNaeiykaKcaaOGaemiAaG2aa0baaSqaaiabd6gaUbqaaiabcIcaOiabdsha0jabcMcaPaaaaOqaaiabbAgaMjabb+gaVjabbkhaYjabbccaGiabd2gaTjabgcMi5kabd6gaUbaaaiaawUhaaaqaaiabdIgaO9aadaqhaaWcbaWdbiabd2gaTbWdaeaapeGaeiikaGIaemiDaqNaeiykaKcaaOGaeyypa0tcfa4aaSaaa8aabaWdbiabigdaXaWdaeaapeGaeGOmaidaaOGaeiikaGYaaabua8aabaWdbiabdQha6naaDaaaleaacqWGPbqAcqWGTbqBaeaacqGGOaakcqWG0baDcqGGPaqkaaaapaqaa8qacqWGPbqAaeqaniabggHiLdGccqGHRaWkdaaeqbWdaeaapeGaemOEaO3aa0baaSqaaiabd2gaTjabdQgaQbqaaiabcIcaOiabdsha0jabcMcaPaaaa8aabaWdbiabdQgaQbqab0GaeyyeIuoakiabcMcaPaaaaaa@83B7@
(8)

Stop conditions

The algorithm stops, when the stability of estimations between the following steps is obtained, i.e. the absolute difference between the calculated frequencies is less then ε (equation 9). The default threshold value for epsilon is 10-5, and can be changed by a program option.

i = 1 R | z i ( t + 1 ) z i ( t ) | < ε MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaeaaaaaaaaa8qadaaeWbWdaeaacqGG8baFpeGaemOEaO3damaaDaaaleaapeGaemyAaKgapaqaa8qacqGGOaakcqWG0baDcqGHRaWkcqaIXaqmcqGGPaqkaaGccqGHsislcqWG6bGEpaWaa0baaSqaa8qacqWGPbqAa8aabaWdbiabcIcaOiabdsha0jabcMcaPaaak8aacqGG8baFaSqaa8qacqWGPbqAcqGH9aqpcqaIXaqma8aabaWdbiabdkfasbqdcqGHris5aOGaeyipaWJaeqyTdugaaa@48B4@
(9)

The final step is calculation of the haplotype frequencies (another E step), and of the conditional probability of the haplotype pair, given genotype estimation (equation 10).

z m n | g j = z m n g j = z m n x r j z x MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaeaaaaaaaaa8qacqWG6bGEpaWaaSbaaSqaa8qacqWGTbqBcqWGUbGBa8aabeaak8qacqGG8baFcqWGNbWzpaWaaSbaaSqaa8qacqWGQbGAa8aabeaak8qacqGH9aqpjuaGdaWcaaWdaeaapeGaemOEaO3damaaBaaabaWdbiabd2gaTjabd6gaUbWdaeqaaaqaa8qacqWGNbWzpaWaaSbaaeaapeGaemOAaOgapaqabaaaaOWdbiabg2da9Kqbaoaalaaapaqaa8qacqWG6bGEpaWaaSbaaeaapeGaemyBa0MaemOBa4gapaqabaaabaWdbmaaqadabaGaemOEaO3aaSbaaeaacqWG4baEaeqaaaqaaiabdIha4bqaaiabdkhaYnaaBaaabaGaemOAaOgabeaaaiabggHiLdaaaaaa@4EF4@
(10)

Results and Discussion

The described algorithm was implemented using C++ and the Boost libraries [13] and called NullHap. The main advantage of our application is the ability to handle problems, when one or more multiallelic loci containing null variant occur.

NullHap was tested on simulated and real data sets and its performance was compared with those of previously described programs: Arlequin [5], PHASE [7] and Haplo-IHP [12].

Test on generated data sets

Firstly, the simulated data sets were obtained as the most probable samples generated for polymorphisms with varying locus characteristics, and accuracy of estimated frequencies for different computer programs was analyzed. An example of assumed and estimated frequencies used in one such simulation is shown in Table 1. In Table 2, results of six simulations are summarized by giving a mean absolute percentage error, calculated as shown in equation (11), where x is the assumed frequency, and x* is the calculated one.

Table 1 Assumed and estimated haplotype frequencies
Table 2 Haplotype estimation frequency error
e r r o r = 1 N i = 1 N | x x x | MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaeaaaaaaaaa8qacqWGLbqzcqWGYbGCcqWGYbGCcqWGVbWBcqWGYbGCcqGH9aqpjuaGdaWcaaWdaeaapeGaeGymaedapaqaa8qacqWGobGtaaGcdaaeWbWdaeaapeGaeiiFaWxcfa4aaSaaa8aabaWdbiabdIha4jabgkHiTiabdIha4naaCaaabeqaaiabgEHiQaaaa8aabaWdbiabdIha4baacqGG8baFaSWdaeaapeGaemyAaKMaeyypa0JaeGymaedapaqaa8qacqWGobGta0GaeyyeIuoaaaa@489F@
(11)

Since it may not be known beforehand, whether a locus has a null allele, we also checked performance of NullHap which was run assuming the presence of a null allele in each locus. Such an approach allows to screen the likelihood of the presence of a null allele in a given locus by evaluating the frequencies of haplotypes containing this allele. An appreciable frequency of any such haplotype in the output indicates the need to include a null allele in this particular locus. Otherwise, the conclusion is, that given locus most likely does not contain a null variant. Alternatively, genotypes of each locus could be analyzed for deviation from HWE by any of the available programs. When typing mistakes are excluded, deviation from HWE strongly indicates the presence of a null allele.

Secondly, the effect of sample size on the performance of the method was investigated. This was done by generating k random samples of 25, 50, 100, 200, 500 and 1000 individuals from an infinite population in HWE. The haplotype frequencies were estimated and median of k mean absolute errors (calculated as e r r o r = 1 N i = 1 N | x x | MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaeaaaaaaaaa8qacqWGLbqzcqWGYbGCcqWGYbGCcqWGVbWBcqWGYbGCcqGH9aqpjuaGdaWcaaWdaeaapeGaeGymaedapaqaa8qacqWGobGtaaGcdaaeWaqaaiabcYha8jabdIha4jabgkHiTiabdIha4naaCaaaleqabaGaey4fIOcaaOGaeiiFaWhaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGobGta0GaeyyeIuoaaaa@4574@ , where N is the number of individuals in the sample) was calculated. The results obtained for haplotype distributions such as those given in examples 5 and 6 in Table 2 are illustrated in Figure 1. As can be seen, with a sample size of 200 individuals, an error of approximately 2% can be expected in haplotype frequency estimation, whereas a lower sample size may lead to substantially higher errors.

Figure 1
figure 1

Effect of sample size on accuracy of estimation. Effect of sample size on accuracy of estimation of haplotype frequencies. Ten samples of 25, 50, 100, 200, 500, 1000 individuals were generated from population in HWE. The error in function of sample size is shown. The haplotype distribution is given in example 5 (red) and example 6 (blue) in Table 2, respectively.

Thirdly, tests of the effect of different levels of HWE violation on the accuracy of the algorithm were performed. The degree of HWE violation was modeled by increasing values of inbreeding coefficient f as defined by Weir [14, 15], equation (12).

z m n = { h m 2 ( 1 f ) + h m f for  m = n 2 h m h n ( 1 f ) for  m n MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaeaaaaaaaaa8qacqWG6bGEdaWgaaWcbaGaemyBa0MaemOBa4gabeaakiabg2da9maaceaabaqbaeaabiGaaaqaaiabdIgaOnaaDaaaleaacqWGTbqBaeaacqaIYaGmaaGccqGGOaakcqaIXaqmcqGHsislcqWGMbGzcqGGPaqkcqGHRaWkcqWGObaAdaWgaaWcbaGaemyBa0gabeaakiabdAgaMbqaaiabbAgaMjabb+gaVjabbkhaYjabbccaGiabd2gaTjabg2da9iabd6gaUbqaaiabikdaYiabdIgaOnaaBaaaleaacqWGTbqBaeqaaOGaemiAaG2aaSbaaSqaaiabd6gaUbqabaGccqGGOaakcqaIXaqmcqGHsislcqWGMbGzcqGGPaqkaeaacqqGMbGzcqqGVbWBcqqGYbGCcqqGGaaicqWGTbqBcqGHGjsUcqWGUbGBaaaacaGL7baaaaa@5ECC@
(12)

As can be seen from Figure 2, there was a linear correlation of inbreeding coefficient f with the accuracy of estimation of haplotype frequencies.

Figure 2
figure 2

Effect of HWE violation on accuracy of estimation. Effect of HWE violation on the accuracy of the algorithm. The figure shows the error in function of inbreeding coefficient f for two polymorphisms characterized in Table 2 (example 5 – red line, example 6 – blue line).

Finally, to evaluate the effect of haplotype frequency on the error of the estimation, 10 samples of 1000 individuals were generated from a population in HWE, for a simple two loci polymorphism: A with variants A0, A1, A2 and B with variants B0, B1. The frequencies of haplotypes A0B1, A1B0, A1B1, A2B0, A2B1 were fixed and equal to 0.19, 0.18, 0.16, 0.1, 0.04 and 0.02 respectively, whereas the frequency of haplotype A0B0 varied from 0.05 to 0.9. Results expressed as median of mean absolute percentage error (equation (11)) are shown in Figure 3. As can be seen, the lowest error occured with haplotype frequency close to 0.5.

Figure 3
figure 3

Effect of haplotype frequency on the error of the estimation. Effect of haplotype frequency on the error of the estimation. Ten samples of 1000 individuals were generated for population in HWE, for a 2 locus polymorphism: A with variants A0, A1, A2 and B with variants B0, B1. The graph shows the error of haplotype frequency estimation in function of assumed frequency of this haplotype.

Performance tests

We also performed analysis of computational time in different scenarios. Results presented for appropriate applications are shown in Table 3. All computations were achieved on Celeron M 1.6 GHz, 1 GB RAM, under Debian Linux or Windows XP.

Table 3 Computational time comparison

Because the number of haplotypes grows exponentially with the number of considered loci, there is a practical restriction to approximately 50,000 haplotypes, e.g. 15 biallelic loci. We noted with moderate number of loci the restriction is due to computational time, whereas for the very large number of loci (e.g. 100 loci) the memory becomes a limiting factor.

Tests on real data sets

To perform a test on real data, we first used HLA-DRB1 and HLA-DQB1 allele distributions among 99 Poles as supplied by [5]. Both loci are multiallelic (36 and 14 variants, respectively) without null variants. The difference between estimated frequencies among programs Arlequin, PHASE and NullHap (i.e. programs handling such loci) was less than 2%.

To test the application in the presence of biallelic loci with null variants, the KIR genotypes for 200 Irish subjects [12] were analyzed with NullHap and Haplo-IHP (the only available program suitable for such loci). The difference of estimated frequencies between programs was about 3%.

Reanalysis of published data indicates that the KIR2DS4/1D locus may be a marker of KIR2DS1 haplotypes with different effects on psoriasis susceptibility

In order to apply NullHap to real data from an association study we reanalyzed the results of Luszczek et al. on distribution of KIR genotypes in Polish psoriasis patients and controls [16]. In the original report these authors described an association between KIR2DS1 and psoriasis, which was also observed in two subsequent studies from Japan and the US [17, 18], but not in a study a of Chinese population [19]. Further analysis of genotype data of Luszczek et al. [16] indicated a role for KIR gene variants other than KIR2DS1 in conferring susceptibility to psoriasis, suggesting, that distinct KIR haplotypes could be responsible for observed associations [20].

The distributions of KIR haplotypes among patients and controls obtained with NullHap are given in Table 4. Because the structure of the KIR region is very complex, it is not fully known which genes are truly allelic, i.e. occupy precisely the same chromosomal locus. At first, in our analysis, the K2DL2/KIR2DL3, KIRDS4/KIR1D, and KIR2DS3/KIR2DS5 genes were treated as alleles. Since in the case of KIR2DS3 and KIR2DS5 this may be controversial due to some haplotypes which harbor both genes in cis [21], we also repeated the analysis after exclusion of these variants. In all loci a null allele was allowed [21].

Table 4 The distribution of KIR haplotypes

As can be seen from Table 4, two haplotypes (#1, #2) were strongly overrepresented among the patients. The fact that these haplotypes encoded KIR2DS1 is consistent with the association between this gene and psoriasis [1618] whereas the lower OR associated with haplotype #2 vs. #1 (27 vs. 52.5) supports the protective effect of KIR2DS3 suggested previously [20]. In contrast to haplotypes #1 and #2 other haplotypes encoding KIR2DS1 (#4, #6) were not overrepresented among the patients. Both haplotypes encoded KIR2DS5 which could be interpreted as the postulated protective effect of this variant [20]. However, whereas the presence of KIR2DS5 or KIR2D3 offers one explanation of the heterogeneity of the effects of KIR2DS1 haplotypes, the inspection of Table 4 shows that the risk – conferring and neutral KIR2DS1 haplotypes are also distinguished by the KIR2DS4/1D locus, which is a novel observation. As can be seen, the haplotypes #1, #2 share the 1D variant, whereas the haplotypes #4, #6 both have the KIR2DS4 null allele. These effects of KIR2DS4/1D locus were also apparent in analysis performed after the exclusion of KIR2DS3 and KIR2DS5 genes (haplotypes #8 and #9 vs. haplotypes #12 and #17, Table 4).

The fact that KIR2DS4/1D and KIR2DS1 loci are physically adjacent [21] suggests that the strong predictive effect of their haplotypic combinations may be caused by linkage disequilibrium with an unknown variant in the region, which is primarily associated with psoriasis. The indirect association is particularly plausible for KIR2DS4/1D because KIR 1D and KIRDS4 null (which mark KIR2DS1 haplotypes with distinct effects on disease susceptibility) are both non functional and thus should be equivalent physiologically [21]. In case of the KIR2DS1 it would be tempting to speculate that the susceptibility conferring effect is limited to a rare allele (absent in controls) being in strong linkage disequilibrium with 1D. Interestingly, such a theory could explain a lack of association between KIRDS1 and psoriasis recently reported in a Chinese population [19].

Conclusion

The developed application can effectively estimate haplotype frequencies with a performance that is similar or better than those of other available computer programs. It should be emphasized, that the main advantage of the created application is the ability to estimate haplotypes for every type of polymorphism, in particular polymorphisms with multiallelic loci with null variants.

The presented application is under development, and some improvements are planned, such as an additional step removing unimportant haplotypes or the partition-ligation algorithm [8] to speed-up computations for a large number of loci. Other planned improvements are a graphical user interface as well as an import/export module for popular data formats. The new versions will be available at project homepage.

Availability and requirements

Project name: NullHap

Project homepage: http://nullhap.sourceforge.net

Operating systems(s): OS Portable

Precompiled binaries: Windows NT/2000/XP, Debian Linux

Programming language: C++

License: GNU LGPL

References

  1. Douglas Jaa: Experimentally derived haplotypes substantially increase the efficiency of linkage disequilibrium studies. Nat Genet 2001, 28: 361–364. 10.1038/ng582

    Article  CAS  PubMed  Google Scholar 

  2. Dempster A, Laird N, Rubin D: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society 1977, 39: 1–39.

    Google Scholar 

  3. Botstein D, Risch N: Discovering genotypes underlying human phenotypes: past successes for Mendelian disease, future approaches for complex disease. Nat Genet 2003, 33: 228–237. 10.1038/ng1090

    Article  CAS  PubMed  Google Scholar 

  4. Morris R, Kaplan N: On the advantage of haplotype analysis in the presence of multiple disease susceptibility alleles. Genet Epidemiol 2002, 23: 221–233. 10.1002/gepi.10200

    Article  PubMed  Google Scholar 

  5. Excoffier L, G L, S S: Arlequin ver. 3.0: An integrated software package for population genetics data analysis. Evolutionary Bioinformatics Online 2005, 1: 47–50.

    PubMed Central  CAS  Google Scholar 

  6. Excoffier L, Slatkin M: Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol 1995, 12: 921–927.

    CAS  PubMed  Google Scholar 

  7. Stephens M, Smith N, Donnelly P: A new statistical method for haplotype reconstruction from population data. Am J Hum Genet 2001, 68: 978–989. 10.1086/319501

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  8. Niu T, Qin Z, Xu X, Liu J: Bayesian haplotype inference for multiple linked single nucleotide polymorphisms. Am J Hum Genet 2002, 70: 157–169. 10.1086/338446

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  9. Boettcher P, Pagnacco G, Stella A: A Monte Carlo Approach for Estimation of Haplotype Probabilities in Half-Sib Families. J Dairy Sci 2004, 87: 4303–4310.

    Article  CAS  PubMed  Google Scholar 

  10. Shuying S, Greenwood M, Radford N: Haplotype inference using a Bayesian Hidden Markov model. Genetic Epidemiology 2007, 31: 937–948. 10.1002/gepi.20253

    Article  Google Scholar 

  11. Bashirova A, Martin M, McVicar D, M C: The killer immunoglobulin-like receptor gene cluster: tuning the genome for defense. Annu Rev Genomics Hum Genet 2006, 7: 277–300. 10.1146/annurev.genom.7.080505.115726

    Article  CAS  PubMed  Google Scholar 

  12. Yoo Y, Tang J, Kaslow R, Zhang K: Haplotype inference for present-absent genotype data using previously identified haplotypes and haplotype patterns. Bioinformatics 2007, 23: 2399–2406. 10.1093/bioinformatics/btm371

    Article  CAS  PubMed  Google Scholar 

  13. The boost libraries[http://www.boost.org]

  14. Shoemaker J, Painter I, Weir B: A Bayesian Characterization of Hardy-Weinberg Disequilibrium. Genetics 1998, 149: 2079–2088.

    PubMed Central  CAS  PubMed  Google Scholar 

  15. Weir B: Genetic Data Analysis 2. Sinauer Assocs. Inc., Sunderland, Mass; 1996.

    Google Scholar 

  16. Luszczek W, Manczak M, Cislo M, Nockowski P, Wisniewski A, Jasek M, Kusnierczyk P: Gene for the activating natural killer cell receptor, KIR2DS1, is associated with susceptibility to psoriasis vulgaris. Hum Immunol 2004, 65: 758–766. 10.1016/j.humimm.2004.05.008

    Article  CAS  PubMed  Google Scholar 

  17. Suzuki Y, Hamamoto Y, Ogasawara Y, Ishikawa K, Yoshikawa Y, Sasazuki T, Muto M: Genetic Polymorphisms of Killer Cell Immunoglobulin-Like Receptors Are Associated with Susceptibility to Psoriasis Vulgaris. Journal of Investigative Dermatology 2004, 122: 1133–1136. 10.1111/j.0022-202X.2004.22517.x

    Article  CAS  PubMed  Google Scholar 

  18. Holm S, Sakuraba K, Mallbris L, Wolk K, Stahle M, Sanchez F: Distinct HLA-C/KIR genotype profile associates with guttate psoriasis. Journal of Investigative Dermatology 2005, 125: 721–730. 10.1111/j.0022-202X.2005.23879.x

    Article  CAS  PubMed  Google Scholar 

  19. Chang Y, Chou C, Shiao Y, Lin M, Yu C, Chen C, Huang C, Lee D, Liu H, Wang W, Tsai S: The Killer Cell Immunoglobulin-Like Receptor Genes Do Not Confer Susceptibility to Psoriasis Vulgaris Independently in Chinese. J Invest Dermatol 2006, 126: 2335–2338. 10.1038/sj.jid.5700415

    Article  CAS  PubMed  Google Scholar 

  20. Ploski R, Luszczek W, Kusnierczyk P, Nockowski P, Cislo M, Krajewski P, Malejczyk J: A role for KIR gene variants other than KIR2DS1 in conferring susceptibility to psoriasis. Hum Immunol 2006, 67: 521–526. 10.1016/j.humimm.2006.04.001

    Article  CAS  PubMed  Google Scholar 

  21. Khakoo S, Carrington M: KIR and disease: a model system or system of models? Immunol Rev 2006, 214: 186–201. 10.1111/j.1600-065X.2006.00459.x

    Article  CAS  PubMed  Google Scholar 

  22. Haldane J: The estimation and significance of the logarithm of a ratio of frequencies. Ann Hum Genet 1956, 20: 309–311. 10.1111/j.1469-1809.1955.tb01285.x

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

This work was supported by the statutory research of Institute of Electronic Systems of Warsaw University of Technology and Medical University of Warsaw Grant 1WY/N/2008. We would like to thank the editor and anonymous referees for their insightful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robert M Nowak.

Additional information

Authors' contributions

RN adopted algorithm, implemented application, performed calculations and drafted manuscript. RP proposed the idea to develop the software, outlined its main features, participated in validation process and provided biological interpretation of results of reanalysis of KIR haplotypes. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Nowak, R.M., Płoski, R. NullHap – a versatile application to estimate haplotype frequencies from unphased genotypes in the presence of null alleles. BMC Bioinformatics 9, 330 (2008). https://doi.org/10.1186/1471-2105-9-330

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-9-330

Keywords