A quantitatively-modeled homozygosity mapping algorithm, qHomozygosityMapping, utilizing whole genome single nucleotide polymorphism genotyping data
- Huqun*1, 2,
- Shun-ichiro Fukuyama†1,
- Hiroyuki Morino3,
- Hiroshi Miyazawa1,
- Tomoaki Tanaka1,
- Tomoko Suzuki1,
- Masakazu Kohda4,
- Hideshi Kawakami3,
- Yasushi Okazaki4,
- Kuniaki Seyama5 and
- Koichi Hagiwara†1Email author
© Huqun et al; licensee BioMed Central Ltd. 2010
Published: 15 October 2010
Homozygosity mapping is a powerful procedure that is capable of detecting recessive disease-causing genes in a few patients from families with a history of inbreeding. We report here a homozygosity mapping algorithm for high-density single nucleotide polymorphism arrays that is able to (i) correct genotyping errors, (ii) search for autozygous segments genome-wide through regions with runs of homozygous SNPs, (iii) check the validity of the inbreeding history, and (iv) calculate the probability of the disease-causing gene being located in the regions identified. The genotyping error correction restored an average of 94.2% of the total length of all regions with run of homozygous SNPs, and 99.9% of the total length of them that were longer than 2 cM. At the end of the analysis, we would know the probability that regions identified contain a disease-causing gene, and we would be able to determine how much effort should be devoted to scrutinizing the regions. We confirmed the power of this algorithm using 6 patients with Siiyama-type α1-antitrypsin deficiency, a rare autosomal recessive disease in Japan. Our procedure will accelerate the identification of disease-causing genes using high-density SNP array data.
Identification of the genetic factors underlying disease causation provides crucial information for disease prevention and treatment. Nevertheless, genetic factors have not yet been elucidated for many diseases [1, 2].
If a patient is from an inbred family (i.e., F is large) and the disease is rare (i.e., p is small), then P AS ≈ 1, indicating that the gene is located in an AS. There are implementations that utilize single-nucleotide polymorphism (SNP) genotyping data obtained by high-density arrays [5, 6]. The usable implementation should (i) correct genotyping errors because thousands of SNPs are mistyped per high-density SNP array, adversely affecting the homozygosity mapping analysis; (ii) search for ASs genome-wide; (iii) check the validity of the inbreeding history, which is vital for homozygosity mapping but is often erroneous, and (iv) calculate the probability of the disease-causing gene being located in the regions identified. At the end of the analysis, we would know the probability that regions identified contain a disease-causing gene, and we would be able to determine how much effort should be devoted to scrutinizing the regions.
In the current study, we present an algorithm that implements the capabilities described in the above paragraph. We confirmed the power of this algorithm using 6 patients with Siiyama-type α1-antitrypsin deficiency, a rare autosomal recessive disease in Japan [7, 8]. The preliminary version of the algorithm described here has been used to prove that the SLC34A2 gene is responsible for pulmonary alveolar microlithiasis ; the current version has been used to show that the OPTN gene is responsible for amyotrophic lateral sclerosis .
We used the Haldane's Poisson process model for the occurrence of crossovers and performed all calculations based on this model . Information on SNPs used by Affymetrix's Genome-Wide Human SNP Array 6.0 (hereafter referred to as SNP Array 6.0) was summarized in the annotation file, , in which the genetic distance from the telomere of the short arm of a chromosome to each SNP was obtained by interpolation using the sex-averaged data published by deCODE Genetics . We restricted our analysis to a total of 890,625 autosomal SNPs with assigned dbSNP refIDs .
Monte Carlo simulation
The length of AS
In actuality, the autosomes have finite length; however, equation 2 provides a good approximation when the length of an AS is much shorter than the length of an autosome.
RHS (run of homozygous SNPs), false negative, type A false positive and type B false positive
R false negative , the ratio of the total length of false negatives to the total length of the AS
R Type A false positive , the ratio of the total length of type A false positives to the total length of the autosomes
R Type B false positive , the ratio of the total length of type B false positives to the total length of the autosomes
- (4)R false positive , the ratio of the total length of false positives to the total length of the autosomes(7)
Probability that a disease-causing gene is contained in RHSs, or the overlap of RHSs
Human Subjects and genotyping
This study was approved by the Institutional Review Boards of Saitama Medical University and Juntendo University. After obtaining written informed consent, DNA samples from 6 patients with α1-antitrypsin deficiency were purified from peripheral blood. These patients were not related and lived in different areas of Japan. Patients 1-5 were from families with a history of inbreeding because their parents were first cousins. Patient 6 did not have any family history of inbreeding. These 6 patients were genotyped using the SNP Array 6.0. The genotyping data for 86 HapMap JPT were available in the HapMap3 draft release 2 http://www.hapmap.org, and were downloaded from the Wellcome Trust Sanger Institute web site http://www.sanger.ac.uk/humgen/hapmap3/. The genotyping data for NA18987, a subject in HapMap JPT, was also distributed from Affymetrix and was used in the current study.
Genotyping error correction
Genotyping errors may convert homozygous SNPs to heterozygous SNPs and erroneously terminate an RHS, resulting in the failure to detect a portion of an RHS. According to Affymetrix, SNP Array 6.0 has an accuracy of > 0.997, implying that the genotyping error rate (P genotypingError ) may be 0.003 at maximum. A mistyped heterozygous SNP occurring in an RHS is separated by a large distance from neighboring heterozygous SNPs (Figure 1C). Therefore, if a heterozygous SNP is separated from neighboring SNPs by a distance that is rarely observed by chance, we speculated that the SNP was mistyped. Using equation 4, we calculated the probability of a heterozygous SNP being separated from neighboring SNPs at the observed distance (P distanceOccurredByChance ). A SNP with P distanceOccreceByChance < 0.01 was considered a mistyped SNP and these data were removed. This algorithm may erroneously remove 20 correctly genotyped heterozygous SNPs (N homozygousSNP x P genotypingError x 0.01) from a single SNP array analysis data, which we considered acceptable.
The computer program was written in the ANSI standard C programming language. The program was compiled by the GNU C compiler 4.2 and run on a MacBook Pro (CPU: 2.53 GHz Intel Core 2 Duo, 4 GB RAM) computer. The command line programs and the programs equipped by graphic user interface are both available from our web site at http://www.hhanalysis.com.
Our aim was to establish an algorithm for homozygosity mapping that uses SNP genotyping data obtained by high-density arrays, is equipped by a powerful genotyping error correction algorithm, detects ASs genome-wide, allows investigation into the family inbreeding history, and is able to calculate the probability that the identified regions contain the target gene.
The algorithm searches for the ASs (Figure 1A, B(i)) through runs of homozygous SNPs, or RHSs, that are formed by consecutively homozygous SNPs and are longer than the RHS cutoff value (Figure 1B(ii)). RHSs are presumably the autozygous segments (ASs). Three types of errors were defined; false negative, type A false positive, and type B false positive (Figure 1B(iii)). The main determinants of the false negative rate (R false negative ), which is the ratio of the total length of false negatives to the total length of ASs, are the number of SNPs investigated and the genotyping error rate. The main determinants of the false positive rate (R false positive ), which is the ratio of the total length of type A false positives plus type B false positives to the entire length of the autosomes, are the positioning of SNPs, local haplotype block structure , and population substructure .
To attain the aims stated above while avoiding the influence of these errors, our algorithm had the following steps: Step (a) determine an appropriate RHS cutoff value based on the Haldane's recombination model; Step (b) perform genotyping error correction; Step (c) detect RHSs; Step (d) obtain the overlaps of RHSs among patients; and Step (e) correct false positives by a case-control approach. The validity of the family history is checked at Step (c). We used 5 patients with Siiyama-type α1-antitrypsin deficiency, a rare disease in Japan, to verify our strategy. Analyses performed in the Result section can be reproduced using the program contained in additional file 1 according to the tutorial also contained in the additional file 1.
Determination of the RHS cutoff
Genotyping error correction
RHSs in the patients
Statistics of AS
Size of the longest RHS for each patient
Length of the longest RHS (cM)
Overlap of RHSs
Genes present in the candidate RHS overlap
chromosome 14 open reading frame 48
OTU domain, ubiquitin aldehyde binding 2
DEAD (Asp-Glu-Ala-Asp) box polypeptide 24
interferon, alpha-inducible protein 27-like 1
interferon, alpha-inducible protein 27
interferon, alpha-inducible protein 27-like 2
protein phosphatase 4, regulatory subunit 4
serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 10
serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 6
Description: hypothetical protein LOC100287997
serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 2
serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 1
serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 11
serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 9
serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 12
A patient without family history of inbreeding
In the current report, we described the quantitatively-modeled homozygosity mapping algorithm that uses high density array SNP genotyping data.
Homozygosity mapping is simple in principle, but many pitfalls were discovered when it was actually applied. Problems that included (i) unexpected allelic heterogeneity, (ii) identification of a homozygous identical-by-descent (IBD) region to the disease locus, (iii) underestimation of the extent of inbreeding, were pointed out in the analyses using microsatellite markers  and are still observed in the analyses using SNPs. Moreover, use of high-density SNP arrays introduced a novel problem, (iv) a large number of mistyped SNPs. Although the genotyping error rate is low for high-density arrays, the huge number of SNPs in these arrays inevitably produces a large number of mistyped SNPs. Even a single mistyped SNP erroneously terminates an RHS, making the detection of large RHSs difficult. Our algorithm has overcome all these problems: problem (i) is solved by using high-density SNP arrays, problem (ii) by case-control approach, problem (iii) by identifying ASs as RHSs and calculating F by the total length of RHSs divided by the total length of the autosomes, and problem (iv) by applying genotyping error correction algorithm.
As stated as Problem (ii) above, we observed some autosomal regions had a high probability of having RHSs. This may be caused by SNP positioning, local haplotype block structure, or population substructure. The effect of them was eliminated by using a case-control approach, which is performed in the order that (a) obtain overlap of RHS among patients, and (b) perform a case-control analysis targeting obtained overlaps.
Homozygosity mapping has power to identify a disease-causing gene in as few as 3 patients, and we have indeed identified the SLC34A2 gene in pulmonary alveolar microlithiasis and the OPTN gene in the amyotrophic lateral sclerosis both in 3 patients [9, 10]. Amyotrophic lateral sclerosis has multiple causative genes. In the latter report, we were able to identify one of the genes by investigating each combination of 3 patients from 7 patients with a history of inbreeding, seeking for 3 patients harboring the same disease-causing gene. Our algorithm worked fine in this approach. During the process, it was quite helpful that the algorithm provided the probability that the identified regions contain the disease-causing gene, which determined how much effort should be further devoted. To our knowledge, the algorithm presented in the current study is the first to provide this information.
We described an algorithm that enables homozygosity mapping to be performed based on a quantitative model using SNP genotyping data. Our procedure will accelerate the identification of disease-causing genes using high-density SNP array data.
Availability and requirements
Project name: qHomozygosityMapping
Project home page: http://www.hhanalysis.com
Operating system(s): Mac, Linux and Windows.
Programming language: C
License: GNU GPL.
Any restrictions to use by non-academics: The software is for academic purpose only.
This work is supported in part by the grant-in-aid for scientific research (No. 18390242) from the Japan Society of Promotion of Science, and in part by the grants-in-aid for Health and Labor Science (Nos. H22-Nanchi-Ippan-005 and H20-Nanchi-Ippan-023) from the Ministry of Health, labor and Welfare, Japan.
The authors thank Ms. Tomoko Hirata for her technical assistance.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 7, 2010: Ninth International Conference on Bioinformatics (InCoB2010): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S7.
- McKusick VA: Mendelian Inheritance in Man and its online version, OMIM. Am J Hum Genet 2007, 80:, 588–604. 10.1086/514346PubMed CentralView ArticlePubMedGoogle Scholar
- OMIM-Online Mendelian Inheritance in Man[http://www.ncbi.nlm.nih.gov/Omim/mimstats.html]
- Lander ES, Botstein D: Homozygosity mapping: a way to map human recessive traits with the DNA of inbred children. Science 1987, 236: 1567–1570. 10.1126/science.2884728View ArticlePubMedGoogle Scholar
- Clark AG: The size distribution of homozygous segments in the human genome. Am J Hum Genet 1999, 65: 1489–1492. 10.1086/302668PubMed CentralView ArticlePubMedGoogle Scholar
- Woods CG, Valente EM, Bond J, Roberts E: A new method for autozygosity mapping using single nucleotide polymorphisms (SNPs) and EXCLUDEAR. J Med Genet 2004, 41: e101. 10.1136/jmg.2003.016873PubMed CentralView ArticlePubMedGoogle Scholar
- Seelow D, Schuelke M, Hildebrandt F, Nurnberg P: HomozygosityMapper--an interactive approach to homozygosity mapping. Nucleic Acids Res 2009, 37: W593–599. 10.1093/nar/gkp369PubMed CentralView ArticlePubMedGoogle Scholar
- Seyama K: State of alpha1-antitrypsin deficiency in Japan. Respirology 2001, 6(Suppl):S35–38. 10.1046/j.1440-1843.2001.00310.xView ArticlePubMedGoogle Scholar
- Seyama K, Nukiwa T, Souma S, Shimizu K, Kira S: Alpha 1-antitrypsin-deficient variant Siiyama (Ser53[TCC] to Phe53[TTC]) is prevalent in Japan. Status of alpha 1-antitrypsin deficiency in Japan. Am J Respir Crit Care Med 1995, 152: 2119–2126.View ArticlePubMedGoogle Scholar
- Izumi S, Miyazawa H, Ishii K, Uchiyama B, Ishida T, Tanaka S, Tazawa R, Fukuyama S, Tanaka T, Nagai Y, Yokote A, Takahashi H, Fukushima T, Kobayashi K, Chiba H, Nagata M, Sakamoto S, Nakata K, Takebayashi Y, Shimizu Y, Kaneko K, Shimizu M, Kanazawa M, Abe S, Inoue Y, Takenoshita S, Yoshimura K, Kudo K, Tachibana T, Nukiwa T, Hagiwara K: Mutations in the SLC34A2 gene are associated with pulmonary alveolar microlithiasis. Am J Respir Crit Care Med 2007, 175: 263–268.View ArticlePubMedGoogle Scholar
- Maruyama H, Morino H, Ito H, Izumi Y, Kato H, Watanabe Y, Kinoshita Y, Kamada M, Nodera H, Suzuki H, Komure O, Matsuura S, Kobatake K, Morimoto N, Abe K, Suzuki N, Aoki M, Kawata A, Hirai T, Kato T, Ogasawara K, Hirano A, Takemi T, Kusaka H, Hagiwara K, Kaji R, Kawakami H: Mutations of optineurin in amyotrophic lateral sclerosis. Nature 2010, 465: 223–226. 10.1038/nature08971View ArticlePubMedGoogle Scholar
- Haldane J: The combination of linkage values, and the calculation of distances between the loci of linked factors. J Genet 1919, 8: 299–309. 10.1007/BF02983270View ArticleGoogle Scholar
- Affymetrix - Home[http://www.affymetrix.com/index.affx]
- Kong A, Gudbjartsson DF, Sainz J, Jonsdottir GM, Gudjonsson SA, Richardsson B, Sigurdardottir S, Barnard J, Hallbeck B, Masson G, Shlien A, Palsson ST, Frigge ML, Thorgeirsson TE, Gulcher JR, Stefansson K: A high-resolution recombination map of the human genome. Nat Genet 2002, 31: 241–247.PubMedGoogle Scholar
- National Center for Biotechnology Information[http://www.ncbi.nlm.nih.gov]
- International HapMap Consortium: The International HapMap Project. Nature 2003, 426: 789–796. 10.1038/nature02168View ArticleGoogle Scholar
- Overall AD, Nichols RA: A method for distinguishing consanguinity and population substructure using multilocus genotype data. Mol Biol Evol 2001, 18: 2048–2056.View ArticlePubMedGoogle Scholar
- Miano MG, Jacobson SG, Carothers A, Hanson I, Teague P, Lovell J, Cideciyan AV, Haider N, Stone EM, Sheffield VC, Wright AF: Pitfalls in homozygosity mapping. Am J Hum Genet 2000, 67: 1348–1351.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.