Haplo2Ped: a tool using haplotypes as markers for linkage analysis
© Cheng et al; licensee BioMed Central Ltd. 2011
Received: 26 April 2011
Accepted: 22 August 2011
Published: 22 August 2011
Generally, SNPs are abundant in the genome; however, they display low power in linkage analysis because of their limited heterozygosity. Haplotype markers, on the other hand, which are composed of many SNPs, greatly increase heterozygosity and have superiority in linkage statistics.
Here we developed Haplo2Ped to automatically transform SNP data into haplotype markers and then to compute the logarithm (base 10) of odds (LOD) scores of regional haplotypes that are homozygous within the disease co-segregation haploid group. The results are reported as a hypertext file and a 3D figure to help users to obtain the candidate linkage regions. The hypertext file contains parameters of the disease linked regions, candidate genes, and their links to public databases. The 3D figure clearly displays the linkage signals in each chromosome. We tested Haplo2Ped in a simulated SNP dataset and also applied it to data from a real study. It successfully and accurately located the causative genomic regions. Comparison of Haplo2Ped with other existing software for linkage analysis further indicated the high effectiveness of this software.
Haplo2Ped uses haplotype fragments as mapping markers in whole genome linkage analysis. The advantages of Haplo2Ped over other existing software include straightforward output files, increased accuracy and superior ability to deal with pedigrees showing incomplete penetrance. Haplo2Ped is freely available at: http://bighapmap.big.ac.cn/software.html.
Linkage analysis plays an important role in mapping disease-causing genes. Compared to other methods, such as association research, not only are very limited samples needed in linkage study, but also the high disease homogeneity among pedigree members increases the possibility of locating causative genes [1, 2]. Furthermore, linkage mapping of complex traits was made feasible for experimental organisms, such as animals and plants, through the use of genetic mapping in large crosses [3, 4]. Linkage analysis has wide applications in both medical experiments and agricultural breeding.
Along with the achievement of high-throughput SNP genotyping, using whole genome SNP data for linkage analysis has been shown to be an efficient strategy [5, 6]. However, because of their two-allele character, the heterozygosity of SNP markers is usually lower than traditional genetic markers, such as short tandem repeats (STRs). Therefore, two point linkage analysis with SNP data is often insufficiently powerful. Considering the abundance of SNPs in the human genome, the use of multi-point based methods, such as haplotype-disease co-transmission analysis, would largely overcome the low heterozygosity of individual SNPs, because haplotypes formed by multi SNPs could easily achieve the maximum heterozygosity in pedigrees.
Software packages have been developed to carry out multi-point analysis. The traditional linkage methods employed two basic algorithms: the Elston-Steward algorithm, used in Allegro, and the Lander-Green algorithm, used in Merlin. SNPLINK, a Perl Script that performs full genome linkage analysis, uses both these algorithms. However, the application of these two algorithms is limited, either by the number of markers or by the size of the pedigrees. Another program, SNP4Linkage, is based on allele sharing determination and is better adapted to high-density SNP genotyping data. Nevertheless, it still lacks a tool for considering haplotype fragments as genomic markers for linkage research [7–10]. Therefore, Haplo2Ped was developed. It can perform whole genome linkage analysis with haplotypes and generate a corresponding report file that contains linkage regions, LOD scores, and the candidate genes. To help users to obtain further information, links for the candidate genes to databases of gene annotations and OMIM (Online Mendelian Inheritance in Man) are also offered . Meanwhile, an auto-generated 3D picture allows users to visualize the linking signals clearly on a genomic scale.
Firstly, we consider the example of a dominant disease model. In a trio, if the child and his father are both affected, the father's transmitted haploid will be selected as an aHap. Conversely, if the child is healthy, the affected father's untransmitted haploid will be deemed as an aHap. When we cannot be sure of the child's affected status (the child is too young to show symptoms or it is a disease with incomplete penetrance), then the affected father's two chromatids would be selected and treated with the rule that at least one of them is an aHap.
Once the set of aHaps is determined from all the trios, the haplotype sharing analysis is performed. A window-length and step-size are set to scan these aHaps to determine disease candidate segment(s) generated from recombination events (Figure 2B). For the haplotypes locating within the same scanning window, if they show homozygosity in all aHaps, this window would be merged with the adjacent homozygous windows until the sliding window process moves out of the area showing homozygosity. After the completion of aHap scanning, the family's haplotype fragments that are located in the homozygous aHap regions are determined and are consequently used as markers to calculate LOD scores .
For a disease with incomplete penetrance, we cannot determine whether the asymptomatic healthy child is really disease free or not. As referred to above, we treat both the transmitted and un-transmitted haploids of their affected parent as paired aHaps. The two assumed aHaps are then compared to the assured aHaps. A true disease co-segregation haplotype fragment should be found in at least one of the two assumed aHaps. Regarding determination of a candidate region by window sliding, although paired aHaps are not as informative as the assured aHaps, they may still contribute to shortening the linked regions and identifying whether or not the child carries the disease targeted haplotype.
Using the above method to analyze a disease caused by fragment deletion may result in two linked regions separated by a homozygous region caused by the deletion. For large deletions (> 500 Kb), such a result may lead to confusion or an incorrect conclusion. Therefore, Haplo2Ped provides a LOH (loss of heterozygous) test to detect large fragment deletions.
Simulated linkage regions and the regions detected by Haplo2Ped
Expected region a (bp)
Detected region (bp)
Detected region (bp)
To compare the efficiency of Haplo2Ped with other existing software, we submitted the same simulated data to Merlin, SNPLink, and SNP4Linkage. The output results are listed in Table 1. Merlin reported the six regions co-segregating with the disease with a LOD score of around 1.78, which was the maximum value across the whole genome. Four of these six reported regions were smaller than the expected regions indicating that some regions that might harbor the disease-causal mutation were missed. Such a low LOD score could not reflect the real level of linkage between the disease-causal regions and the disease phenotype. Except for the six simulated regions, Merlin also detected three other regions with LOD scores of around 1.78 (Table 1). These false positive results could add to the difficulty in locating the disease-causing mutations in real studies. Moreover, Merlin reported the LOD score of every individual SNP. The LOD scores of SNPs on the border of the linked regions usually increase from a low value of unlinked regions to a high value of linked regions or decrease the other way around. Thus, another concern is that it is usually difficult for users to determine the borders of the regions detected by Merlin.
SNPLINK did not report LOD scores in the final output files although it showed good accuracy on four regions co-segregating with the disease. Furthermore, SNPLINK missed some regions on the left edge of two expected regions on chromosomes 1 and 13. The results from SNP4Linkage were the same as SNPLINK. There were no false positive regions detected by these two programs.
In a real study of a digital-anomaly family, we applied Haplo2Ped to SNP genotype data from 13 family members for the linkage analysis by haplotype, and successfully located the linkage region. Further study determined the mutation of the causative gene . Comparisons of Haplod2Ped and other existing software using the data from the real study are listed in additional file 2: Software comparisons using real data. All the software packages successfully located the disease-causing region, while Merlin reported more false positive regions.
To evaluate the false positive rate of Haplo2Ped, we simulated genotype data sets for thirty pedigrees with one causal mutation each using an in-house developed Perl script (packaged with Haplo2Ped). Each data set was analyzed by both Haplo2Ped and Merlin. The false positive regions reported by Merlin were significantly more than those reported by Haplo2Ped (Figure S1 in additional file 3: Evaluation of false positive rate of Haplo2Ped with completely simulated genotype data), indicating that using haplotypes that are of high heterozygosity as markers has better efficiency in filtering false positive regions than using only individual SNPs.
The haplotype-sharing scanning of aHaps is the most important step in Haplo2Ped. For dominant diseases, the main point is to confirm whether the disease haploid is transmitted or not. In the case of recessive diseases, two haplotypes of the affected individual are both aHaps. Additionally, for either a dominant or recessive model, Haplo2Ped is only suitable for one-disease-founder cases (i.e. a disease with complete homogeneity). Two or more disease founders would result in more than one type of disease haplotype for the family, which could lead to either loss of linkage signals, or generate false positives. Haplo2Ped analysis is based on deduced parental haplotypes; therefore, in cases where one parent is missing in a nuclear family, it is still applicable for linkage study.
Our simulation analysis showed that Haplo2Ped was consistently accurate in pinpointing the regions co-segregating with the disease. It did not miss any expected regions, while other software reported biased results, especially on the left edge of certain regions. Given the limited recombination events accumulated in a pedigree, both the disease-causing mutation and the neighboring SNPs in a shared haplotype co-segregate with the disease phenotype. However, when a disease-causing haplotype is transmitted to offspring, recombination occurs at random sites of this haplotype, indicating that the disease-causing mutation also probably locates at the edge of our assumed regions. If any expected regions are missed, the risk of not locating the final mutation is increased.
A gain of LOD score using haplotypes as markers in our tool demonstrated an advantage over Merlin, which is based on classical maximum-likelihood methods. Employing haplotypes with high heterozygosity as markers avoided the false positive results generated by Merlin, which is subject to the low heterozygosity of individual SNPs. Furthermore, the LOD score of the SNPs reported by Merlin in the assumed regions usually varies over a wide range. Many SNPs even show lower LOD scores than those in the unlinked regions. This adds to the difficulty of locating the linked regions. Thus, we suggest that our method of combining the heterozygosity of multi-SNPs and the breakpoints of recombination (borders of co-segregating haplotype) better reflects the stable strength of a linkage region compared to a method that only uses the heterozygosity of individual SNPs.
Another advantage of Haplo2Ped is its capability of dealing with the diseases that exhibit incomplete penetrance, a model of which is not included in software such as SNPLINK. Using simulated data with incomplete penetrance, although Merlin reported expected linkage regions similar to those of Haplo2Ped (additional file 4: Software comparison with simulation data with incomplete penetrance), it generated three false positive regions, while Haplo2Ped reported none. Performance on the data from the real study with incomplete penetrance and the simulated genotype data of thirty pedigrees also showed that Merlin reported more false positive regions than Haplo2Ped. In addition, using the notion of shared affected haploids among affected individuals instead of traditional algorithms, such as the Elston-Steward and the Lander-Green algorithms means that Haplo2Ped is not restricted by the number of markers or the family members. The successful application of Haplo2Ped to a real study demonstrated its power in detecting the regions harboring the disease-causing mutation.
The haplotype-sharing analysis is sensitive to mis-genotyped SNPs, which may generate false breakpoints in the haplotype fragments. To prevent such errors, we use a window sliding method to scan the genome. For the haplotype fragment in each window, we determine if it is homogeneous among all aHaps with a certain tolerance. For example, we set the level of inconsistent SNPs to less than 5% of the total in the above analysis. When the window steps into the linkage region, the ratio of inconsistent SNPs should largely decrease and when the window steps into the recombination free region, the ratio quickly increases to above 5%. As the real ratio of mis-genotyped SNPs is usually unknown or is different in different genomic regions, we suggest a threshold of 5% be set as the mis-genotype tolerance. Generally, a 5% typing error is much higher than the true ratio in experiments, and it would generate a candidate region slightly larger than the real linkage region as seen in our example. Despite a relatively conservative setting, the introduction of false breakpoints by mis-genotyped SNPs should be prevented.
The new software, named Haplo2Ped, which uses haplotype fragments as mapping markers in whole genome linkage analysis, has been developed. Comparison with other programs by simulation tests and successful application in a real study demonstrated its high efficiency and reliability. Haplo2Ped is not restricted by the number of markers or family members. Moreover, it also provides LOH (loss of heterozygosity) detection for pedigrees in which fragment deletion causes the disease. We propose that haplotype fragments could be powerful genomic markers in linkage analysis.
Availability and requirements
Software name: Haplo2Ped
Software home page: http://bighapmap.big.ac.cn/software.html
Operating system(s): Windows or Linux
Programming language: Matlab platform
Other requirements: No
We thank Dr. Jurg Ott for his helpful suggestions. We are grateful to all laboratory members who provided advice for this work.
Funding: This work was supported by grants from the National Natural Science Foundation of China (30890031), the Knowledge Innovation Program of the Chinese Academy of Sciences (KSCX2-YW-R-72), and the Ministry of Science and Technology (2006AA02Z19D) to CZ.
- Lander ES, Schork NJ: Genetic dissection of complex traits. Science 1994, 265(5181):2037–2048. 10.1126/science.8091226View ArticlePubMedGoogle Scholar
- Lander E, Kruglyak L: Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nat Genet 1995, 11(3):241–247. 10.1038/ng1195-241View ArticlePubMedGoogle Scholar
- Altshuler D, Daly MJ, Lander ES: Genetic mapping in human disease. Science 2008, 322(5903):881–888. 10.1126/science.1156409PubMed CentralView ArticlePubMedGoogle Scholar
- Paterson AH, Lander ES, Hewitt JD, Peterson S, Lincoln SE, Tanksley SD: Resolution of quantitative traits into Mendelian factors by using a complete linkage map of restriction fragment length polymorphisms. Nature 1988, 335(6192):721–726. 10.1038/335721a0View ArticlePubMedGoogle Scholar
- Ozcelik T, Akarsu N, Uz E, Caglayan S, Gulsuner S, Onat OE, Tan M, Tan U: Mutations in the very low-density lipoprotein receptor VLDLR cause cerebellar hypoplasia and quadrupedal locomotion in humans. Proc Natl Acad Sci USA 2008, 105(11):4232–4236. 10.1073/pnas.0710010105PubMed CentralView ArticlePubMedGoogle Scholar
- Sun M, Li N, Dong W, Chen Z, Liu Q, Xu Y, He G, Shi Y, Li X, Hao J, et al.: Copy-number mutations on chromosome 17q24.2-q24.3 in congenital generalized hypertrichosis terminalis with or without gingival hyperplasia. Am J Hum Genet 2009, 84(6):807–813. 10.1016/j.ajhg.2009.04.018PubMed CentralView ArticlePubMedGoogle Scholar
- Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin--rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 2002, 30(1):97–101. 10.1038/ng786View ArticlePubMedGoogle Scholar
- Webb EL, Sellick GS, Houlston RS: SNPLINK: multipoint linkage analysis of densely distributed SNP data incorporating automated linkage disequilibrium removal. Bioinformatics 2005, 21(13):3060–3061. 10.1093/bioinformatics/bti449View ArticlePubMedGoogle Scholar
- Lin G, Wang Z, Wang L, Lau YL, Yang W: Identification of linked regions using high-density SNP genotype data in linkage analysis. Bioinformatics 2008, 24(1):86–93. 10.1093/bioinformatics/btm552View ArticlePubMedGoogle Scholar
- Gudbjartsson DF, Jonasson K, Frigge ML, Kong A: Allegro, a new computer program for multipoint linkage analysis. Nat Genet 2000, 25(1):12–13. 10.1038/75514View ArticlePubMedGoogle Scholar
- Cheng F, Chen W, Richards E, Deng L, Zeng C: SNP@Evolution: a hierarchical database of positive selection on the human genome. BMC Evol Biol 2009, 9: 221. 10.1186/1471-2148-9-221PubMed CentralView ArticlePubMedGoogle Scholar
- Ott J: Some statistical properties of the lod method and the method of scoring known recombination events in linkage analysis. Cytogenet Cell Genet 1978, 22(1–6):702–705. 10.1159/000131057View ArticlePubMedGoogle Scholar
- Cheng F, Ke X, Lv M, Zhang F, Li C, Zhang X, Zhang Y, Zhao X, Wang X, Liu B, et al.: A novel frame-shift mutation of GLI3 causes non-syndromic and complex digital anomalies in a Chinese family. Clin Chim Acta 2011, 412(11–12):1012–1017. 10.1016/j.cca.2011.02.007View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.