Linked region detection using high-density SNP genotype data via the minimum recombinant model of pedigree haplotype inference
- Lusheng Wang^{1},
- Zhanyong Wang^{1} and
- Wanling Yang^{2}Email author
https://doi.org/10.1186/1471-2105-10-216
© Wang et al; licensee BioMed Central Ltd. 2009
Received: 16 December 2008
Accepted: 15 July 2009
Published: 15 July 2009
Abstract
Background
With the rapid development of high-throughput genotyping technologies, efficient methods for identifying linked regions using high-density SNP genotype data have become more and more important. Recently, a deterministic method that works very well on SNP genotyping data has been developed (Lin et al. Bioinformatics 2008, 24(1): 86–93). However, that program can only work on a limited number of family structures. In particular, the results (if any) will be poor when the genotype data for the whole chromosome of one of the parents in a nuclear family is missing.
Results
We have developed a software package (LIden) for identifying linked regions using high-density SNP genotype data. We focus on handling the case where the genotype data for the whole chromosome of one of the parents in a nuclear family is missing. We use the minimum recombinant model for haplotype inference in pedigrees. Several local optimization algorithms are used to infer the haplotype of each individual and determine the linked regions based on the inferred haplotype data. We have developed a more flexible method to combine nuclear families to further refine (reduce the length of) the linked regions.
Conclusion
Our new package (LIden) is efficient software for linked region detection using high-density SNP genotype data. LIden can handle some important cases where the existing programs do not work well. In particular, the new package can handle many cases where the genotype data of one of the two parents is missing for the entire chromosome. The running time of the program is O(mn), where m is the number of members in the family and n is the number of SNP sites in the chromosome. LIden is specifically suitable for handling big sized families. This research also demonstrates another practical use of the minimum recombinant model for haplotype inference in pedigrees.
The software package can be downloaded at http://www.cs.cityu.edu.hk/~lwang/software/Link.
Background
With the completion of the human genome sequencing project and the development of HapMap project [1], our understanding of human genomic sequences has been extended dramatically. Due to the development of SNP genotyping technology, genotyping of hundreds of thousands of single nucleotide polymorphism (SNP) markers in a high-throughput format has become a routine job in many labs.
Compared to classical genotyping methods mainly using microsatellite markers, SNP genotyping is faster and easier. It provides complete coverage of the genome and much more information on covered regions. Linkage analysis is a method to identify genomic regions that cosegregate with an inherited disease in a family and to facilitate the eventual identification of the mutation in that region causing the disease. Leykin et al. in [2] and Sellick et al. in [3] demonstrated that high-density SNP genotype data, such as that from microarrays, can be used for large-scale and cost-effective linkage analysis. The main reason is that there will be a sufficient number of informative markers between any two recombination points and thus the allele sharing status among the family members can be precisely determined. Therefore, efficient programs are highly demanded for allele sharing determination that work on a large number of markers and big sized families.
Classical linkage analysis methods are designed for sparse microsatellite markers. They are mainly based on two algorithms, the Elston-Steward algorithm that is limited by the number of total markers used [4] and the Lander-Green algorithm that is limited by the total number of individuals in a family [5]. As a result, they either cannot handle genotype data based on large number of SNPs at all or they cannot handle families of a large size, especially together with large numbers of genotyped SNPs, due to memory constraint.
Recently, a deterministic method that works very well on SNP genotyping data [6] has been developed. This was one of a series of efforts to develop software that is particularly suitable for SNP genotyping data and runs in time linear to both the number of SNP sites and the number of family members. However, the program in [6] can only work on a limited number of family structures. Here we use the minimum recombinant model for haplotype inference in pedigrees and develop a set of algorithms to minimize the total number of recombinants and produce a software package that works on a much wider range of family structures. Extensive simulations on Affymetrix Human Mapping 50 K/250 K GeneChips showed that the new package can correctly identify the linked regions on a wide range of family structures. In particular, the new package outperforms the old program in many important cases where the genotype data of one of the parents is missing on the entire chromosome. This research also demonstrates another practical use of the minimum recombinant model for haplotype inference in pedigrees.
Implementation
We use the minimum recombinant model to infer the haplotype configuration for all the family members. In 2002, Qian and Beckman [7] proposed a model to reconstruct haplotype configurations from genotype data in a pedigree under the Mendelian law of inheritance. In this model, the resulting haplotype configurations should have the minimum number of recombinants (i.e. recombination events).
Minimum Recombinant Haplotype Configuration
Given a pedigree and the genotype information for each member, the object is to find a haplotype configuration such that the total number of recombinants in the whole pedigree is minimized.
The problem is called Minimum Recombinant Haplotype Configuration (MRHC). The MRHC problem was proved to be NP-hard by Li and Jiang [8]. Lots of algorithms have been proposed. Some algorithms run in time exponential in terms of the number of SNP sites and some algorithms run in time exponential in terms of the number of family members [9]. An integer linear programming approach was proposed to handle incomplete genotype data [10].
Linkage analysis aims to identify regions whose allele is shared by all or most diseased members in a family but by none or few normal family members. In dominant inheritance situations, sharing of one mutation allele can cause a disease phenotype. In recessive cases, sharing of two disease alleles in that region is necessary for there to be a diseased status. We will first design algorithms to infer the allele sharing status with minimum recombinants and then use an algorithm to find the linked regions(regions shared by all or most of the diseased individuals but not shared by any normal individuals) by possibly changing the inferred allele sharing status.
Algorithm
Our package contains a set of (heuristic) algorithms to handle various kinds of situations and sometimes for one case, we use two (local optimization) algorithms to iteratively refine the output. Before we discuss the algorithms, we give the basic data structures used in the package.
The basic data structures
The possible values of each element in the five arrays.
Array Name | Possible Values of each element |
---|---|
genotype | {AA}, {AB}, {BB} |
paternal-h | A, B |
maternal-h | A, B |
which-p | 0, 1 |
which-m | 0, 1 |
I.which-p [i] = 0 indicates that the haplotype I.paternal-h [i] of individual I is from his/her father's 0-th haplotype and I.which-p [i] = 1 indicates that the haplotype I.paternal-h [i] of individual I is from his/her father's 1-th haplotype. Similarly, I.which-m [i] = 0 indicates that the haplotype I. maternal-h [i] of individual I is from his/her mother's 0-th haplotype and I.which-m [i] = 1 indicates that the haplotype I. maternal-h [i] of individual I is from his/her mother's 1-th haplotype. The main purpose here is to keep a record of which grandparent the allele came from.
The algorithms for nuclear families with data available for both parents
The basic algorithm
In our basic algorithm, we consider a site at a time. Suppose that paternal [i], maternal [i], which-p [i] and which-m [i] of each individual at site i have been fixed. For site i + 1, there are (at most) 4 different haplotype configurations of the two parents fitting the given genotype data. For each of the 4 haplotype configurations of the two parents, we can fix C_{ j }.paternal-h [i + 1], C_{ j }.maternal-h [i + 1], C_{ j }.which-p [i + 1] and C_{ j }.which-m [i + 1] for every child C_{ j }(j = 1, 2,..., n) such that the number of break points of child C_{ j }between site i and i + 1 is minimized. Note that, for each child C_{ j }, the number of breakpoints between site i and site i + 1 could be 0, 1 or 2. When two choices are equally good for child C_{ k }, we arbitrarily select one. We then choose one of the 4 haplotype configurations of the parents at site i + 1 such that the total number of break points for all the n children between site i and site i + 1 is minimized. It is worth pointing out that the basic algorithm here considers all the children site by site while the old algorithm in [6] considers all the sites for every child and then handles the children of the nuclear family one by one. Clearly, the quality of the solution at site i + 1 heavily depends on the quality of the solution at site i. Thus, we will first use a method to select a starting site that generates a good solution and then we use the algorithm to compute the solution to the left and right ends of the chromosome, respectively.
Finding a good starting site
Here we try to find a site, where the haplotypes for all individuals are uniquely determined according to the given genotype data. This can be done if both genotypes of the parents are "AB", and every child's genotype is "AA" or "BB". If there is no such site, we look for the "second best" site, where some of the individuals' haplotypes can be partially fixed. The second best site is a site where the genotype of one parent is "AB", and the genotype of each child is "AA" or "BB". In this case, we can uniquely determine the haplotypes for all children, but partially determine the inheritance information, i.e., one parent has genotype "AA" ("BB") and the inherited "A" ("B") of a child from this parent could be any one of "AA" ("BB"). For this case, we arbitrarily give a solution which fits the genotype data. The third best site is a site where both genotypes of father and mother are "AB", and the genotypes of some (but not all) children are "AA" or "BB". In this case, we can fix the haplotypes of those children with "AA" or "BB" correctly. But for a child with genotype "AB", we have a risk of making mistakes. Again, in this case, we arbitrarily give a solution which fits the genotype data. The worst case is that both genotypes of father and mother are "AA" or "BB". In this case, we cannot fix the inheritance source for any child. In practice, we can always find a site which is one of the first three types.
Note that there is no guarantee that we can always find a starting site with uniquely determined haplotypes. When the solution on the starting site is wrong, our algorithm may produce a short segment with many breakpoints. Whenever such a segment is found, our algorithm will re-calculate the solution of the segment using the reverse order and thus another starting point.
The horizontal local optimization algorithm
After we obtain a haplotype solution from the basic algorithm, we can look at three individuals, two parents and one of their children C_{ j }, at a time. Assuming that the haplotypes of the two parents are fixed, the number of break points in C_{ j }might be further reduced if we change the haplotypes of C_{ j }. This is due to the existence of multiple solutions at a site and the fact that the haplotype solution at site i heavily depends on that of its previous site. Thus, we use an algorithm that can give a haplotype of C_{ j }with minimum number of break points when the haplotypes of the two parents are fixed (by the basic algorithm). In this way, the total number of break points can be reduced. Let D^{ pq }(i) be the minimum number of breakpoints of C_{ j }for the first i sites such that at site i the paternal haplotype is from the p-th haplotype of the father and the maternal haplotype is from the q-th haplotype of the mother, where p = 0, 1 and q = 0, 1. Then D^{ pq }(i + 1) can be computed based on D^{00} (i), D^{01} (i), D^{10} (i) and D^{11} (i). For example, the value of D^{00} (i + 1) can be one of D^{00} (i), D^{01} (i) + 1, D^{10} (i) + 1 or D^{11} (i) + 2. We can check each of the cases and see if the genotype of C_{ j }at site i + 1 can fit each of the four configurations under the Mendelian law of inheritance. Among all the possible configurations, we choose the one corresponding to the minimum value. The running time of the algorithm is O(n), where n is the number of sites in the chromosome.
We apply the horizontal local optimization algorithm to each of the n children in the nuclear family one by one in an arbitrarily fixed order. (The order among the children does not affect the results.)
The whole algorithm for a nuclear family
- 1.
Find a good starting point as described.
- 2.
Use the basic algorithm to get a solution for all individuals in the nuclear family.
- 3.
Identify a short segment with a large number of breakpoints and apply the basic algorithm to this segment to re-calculate using the inverse order (thus a different starting point). If we can reduce the total number of break points then we use the new solution for this segment.
- 4.
Use the horizontal local optimization algorithm for each child to refine the solution.
The algorithms for nuclear families with data available for single parents
Now, we consider the case, where the genotype data of one of the parents in the nuclear family is unknown over the entire chromosome. To handle this case, the basic idea is similar to that for nuclear families with data available for both parents. For the basic algorithm, we guess the haplotype of the unknown parent whenever needed. Since each individual has two copies of haplotypes on each chromosome, there are four different haplotype configurations at each site. The two haplotypes for an individual can be AA, AB, BA, and BB. Thus, we can modify the basic algorithm to handle this case, where the genotype data for one parent is missing. In the basic algorithm, instead of considering at most 4 configurations of the two parents, we consider at most 8 configurations, where the unknown parent has 4 configurations, and the known parent has at most 2 configurations. Similarly, for each of the 8 configurations of the two parents, we can fix C_{ j }.paternal-h [i + 1], C_{ j }.maternal-h [i + 1], C_{ j }.which-p [i + 1] and C_{ j }.which-m [i + 1] for every child C_{ j }(j = 1, 2,..., n) such that the number of breakpoints of child C_{ j }between site i and i + 1 is minimized. The rest of the algorithm remains the same.
The algorithm for a family with three or more generations
To handle a family with three generations, we will view the set of all individuals in the first and second generations as a nuclear family which is referred to as the main nuclear family. For any child C_{ j }in the main nuclear family, if C_{ j }has his/her own children (third generation individuals), then we will view C_{ j }as a super child representing the second generation nuclear family including C_{ j }, C_{ j }'s spouse and all their children. The basic algorithm for a family with three generations is similar to the basic algorithm for a nuclear family. Here we focus on the main nuclear family and give some special treatment for super children.
The basic algorithm for a family with three generations
Again, we consider a site at a time. Suppose that paternal-h [i], maternal-h [i], which-p [i] and which-m [i] of each individual in the main nuclear family for site i are fixed. Let us consider site i + 1. If the genotype data is known for both parents, there are (at most) 4 different haplotype configurations of the two parents (first generation individuals) fitting the given genotype data. If the genotype data of a parent is missing, there are at most 8 different haplotype configurations of the two parents. Let C_{1}, C_{2},..., C_{ n }be the n children in the second generation and some of them might be super children. For each possible haplotype configurations of the two parents (first generation individuals), we try to fix C_{ j }.paternal-h [i + 1], C_{ j }.maternal-h [i + 1], C_{ j }.which-p [i + 1] and C_{ j }.which-m [i + 1] for every child C_{ j }(j = 1, 2,..., n) as follows:
A1: if C_{ j }is not a super child, we fix C_{ j }.paternal-h [i + 1], C_{ j }.maternal-h [i + 1], C_{ j }.which-p [i + 1] and C_{ j }.which-m [i + 1] such that the number of breakpoints of child C_{ j }between site i and i + 1 is minimized. Note that for each child C_{ j }, the number of breakpoints between site i and site i + 1 could be 0, 1 or 2.
A2: if C_{ j }is a super child, we fix C_{ j }.paternal-h [i + 1], C_{ j }.maternal-h [i + 1], C_{ j }.which-p [i + 1] and C_{ j }.which-m [i + 1] such that the number of breakpoints in the second generation nuclear family C_{ j }represented by C_{ j }between site i and i + 1 is minimized. Here if the genotype data for both C_{ j }and C_{ j }'s spouse is given, there are at most two choices for each of C_{ j }and C_{ j }'s spouse. We can call the basic algorithm to get . Note that could be greater than 2.
Again, when several choices are equally good for child C_{ k }, we arbitrarily select a choice.
Among all the possible haplotype configurations of the parents (first generation individuals) at site i + 1, we select one such that the total number of breakpoints for all the individuals in the three-generation family between site i and site i + 1 is minimized. This process can be used recursively to handle more than three generations. In fact, for our package, there is no limit on the number of generations in the family. Clearly, if the genotype data of both parents in all nuclear families is known, the running time of the basic algorithm is O(mn), where m is the total number of individuals in the whole family and n is the number of SNP sites in the chromosome.
Dealing with missing individuals in the second generation
In this subsection, we deal with the cases where the genotype data for one of the parents in a second generation nuclear family is missing. The algorithm is similar to the basic algorithm for a family with three generations. Let C_{ j }be a super child. Two cases arise.
The genotype data for C_{ j }'s spouse is missing
For this case, when we try to fix C_{ j }.paternal-h [i + 1], C_{ j }.maternal-h [i + 1], C_{ j }.which-p [i + 1] and C_{ j }.which-m [i + 1] such that the number of breakpoints in the second generation nuclear family represented by C_{ j }between site i and i + 1 is minimized as in A2, we have to guess the haplotyes of C_{ j }'s spouse by trying all possible haplotypes AA, AB, BA and BB. When the genotype data for C_{ j }'s spouse is given, there are two choices. This will slightly slow down the program. Moreover, if there are more than one second generation nuclear families missing the super child's spouse's genotype data, the speed will not be affected too much, since children in the second generation are processed one by one.
The genotype data for C_{ j }is missing
For this case, when we try to fix C_{ j }.paternal [i + 1], C_{ j }.maternal [i + 1], C_{ j }.which-p [i + 1] and C_{ j }.which-m [i + 1] such that the number of breakpoints in the second generation nuclear family represented by C_{ j }between site i and i + 1 is minimized as in A2, we have to guess the haplotyes of C_{ j }by trying all possible haplotypes AA, AB, BA and BB without genotype data. Again, the running time is still O(m × n) since children in the second generation are processed one by one.
Remarks
For the algorithm in [6], a top-down approach is used to deal with three-generation families. The algorithm processes nuclear families (with two generations) one by one from the top to the bottom. The old approach cannot give good solutions when the size of the main nuclear family is small. The reason is that the quality of solutions heavily depends on the sizes of nuclear families and if the size of the main nuclear family is small, the obtained haplotypes for the (super) children in the second generation could be wrong and this wrong information will be passed to the processing of the third generation. A better strategy is to work on big nuclear families first. However, even this strategy is not as good as our approach here since we do not fix the solution of super children in the second generation. We have observed that the inferred haplotypes for big sized second generation families such as two parents and 5 children could be wrong though the breakpoint positions are very accurate. If the wrong haplotyes are used to handle other nuclear families, the whole linked region could be missed.
The current version of our program works for any number of generations. It can handle the case, where the genotype data for one of a couple is missing.
Genotype data error correction
For large-scale SNP genotyping, a certain number of experimental errors is unavoidable. We observe that when genotype data errors occur, the inferred haplotypes contain many breakpoints that are close to each other. In order to get the correct allele sharing status, our algorithm will simply delete both breakpoints that are within x SNP sites. We suggest setting the value of x based on the error rate. When the error rate is smaller than 0.1%, x = 5. When the error rate is between 0.1% and 0.3%, x = 8. When the error rate is between 0.3% and 0.5%, x is set to be 10. When the error rate cannot be estimated, we simply use x = 8.
Identifying mutation regions
After obtaining the inferred haplotype data for all the individuals in the family, we can find possible mutation regions by looking at the SNP sites one by one. The possible mutation regions should be regions shared by all or most of the diseased family members (considering phenocopy) but none or few of the normal family members (considering penetrance). Those regions are reported as suspected mutation regions. Due to the existence of multiple solutions and haplotype inference error, it is possible that the reported regions do not completely overlap the true mutation region. In our algorithm, we use a subroutine to extend the suspected mutation regions site by site in both directions, where we can revise the haplotype allele such that the allele is shared by diseased family members, but not shared by the normal family members. Unlike the program in [6], here we have to extend in both directions.
The current version of the package reports all the possible regions. The users are asked to input the maximum number of normal individuals to be allowed to share the mutation allele (allowing for penetrance) and the maximum number of diseased individuals in the family to be allowed not to share the potential mutation region (allowing for phenocopy).
Results and Discussion
In this section, we will test our software package using simulated data. We have considered a wide range of pedigree structures.
Generating haplotype data using the Chi-square model
In order to test the program, we generated haplotype datasets based on the Chi-square model to see if our program can infer the haplotype data correctly from the corresponding genotype data. The founder haplotypes were obtained from real datasets (Affymetrix Human Mapping 50 K/250 K GeneChips [11]), and children haplotypes were generated through random inheritance of paternal/maternal alleles using the Chi-square model for recombination with m equals 4 [6, 12, 13] and according to male/female averaged genetic map for chromosome 1 downloaded from HapMap [14]. The simulation process is identical to that of [6, 15]. When disease status was considered, a mutation was randomly assigned to be close to a SNP site (called mutation site), and the diseased individuals were forced to inherit the mutation strand and the normal individuals were forced not to inherit the mutation strand. This process is done generation by generation.
Nuclear families with data available for both parents
The experimental results for nuclear families when the genotype data for both parents is available.
2 children | 3 children | 4 children | 5 children | 6 children | |
---|---|---|---|---|---|
No. of Breakpoints | 11.11 (10.59) | 15.63 (15.50) | 21.32 (21.28) | 26.61 (26.60) | 31.66 (31.60) |
precision | 0.47 (0.50) | 0.93 (0.98) | 0.95 (0.97) | 0.98 (0.99) | 0.98 (0.99) |
recall | 0.47 (0.48) | 0.91 (0.96) | 0.94 (0.97) | 0.96 (0.98) | 0.96 (0.98) |
length of real linked region (cM) | 76.66 | 59.51 | 40.82 | 35.97 | 29.90 |
length of found linked region (cM) | 78.70 (79.51) | 60.79 (60.77) | 42.15 (42.13) | 37.00 (37.01) | 30.97 (30.99) |
overlap/real | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) |
overlap/found | 0.97 (0.96) | 0.97 (0.97) | 0.96 (0.96) | 0.96 (0.96) | 0.95 (0.95) |
linked region recovery | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) |
Nuclear families with data available for single parents
The experimental results for nuclear families when the genotype data of only one parent is available and this parent is diseased.
2 children | 3 children | 4 children | 5 children | 6 children | |
---|---|---|---|---|---|
No. of Breakpoints | 3.68 (3.75) | 7.47 (14.59) | 10.32 (25.47) | 12.90 (34.83) | 15.48 (50.38) |
precision | 0.33 (0.39) | 0.78 (0.26) | 0.89 (0.23) | 0.94 (0.24) | 0.96 (0.21) |
recall | 0.23 (0.27) | 0.72 (0.48) | 0.86 (0.55) | 0.91 (0.63) | 0.92 (0.66) |
length of real linked region (cM) | 83.14 | 57.98 | 45.62 | 35.15 | 31.01 |
length of found linked region (cM) | 111.1 (91.45) | 65.08 (66.32) | 48.85 (49.38) | 37.40 (38.02) | 32.57 (32.80) |
overlap/real | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) |
overlap/found | 0.86 (0.94) | 0.93 (0.94) | 0.94 (0.93) | 0.93 (0.92) | 0.93 (0.92) |
linked region recovery | 1.00 (255/285) | 1.00 (229/285) | 1.00 (267/285) | 1.00 (276/285) | 1.00 (281/285) |
The experimental results for nuclear families when the genotype data of only one parent is available and this parent is normal.
3 children | 4 children | 5 children | 6 children | |
---|---|---|---|---|
No. of Breakpoints | 7.39 (13.47) | 10.29 (24.47) | 12.94 (35.69) | 15.23 (55.84) |
precision | 0.75 (0.31) | 0.89 (0.25) | 0.94 (0.24) | 0.96 (0.19) |
recall | 0.68 (0.53) | 0.86 (0.59) | 0.90 (0.65) | 0.92 (0.66) |
length of real linked region (cM) | 55.56 | 43.61 | 36.61 | 28.45 |
length of found linked region (cM) | 67.66 (59.25) | 43.90 (44.14) | 36.87 (35.83) | 29.23 (28.12) |
overlap/real | 0.99 (0.99) | 0.99 (0.98) | 0.99 (0.98) | 0.99 (0.97) |
overlap/found | 0.85 (0.94) | 0.99 (0.98) | 0.99 (0.99) | 0.98 (0.98) |
linked region recovery | 1.00 (275/285) | 1.00 (274/285) | 1.00 (274/285) | 1.00 (266/285) |
Complicated pedigrees
The experimental results for pedigrees P1 to P4.
P1 | P2 | P3 | P4 | |
---|---|---|---|---|
No. of Breakpoints | 20.49 | 20.20 | 14.20 | 20.38 |
precision | 0.58 | 0.77 | 0.49 | 0.57 |
recall | 0.63 | 0.72 | 0.43 | 0.53 |
length of real linked region (cM) | 34.69 | 36.42 | 43.89 | 41.28 |
length of found linked region (cM) | 45.84 | 47.28 | 62.88 | 43.25 |
overlap/real | 0.99 | 1 | 1 | 1 |
overlap/found | 0.83 | 0.80 | 0.75 | 0.95 |
linked region recovery | 1 | 1 | 1 | 1 |
The experimental results for pedigrees P5 to P8.
P5 | P6 | P7 | P8 | |
---|---|---|---|---|
No. of Breakpoints | 20.81 (31.60) | 26.37 (47.27) | 38.83 (108.4) | 44.65 (145.4) |
precision | 0.86 (0.45) | 0.96 (0.41) | 0.97 (0.26) | 0.90 (0.23) |
recall | 0.82 (0.66) | 0.94 (0.71) | 0.93 (0.71) | 0.92 (0.76) |
length of real linked region (cM) | 38.58 | 19.77 | 21.92 | 12.37 |
length of found linked region (cM) | 41.00 (43.28) | 22.98 (23.51) | 23.37 (24.57) | 16.04 (16.20) |
overlap/real | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) |
overlap/found | 0.93 (0.93) | 0.86 (0.84) | 0.91 (0.91) | 0.79 (0.78) |
linked region recovery | 1.00 (232/285) | 1.00 (266/285) | 1.00 (228/285) | 1.00 (265/285) |
The experimental results for pedigrees P9 to P12.
P9 | P10 | P11 | P12 | |
---|---|---|---|---|
No. of Breakpoints | 23.43 (41.07) | 18.46 (26.38) | 37.44 (92.49) | 20.68 (31.52) |
precision | 0.95 (0.40) | 0.92 (0.49) | 0.91 (0.29) | 0.94 (0.47) |
recall | 0.91 (0.67) | 0.89 (0.68) | 0.89 (0.70) | 0.91 (0.71) |
length of real linked region (cM) | 29.80 | 26.06 | 22.26 | 22.05 |
length of found linked region (cM) | 32.01 (36.72) | 28.26 (31.93) | 24.07 (24.53) | 29.78 (30.57) |
overlap/real | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) |
overlap/found | 0.90 (0.81) | 0.91 (0.83) | 0.90 (0.89) | 0.80 (0.78) |
linked region recovery | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (90/95) |
The experimental results for pedigrees P13 to P16.
P13 | P14 | P15 | P16 | |
---|---|---|---|---|
No. of Breakpoints | 23.56 (46.66) | 44.20 (124.0) | 31.37 (48.25) | 34.60 (69.26) |
precision | 0.97 (0.37) | 0.98 (0.27) | 0.96 (0.47) | 0.88 (0.38) |
recall | 0.94 (0.71) | 0.95 (0.75) | 0.93 (0.71) | 0.87 (0.75) |
length of real linked region (cM) | 21.36 | 18.22 | 21.41 | 18.40 |
length of found linked region (cM) | 25.61 (25.91) | 19.68 (20.42) | 25.12 (27.60) | 21.81 (22.39) |
overlap/real | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 0.99 (1.00) |
overlap/found | 0.86 (0.84) | 0.90 (0.89) | 0.88 (0.81) | 0.86 (0.85) |
linked region recovery | 1.00 (84/95) | 1.00 (75/95) | 1.00 (1.00) | 1.00 (1.00) |
A Case Study
Genotype data error correction
The experimental results for genotype data error correction using Affymetrix 50 K GeneChips data.
precision | recall | linked region recovery | overlap/real | overlap/found | |
---|---|---|---|---|---|
No error | 97.74% | 95.05% | 100% | 99.98% | 92.46% |
(69.62%) | (86.56%) | (100%) | (100%) | (91.78%) | |
0.1% error | 84.40% | 94.94% | 100% | 99.93% | 92.00% |
(67.19%) | (86.19%) | (100%) | (100%) | (91.13%) | |
0.5% error | 79.00% | 93.76% | 100% | 99.84% | 92.57% |
(61.81%) | (86.10%) | (100%) | (100%) | (91.94%) |
Comparison with Merlin
Comparison with Merlin using Affymetrix 50 K GeneChips.
Run time (s) | overlap/real | overlap/found | |
---|---|---|---|
0+3 | 16.717 (0.898) | 0.941 (0.805) | 0.605 (0.862) |
0+4 | 23.325 (1.126) | 0.928 (0.824) | 0.594 (0.930) |
0+5 | 30.634 (1.746) | 0.941 (0.901) | 0.587 (0.936) |
0+6 | 39.028 (3.901) | 0.987 (0.929) | 0.576 (0.969) |
2+3(4 bits) | 3.276 (1.340) | 1.000 (0.928) | 0.960 (1) |
2+4(6 bits) | 4.596 (1.735) | 0.9997 (0.937) | 0.967 (0.990) |
P5(7 bits) | 7.714 (2.325) | 0.999 (0.933) | 0.919 (0.979) |
P9(9 bits) | 7.365 (3.255) | 0.999 (0.879) | 0.926 (0.958) |
P10(11 bits) | 19.15 (8.829) | 0.999 (0.797) | 0.912 (0.968) |
P11(12 bits) | 11.37 (14.01) | 1.000 (0.938) | 0.924 (0.979) |
P12(13 bits) | 25.62 (31.57) | 1.000 (0.922) | 0.842 (0.958) |
P7(14 bits) | 13.58 (58.14) | 1.000 (0.949) | 0.933 (0.990) |
P15(14 bits) | 46.96 (43.15) | 1.000 (0.906) | 0.894 (0.979) |
P13(15 bits) | 23.72 (1282.05) | 1.000 (0.970) | 0.875 (1) |
P16(15 bits) | 37.93 (384.78) | 0.988 (0.950) | 0.877 (0.990) |
P14(16 bits) | 15.042 (NA) | 1.000 (NA) | 0.927 (NA) |
P6(17 bits) | 40.754 (NA) | 1.000 (NA) | 0.869 (NA) |
Comparison with Merlin using Affymetrix 250 K GeneChips.
Run time (s) | overlap/real | overlap/found | |
---|---|---|---|
0+3 | 67.680 (2.979) | 0.954 (0.975) | 0.590 (0.866) |
0+4 | 97.082 (4.067) | 0.965 (0.988) | 0.548 (0.955) |
0+5 | 124.43 (6.558) | 0.983 (0.994) | 0.607 (0.960) |
0+6 | 131.71 (12.50) | 0.994 (0.992) | 0.675 (0.986) |
2+3(4 bits) | 9.454 (7.122) | 1 (0.936) | 0.995 (1) |
2+4(6 bits) | 11.625 (8.133) | 1 (0.989) | 0.988 (1) |
P5(7 bits) | 16.662 (8.262) | 1 (0.892) | 0.983 (0.938) |
P9(9 bits) | 24.18 (16.65) | 1.000 (0.934) | 0.973 (1) |
P10(11 bits) | 38.54 (31.39) | 1.000 (0.797) | 0.970 (0.938) |
P11(12 bits) | 29.94 (415.18) | 1.000 (0.991) | 0.972 (1) |
P12(13 bits) | 49.28 (1237.7) | 1.000 (0.968) | 0.871 (1) |
P7(14 bits) | 31.611 (NA) | 1(NA) | 0.960(NA) |
P15(14 bits) | 103.269 (NA) | 1(NA) | 0.927(NA) |
P13(15 bits) | 54.114 (NA) | 1.000(NA) | 0.881(NA) |
P16(15 bits) | 141.266 (NA) | 0.991(NA) | 0.922(NA) |
P14(16 bits) | 52.911 (NA) | 1(NA) | 0.979(NA) |
P6(17 bits) | 63.984 (NA) | 1.000(NA) | 0.862(NA) |
Running time
For small sized families, both programs can generate results in a few seconds. When the sizes of the families and the number of the markers increase, the running time of our program increases linearly. For large families, Merlin requires really long running time. Most importantly, Merlin needs large memory space for big sized families and cannot successfully complete the computation for some pedigrees (see P14 and P6 in Table 10 and Table 11).
Output Quality
We again use "overlap/real" and "overlap/found" to indicate the quality of the computational results. Our program always clearly gives a computed candidate region for each input. Merlin calculates a LOD score for each marker. We evaluated the segment with the highest LOD score as the output linked region for Merlin. From Table 10 and Table 11, we can see that the results of our program are less than optimal when parental data is not available at all. When the family size becomes bigger, our program outperforms Merlin. Our program can quickly produce accurate linked regions when the family size is big while Merlin failed to execute on big sized families.
Conclusion
We have developed a software package that infers the haplotype allele sharing status for the members of a pedigree based on the minimum recombinants model. The running time of the program is linear in terms of the input size O(mn), where m is the total number of individuals in the whole family and n is the number of SNP sites in the chromosome. The new package can handle a wide range of pedigree structures. It works very well for cases where the genotype data of one parent is missing for the entire chromosome.
Availability and requirements
Project Name: CityU 121608
Project Homepage: http://www.cs.cityu.edu.hk/~lwang/software/Link
Operating system(s): Platform independent
Programming language: Java
Other requirements: Java 1.6.0 or higher
Licence: None
Any restrictions to use by non-academics: None
Declarations
Acknowledgements
We thank the referees for their helpful suggestions. Lusheng Wang is fully supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China [Project No. CityU 121608].
Authors’ Affiliations
References
- The International HapMap Consortium: A second generation human haplotype map of over 3.1 million SNPs. Nature 2007, 449: 851–861. 10.1038/nature06258PubMed CentralView ArticleGoogle Scholar
- Leykin I, Hao K, Cheng J, Meyer N, Pollak M, Smith R, Wong W, Rosenow C, Li C: Comparative linkage analysis and visualization of high-density oligonucleotide SNP array data. BMC Genet 2005, 6(1):7. 10.1186/1471-2156-6-7PubMed CentralView ArticlePubMedGoogle Scholar
- Sellick G, Longman C, Tolmie J, Newbury-Ecob R, Geenhalgh L, Hughes S, Whiteford M, Garrett C, Houlston R: Genomewide linkage searches for mendelian disease loci can be efficiently conducted using high-density SNP genotyping arrays. Nucleic Acids Res 2004, 32: e164. 10.1093/nar/gnh163PubMed CentralView ArticlePubMedGoogle Scholar
- Elston R, Stewart J: A general model for the genetic analysis of pedigree data. Hum Hered 1971, 21(6):523–542. 10.1159/000152448View ArticlePubMedGoogle Scholar
- Lander E, Green P: Construction of multilocus genetic linkage maps in humans. Proc Natl Acad Sci USA 1987, 84: 2363–2367. 10.1073/pnas.84.8.2363PubMed CentralView ArticlePubMedGoogle Scholar
- Lin G, Wang Z, Wang L, Lau YL, Yang W: Identification of linked regions using high-density SNP genotype data in linkage analysis. Bioinformatics 2008, 24(1):86–93. 10.1093/bioinformatics/btm552View ArticlePubMedGoogle Scholar
- Qian D, Beckmann L: Minimum recombinant haplotyping in pedigrees. Am J Hum Genet 2002, 70: 1434–1445. 10.1086/340610PubMed CentralView ArticlePubMedGoogle Scholar
- Li J, Jiang T: Efficient inference of haplotypes from genotypes on a pedigree. J Bioinform Comput Bio 2003, 1: 41–69. 10.1142/S0219720003000204View ArticleGoogle Scholar
- Doi K, Li J, Jiang T: Minimum recombinant haplotype configuration on tree pedigrees. Proc. of the 3rd Annual Workshop on Algorithms in Bioinformatics (WABI'03) 2003, 339–353.View ArticleGoogle Scholar
- Li J, Jiang T: Computing the minimum recombinant haplotype configuration from incomplete genotype data on a pedigree by integer linear programming. Journal of Computational Biology 2005, 12: 719–739. 10.1089/cmb.2005.12.719View ArticlePubMedGoogle Scholar
- Affymetrix Human Mapping GeneChips[http://www.affymetrix.com/products_services/arrays/specific/100k.affx]
- Broman K, Weber J: Characterization of human crossover interference. Am J Hum Genet 2000, 66(6):1911–1926. 10.1086/302923PubMed CentralView ArticlePubMedGoogle Scholar
- Zhao H, Speed T, McPeek M: Statistical analysis of crossover interference using the chi-square model. Genetics 1995, 139(2):1045–1056.PubMed CentralPubMedGoogle Scholar
- International HapMap Project[http://www.hapmap.org]
- Yang W, Wang Z, Wang L, Sham P, Huang P, Lau Y: Predicting the number and sizes of IBD regions among family members and evaluating the family size requirement for linkage studies. European Journal of Human Genetics 2008, 16: 1535–1543. 10.1038/ejhg.2008.116View ArticlePubMedGoogle Scholar
- Abecasis G, Cherny S, Cookson W, Cardon L: Merlin – rapid analysis of dense genetic maps using sparse gene flow trees. Nature Genetics 2002, 30: 97–101. 10.1038/ng786View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.