- Software
- Open access
- Published:
Identifying mutation regions for closely related individuals without a known pedigree
BMC Bioinformatics volume 13, Article number: 146 (2012)
Abstract
Background
Linkage analysis is the first step in the search for a disease gene. Linkage studies have facilitated the identification of several hundred human genes that can harbor mutations leading to a disease phenotype. In this paper, we study a very important case, where the sampled individuals are closely related, but the pedigree is not given. This situation happens very often when the individuals share a common ancestor 6 or more generations ago. To our knowledge, no algorithm can give good results for this case.
Results
To solve this problem, we first developed some heuristic algorithms for haplotype inference without any given pedigree. We propose a model using the parsimony principle that can be viewed as an extension of the model first proposed by Dan Gusfield. Our heuristic algorithm uses Clarkâs inference rule to infer haplotype segments.
Conclusions
We ran our program both on the simulated data and a set of real data from the phase II HapMap database. Experiments show that our program performs well. The recall value is from 90% to 99% in various cases. This implies that the program can report more than 90% of the true mutation regions. The value of precision varies from 29% to 90%. When the precision is 29%, the size of the reported regions is three times that of the true mutation region. This is still very useful for narrowing down the range of the disease gene location. Our program can complete the computation for all the tested cases, where there are about 110,000 SNPs on a chromosome, within 20 seconds.
Background
Linkage analysis is the first step in the search for a disease gene. The aim is to find the rough location (a region in which the disease gene is) of the gene in the chromosome. Linkage studies have facilitated the identification of several hundred human genes that can harbor mutations leading to a disease phenotype [1]. The principle of linkage analysis is simple. All our chromosomes come in pairs, one inherited from the mother and the other from the father. Each pair of chromosomes contains the same genes in the same order, but the sequences are not identical. Thus it is possible to find out whether a particular sequence comes from the mother or father. These sequence variants are called maternal and paternal alleles. The key problem for linkage analysis is to infer the pairs of alleles and identify regions whose allele is shared by all or most of the diseased individuals but by none or few of the normal individuals.
Linkage analysis has been extensively studied in recent years. Almost all the existing methods are for families with clearly given pedigrees. The pedigree information helps a lot in the design of computational methods. Early approaches to linkage analysis were based on sparse microsatellite markers. With the new development of microarray techniques, high-density SNP genotype data can be used for large-scale and cost-effective linkage analysis [2, 3]. With high-density SNP genotype data, there exists a sufficient number of informative markers between every pair of recombination points, and the allele-sharing status among the family members can be unambiguously determined. Lots of new computer programs have been developed for dealing with high-density SNP genotype data.
There are two categories of existing approaches to linkage analysis, the probabilistic approaches and the deterministic approaches. In probabilistic approaches, recombinant rates are estimated in a way to maximize the likelihood of the observed data [4â7]. Software tools based on this kind of approach include GeneHunter [5], LINKAGE [8], Allegro [6], Merlin [7], etc. According to [2], these tools have different performances and efficiencies. Some of them (such as those based on the Elston-Steward algorithm [9]) do not work well when the number of markers is large, while the others (such as those based on the Lander-Green algorithm [4]) do not work well with large number of family members. Though tremendous improvement has been made to them through subsequent modifications [6, 7], this still remains a problem in practice. On the other hand, these tools can give very accurate results when the size of the pedigree is small.
Some deterministic approaches have been developed recently. The main idea is to infer the haplotype segments based on the input genotype data so that all or most of the diseased individuals share a segment that is shared by none of the normal individuals [10, 11]. The mathematical model used here is to minimize the total number of recombinants among all the individuals in the pedigree. Lots of algorithms for haplotype inference with a pedigree have been developed. Qian and Beckmann [12] and Tapader et al.[13] proposed a method to minimize the number of recombinants with a given pedigree. Zhang et al.[14] developed a program for general pedigrees assuming that there is no recombinant on the segment. Doi et al.[15] designed two algorithms for haplotype inference with a given pedigree. One of their algorithms works well when the number of marker loci is a fixed constant, while the other works well when the number of family members is bounded by a small constant. Li and Jiang [16, 17] proposed to use an integer linear programming approach for minimum recombinant configuration. Xiao et al.[18] designed a faster algorithm for the case where there is no recombinant. The algorithm in [10] uses a set of heuristics for haplotype inference with a given pedigree and can give very accurate results when the number of family members is large enough and for each nuclear family the genotype data for both parents are available. An extended software package (called LIden) was developed in [19] and it focuses on handling the case where the genotype data for the whole chromosome of one of the parents in a nuclear family are missing. It also uses the minimum recombinant model for haplotype inference in pedigrees.
Throughout this paper, we study the dominant inheritance situation, where sharing of one mutation allele can cause a disease phenotype. We deal with a very important case, where the sampled individuals are closely related, but the pedigree is not given. This situation happens very often in lots of villages in China when the individuals share a common ancestor 6 or more generations ago. Handling this case will be very helpful to identify some local genetic diseases in China. The situation also happens when studying wild animals, where the pedigree can not be identified. To our knowledge, no algorithm can give good results for this case. To solve this problem, we first developed some heuristic algorithms for haplotype inference without any given pedigree. We propose a model using the parsimony principle that can be viewed as an extension of the model first proposed in [20, 21]. Our heuristic algorithm uses Clarkâs inference rule [22] to infer haplotype segments. Experiments show that our program performs well. The recall value is from 90% to 99% in various cases. This implies that the program can report more than 90% of the true mutation regions. The value of precision varies from 29% to 90%. When the precision is 29%, the size of the reported regions is three times that of the true mutation region. This is still very useful for narrowing down the range of the disease gene location. Our program can complete the computation for all the tested cases, where there are about 110,000 SNPs on a chromosome, within 20 seconds.
Implementation
Our software is implemented in Java. It takes the genotype data on a chromosome as well as the disease status for a set of input individuals without any pedigree information as input, and outputs the predicted mutation regions. The software is platform independent. In the following, we show the methods we used and how we implemented our algorithm.
The problem
Suppose that there is a (hidden) pedigree containing many (e.g., 5 to 7) generations, where we only have the genotype data on a chromosome for the individuals in the latest generation (or the latest two generations). Here those individuals with given genotype data are referred to as the input individuals. For each input individual, we also know if such an individual is diseased or normal. An example is given in Figure 1, where there are five generations in the pedigree and we only have the genotype data for the individuals in the dashed rectangle at the bottom of the pedigree. In such a figure, a square represents a male, while a circle represents a female. Moreover, a filled square (respectively, circle) represents a diseased male (respectively, female), while an unfilled square (respectively, circle) represents a normal male (respectively, female). Furthermore, if two squares (respectively, circles) enclose the same number in the figure, then they correspond to the same male (respectively, female) and their sides (respectively, circumferences) are dashed.
For a genotype segment g of length L, the value at each position of g can be 0,1, or 2. A position of g with 0 indicates that both haplotypes have 0 at this position, while a position of g with 1 indicates that both haplotypes have 1 at this position. If the value at a position is 2, then one of the haplotypes is 0 while the other is 1 at this position. A pair of haplotype segments (h,hâ²)is a haplotype pair for a genotype segment g if they satisfy the following conditions:
-
C1.
If the value of g is 0 (or 1) at a position, the values of h and hâ²at this position are both 0 (or 1).
-
C2.
If the value of g is 2 at a position, one of h and hâ²is 0 and the other is 1 at this position.
We also say that the pair of haplotype segments (h,hâ²)can explain g.
Throughout this paper, we study the dominant inheritance situation, where sharing of one mutation allele can cause a disease phenotype. The general problem is as follows: we are given two sets of genotypes on the whole chromosome Dâ=â{G1,G2,âĤ,G k } and Nâ=â{Gk + 1,Gk + 2,âĤ,G n }, where the k genotypes in D are from diseased individuals and the nâââkgenotypes in N are from normal individuals. The n individuals in D and N are closely related (in the same hidden pedigree). The objective is to detect the mutation regions on the chromosome, where all the diseased individuals share a common haplotype segment on the mutation region and none of the normal individuals has such a common haplotype segment on the mutation region. Note that each individual has two haplotype segments on each region. If we know the haplotypes of each input individual over the chromosome, the shared mutation regions can be computed by finding the haplotype segments which are shared by all the diseased individuals but by none of the normal ones. The true mutation region is a shared mutation region containing the disease gene. Therefore, to solve the problem, the key issue is to infer the haplotype segments.
The task of inferring the haplotypes of each individual over the whole chromosome is extremely hard. For our purpose, we divide the whole chromosome into a set of disjointed length L segments, where L is a parameter to be determined later. For each length L segment, we try to infer the two haplotype segments of each individual based on the following mathematical model.
Mutation Region Haplotype Inference Problem (MRHIP)
Given two sets and of genotype segments on a length L region R, where the first k genotype segments in DRare from diseased individuals and the nâââkgenotype segments in NRare from normal individuals, we want to compute a center haplotype segment hRand a pair of haplotype segments and for each in DRâŞNR, such that the following conditions hold:
(1) for any .
(2) for and râ=â1,2;
(3)the total number x R of distinct haplotype segments on R is minimized.
Without loss of generality, we assume that all the genotype segments on R in DR are distinct. Similarly, all the genotype segments on R in NRare also distinct. However, a genotype segment on R from a diseased individual and a genotype segment on R from a normal individual may be identical. In this case, such a genotype segment on R should be in both DR and NR.
Condition (1) makes sure that all the diseased individuals have a haplotype segment which is identical to the center haplotype segment hR. Condition (2) ensures that all the normal individuals do not have hR. Condition (3) uses the parsimony principle, i.e., we want the total number of distinct haplotype segments to be minimized. Due to condition (3), this mathematical model can be viewed as an extension of the parsimony model first proposed by Gusfield in [20] for haplotype inference. The parsimony model has been extensively studied in [20, 21, 23, 24]. MRHIP can be viewed as a simplified version of the Maximum Resolution (MR) Problem which is proved to be NP-hard in [20].
It should be emphasized that for some input of MRHIP on a region R, the solution of MRHIP may not exist. Even if the solutions exist, the values of x R may vary for different inputs. If R is the mutation region, the solution for MRHIP on R always exists and the value of x R should be small.
Our approach contains three steps. First, we decompose the whole chromosome into a set of disjointed length L (Lâ=â500) segments and try to give a solution for MRHIP on each length L segment. We then have an algorithm to merge length L segments based on the computational results to form longer segments and try to get solutions for MRHIP on those longer segments. After that, we have a method to further extend the longer segments to the left and right. Finally, our algorithm reports all the detected mutation regions.
The algorithm for MRHIP
For the Mutation Region Haplotype Inference Problem(MRHIP), we designed an algorithm to solve it. Given an instance of MRHIP, there may or may not exist a solution. If a solution does not exist, there are two cases:
-
1.
There does not exist a center haplotype hRwhich is shared by all the genotype segments in DR. This case is referred to as type I. Type I cases occur when one element in DRhas genotype value 0 and the other element in DRhas genotype value 1.
-
2.
We can find a center haplotype hR, but some genotype segments in NRmust be explained by a pair of haplotype segments and one of the haplotype segments is identical to hR. This case is referred to as type II.
An example of a type II case is the following: DRâ=â{g1â=â111,g2â=â121}, and NRâ=â{g3â=â112,g4â=â102}. Based on g1 and g2, the shared center haplotype hR must be 111. However, g3indicates that normal individuals also have a haplotype segment 111 on R which is identical to the shared center hR. Thus, condition (2) in MRHIP does not hold.
In our algorithm, we first compute the center haplotype hR based on the diseased genotype segments in DR. We look at the positions in R one by one. Based on C1 and C2, if one of the diseased individuals has genotype value 0, then the haplotype value of hRat this position should be 0; if one of the diseased individuals has genotype value 1 at a position, then the haplotype value of hRat this position should be 1. If there exists a position p at which one diseased individual has genotype value 0 and the other diseased individual has genotype value 1, then a conflict occurs and position p is called a conflicting position. Once a conflict occurs, we simply conclude that there is no MRHIP solution on this segment R. We say the Type I False occurs in this case. If all the diseased individuals have genotype value 2 at a position p in R, then the haplotype value of hRcannot be determined at this step. We call such a position the wild card position and put a â at the wild card position to indicate that the haplotype value of hRwill be determined later. The detailed procedure (Procedure P1) for computing the center haplotype segment hR is given as follows:
for each position p in R do
-
1.
if all the diseased individuals have genotype value 0 or 2, then set the haplotype value of hRat p to 0.
-
2.
if all the diseased individuals have genotype value 1 or 2, then set the haplotype value of hRat p to 1.
-
3.
if some diseased individuals have genotype value 0 and some other diseased individuals have genotype value 1, then return Type I False.
-
4.
if all the diseased individuals have genotype value 2, then set the haplotype value of hRto â(indicating that the haplotype value of hRwill be determined later).
Without loss of generality, for each diseased individual , we set for iââ¤âk. Then we can set in such a way that is a haplotype pair for with iââ¤âk. Note that, the values of at wild card positions in R are still not yet determined. Here we refer to as a partially inferred haplotype segment on R. Let , where if two haplotype segments and are identical, then we just keep one of them. Note that if any haplotype segment is undetermined at a position p, then all the haplotype segments in intQ are undetermined at p.
After partially determining hRand for every in DR, we use a heuristic method to infer the haplotype segments for . Since we want to minimize the total number of resulting distinct haplotype segments on R, our strategy is to let the inferred haplotype segments for share as many haplotype segments as possible. This is actually Clarkâs inference rule [22].
Let Q be a queue that contains a set of (partially) inferred haplotype segments on R. Initially, Qâ=âintQ. A partially inferred haplotype segment h in Q can solve if the following conditions hold:
-
1.
if h is 0 at a position p then is 0 or 2 at position p.
-
2.
if h is 1 at a position p then is 1 or 2 at position p.
We can use h to solve by constructing two haplotype segments as follows:
Using h to solve :
-
1.
if h is 0 at position p in R then we set at p and at p is set according to rules C1 and C2.
-
2.
if h is 1 at position p in R then we set at p and at p is set according to rules C1 and C2.
-
3.
if h is undetermined at position p in R, and is 0 (or 1) at p, then set (or ) at p. Here we also have to determine the value of h at position p accordingly. After the undetermined value of h at p is determined, if h is obtained from a , then we also have to determine the values of hRand other for at position p according to the haplotype value of h at p and rules C1 and C2.
-
4.
if h is undetermined at position p in R, and is 2 at p, then and remain undetermined at p.
A genotype segment is solved if the pair of haplotype segments for are (partially) determined. In our algorithm, we use P to store the set of genotypes in NR that have not been solved. Initially, Pâ=âNR. We then use the haplotype segments in Q one by one and try to solve each of the genotypes in P. After trying to use a hâââQ to solve all âs in P, we delete h from Q. Two haplotype segments on R in Q are compatible if there does not exist any position p in R such that one segment has value 0 and the other segment has value 1 at p. For two compatible haplotype segments h1 and h2 on R, we can merge them to form one haplotype segment h, where the value of h is determined as 0 or 1 if at least one of h1 and h2 is 0 or 1 and the value of h remains undetermined if both h1and h2 are undetermined. Again, for any position p, if the value of h1or h2 is not identical to that of h, then the value of h1or h2 is changed from undetermined to 0 or 1. Thus, we have to update some of the previously inferred haplotype segments accordingly.
When we use a hâââQ to solve a in P, we can obtain another new haplotype segment hâ²on R. If hâ²is compatible with a haplotype segment in Q, we then merge them. Note that, hâ²might be compatible with more than one haplotype segment in Q. In this case, we arbitrarily choose a compatible haplotype segment in Q and merge the two haplotype segments. If hâ² is not compatible with any haplotype segment in Q, we add hâ²into Q.
We give an example to illustrate the above process.
Example 1
DRâ=â{g1â=â10211,g2â=â12221,g3â=â10221}and NRâ=â{g4=10121}. After Procedure P1, hRâ=â10â11and consequently intQâ=â{h1,2â=â10â11,h2,2â=â11â01,h3,2â=â10â01}. After that , we can use h1,2â=â10â11in intQ to solve g4â=â10121in NR. Based on h1,2â=â10â11, g4â=â10121can be solved as h4,1â=â10111and h4,2â=â10101. Moreover, since we want h1,2and h4,1to be identical (to minimize the number of distinct haplotype segments), h1,2is updated as h1,2â=â10111. Correspondingly, we update hRâ=âh1,1â=â10011, h1,2â=â10111, h2,2â=â11101,and h3,2â=â10101. After that, the set of distinct haplotype segments we have obtained so far is {hRâ=â10011,h1,2â=âh4,1â=â10111,h2,2â=â11101,h3,2â=â10101,h4,2â=â10101}. Note that h3,2and h4,2are compatible (actually identical), and the set of distinct haplotype segments is {hRâ=â10011,h1,2â=âh4,1â=â10111,h2,2â=â11101,h3,2â=âh4,2â=â10101}.
After trying to use all the hâs in Q to solve all âs in P, Q will become empty. When Q is empty and P still contains at least two genotype segments, we consider all pairs of genotype segments and in P and use the following method to infer the haplotype segments. Inferring the haplotype segments from a pair and in P:
A position p in R is a conflicting position for and if one of and has the genotype value 0 and the other has genotype value 1 at p. The pair of and can share a common haplotype segment on R if there is no conflicting position in R for and . The shared haplotype segment can be computed as follows: (1) if one of the genotype values at position p is 0, then the haplotype value is 0 at p; (2) if one of the genotype values is 1 at p, then the haplotype value is 1 at p; (3 )if both genotype values are 2 at p, then the haplotype value at p is undetermined. Once the shared haplotype segment for and are computed, we can determine the other haplotype segments for and based on C1 and C2.
After inferring the haplotype segments from a pair and that can share a common haplotype segment, we delete and from P, merge compatible inferred haplotype segments, and insert the newly obtained haplotype segments into Q. Once Q is not empty, we can use haplotype segments in Q to solve the genotype segments in P again. The process is repeated until P is empty. The detailed algorithm is given as Algorithm 1.
Algorithm 1
Mutation Region Haplotype Inference
Input: Two sets of genotype segments and on R.
Output: True if there is a solution. Type I false if two diseased haplotype segments are conflict at a position in R; Type II false otherwise.
1: Compute the center haplotype hRas in Procedure P1.
2:if hRdoes not exist then
3:return Type I False
4:else
5:
set for ;
6: compute according to and C1 and C2 for each ; Set (removing identical segments) and Pâ=âNR.
7:end if
8:while Qââ ââ and Pââ ââ do
9: Delete a haplotype segment from Q;
10:if can solve then
11: use to solve . Add the newly obtained haplotype segments (after merging compatible segments) into Q and delete from P.
12:end if
13:end while
14:if there are at least 2 genotype segments in P then
15:if there exists a pair of genotype segments and that can share a haplotype segment on R then
16:
infer the haplotype segments of and and insert them (after merging) into Q, goto line 8.
17:end if
18:if any inferred haplotype segments for some on R is identical to hR then
19:return Type II False.
20:end if
21:
fix and for each so that and .
22:if Line line 21 fails then
23:return Type II False
24:else
25:return True.
26:end if
27:end if
The following is an example to illustrate the case when Q becomes empty.
Example 2
DRâ=â{g1â=â12110,g2â=â11210}and NRâ=â{g3â=â21111,g4â=â11121}. After Procedure P1, hRâ=â11110and intQâ=â{10110,11010}. After trying all the hâs in intQ to solve g i âs in NR, Q becomes empty and Pâ=âNRââ={21111,11121}. In this case, we look at both 21111 and 11121 in P and infer a shared haplotype h3,1â=âh4,1â=â11111and the other two haplotype segments h3,2â=â01111and h4,2â=â11101.
The algorithm for the whole chromosome
For an input segment R on a chromosome, if Algorithm 1 returns true, then R is a valid segment. In order to get the mutation regions, we decompose the whole chromosome into a set of disjointed length L segments. (In this paper, we performed experiments on chromosomes with about 110,000 SNP sites. In this case, we set Lâ=â500.) For each segment, we run Algorithm 1 to test if the segment is valid. After finding all the valid segments, we repeatedly merge two valid segments into a long valid segment if the two segments are within 2L SNPs and Algorithm 1 returns Type II false on all the segments in the gap.After the above merging process, we obtain several long valid segments. For each such long valid segment [sb,se), we run Algorithm 1 on the three segments [sbâ0.5L,se + 0.5L), [sbâ0.2L,se + 0.2L) and [sb,se) and select the longest one (denoted as [b,e)) which returns true. Since we impose that Algorithm 1 returns Type II false for the segments in gaps in the merging process, we can always ensure that Algorithm 1 returns true for [sb,se). Extending [b,e) to the left and right:
After we obtain Râ=â[b,e) as discussed above, we try to extend the segment [b,e) to the left and right. On the segment R=[b,e), we have inferred and for each . These and form a collection of disjointed sets H1,H2,âĤ,H m , where each H k (1ââ¤âkââ¤âm) is a set of identical haplotypes in on R.
We extend the segment [b,e)to the left and right by looking at each position p. We first try pâ=âbâââqfor qâ=â1,2,âĤ(to the left) and then pâ=âeâââ1 + qfor qâ=â1,2,âĤ(to the right). For each H k (1ââ¤âkââ¤âm),
-
1.
If there exist some hi,jin H k such that g i are 0 (or 1) and others are 2, then every hi,jin H k should be 0 (or 1). If there exists a hi,jin H k that has been set to the conflict value 1 (or 0) before, then we know that position p is a conflicting position and p should not be extended to be part of R and the extension process to the current direction (left or right) should stop. Otherwise, we set every hi,jin H k to be 0 (or 1). If p is not a conflicting position, after setting hi,jin H k to be 0 (or 1), we can determine the value of (jâ²â=â1if jâ=â2and jâ²â=â2if jâ=â1) according to the value of hi,jat p, the value of g i at p and rules C1and C 2. Again, we should test if such a value of is consistent with the value of determined before (if any). If conflict exists, then p is a conflicting position and the extension process to the current direction (left or right) should stop. Let . If there is no conflict, we should also update the value of all hâs in . This recursive process continues until no further change can be made.
-
2.
If all hi,jin H k are 2, we set hi,jas undetermined.
The extension process stops when we find conflicts in both directions. The extended region obtained from [b,e) is denoted as [rb,re]. After the extension process, our program reports all the mutation regions obtained in the algorithm. The complete algorithm to find the mutation regions on the whole chromosome is shown as Algorithm 2.
Algorithm 2
The algorithm for the whole chromosome
Input: Two sets of genotype on the whole chromosome Dâ=â{G1,G2,âĤ,G k }and Nâ=â{Gk + 1,Gk + 2,âĤ,G n }.
Output: The detected mutation regions
1:Decompose the whole chromosome into segments of length Lâ=â500.
2:for each segment do
3:
use Algorithm 1 to test if the segment is valid.
4:end for
5:Merge the valid segments (see The algorithm for the whole chromosome in Implementation, the first paragraph) to form longer segments.
6:for each segment [sb,se)obtained in line 5 do
7:
Select the longest segment of [sbâââ0.5L,se + 0.5L), [sbâââ0.2L,se + 0.2L)and [sb,se)which will return true by calling Algorithm 1. Denote it as [b,e).
8:
Extend [b,e)to get a candidate mutation region.
9:end for
10:Output all the mutation regions obtained in the for loop of lines 6-9.
Results
In this section, we first show some experiments on simulated data. We then give a real case study to show that our program can also handle real data (with errors). A discussion is given at the end of this section.
Experiments on simulated data
In order to evaluate the performance of our method and the feasibility of the mathematical model proposed in this paper, we write a program in C++ to produce simulated data. The program takes a pedigree (e.g., Figure 1) and the haplotype data for the whole chromosome of each founder in the pedigree as the input. It generates the haplotype data for the remaining individuals in the pedigree using the standard Ï2 model for recombination with m (the degree of freedom divided by 2) equal to 4 ([25]) and according to the male/female averaged genetic map for chromosome 1 downloaded from HapMap (http://hapmap.org). Also see [26]. The haplotype data of a non-founder in the pedigree are generated to randomly inherit one strand of the four-strand chromatid bundle from each parent of the non-founder. A mutation point is selected uniformly at random from the SNP sites of the chromosome. Each diseased offspring is forced to inherit (from each of its parents) the strand with the mutation point and the normal offspring are forced to inherit the strand without the mutation point. In this way, we can guarantee that there is exactly one true mutation region. Note that the true mutation region must be a shared mutation region. See Implementation for the definition. Moreover, since we know the haplotype data of all the individuals in the simulations, we can easily find the shared mutation regions. By definition, there may exist more than one shared mutation region.
To generate the simulated data, we randomly chose some of the haplotype data for chromosome 1 of 170 unrelated Japanese in Tokyo and Han Chinese in Beijing in the database of HapMap project (http://hapmap.ncbi.nlm.nih.gov/) as the haplotype data for each founder. There are about 110,000 SNP sites on this chromosome.
Recall that our program takes two sets of individuals D and N and their genotype data as input. After generating the haplotype data of each individual, we only use some of the individuals (in the dashed rectangle at the bottom of the pedigree, say, e.g., Figure 1) and their associated genotype (obtained from the simulated haplotype data) data as the input of our program.
To evaluate the performance of our method, we used different pedigrees to evaluate our algorithm. Figures 1, 2, 3 and 4 are pedigrees of 5 generations with 2 to 5 diseased individuals in the dashed rectangle at the bottom. Those individuals in the dashed rectangle at the bottom of each pedigree are the input of our program.
The correctly detected mutation regions are the intersection of the regions reported by the computer program and the true mutation region. Here, precision is defined as the number of SNPs in the correctly detected mutation regions divided by the total number of SNPs in the regions output by the program. The value of recall is defined as the number of SNPs in the correctly detected mutation regions divided by the total number of SNPs in the true mutation region. So, if the value of recall is 1, then all the SNPs in the true mutation region have been reported by the program. Similarly, if precision is 1, then all the reported SNPs are in the true mutation region.
We performed 200 experiments for each pedigree. Since there are about 110,000 SNP sites on the chromosome, we set Lâ=â500. For each region [rb,re]reported by our algorithm, we define a score as follows: Let DH be the number of distinct haplotypes on this region and LENGTHâ=â(reââârb + 1)the length of this region. Then the score of this region is defined as SCOREâ=â(2ânâââDH)âLENGTH, where n is the total number of input genotypes in DâŞN. This score can balance the length of the mutation region and the number of distinct haplotype segments on the region. With the longer region and smaller DH, the score becomes higher. To illustrate the quality of our program, we report the results when our program reports the region with the highest score and the first three regions with the highest scores, respectively. In fact, our program does not need this score in the computation. The program simply reports all the mutation regions. See the Genotype data error handling in Discussion.
The precision and recall on the experiments are shown in Table 1. Only the genotype of the individuals in the latest generation of Pedigree 1â4are known in this experiment. Several mutation regions may be detected by our algorithm. In Table 1, the results when our program reports the region with the highest score are shown in the columns under âone regionâ. The results when our program reports three regions with the highest scores are shown in the columns under âthree regionsâ. The precision and recall are calculated based on the true mutation region, the reported region(s), and the intersection of the reported region(s) and the true mutation region. The precisionâ and recallâ are calculated by replacing the true mutation region with shared mutation regions. The column âtimeâ indicates the average time of our program by running 200 experiments on each pedigree.
From Table 1, we can see that the values of recall are from 82.71% to 98.07% and the values of precision are from 23.07% to 39.94% in the four pedigrees. When the number of diseased individuals is increased, the values of both recall and precision are improved significantly. When there are 4 or 5 diseased individuals, the value of recall is more than 97%. That is, the program can report most of the SNPs in the true mutation region.
In practice, one can often get the genotype data for the individuals of the latest two generations. Thus, we study this case by looking at different input individuals based on the pedigrees in Figures 1, 2, 3 and 4.
Now, we study different sets of input individuals in the latest two generations of Pedigree 1. These different sets of input individuals in the latest two generations in the pedigree are given in Figure 5, a square/circle with a slash indicates that such individual is not included as part of the input though the individual is used in generating the simulated data. For the rest of test, we performed 200 experiments for each case and show the average values. Table 2 shows the results for the different sets of input individuals in Figure 5. The individuals in the latest and latest two generations are not distinguished in our algorithm. We just input the genotype for all the individuals without the slash. Again, the setting is similar to that of Table 1. From Table 2, we can see that the values of recall for different inputs are close to 99% except for 2d-3fam-3, 2d-2fam-2 and 2d-2fam-3, where the input contains only 2 or 3 diseased individuals. Comparing Table 1 and Table 2, we can see that more input individuals do help improve the values of precision and recall.
We also performed similar experiments for Pedigree 2-4 (see Figures 2, 3 and 4). The results are similar to that in Table 2 and are given in Additional file 1.
We also tested the program using pedigrees containing 6 and 7 generations and 2, 3, 4, 5 diseased individuals, respectively, in the latest generation. The four pedigrees containing 6 generations are shown in Additional file 1: Figure S4, Figure S5, Figure S6 and Figure S7 in the Additional file. The four pedigrees containing 7 generations are shown in Additional file 1: Figure S12, Figure S13, Figure S14 and Figure S15 in the Additional file. Again, the input individuals are the individuals in the dashed rectangle at the bottom of the pedigree. The experiment results for 6 generations and 7 generations are shown in Table 3 and Table 4, respectively. The settings of Table 3 and Table 4 are similar to that of Table 1. Table 3 and Table 4 show that the performance of our program for 6 and 7 generations is similar (but slightly worse than) to that for 5 generations.
Similar to the case of 5 generations, for pedigrees with 6 and 7 generations, we also tested various cases when some individuals of the latest two generations are available as input individuals. The results for 6 and 7 generations are similar to that of 5 generations. The detailed results are given in the Additional file.
A real case study
To illustrate the usefulness of our program, we applied our method to a set of real data originally from the phase II HapMap database and was studied in [27]. In [27], the authors studied two CEU (Utah residents with European ancestry from the CEPH collection) families (parent-offspring trios) CEPH 1341 and CEPH 1375 (see Figure 6). They identified a segment (107M to 110M) on chromosome 9 shared by the four individuals NA06991, NA10863, NA06985 and NA12264. For this set of data, there are totally 168,321 SNP sites on the chromosome after the unknown genotypes are eliminated from the database. There are totally 6,519 SNP sites between 107M and 110M on chromosome 9 starting at the 122348-th SNP site and ending at the 128866-th SNP site.
We applied our program with the six individuals in the two families CEPH 1341 and CEPH 1375 as input and set the four individuals NA06991, NA10863, NA06985 and NA12264 as diseased individuals. We set Lâ=â500. Our algorithm found several segments shared by the four diseased individuals. The lengths of all the reported segments are approximately 500 SNP sites except the longest ones. All these segments are shown in Table 5. Table 5 shows the starting and ending point of the segments, the number of distinct haplotypes on the segment (DN), and the score for each segment. As there is only one conflicting position 126786 between segments [124561,126785] and [126787,129451], we should consider such a conflicting position as a data error. Therefore, segment [124561,129451] should be the predicted mutation. This segment starts at the 124561-th SNP site and ends at the 129451-th SNP site. The details are shown in Figure 7. We can see that the shared segment found by PLINK in [27] (the blue line) starts at the 122348-th SNP site while the starting position of our reported segment (the red line) is 124561. For the subsegment [122348, 124560] that we did not report, we found 79 conflicting positions in this subsegment containing 2,213 SNP sites. (See the filled dots on the blue line.) However, on the segment [124561, 129451] containing 4,891 SNP sites reported by our program, there is only one conflicting position. This is strong evidence that the subsegment [122348, 124560] is not shared by all the four diseased individuals. We also looked at the segment [129452, 131664] with length of 2,213 SNP sites on the right of our reported segment [124561, 129451], and found 52 conflicting positions among the 2,213 SNP sites. We can see that the subsegments [122348, 124560] and [129452, 131664] on the left and right of our reported segment have approximately the same number of conflicting positions.
Discussion
Haplotype inference methods
As discussed before, if the haplotype data for each input individual are known, the problem of finding the true mutation region is straightforward. Currently, there are several population-based phasing methods that can give accurate haplotype segments [28â30]. However, these methods can only phase a small number of SNPs effectively and take an extremely long time to infer the haplotype for the whole chromosome. LRP in [31] can phase more than 1,000 SNPs simultaneously within a reasonable time. However, it is still very slow for phasing 110,000 SNPs of a whole chromosome (as our program does). Moreover, they cannot directly report the true mutation region for a set of input individuals. On the other hand, our program can complete the computation in less than 20 seconds for about 110,000 SNPs with about 10 to 20 individuals.
Related mutation region detection methods
To our knowledge, all the existing software packages (except PLINK in [27]) need a clearly given pedigree as part of the input. If the pedigree is not known, most of the software packages do not work. Our algorithm deals with the case where the input individuals are closely related but the pedigree is not given.
Merlin is widely used for linkage analysis, where a pedigree is required as part of the input. It works well on SNP data due to the use of sparse trees. However, it can only analyze pedigrees of moderate size. When the family size is big, a large memory space is needed and the computation cannot successfully be completed. As shown in [19], Merlin cannot report the results for some pedigrees, e.g., P14 and P16 in [19], where there are less than 16 input individuals. However, our program can deal with the cases where the number of input individuals is large. The first row in Table 3 of the Additional file shows the results for 20 (not including those without known genotype data) input individuals. We can see that our method can give very high precision and recall in this case (without taking the pedigree as part of the input). Therefore, our algorithm can handle some cases which cannot be handled by Merlin.
The rule-based algorithm in [10] uses a set of heuristics for haplotype inference with a given pedigree. It can give very accurate results when the number of family members is large enough and for each nuclear family the genotype data for both parents are available. However, it does not work well when the genotype data of one of the parents are missing in the nuclear family. If the data for both parents are missing, it does not work. LIden [19] is an extended software package of the algorithm in [10]. It focuses on handling the case where the genotype data for one of the parents in a nuclear family are missing for the entire chromosome. But it still does not work when the genotype data for both parents in a nuclear family are missing or the family pedigree is not given in the input.
PLINK in [27] and Beagle in [30, 32] can identify the shared haplotype segment between two individuals based on population-based linkage analysis. But it cannot automatically identify the mutation region taking a set of individuals as the input, which is expected to give more precise prediction. The above real case study has illustrated this.
Genotype data error handling
The real datasets often contain errors. Handling the genotype data errors is an important issue in practice. For our program, we have a pre-process step to delete all the SNPs containing missing data. That is, if the genotype data for an input individual at an SNP site are missing, we delete this SNP site from the input. Without this step, we cannot get reasonable results for the real case study. When the genotype data contain errors, it is hard to detect and correct them. The errors may affect our programâs results in two ways: (a) an SNP site in the true mutation region may become a conflicting position due to error; and (b) the number of distinct haplotype segments to explain the genotype data is increased. When (b) occurs, our score for the detected mutation regions becomes worse. When (a) occurs once, our program reports two detected regions with the conflicting position in between. See [124561, 126785] and [126787, 129451] in Table 5 of the real case study. When this kind of error occurs many times, our program reports many regions separated by a few SNPs in the middle. When the user looks at the results of our program, it is possible to realize that the few SNPs between two closely located reported regions are due to errors. This is similar to other linkage analysis programs such as Merlin, where each SNP site has a score, and the user decides a region (by ignoring fluctuations) with high scores as the true mutation region.
Our program may work for the situation where the input individuals are from multiple families. Our algorithm tries to find regions shared by all the diseased individuals. Thus, as long as the diseased individuals from multiple families share the same (or similar) haplotype segment on the true mutation region, our program should be able to find such region. Even if the haplotype segments from different families on the true mutation region are slightly different, the program should be able to report several smaller regions with a few missing SNPs in the middle. Again, it is possible for users to figure out the whole true mutation by ignoring the few missing SNPs in the middle.
Conclusion
We developed a software package for linkage analysis where the input individuals are closely related, but the pedigree is not known. We propose a model using the parsimony principle that can be viewed as an extension of the model first proposed by Dan Gusfield ([20, 21]). Our heuristic algorithm simply uses Clarkâs inference rule to infer haplotype segments. Experiments show that our program can give very high value (90%-99%) of recall in various cases. This implies that the program can report more than 90% of the true mutation region. The value of precision varies from 29% to 90%. When the precision is 29%, the size of the reported regions is three times that of the true mutation region. This is still very useful for narrowing down the range of the disease gene.
Availability and requirements
Project name: MRD
Project homepage:http://www.cs.cityu.edu.hk/~wenjuacui/software/mutationRegion/index.html. The source code is also available.
Operating system(s): Platform independent
Programming language: Java
Other requirements: Java 1.6.0 or higher
License: None
Any restrictions to use by non-academics: None
Authorâs contributions
LW proposed the topic and ideas for algorithms, WC implemented the program, and both authors devised and developed the method and prepared the manuscript. Both authors read and approved the final manuscript.
References
Emahazion T, Feuk L, Sawyer S, Fredman D, St Clair D, Prince J, Brookes A: SNP association studies in Alzheimerâs disease highlight problems for complex disease analysis. Trends Genet 2001, 17(7):407â413. 10.1016/S0168-9525(01)02342-3
Leykin I, Hao K, Cheng J, Meyer N, Pollak M, Smith R, Wong W, Rosenow C, Li C: Comparative linkage analysis and visualization of high-density oligonucleotide SNP array data. BMC Genet 2005, 6: 7.
Sellick G, Longman C, Tolmie J, Newbury-Ecob R, Geenhalgh L, Hughes S, Whiteford M, Garrett C, Houlston R: Genomewide linkage searches for Mendelian disease loci can be efficiently conducted using high-density SNP genotyping arrays. Nucleic Acids Res 2004, 32(20):e164. 10.1093/nar/gnh163
Lander E, Green P: Construction of multilocus genetic linkage maps in humans. Proc Nat Acad Sci USA 1987, 84(8):2363â2367. 10.1073/pnas.84.8.2363
Kruglyak L, Daly M, Reeve-Daly M, Lander E: Parametric and nonparametric linkage analysis: a unified multipoint approach. Am J Human Genet 1996, 58(6):1347â1363.
Gudbjartsson D, Jonasson K, Frigge M, Kong A: Allegro, a new computer program for multipoint linkage analysis. Nat Genet 2000, 25: 12â13. 10.1038/75514
Abecasis G, Cherny S, Cookson W, Cardon L: Merlin-rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 2002, 30: 97â101. 10.1038/ng786
Lathrop G, Lalouel J, Julier C, Ott J: Strategies for multilocus linkage analysis in humans. Proc Nat Acad Sci USA 1984, 81(11):3443â3446. 10.1073/pnas.81.11.3443
Elston R, Stewart J: A general model for the genetic analysis of pedigree data. Human Heredity 1971, 21(6):523â542. 10.1159/000152448
Lin G, Wang Z, Wang L, Lau Y, Yang W: Identification of linked regions using high-density SNP genotype data in linkage analysis. Bioinformatics 2008, 24: 86â93. 10.1093/bioinformatics/btm552
Cai Z, Sabaa H, Wang Y, Goebel R, Wang Z, Xu J, Stothard P, Lin G: Most parsimonious haplotype allele sharing determination. BMC Bioinformatics 2009, 10: 115. 10.1186/1471-2105-10-115
Qian D, Beckmann L: Minimum-recombinant haplotyping in pedigrees. Am J Human Genet 2002, 70(6):1434â1445. 10.1086/340610
Tapadar P, Ghosh S, Majumder P: Haplotyping in pedigrees via a genetic algorithm. Human Heredity 2000, 50: 43â56. 10.1159/000022890
Zhang K, Sun F, Zhao H: HAPLORE: a program for haplotype reconstruction in general pedigrees without recombination. Bioinformatics 2005, 21: 90â103. 10.1093/bioinformatics/bth388
Doi K, Li J, Jiang T: Minimum recombinant haplotype configuration on tree pedigrees. In Proceedings of Workshop on Algorithms in Bioinformatics(WABI) 2003, 339â353.
Li J, Jiang T: Computing the minimum recombinant haplotype configuration from incomplete genotype data on a pedigree by integer linear programming. J Comput Biol 2005, 12(6):719â739. 10.1089/cmb.2005.12.719
Li J, Jiang T: An exact solution for finding minimum recombinant haplotype configurations on pedigrees with missing data by integer linear programming. In Proceedings of the eighth annual international conference on Resaerch in Computational Molecular Biology(RECOMB). SanDiego, California, USA: ACM; 2004:20â29.
Xiao J, Liu L, Xia L, Jiang T: Fast elimination of redundant linear equations and reconstruction of recombination-free mendelian inheritance on a pedigree. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete Algorithms. SIAM, New Orleans, Louisiana USA; 2007:655â664.
Wang L, Wang Z, Yang W: Linked region detection using high-density SNP genotype data via the minimum recombinant model of pedigree haplotype inference. BMC Bioinformatics 2009, 10: 216. 10.1186/1471-2105-10-216
Gusfield D: Inference of haplotypes from samples of diploid populations: complexity and algorithms. J Comput Biol 2001, 8(3):305â323. 10.1089/10665270152530863
Gusfield D: Haplotype inference by pure parsimony. In Combinatorial Pattern Matching. Morelia, Michocan Mexico: Springer; 2003:144â155.
Clark A: Inference of haplotypes from PCR-amplified samples of diploid populations. Mol Biol Evol 1990, 7(2):111â122.
Wang L, Xu Y: Haplotype inference by maximum parsimony. Bioinformatics 2003, 19(14):1773â1780. 10.1093/bioinformatics/btg239
Brown D, Harrower I: Integer programming approaches to haplotype inference by pure parsimony. IEEE/ACM Trans Comput Biol Bioinformatics 2006, 30(2):141â154.
Broman K, Weber J: Characterization of human crossover interference. Am J Human Genet 2000, 66(6):1911â1926. 10.1086/302923
Yang W, Wang Z, Wang L, Sham P, Huang P, Lau Y: Predicting the number and sizes of IBD regions among family members and evaluating the family size requirement for linkage studies. Eur J Human Genet 2008, 16(12):1535â1543. 10.1038/ejhg.2008.116
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira M: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007, 81: 559â575. 10.1086/519795
Stephens M, Smith N, Donnelly P: A new statistical method for haplotype reconstruction from population data. Am J Human Genet 2001, 68(4):978â989. 10.1086/319501
Scheet P, Stephens M: A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Human Genet 2006, 78(4):629â644. 10.1086/502802
Browning B, Browning S: A fast, powerful method for detecting identity by descent. Am J Human Genet 2011, 88(2):173â182. 10.1016/j.ajhg.2011.01.010
Kong A, Masson G, Frigge M, Gylfason A, Zusmanovich P, Thorleifsson G, Olason P, Ingason A, Steinberg S, Rafnar T, et al.: Detection of sharing by descent, long-range phasing and haplotype imputation. Nature Genet 2008, 40(9):1068â1075. 10.1038/ng.216
Browning S, Browning B: High-resolution detection of identity by descent in unrelated individuals. Am J Human Genet 2010, 86(4):526â539. 10.1016/j.ajhg.2010.02.021
Acknowledgements
This work is supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China [Project No. CityU 121608]. Lusheng Wang is the corresponding author.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests
Electronic supplementary material
12859_2011_5455_MOESM1_ESM.pdf
Additional file 1: Supplementary Material: This file includes several figures and additional experimental results mentioned in the paper. It contains the different set of input individuals on Pedigree 2-4 in the paper, the pedigrees containing 6 and 7 generations and 2,3,4,5 diseased individuals in the latest generation respectively. The tables show the results on the input of the above figures. (PDF 2 MB)
Authorsâ original submitted files for images
Below are the links to the authorsâ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Cui, W., Wang, L. Identifying mutation regions for closely related individuals without a known pedigree. BMC Bioinformatics 13, 146 (2012). https://doi.org/10.1186/1471-2105-13-146
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1471-2105-13-146