Detection of identity by descent using next-generation whole genome sequencing data
© Su et al.; licensee BioMed Central Ltd. 2012
Received: 20 December 2011
Accepted: 6 June 2012
Published: 6 June 2012
Identity by descent (IBD) has played a fundamental role in the discovery of genetic loci underlying human diseases. Both pedigree-based and population-based linkage analyses rely on estimating recent IBD, and evidence of ancient IBD can be used to detect population structure in genetic association studies. Various methods for detecting IBD, including those implemented in the software programs fastIBD and GERMLINE, have been developed in the past several years using population genotype data from microarray platforms. Now, next-generation DNA sequencing data is becoming increasingly available, enabling the comprehensive analysis of genomes, including identifying rare variants. These sequencing data may provide an opportunity to detect IBD with higher resolution than previously possible, potentially enabling the detection of disease causing loci that were previously undetectable with sparser genetic data.
Here, we investigate how different levels of variant coverage in sequencing and microarray genotype data influences the resolution at which IBD can be detected. This includes microarray genotype data from the WTCCC study, denser genotype data from the HapMap Project, low coverage sequencing data from the 1000 Genomes Project, and deep coverage complete genome data from our own projects. With high power (78%), we can detect segments of length 0.4 cM or larger using fastIBD and GERMLINE in sequencing data. This compares to similar power to detect segments of length 1.0 cM or higher with microarray genotype data. We find that GERMLINE has slightly higher power than fastIBD for detecting IBD segments using sequencing data, but also has a much higher false positive rate.
We further quantify the effect of variant density, conditional on genetic map length, on the power to resolve IBD segments. These investigations into IBD resolution may help guide the design of future next generation sequencing studies that utilize IBD, including family-based association studies, association studies in admixed populations, and homozygosity mapping studies.
The concept of identity by descent (IBD), which is used to indicate when alleles at a given locus in two individuals are inherited from a common ancestor, has played a fundamental role in many genetic studies. Analyses of IBD are commonly used in pedigree data for linkage mapping . IBD also has many uses in population-based studies, including mapping disease genes [2, 3], estimating haplotypic phase  and inferring evolutionary history (e.g., natural selection and inbreeding depression) [5, 6]. More recently, IBD has been applied to analyzing gene expression in related or unrelated individuals . Incorporating such information about shared genetic material between individuals in linkage/association analyses has been shown to improve statistical power for mapping disease genes in some studies [8–10].
The length of an IBD segment will depend on the number of generations between the individuals under study and their common ancestor, as IBD tracts are broken down by recombination events over time. In family data, the common ancestor is fairly recent, and thus IBD segments are expected to be long. Long IBD segments are easily detected with low density genotype data; in fact, linkage analysis studies were typically conducted with 300–400 highly polymorphic microsatellite markers, prior to the widespread use of microarray-based genotyping. This approach takes advantage of the fact that tracts of IBD extend several centiMorgans (cM) across family members, covering many genetic variants. On the other hand, the expected length of IBD seg- ments between two putatively unrelated subjects in a large population is expected to be small. In this case, deeper coverage of genetic variants obtained through sequencing can increase the power to detect small IBD segments. As whole genome sequence data becomes increasingly available, quantifying the extent to which sequence data can improve the resolution of IBD detection is an important step toward enabling more powerful approaches to disease gene mapping and understanding population history.
Various methods for detecting IBD have been proposed for use with population genotype data, including methods based on observed long segments of allele sharing [4, 11] and on probabilities of IBD built into a hidden Markov model (HMM) (Purcell et al., 2007; [2, 12–15]). We focus our attention on two high computationally efficient methods implemented in the software packages, Germline  and fastIBD . The former searches for IBD by directly matching portions of haplotypes between individuals from phased genotype data. The later detects IBD segments by modeling shared haplotype frequencies accounting for background levels of LD based on an HMM from unphased genotype data. In addition to the computational efficiency, fastIBD accounts for uncertainty of haplotype phase while inferring IBD states. Previous studies using high-density SNP genotype data from European ancestry samples in the HapMap project have shown that these two programs have good power to detect IBD segments greater than 2 cM in length .
Here, we conduct a comprehensive evaluation of IBD detection using sequence data, and compare the resolution of detectable IBD segment lengths with that of microarray-based genotype data using these two software packages (Germline and fastIBD). We investigate the power and false positive rate of these two approaches for microarray genotype data from the WTCCC study, denser genotype data from the HapMap Project, low coverage sequencing data from the 1000 Genomes Project, and deep coverage complete genome data from our own projects. The results of our analysis can help guide the design of future next generation sequencing studies that utilize IBD.
The average power of fastIBD and Germline
Segment Size (cM)
The average false positive rate of fastIBD and Germline
Segment Size (cM)
The total length (cM) of IBD segments detected on Chromosome 1 using fastIBD
number of individuals
total length (cM)
In this study, we examined how the density of genetic variants in a dataset affects the power to detect IBD between individuals. We found that analysis of sequence data with high SNP density improves resolution and power for detecting IBD relative to microarray-based genotyping, particularly for small segments. In our simulation, there was good power (80%) to detect IBD segments of size 0.4 cM using high coverage sequence data with a low false positive rate, compared to a power of approximately (77%) for segments of size 1 cM using microarray genotype data (WTCCC).
It is possible that the methods we examined in this study may be further refined to improve the power to detect even smaller IBD segments. We found that Germline has slightly higher power to detect IBD using sequence data compared to fastIBD, but it has a much higher false positive rate. That is, for high variant density data, Germline detects many small segments, where around 25% of them are false positives. We set the detectable minimum length to 0.1 cM while running Germline, which allows Germline be able to detect small segments, but it increases the false positive rate. Germline also provided lower power for detecting IBD segments using the microarray dataset (from WTCCC). These results indicate that fastIBD provides more robust and reliable IBD detection than Germline for these types of datasets. Given these observations, the current implementation of fastIBD appears to be better than the current implementation of Germline for detecting IBD segments for both low and high variant density data. We note that fastIBD represents a recent update to the HMM approach to IBD detection [2, 3, 13, 14], and was previously tested on microarray data.
The results of this study have important implications for the design of genetic studies of human diseases. Identity by descent estimation can be used to conduct family-based association studies, association studies in admixed populations, and homozygosity mapping, and improved resolution and detection of IBD can enhance the power of these approaches to detect human disease genes. In general, the expected length of IBD is 1/(2n) Morgans for a common ancestor from n generations ago for a large population, where the ancestral haplotype is transmitted across 2n meioses. The variance of the length of an IBD track, however, is large, and the expected IBD lengths in a relatively small population could be affected dramatically by some aspects of population history (e.g., growth type and internal subdivision) . In fact, recent studies have shown that both the amount of the genome shared identical by descent and the proportion of the genome that is covered by long runs of homozygosity differs by population [5, 17, 18]. A more detailed assessment of IBD across populations could help determine to what extent whole genome sequence data can improve the power of these mapping approaches. Additionally, the continued improvement of IBD detection methods and the testing of those methods on dense genetic data can provide a foundation for future genetic studies.
Materials and methods
To assess the statistical power to detect chromosomal segments that are shared identical by descent between two individuals, we conducted a simulation study. First, we collected genotype data from four sources that represent different levels of coverage (that is, the proportion of all variants in the genome that are assayed by a given platform), ranging from microarray genotype data to deep coverage whole genome sequence data. These empirical genotype data include: Microarray genotype data (WTCCC) from the Wellcome Trust Case Control Consortium (WTCCC) study. We obtained genotype data on 1000 controls from UK National Blood Donors (NBS) cohort genotyped on the Illumina 1.2 M chip. We used the SNP set released from the WTCCC database, which represents a cleaned set of data from their default QC procedures 
Denser genotype data (HapMap) from the HapMap phase II project. We obtained genotype data on 60 unrelated samples from the CEU population (Utah residents with ancestry from northern and western Europe) Low coverage sequence data (1000 g) from the 1000 Genomes Project. We obtained genotype data on 283 individuals that originate from Europe sequenced with 4X coverage (2010.08 release).
Deep coverage sequence data (complete) from University of California at San Francisco Whole Genome Sequencing Consortium and Complete Genomics . We obtained genotype data on 54 samples of European origin sequenced by Complete Genomics with an average of 50X coverage. We used the Complete Genomics default cut offs for full genotype calls (excluding partial and no calls), which pass a strict quality score metric.
Construction of artificial IBD for assessing power
As in the first simulation, we investigated 100 regions on Chromosome 1 for each of the 5 different lengths of composite segments (0.2, 0.4, 0.6, 1, and 2 cM). A subset of 100 individuals from WTCCC and the 1000 Genomes datasets were randomly selected to create the 10 composite individuals. We included the other 900 individuals in the WTCCC data and 183 individuals in the 1000 Genomes data for IBD analysis. We did not investigate the error rate for HapMap and Complete Genomic data due to the limited number of individuals available for study. For each IBD segment length, the false positive rate is calculated by the number of SNPs that are detected as IBD divided by the total number of SNPs within the simulated segment. The error rates were then averaged over 100 regions and any pair of these 10 individuals. For Germline, the input data need to be phased genotype data. Thus, we phased the data before running Germline using fastIBD. Both fastIBD and Germline generate a list of all pairwise IBD segments.
We ran the fastIBD function in Beagle V3.3.1 with default settings. fastIBD applied a score threshold when detecting IBD. The results in the previous study shows that a threshold of 10−10 gives good power to detect IBD and also keep the false discovery rate close to zero . Here, we used the default threshold 10−8. We used default settings in Germline V1.5.0 except that we set the minimum length (−min m) to 0.1 cM. This allows Germline to have a chance of detecting small segments.
We would like to thank Tom Hoffman, Iona Cheng, and John Witte for review of the manuscript. This study makes use of data generated by the Wellcome Trust Case–control Consortium, the International HapMap Project, and the 1000 Genomes Project. A full list of the investigators who contributed to the generation of the data is available from http://www.wtccc.org.uk, hapmap.ncbi.nlm.nih.gov, and http://www.1000genomes.org. The deep sequence data are contributed from the UCSF Sequencing Consortium (sequencing.galloresearch.org) and Complete Genomics (http://www.completegenomics.com).
- Weir BS, Anderson AD, Hepler AB: Genetic relatedness analysis: modern data and new challenges. Nat Rev Genet. 2006, 7 (10): 771-780. 10.1038/nrg1960.View ArticlePubMed
- Albrechtsen A, Sand Korneliussen T, Moltke I, van Overseem Hansen T, Nielsen FC, Nielsen R: Relatedness mapping and tracts of relatedness for genome-wide data in the presence of linkage disequilibrium. Genet Epidemiol. 2009, 33 (3): 266-274. 10.1002/gepi.20378.View ArticlePubMed
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: Plink: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81 (3): 559-575. 10.1086/519795.PubMed CentralView ArticlePubMed
- Kong A, Masson G, Frigge ML, Gylfason A, Zusmanovich P, Thorleifsson G, Olason PI, Ingason A, Steinberg S, Rafnar T, Sulem P, Mouy M, Jonsson F, Thorsteinsdottir U, Gudbjartsson DF, Stefansson H, Stefansson K: Detection of sharing by descent, long-range phasing and haplotype imputation. Nat Genet. 2008, 40 (9): 1068-1075. 10.1038/ng.216.PubMed CentralView ArticlePubMed
- Albrechtsen A, Moltke I, Nielsen R: Natural selection and the distribution of identity-by-descent in the human genome. Genetics. 2010, 186 (1): 295-308. 10.1534/genetics.110.113977.PubMed CentralView ArticlePubMed
- Charlesworth D, Willis JH: The genetics of inbreeding depression. Nat Rev Genet. 2009, 10 (11): 783-796. 10.1038/nrg2664.View ArticlePubMed
- Price AL, Helgason A, Thorleifsson G, McCarroll SA, Kong A, Stefansson K: Single-tissue and cross-tissue heritability of gene expression via identity-by-descent in related or unrelated individuals. PLoS Genet. 2011, 7 (2): e1001317-10.1371/journal.pgen.1001317.PubMed CentralView ArticlePubMed
- Almasy L, Blangero J: Multipoint quantitative-trait linkage analysis in general pedigrees. Am J Hum Genet. 1998, 62 (5): 1198-1211. 10.1086/301844.PubMed CentralView ArticlePubMed
- Thomas A, Camp NJ, Farnham JM, Allen-Brady K, Cannon-Albright LA: Shared genomic segment analysis. mapping disease predisposition genes in extended pedigrees using snp genotype assays. Ann Hum Genet. 2008, 72 (2): 279-287. 10.1111/j.1469-1809.2007.00406.x.PubMed CentralView ArticlePubMed
- Zhang Q, Wang S, Ott J: Combining identity by descent and association in genetic case–control studies. BMC Genet. 2008, 9 (1): 42-PubMed CentralView ArticlePubMed
- Gusev A, Lowe JK, Stoffel M, Daly MJ, Altshuler D, Breslow JL, Friedman JM, Pe’er I: Whole population, genome-wide mapping of hidden relatedness. Genome Res. 2009, 19 (2): 318-326.PubMed CentralView ArticlePubMed
- Bercovici S, Meek C, Wexler Y, Geiger D: Estimating genome-wide ibd sharing from snp data via an efficient hidden markov model of ld with application to gene mapping. Bioinformatics. 2010, 26 (12): i175-i182. 10.1093/bioinformatics/btq204.PubMed CentralView ArticlePubMed
- Browning SR, Browning BL: High-resolution detection of identity by descent in unrelated individuals. Am J Hum Genet. 2010, 86: 526-539. 10.1016/j.ajhg.2010.02.021.PubMed CentralView ArticlePubMed
- Browning BL, Browning SR: A fast, powerful method for detecting identity by descent. Am J Hum Genet. 2011, 88 (2): 173-182. 10.1016/j.ajhg.2011.01.010.PubMed CentralView ArticlePubMed
- Moltke I, Albrechtsen A, Hansen TV, Nielsen FC, Nielsen R: A method for detecting ibd regions simultaneously in multiple individuals, with applications to disease genetics. Genome Res. 2011, 21 (7): 1168-1180. 10.1101/gr.115360.110.PubMed CentralView ArticlePubMed
- Chapman NH, Thompson EA: A model for the length of tracts of identity by descent in finite random mating populations. Theor Popul Biol. 2003, 64 (2): 141-150. 10.1016/S0040-5809(03)00071-6.View ArticlePubMed
- Auton A, Bryc K, Boyko AR, Lohmueller KE, Novembre J, Reynolds A, Indap A, Wright MH, Degenhardt JD, Gutenkunst RN, King KS, Nelson M, Bustamante CD: Global distribution of genomic diversity underscores rich complex history of continental human populations. Genome Researc. 2009, 19 (5): 795-798. 10.1101/gr.088898.108.View Article
- Kirin M, McQuillan R, Franklin CS, Campbell H, McKeigue PM, Wilson JF: Genomic runs of homozygosity record population history and consanguinity. PLoS One. 2010, 5 (11): e13996-10.1371/journal.pone.0013996.PubMed CentralView ArticlePubMed
- WTCCC: Genome-wide association study of cnvs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature. 2010, 464 (7289): 713-720. 10.1038/nature08979.View Article
- Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG, Carnevali P, Nazarenko I, Nilsen GB, Yeung G, Dahl F, Fernandez A, Staker B, Pant KP, Baccash J, Borcherding AP, Brownley A, Cedeno R, Chen L, Chernikoff D, Cheung A, Chirita R, Curson B, Ebert JC, Hacker CR, Hartlage R, Hauser B, Huang S, Jiang Y, Karpinchyk V, Koenig M, Kong C, Landers T, Le C, Liu J, McBride CE, Morenzoni M, Morey RE, Mutch K, Perazich H, Perry K, Peters BA, Peterson J, Pethiyagoda CL, Pothuraju K, Richter C, Rosenbaum AM, Roy S, Shafto J, Sharanhovich U, Shannon KW, Sheppy CG, Sun M, Thakuria JV, Tran A, Vu D, Zaranek AW, Wu X, Drmanac S, Oliphant AR, Banyai WC, Martin B, Ballinger DG, Church GM, Reid CA: Human genome sequencing using unchained base reads on self-assembling dna nanoarrays. Science. 2010, 327 (5961): 78-81. 10.1126/science.1181498.View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.