hsphase: an R package for pedigree reconstruction, detection of recombination events, phasing and imputation of half-sib family groups
© Ferdosi et al.; licensee BioMed Central Ltd. 2014
Received: 11 October 2013
Accepted: 27 May 2014
Published: 7 June 2014
Identification of recombination events and which chromosomal segments contributed to an individual is useful for a number of applications in genomic analyses including haplotyping, imputation, signatures of selection, and improved estimates of relationship and probability of identity by descent. Genotypic data on half-sib family groups are widely available in livestock genomics. This structure makes it possible to identify recombination events accurately even with only a few individuals and it lends itself well to a range of applications such as parentage assignment and pedigree verification.
Here we present hsphase, an R package that exploits the genetic structure found in half-sib livestock data to identify and count recombination events, impute and phase un-genotyped sires and phase its offspring. The package also allows reconstruction of family groups (pedigree inference), identification of pedigree errors and parentage assignment. Additional functions in the package allow identification of genomic mapping errors, imputation of paternal high density genotypes from low density genotypes, evaluation of phasing results either from hsphase or from other phasing programs. Various diagnostic plotting functions permit rapid visual inspection of results and evaluation of datasets.
The hsphase package provides a suite of functions for analysis and visualization of genomic structures in half-sib family groups implemented in the widely used R programming environment. Low level functions were implemented in C++ and parallelized to improve performance. hsphase was primarily designed for use with high density SNP array data but it is fast enough to run directly on sequence data once they become more widely available. The package is available (GPL 3) from the Comprehensive R Archive Network (CRAN) or from http://www-personal.une.edu.au/~cgondro2/hsphase.htm.
KeywordsSNP Phasing Imputation Recombination Haplotypes Linkage analysis Genotyping Parentage testing Pedigree reconstruction
Identification of recombination events and which chromosomal segments contributed to an individual is useful for a number of applications in genomic analyses including haplotyping, imputation, linkage disequilibrium , signatures of selection, and improved estimates of relationship and probability of identity by descent . This is particularly true for genomic prediction which has become an important tool in modern livestock breeding programs to predict the merit of individuals by estimating the genome-wide effects of the alleles they inherited from their ancestors . It is expected that the accuracy of prediction will be even higher once it is based on causal variants identified through sequencing  instead of the currently used linked markers. In livestock, individuals of high genetic merit, particularly males, are widely used which leads to an overrepresentation of their genetics across the population. This stratification can be problematic for population based phasing algorithms which rely on samples being unrelated to each other and reasonably representative of the spectrum of genetic diversity . On the other hand this high level of relatedness between individuals provides a structure of high linkage disequilibrium which can be used to track chromosomal segments (haplotypes) throughout the population. By sequencing these overrepresented individuals and genotyping their descendants with high density marker panels, their full sequence data can be imputed  for around one tenth of current sequencing costs. Availability of sequence data for a large number of samples will increase the power to identify causal variants, which in turn can replace the currently used evenly spaced marker panels with a smaller subset of trait specific variants that are either causal or in perfect LD with the causal variants . This implies the ability to accurately identify and track haplotypes in the population.
Here we present hsphase, an R package that implements a fast, deterministic and robust method for half-sib family structures to identify recombination events, phase family groups, impute and phase un-genotyped sires and build a library of haplotypes . The package also makes use of this population structure to evaluate correctness of recorded pedigrees, identify and fix pedigree errors, i.e. reassign individuals with wrong pedigree records to their correct sires; or even reconstruct family groups without pedigree records. If genotypes from candidate parents are available the package can be used for parentage verification.
Additional functions allow identification of genomic mapping errors, evaluation of phasing results generated by hsphase or other phasing programs. hsphase will also generate a blocking structure of chromosomal segments that define which progeny carry segments identical by descent. This can be used to improve phasing of the paternal sequence data [8, 9] and allows precise sequence imputation in the offspring. Imputation is important in association studies and genomic prediction to increase accuracy and power since a large number of samples can be genotyped at lower density (and lower cost) and imputed up to sequence level or to denser marker panels, which increases the level of linkage disequilibrium between SNP and causal variants .
hsphase seamlessly integrates into the R environment for pipelined analyses and provides a range of diagnostic plotting functions that permit rapid visual inspection of results and evaluation of datasets. Functions for pedigree checking, reconstruction and parentage assignment can be used independently or as part of phasing workflow. For phasing purposes, the main advantages of hsphase are that is it extremely fast in comparison to population based phasing methods, can be used with small datasets and it is not affected by sampling stratification. It also builds blocks of chromosomal inheritance in the half-sibs which makes it simple to impute when paternal sequence or higher density marker haplotypes are available. The package is sufficiently fast to be used directly on sequence data.
The hsphase package exploits the linkage disequilibrium found within a half-sib family and the information content of opposing homozygous SNP markers . An opposing homozygote, for any given marker, is defined as one individual being homozygous for an allelic variant and the other individual homozygous for the alternative allele.
hsphase was implemented as a package for the widely used R statistical programming environment and wrapper functions make it easy to use and facilitate integration with other R/Bioconductor packages. Programs such as snpQC output files in a format that can be used by hsphase. Source code, compiled package, tutorial and example dataset are available from the project’s website (the package is also available directly from CRAN). In the following section, the main components of the package are briefly described.
Main functions in hsphase
hsphase requires a SNP map file (name, chromosome and map positions), a genotype data file (numerically coded as 0, 1, 2 for the three genotypes and 9 for missing data) and a pedigree file (individuals and paternal ancestor). The latter can be generated from the data itself if no pedigree information is available or the pedigree is unreliable.
Pedigree reconstruction and parentage assignment
Block structure and recombination events
The bmh function creates the blocking structure for the half-sibs and splits them into two groups based on the chromosomal segments they inherited from either one of the sire’s haplotypes. Blocks for each chromosome are constructed by selecting the first opposing homozygous SNP on the chromosome and partitioning all members of a half-sib family into two groups according to their genotypes (i.e. all individuals with genotype AA are placed in one group – group 1, and all with BB in the other group – group 2). Starting from this initial grouping the function steps through the SNP according to their map order to allocate individuals into one group or the other one, until the end of the chromosome is reached. At the end of the process each individual at each SNP will have been assigned to one of the two groups; the function returns a matrix of individuals by SNP coded as 1 and 2. Recombination between two adjacent SNP is an unlikely event, so from the second SNP onwards individuals are assigned to a group by minimizing the number of individuals that have to change groups in relation to the previous grouping (i.e. minimum number of recombinations). Recombinations are identified when an individual moves from one group to the other based on its opposing homozygous status. The bmh function performs a validation step for the recombination by checking if during the next steps (SNP) the individual does not return to the previous group. Recombinations occurring on both sides of a single SNP in a single individual are interpreted as a genotyping error and ignored. Group assignment is based on family relationships which makes bmh sensitive to pedigree errors . In addition, only a proportion of SNP will be homozygous for any given individual at any particular SNP; family sizes need to be sufficiently large to be able to reliably assign individuals to groups and markers sufficiently dense to correctly detect recombination events. As rule of thumb, families with at least 8 individuals and 50 k panels should yield very accurate results.
Phasing and imputation
The function ssp imputes and phases the paternal haplotypes. The function infers the sires’ haplotypes at each SNP by simply averaging the sum of the genotypes of the half-sibs in a blocking group (alleles coded as 0 and 1; genotypes as 0 – 0/0, 1 – 0/1 and 2 – 1/1). Averages are rounded to the nearest integer and assigned to the sire’s haplotypes.
The phf function phases the offspring and returns their paternal haplotypes. It uses the sire’s phased haplotypes as a reference and overlaps the block matrix to select which parts of the haplotypes each individual inherited. Once the paternal haplotypes of the offspring are created, the maternal ones are obtained by simply subtracting these haplotypes from the original genotypes.
The function impute imputes the paternal strand of half-sib families from low density genotypes to high density by using the sire’s haplotypes as a scaffold. Similarly to the function phf it simply uses the blocks to match the haplotypes of the offspring with the correct haplotype of the sire and fills the missing markers with the haplotypes of the denser panel.
For large datasets the para function provides a parallelized wrapper to partition the job across multiple CPUs.
To discuss the use of the hsphase package, a dataset of 106 brown Hanwoo Korean cattle genotyped on the Illumina 700 k BovineHD BeadChip SNP array was used. Individuals belonged to 14 half-sib family groups with family sizes ranging from 6 to 8. Genotypes for the 14 sires were also available and pedigree records were accurate. For reference purposes the Korean Hanwoo are a pure-bred heavily selected population with a small effective population size (Ne ~100) and there is some ascertainment bias in the chip which was not specifically designed for the breed. Population differences among unrelated individuals is expected to be lower than in populations with large Ne.
Map errors due to errors in the reference assembly can also be identified by visual inspection of the block structures (Figure 9B). This is characterized by an individual SNP (or a few SNP in a region) that shows an excessive number of recombinations. Map errors are consistent across families, meaning that the same SNP show excessive recombination across all family groups. With the method used in hsphase, a map error leads to downstream blocking problems and individuals start showing patterns of recombination at the same SNP (Figures 7 and 9B). This can be corrected by deleting the region with the map error, provided it is not too long. The difference between map errors, regions of high recombination and SNP genotyping problems are not entirely straightforward, particularly if the marker panel is not very dense.
Accuracy of sire inference and imputation
To test the accuracy of imputation from low to high marker density we selected 46,174 SNP – the SNP in common with the 50 k bovine panel – in the Hanwoo offspring and excluded the others. We built the block structures for this subset of SNP and then used the impute function to fill the gaps using the sire’s phased genotypes as a scaffold. The average accuracy of imputation (proportion of paternal haplotypes correct out of total) for the 106 offspring was 0.981 (comparison of 50 K imputed to 700 k with the true 700 k haplotypes). The worst accuracy was 0.977 and the best 0.993. Note that the accuracies were high but they were probably biased upwards since the sires were phased using hsphase and there is some circularity in these values. In the absence of true phased sire data this issue cannot be resolved unambiguously. We also evaluated the accuracy of sire inference (comparison of inferred genotypes with the true genotypes of the sires). The average accuracy was 0.992, with the worst sire 0.985 and the best 0.997. Undefined regions were not called (average 16.5% of SNP). A comprehensive evaluation of the phasing method used in hsphase is given in .
hsphase is an R package for analysis and visualization of genomic structures in small half-sib groups. The package can be used to reconstruct pedigree, assign or verify parentage, impute and phase un-genotyped paternal ancestors, phase the half-sib groups and detect and quantify recombination events. Diagnostic plots assist identification of pedigree, mapping and phasing errors. Whilst designed for high density SNP arrays the algorithm is extremely fast and can be used directly on sequence data as it becomes available. Auxiliary functions to impute from low to high density markers and parse datasets are also included in the package.
Availability and requirements
The package is freely available (GPL 3) from the Comprehensive R Archive Network (CRAN) or from http://www-personal.une.edu.au/~cgondro2/hsphase.htm. Source code, compiled package, a tutorial and example dataset are available from the project’s website.
Project name: hsphase
Project home page: http://www-personal.une.edu.au/~cgondro2/hsphase.htm
Operating system(s): platform independent
Programming language: R  and C/C++
License: GNU GPL 3
CG and SHL were supported by a grant from the Next-Generation BioGreen 21 Program (No. PJ008196), Rural Development Administration (RDA), Republic of Korea. CG and BPK were supported by an Australian Research Council Discovery Project DP130100542. The authors wish to thank SheepGenomics, the Sheep Cooperative Research Centre, The National Institute of Animal Science, RDA and Livestock Improvement Corporation for sharing the genotypes used to test the method developed in this study.
- Edwards D: Modelling and visualizing fine-scale linkage disequilibrium structure. BMC bioinformatics. 2013, 14: 179-10.1186/1471-2105-14-179.View ArticlePubMed CentralPubMedGoogle Scholar
- Su SY, Kasberger J, Baranzini S, Byerley W, Liao W, Oksenberg J, Sherr E, Jorgenson E: Detection of identity by descent using next-generation whole genome sequencing data. BMC bioinformatics. 2012, 13: 121-10.1186/1471-2105-13-121.View ArticlePubMed CentralPubMedGoogle Scholar
- Gondro C, van der Werf J, Hayes B: Genome-Wide Association Studies and Genomic Prediction, Volume 1019. 2013, Springer: Humana PressView ArticleGoogle Scholar
- Meuwissen T, Goddard M: Accurate Prediction of Genetic Values for Complex Traits by Whole-Genome Resequencing. Genetics. 2010, 185 (2): 623-U338. 10.1534/genetics.110.116590.View ArticlePubMed CentralPubMedGoogle Scholar
- Browning SR, Browning BL: Haplotype phasing: existing methods and new developments. Nat Rev Genet. 2011, 12 (10): 703-714. 10.1038/nrg3054.View ArticlePubMed CentralPubMedGoogle Scholar
- Druet T, Macleod IM, Hayes BJ: Toward genomic prediction from whole-genome sequence data: impact of sequencing design on genotype imputation and accuracy of predictions. Heredity. 2014, 112 (1): 39-47. 10.1038/hdy.2013.13.View ArticlePubMed CentralPubMedGoogle Scholar
- Ferdosi MH, Kinghorn BP, van der Werf JH, Gondro C: Detection of recombination events, haplotype reconstruction and imputation of sires using half-sib SNP genotypes. Genet Sel Evol. 2014, 46: 11-10.1186/1297-9686-46-11.View ArticlePubMed CentralPubMedGoogle Scholar
- Efros A, Halperin E: Haplotype reconstruction using perfect phylogeny and sequence data. BMC bioinformatics. 2012, 13 (Suppl 6): S3-10.1186/1471-2105-13-S6-S3.View ArticlePubMed CentralPubMedGoogle Scholar
- He D, Choi A, Pipatsrisawat K, Darwiche A, Eskin E: Optimal algorithms for haplotype assembly from whole-genome sequence data. Bioinformatics. 2010, 26 (12): i183-i190. 10.1093/bioinformatics/btq215.View ArticlePubMed CentralPubMedGoogle Scholar
- Hoze C, Fouilloux MN, Venot E, Guillaume F, Dassonneville R, Fritz S, Ducrocq V, Phocas F, Boichard D, Croiseau P: High-density marker imputation accuracy in sixteen French cattle breeds. Genet Sel Evol. 2013, 45: 33-10.1186/1297-9686-45-33.View ArticlePubMed CentralPubMedGoogle Scholar
- Hayes BJ: Efficient parentage assignment and pedigree reconstruction with dense single nucleotide polymorphism data. J Dairy Sci. 2011, 94 (4): 2114-2117. 10.3168/jds.2010-3896.View ArticlePubMedGoogle Scholar
- Calus MPL, Mulder HA, Bastiaansen JWM: Identification of Mendelian inconsistencies between SNP and pedigree information of sibs. Genet Sel Evol. 2011, 43: 34-10.1186/1297-9686-43-34.View ArticlePubMed CentralPubMedGoogle Scholar
- Gondro C, Lee SH, Lee HK, Porto-Neto LR: Quality control for genome-wide association studies. Methods Mol Biol. 2013, 1019: 129-147. 10.1007/978-1-62703-447-0_5.View ArticlePubMedGoogle Scholar
- Ferdosi MH, Kinghorn B, van der Werf J, Gondro C: Effect of genotype and pedigree error on block partitioning, sire imputation and haplotype inference using the hsphase algorithm. AAABG Proceeding. 2013, Napier, New ZealandGoogle Scholar
- Gondro C, Porto-Neto LR, Lee SH: R for genome-wide association studies. Methods Mol Biol. 2013, 1019: 1-18. 10.1007/978-1-62703-447-0_1.View ArticlePubMedGoogle Scholar
- The R Development Core Team: R: A language and environment for statistical computing. 2014, Vienna: R Foundation for Statistical ComputingGoogle Scholar
- Knaus J, snowfall: Easier cluster computing (based on snow).. R package version 1.84-6.Google Scholar
- Eddelbuettel D, Francois R: Rcpp: seamless R and C++ integration. J STAT SOFTW. 2011, 40 (8): 1-18.View ArticleGoogle Scholar
- Eddelbuettel D: Seamless R and C++ integration with Rcpp, Volume 64. 2013, New York: SpringerView ArticleGoogle Scholar
- Eddelbuettel D, Sanderson C: RcppArmadillo: Accelerating R with high-performance C++ linear algebra. COMPUT STAT DATA AN. 2014, 71: 1054-1063.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.