PolyHaplotyper algorithm
Population structure
In principle, populations with any genetic structure can be haplotyped by our method. However, the algorithm takes advantage of the presence of full-sib (FS) families and their parents, where groups of FS families sharing common parents are jointly taken into account. FS families include any progenies of crosses between two parental plants and as such include F2 progenies derived from the selfing of one F1 plant, backcross progenies derived from the cross of one F1 plant and one recurrent parent plant, etc. All individuals not belonging to one of the specified FS families or their parents are considered to be unrelated. All individuals must be of the same, even ploidy level.
Input data
The input includes the population structure, the haploblock compositions and the biallelic marker dosage data. The population structure indicates parents and progeny belonging to each FS family. The same individual may be the parent of more than one FS family. All other individuals occurring in the marker data are haplotyped as if they were unrelated, even if a pedigree is available. The haploblock composition describes which biallelic markers belong to each haploblock. The marker data consist of a matrix that for each biallelic marker gives the dosages of one of its alleles for all individuals; missing data are allowed.
We use the symbols nmrk for the number of bi-allelic markers (SNPs) in the haploblock and nhap for the total number of possible haplotypes, where nhap = 2nmrk. A (phasing) “solution” for an individual is a combination of haplotypes that fits (results in) the observed SNP marker dosages. Finally Gmrk is the genotype of an individual expressed as the allele dosages of all markers in the haploblock (where the dosage of each marker is in 0—ploidy), and Ghap is the genotype of an individual expressed as the dosages of all haplotypes (these dosages sum to ploidy). Gmrk are the observed SNP marker dosages and are the input of the haplotyping process; Ghap are the inferred haplotype combinations, the result of the haplotyping.
The main function, inferHaplotypes, is essentially a wrapper that performs the haplotyping for each haploblock separately; linkage between haploblocks is not taken into account. In short, the algorithm applied to each haploblock (in function hapOneBlock) consists of three stages:
-
Stage 1: an inventory is made of haplotypes that are very likely present in the population, based on necessity (e.g. homozygous individuals) and a parsimony criterion
-
Stage 2: full-sib families and their parents are haplotyped based on observed segregation ratios; any new inferred haplotypes are added to the inventory of haplotypes; FS families that cannot be haplotyped as such are re-defined as unrelated material
-
Stage 3: the unrelated material (including any FS families added in stage 2) are haplotyped based on the inventory of known haplotypes and a parsimony criterion
These stages are elaborated below.
Stage 1: inventory of likely haplotypes
In this initial step, implemented in function inferHaps_noFS, we attempt to find a minimal set of haplotypes that together explain a large subset of the observed bi-allelic marker dosages. We apply an algorithm inspired by that of [25], developed for diploid populations, which aims to identify a parsimonious set of haplotypes. When there is a priori information available about haplotypes that are likely to be present in the population, this information is used.
This stage results in both a set of haplotypes known with some certainty to be present, and Ghap for many of the individuals. However, the purpose of this stage is only to obtain a set of haplotypes as input for the analysis of FS families (Stage 2) and the Ghap are ignored. If no FS families are present Stage 1 and 2 are skipped.
-
a.
Initially we identify haplotypes that are certain to be present in some minimum number of individuals. Individuals where all markers in the haploblock have a dosage equal to 0 or ploidy must be homozygous for a haplotype. Similarly, if only one of the markers has a dosage different from 0 or ploidy, two haplotypes are certain to be present. This can be generalized: if all combinations of haplotypes that explain the marker dosages of an individual contain a certain haplotype, that haplotype is known to be present. The use of a minimum threshold for the number of individuals (by default: 10%) reduces the chances to infer the presence of a haplotype due to dosage scoring errors.
This step results in a number of confirmed haplotypes, that are added to a priori known haplotypes (if provided).
-
b.
Next, for each individual that cannot be completely explained by the already known haplotypes, we find the solutions that involve the least number of additional haplotypes, and we determine which haplotypes are required in all of these solutions. Haplotypes that are required in some minimum number of individuals (also by default 10%), together with haplotypes defined in step a, become the new set of known haplotypes.
We repeat step b until the set of known haplotypes does not change anymore. It may happen that the repetitions of this step do not converge to a single set of known haplotypes but cycle through a few different sets. In that case we use the results of the first execution of this step, which are based on the haplotypes identified with (almost) complete certainty in step a.
Stage 2: haplotyping of full-sib families
If FS families are specified, perhaps along with unrelated material, we first group the FS families into groups where the families are linked through shared parents. For each group we find an optimal solution, using the procedure described below. All haplotypes present in the parents according to the optimal solutions for all groups are assumed to be confirmed and are used as input for the haplotyping of the unrelated material (Stage 3). Stage 2 is summarized in Additional file 3: Fig. S1.
For each FS family we rely on the parental marker dosages to be non-missing and correct. If one or both parents have missing dosages for one or more markers in the haploblock, then for this haploblock the FS family and its parents are considered to be unrelated material. This is also the case for FS families where no acceptable solution is found, as will usually be the case if there are errors in the parental genotypes.
For FS families where both parents have full marker dosage information we apply function solveOneFS: consider the possible combinations of parental Ghap, each of which leads to a prediction of the haplotype segregation in the FS progeny. In calculating the expected haplotype segregation a specified frequency of double reduction is taken into account (default 2.5%). We convert the expected haplotype segregation into an expected segregation of marker allele dosage genotypes, taking into account a small probability of errors in dosage scores (default: 2.5%). We test the observed Gmrk segregation in the FS against this expected segregation, using a chi-squared test. In order to limit the bias of this test caused by categories with small number of expected individuals we group all expected Gmrk with less than 1 expected individual, together with the observed but unexpected Gmrk, in a single group. After this step we select the best parental Ghap combinations: those whose chi-squared P value is not less than 0.001 * the P value of the best combination and larger than some minimum threshold (default 10–8).
Depending on the Gmrk of the parents there may be a large number of possible parental Ghap combinations. Checking one parental Ghap combination takes appreciable time (a standard desktop computer with Intel i5 processor at 1.60 GHz, Windows10 can check about 200,000 parental Ghap combinations per hour). Therefore we try to consider the most likely parental Ghap first, which are the ones that involve only the haplotypes determined to be present after Stage 1. If no good solution is obtained with these selected haplotype combinations we consider all other possible Ghap as well, although it is possible to specify a maximum number of parental Ghap combinations (default: 150,000), above which the FS family will be treated as unrelated material.
Once all FS families in a group with shared parents have been analyzed, some of the FS families may have more than one possible solution and/or the solutions of FS families with a shared parent may be incompatible (i.e. each specifying a different Ghap for the shared parent). The best solution over the group is selected in function resolveGroupConflicts as follows. First we get rid of the incompatible solutions by iterating the following steps.
-
a.
For each parent in the group of linked FS families we consider all Ghap that are a potential solution for any of the FS families of which it is a parent. For each of these Ghap we determine for which of the FS families it is NOT a solution and we sum the numbers of individuals in these FSs. In this way for each potential Ghap of each parent in the group we know how many FS individuals in the group are not explained by it.
-
b.
If no such Ghap have unexplained FS families, all remaining FS families in the group are compatible and we stop this process. Else we remove the parental Ghap with most incompatible FS progeny from the set of possible solutions of each of the FS families of which this parent is a parent. If this was the last remaining solution of an FS family, the entire FS family is considered as incompatible and is removed from the group. All FS progeny in that population and its other parent (if not shared by other FS families in the group) are removed from the group and the process is repeated. The removed FS family and parent will be treated as unrelated material for this haploblock.
After this iteration there are one or more sets of Ghap for all parents that together are solutions for the entire (remaining) group. If more than one such set exists, we select the optimal one. The optimal solution is found by multiplying the chi-squared P values over all (remaining) FS families in the group and selecting the one(s) with the maximum combined P value. If multiple overall solutions are equally optimal, we assign Ghap to all parents and FS individuals that have the same, unique Ghap under all these solutions and leave the others unassigned.
For FS families where a single solution is found all possible marker genotypes (Gmrk) are known. For FS individuals with one or more missing marker dosages these are imputed (function imputeFSindiv) if the remaining markers allow only one of these possible Gmrk. In order to avoid over-interpretation, the imputed marker genotypes are accepted only if less than half of the FS family is imputed, and if the chi-squared fit including the imputed progeny is not too much worse than without these individuals (the chi-squared P value with imputation must be > 0.1 * the P value without imputation).
Stage 3: haplotyping of remaining material
After all groups of FS families have been considered we may be left with FS families (and parents) for which no solution was found or that had a solution incompatible with the other FS families in their group. These individuals, together with any material that was not part of a FS family, are jointly analyzed as a group of unrelated material. This is also the case if no FS families were defined in the original population.
In this analysis all haplotypes present in the now haplotyped FS families and any haplotypes specified as known are taken as known haplotypes. The initial analysis of this material in stage 3 is identical to stage 1 and also performed by function inferHaps_noFS. After this analysis some of the selected Ghap may contain haplotypes that are not in the set of known haplotypes because they occur in less than the minimum number of individuals. If so, we run one extra cycle of step b (described under stage 1) with a lower threshold (by default 1%, but at least 2 individuals, instead of 10%) for the number of individuals in which a haplotype should occur to be considered “known”. If this results in more individuals with a unique solution and if no individuals that had a unique solution now don’t have one anymore, then we accept the result of this final cycle.
Implementation
The algorithm described above is implemented in an R package [26] named PolyHaplotyper which is available from CRAN [27]. The entire haplotyping process over multiple haploblocks in a single population (which may consist of multiple FS families, their parents and unrelated material) is performed through a single call to function inferHaplotypes. Apart from this function the package provides functions that help to format the data set, to merge replicate samples from the same individuals and to produce overviews and statistics.
Apart from the relations between FS families and their (possibly shared) parents, the haplotyping process does not make use of known pedigree relations. However, if a pedigree is available, PolyHaplotyper allows to check for each parents-offspring trio (or pair, if only one parent is available) whether the inferred haplotype composition of the offspring is compatible with that of the parents.
An important characteristic of our method is that it relies on considering all possible combinations of haplotypes that yield the observed marker dosages. The number of haplotype combinations quickly becomes astronomical with increasing ploidy level and increasing number of markers per haploblock. The implementation allows to generate the combinations as required, but the computation time then quickly becomes prohibitive. It is much more efficient to tabulate all haplotype combinations for all marker dosage combinations in advance, so that the table for the desired ploidy can be loaded at the start of a run. The package also provides functions to calculate these tables. In practice, these pre-calculated tables are currently limited to 8 markers per haploblock for tetraploids and 6 for hexaploids. Larger numbers would require a different access method as it would become impossible to store these tables in memory and the indices to the tables would overrun the capacity of the 32-bit integers in R.
Comparisons with other software
Our software is not the first that is aimed at short-range haplotyping in polyploids based on SNP dosage data, but to our knowledge it is the first that makes use of known FS families in the data. We compared the results of our software with those of SATlotyper [13] downloaded from [28], Happy-inf [19] downloaded from [29] and SHEsisPlus [17, 18] accessed online at [21]. All were used with their recommended or default settings. Functions are supplied in the PolyHaplotyper package to convert PolyHaplotyper-formatted input data to the formats required by these packages, and to reformat their output to PolyHaplotyper output format. Further, PolyHaplotyper contains functions to convert marker data simulated with PedigreeSim (Voorrips and Maliepaard, 2012) to the input format of PolyHaplotyper, and to compare haplotyping results with “true” simulated genotypes generated by PedigreeSim.
Data sets
Data set 1: hexaploid Chrysanthemum population
Data set 1 contains SNP dosage data obtained from chrysanthemum using a dedicated Axiom SNP array [20]. The hexaploid population consists of four FS families of 405, 53, 76 and 37 individuals, with 7 parents (families 1 and 4 share a common parent) and 53 unrelated individuals. For this population the true genotypes are unknown, but a pedigree is available that allows to check for inconsistent haplotype assignments. The data set contains 189 SNP markers, grouped in 43 haploblocks of 4 or 5 SNPs. Markers grouped in a haploblock mapped on the same sequence contig, based on sequence data generated prior to the array development.
Data set 2: simulated tetraploid population with 9 FS families
This tetraploid data set is the result of a simulation performed using PedigreeSim [30]. The population consists of 9 FS families of 50 individuals each, all sharing one common parent. The parents have been simulated so that they are genetically structured into three groups of related individuals. The number of founder alleles segregating per locus ranges between 14 and 26. However, the number of haploblock alleles is lower, due to the limited information content of small-size haploblocks. SNP-based heterozygosity per individual (i.e. the percentage of heterozygous calls over the total number of SNPs per individual) ranges from 73 to 85%. Results are presented for 300 biallelic (SNP) markers grouped in 60 haploblocks of 3—7 markers; all markers in a haploblock were simulated at the same genetic map position, i.e. no recombination could occur within haploblocks. The haplotyping results were compared to the true haplotype compositions, which are known from the simulation results.
Data set 3: simulated tetraploid population with 2 FS families and other material
This tetraploid data set is the result of a simulation performed using PedigreeSim [30]. The population consists of 2 FS families of 50 individuals each, sharing one common parent. The parents have been simulated as originating from two distinct random-mating populations. In addition there are two groups of 50 individuals, one group belonging to each of these two random-mating populations. Results are presented for 300 biallelic (SNP) markers grouped in 60 haploblocks of 3–7 markers; all markers in a haploblock were simulated at the same genetic map position, i.e. no recombination could occur within haploblocks. The number of haplotypes per haploblock ranges from 7 to 29, with an average of 16.2. The haplotyping results were compared to the true haplotype compositions, which are known from the simulation results.