A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes

Background Given the economic and environmental importance of allopolyploids and other species with highly duplicated genomes, there is a need for methods to distinguish paralogs, i.e. duplicate sequences within a genome, from Mendelian loci, i.e. single copy sequences that pair at meiosis. The ratio of observed to expected heterozygosity is an effective tool for filtering loci but requires genotyping to be performed first at a high computational cost, whereas counting the number of sequence tags detected per genotype is computationally quick but very ineffective in inbred or polyploid populations. Therefore, new methods are needed for filtering paralogs. Results We introduce a novel statistic, Hind/HE, that uses the probability that two reads sampled from a genotype will belong to different alleles, instead of observed heterozygosity. The expected value of Hind/HE is the same across all loci in a dataset, regardless of read depth or allele frequency. In contrast to methods based on observed heterozygosity, it can be estimated and used for filtering loci prior to genotype calling. In addition to filtering paralogs, it can be used to filter loci with null alleles or high overdispersion, and identify individuals with unexpected ploidy and hybrid status. We demonstrate that the statistic is useful at read depths as low as five to 10, well below the depth needed for accurate genotype calling in polyploid and outcrossing species. Conclusions Our methodology for estimating Hind/HE across loci and individuals, as well as determining reasonable thresholds for filtering loci, is implemented in polyRAD v1.6, available at https://github.com/lvclark/polyRAD. In large sequencing datasets, we anticipate that the ability to filter markers and identify problematic individuals prior to genotype calling will save researchers considerable computational time. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04635-9.


Background
Highly duplicated genome sequences are common throughout the plant kingdom. These include recent allopolyploids such as wheat, cotton, canola, strawberry, and coffee, as well as species with evidence of ancient whole genome duplication such as maize and legumes [1]. This phenomenon is also present in the animal kingdom, for example allopolyploidy in the model frog Xenopus, as well as an ancient tetraploidization event followed by diploidization in salmonid fishes [2,3]. For species in which paralogous sequences no longer pair at meiosis, accurate separation of paralogs in DNA and RNA sequence analysis, including reference genome assembly, remains challenging [4]. This separation of paralogs is especially important in variant calling, because SNPs and indels will not behave in a Mendelian fashion if the reads originate from more than one locus yet are erroneously attributed to a single locus [5]. Accurate variant calling therefore impacts all downstream analysis that assumes Mendelian inheritance, including linkage and QTL mapping, association studies, genomic selection, population genetics, and parentage analysis. For example, failure to remove paralogs from downstream analysis has been demonstrated to bias estimates of allele frequency and inbreeding as well as population structure [4,[6][7][8].
Due in part to the difficulty of assembling highly duplicated reference genomes, several methods have been published for filtering collapsed paralogous loci from genotyping-by-sequencing (GBS, including restriction-site associated DNA sequencing (RAD) approaches) datasets without the need for a reference genome. The most straightforward approach is to call genotypes and then determine if observed heterozygosity exceeds expected heterozygosity [9][10][11]. However, sampling error at low read depth can confound this filtering step by causing heterozygotes to be miscalled as homozygotes, lowering the observed heterozygosity. Moreover, estimating observed heterozygosity becomes complicated when polysomic inheritance is expected, due to the challenge of estimating allele copy number. Bayesian genotype calling methods mitigate the underestimation of observed heterozygosity, but at substantial computational cost [12][13][14]. Another approach is to filter loci that have read depth above an arbitrary threshold [15], although due to differences in amplification efficiency based on fragment size and GC content, this method could fail to filter some paralogs while filtering other non-paralogous loci. Peterson et al. [16] developed a method, extended by Willis et al. [17], that involved counting the number of unique haplotypes per individual for a putative locus, with the idea that in a collapsed paralog, the number of haplotypes would exceed the ploidy. However, this method can be confounded by sequencing error and cross-contamination among samples, and its sensitivity depends on allele frequencies, inbreeding, and ploidy. Other approaches have examined read depth ratios within individual genotypes [18] as well as read depth ratios in combination with observed heterozygosity [19]. Lastly, multiple methods identify putative paralogs based on networks of similarity among sequence tags [6,20].
We present a novel statistic, H ind /H E , for evaluating marker quality, in particular for assessing whether a marker represents one Mendelian locus or multiple collapsed paralogous loci, based upon read depth distribution in a population. For a Mendelian locus, the statistic has the same expected value regardless of number of alleles, allele frequency, and total read depth. As a result, the distribution of the statistic can be visualized across loci in order to identify threshold values for filtering. The expected value can be calculated from ploidy (assuming disomic or polysomic inheritance) and the inbreeding coefficient, or the mode value of the statistic in a population can be used to estimate ploidy or inbreeding. Notably, because genotype calls are not needed in order to estimate this statistic, it can be used for filtering loci before any genotype calling is performed, saving computation time. Technical parameters such as sequencing error and overdispersion can influence estimates, but are explored here using simulated data so that they can be accounted for. We extend our Bayesian genotype calling software polyRAD [12] to implement the novel statistic and determine appropriate cutoffs.

The H ind statistic
Here we describe a novel statistic, H ind , that is based on sequence read depth across all alleles at a given locus and sample, and is agnostic of genotype calls, inheritance mode, and ploidy. It is related to observed heterozygosity, H O , which in a diploid can be thought of as a matrix of ones and zeros indicating whether the genotype at each sample*locus is heterozygous. H ind is instead a number ranging from zero to one, indicating the probability that if two sequencing reads were sampled without replacement at that sample*locus, they would represent different alleles. The abbreviation "ind" stands for "individual", as it is calculated for each individual before averaging across a population. It can be calculated for SNP loci or for multiallelic haplotype-or tag-based loci, as long as allelic read depth is available.
The expected value for H ind in a natural population of diploids or polysomic polyploids is: where k is the ploidy, H E is the expected heterozygosity at the same locus, and F is the inbreeding coefficient. H E is the probability that two alleles drawn at random from the population will be different, (1 -F) is the probability that two alleles randomly drawn from an individual will not be identical by descent, and (k -1)/k is the probability that two sequencing reads originated from different chromosome copies. Multiplied together, these three terms yield the probability that two sequence reads from one sample at one locus will be different from each other.
If we divide H ind by H E : we now have a statistic that is only dependent on ploidy and inbreeding, two parameters that we will assume to be consistent across samples and loci.
In a mapping population, the term H E * (1 -F) must be replaced by the probability, for a given locus, that two locus copies in a progeny will be different alleles. This requires knowledge of the ploidy, parental genotypes, and population design including number of generations of backcrossing and self-fertilization. This probability, which we will call H E. map , can be estimated by simulation of the cross. The expectation is then: For a population of n individuals with sequencing reads, allele frequencies are estimated from average within-individual read depth ratios: And expected heterozygosity is estimated as Averaged across n individuals with two or more reads at a given locus in a natural population, the expectation is then: In a mapping population, Ĥ ind,m is estimated in the same way. H E.map is estimated from parental genotypes and population design, and the expected average ratio within a locus is given in Eq. 3.

Utility of H ind /H E for detecting collapsed paralogs in a diversity panel
To compare the distribution of H ind /H E values for Mendelian loci versus collapsed paralogs, we aligned Miscanthus sacchariflorus tag sequences to the M. sinensis reference genome, in which they should align to the correct paralog most of the time, and to the Sorghum bicolor reference genome, in which two paralogs from Miscanthus correspond to one alignment location. We found that loci with a mean read depth less than five had very low estimates of H ind /H E , likely due to restriction site polymorphisms or other technical issues ( Fig. 1 and Additional file 1: Fig. S1). As mean read depth increased above 100 in our dataset, however, loci tended to have H ind /H E values above the expectation for a Mendelian locus, suggesting that most loci at this depth and higher were in fact collapsed paralogs (Fig. 1, and Additional file 1: Figs. S1 and S2).
When a mean depth of five was used as a cutoff and the M. sinensis genome was used as a reference, the peak H ind /H E value was slightly below the expected values of 0.5 for diploids and 0.75 for tetraploids ( Fig. 2), indicating some inbreeding, likely due to population structure [22]. A second peak was observed at a higher value of H ind /H E (Fig. 2), likely representing sets of tags that belonged to different Mendelian loci despite aligning to the same location (i.e., misalignments). When S. bicolor was used as the reference genome, the opposite trend was observed, where most loci had a H ind /H E above the expected value, indicating collapsed paralogs, but a second peak was observed closer to the expected value, indicating regions in the S. bicolor genome that may only have synteny with one region of the M. sinensis genome (Fig. 2). Although the peaks overlapped somewhat, they were distinct enough that a reasonable threshold for identifying putative collapsed paralogs could be visually determined (Fig. 2). Moreover, although the diploid and tetraploid datasets were processed separately, they were largely in agreement about which loci were Mendelian and which were collapsed paralogs (Additional file 1: Fig.  S2), suggesting that the filtering performed in one population can be applied to another population, which could be especially useful for populations that are too small for accu- In both the diploid and tetraploid datasets, the distribution and peak values of H ind /H E were similar regardless of whether biallelic SNPs or multiallelic, haplotypebased markers were used (Additional file 1: Fig. S3). However, the variance of H ind /H E was approximately 20% higher when SNPs were used, suggesting that the higher information content of multiallelic markers improves the precision of H ind /H E estimates.  The ExpectedHindHe function in polyRAD was used to set thresholds for filtering the diploid and tetraploid datasets. Based on results from the TestOverdispersion function, the overdispersion parameter was set to 11 for diploids and 10 for tetraploids. Based on the observed distribution of H ind /H E in the dataset, the inbreeding coefficient was set to 0.35 for diploids and 0.25 for tetraploids. Based on these parameters, as well as read depth and allele frequencies in the datasets, the ranges for retaining 95% of Mendelian loci were 0.175-0.584 in diploids and 0.356-0.716 in tetraploids as estimated by Expect-edHindHe, resulting in 40.2% and 42.3% of loci being filtered, respectively ( Table 2). Markers within genes were underrepresented among markers that were filtered for having H ind /H E below the lower threshold, and overrepresented among markers that were filtered for having H ind /H E above the upper threshold, significant in Fisher's Exact Test at P < 0.0005 ( Table 2). Markers that were filtered having H ind /H E above the upper threshold tended to have minor allele frequencies that were very low, consistent with the markers representing sequencing error rather than true alleles, or very high, consistent with the markers representing collapsed paralogs (Fig. 3).

By individual, H ind /H E reflects ploidy and hybrid status
In addition to evaluating the mean H ind /H E within loci, we also obtained the mean statistic within individuals in order to assess the utility of the statistic for determining ploidy. We found that H ind /H E increased with ploidy, largely independent of read depth (Fig. 4). Although the distributions overlapped too much for H ind /H E to be a conclusive indicator of ploidy, it could still potentially be used to identify outlier individuals whose ploidy should be confirmed by other means (e.g., flow cytometry). Additionally, because our empirical dataset included many natural interspecific (M. sacchariflorus × M. sinensis) F1 hybrid and backcross individuals, we were also able to observe that H ind /H E values were considerably higher in hybrids than in non-hybrids, reflecting higher heterozygosity (Fig. 4).

Variance and bias in the H ind /H E statistic using simulated data
Using simulated data resembling a diversity panel or natural population, the mean H ind /H E estimate decreased as inbreeding increased, with diploid and tetraploid loci being indistinguishable at an inbreeding coefficient of 0.8 or higher (Fig. 5). Sequencing error had little effect on the estimate at a minor allele frequency of 0.05, but caused an inflated estimate at a minor allele frequency of 0.01, particularly as inbreeding increased (Fig. 5). Variance and bias in the statistic were minimized if there were at least 500 samples, minor allele frequency was 0.05 or higher, and read depth was at least 5 ( Fig. 6). Ploidy had negligible impact on variance and bias (Fig. 6). Read depth and minor allele frequency influenced the estimates for collapsed paralogs, but not enough to interfere with distinguishing them from Mendelian markers (Fig. 6). As expected, overdispersion (deviation of read depth ratios from allelic dosage ratios) reduced the mean H ind /H E estimate, with the effect of overdispersion being greater at higher minor allele frequencies (Additional file 1: Fig. S4). The H ind /H E estimate also decreased linearly as null allele frequency increased (Additional file 1: Fig. S5).
In simulated F1 mapping populations, the standard deviation of the H ind /H E ranged from 0.012 to 0.076 depending on the marker type (Fig. 7). In tetraploids, marker types with high expected heterozygosity in the progeny, such as triplex × nulliplex and triplex × simplex, had lower variance in the estimate than marker types with lower expected heterozygosity in the progeny, such as simplex × nulliplex and simplex × simplex (Fig. 7). A few rare markers had H ind /H E estimates that deviated very far from the expected value, indicating that the parents were incorrectly genotyped (Fig. 7).

Comparison with other approaches
To compare effectiveness at filtering paralogs between H ind /H E and other approaches, 1000 Mendelian loci and 1000 collapsed paralogs were simulated in 200 diploid and 200 tetraploid individuals at three levels of inbreeding. The median allele frequency was 0.026 and median read depth per Mendelian locus was 21. For each statistic, the 95th percentile for Mendelian loci was determined, and the proportion of collapsed paralogs that would be filtered at that threshold was estimated.  in tetraploids at inbreeding levels of 0.1, 0.5, and 0.9, respectively. The haplotype counting approach [17] and allelic depth ratio Z-score approach [19] both performed reasonably well in diploids but were much less effective in tetraploids, with haplotype counting being useless in tetraploids at high inbreeding, while the Z-score approach additionally suffered in terms of computational time due to the need for genotyping. However, haplotype counting used two-to threefold less computational time than H ind /H E , and thus could be advantageous in diploids when millions of loci are being processed. Lastly, filtering on read depth alone was not very effective given the variation in read depth among loci.

Properties of the H ind /H E statistic
While H ind /H E can be used, in combination with other metrics, to assess locus quality, this should be performed with an understanding of what biological and technical phenomena can cause it to deviate from the expected value. Inbreeding from any source will lower the expected value below (k -1)/k, where k is the ploidy; this includes not only self-fertilization and preferential mating with relatives, but also population structure, which is why we observed values below (k -1)/k even in self-incompatible, wind-pollinated M. sacchariflorus (Figs. 1, 2, 3, 4). A benefit of this, however, is that as long as ploidy is known and overdispersion can be reasonably estimated (e.g., with the TestOverdispersion function in polyRAD), H ind /H E can be used to estimate inbreeding, either at the population or individual level, directly from sequence read depth. Given that we observed H ind /H E to be inflated at low minor allele frequencies, we recommend using the mode H ind /H E at markers with minor allele frequency of at least 0.05 for estimating inbreeding. Additionally, individuals that are hybrids between species or between highly diverged populations, as well as DNA samples that are an accidental mix of two or more individuals, may have H ind /H E above the expected value (Fig. 4). Strong selection for homozygotes or heterozygotes at particular loci would be expected to lower and raise H ind /H E , respectively. At the locus level, a H ind /H E that exceeds the expected value can be an indication that alleles are derived from paralogous loci rather than a true Mendelian locus. More broadly, if all alleles truly belong to a single locus, then the expected value is (1 -F) (k -1)/k. However, if a set of random, independent alleles were assigned to one putative locus, the expected value of H ind /H E would be one, because the probability of sampling reads from two different alleles within one individual would be the same as the probability of sampling reads from two different alleles in the general population. In the M. sacchariflorus dataset, markers within genes were overrepresented among markers that were filtered for having H ind /H E above the expected value, likely due to high sequence conservation between paralogs ( Table 2). A H ind /H E of zero could indicate a cytoplasmic marker, because while there may be variation in the population, each individual would only be expected to possess reads from one allele. Loci with highly overdispersed read depth distributions due to technical issues such as differential fragment size or variation in library preparation would also be expected to have H ind /H E below expectations; it may be advantageous to filter these from the dataset as they will tend to yield poor-quality genotype calls. Lastly, loci with common null alleles have lower than expected H ind /H E (Additional file 1: Fig. S3), resulting in a tendency to filter loci that are not within genes as these regions are less conserved (Table 2). Null alleles can be the result of restriction cut site polymorphism in RAD-based techniques, primer binding site mutations in amplicon sequencing, or deletion mutations using any genotyping method. Because they are a common problem, H ind /H E can be used to identify and filter loci with null alleles.
The expected value of H ind /H E is independent of read depth, number of individuals sampled, and the allele frequency. However, all of these factors influence the variance of the estimate, and low minor allele frequency especially can bias it upwards (Figs. 5, 6). As there is no generalized formula to estimate the variance of a ratio, the variance of H ind /H E cannot be estimated mathematically. Moreover, sequencing error inflates the estimate at low minor allele frequency (Fig. 5), and polyRAD cannot account for sequence quality scores or alignment quality scores since it only imports allelic read depth. We therefore recommend simulating data for Mendelian loci given the ploidy, inbreeding, sample size, sequencing error rate, and distribution of read depth and allele frequency observed in the dataset of interest. The distribution of H ind /H E across simulated loci then can be used to determine cutoff values for filtering loci in the empirical dataset. The ExpectedHindHe and ExpecteHindHeMapping functions are available in polyRAD for this purpose, and suggest cutoffs for filtering loci in order to retain 95% of Mendelian loci. Depending on the downstream application, we recommend considering the number of markers needed versus the importance of marker quality when determining thresholds for read depth, allele frequency, and H ind /H E .
H ind /H E is more useful for detecting paralogs when haplotypes are treated as alleles (i.e., loci can be multiallelic), as opposed to when all loci are treated as biallelic SNPs, simply due to the fact that multiallelic markers are more information-rich than biallelic markers for the same distribution of minor allele frequencies. We observed that, for the same set of SNPs in M. sacchariflorus, the median value of H ind /H E per locus was very similar regardless of whether they were phased into haplotypes within the span of a single RAD tag, but the variance in H ind /H E was about 20% higher for SNPs vs. haplotypes (Additional file 1: Fig. S3). This improved power and information content is why polyRAD generally imports multiallelic, haplotype-based genotypes rather than SNPs as the default. Other methods for marker calling in highly duplicated genomes have also benefitted from the use of haplotype information [11,23], and multiallelic markers have been found to be advantageous over biallelic SNPs for linkage mapping in polyploids [24]. It should be noted that in this study we only phased SNPs that were certain to have originated from the same sequencing reads based on physical linkage and read depth. The H ind /H E statistic cannot be estimated using haplotypes spanning longer distances, given that read depth will vary from locus to locus within haplotype.

Uses of the H ind /H E statistic
We anticipate locus-filtering to be the most common application of the H ind /H E statistic, with major advantages being that it is not biased by read depth or allele frequency and can be estimated prior to genotype calling. We demonstrate that it is similar to H O /H E in effectiveness for filtering paralogs, with substantial savings on computational time ( Table 2). We should note that our H O /H E estimates used Bayesian genotype calls from polyRAD, which mitigate the underestimation of observed heterozygosity as compared to naïve genotype calls [12]. Stringency of filtering should depend on the genotype quality needed for downstream analysis; for example, parentage analysis and QTL mapping are sensitive to genotyping errors, whereas genome-wide association studies and estimations of population structure from principal components analysis are less sensitive. Missing data rate, median read depth, and minor allele frequency are common criteria that should be used in combination with H ind /H E to determine which loci to retain for downstream analysis. In our empirical dataset, we found the loci ranging in depth from five to 100 had the best distribution of H ind /H E (Fig. 1), but a higher minimum depth may be required for applications that require accurate genotype calling, and the optimal maximum depth used in filtering depends on the overall depth of the dataset. The use of observed heterozygosity, read depth ratios within genotypes, and number of haplotypes per individual are redundant with H ind /H E and unnecessary if it has already been used for filtering. In addition to its use for detecting paralogs in highly duplicated genomes, H ind /H E can be used for marker filtering in less duplicated genomes where occasional paralogs are still an issue. Additionally, in any species, markers with low values of H ind /H E (e.g., below the 95% confidence interval generated by simulated data) are likely to have null alleles, high overdispersion, or other technical issues and should generally be removed from the dataset. We found that using H ind /H E to filter our M. sacchariflorus dataset impacted minor allele frequency and proportion of markers in genes in ways consistent with the removal of markers with null alleles, collapsed paralogs, or false alleles due to sequencing error (Table 2 and Fig. 3).
Although less accurate for determining ploidy than techniques such as flow cytometry, when averaged within individuals, H ind /H E can be used to identify individuals whose ploidy might deviate from expectations and should be confirmed. If flow cytometry is not an option, several other tools exist for the estimation of ploidy directly from nextgeneration sequencing data [25]. Lastly, H ind /H E could be potentially useful for improving reference genome assemblies, increasing the value of complementing a de novo assembly with a resequencing or genotyping-by-sequencing effort in a large population or diversity panel. Regions of the reference genome that contain collapsed paralogs are expected to have inflated H ind /H E values, which could be visualized in a smoothed plot of H ind /H E versus alignment position.
At a minor allele frequency of 0.05, a read depth of five or higher is sufficient to estimate H ind /H E with minimal variance (Fig. 6). It is notable that a read depth of five is too low to call genotypes with confidence, to some extent in diploids but especially in polyploids. However, using the H ind /H E statistic, such low depth data are useful for a variety of applications such as identification of outlier individuals in terms of ploidy and hybridity, estimation of inbreeding, identification of loci with technical issues, and assessment of reference genome quality. This in turn can enable researchers to reduce sequencing costs by generating preliminary, low-depth datasets to evaluate these issues before (or instead of ) sequencing more deeply.

Conclusions
Here we introduce the H ind /H E statistic, which can be used for evaluating marker and sample quality in genotyping-by-sequencing datasets for a variety of downstream applications. We demonstrate that reads from paralogous loci cause the statistic to be above the expected value, whereas technical issues such as overdispersion and null alleles cause the statistic to be below the expected value. In typical datasets (hundreds of individuals, read depth above five) the statistic has sufficiently low variance to be useful for filtering loci. The polyRAD R package can estimate H ind /H E , suggest filtering cutoffs based on simulated data, and perform genotyping after filtering.

Implementation in polyRAD
Functions for estimating H ind /H E and H ind /H E.map are available in polyRAD v1.2 and later, and are named HindHe and HindHeMapping, respectively. Both utilize an internal Rcpp function for fast calculation, take a RADdata object as input, and return a matrix of values, with samples in rows and loci in columns. The mean value across rows can then be used to get a per-sample estimate, for identifying individuals that are interspecies hybrids or unexpected ploidies. The mean value across columns can be used to get a per-locus estimate for filtering loci. Additionally, polyRAD v1.5 and later includes the ExpectedHindHe and ExpectedHindHeMapping functions, which simulate data to emulate the sample size, allele frequency distribution or parental genotypes, and read depth distribution of an empirical dataset, and return the distribution of H ind /H E as if all loci were Mendelian, giving the user reasonable thresholds to use for filtering loci.

Datasets for testing
Two types of datasets were used to test H ind /H E : (1) empirical data from a diversity panel of Miscanthus sacchariflorus, and (2) simulated datasets of diversity panels and of biparental F1 mapping populations. Previously published RAD-seq data for an M. sacchariflorus diversity panel [22] were used for the empirical tests. All species in the Miscanthus genus share an ancient genome duplication, increasing the chromosome number to 19 from the base of 10 in the Andropogoneae tribe [26][27][28]. Moreover, some populations of M. sacchariflorus display autotetraploidy in addition to this genome duplication (4x = 76) [22,29], allowing us to test our algorithm in situations where tetrasomic inheritance is expected, in addition to the more typical disomic inheritance. Miscanthus is also highly heterozygous due to being wind-pollinated and self-incompatible [30], thus heterozygosity cannot be used to identify paralogs as easily as it could in an inbred crop species. Together, these factors make M. sacchariflorus an ideal test case.
To compare values of H ind /H E in putatively Mendelian markers versus collapsed paralogs, markers were called from the same dataset using either Miscanthus sinensis or Sorghum bicolor as a reference because M. sinensis has a whole genome duplication with respect to S. bicolor. Raw sequence reads from M. sacchariflorus were processed by the TASSEL-GBSv2 pipeline [31] to identify unique tag sequences and their depths in all individuals. Tag sequences were then aligned to the Miscanthus sinensis v7.1 reference genome [32] and the Sorghum bicolor v3.1.1 reference genome [33] using Bowtie 2 [34]. The tag manager feature of TagDigger [35] was used to process the SAM files, recording the alignment location of each tag in both reference genomes. Tag alignment locations within the S. bicolor reference were retained for further analysis if they corresponded to two alignment locations in the M. sinensis reference matching the known synteny between chromosomes. Under this filtering, 239,501 tags were retained at 18,402 S. bicolor alignment locations corresponding to 36,804 M. sinensis alignment locations, in a set of 356 diploid and 268 tetraploid individuals. H ind /H E was then estimated per-locus in polyRAD for both the M. sinensis and S. bicolor alignments.
To compare the variance of H ind /H E when biallelic SNPs were used versus multiallelic, haplotype-based markers, the TASSEL-GBSv2 pipeline was used to call SNP variants from M. sacchariflorus and export them to VCF. Markers from chromosome 1 were imported to polyRAD using VCF2RADdata, with and without the option to phase SNPs into haplotypes, yielding 3710 and 10,458 loci, respectively. The phasing performed by VCF2RADdata only phases SNPs that are certain to have originated from the same reads based on allelic read depth and physical distance. H ind /H E was then estimated by locus in polyRAD separately for diploids and tetraploids.
Simulated diversity panel datasets were generated in order to assess the effect of minor allele frequency, sample size, read depth, sequencing error, overdispersion, inbreeding, ploidy, and null alleles on variance and bias of the H ind /H E statistic, using the SimGenotypes and SimAlleleDepth functions in polyRAD v1.6. See Clark et al. [12] (Eq. 2) for a definition of the overdispersion parameter; lower values result in allelic read depths that deviate further from the ratios expected based on allelic dosage. Three sets of data were simulated. (1) Minor allele frequencies of 0.01, 0.05, and 0.1; sample sizes of 100, 500, and 1000; and genotype read depths of 2, 5, 10, 20, 50, and 100 were simulated in all combinations under diploidy and tetraploidy, with no inbreeding, a sequencing error rate of 0.001, and an overdispersion parameter of 20. For each combination, 5000 biallelic loci were simulated, as well as 5000 collapsed paralogs that each consisted of two Mendelian loci combined. (2) Minor allele frequencies of 0.01 and 0.05, overdispersion spanning all integers from 5 to 20, sequencing error rates of 0 and 0.001, and inbreeding (F; the probability that two locus copies in an individual are identical by descent) spanning all intervals of 0.1 from 0 to 1 were simulated in all combinations under diploidy and tetraploidy, with a sample size of 500 and a read depth of 20. For each combination, 20,000 biallelic loci were simulated. (3) Minor non-null allele frequencies of 0.01 and 0.05 and null allele frequencies of 0.01, 0.05, 0.1, and 0.2 were simulated in all combinations under diploidy and tetraploidy, with a sample size of 500, a read depth of 20, a sequencing error rate of 0.001, overdispersion of 20, and no inbreeding. For each combination, 5000 triallelic (with one allele being null, i.e. having all of its reads discarded) loci were simulated.
To evaluate effectiveness of various approaches for filtering paralogs, 1000 Mendelian loci and 1000 collapsed paralogs were simulated in 200 diploid and 200 tetraploid individuals each at three levels of inbreeding. Number of alleles was evenly distributed from two to eight in Mendelian loci. Allele frequency was sampled from a gamma distribution with shape of 0.3 and scale of 1, divided by 10 and added to 0.01 to ensure a minimum minor allele frequency, given that allele frequency filtering is typically performed during variant calling and/or data import. One allele frequency at each locus was generated as one minus the sum of all other allele frequencies, to emulate the typical situation of one common allele and one or more rare alleles. Genotypes were simulated from the allele frequencies assuming an inbreeding coefficient (F) of 0.1, 0.5, or 0.9. Mean read depth per locus was drawn from a gamma distribution with a shape of 3.2 and scale of 8. Read depth at individual genotypes was then drawn from a gamma distribution with the locus depth/10 as the shape, and a scale of 10. Allelic read depth was simulated assuming an overdispersion parameter of 20 and a sequencing error rate of 0.001. Collapsed paralogs were simulated in the same way, but with number of alleles per locus ranging from one to eight, and two random loci being combined to form a collapsed paralog.

Comparison with other approaches
To call genotypes for the H O /H E and Z-score [19] approaches, the IterateHWE function in polyRAD was used with default parameters to obtain genotype probabilities, and then GetProbableGenotypes was used to get discrete genotypes, with genotypes set to missing if allele copy numbers did not add up to the ploidy. To extend its use to polyploids, H O was estimated as the probability that two alleles sampled from a genotype without replacement would be different from each other, averaged across individuals within a locus. The Z-score approach [19] was originally only defined for biallelic markers in diploids. To extend it for multiallelic markers and polyploid species, for each marker genotypes with ploidy -1 copies of the most common allele (i.e., the heterozygous genotype class expected to be most common) were identified, and allelic read depth summed across those samples. Deviation