nQuire: a statistical framework for ploidy estimation using next generation sequencing

Background Intraspecific variation in ploidy occurs in a wide range of species including pathogenic and nonpathogenic eukaryotes such as yeasts and oomycetes. Ploidy can be inferred indirectly - without measuring DNA content - from experiments using next-generation sequencing (NGS). We present nQuire, a statistical framework that distinguishes between diploids, triploids and tetraploids using NGS. The command-line tool models the distribution of base frequencies at variable sites using a Gaussian Mixture Model, and uses maximum likelihood to select the most plausible ploidy model. nQuire handles large genomes at high coverage efficiently and uses standard input file formats. Results We demonstrate the utility of nQuire analyzing individual samples of the pathogenic oomycete Phytophthora infestans and the Baker’s yeast Saccharomyces cerevisiae. Using these organisms we show the dependence between reliability of the ploidy assignment and sequencing depth. Additionally, we employ normalized maximized log- likelihoods generated by nQuire to ascertain ploidy level in a population of samples with ploidy heterogeneity. Using these normalized values we cluster samples in three dimensions using multivariate Gaussian mixtures. The cluster assignments retrieved from a S. cerevisiae population recovered the true ploidy level in over 96% of samples. Finally, we show that nQuire can be used regionally to identify chromosomal aneuploidies. Conclusions nQuire provides a statistical framework to study organisms with intraspecific variation in ploidy. nQuire is likely to be useful in epidemiological studies of pathogens, artificial selection experiments, and for historical or ancient samples where intact nuclei are not preserved. It is implemented as a stand-alone Linux command line tool in the C programming language and is available at https://github.com/clwgg/nQuireunder the MIT license. Electronic supplementary material The online version of this article (10.1186/s12859-018-2128-z) contains supplementary material, which is available to authorized users.


Background
Polyploidy, the presence of more than two complete sets of chromosomes, can under certain circumstances accelerate evolutionary adaptation by influencing the generation and maintenance of genetic diversity [1,2]. In addition, polyploidy also poses short-and long-term challenges to organismal fitness, which are associated with increased nuclear and cellular volume, propensity to aneuploidy, and disruption of gene expression regulation [3]. Interspecific comparisons between eukaryotic genomes can identify ancient polyploidization events. In contrast, more recent polyploidization events result in intraspecific variation of *Correspondence: hernan.burbano@tuebingen.mpg.de 1 Research Group for Ancient Genomics and Evolution, Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tuebingen, Germany Full list of author information is available at the end of the article ploidy and, in some cases, aneuploidy. The presence of individuals of different ploidy in a population can hinder mating. Therefore, intraspecific ploidy variation tends to occur -although not exclusively -in organisms that have the capacity to reproduce asexually [4][5][6], are selfcompatible or perennial [7].
Although ploidy traditionally has been investigated by measuring DNA content using flow cytometry [8], it can also be inferred from next generation sequencing (NGS) data either by examining k-mer distributions, or by assessing the distribution of allele frequencies at biallelic single nucleotide polymorphisms (SNPs) [4]. This methodology has been used to estimate ploidy in newly assembled genomes in order to identify the number of likely collapsed haplotypes on a per-contig basis [9], as well as to detect intraspecific variation of ploidy in the oomycete Phytophthora infestans [4,6] and in the Baker's yeast Saccharomyces cerevisiae [5]. It also was successfully used for ploidy estimation in P. infestans historic herbaria samples that are not suitable for flow cytometry, since they do not contain intact nuclei [4]. The method assumes that alleles present at biallelic SNPs occur at different ratios for different ploidy levels, that is, 0.5/0.5 in diploids, 0.33/0.67 in triploids, and a mixture of 0.25/0.75 and 0.5/0.5 in tetraploids. These ratios hold true for recent autopolyploids, as well as recent allopolyploids from heterozygous source genomes. To determine the ploidy level, the distribution of biallelic SNPs can be inspected visually [10], and/or qualitatively compared with simulated data [4]. However, this methodology does not provide summary statistics that permit quantifying how well the data fit the expected distributions, which is especially critical when dealing with noisy distributions typical for highly-repetitive genomes. An additional disadvantage of this approach is that it is preceded by the identification of variable sites ("SNP calling"), which is carried out using methodologies that benefit from a previously known ploidy level [11]. In a further development Gompert et al. [12] used biallelic SNPs in a Bayesian statistical approach to distinguish between ploidy levels from genotyping-bysequencing data. This method was primarily developed for resequencing studies, where typically multiple individuals from populations with ploidy variation are genotyped, as it benefits from preexisting knowledge about the ploidy levels that may be observed. Based on the posterior probabilities emitted by the Bayesian model, this approach separates samples into ploidy clusters, using dimensionality reduction methods such as PCA. Since it allows the inclusion of training data of known ploidy, test samples can be assigned a ploidy level if they belong to a cluster that includes samples of known ploidy.
Here we present a statistical model that aims to distinguish between the distribution of base frequencies at variable sites for diploids, triploids and tetraploids, directly from read mappings to a reference genome. It models base frequencies as a Gaussian Mixture Model (GMM), and uses maximum likelihood to assess empirical data under the assumptions of diploidy, triploidy and tetraploidy. We evaluated the performance of our method for different sequencing coverages using published genomes of S. cerevisiae [5], and high-coverage genomes of P. infestans produced for this study.

Model and Implementation
We used base frequencies at variable sites with only two bases segregating to distinguish between diploids, triploids and tetraploids (Fig. 1a). For that, we implemented a GMM that models the base frequency profiles as a mixture of three Gaussian distributions (Fig. 1b), which are scaled relative to each other. A log-likelihood can be calculated following: Here, n describes the numbers of data points and x i describes the value of each data point (i.e. the base frequencies). At sites with two bases segregating, we use the frequencies of both bases to achieve symmetry in the frequency distribution for triploids and tetraploids. μ j and σ j are the parameters of the j th of three Gaussian distributions N j that are scaled relative to each other through the parameter α j , with 3 j=1 α j = 1. This model allows for estimating the parameters of the Gaussian mixture components, as well as their mixture proportions, by maximizing the log-likelihood, either with or without constraints on the possible parameter space.
The likelihood maximization of the GMM is implemented through an Expectation-Maximization (EM) algorithm (Fig. 1b).
During the EM, we make use of the latent variables Z i . They represent the assignment of a data point to one of the mixture components. In the E-step, we use the current estimates for μ j , σ j and α j to calculate: This probability is calculated for all x i and j ∈ {1, 2, 3} to form a n × 3 matrix, the columns of which represent the probability of each data point belonging to either of the three mixture components. From this matrix we can calculate the column sum S j = n i=1 γ Z i ( j), which represents the size of each mixture component. In the following M-step, we update our estimates of μ j , σ j and α j : The log-likelihood is calculated after the M-step, and the next E-step is initiated unless the log-likelihood has changed by less then = 0.01 from the previous M-step.
As shown above, the algorithm allows the estimation of μ j , σ j and α j simultaneously. Henceforth, we call this setup the "free model". The log-likelihood of the free model upon convergence represents the optimum under the assumptions of the model.
We can also set certain parameters to fixed values and forgo their update in the M-step. We use this to maximize the log-likelihood under the expectation of diploidy (one Gaussian with mean 0.5), triploidy (two Gaussians with means 0.33 and 0.67) and tetraploidy (three Gaussians with means 0.25, 0.5 and 0.75), and call these the three "fixed models": In these three models, we only estimate σ j , while μ j and α j are fixed as shown above. Since all fixed models are nested within the free model, it is possible to directly compute the log-likelihood ratios, following: The logLs describe the distance between each fixed model and the best fit under the assumptions of the GMM. A substantially lower logL of one fixed model over the others supports the ploidy level described by this fixed model (Fig. 1c). Therefore, we used logL as summary statistics where the minimum value supports a given ploidy level.
Additionally, the GMM can be extended to a Gaussian Mixture Model with Uniform noise component (GMMU), by adding a uniform mixture component: The constraint on the mixture proportions then becomes 4 j=1 α j = 1. The uniform noise component is used in our implementation to allow base-line noise removal (Additional file 1: Figure S1). This can be useful when the Gaussian peaks are observable but embedded in a basal noise, which could be caused by highly repetitive genomes or low coverage.

Multivariate Gaussian clustering
To cluster samples in three dimensions based on the normalized maximized log-likelihoods of the three fixed models, we used the mclust5 package [13] of the R programming language [14]. This package utilizes mixtures of multivariate Gaussian distributions to detect clusters in an arbitrary number of dimensions. mclust5 allows to set constraints on the volume, shape and orientation of each mixture component, by varying features of their covariance matrix either within each sample, or for all samples at once. For the analysis displayed in Fig. 4, we used clusters of equal volume, but varying shape and orientation. This configuration represented the data the best, as assessed by the recovery of ploidy levels from cluster assignments.

Phytophthora infestans libraries
The two benchmarking libraries from P. infestans were generated according to the protocol by Meyer and Kircher [15] from DNA extracted from lab cultures [16]. These libraries were sequenced to high coverage on an Illumina HiSeq 3000 machine in paired end 150 bp mode. This

Results
Performance nQuire directly processes BAM files [17] and is designed to be efficient in memory usage and runtime. To process a 1GB S. cerevisiae BAM file (100x coverage), nQuire needs 70 s to build appropriate data structures, 6 s to run the models and calculate the maximum likelihood estimates, and uses a maximum of 8 Mb of RAM, whereas for processing a 10GB P. infestans BAM file (100x coverage) it needs 760 s, 100 s and 60 Mb of RAM, respectively. These benchmarks were performed on a single core of a Intel® Core™ i5-4670 CPU on a system with 16Gb of DDR3-1600 RAM and an SSD.

Analysis of individual samples
We evaluate nQuire's performance using three S. cerevisiae samples at 100x coverage, which represent each of the three ploidy levels evaluated by the model, as well as two P. infestans samples, one diploid and one triploid, at 210x and 368x coverage, respectively. The logL of each of the fixed models to the free model at full coverage is shown in Table 1. At those coverages, the logL of the best model is more than two times closer to the free model than the second best.  Table 1). The plots depict the change of logL of all fixed models as a function of genome coverage for the diploid (e) and triploid (f) sample a b Fig. 4 Clustering sets of samples into ploidy groups. Samples were clustered based on the normalized maximized log-likelihood of each fixed model representing each of the three ploidy levels. The clustering in three dimensions is shown with a set of three plots, one for each combination of ploidies. Shapes represent the manually assessed true ploidy levels, while colors show the assignments returned by the clustering algorithm. The agreement between the two is shown as a fraction. This analysis was conducted for samples before (a) and after denoising (b)

Coverage dependence
To investigate the impact of coverage on the performance of the GMM, we downsampled mapped reads from the S. cerevisiae (Fig. 2g-i) and P. infestans (Fig. 3e-f) strains shown in Table 1 to different coverage levels. In all cases the logLs of the two improper models increases with increasing coverage. In contrast, the logL of the best model stabilizes at low coverage and doesn't increase further as coverage increases. The coverage at which the logL of the best model is stable will be different for genomes of different size and complexity, as shown in the difference between the two organisms used for benchmarking.

Analysis of population samples
In cases where multiple samples are sequenced simultaneously, it might be impractical to assess ploidy in each sample individually. In these cases, we propose to use maximized log-likelihoods of the three fixed models, normalized by that of the free model, to cluster samples into ploidy groups. The rationale is that within one species, the relative likelihoods of the fixed models will be similar within each ploidy level. As a proof of concept, we applied this to all di-, tri-and tetraploids from the S. cerevisiae test set [5], and clustered the samples into three groups in three dimensions using multivariate Gaussian clustering (see "Methods" section). The sample set was  Table 1 were used to randomly sample 10 to 300 positions (left). The subsampling was extended to 20,000 positions (right) manually scored for ploidy, and the overlap between clusters and manually assessed ploidy level was calculated (Fig. 4). Running nQuire on raw data showed high recovery of ploidy level (93%, Fig. 4a), which was further improved through our denoising implementation utilizing the GMMU (96%, Fig. 4b).

Detection of Aneuploidies
Recent polyploidization is often associated with aneuploidies. To be able to detect those, nQuire allows to split the analysis of a sample by regions defined in BED format. We used this to reanalyze the sample YJM466 from the S. cerevisiae test set [5]. This sample had been shown to be triploid on whole genome level, but tetraploid for chromosome 6 and diploid for chromosome 9. The logLs for the three fixed models individually calculated for each of the 16 chromosomes of S. cerevisiae confirmed this observation (Fig. 5).
A natural extension of analyzing the genome by chromosomes is to use sliding windows to detect possible transitions between ploidies in aneuploid individuals. We used the three S. cerevisiae datasets shown in Table 1 at their full coverage to benchmark the number of randomly sampled positions needed to accurately assign ploidy (Fig. 6). For these test samples, 100-200 random sites at 100x coverage are enough to correctly assign ploidy based on the logL. However, this will vary for regions of the genome with different complexity and repetitiveness.

Discussion
In addition to nucleotide and structural variation, certain organisms can also vary intraspecifically in their ploidy level, which constitutes another source of variation that selection might act upon. Using NGS data our method permits assessment of ploidy variation from data that is usually generated for variant detection. In contrast to previous methods that visually analyze the distributions of SNPs at biallelic heterozygous sites [4,6,10], we quantitatively distinguish between different ploidy levels based on the distribution of base frequencies at variable sites, using relative differences in likelihoods. In comparison to the approach proposed by Gompert and Mock [12], nQuire avoids the requirement of high quality SNP calls. The higher level of noise in the data resulting from this is accounted for by using Gaussian distributions. They approximate a binomial process, but are impacted less by the effects of high coverage outliers, which arise for example from misalignments of paralogous sequences (Additional file 1: Figure S2). Additionally, nQuire is a Linux command line tool that uses standard file formats as input and handles large genomes at high coverage efficiently.
In all test cases, triploids were the easiest to distinguish, most likely caused by the lack of probability density around 0.5 compared to the other two models. While diploids and tetraploids are more difficult to tease apart, our results on coverage dependence show that at sufficient coverage, the data fits the true model much better than either of the two alternatives (Fig. 2g-i). For cases where the Gaussian peaks were largely overlapped by uniform noise, we extended our free model to include a uniform component, whose mixture proportion can be used -after likelihood maximization -for base-line removal. We show that this procedure improves the recovery of the true ploidy level when samples are clustered based on maximized likelihoods under the assumptions of the fixed models (Fig. 4b). We also show that few high quality positions are enough to estimate the correct ploidy level (Fig. 6). Such high quality positions can be selected by using stringent filtering criteria. Several filters are implemented in nQuire directly, thus no pre-processing of the BAM file (after duplicate removal) is necessary. They include minimum and maximum coverage, as well as mapping quality and base frequency filters. The default values of these filters have been configured to fit most applications. The exact coverage and number of positions needed for a reliable estimation of ploidy will however depend on the complexity and repetitiveness of the genome. Additionally, it is possible to obtain high quality positions by using BED files to define regions of low repetitiveness, where base frequencies can be more confidently assessed.

Conclusion
We present nQuire, a statistical approach to distinguish diploids, triploids and tetraploids of recent evolutionary origin based on the distribution of base frequencies at variable sites. The method facilitates analysis of ploidy in single samples, and we demonstrate how to apply it to population scale data, when available. nQuire can also interact with BED files, to limit the analysis to certain sequence features, or divide it by regions of the genome, for example to detect aneuploidies. Our approach will be useful to assess intraspecific variation in ploidy from both historic and modern samples, as well as in experimental evolution experiments.