Skip to main content
  • Methodology article
  • Open access
  • Published:

nQuire: a statistical framework for ploidy estimation using next generation sequencing

Abstract

Background

Intraspecific variation in ploidy occurs in a wide range of species including pathogenic and nonpathogenic eukaryotes such as yeasts and oomycetes. Ploidy can be inferred indirectly - without measuring DNA content - from experiments using next-generation sequencing (NGS). We present nQuire, a statistical framework that distinguishes between diploids, triploids and tetraploids using NGS. The command-line tool models the distribution of base frequencies at variable sites using a Gaussian Mixture Model, and uses maximum likelihood to select the most plausible ploidy model. nQuire handles large genomes at high coverage efficiently and uses standard input file formats.

Results

We demonstrate the utility of nQuire analyzing individual samples of the pathogenic oomycete Phytophthora infestans and the Baker’s yeast Saccharomyces cerevisiae. Using these organisms we show the dependence between reliability of the ploidy assignment and sequencing depth. Additionally, we employ normalized maximized log- likelihoods generated by nQuire to ascertain ploidy level in a population of samples with ploidy heterogeneity. Using these normalized values we cluster samples in three dimensions using multivariate Gaussian mixtures. The cluster assignments retrieved from a S. cerevisiae population recovered the true ploidy level in over 96% of samples. Finally, we show that nQuire can be used regionally to identify chromosomal aneuploidies.

Conclusions

nQuire provides a statistical framework to study organisms with intraspecific variation in ploidy. nQuire is likely to be useful in epidemiological studies of pathogens, artificial selection experiments, and for historical or ancient samples where intact nuclei are not preserved. It is implemented as a stand-alone Linux command line tool in the C programming language and is available at https://github.com/clwgg/nQuireunder the MIT license.

Background

Polyploidy, the presence of more than two complete sets of chromosomes, can under certain circumstances accelerate evolutionary adaptation by influencing the generation and maintenance of genetic diversity [1, 2]. In addition, polyploidy also poses short- and long-term challenges to organismal fitness, which are associated with increased nuclear and cellular volume, propensity to aneuploidy, and disruption of gene expression regulation [3]. Interspecific comparisons between eukaryotic genomes can identify ancient polyploidization events. In contrast, more recent polyploidization events result in intraspecific variation of ploidy and, in some cases, aneuploidy. The presence of individuals of different ploidy in a population can hinder mating. Therefore, intraspecific ploidy variation tends to occur - although not exclusively - in organisms that have the capacity to reproduce asexually [46], are self-compatible or perennial [7].

Although ploidy traditionally has been investigated by measuring DNA content using flow cytometry [8], it can also be inferred from next generation sequencing (NGS) data either by examining k-mer distributions, or by assessing the distribution of allele frequencies at biallelic single nucleotide polymorphisms (SNPs) [4]. This methodology has been used to estimate ploidy in newly assembled genomes in order to identify the number of likely collapsed haplotypes on a per-contig basis [9], as well as to detect intraspecific variation of ploidy in the oomycete Phytophthora infestans [4, 6] and in the Baker’s yeast Saccharomyces cerevisiae [5]. It also was successfully used for ploidy estimation in P. infestans historic herbaria samples that are not suitable for flow cytometry, since they do not contain intact nuclei [4]. The method assumes that alleles present at biallelic SNPs occur at different ratios for different ploidy levels, that is, 0.5/0.5 in diploids, 0.33/0.67 in triploids, and a mixture of 0.25/0.75 and 0.5/0.5 in tetraploids. These ratios hold true for recent autopolyploids, as well as recent allopolyploids from heterozygous source genomes. To determine the ploidy level, the distribution of biallelic SNPs can be inspected visually [10], and/or qualitatively compared with simulated data [4]. However, this methodology does not provide summary statistics that permit quantifying how well the data fit the expected distributions, which is especially critical when dealing with noisy distributions typical for highly-repetitive genomes. An additional disadvantage of this approach is that it is preceded by the identification of variable sites (“SNP calling”), which is carried out using methodologies that benefit from a previously known ploidy level [11]. In a further development Gompert et al. [12] used biallelic SNPs in a Bayesian statistical approach to distinguish between ploidy levels from genotyping-by-sequencing data. This method was primarily developed for resequencing studies, where typically multiple individuals from populations with ploidy variation are genotyped, as it benefits from preexisting knowledge about the ploidy levels that may be observed. Based on the posterior probabilities emitted by the Bayesian model, this approach separates samples into ploidy clusters, using dimensionality reduction methods such as PCA. Since it allows the inclusion of training data of known ploidy, test samples can be assigned a ploidy level if they belong to a cluster that includes samples of known ploidy.

Here we present a statistical model that aims to distinguish between the distribution of base frequencies at variable sites for diploids, triploids and tetraploids, directly from read mappings to a reference genome. It models base frequencies as a Gaussian Mixture Model (GMM), and uses maximum likelihood to assess empirical data under the assumptions of diploidy, triploidy and tetraploidy. We evaluated the performance of our method for different sequencing coverages using published genomes of S. cerevisiae [5], and high-coverage genomes of P. infestans produced for this study.

Methods

Model and Implementation

We used base frequencies at variable sites with only two bases segregating to distinguish between diploids, triploids and tetraploids (Fig. 1a). For that, we implemented a GMM that models the base frequency profiles as a mixture of three Gaussian distributions (Fig. 1b), which are scaled relative to each other. A log-likelihood can be calculated following:

$$\text{log}L=\sum_{i=1}^{n}\text{log}\sum_{j=1}^{3}\alpha_{j} N\left(x_{i};\mu_{j},\sigma_{j}\right) $$
Fig. 1
figure 1

Overview of the Gaussian Mixture Model (GMM) based method used by nQuire to estimate ploidy. We illustrate the workflow using a diploid individual as an example. a After sequenced reads are mapped to a reference genome, base frequencies are calculated at variable sites where only two bases are segregating. b The base frequencies are modeled using a GMM and the likelihood is maximized using an Expectation-Maximization (EM) algorithm for both the free and the three fixed models (diploid, triploid and tetraploid). The maximized log-likelihoods (logL) are extracted for subsequent model comparison. The curves show a possible final state of the GMM under the assumptions of each of the four models. c The ΔlogL is calculated between the free model and each of the three fixed models (here represented as barplots). The fixed model with the smallest ΔlogL is chosen as the true ploidy level (diploid in this example)

Here, n describes the numbers of data points and x i describes the value of each data point (i.e. the base frequencies). At sites with two bases segregating, we use the frequencies of both bases to achieve symmetry in the frequency distribution for triploids and tetraploids. μ j and σ j are the parameters of the jth of three Gaussian distributions N j that are scaled relative to each other through the parameter α j , with \(\sum _{j=1}^{3}\alpha _{j} = 1\).

This model allows for estimating the parameters of the Gaussian mixture components, as well as their mixture proportions, by maximizing the log-likelihood, either with or without constraints on the possible parameter space.

The likelihood maximization of the GMM is implemented through an Expectation-Maximization (EM) algorithm (Fig. 1b).

During the EM, we make use of the latent variables Z i . They represent the assignment of a data point to one of the mixture components. In the E-step, we use the current estimates for μ j , σ j and α j to calculate:

$$P\left(Z_{i} = j | x_{i}\right) = \frac{\alpha_{j} N\left(x_{i};\mu_{j},\sigma_{j}\right)}{\sum_{j=1}^{3}\alpha_{j} N\left(x_{i};\mu_{j},\sigma_{j}\right)} = \gamma_{Z_{i}}(j) $$

This probability is calculated for all x i and j{1,2,3} to form a n×3 matrix, the columns of which represent the probability of each data point belonging to either of the three mixture components. From this matrix we can calculate the column sum \(S_{j} = \sum _{i=1}^{n}\gamma _{Z_{i}}(\,j)\), which represents the size of each mixture component. In the following M-step, we update our estimates of μ j , σ j and α j :

$$\begin{array}{@{}rcl@{}} \hat{\mu_{j}} &=& \frac{1}{S_{j}}\sum_{i=1}^{n}\gamma_{Z_{i}}(j)x_{i}\\ \hat{\sigma_{j}^{2}} &=& \frac{1}{S_{j}}\sum_{i=1}^{n}\gamma_{Z_{i}}(j)\left(x_{i} - \mu_{j}\right)^{2}\\ \hat{\alpha_{j}} &=& \frac{S_{j}}{n} \end{array} $$

The log-likelihood is calculated after the M-step, and the next E-step is initiated unless the log-likelihood has changed by less then ε=0.01 from the previous M-step.

As shown above, the algorithm allows the estimation of μ j , σ j and α j simultaneously. Henceforth, we call this setup the “free model”. The log-likelihood of the free model upon convergence represents the optimum under the assumptions of the model.

We can also set certain parameters to fixed values and forgo their update in the M-step. We use this to maximize the log-likelihood under the expectation of diploidy (one Gaussian with mean 0.5), triploidy (two Gaussians with means 0.33 and 0.67) and tetraploidy (three Gaussians with means 0.25, 0.5 and 0.75), and call these the three “fixed models”:

$$\begin{array}{@{}rcl@{}} \text{log}L_{\text{diploid}}&=&\sum_{i=1}^{n}\text{log} N(x_{i};0.5,\sigma)\\ \text{log}L_{\text{triploid}}&=&\sum_{i=1}^{n}\text{log}\sum_{j=1}^{2}0.5\cdot N\left(x_{i};\mu_{j},\sigma_{j}\right),\\&& \quad\mu_{j} \in \{0.33, 0.67\}\\ \text{log}L_{\text{tetraploid}}&=&\sum_{i=1}^{n}\text{log}\sum_{j=1}^{3}0.33\cdot N\left(x_{i};\mu_{j},\sigma_{j}\right),\\ &&\quad\mu_{j} \in \{0.25, 0.5, 0.75\} \end{array} $$

In these three models, we only estimate σ j , while μ j and α j are fixed as shown above. Since all fixed models are nested within the free model, it is possible to directly compute the log-likelihood ratios, following:

$$\begin{array}{@{}rcl@{}} \Delta \text{log}L_{\text{diploid}} &=& \text{log}L_{\text{free}} - \text{log}L_{\text{diploid}}\\ \Delta \text{log}L_{\text{triploid}} &=& \text{log}L_{\text{free}} - \text{log}L_{\text{triploid}}\\ \Delta \text{log}L_{\text{tetraploid}} &=& \text{log}L_{\text{free}} - \text{log}L_{\text{tetraploid}} \end{array} $$

The ΔlogLs describe the distance between each fixed model and the best fit under the assumptions of the GMM. A substantially lower ΔlogL of one fixed model over the others supports the ploidy level described by this fixed model (Fig. 1c). Therefore, we used ΔlogL as summary statistics where the minimum value supports a given ploidy level.

Additionally, the GMM can be extended to a Gaussian Mixture Model with Uniform noise component (GMMU), by adding a uniform mixture component:

$$\text{log}L=\sum_{i=1}^{n}\text{log}\left(\alpha_{4} U(x_{i}) + \sum_{j=1}^{3}\alpha_{j} N\left(x_{i};\mu_{j},\sigma_{j}\right)\right) $$

The constraint on the mixture proportions then becomes \(\sum _{j=1}^{4}\alpha _{j} = 1\).

The uniform noise component is used in our implementation to allow base-line noise removal (Additional file 1: Figure S1). This can be useful when the Gaussian peaks are observable but embedded in a basal noise, which could be caused by highly repetitive genomes or low coverage.

Multivariate Gaussian clustering

To cluster samples in three dimensions based on the normalized maximized log-likelihoods of the three fixed models, we used the mclust5 package [13] of the R programming language [14]. This package utilizes mixtures of multivariate Gaussian distributions to detect clusters in an arbitrary number of dimensions. mclust5 allows to set constraints on the volume, shape and orientation of each mixture component, by varying features of their covariance matrix either within each sample, or for all samples at once. For the analysis displayed in Fig. 4, we used clusters of equal volume, but varying shape and orientation. This configuration represented the data the best, as assessed by the recovery of ploidy levels from cluster assignments.

Phytophthora infestans libraries

The two benchmarking libraries from P. infestans were generated according to the protocol by Meyer and Kircher [15] from DNA extracted from lab cultures [16]. These libraries were sequenced to high coverage on an Illumina HiSeq 3000 machine in paired end 150 bp mode. This sequencing data is available at the European Nucleotide Archive (ENA) under study number PRJEB20998.

Results

Performance

nQuire directly processes BAM files [17] and is designed to be efficient in memory usage and runtime. To process a 1GB S. cerevisiae BAM file (100x coverage), nQuire needs 70 s to build appropriate data structures, 6 s to run the models and calculate the maximum likelihood estimates, and uses a maximum of 8 Mb of RAM, whereas for processing a 10GB P. infestans BAM file (100x coverage) it needs 760 s, 100 s and 60 Mb of RAM, respectively. These benchmarks were performed on a single core of a Intel®; Core™ i5-4670 CPU on a system with 16Gb of DDR3-1600 RAM and an SSD.

Analysis of individual samples

We evaluate nQuire’s performance using three S. cerevisiae samples at 100x coverage, which represent each of the three ploidy levels evaluated by the model, as well as two P. infestans samples, one diploid and one triploid, at 210x and 368x coverage, respectively. The ΔlogL of each of the fixed models to the free model at full coverage is shown in Table 1. At those coverages, the ΔlogL of the best model is more than two times closer to the free model than the second best. Additionally, it coincides in all samples with the ploidy level inferred by visually inspecting the empirical distributions of base frequencies at full coverage (Figs. 2a-c and 3a-b).

Fig. 2
figure 2

Evaluation and benchmarking of nQuire using Saccharomyces cerevisiae. Distribution of base frequencies at variable sites where only two bases are segregating for a diploid (a), triploid (b) and tetraploid (c) sample. The barplots depict the ΔlogL of all fixed models for the diploid (d), triploid (e) and tetraploid (f) sample (also presented in Table 1). The plots depict the change of ΔlogL of all fixed models as a function of genome coverage for the diploid (g), triploid (h) and tetraploid (i) sample

Fig. 3
figure 3

Evaluation and benchmarking of nQuire using Phytophthora infestans. Distribution of base frequencies at variable sites where only two bases are segregating for a diploid (a) and a triploid (b) sample. The barplots depict the ΔlogL of all fixed models for the diploid (c) and triploid (d) sample (also presented in Table 1). The plots depict the change of ΔlogL of all fixed models as a function of genome coverage for the diploid (e) and triploid (f) sample

Table 1 Samples of Saccharomyces cerevisiae and Phytophthora infestans used to evaluate and benchmark nQuire

Coverage dependence

To investigate the impact of coverage on the performance of the GMM, we downsampled mapped reads from the S. cerevisiae (Fig. 2g-i) and P. infestans (Fig. 3e-f) strains shown in Table 1 to different coverage levels. In all cases the ΔlogLs of the two improper models increases with increasing coverage. In contrast, the ΔlogL of the best model stabilizes at low coverage and doesn’t increase further as coverage increases. The coverage at which the ΔlogL of the best model is stable will be different for genomes of different size and complexity, as shown in the difference between the two organisms used for benchmarking.

Analysis of population samples

In cases where multiple samples are sequenced simultaneously, it might be impractical to assess ploidy in each sample individually. In these cases, we propose to use maximized log-likelihoods of the three fixed models, normalized by that of the free model, to cluster samples into ploidy groups. The rationale is that within one species, the relative likelihoods of the fixed models will be similar within each ploidy level. As a proof of concept, we applied this to all di-, tri- and tetraploids from the S. cerevisiae test set [5], and clustered the samples into three groups in three dimensions using multivariate Gaussian clustering (see “Methods” section). The sample set was manually scored for ploidy, and the overlap between clusters and manually assessed ploidy level was calculated (Fig. 4). Running nQuire on raw data showed high recovery of ploidy level (93%, Fig. 4a), which was further improved through our denoising implementation utilizing the GMMU (96%, Fig. 4b).

Detection of Aneuploidies

Recent polyploidization is often associated with aneuploidies. To be able to detect those, nQuire allows to split the analysis of a sample by regions defined in BED format. We used this to reanalyze the sample YJM466 from the S. cerevisiae test set [5]. This sample had been shown to be triploid on whole genome level, but tetraploid for chromosome 6 and diploid for chromosome 9. The ΔlogLs for the three fixed models individually calculated for each of the 16 chromosomes of S. cerevisiae confirmed this observation (Fig. 5).

Fig. 4
figure 4

Clustering sets of samples into ploidy groups. Samples were clustered based on the normalized maximized log-likelihood of each fixed model representing each of the three ploidy levels. The clustering in three dimensions is shown with a set of three plots, one for each combination of ploidies. Shapes represent the manually assessed true ploidy levels, while colors show the assignments returned by the clustering algorithm. The agreement between the two is shown as a fraction. This analysis was conducted for samples before (a) and after denoising (b)

Fig. 5
figure 5

Detection of chromosome-wide aneuploidies. The ploidy estimation can be run for each chromosome independently, which enables detection of aneuploidies. The sample displayed here shows genome wide triploidy, but splitting the analysis by chromosome shows tetraploidy for chromosome 6 and diploidy for chromosome 9, as detected by ΔlogL

A natural extension of analyzing the genome by chromosomes is to use sliding windows to detect possible transitions between ploidies in aneuploid individuals. We used the three S. cerevisiae datasets shown in Table 1 at their full coverage to benchmark the number of randomly sampled positions needed to accurately assign ploidy (Fig. 6). For these test samples, 100-200 random sites at 100x coverage are enough to correctly assign ploidy based on the ΔlogL. However, this will vary for regions of the genome with different complexity and repetitiveness.

Fig. 6
figure 6

Number of positions required for reliable ploidy assignment. The diploid (top), triploid (middle) and tetraploid (bottom) S. cerevisiae datasets from Table 1 were used to randomly sample 10 to 300 positions (left). The subsampling was extended to 20,000 positions (right)

Discussion

In addition to nucleotide and structural variation, certain organisms can also vary intraspecifically in their ploidy level, which constitutes another source of variation that selection might act upon. Using NGS data our method permits assessment of ploidy variation from data that is usually generated for variant detection. In contrast to previous methods that visually analyze the distributions of SNPs at biallelic heterozygous sites [4, 6, 10], we quantitatively distinguish between different ploidy levels based on the distribution of base frequencies at variable sites, using relative differences in likelihoods. In comparison to the approach proposed by Gompert and Mock [12], nQuire avoids the requirement of high quality SNP calls. The higher level of noise in the data resulting from this is accounted for by using Gaussian distributions. They approximate a binomial process, but are impacted less by the effects of high coverage outliers, which arise for example from misalignments of paralogous sequences (Additional file 1: Figure S2). Additionally, nQuire is a Linux command line tool that uses standard file formats as input and handles large genomes at high coverage efficiently.

In all test cases, triploids were the easiest to distinguish, most likely caused by the lack of probability density around 0.5 compared to the other two models. While diploids and tetraploids are more difficult to tease apart, our results on coverage dependence show that at sufficient coverage, the data fits the true model much better than either of the two alternatives (Fig. 2g-i). For cases where the Gaussian peaks were largely overlapped by uniform noise, we extended our free model to include a uniform component, whose mixture proportion can be used - after likelihood maximization - for base-line removal. We show that this procedure improves the recovery of the true ploidy level when samples are clustered based on maximized likelihoods under the assumptions of the fixed models (Fig. 4b). We also show that few high quality positions are enough to estimate the correct ploidy level (Fig. 6). Such high quality positions can be selected by using stringent filtering criteria. Several filters are implemented in nQuire directly, thus no pre-processing of the BAM file (after duplicate removal) is necessary. They include minimum and maximum coverage, as well as mapping quality and base frequency filters. The default values of these filters have been configured to fit most applications. The exact coverage and number of positions needed for a reliable estimation of ploidy will however depend on the complexity and repetitiveness of the genome. Additionally, it is possible to obtain high quality positions by using BED files to define regions of low repetitiveness, where base frequencies can be more confidently assessed.

Conclusion

We present nQuire, a statistical approach to distinguish diploids, triploids and tetraploids of recent evolutionary origin based on the distribution of base frequencies at variable sites. The method facilitates analysis of ploidy in single samples, and we demonstrate how to apply it to population scale data, when available. nQuire can also interact with BED files, to limit the analysis to certain sequence features, or divide it by regions of the genome, for example to detect aneuploidies. Our approach will be useful to assess intraspecific variation in ploidy from both historic and modern samples, as well as in experimental evolution experiments.

Abbreviations

BAM:

Binary alignment map

BED:

Browser extensible data

CPU:

Central processing unit

EM:

Expectation maximization

ENA:

European nucleotide archive

GMM:

Gaussian mixture model

GMMU:

Gaussian mixture model with uniform noise component

NGS:

Next generation sequencing

RAM:

Random access memory

SNP:

Single nucleotide polymorphism

References

  1. Selmecki AM, Maruvka YE, Richmond PA, Guillet M, Shoresh N, Sorenson AL, De S, Kishony R, Michor F, Dowell R, Pellman D. Polyploidy can drive rapid adaptation in yeast. Nature. 2015; 519(7543):349–52. https://doi.org/10.1038/nature14187.

  2. Venkataram S, Dunn B, Li Y, Agarwala A, Chang J, Ebel E. R, Geiler-Samerotte K, Hérissant L, Blundell JR, Levy SF, Fisher DS, Sherlock G, Petrov DA. Development of a comprehensive Genotype-to-Fitness map of Adaptation-Driving mutations in yeast. Cell. 2016; 166(6):1585–9622. https://doi.org/10.1016/j.cell.2016.08.002.

  3. Comai L. The advantages and disadvantages of being polyploid. Nat Rev Genet. 2005; 6(11):836–46. https://doi.org/10.1038/nrg1711.

  4. Yoshida K, Schuenemann VJ, Cano LM, Pais M, Mishra B, Sharma R, Lanz C, Martin FN, Kamoun S, Krause J, Thines M, Weigel D, Burbano HA. The rise and fall of the phytophthora infestans lineage that triggered the irish potato famine. Elife. 2013; 2:00731. https://doi.org/10.7554/eLife.00731.

  5. Zhu YO, Sherlock G, Petrov DA. Whole genome analysis of 132 clinical saccharomyces cerevisiae strains reveals extensive ploidy variation. G3. 2016; 6(8):2421–34. https://doi.org/10.1534/g3.116.029397.

  6. Li Y, Shen H, Zhou Q, Qian K, van der Lee T, Huang S. Changing ploidy as a strategy: The irish potato famine pathogen shifts ploidy in relation to its sexuality. 2017; 30(1):45–52. https://doi.org/10.1094/MPMI-08-16-0156-R.

  7. Otto SP, Whitton J. Polyploid incidence and evolution. Annu Rev Genet. 2000; 34:401–37. https://doi.org/10.1146/annurev.genet.34.1.401.

  8. Dirihan S, Terho P, Helander M, Saikkonen K. Efficient analysis of ploidy levels in plant evolutionary ecology. Caryologia. 2013; 66(3):251–6. https://doi.org/10.1080/00087114.2013.849414.

  9. Margarido GRA, Heckerman D. ConPADE: genome assembly ploidy estimation from next-generation sequencing data. PLoS Comput Biol. 2015; 11(4):1004229. https://doi.org/10.1371/journal.pcbi.1004229.

  10. Augusto Corrêa Dos Santos R, Goldman GH, Riaño-Pachón DM. ploidyNGS: visually exploring ploidy with next generation sequencing data. Bioinformatics. 2017; 33(16):2575–6. https://doi.org/10.1093/bioinformatics/btx204.

  11. Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella KV, Altshuler D, Gabriel S, DePristo MA. From FastQ data to high confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013; 43:11–10133. https://doi.org/10.1002/0471250953.bi1110s43.

  12. Gompert Z, Mock KE. Detection of individual ploidy levels with genotyping-by-sequencing (GBS) analysis. Mol Ecol Resour. 2017. https://doi.org/10.1111/1755-0998.12657.

  13. Scrucca L, Fop M, Murphy TB, Raftery AE. mclust 5: Clustering, classification and density estimation using gaussian finite mixture models. R J. 2016; 8(1):289–317.

    PubMed  PubMed Central  Google Scholar 

  14. R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. 2008. http://www.R-project.org.

  15. Meyer M, Kircher M. Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb Protoc. 2010; 2010(6):5448. https://doi.org/10.1101/pdb.prot5448.

  16. Cooke DEL, Cano LM, Raffaele S, Bain RA, Cooke LR, Etherington GJ, Deahl KL, Farrer RA, Gilroy EM, Goss EM, Grünwald NJ, Hein I, MacLean D, McNicol JW, Randall E, Oliva RF, Pel MA, Shaw DS, Squires JN, Taylor MC, Vleeshouwers VGAA, Birch PRJ, Lees AK, Kamoun S. Genome analyses of an aggressive and invasive lineage of the irish potato famine pathogen. PLoS Pathog. 2012; 8(10):1002940. https://doi.org/10.1371/journal.ppat.1002940.

  17. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. The sequence Alignment/Map format and SAMtools. Bioinformatics. 2009; 25(16):2078–9. https://doi.org/10.1093/bioinformatics/btp352.

Download references

Acknowledgments

We thank Michael Dannemann, Richard Neher, Thomas Mailund, Oliver Kohlbacher, Moises Exposito-Alonso, Kay Pruefer, and members of the Research Group for Ancient Genomics and Evolution (AGE) for useful discussions and input on model implementation; Patricia Lang, the AGE group and Michael Dannemann for comments on the manuscript.

Funding

This work was supported by the Presidential Innovation Fund of the Max Planck Society.

Author information

Authors and Affiliations

Authors

Contributions

CLW, SK and HAB conceived the study. MP, LMC and CLW generated sequencing data. CLW developed the software application and analyzed the data. CLW and HAB wrote the the manuscript with input from all authors. All authors read and approved the final manuscript.;

Corresponding author

Correspondence to Hernán A. Burbano.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional information

Availability of data and materials

The Phytophthora infestans sequencing data generated for this study is available at the European Nucleotide Archive (ENA) under study number PRJEB20998. The nQuire source code is available at https://github.com/clwgg/nQuireunder the MIT license.

Additional file

Additional file 1

Supplementary figures addressing the effect of denoising, as well as the effect of high coverage outliers on the likelihood of Gaussian and binomial mixtures. (PDF 144 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Weiß, C.L., Pais, M., Cano, L.M. et al. nQuire: a statistical framework for ploidy estimation using next generation sequencing. BMC Bioinformatics 19, 122 (2018). https://doi.org/10.1186/s12859-018-2128-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-018-2128-z

Keywords