 Methodology Article
 Open Access
 Published:
Variational inference for rare variant detection in deep, heterogeneous nextgeneration sequencing data
BMC Bioinformatics volume 18, Article number: 45 (2017)
Abstract
Background
The detection of rare single nucleotide variants (SNVs) is important for understanding genetic heterogeneity using nextgeneration sequencing (NGS) data. Various computational algorithms have been proposed to detect variants at the single nucleotide level in mixed samples. Yet, the noise inherent in the biological processes involved in NGS technology necessitates the development of statistically accurate methods to identify true rare variants.
Results
We propose a Bayesian statistical model and a variational expectation maximization (EM) algorithm to estimate nonreference allele frequency (NRAF) and identify SNVs in heterogeneous cell populations. We demonstrate that our variational EM algorithm has comparable sensitivity and specificity compared with a Markov Chain Monte Carlo (MCMC) sampling inference algorithm, and is more computationally efficient on tests of relatively low coverage (27× and 298×) data. Furthermore, we show that our model with a variational EM inference algorithm has higher specificity than many stateoftheart algorithms. In an analysis of a directed evolution longitudinal yeast data set, we are able to identify a timeseries trend in nonreference allele frequency and detect novel variants that have not yet been reported. Our model also detects the emergence of a beneficial variant earlier than was previously shown, and a pair of concomitant variants.
Conclusions
We developed a variational EM algorithm for a hierarchical Bayesian model to identify rare variants in heterogeneous nextgeneration sequencing data. Our algorithm is able to identify variants in a broad range of read depths and nonreference allele frequencies with high sensitivity and specificity.
Background
Massively parallel sequencing data generated by nextgeneration sequencing technologies is routinely used to interrogate single nucleotide variants (SNVs) in research samples [1]. For example, deep sequencing confirmed the degree of genetic heterogeneity of HIV and influenza [2, 3]. Intratumor heterogeneity has been revealed by nextgeneration sequencing [4]. Whole genome sequencing has revealed that many beneficial mutations of minor allele frequencies are essential to respond to dynamic environments [5]. However, rare SNV identification in heterogeneous cell populations is challenging, because of the intrinsic error rate of next generation sequencing [6]. Thus, there is a need for accurate and scalable statistical methods to uncover SNVs in heterogeneous samples.
A number of computational methods have been developed to detect SNVs in large scale genomic data sets. These methods can be roughly categorized as probabilistic or heuristic or some combination. Among all of the current probabilistic methods, the Bayesian probabilistic framework has been increasingly used to estimate unobserved quantities such as variant allele frequency given observed genomic sequencing data.
GATK [7] and SAMTools [8] use a naive Bayesian decision rule to call variants. EBCall models sequencing errors based on a BetaBinomial distribution, where the parameters and latent variables are estimated from a set of nonpaired normal sequencing samples [9]. However, the error rate of normal sequencing samples could be unmatched with the error rate of the target samples, which may cause a problem of making false negatives calls [10]. CRISP compares aligned reads across multiple pools to obtain sequencing errors, and then distinguishes true rare variants from the sequencing errors [11]. However, the bottleneck of CRISP is its low computational efficiency due to a calculation of a large number of contingency tables.
JointSNVMix introduces two Bayesian probabilistic models (JointSNVMix1 and JointSNVMix2) to jointly analyze a tumournormal paired allelic count of NGS data [12]. JointSNVMix derives an expectation maximization (EM) algorithm to calculate maximum aposteriori (MAP) estimate of latent variables in a particular probabilistic graphical model. Furthermore, they showed that the joint modeling method, JointSNVMix1, observes 80fold reduction of false positives compared with its independent analogue (SNVMix1) [12]. SomaticSniper models the joint diploid genotype likelihoods for both tumour and normal samples [13]. Strelka models the joint probabilistic distribution of allele frequencies for both tumour and normal samples, which is demonstrated to be more accurate compared with the methods based on the estimated allele frequency tests between tumour and normal samples [14]. SNVer focuses on a frequentist method that is able to calculate Pvalues, but [15] pointed out that this approach fails to model sampling bias that will reduce the power of detecting true rare variants. VarScan compares tumour and normal samples thresholding on variant allele frequency and a number of allele counts then uses Fisher’s exact test to estimate sample allele frequencies [16].
In previous work, we developed a BetaBinomial model to estimate a null hypothesis error rate distribution at each position. Using this rare variant detection (RVD) model, we call rare variants by comparing the error rate of the sample sequence data to a null distribution obtained from sequencing a known reference sample [2]. RVD can identify mutant positions at a 0.1% fraction in mixed samples using high read depth data.
An improvement of that work, RVD2, uses hierarchical priors to tie parameters across positions to detect variants in low read depth data [17]. We derived a Markov Chain Monte Carlo (MCMC) sampling algorithm for posterior inference. However, the main limitation of MCMC is that it is hard to diagnose convergence and may be slow to converge [18]. An alternative inference method, that we explore here, is to use variational inference, which is based on a proposed variational distribution over latent variables. By optimizing variational parameters, we fit an approximate distribution that is close to the true posterior distribution in the sense of the KullbackLiebler (KL) divergence. Variational inference can now handle nonconjugate distributions and tends to be more computationally efficient than MCMC sampling [19].
Here, we propose a variational EM algorithm for our Bayesian statistical model to detect rare SNVs in heterogeneous NGS data. We show that variational EM algorithm has comparable accuracy and efficiency compared with MCMC in a synthetic data set. First, we define the model structure, and derive our variational EM algorithm to approximate the posterior distribution over latent variables. Then, we call a variant by a posterior difference hypothesis test between the key model parameters of a pair of samples. As a result, we compare the performance of the variational EM inference algorithm to the MCMC sampling method and the stateoftheart methods using a synthetic data set. Finally, we show that our variational EM algorithm is able to detect rare variants and estimate nonreference allele frequency (NRAF) in a longitudinal directed evolution experimental data set.
Methods
Model structure
Our Bayesian statistical model is shown as a graphical model in Fig. 1 a. In the model, r _{ ji } is the number of reads with a nonreference base at location j in experimental replicate i; n _{ ji } is the total number of reads at location j in experimental replicate i. The model parameters are: −μ _{0} a global nonreference read rate that captures the error rate across all the positions, − M _{0} a global precision that captures the variation of the error rate across positions in a sequence, and − M _{ j } a local precision that captures the variation of the error rate at position j across different replicates.
The latent variables are: − μ _{ j }∼Beta(μ _{0},M _{0}) a positionspecific nonreference read rate for position j, and − θ _{ ji }∼Beta(μ _{ j },M _{ j }) the nonreference read rate for position j in replicate i.
In Fig. 1 b, γ is the parameter for the variational distribution for latent variable μ, and δ is the parameter for the variational distribution for latent variable θ. We describe q(μ) and q(θ) in detail in the following section.
The model generative process is as follows:

1.
For each location j∈ [ 1,…,J]:

(a)
Draw an error rate \(\mu _{j} \thicksim \text {Beta}(\mu _{0}, M_{0})\)

(b)
For each replicate i∈ [ 1,…,N]:

(i)
Draw \(\theta _{ji} \thicksim \text {Beta}(\mu _{j}, M_{j})\)

(ii)
Draw \(r_{ji}  n_{ji} \thicksim \text {Binomial}(\theta _{ji}, n_{ji})\)

(i)

(a)
The joint distribution p(r,μ,θn;ϕ) given the parameters can be factorized as
Variational expectation maximization (EM) inference
We developed a nonconjugate variational inference algorithm to approximate the posterior distribution,
where the parameters are \(\phi \triangleq \{\mu _{0}, M_{0}, M\}\).
Factorization
We propose the following factorized variational distribution to approximate the true posterior over latent variables μ _{ j } and θ _{ ji }. Here, q(μ _{ j }) approximates the variational posterior distribution of μ _{ j }, which represents the local error rate distribution at position j across different replicates; and q(θ _{ ji }) approximates the posterior distribution of θ _{ ji }, which is the error rate distribution at position j for replicate i.
Evidence lower bound (ELBO)
Given the variational distribution, q, the loglikelihood of the data is lowerbounded according to Jensen’s inequality,
The function \(\mathcal {L}(q, \phi)\) is the evidence of lower bound (ELBO) of the loglikelihood of the data, which is the sum of qexpected complete loglikelihood and the entropy of the variational distribution q. The goal of variational inference is to maximize the ELBO. Equivalently, q is chosen by minimizing the KL divergence between the variational distribution and the true posterior distribution.
Since θ and r are conjugate pairs, the posterior distribution of θ _{ ji } is a Beta distribution,
Therefore, we propose a Beta distribution with parameter vector δ _{ ji } as variational distribution,
The posterior distribution of μ _{ j } is given by its Markov blanket,
This is not in the form of any known distribution. But, since the support of μ _{ j } is [ 0,1], we propose a Beta distribution with parameter vector γ _{ j } as variational distribution,
Each component of ELBO is derived in Additional file 1.
Variational EM algorithm
Variational EM algorithm maximizes the ELBO of the likelihood by alternating between maximization over q (Estep) and maximization over ϕ={μ _{0},M _{0},M} (Mstep). We update the variational parameters and the model parameters iteratively by numerically optimizing each problem using Sequential Least SQuares Programming (SLSQP) [20] (see Additional file 2 for detail). There is no analytical representation for \(E_{q}\left [ \log \left (\frac {\Gamma (M_{j})} { \Gamma (\mu _{j} M_{j}) \Gamma (M_{j} (1\mu _{j})) }\right)\right ]\), which is required to update variational distribution for μ _{ j } and model parameter M. So, we must resort to numerical integration,
Unfortunately, this numerical integration step is computationally expensive. The variational EM algorithm is summarized using pseudocode in Algorithm 1.
Hypothesis testing
The posterior distribution over \(\mu _{j}^{\triangle } \mid r^{case}, r^{control} \triangleq \mu _{j}r^{case}  \mu _{j}r^{control}\) is the distribution over the change in the nonreference read rate at position j between a case and control sample. Since the variational approximate posterior distributions in the difference are Beta distributions, the distribution of the difference is not analytically known. In order to compute the statistic of interest, we approximate μ _{ j }r ^{case} and μ _{ j }r ^{control} with univariate Gaussian distributions by matching the first two moments of the variational Beta distributions. Then, the difference is a Gaussian distribution. As we show in the section of comparison of approximated posterior distribution, the Gaussian approximation is empirically reasonable.
Under the variational approximation,
for μ _{ j }r ^{case} and likewise for μ _{ j }r ^{control}. We approximate the posterior for the case sample as
and likewise for the control. Then,
Now, we can approximate the posterior probability of interest,
that is, the posterior probability that the difference in the nonreference read rate is greater than a fixed effect size τ (e.g. zero) for a one sided test. For a two sided test, we compute the approximate probability
A position is called a provisional variant if \(\Pr ( \mu _{j}^{\triangle }  \geq \tau \mid r^{case}, r^{control}) \geq 1\alpha /2\), where the probability is approximated as described.
It is possible that a position is called a variant due to a differential nonreference read count, but no particular alternative base is more frequently observed than the others. In this case, the likely cause is a sequencing error that indiscriminately incorporates a nonreference base at the position. To discriminate this nonbiological cause from the interesting true variants we use a χ ^{2} goodnessoffit test for nonuniform base distribution [17, 21]. For each provisional variant, if we reject the null hypothesis that the distribution is uniform, we promote the position to a called variant.
Results
Data sets
Synthetic DNA sequence data
The data set we use to assess sensitivity and specificity is described and made available elsewhere [2]. Briefly, we performed an invitro mixture of two DNA sequences to test the sensitivity and specificity of our approach. Two 400 bp DNA sequences were chemically synthesized. One sample has 14 variant loci and is taken as the case and the other without variants is taken as the control. Case and control DNA samples were mixed invitro to yield defined NRAF of 0.1%, 0.3%, 1.0%, 10.0%, and 100.0%. The synthetic DNA dataset was downsampled by 10×, 100×, 1,000×, and 10,000× using picard (v 1.96). The final data set contains read pairs for six replicates for the control and cases at different NRAF levels.
Longitudinal directed evolution data
The longitudinal yeast data comes from three strains of haploid S288c which were grown for 448 generations under limitedglucose (0.08%). The wildtype ancestral strain GSY1136 was sequenced as a reference. Aliquots were taken about every 70 generations and sequenced. The detail of library sequencing is described in [5, 11, 22]. The Illumina sequencing data is available on the NCBI Sequence Read Archive (SRA054922)[5]. For this study, we received the original BAM files from one of the authors. The aligned BAM files have 266 – 1,046× coverage. We used samtools (v 1.1) with mpileup C50 flags to convert BAM files to pileup files. Then, we generated depth chart files, which are tabdelimited text tables recording in each element of the table the count of a nucleotide at a genomic position. We ran our variational inference algorithm on the depth chart files to identify SNVs.
Performance on synthetic DNA data
Comparison of sensitivity and specificity
The performance of variational EM algorithm is shown in receiveroperating characteristic curves (ROCs) for a broad range of median read depths and NRAFs in Fig. 2. The results in the ROC curves are generated by varying parameter α in the posterior distribution test. It shows that the performance improved with read depth and true mutant mixtures. Furthermore, we evaluated the performance by using both the posterior distribution test with α=0.05 and the χ ^{2} test to detect variants, and compared the performance with the MCMC sampling algorithm in terms of sensitivity and specificity (Table 1).
The variational EM algorithm shows higher sensitivity and specificity than the MCMC algorithm in the events when NRAF is 0.1%. The variational EM algorithm has a higher specificity compared with the MCMC algorithm for a median read depth of 41,472× at 0.3% NRAF and 55,489× at 1.0% NRAF, but the sensitivity is slightly lower due to false negatives.
Comparison of approximated posterior distribution
Figure 3 shows the approximate posterior distribution of the variational EM algorithm and samples of the MCMC algorithm. One variant position, 85, is taken as an example to show the comparison of the approximated posteriors. The variational EM and MCMC algorithms both identify all the variants when NRAF is 10.0% and 100.0%. The variational EM algorithm calls 90 false positive positions without a χ ^{2} test when NRAFs are 0.1% and 0.3% for low median read depth (30× and 400×). This is to be expected because it is highly unlikely to correctly identify a variant base with a population frequency of 1 in 1,000 with less than a 1,000× read depth.
A false positive, a nonmutated position that is called by the variational EM algorithm but not called by the MCMC algorithm, is shown in Fig. 4. The variance of the MCMC posterior estimate is higher than that of the variational posterior estimate. We tested 10 random initial values variational inference algorithm and found the approximate posterior distributions from the variational EM algorithm are essentially equivalent for all random initializations. It is notable that the shape of the proposed Beta variational distribution is well approximated by a Gaussian.
Comparison to the stateoftheart methods
We compared the performance of our variational EM algorithm with the stateoftheart variant detection methods, SAMtools [8], GATK [7], CRISP [11], VarScan2 [16], Strelka [14], SNVer [15], MuTect [23], and RVD2 [17], using synthetic DNA data set (Table 2). Among all of the methods compared, our variational EM algorithm has a higher sensitivity and specificity for a broad range of read depths and NRAFs. Our variational EM algorithm shows higher specificity than all the other tested methods at a very low NRAF (0.1%) level. However, our algorithm has a slightly lower specificity than the MCMC algorithm when the median read depth is 4,156× at 0.3% NRAF, and a slightly lower sensitivity than the MCMC algorithm when the median read depth is 41,472× at 0.3% NRAF and a median read depth of 55,489× at 1.0% NRAF. The performance of other methods is stated in detail in [17].
Runtime assessment
The computational time for approximating the variational posterior distribution is increased by expanding the length of region and the median read depth (Fig. 5). Our variational EM algorithm is faster than the MCMC algorithm at the low median read depths of 27× and 298×, and slower for the high median read depths of 3,089× and 30,590×.
Table 3 shows the timing profile for each part of our variational EM algorithm when median read depth is 3,089×. Optimizing γ in the Estep and optimizing M _{ j } in the Mstep takes more than 95% of the time of one variational iteration in a test of a single processor, since the integration (7) is needed.
Variant detection on the longitudinal directed evolution data
Detected variants
We applied our variational EM algorithm to the MTH1 gene at Chr04:1,014,4011,015,702 (1,302 bp), which is the most frequently observed mutated gene by [5]. Our algorithm detected the same variants that were found by [5] (shown as highlighted in Additional file 2). Additionally, we detected 81 novel variants in 8 timepoints that the original publication did not detect. In Additional file 2, G7 is the baseline NRAF as the control sample when comparing with G70, G133, G266, G322, G385, and G448 in the respective hypotheses testing. The corresponding NRAFs of called variants at different time points are given by the estimate of the latent variable, \(\hat {\mu _{j}} = E_{q}[\mu _{j}r]\).
All of these variants, except the variant at position Chr04:1,014,740, decrease in NRAF following a maximum. The allele at position Chr04:1,014,740 is a beneficial variant that arises in NRAF to 99.6% at generation 448 within a constant glucoselimited environment. Moreover, we identified the first emergence of this beneficial variant as early as 0.5% in generation 133. We detected 22 variants (NRAF < 1.0%) early (at generation 70) in the evolutionary time course. Given that the median read depth is 1,649×, we have some confidence these are bonafide variants.
Concomitant variants detection
We identified a pair of variants, Chr04:1,014,740 in gene MTH1 and Chr12:200,286 in gene ADE16, that increase in NRAF together in time (Fig. 6). We hypotheses that the variants are concomitant in the same clone. In this pair of genes, gene MTH1 is a negative regulator of the glucosesensing signal transduction pathway, and gene ADE16 is an enzyme of d e n o v o purine biosynthesis. Glucose sensing induces gene expression changes to help yeast receive necessary nutrients, which could be a reason for this pair of genes to mutate together [24]. Further experimental validation of this hypothesis would be required to definitively show that the mutations are concomitant.
Discussion
Sensitivity analysis
The global precision hyperparameter M _{0} could influence the estimate of μ _{ j } due to its regularization effect. We show the influence of different \(\hat {M_{0}}\) on variant position Chr04:1,014,740, q(μ _{1,014,740}r) in Fig. 7. We see that as we decrease the prior precision parameter \(\hat {M_{0}}\), \(\hat {\mu }_{1,014,740}\) increases as expected. But the effect of changing \(\hat {M_{0}}\) over several orders of magnitude does not change \(\hat {\mu }_{j}\) greatly. Here \(\hat {M_{0}} = 1.752\) in this dataset.
Conclusions
In this article, we propose a variational EM algorithm to estimate the nonreference allele frequency in the RVD2 model to identify rare nucleotide variants in heterogeneous pools.
Our results show that the variational EM algorithm (i) is able to identify rare variants at a 0.1% NRAF level with comparable sensitivity and specificity to a MCMC sampling algorithm; (ii) has a higher specificity in comparison with many stateoftheart algorithms in a broad range of NRAFs; and (iii) detects SNVs early in the evolutionary time course, as well as tracks NRAF in a real longitudinal yeast data set.
We have chosen parametric forms for the variational distributions. This choice has left us with a complex integral in our variational optimization problem. In future work, we plan to explore other approximations of the variational distributions that render the integral easier to compute. One could use cubic splines to numerically approximate the function and then integrate that surrogate [25]. Another strategy is to consider a Laplace approximation for the variational distribution, as we and others have done previously [26, 27].
Improving the speed of the estimating algorithm enables us to interrogate wholegenome sequencing data. By doing this, we hope to reveal the dynamics of arising variants at the genomewide scale to show the genetic basis of clonal interference. Our method could be extended to study drug resistance by characterizing tumor heterogeneity in targeted anticancer chemotherapy samples, or to find the causative variants that lead to drug resistance and understand the causes of resistance at the single nucleotide level.
Abbreviations
 ELBO:

Evidence lower bound
 EM:

Expectation maximization
 KL:

KullbackLiebler
 MAP:

Maximum aposteriori
 MCMC:

Markov Chain Monte Carlo
 NGS:

Nextgeneration sequencing
 NRAF:

Nonreference allele frequency
 RVD:

Rare variant detection
 SNVs:

Single nucleotide variants
References
 1
Koboldt DC, Steinberg KM, Larson DE, Wilson RK, Mardis ER. The nextgeneration sequencing revolution and its impact on genomics. Cell. 2013; 155(1):27–38.
 2
Flaherty P, Natsoulis G, Muralidharan O, Winters M, Buenrostro J, Bell J, Brown S, Holodniy M, Zhang N, Ji HP. Ultrasensitive detection of rare mutations using nextgeneration targeted resequencing. Nucleic Acids Res. 2012:;40(1).
 3
Ghedin E, Laplante J, DePasse J, Wentworth DE, Santos RP, Lepow ML, Porter J, Stellrecht K, Lin X, Operario D, et al. Deep sequencing reveals mixed infection with 2009 pandemic influenza a (h1n1) virus strains and the emergence of oseltamivir resistance. J Infect Dis. 2011; 203(2):168–74.
 4
Navin N, Krasnitz A, Rodgers L, Cook K, Meth J, Kendall J, Riggs M, Eberling Y, Troge J, Grubor V, et al. Inferring tumor progression from genomic heterogeneity. Genome Res. 2010; 20(1):68–80.
 5
Kvitek DJ, Sherlock G. Whole genome, whole population sequencing reveals that loss of signaling networks is the major adaptive strategy in a constant environment. PLoS Genet. 2013; 9(11):1003972.
 6
Shendure J, Ji H. Nextgeneration dna sequencing. Nat Biotechnol. 2008; 26(10):1135–45.
 7
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, et al. The genome analysis toolkit: a mapreduce framework for analyzing nextgeneration dna sequencing data. Genome Res. 2010; 20(9):1297–303.
 8
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, et al. The sequence alignment/map format and samtools. Bioinformatics. 2009; 25(16):2078–9.
 9
Shiraishi Y, Sato Y, Chiba K, Okuno Y, Nagata Y, Yoshida K, Shiba N, Hayashi Y, Kume H, Homma Y, et al. An empirical bayesian framework for somatic mutation detection from cancer genome sequencing data. Nucleic Acids Res. 2013; 41(7):89–9.
 10
Wang Q, Jia P, Li F, Chen H, Ji H, Hucks D, Dahlman KB, Pao W, Zhao Z, et al. Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers. Genome Med. 2013; 5(10):91.
 11
Bansal V. A statistical method for the detection of variants from nextgeneration resequencing of dna pools. Bioinformatics. 2010; 26(12):318–24.
 12
Roth A, Ding J, Morin R, Crisan A, Ha G, Giuliany R, Bashashati A, Hirst M, Turashvili G, Oloumi A, et al. Jointsnvmix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired nextgeneration sequencing data. Bioinformatics. 2012; 28(7):907–13.
 13
Larson DE, Harris CC, Chen K, Koboldt DC, Abbott TE, Dooling DJ, Ley TJ, Mardis ER, Wilson RK, Ding L. Somaticsniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics. 2012; 28(3):311–7.
 14
Saunders CT, Wong WS, Swamy S, Becq J, Murray LJ, Cheetham RK. Strelka: accurate somatic smallvariant calling from sequenced tumor–normal sample pairs. Bioinformatics. 2012; 28(14):1811–7.
 15
Wei Z, Wang W, Hu P, Lyon GJ, Hakonarson H. Snver: a statistical tool for variant calling in analysis of pooled or individual nextgeneration sequencing data. Nucleic Acids Res. 2011; 39(19):132–2.
 16
Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis ER, Ding L, Wilson RK. Varscan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012; 22(3):568–76.
 17
He Y, Zhang F, Flaherty P. Rvd2: An ultrasensitive variant detection model for lowdepth heterogeneous nextgeneration sequencing data. Bioinformatics. 2015; 31(17):2785–93.
 18
Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK. An introduction to variational methods for graphical models. Mach Learn. 1999; 37(2):183–233.
 19
Peterson C, Hartman E. Explorations of the mean field theory learning algorithm. Neural Netw. 1989; 2(6):475–94.
 20
Kraft D. A software package for sequential quadratic programming. Technical Report DFVLRFB 8828, Oberpfaffenhofen: Institut für Dynamik der Flugsysteme; 1988.
 21
Efron B. Largescale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction vol. 1. Cambridge: Cambridge University Press; 2010.
 22
Kao KC, Sherlock G. Molecular characterization of clonal interference during adaptive evolution in asexual populations of saccharomyces cerevisiae. Nat Genet. 2008; 40(12):1499–504.
 23
Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, Gabriel S, Meyerson M, Lander ES, Getz G. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol. 2013; 31(3):213–9.
 24
Johnston M. Feasting, fasting and fermenting: glucose sensing in yeast and other cells. Trends Genet. 1999; 15(1):29–33.
 25
De Boor C, De Boor C, De Boor C, De Boor C. A Practical Guide to Splines vol. 27. New York: Springer; 1978.
 26
Saddiki H, McAuliffe J, Flaherty P. Glad: a mixedmembership model for heterogeneous tumor subtype classification. Bioinformatics. 2014; 31(2):225–32.
 27
Wang C, Blei DM. Variational inference in nonconjugate models. J Mach Learn Res. 2013; 14(1):1005–31.
Acknowledgements
We would like to thank Dan Kvitek for kindly sharing the data set used for the longitudinal yeast evolution analysis. We acknowledge Chuangqi Wang and Ross Lagoy for valuable comments on the manuscript.
Funding
FZ and PF were supported by PhRMA Foundation Informatics Grant 2013080079.
Availability of data and materials
The synthetic data set is available on http://dnadiscovery.stanford.edu/software/rvd/under the heading of Synthetic DNA Data. The Illumina sequencing data is available on the NCBI Sequence Read Archive SRA054922. The source code in Python is available in ’variational’ branch in rvd repository https://bitbucket.org/flahertylab/rvd/branches/.
Authors’ contributions
FZ developed and implemented the variational expectation maximization inference algorithm, and wrote the manuscript. PF conceived the study and helped to draft the manuscript. Both authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Author information
Additional files
Additional file 1
Derivation of the variational expectation maximization (EM) inference algorithm. Derivation of the variational EM algorithm is described in detail. (PDF 46.4 kb)
Additional file 2
Identified variants and corresponding nonreference allele frequencies in gene MTH1 on Chromosome 4. A blank cell indicates that the position of that time point is not called significantly different than G7. The positions highlighted as blue were also identified by Kvitek, 2013. The other 81 positions are novel identified variants in 8 timepoints. (XLSX 18.4 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Zhang, F., Flaherty, P. Variational inference for rare variant detection in deep, heterogeneous nextgeneration sequencing data. BMC Bioinformatics 18, 45 (2017) doi:10.1186/s1285901614515
Received:
Accepted:
Published:
Keywords
 Single nucleotide variant detection
 Nextgeneration sequencing
 Bayesian statistical method
 Variational inference