Detection of copy number variation from array intensity and sequencing read depth using a stepwise Bayesian model
 Zhengdong D Zhang^{1}Email author and
 Mark B Gerstein^{2, 3, 4}
https://doi.org/10.1186/1471210511539
© Zhang and Gerstein; licensee BioMed Central Ltd. 2010
Received: 17 May 2010
Accepted: 31 October 2010
Published: 31 October 2010
Abstract
Background
Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations. Arraybased comparative genomic hybridization (arrayCGH) and newly developed readdepth approach through ultrahigh throughput genomic sequencing both provide rapid, robust, and comprehensive methods to identify CNVs on a wholegenome scale.
Results
We developed a Bayesian statistical analysis algorithm for the detection of CNVs from both types of genomic data. The algorithm can analyze such data obtained from PCRbased bacterial artificial chromosome arrays, highdensity oligonucleotide arrays, and more recently developed highthroughput DNA sequencing. Treating parameterse.g., the number of CNVs, the position of each CNV, and the data noise levelthat define the underlying data generating process as random variables, our approach derives the posterior distribution of the genomic CNV structure given the observed data. Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates. We illustrate the characteristics of our algorithm by applying it to both synthetic and experimental data sets in comparison to other segmentation algorithms.
Conclusions
In particular, the synthetic data comparison shows that our method is more sensitive than other approaches at low false positive rates. Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast pointestimate methods and also as a framework to integrate arrayCGH and sequencing data with other CNVrelated biological knowledge, all through informative priors.
Background
Stable but not static, the DNA of human genome is subject to a variety of heritable changes of different types, which significantly contribute to the phenotypic differences of individuals in human populations. In addition to the single nucleotide polymorphisms (SNPs), these genetic changes also include the chromosomal structural variations, such as insertions, deletions, duplications, inversions, and translocations, on various genomic scales. Recent studies showed that insertions, deletions, and duplications of DNA segments of 1 kb or longer in the genome collectively referred to as the copy number variants (CNVs)occur at a much higher frequency than previously expected [1–4]. A recent global study of CNVs in the human genome showed that the regions of CNVs covered more nucleotide content per genome than SNPs [1]. It is now widely believed that CNVs are as important as SNPs and other small genomic changes in their contribution to the phenotypic variation in human populations.
Currently, unbalanced structural variants can be experimentally identified by methods based on microarray technology, polymerase chain reaction, or DNA sequence comparison. Arraybased method is a natural, highthroughput extension of the comparative genomic hybridization (CGH) analysis, which was originally developed as a method to reveal any regions of allele loss or aneuploidy by fluorescence microscopy [5]. Highdensity oligonucleotide microarrays, which offer high genomic resolution, have been used in several recent arrayCGH studies [4, 6, 7]. The last several years have seen rapid advancement in the field of sequencing technology. Novel methods [8–10] are being developed to reduce the cost and increase the throughput by generating massive amounts of sequence that can be aligned to the genomic reference. This development has made it possible to resequence whole genomes from multiple individuals.
Indeed, a major sequencing project, the 1000 Genomes Project, has been launched to resequence the genomes of at least a thousand people from around the world using the new sequencing technologies to produce the most detailed map of human genetic variation for disease studies. As the technologies mature and their uses spread, new sequencingbased methods to detect structural variations have been developed to take advantage of the massively parallel sequencing. In the readdepth approach, after DNA fragments are sequenced from one or both ends, the reads are mapped to the genome and then counted in a nonoverlapping sliding window. Both methods provide a rapid, robust, and comprehensive approach to identify CNVs on the wholegenome scale.
Both arrayCGH and readdepth sequencing generate genomic copy number (GCN) data in a very similar format: they consist of genomic signal output indexed by the genomic locations. The signals are logratios of normalized intensities from the test sample to those from the reference sample for arrayCGH and sequence read counts after mean subtraction for readdepth sequencing, respectively. The goal of analyzing such data is to detect CNVs by identifying regions with signals that are consistently higher or lower than the normalized baseline. Implicitly, there are two distinct and yet closely related estimation problems: one is to estimate the number of CNVs, and the other is to determine the boundaries and the average signal strength of each of them. Many statistical and computational methods have been developed to identify CNVs in individual genomes. They include approaches built on hidden Markov model [11–13] or in a Bayesian framework [14–16]. Recently a method to identify recurrent CNVs within a group of individuals has also been proposed [17]. Based on their data analysis approaches, algorithms that have been developed to analyze such data can be roughly grouped into three types: some only smooth the raw logratio data and the regions with logratios higher or lower than a preset threshold are identified as CNVs [18, 19], others estimate the number of CNVs and their boundaries directly using the original logratio data [20–23], and the rest use a combined approach [24–27]. The relative performance of these algorithms has been assessed [28].
Here we present a Bayesian statistical framework to analyze both arrayCGH and readdepth data. Treating parameters that define the underlying genomic copy number variation encoded in the data as random variables, our approach derives the posterior distribution of those parameters given observed data. This statistical method models the location of regional changes and their corresponding associated copy number, and estimates the overall noise level in the data at the same time. Sampling from the posterior distribution using Markov chain Monte Carlo (MCMC) simulation is able to give both the best estimate and a corresponding Bayesian credible interval for each parameter in the model. We discuss how our model was derived and implemented, and the empirical results from applying our method to both microarray and sequencing data for CNV discovery.
Statistical model
In the life sciences we are often faced with the task of making inferences about some object of interest given incomplete and noisy experimental data. In the case of CNV study, we are primarily concerned with inferring the number of the DNA copy number variations, their locations, sizes, and corresponding copy numbersassociated amplitude measurements, given the genomic copy number data, which are logratios of sample and control intensities measured on microarrays or read depths generated by shotgun genomic sequencing. To demonstrate the application of our method to the readdepth data, we take a set of sequence reads from the 1000 Genomes Project and construct a 'readdepth intensity signal' spectrum by first mapping the reads to the human genome reference sequence and then counting the number of reads in a sliding window, a procedure that transforms sequencing data into arraylike intensity signal. We capture these unknown quantities in a probability model that relates them to the observed data. Our model is Bayesian in essence as we assign prior distributions to parameters and use the posterior distribution function to estimate the underlying data generating process. Given the posterior distribution, we then use the Markov chain Monte Carlo method to fit the model. By doing so, we get not only the best estimates for these unknown parameters but also Bayesian credible intervals for the estimations at the same time.
where N is the number of 'smoothed' CNVs detectable in the data set, and s_{ j } , w_{ j } , a_{ j } are the start genomic location, the width, and the amplitude of the j th CNV respectively. Thus the 'ideal' data set corresponding to D based on this model is F= {g_{ k } , f_{ k } }, k = 1, 2, ..., M. For simplicity, we assume that X_{ 1 } , X_{ 2 } , ..., X_{ M } measured in D are independent random variables each subjected to additive Gaussian noise around ℱ with a standard deviation σ.
where the normalization constant p(D) is omitted for simplicity.
Prior. Given the discrete nature of CNVs, it is reasonable to assume a priori independence among all the parameters in θ. We choose the following prior distributions:
▫ Uniform distributions for N, s_{ j } , and w_{ j } (j = 1, 2, ..., N):

p(N) = 1/N_{max}

p(s_{ j }) = 1/(s_{max}s_{min}) = 1/M

p(w_{ j }) = 1/M
▫ Normal distribution for a_{ j } : p(a_{ j } ) ~ $\mathcal{N}$(τ_{ j } , κ_{ j }^{2})
▫ Inverse gamma distribution for ${\sigma}^{\text{2}}:p\left({\sigma}^{\text{2}}\right)={\beta}^{\alpha}{\sigma}^{\text{2}(\alpha +\text{1})}{e}^{\beta /{\sigma}^{\text{2}}}/\Gamma \left(\alpha \right)$
where α, β, τ_{ j } and κ_{ j } are the hyperparameters that characterize the prior distribution. See the Implementation subsection below for their parameterization.
Informative prior. If we have information on certain parameters in θ, for example s_{ j } and w_{ j } , from an initial scan of data D, such information can be coded in an informative prior to simplify subsequent parameter estimation. For example, suppose we know N = 1, the CNV starts at a certain place between genomic position a and b, and its length is between c and d bp long. We code such prior information as following:
▫ Uniform distributions for s_{1} and w_{1}:

$\phantom{\rule{0.1em}{0ex}}p({s}_{1})=\{\begin{array}{c}\frac{1}{ba}\text{for}\phantom{\rule{0.1em}{0ex}}{s}_{1}\in [a,b]\\ 0\text{otherwise}\end{array}$

$\phantom{\rule{0.1em}{0ex}}p({w}_{1})=\{\begin{array}{c}\frac{1}{dc}\text{for}\phantom{\rule{0.1em}{0ex}}{w}_{1}\in [c,d]\\ 0\text{otherwise}\end{array}$
With this informative prior, the posterior is the same as (7), but only nonezero for s_{1} ∈[a, b] and w_{1} ∈ [c, d ]. This condition simplifies subsequent parameter estimation, as s_{1} and w_{1} only need to be sampled in these two intervals during MCMC simulation.
When analyzing clean readdepth data, if the amplitude of the j th CNV, a_{ j } , occurs discretely at several different values (for example a_{ j } ∈ {c, c, 2c}, where c is the genomewide average haploid read depth), the prior distribution p(a_{ j } ) of a_{ j } can be modeled naturally by a multinomial distribution.
Algorithm and implementation
Parameter estimation by Markov chain Monte Carlo simulation
the calculation is usually impracticable for two reasons. Firstly, p(θD) is only known up to some multiplicative constant due to the proportionality form of equation (8). Secondly, even if the exact form of p(θD) is known, given the number of parameters in θ(at least four in a nontrivial case), the high dimensional integral required in equation (8) is very difficult to be carried out in practice and soon becomes intractable as the number of parameters increases. However, Markov chain Monte Carlo (MCMC) provides an alternative whereby the posterior can be directly sampled to obtain sample estimates of the quantities of interest. Thus using a random sequence of K draws θ^{(1)}, θ^{(2)}, ..., θ^{(K)}from p(θD), E(g(θ)D) can be approximated by simply taking the average of these draws. Similar methods can be used to compute the posterior standard deviation ${\zeta}_{\widehat{\theta}}$ or quantiles, probabilities that parameters take particular values, and other quantities of interest.
The Gibbs sampling algorithm [29] was implemented to sample from the target distribution {p(θD,ℳ): θ∈ Θ ⊆ ℝ^{3N+1}}. To do so, the Gibbs sampler first constructs an aperiodic and irreducible Markov chain whose stationary distribution is p(θD,ℳ) in the state space Θ, and then draws a sequence of random samples from conditional distributions to characterize the joint target distribution. More precisely, it was implemented by (i) taking some initial values θ^{(0)}; (ii) repeating for each t = 1, 2, ..., T, where T is the preset number of iterations, generating θ_{ i }^{(t)}from p(θ_{ i }^{(t)} θ_{1}^{(t)}, ..., θ_{i1}^{(t)}, θ_{i+1}^{(t1)}, ..., θ _{ θ }^{(t1)},D,ℳ) for i = 1, 2, ..., θ; (iii) continuing the previous step for T times after the estimated target distribution $\widehat{p}(\theta D,\mathcal{M})$ converges.
To calculate the conditional probabilities of s_{ j } and w_{ j } required by the second step of the Gibbs sampling stated above, all possible s ∈ [g_{1}, g_{ M } ] and w ∈ [w_{min}, w_{max}] are evaluated. Given the normality assumption about the data, conjugate prior distributions of a_{ j } and σ^{2} can be used to simplify the calculation of their conditional probabilities. If the prior distribution of a_{ j } takes the conjugate from p(a_{ j } ) ~ $\mathcal{N}$(τ_{ j } , κ_{ j }^{2}), the conditional distribution of a_{ j } given other parameters, the data D, and the model ℳ is also a normal distribution as $\mathcal{N}([(1/{\kappa}_{j}^{2})/(1/{\kappa}_{j}^{2}+{w}_{j}/{\sigma}^{2})]\phantom{\rule{0.1em}{0ex}}{\tau}_{j}+[({w}_{j}/{\sigma}^{2})/(1/{\kappa}_{j}^{2}+{w}_{j}/{\sigma}^{2})]\phantom{\rule{0.1em}{0ex}}{\overline{x}}_{j}$
, $1/(1/{\kappa}_{j}^{2}+{w}_{j}/{\sigma}^{2}))$ where ${\overline{x}}_{j}$ is the average logratios of probe intensities in the j th CNV. Given the conjugate prior distribution of σ^{2}, p(σ^{2}) ~ ℐnvmma(α, β), the conditional distribution of σ^{2} given other parameters, the data D, and the model ℳ is also an inverse gamma distribution, $\mathcal{I}\text{nv}\mathcal{G}\text{amma}(\alpha +M/2,\phantom{\rule{0.1em}{0ex}}\beta +{\displaystyle {\sum}_{i=1}^{M}{({x}_{j}\overline{x})}^{2}}/2)$.
Model selection using Bayes factor
Model selection is required to determine N, the number of CNVs, as different N changes the model parameterization θ. Suppose that the data D have arisen under one of the two models, {ℳ_{1}: θ_{1} ∈ Θ_{1}} and {ℳ_{2}: θ_{2} ∈ Θ_{2}}, according to a probability density p(Dℳ_{1}) or p(Dℳ_{2}).
which does not display any of the instability of ${\widehat{p}}_{\text{HM}}(D\mathcal{M})$. We implemented ${\widehat{p}}_{4}(D\mathcal{M})$ to calculate the Bayes factor for model comparison
Implementation
Our method, including the Markov chain Monte Carlo simulation and the model comparison, is currently implemented in R [31]. To use noninformative priors for our Bayesian inference, we set τ_{ j } = 0 and κ_{ j } = 100, which effectively makes the prior distribution of a_{ j } flat around 0. We also assign 1 to both α and β for the inverse gamma distribution of σ^{2}. We tested various values of the hyperparameters (τ_{ j } , κ_{ j } , α, and β), and the simulation results showed that the parameter inference was insensitive to the values assigned to these hyperparameters, which is expected given the large number of data points. A 500iteration MCMC simulation of the posterior distribution (7) given a data set with M = 500 and N = 1 took 126 seconds of CPU time on a personal computer with one 1400Mhz x86 Intel processor and 500 MB RAM. To assess the convergence of the Markov chain after 500 iterations, we started multiple chains from different values of θ^{(0)}. The simulations showed that after initial dozens of iterations all chains converged to the same solution. Based on this observation, we concluded that 500 iterations were sufficient for Bayesian inference in this case. We used the same convergence diagnostic for all inferences.
Because of great computational intensity of the MCMC simulation, to process a large GCN data set, we use a 'divide and conquer' strategy. We first sort array/sequencing features from each chromosome according to their genomic locations and then group 1000 consecutive features into subsets for parallel processing on a computer cluster.
Results
Simulated arrayCGH data
The simulated array data (M = 500) plotted in Figure 2A were generated with θ= {N = 1, (s_{1} = 200, w_{1} = 50, a_{1} = 1.5), σ^{2} = 0.4^{2}}, which was to be estimated. Taking N = 1, we started the Markov chain at some random θ^{(0)} = {(s_{1} = 100, w_{1} = 0, a_{1} = 0), σ^{2} = 0.1^{2}} and ran it for 500 iterations. The sampling results are shown in Figure 2CF. As the parameter trace plots (Figure 2GJ) show, the Markov chain quickly converged to stationarity after approximately ten iterations (Figure 2B). To err on the side of caution, we discarded the samples from the first 100 iterations as the 'burnin' samples and estimated the parameter values from the rest 400 samples, which gave $\widehat{\theta}=\{({\widehat{s}}_{1}=200,{\widehat{w}}_{1}=50,{\widehat{a}}_{1}=1.57),{\widehat{\sigma}}^{2}={0.38}^{2}\}$ given N = 1.
Remarkably, all these samples have the very similar s_{1} and w_{1}, which are 200 and 50 respectively. Because of this small variation in their estimation, the estimates of s_{1} and w_{1} from the data are of extremely high confidence. The distributions of a_{1} and σ in the 400 samples are approximately normal as $\mathcal{N}$(1.57, 0.057^{2}) and $\mathcal{N}$(0.38, 0.012^{2}) respectively. Based on their normal distributions, we can easily calculate a Bayesian credible interval for both a_{1} and σ. For example, a 95% Bayesian credible interval for a_{1} is [1.46, 1.68], which suggests that, after observing the data, there is a 95% chance that the average logratio of intensities in this CNV falls between 1.46 and 1.68.
We also simulated arrayCGH data (M = 1000) with multiple CNVs (Figure 3A) using θ= {N = 4, (s_{1} = 100, w_{1} = 30, a_{1} = 0.7), (s_{2} = 200, w_{2} = 20, a_{2} = 0.3), (s_{3} = 400, w_{3} = 80, a_{3} = 1.5), (s_{4} = 600, w_{4} = 90, a_{4} = 0.6), σ^{2} = 0.1^{2}}. To identify the CNVs encoded in this data set, first the modelspecific parameters {(s_{ j } , w_{ j } , a_{ j } ), σ^{2}}, j = 1, 2, ..., N were estimated under different models with N = 0, 1, ..., 5. In Figure 3A, the scatter plot of the multiCNV arrayCGH data are overlaid with the segmentation found by our algorithm using different models. The figure shows that the most prominent CNV was identified first when the number of CNVs, N, was set to 1 and less prominent CNVs were progressively identified as the model became more permissive (i.e., N was increased). To select the most plausible model from which the observed data were generated, each of the models with N = 1, 2, 3, 4, 5 was then compared with the basal, null model (N = 0). Quantification of these comparisons by the logarithm of the Bayes factor, which gives 691.02, 926.94, 1091.13, 1556.23, and 1173.67 respectively, clearly indicates that the model with N = 4 is the best model among the ones tested (Figure 3B). It is noteworthy that since the numbers aforementioned are the logarithms of the Bayes factor the actual increase in the marginal evidence p(DM) between neighboring models is very substantial. For example, the increase in the marginal evidence from N = 3 to N = 4 is e^{1556.231091.13} = e^{465.11} ≈ 9.86 × 10^{201} fold.
Glioblastoma Multiforme arrayCGH data
Lai et al. [28] compared 11 different arrayCGH data analysis algorithms that are based on diverse statistical or computational techniques. In addition to testing those methods using simulated data, they also characterized their performance on chromosomal regions of interest in real data sets obtained from patients with Glioblastoma Multiforme (GBM) [32]. These cDNA microarraybased data sets were generated by CGHprofiling copy number alterations across 42,000 mapped human cDNA clones, in a series of 54 gliomas of varying histogenesis and tumor grade.
Figures 5A and 5B show the arrayCGH profiles of chromosomes 13 and 22 in Glioblastoma Multiforme samples GBM31 and GBM29, respectively, overlaid with the segmentation found by our algorithm. As seen in Figure 5A, our algorithm detected the single broad proximal deletion of part of chromosome 13 in GBM31, spanning from the 59^{th} to the 542^{nd} probe with a logration intensity at 0.30 $(\phantom{\rule{0.1em}{0ex}}\phantom{\rule{0.1em}{0ex}}\widehat{\theta}=\{(\phantom{\rule{0.1em}{0ex}}{\widehat{s}}_{1}=59,\phantom{\rule{0.1em}{0ex}}\phantom{\rule{0.1em}{0ex}}\phantom{\rule{0.1em}{0ex}}{\widehat{e}}_{1}=542,\phantom{\rule{0.1em}{0ex}}\phantom{\rule{0.1em}{0ex}}{\widehat{a}}_{1}=0.30),\phantom{\rule{0.1em}{0ex}}\phantom{\rule{0.1em}{0ex}}\phantom{\rule{0.1em}{0ex}}{\widehat{\sigma}}^{2}={0.38}^{2}\}$ with corresponding standard deviations, ${\zeta}_{\widehat{\theta}}=\{({\zeta}_{{\widehat{s}}_{1}}=2.40,\phantom{\rule{0.1em}{0ex}}{\zeta}_{{\widehat{e}}_{1}}=3.19,\phantom{\rule{0.1em}{0ex}}{\zeta}_{{\widehat{a}}_{1}}=0.02),\phantom{\rule{0.1em}{0ex}}{\zeta}_{{\widehat{\sigma}}^{2}}=0.01\}$, for calculating each Bayesian credible interval). The breakpoint ${\widehat{e}}_{1}$ at the probe genomic index 542 was also identified by all the programs that detected this deletion in the test conducted by Lai et al. The other breakpoint ${\widehat{a}}_{1}$ at 59 was again found by CLAC and ACE evaluated in the same test [28]. The small sample standard deviations in ${\zeta}_{\widehat{\theta}}$ connote the reliability of the parameter estimation despite a rather low signal to noise ratio of the GBM31 data. Our algorithm also detected all three highamplitude amplifications of parts of chromosome 22 in GBM29 (Figure 5B). Even though there are only four probes separating the first two amplifications, our method still segmented them clearly. Moreover, our method also pinpointed the six breakpoints of these three CNVs (their sample standard deviations are all zeros), which makes these predictions highly reliable.
βglobin highdensity arrayCGH data
One recent significant development in the microarray technology is the emergence of the tiling array technology, which can be used to cover large genomic regions or even an entire genome on one or several microarrays in an unbiased fashion by using oligonucleotides (a.k.a. tiles) uniformly sampled from presented genomic sequences. The current trend is to migrate from PCRbased arrays to tiling arrays for a much higher resolution and a comprehensive genomic coverage.
In a recent study [7], in order to test the resolution limits of tiling arrays when they are used with CGH for CNV discovery, Urban et al. designed microarrays that tile through 100 kb of the βglobin locus with overlapping isothermal oligonucleotides spaced 9 bp apart alone the tiling path. They compared the test DNA from a patient with a known heterozygous deletion of 622 bp in the βglobin locus and the reference DNA pooled from seven individuals without this chromosomal aberration. Figure 5C shows the arrayCGH profile of the βglobin locus of the patient overlaid with the segmentation $(\phantom{\rule{0.1em}{0ex}}\widehat{\theta}=\{(\widehat{s}=48,\widehat{e}=88,\widehat{a}=0.36),\phantom{\rule{0.1em}{0ex}}\phantom{\rule{0.1em}{0ex}}{\widehat{\sigma}}^{2}={0.25}^{2}\})$ found by our algorithm. This deletion in the βglobin locus was detected, and the estimate of its length, $\widehat{w}$, corresponding to 641 bp in the genomic coordinate system, is highly accurate in comparison with the actual length of the deletion (622 bp).
Readdepth genome resequencing data
The genome of a Utah resident with Northern and Western European ancestry from the CEPH collection (NA12878) has been sequenced by the 1000 Genomes Project using both the 454 pairedend and the Illumina shotgun sequencing technologies, which produced long (120bp) sequence reads with low coverage (0.5×) and short (50bp) ones with high coverage (40×), respectively.
Our method detected this deletion in the readdepth data and estimated its parameters to be $\widehat{\theta}=\{(\phantom{\rule{0.1em}{0ex}}\phantom{\rule{0.1em}{0ex}}\widehat{s}=32670400,\phantom{\rule{0.1em}{0ex}}\phantom{\rule{0.1em}{0ex}}\phantom{\rule{0.1em}{0ex}}\widehat{e}=32672500,\phantom{\rule{0.1em}{0ex}}\phantom{\rule{0.1em}{0ex}}\widehat{a}=51.20),\phantom{\rule{0.1em}{0ex}}\phantom{\rule{0.1em}{0ex}}{\widehat{\sigma}}^{2}={27.73}^{2}\}$. To investigate how the sequencing depth affects the estimation of the start and the end positions of a CNV, we simulate a series of sequencing depths by randomly sampling (without replacement) different numbers of mapped Illumina reads and then apply our method to the simulated data. The standard deviation in the estimates of the start and the end positions, s and e, reflects how well these two parameters can be estimated from the readdepth data. In figure 6B we plot the averaged standard deviation in the estimates of the s and e at different sequencing depths. It is clear as the sequencing depth decreases from the original depth (37×) the estimates of the terminal positions become less accurate. In fact, when the coverage is below 1×, it becomes very difficult to find the deletion at all.
Discussion and Conclusion
The MetropolisHastings and the Gibbs sampling algorithms, two Markov chain Monte Carlo simulation methods, have been widely used for Bayesian inference. Developed by Metropolis et al. [33] and subsequently generalized by Hastings [34], the MetropolisHastings algorithm generates, based on the current state and a preselected proposal density, candidate states that are accepted (or rejected) stochastically with a certain acceptance probability but then retains the current value when rejection takes place. Gibbs sampling [29] draws a sequence of random samples from conditional distributions of unknown parameters to characterize their joint target distribution. In fact, the Gibbs sampling can be regarded as a special case of the MetropolisHastings algorithm as the acceptance probability is always onei.e., every proposal is automatically accepted.
For our Bayesian analysis of genomic copy number data, we implemented both the random walk MetropolisHastings (RWMH) and the Gibbs sampling algorithms and observed that in this application Gibbs sampling is much more suitable for parameter inference. RWMH worked well for oneCNV data. However, if the data contain two widely separated CNVs, it can only identify one of them but not both. To investigate this limitation, we plotted the landscape of the posterior probability distribution in a twodimensional parameter space. A twoCNV data set D(M = 700) with θ= {N = 2, (s_{1} = 100, w_{1} = 20, a_{1} = 2), (s_{2} = 600, w_{2} = 20, a_{2} = 2), σ^{2} = 0.4^{2}} was first simulated, and then the posterior probability was evaluated with various combinations of s_{1} and s_{2} while all other parameters were kept fixed at their true values.
As the ROC curves in Figure 4 show, our Bayesian algorithm is the most sensitive method at low (< 10%) false positive rates. This means that at a given low FPR our method can identify more true positive probes inside CNVs than other methods. When the FPR is higher, it is less sensitive than several methods, most of which find CNVs through data smoothing. However, this is hardly a disadvantage, as at high false positive rates the list of identified CNVs is awash with false positives, rendering the whole list practically unusable.
In addition to the improved sensitivity, our method also has several distinct advantages innate to its Bayesian approach. The confidence on an estimated parameter value can be assessed through its Bayesian credible interval. Akin to a confidence interval but with an intuitive probabilistic interpretation, a Bayesian credible interval is a range of values that contains the true parameter value with a certain probability. Through stochastic simulation, it is straightforward to summarize the otherwise analytically intractable joint posterior distribution of the unknown parameters and compute both the best estimate and a corresponding Bayesian credible interval for each parameter in the model. The availability of the intervals for s_{ j } , e_{ j } , and a_{ j } the start and the end genomic locations and the copy number of each CNVis unique to our Bayesian method, and these credible intervals can be especially useful.
Recent years have seen fast development of methodologies in different frameworks to detect CNVs in arrayCGH data. For example, to detect CNV breakpoints, Shah et al. used a modified hidden Markov model (HMM) that is robust to outlier probes [13], while Rueda and D'azUriarte used a nonhomogeneous HMM fitted via reversible jump MCMC [12]. PiqueRegi et al. used piecewise constant (PWC) vectors to represent genome copy numbers and sparse Bayesian learning to detect CNV breakpoints [16]. Other methods for segmenting arrayCGH data have also been implemented, including using Bayesian change point analysis [15], a spatially correlated mixture model [14], a Bayes regression model [35], and wavelet decomposition and thresholding [36].
Due to the computational intensiveness of its MCMC simulation, the method that we present here can be most advantageously used to refine CNVs detected by fast pointestimate methods. It could also be seen as a basic genomic copy number data analysis framework, amenable for several possible extensions. Firstly, due to the nature of the genomic sequence duplications and deletions, the signal measurements of CNVs will aggregate to certain expected values. Such information could be incorporated into the model for better signal detection from background noise. Secondly, more complicated likelihood function, such as a truncated Gaussian, could be used to handle outliers in genomic copy number data. Thirdly, informative priors could be used for better CNV detection. The formation of CNVs in a genome is potentially affected by many local genomic features, such as conservation and repeat content on the sequence level. Compared with the aforementioned methods for arrayCGH data, our Bayesian approach has the advantage to readily incorporate such sequence information through the prior distributions, as it treats the start and the width of CNVs as parameters and thus directly models the genomic CNV state. For this initial Bayesian analysis of genomic copy number data, we used flat priors for both the CNV start site and width. However, instead of using such noninformative prior, we can assign a prior for the start site inversely proportional to the conservation level of the probe sequence. (This incorporates our belief that the more conserved a sequence is the less likely it is to be duplicated or deleted.) For the width, we can assign the width distribution of known CNVs in the database as a prior. The incorporation of such knowledge through the priors does not need to be done only once: it can be sequential (orderinsensitive) as more relevant information becomes available. Using such informative priors, our method can be seen as a framework that enables integration of genomic copy number data and the CNVrelated biological knowledge.
Declarations
Acknowledgements
ZDZ was funded by an NIH grant (T15 LM07056) from the National Library of Medicine.
Authors’ Affiliations
References
 Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, et al.: Global variation in copy number in the human genome. Nature 2006, 444(7118):444–454. 10.1038/nature05329View ArticlePubMedPubMed CentralGoogle Scholar
 Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA, Pertz LM, Haugen E, Hayden H, Albertson D, Pinkel D, et al.: Finescale structural variation of the human genome. Nat Genet 2005, 37(7):727–732. 10.1038/ng1562View ArticlePubMedGoogle Scholar
 Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C: Detection of largescale variation in the human genome. Nat Genet 2004, 36(9):949–951. 10.1038/ng1416View ArticlePubMedGoogle Scholar
 Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M, et al.: Largescale copy number polymorphism in the human genome. Science 2004, 305(5683):525–528. 10.1126/science.1098918View ArticlePubMedGoogle Scholar
 Forozan F, Karhu R, Kononen J, Kallioniemi A, Kallioniemi OP: Genome screening by comparative genomic hybridization. Trends Genet 1997, 13(10):405–409. 10.1016/S01689525(97)012444View ArticlePubMedGoogle Scholar
 Jobanputra V, Sebat J, Troge J, Chung W, AnyaneYeboa K, Wigler M, Warburton D: Application of ROMA (representational oligonucleotide microarray analysis) to patients with cytogenetic rearrangements. Genet Med 2005, 7(2):111–118. 10.1097/01.GIM.0000153661.11110.FBView ArticlePubMedGoogle Scholar
 Urban AE, Korbel JO, Selzer R, Richmond T, Hacker A, Popescu GV, Cubells JF, Green R, Emanuel BS, Gerstein MB, et al.: Highresolution mapping of DNA copy alterations in human chromosome 22 using highdensity tiling oligonucleotide arrays. Proc Natl Acad Sci USA 2006, 103(12):4534–4539. 10.1073/pnas.0511340103View ArticlePubMedPubMed CentralGoogle Scholar
 Fedurco M, Romieu A, Williams S, Lawrence I, Turcatti G: BTA, a novel reagent for DNA attachment on glass and efficient generation of solidphase amplified DNA colonies. Nucleic Acids Res 2006, 34(3):e22. 10.1093/nar/gnj023View ArticlePubMedPubMed CentralGoogle Scholar
 Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al.: Genome sequencing in microfabricated highdensity picolitre reactors. Nature 2005, 437(7057):376–380.PubMedPubMed CentralGoogle Scholar
 Shendure J, Porreca GJ, Reppas NB, Lin X, McCutcheon JP, Rosenbaum AM, Wang MD, Zhang K, Mitra RD, Church GM: Accurate multiplex polony sequencing of an evolved bacterial genome. Science 2005, 309(5741):1728–1732. 10.1126/science.1117389View ArticlePubMedGoogle Scholar
 Cahan P, Godfrey LE, Eis PS, Richmond TA, Selzer RR, Brent M, McLeod HL, Ley TJ, Graubert TA: wuHMM: a robust algorithm to detect DNA copy number variation using long oligonucleotide microarray data. Nucleic Acids Res 2008, 36(7):e41. 10.1093/nar/gkn110View ArticlePubMedPubMed CentralGoogle Scholar
 Rueda OM, DiazUriarte R: Flexible and accurate detection of genomic copynumber changes from aCGH. PLoS Comput Biol 2007, 3(6):e122. 10.1371/journal.pcbi.0030122View ArticlePubMedPubMed CentralGoogle Scholar
 Shah SP, Xuan X, DeLeeuw RJ, Khojasteh M, Lam WL, Ng R, Murphy KP: Integrating copy number polymorphisms into array CGH analysis using a robust HMM. Bioinformatics 2006, 22(14):e431–439. 10.1093/bioinformatics/btl238View ArticlePubMedGoogle Scholar
 Broet P, Richardson S: Detection of gene copy number changes in CGH microarrays using a spatially correlated mixture model. Bioinformatics 2006, 22(8):911–918. 10.1093/bioinformatics/btl035View ArticlePubMedGoogle Scholar
 Erdman C, Emerson JW: A fast Bayesian change point analysis for the segmentation of microarray data. Bioinformatics 2008, 24(19):2143–2148. 10.1093/bioinformatics/btn404View ArticlePubMedGoogle Scholar
 PiqueRegi R, Ortega A, Asgharzadeh S: Joint estimation of copy number variation and reference intensities on multiple DNA arrays using GADA. Bioinformatics 2009, 25(10):1223–1230. 10.1093/bioinformatics/btp119View ArticlePubMedPubMed CentralGoogle Scholar
 Wu LY, Chipman HA, Bull SB, Briollais L, Wang K: A Bayesian segmentation approach to ascertain copy number variations at the population level. Bioinformatics 2009, 25(13):1669–1679. 10.1093/bioinformatics/btp270View ArticlePubMedGoogle Scholar
 Eilers PH, de Menezes RX: Quantile smoothing of array CGH data. Bioinformatics 2005, 21(7):1146–1153. 10.1093/bioinformatics/bti148View ArticlePubMedGoogle Scholar
 Hsu L, Self SG, Grove D, Randolph T, Wang K, Delrow JJ, Loo L, Porter P: Denoising arraybased comparative genomic hybridization data using wavelets. Biostatistics 2005, 6(2):211–226. 10.1093/biostatistics/kxi004View ArticlePubMedGoogle Scholar
 Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain ANAN: Hidden Markov models approach to the analysis of array CGH data. Journal of Multivariate Analysis 2004, 90(1):132–153. 10.1016/j.jmva.2004.02.008View ArticleGoogle Scholar
 Jong K, Marchiori E, van der Vaart A, Ylstra B, Meijer G, Weiss M: Chromosomal breakpoint detection in human cancer. In Lecture Notes in Computer Science. Volume 2611. Berlin: SpringerVerlag; 2003:54–651. full_textGoogle Scholar
 Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of arraybased DNA copy number data. Biostatistics 2004, 5(4):557–572. 10.1093/biostatistics/kxh008View ArticlePubMedGoogle Scholar
 Picard F, Robin S, Lavielle M, Vaisse C, Daudin JJ: A statistical approach for array CGH data analysis. BMC Bioinformatics 2005, 6: 27. 10.1186/14712105627View ArticlePubMedPubMed CentralGoogle Scholar
 Hupe P, Stransky N, Thiery JP, Radvanyi F, Barillot E: Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics 2004, 20(18):3413–3422. 10.1093/bioinformatics/bth418View ArticlePubMedGoogle Scholar
 Lingjaerde OC, Baumbusch LO, Liestol K, Glad IK, BorresenDale AL: CGHExplorer: a program for analysis of arrayCGH data. Bioinformatics 2005, 21(6):821–822. 10.1093/bioinformatics/bti113View ArticlePubMedGoogle Scholar
 Myers CL, Dunham MJ, Kung SY, Troyanskaya OG: Accurate detection of aneuploidies in array CGH and gene expression microarray data. Bioinformatics 2004, 20(18):3533–3543. 10.1093/bioinformatics/bth440View ArticlePubMedGoogle Scholar
 Wang P, Kim Y, Pollack J, Narasimhan B, Tibshirani R: A method for calling gains and losses in array CGH data. Biostatistics 2005, 6(1):45–58. 10.1093/biostatistics/kxh017View ArticlePubMedGoogle Scholar
 Lai WR, Johnson MD, Kucherlapati R, Park PJ: Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 2005, 21(19):3763–3770. 10.1093/bioinformatics/bti611View ArticlePubMedPubMed CentralGoogle Scholar
 Geman S, Geman D: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions of Pattern Analysis and Machine Intelligence 1984, 6: 721–741. 10.1109/TPAMI.1984.4767596View ArticleGoogle Scholar
 Newton MA, Raftery AE: Approximate Bayesian inference with the weighted likelihood bootstrap. Journal of Royal Statistical Society (B series) 1994, 56: 3–48.Google Scholar
 R Development Core Team: R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2006.Google Scholar
 Bredel M, Bredel C, Juric D, Harsh GR, Vogel H, Recht LD, Sikic BI: Highresolution genomewide mapping of genetic alterations in human glial brain tumors. Cancer Res 2005, 65(10):4088–4096. 10.1158/00085472.CAN044229View ArticlePubMedGoogle Scholar
 Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E: Equation of state calculations by fast computing machines. The Journal of Chemical Physics 1953, 21: 1087–1092. 10.1063/1.1699114View ArticleGoogle Scholar
 Hastings WK: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 1979, 57: 97–109. 10.1093/biomet/57.1.97View ArticleGoogle Scholar
 Wen CC, Wu YJ, Huang YH, Chen WC, Liu SC, Jiang SS, Juang JL, Lin CY, Fang WT, Hsiung CA, et al.: A Bayes regression approach to arrayCGH data. Stat Appl Genet Mol Biol 2006., 5(Article 3):Google Scholar
 BenYaacov E, Eldar YC: A fast and flexible method for the segmentation of aCGH data. Bioinformatics 2008, 24(16):i139–145. 10.1093/bioinformatics/btn272View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.