Because of the complexity of the biological model, we first describe a preliminary simplified model (called Model 1), which only estimates the copy number events exploiting the relationship between copy number and LOH data. Therefore, it does not identify copy-neutral LOH regions (called IBD/UPD regions), which are due to events like uniparental disomy, and it does not distinguish the normal regions from the gained one (because we suppose that the capability of detection of the homozygous status is the same in these two types of regions). In the subsequent subsections, we add to the model the detection of copy-neutral LOH regions (Model 2) and of gained ones (Model 3). Therefore, the explanation is structured in the following way:
-
Model 1: relationship between LOH and copy number data to detect copy number changes (apart from the gained regions);
-
Model 2: addition to Model 1 of the IBD/UPD region detection (i.e. determination of copy-neutral LOH regions);
-
Model 3: addition to Model 2 of the gained region detection.
Each of the three models is contained in the subsequent. The final model (Model 3) represents our algorithm for the estimation of both copy number changes and copy-neutral LOH regions and we call it genomic Bayesian Piecewise Constant Regression (gBPCR).
Model 1: relationship between LOH and copy number data
Although in nature the copy number is an integer, the raw copy number values detected by the microarray are usually continuous values, due to technical procedures. Moreover, the samples often contain also a percentage of normal cells.
It is common practice to treat copy number data in a log2ratio scale (where the ratio is defined with respect to a normal reference dataset) which makes the errors approximately normally distributed. Then, the copy number profile is estimated as a piecewise constant function (i.e. the genome is divided in regions of constant copy number), where the levels assume real values. For the purpose of our model, we estimate this profile by mBPCR, which is a Bayesian piecewise constant regression procedure [14]. It has been shown in [14] that this method outperformed well-known other methods on several datasets.
Commonly, in biomedical/cancer research, after estimating the log2ratio profile, the copy number aberrations are defined as those regions with values outside an interval around zero (notice that, in the log2ratio scale, zero represents CN = 2, i.e. a normal copy number). Often, the interval is a statistical confidence interval computed on the basis of the samples of the whole dataset.
In Model 1, our aim is to classify better the copy number changes, trying to reduce the number of false positives, by exploiting the relationship between copy number and LOH data.
Mathematical model of the biology mechanism
The aim of Model 1 is to obtain a better estimation of the true underlying copy number events, using both the information given by copy number and LOH data. In a genomic region, a copy number event is defined as a particular class of copy number values. The definition of the categories into which the copy number values are divided will follow from the description of the LOH data.
For the purpose of better identifying the copy number events, we can consider two classes of SNP values: Heterozygosity (Het) and Homozygosity (Hom). Thus, the LOH data are deduced from the genotyping data, by mapping the genotypes AA and BB into Hom and the genotype AB into Het, for all SNPs. The genotype calling algorithm (e.g. BRLMM [26]) is unable to distinguish between a homozygosity due to the presence of two equal nucleotides or the one due to the loss or high amplification of one of them. Hence, the presence of heterozygosity can ensure that the copy number is normal or gained with a high probability, while the homozygosity can be due to different events. It follows that there are only four relevant classes of copy number events that can be distinguished by looking at the LOH data. Therefore, if we call
the random variable which represents a copy number event at SNP i, it can assume only the following values:
-
= 2, when CN > 4 (amplification);
-
= 0, when 1 < CN ≤ 4 (normal or gain);
-
= -1, when CN = 1 (loss);
-
= -2, when CN = 0 (homozygous deletion).
The homozygous deletion corresponds to the loss of both copies of a genomic region. Ideally, the genotype calling algorithm should detect a NoCall genotype at the corresponding SNP position (i.e. it should not be able to identify the genotype of the SNP). Although not common, since cancer DNA samples usually contain a mixture of normal and tumor cells (with also different cancer cell subpopulations), the information given by the NoCall genotype can be used to better distinguish between a mono-allelic deletion and a bi-allelic (homozygous) deletion.
Therefore, three different LOH variables are present in the model: the true homozygous status in normal cells (XN), the homozygous status in abnormal cells (X), which is the consequence of copy number changes (in Model 1 we do not consider other biological events), and the homozygous status detected by the genotype calling algorithm (Y). The components of the first two random vectors can assume only values in
= {Het, Hom} and
* = {∅, Het, Hom}, respectively, and we suppose that they are independently distributed as Bernoulli random variable. The components of Y can assume values in
= {NoCall, Het, NHet} (NHet stands for "not heterozygous", since the genotype calling algorithm cannot distinguish between two equal nucleotides, i.e. homozygosity, and the loss of one copy).
A summary of the model can be found in Figure 1 and a summary of the notations is in Table 1. Ideally, at each SNP i, the homozygous status in abnormal cells X
i
is completely determined, given the corresponding value in normal cells
and the occurred copy number event
, by the following relations:
Nevertheless, the homozygous status of abnormal cells estimated by the genotype calling algorithm (Y
i
) is affected by several sources of errors.
Hypothesis of the model
The genome of cancer cells can be divided in subregions where the copy number is constant. Since we divided the copy number values in four classes (i.e. the copy number events), we can also consider regions with the same copy number event.
Let us consider a genomic region where the microarray measures the DNA copy number and the genotype at n SNP loci. Then, from the previous discussion, the vector of the copy number events at all positions
can be seen as a piecewise constant function. This function consists of k0 intervals with the same copy number event and with boundaries
so that
, for all p = 1, ..., k0. To estimate this function we use a Bayesian piecewise constant regression method, which determines the number of segments k0, the boundaries (
) and the copy number events
.
For any sample, we assume to have the homozygous status detected by the genotype calling algorithm (Y) and the profile of the log2ratio of the copy number estimated by mBPCR. The estimated log2ratio profile consists of
intervals with boundaries
and levels of the segments
(
is the estimated log2ratio in the pthinterval, for
). This estimated profile is used only to define the prior distribution of the random vector Z (see Subsubsection "Z prior definition"), while the LOH data Y are used to infer Z (the scheme of the algorithm is in Figure 2). Notice that we do not suppose to know XN, i.e. the homozygous status in normal cells. Moreover, we assume that, given the true value of the homozygous status in normal cells XNand the copy number event Z at each position, the LOH data points
are independent, since their values depend only on both noise and genotype detection errors.
The model implies that, given k0 and t0, the posterior distribution of
is
and thus, if we condition only with respect to the LOH data y, the posterior becomes
where
and
are the domains of k and t, respectively (they will be defined later).
To specify the model (see Figure 1), we need to define the likelihood, i.e. the conditional distribution of Y, given
and XN. To model it, we take into account all the variability that can affect the genotype detection, such as: the polymerase chain reaction (PCR) amplification, the presence of different cancer cell subpopulations or normal cells, and the amplification of only one copy. For example, the probabilities
and
are not zero, because of the error in the genotype detection even in case of a normal DNA sample. The probabilities
and
are related to the detection errors due to the presence of normal cells and/or different types of cancer cell subpopulations, or to PCR amplification errors, while
is related to the errors that can be due to the amplification of only one allele. Also
and
account for the errors that can be due to the presence of cell subpopulations.
The set of conditional probabilities
are considered as parameters of the model. To quantify them, we needed paired normal-cancer samples, since they are related to the probability of detecting a certain homozygous status in a cancer cell, given the corresponding one in a normal cell of the same patient and under some copy number event. Therefore, to compute maximum likelihood estimates of these parameters, we used 13 samples from an available cancer dataset consisting of breast cancer cell lines [27, 28] (see Section S.1 in Additional file 2, for further explanations).
To complete the Bayesian model, we need to define the prior distributions of the other random variables. For the parameters K and T, we consider distributions similar to the ones used in mBPCR [14]:
where
= {1, ..., kmax} and
is a subspace of
such that t0 = 0, t
k
= n and t
q
∈ {1, ..., n - 1} for all q = 1, ..., k - 1, in an ordered way and without repetitions.
The prior probabilities of heterozygosity of the SNPs
are the frequencies of heterozygosity computed on the samples of the matched race in the HapMap project [3, 29]. They are usually provided by the manufacturer in the documentation related to the microarray used. In Section "Results and Discussion", the microarray mostly employed is the GeneChip Human Mapping 250K NspI (Affymetrix, Santa Clara, CA, USA).
Z prior definition
The only prior that we have not yet defined is the one of Z. While the estimated levels of the log2ratio profile are continuous variables, Z classifies the copy number as discrete events. Then, the major problem consists in mapping the continuous values into the discrete values of Z, i.e. in defining a partition of the log2ratio values such that each interval corresponds to a particular copy number event.
In the literature, most methods determine a confidence interval around zero (which corresponds to CN = 2) and then consider all the log2ratio values above this interval as gains and all values below as losses (see, for example, [30, 31]). This method is not suitable in our case, since we want to classify also the events {CN = 0} and {CN > 4}. Looking at the histogram of the estimated log2ratio values (see, for example, in Figure 3 the histogram derived from the 14 HIV lymphoma cell lines in [32]), we can see that they have a multimodal density with peaks corresponding to CN = 1, CN = 2 and CN = {3, 4}. Sometimes, we can even separate the peaks of CN = 3 and CN = 4. Similarly to [33], we model this density as a mixture of normal distributions (a way to estimate this mixture density can be found in Section S.2 in Additional file 2). Once the parameters of the density are estimated, we can define a function to map the log2ratio values into the copy number event values:
where
are, respectively, the estimated mean and variance of the normal distribution corresponding to CN = cn.
From the definition of f
LOGtoZ
, for all p = 1, ...,
, we define the prior distribution of Z
p
as:
where cn represents all copy number information (both raw data and estimated profile by mBPCR) and M
p
is the random variable representing the log2ratio value in the pthsegment. From the mBPCR model, given cn, M
p
is normally distributed with mean
and variance
where (
,
) are the posterior mean and variance of M
p
estimated by mBPCR, respectively.
The estimation
To estimate the piecewise constant profile of the copy number events, we define the estimators of k0 (the number of segments) and t0 (the boundaries) similarly to the ones in the mBPCR method [14]:
Namely,
corresponds to the
positions which have the highest posterior probability to be a breakpoint. The main differences with respect to mBPCR are in the prior over K and in the estimation of K. Instead of using a uniform prior and an estimator which minimizes the posterior expected squared error, here we consider a prior similar to 1/k2 and an estimator which minimizes the 0-1 error, in order to reduce the false discovery rate (FDR) in case of few segments.
Another difference with respect to mBPCR consists in the level estimation. While in the copy number model the levels were continuous random variables, now they assume categorical values. Hence, they are estimated separately (as before) with the MAP estimator instead of the posterior expected value,
for p = 1, ...,
, where
and
are any estimate of t0 and k0, respectively. For the computation of all the posterior probabilities involved, we used dynamic programming as described in Section S.3 in Additional file 2.
Let us define y
ij
= (yi+1, ..., yj), representing the LOH data points in the interval [i + 1, j], and K
ij
as the random variable which represents the number of segments in the interval [i + 1, j]. Using Bayes' Theorem and the independence of the LOH data points belonging to different segments, the probability in Equation (4), given the LOH data y, can be written as,
Therefore, if the boundary estimator misses a clear boundary between
and
, then the probability at the denominator of Equation (5) could be zero and thus the level would not be estimated. The best way to prevent this event consists in using a good estimator for the boundaries.
Previously, in [14] we found that the boundary estimator
is an estimator with a high sensitivity, but medium FDR. The problem of this estimator is the following. The vector p of the posterior probabilities to be a breakpoint at each point of the sample usually represents a multimodal function with maxima at the breakpoint positions, but often in a neighborhood of each maximum there are other points with high probability because of the uncertainty (see Figure 4). Hence, if we take the first
points with the highest probability (according to the definition of
), we could take points in the neighborhood of the higher maxima and not some maxima with a lower probability (see Figure 4). As a consequence, if k0 was estimated with its exact value then the sensitivity of the
would be lower. In this case, we could lose important breakpoints so that the denominator in Equation (5) would become zero. In practice,
often slightly overestimates k0, because of the high noise of the data, and thus this phenomenon should not happen, but to prevent even this rare case we searched for a way to improve the estimation of the boundaries.
Since commonly the vector of the posterior probabilities shows clearly the position of the breakpoints in correspondence to the maxima, we estimate the number of the segments and the breakpoints with the number of peaks and the locations of their maxima, respectively (see Section S.4 in Additional file 2). Essentially, the algorithm for the determination of the peaks, after applying a kernel method to reduce the noise of the function, uses two thresholds: one for the determination of the peaks (thr1) and one for the definition of the values close to zero (thr2). Therefore, we will denote the corresponding estimators by
and
.
In Section "Results and Discussion", we will consider several pairs of thresholds and we will apply the corresponding estimators to simulated data, in order to determine the best paired thresholds and to compare their performance with
. We will also compare
with
, another boundary estimator described in [14].
Model 2: addition of the IBD/UPD region detection
LOH data are used in biology not only to better identify regions of loss and amplifications, but, especially, to detect regions of copy-neutral LOH, which can be identified by unusual long stretches of homozygous SNPs, with normal copy number. In Section "Background", we explained that this type of aberrations can be a consequence of UPD (either uniparental isodisomy or heterodisomy) or autozygosity (IBD regions). From the description of these genomic events, it follows that the uniparental isodisomy and the IBD regions can be detected because they appear as a long sequence of homozygous SNPs with a low probability to occur, while the uniparental heterodisomy consists in a sequence of both homozygous and heterozygous SNPs as in a normal condition. Therefore, without the genotypes of the parents, from SNP data we can only detect the uniparental isodisomy (iUPD) and the IBD segments. In the following, we will consider only these two events, referring to them as IBD/UPD events.
Since an IBD/UPD event, by definition, only exists in regions of normal copy number (CN = 2), the only probabilities which are affected by the presence of this event are those involving {Z = 0}. Therefore, we define the following sets of conditional probabilities
and
, where the variable
indicates if an IBD/UPD event occurred at SNP i (if it happened
= 1, otherwise
= 0). We can notice that, given {
= 0,
= 0}, the distribution of Y
i
is equal to the conditional distribution with respect to {
= 0} in Model 1, since the latter was modeled with no possibility of an IBD/UPD event. Instead, in case of an IBD/UPD event, we do not need to condition with respect to
, since, in case of a somatic iUPD event, the genotype of an iUPD region is independent of the homozygous status of the same region in a normal cell. Otherwise, in case of autozygosity or germ line iUPD, the genotypes of normal and abnormal cells are the same and it makes no sense to condition one to the other.
In the new framework, we define the vector of the aberration events at n SNP loci as
; here the aberrations regard both copy number changes and IBD/UPD regions. Each component
of the vector assumes values: -3 (
= 0 and
= 1, i.e. IBD/UPD event), -2 (
= -2, i.e. homozygous deletion), -1(
= -1, i.e. loss), 0 (
= 0 and
= 0, i.e. normal state or gain), 2 (
= 2, i.e. high amplification); a graphical representation of the model is given in Figure 5. As previously, we can divide the genome in intervals corresponding to the same aberration event, i.e the profile of the aberrations consists of k0 intervals, with boundaries
, so that
, for all p = 1, ..., k0. The estimation procedure is similar to the one of Model 1. The estimators of k0 and t0 are the same and, given
and
(any estimate of k0 and t0, respectively), we estimate the aberration events in each interval with their MAP estimators,
for p = 1, ...,
. Notice that, for w = -2, -1, 2, the posterior probability P(W
p
= w|Y,
,
, cn) is equal to P(Z
p
= w|Y,
,
, cn), while for w = -3, 0 we have,
and we assume that P(U
p
= 1) =: p
upd
, for all p = 1, ...,
.
Both
and p
upd
are parameters of the model. For the maximum likelihood estimation of
, we used 11 IBD/UPD regions previously found by us on 5 samples of patients with hairy cell leukemia [34] and on the B-cell lymphoma cell line KARPAS-422. All regions were detected by dChip [16] and their width was between 3 Mb and 100 Mb (covering from 300 to 9800 SNPs), so that they were large enough to be really considered IBD/UPD regions (for further explanations, see Section S.1 in Additional file 2).
Values for the parameter pupd
We expect the prior probability of an IBD/UPD event to be low. In order to estimate the order of magnitude of this parameter, we considered two studies on IBD regions: [6] and [3]. In the former, they considered as IBD regions only stretches of at least 50 homozygous SNPs (with at maximum 2% of heterozygous) longer than 4 Mb and the platform used was the Affymetrix GeneChip Human Mapping 50K Array. In the latter, a denser microarray was used and the stretches considered were longer than 1 Mb (with at least 50 probes) or longer than 3 Mb. Using the data of the former paper (only the normal samples), we estimated p
upd
≈ 1.7·10-3. Instead, with the data of the latter, we estimated p
upd
≈ 1.5·10-3 considering all regions greater than 1 Mb, while p
upd
≈ 1.46·10-4, considering only the regions greater than 3 Mb. The differences in the estimated values are due to the different resolution of the technologies used (in fact, in the former the number of SNPs used was 58,960, while in the latter it was 3,107,620). Moreover, the probability depends on the minimum length allowed for these regions. The wider the regions are, the higher is the probability that the regions represent "abnormalities" and the lower becomes the probability of their occurrence (so that p
upd
is lower). Therefore, in the following applications (see Section "Results and Discussion"), we will use two values: p
upd
= 10-3 and p
upd
= 10-4.
Another possible way to solve the problem could be to assign a prior distribution to p
upd
(for example, a uniform distribution over its range) and integrate it out in the equations of the model.
Model 3: addition of the gained region detection
In the description of Model 1, we explained our assumption that there is no difference in the genotype detection between a normal or gained region. Therefore, in Model 1 (and in Model 2), we defined a single class for the normal or gained regions. But, for the biological studies, it is relevant to distinguish these two copy number events and this distinction is based essentially on the estimated copy number (since there is no difference in the distribution of the detected genotypes, due to the previous discussion). As a consequence, the probability of Y
i
given a normal (i.e. {
= 0}) or gained copy number (i.e. {
= 1} = {
= 1}) is the same,
We also need to define two distinct prior probabilities for the normal copy number and the gain event. Similarly to its previous definition, for all
, the new prior of Z
p
is given by,
In the following, Model 3 (which is the complete model) will be called genomic Bayesian Piecewise Constant Regression (gBPCR).
Adjustment of the parameters related to NoCall
The probabilities {P(Y
i
= NoCall|
= x,
= w), x ∈
, w = -3, -2, -1, 0, 2} are related to the detection of NoCall s under some conditions. Generally, the presence of NoCall s is not only due to diffculties of the genotype calling algorithm in the detection of the genotype (technical noise) but also to the noise of the sample because of differences in the quality of extracted DNA. Therefore, we need to adjust the estimated values of these parameters on the basis of the sample noise.
Since usually the NoCall rate (i.e. percentage of NoCall s in the sample) increases with the noise of the sample, we assume that, given {
= x,
= z}, the probability of detecting a NoCall at SNP i in sample s is proportional to a parameter p
x,z
(which depends on the technical noise) by a factor θ
s
(which depends on the sample noise),
If we condition over the values of
and estimate P(
= Het) = 1/2 for a generic SNP i (by considering a uniform distribution over the four possible combinations of alleles AA, AB, BA, BB), we can compute the NoCall rate in regions with copy number event z in the following way,
Therefore, by applying Equations (9) and (11), for any pair of samples (Sample 1 and 2), we can write the conditional probability of NoCall, given {
= x,
= z}, in Sample 1 in terms of the corresponding probability in Sample 2,
In the following, we will denote the sample to estimate with s = 1 and the reference sample with s = 2.
Using Equation (12), the values of the parameters related to NoCall detection are adjusted for Sample 1,
for z = -2, -1, 0, 2, where r1(z) and r2(z) are an estimate of the NoCall rate in regions with copy number event z, for Sample 1 and 2, respectively. By applying Equation (10) with P(
= Het) = 1/2, r2(z) can be computed from the estimated values of P(Y
i
= NoCall|
= Het,
= z) and P(Y
i
= NoCall|
= Hom,
= z)
for z = -2, -1, 0, 2. r1(z) is the frequency of NoCall in regions with copy number event z of Sample 1, for z = -2, -1, 0, 2.
The estimated value of the probability P(Y
i
= NoCall|
= -3) is adjusted in a different way. On the reference samples, we found, as expected, that
that is, the NoCall rate in IBD/UPD regions is approximately equal to the NoCall rate in normal regions. Therefore,
In Section "Results and Discussion", we will compare the estimations resulting from gBPCR with and without the adjustment of these parameters.