Conditional random pattern model for copy number aberration detection
© Li et al; licensee BioMed Central Ltd. 2010
Received: 30 May 2009
Accepted: 22 April 2010
Published: 22 April 2010
Skip to main content
We’re sorry, something doesn't seem to be working properly. Please try refreshing the page. If that doesn't work, please contact us so we can address the problem.
© Li et al; licensee BioMed Central Ltd. 2010
Received: 30 May 2009
Accepted: 22 April 2010
Published: 22 April 2010
DNA copy number aberration (CNA) is very important in the pathogenesis of tumors and other diseases. For example, CNAs may result in suppression of anti-oncogenes and activation of oncogenes, which would cause certain types of cancers. High density single nucleotide polymorphism (SNP) array data is widely used for the CNA detection. However, it is nontrivial to detect the CNA automatically because the signals obtained from high density SNP arrays often have low signal-to-noise ratio (SNR), which might be caused by whole genome amplification, mixtures of normal and tumor cells, experimental noise or other technical limitations. With the reduction in SNR, many false CNA regions are often detected and the true CNA regions are missed. Thus, more sophisticated statistical models are needed to make the CNAs detection, using the low SNR signals, more robust and reliable.
This paper presents a conditional random pattern (CRP) model for CNA detection where much contextual cues are explored to suppress the noise and improve CNA detection accuracy. Both simulated and the real data are used to evaluate the proposed model, and the validation results show that the CRP model is more robust and reliable in the presence of noise for CNA detection using high density SNP array data, compared to a number of widely used software packages.
The proposed conditional random pattern (CRP) model could effectively detect the CNA regions in the presence of noise.
Detection of copy number aberrations (CNA) using single nucleotide polymorphism (SNP) array data or Array comparative genomic hybridization (CGH) data is becoming important in disease pathogenesis analysis [1–6]. For example, CNA may result in suppression of anti-oncogenes and activation of oncogenes, which would cause certain types of cancers [1, 7, 8]. Disease related CNAs not only indicate the molecular-level pathogenesis, but also can be used as biomarkers for diagnosis. For example, Myelodysplastic syndromes (MDS) are a group of clonal hematopoietic disorders, which are considered as clonal stem cell diseases characterized by peripheral cytopenias (anemia, neutropenia, and/or thrombocytopenia) with normocellular or hypercellular marrow and bilineage or trilineage dysplasia [9–13]. Early diagnosis with appropriate treatment may lead to improved prognosis, however, there is no accurate diagnosic method at the early stage of MDSs because the morphological appearances are highly variable and not specific to the MDSs [9–13]. Using the high density SNP arrays, the molecular-level biomarkers of MDSs may be detected, and are helpful for the MDS early diagnosis and treatment.
To detect the CNA regions using high density SNP arryas, automated SNP array analysis method is needed. However, it is nontrivial to detect the CNA automatically because the signals obtained from high density SNP arrays often have low SNR values, which may be caused by whole genome amplification, mixture of normal and tumor cells, experimental noise and other technical limitations. With the reduction in SNR, many false CNA regions are often detected and true CNA regions are missed. Thus, more sophisticated statistical models are needed urgently to make the CNAs detection robust and reliable using the signals with low SNR, although a number of software packages have been developed for the SNP array analysis. For Affymetrix SNP array, the widely used software packages are Genotyping Console , GEMCA , CNAG [16, 17] and dChip [18, 19] and Birdsuite [20, 21]. For Illumina SNP array, PennCNV [22, 23], QuantiSNP [24, 25], GenoCN [26, 27], SOMATICs [28, 29] and OverUnder [30, 31] have been developed. For array-CGH data, aCGH (Microarray-based comparative genomic hybridization) , CLAC (Clust Along Chromosomes) , CBS (Circular binary segmentation) [34, 35] and GLAD (Gain and loss analysis of DNA) , and many other CNA detection algorithms have been developed . Usually there are two types of copy number analysis: one is the CNA, the other is copy number variation (CNV) analysis. The CNVs naturally happens in normal tissue and are inheritable, while the CNA are acquired somatic alterations and often observed in disease tissues, which also tend to be longer and more densely occur in the genome [23, 27]. Most of the abovementioned software packages could detect both CNVs and CNAs, whereas some of them may incorporate more information to improve their performance specifically for CNV or CNA detection. For example, the SOMATICs and GenoCNA incorporate the normal tissue contamination information for better CNAs detection.
GEMCA is a good software package for copy number variants detection in HapMap data, which can combine both Nsp and Sty arrays to detect small copy number variant regions. It is stated in the website of this software that it is not suitable for cancer analysis that has much larger copy number changes, and the new software programs are being developed for cancer copy number analysis [3, 38, 39]. For Genotyping Console, CNAG  and dChip [19, 40], although different pre-processing techniques (e.g. normalization, scaling and feature extraction) are used, they all use HMM framework  in the second tier to infer the copy number. However, the major limitation of the HMM framework lies in the simple assumption that the current state is only determined by the immediate previous state and the current observation . Due to this assumption, the noisy SNP array data often results in inaccurate copy number inference. In Birdsuite software package, the Birdseye method, which is a HMM, is implemented to find CNV regions [20, 21]. The QuantiSNP, PennCNV and GenoCN all make use of the HMMs on the two dimensional data: Log R Ratio (LRR) and B allele frequency (BAF) [23, 24]. Compared to QuantiSNP, PennCNV improved the transition probabilities in HMM, BAF distribution, accuracy of likelihood of copy number genotype modelling, and added the family-based analysis . In the GenoCN software, the CNV and CNA detections are processed by two different modules: GenoCNV and GenoCNA . The GenoCNA incorporates the normal tissue contamination and genotype data from normal tissue to improve the CNA detection accuracy . SOMATICs uses the two-band pattern of BAF to detect the CNAs of samples that mix the normal and tumor cells . OverUnder detects the CNA by dividing the 2D square of the B allele frequency (BAF) and the Log R Ratio (LRR) into different regions that correspond to the loss, gain, amplification . aCGH uses the unsupervised HMM. CBS (circular binary segmentation) makes use of the maximum of a likelihood ratio statistic recursively to separate SNP sequence into small segments. In GLAD method, an adaptive weights smoothing procedure is used to estimate the means of sequence segments. In CLAC the hierarchical clustering method is used to detect the CNA regions. CBS, GLAD and CLAC only estimate the means of sequence segments, rather than give the exact copy number of each segment.
In this study, we present a novel conditional random pattern (CRP) model for CNA detection, in which more contextual information of neighboring SNP loci is considered, compared to HMM, to suppress the noise and improve the accuracy of CNA detection. Specifically, in the CRP model, the copy number of a SNP locus is not only determined by the copy numbers of its two immediate neighboring SNP loci, but also by a continuous segment of observations (log2-ratio features), thus allows us to employ more contextual cues. The rest of this paper is organized as follows. The details of the CRP model are described in Section 2. Section 3 provides the experimental validations, and the discussion and conclusions are presented in Section 4.
To validate the proposed CRP model, we compared the proposed CRP model with dChip, CNAG and PennCNV-Affy, and four widely used copy number inference software packages: aCGH, CBS, CLAC, and GLAD, using both simulated data and real data.
To validate the proposed CRP model, three real data sets were employed. The first one is from the MDSs samples in our laboratory. 12 cryopreserved bone marrow samples from MDS patients were analyzed. Please see the detailed samples' information in the additional file 1. The second one is the array-CGH data used in Lai's 2005 paper . We also downloaded the HapMap samples' 500 K SNP array data from the Affymetrix website http://www.affymetrix.com/support/technical/sample_data/500k_hapmap_genotype_data.affx. Two HapMap samples, NA10851 (as reference sample) and NA18515 (as test sample) were used to test the CRP model, in which some CNA regions were validated by quantitative PCR in [3, 4].
Since there is no ground truth of the CNA regions in these data, it is difficult to quantitatively measure the performance of different software packages. Therefore, two more simulated data sets were created. We simulated Affymetrix GeneChip Human Mapping 500 K Array data in the Affymetrix's CEL file format, which is based on the real 500 K SNP arrays of HapMap samples from the Affymetrix website, therefore both dChip and CNAG (and other software packages) can process them. The simulation process of these data is as follows. First, we randomly selected three arrays as the normal reference samples: "NA10851_FinNsp_vR1_579813_A1_1_SC2", "NA12812_FinNsp_vR1_579813_B2_1_SC6" and "NA18605_FinNsp_vR1_579548_D5_1_SC3". Secondly, we randomly set one CNA region in each chromosome, and a total of 22 CNA regions were obtained (Chromosome X was not set). The length of these CNA regions varies from 4 to 100 SNPs uniformly. Thirdly, for each reference sample and certain noise level, two simulated arrays are generated, one for copy number deletion (one copy), and the other for copy amplification (three copies). The mismatch probes are used as the background to estimate the simulated intensities of the corresponding perfect match probes in these CNA regions. The intensity of probes outside the CNA regions remains unchanged. Then the Gaussian noises are generated and added to all probes, which follow a Gaussian distribution, N(0, σ), where the standard deviation of noise σ is proportional to the probe intensity y. The signal to noise ratio (SNR) SNR = y/ σ varied from 5, 2 to 1.25, to simulate different noise levels. A total of 18 samples were simulated based on the three selected HapMap samples (2 simulations per SNR level per sample * 3 SNR levels * 3 HapMap samples).
We also simulated log2-ratio sequences. For each noise level, we simulated 100 log2-ratio sequences, and each sequence contains 300 log2-ratio data points. Four CNA regions with 5, 10, 20 and 40 length were created. In the CNA regions, the mean was set as 0.4, and outside the CNA regions the mean was zero. Then the Gaussian noise N(0, σ) was added. Three noise levels were considered, and the SNR (0.4/σ) varied from 2, 1.3 to 1.
To measure the performance of copy number inference software packages, seven metrics were used: snp-precision (sp), snp-recall (sr), region-precision (rp), region-recall (rr), hybrid-precision (hp), hybrid-recall (hr), and f-score (f). Given a contingency Table, as seen in Table 1, the seven metrics are defined as follows.
The Contingency Table
CNA SNP Loci
Normal SNP Loci
To evaluate the performance of the CRP model, we compared it with a number of widely used software packages: dChip, CNAG and PennCNV-Affy. In August 2009, PennCNV provided the PennCNV-Affy protocol to calculate the LRR and BAF signals from Affymetrix SNP arrays, and then make the CNA detection.
The proper comparisons between different algorithms are not trivial because of different possible parameter settings in each algorithm. In CNAG, users can manually set the means of the HMM for each copy number status. To make the comparison fair, we tried to find the best performance of the CNAG by testing a few different parameter settings. Then we compared the CRP model with the best performance of CNAG. Since we simulated the three-copy amplifications and one-copy deletions in the simulated 500 K SNP array data, we only need to set the means of three-copy, MCN3 = 0.38, and one-copy, MCN1 (the mean of two-copy is set as zero) in CNAG. We tested following five different parameter settings of HMM in CNAG: 1) 'Automatic' - the parameters are set by the CNAG automatically. 2) 'Ideal' - (MCN3 = 0.38, MCN1 = -0.45). We estimated the means in the 'Ideal' by calculating the means of the three-copy amplifications and one-copy deletions in the simulated data respectively. 3) 'Random_1' - (MCN3 = 0.38, MCN1 = -0.42). 4) 'Random_2' - (MCN3 = 0.4, MCN1 = -0.5). 5) 'Random_3' - (MCN3 = 0.45, MCN1 = -0.55). Figure 1 shows the performances of CNAG on the simulated 500 K SNP arrays with the five different parameter settings. As we can see, the CNAG generated 'best' results in the 'Random_2' settings. Then we compared the CRP performance with the 'best' results of the CNAG. The detailed CNA detection results of the CRP model and that of the CNAG, with the 'Random_2' parameter setting, are provided in Table 2 and Table 3 respectively. Figure 2 provides the performance comparison between the CRP model and CNAG with optimal parameter setting. Obviously, the CRP model outperforms the CNAG in all three SNR levels and all 18 simulated 500 K SNP samples significantly.
CNA detection results of CRP model on the simulated SNP arrays
CNA detection results of HMM (in CNAG) on simulated SNP arrays
A number of CNA detection algorithms have been developed for array CGH data . To further evaluate the performance of the CRP model, we compared the CRP model with four widely used CNA detection methods: aCGH , CLAC , CBS [34, 35] and GLAD .
In this study, we proposed a CRP model for CNA detection, which explores more contextual cues to generate more accurate results than existing methods. The experimental results using both simulated data and real MDS SNP array data show that the CRP model is more robust and reliable compared to a number of widely used CNA detection methods.
For different studies and different noise levels, different number of nodes in the CRP model may be required. More nodes mean more contextual information to be considered, and can suppress higher noise levels, while it is prone to missing of small CNA regions. Fewer nodes will be more sensitive to small CNA regions, and as such be more sensitive to noise. To make the model adjustment convenient, we implemented the algorithm with the number of nodes as a user-input parameter. Thus, it could automatically generate a CRP model with any user-specified number of nodes. The basic idea of generating the CRP model is as follows. First, assume that the shortest CNA region has m SNP markers. Then select m-1 nearby SNP markers respectively from both the left and right of the current SNP marker. As a result, totally 2m-1 (2(m-1)+1) SNP markers are selected. In these 2m-1SNP markers, there are m possible CNA regions with m consecutive SNP markers, i.e. drawing m consecutive SNP markers out of the total of (2m-1) SNP markers. The rest may be deduced by analogy. There are m-1 possible CNA regions with m+1 consecutive SNP markers, and to the end, there is only one CNA region with 2m-1 SNP markers. Therefore, there are totally possible CNAs patterns out of these 2m-1 SNP markers. In this study, we assume that m = 4, then there are 10 possible CNA patterns. In a word, it is convenient for users to test the CRP model with different parameters on their own data.
The parameters in the CRP model were estimated based on some pre-known CNA regions. The users can adjust these parameters according to their specific SNP array data. Also the CRP model gives the probability of each SNP locus staying in each copy number status. If too many CNA regions are detected (may be caused by the bias of the parameters), the users can remove some CNA regions with low probability. Currently, the Haldane's map function is employed as the transition potential in the CRP model. For different diseases and different populations, some prior information of the copy number change can be included into the transition potential. For example, if certain regions of the genome are known to the 'recombination hotspots', then we can set the transition possibility from one copy number to the other higher. Therefore, the future extension of CRP model will consider the SNP-specific copy number aberration rates, and automatically adjust the structure (number of nodes) of the CRP model in running the program.
In the second tier, the statistical model borrows the contextual information to suppress the noise. In this section, we propose a CRP model that extends the use of copy number dependency.
where , and d (unit: 1 Million base-pairs) is the genome distance between two consecutive SNP loci. This function is used to take into account of distances between markers, and indicates that the nearby SNPs are more likely to have the same copy number, whereas the distant SNPs do not . For different diseases, the prior information of the copy number change can be included into the transition potential. The genome distance information can be found at the Affymetrix website http://www.affymetrix.com/index.affx.
where , is the probability of copy number y t emitting the observation . The ten terms in the right of Equation (4) are derived from the above ten combinations. In combination A, for example, the observations, xt-3, xt-2, xt-1and x t , are emitted from the same copy number status, whereas observations, xt+1, xt+2and xt+3, are generated by another and different copy number status. Therefore we should use xt+3, xt+2, xt+1and x t to estimate the supporting evidence for y t as p(xt-3, xt-2, xt-1, x t | y t ). Under the assumption of independence of observations, we deduce that . The other terms can be explained similarly. Since these ten terms in Equation (4) are not in the same scale, we re-scale them using the indexical transformation. Suppose the real copy number status of seven continuous SNPs is the same as the combination A, then the likelihood functions, p(xt-3| y t ), p(xt-2| y t ), p(xt-1| y t ) and p(x t | y t ), will be much larger than p(xt-1| y t ), p(xt-2| y t ) and p(xt-3| y t ). In other words, combination A yields higher evidence score than combinations B, C and D. Combination E, F, G, H, I and J consider the case that the CNA regions contain more SNPs. We reason that the re-scaled term that corresponds to the real underlying copy number status yields the maximum support evidence (likelihood) value. This is why we choose the maximum value as the local evidence.
The copy number sequence can be obtained by tracing back from δ T (y) to δ1(y). The normalized probability of the best labeling is given by max y δ T (y)/Z(x). We implemented the CRP model using Matlab 7.0 based on the conditional random field toolbox, CRFall, written by Kevin Murphy, which can be found at: http://www.cs.ubc.ca/~murphyk/Software/CRF/crf.html. The CRP model has the same asymptotic computational complexity as HMM.
This work was partially supported by TMHRI scholarship award, IBIS and NIH R01LM010185-01 grants. The authors would like to thank Dr Lingyun Wu in the Bioinformatics Program, The Methodist Hospital Research Institute for their discussion and advice in this research.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.