A forwardbackward fragment assembling algorithm for the identification of genomic amplification and deletion breakpoints using highdensity single nucleotide polymorphism (SNP) array
 Tianwei Yu^{1}Email author,
 Hui Ye^{2, 3},
 Wei Sun^{4},
 KerChau Li^{4},
 Zugen Chen^{5},
 Sharoni Jacobs^{6},
 Dione K Bailey^{6},
 David T Wong^{7} and
 Xiaofeng Zhou^{2, 8}Email author
DOI: 10.1186/147121058145
© Yu et al; licensee BioMed Central Ltd. 2007
Received: 16 October 2006
Accepted: 03 May 2007
Published: 03 May 2007
Abstract
Background
DNA copy number aberration (CNA) is one of the key characteristics of cancer cells. Recent studies demonstrated the feasibility of utilizing high density single nucleotide polymorphism (SNP) genotyping arrays to detect CNA. Compared with the twocolor arraybased comparative genomic hybridization (arrayCGH), the SNP arrays offer much higher probe density and lower signaltonoise ratio at the single SNP level. To accurately identify small segments of CNA from SNP array data, segmentation methods that are sensitive to CNA while resistant to noise are required.
Results
We have developed a highly sensitive algorithm for the edge detection of copy number data which is especially suitable for the SNP arraybased copy number data. The method consists of an oversensitive edgedetection step and a testbased forwardbackward edge selection step.
Conclusion
Using simulations constructed from real experimental data, the method shows high sensitivity and specificity in detecting small copy number changes in focused regions. The method is implemented in an R package FASeg, which includes data processing and visualization utilities, as well as libraries for processing Affymetrix SNP array data.
Background
Most human cancers are characterized by genomic instabilities. Indepth knowledge of genomic aberrations has important clinical values in diagnosis, treatment, and prognostics of cancer [1]. Genomic aberrations can be analyzed using a variety of highthroughput genetic and molecular technologies, such as arraybased comparative genomic hybridization (arrayCGH) [2] and SNP arraybased copy number analysis [3]. A number of methods have been developed to perform smoothing and/or to detect edges of segments containing one consistent copy number [4–23], some of which were compared and summarized by Lai et al. and Willenbrock et al. [24, 25].
Highdensity array platforms, e.g. SNP array, provide the opportunity to identify genomic aberrations that localize to small segments of the chromosome, which we refer to as focused CNA in this paper. To analyze the DNA copy number of a disease sample, the matched normal DNA can be used as a reference for the computation. While this approach yields relatively low noise, such a matched normal DNA sample is often unavailable. By using the existing SNP array data libraries derived from large numbers of normal samples, disease samples can now be analyzed without paired normal samples [14, 26]. However, proper handling of the data is necessary to lower the noise and avoid identifying large numbers of falsepositive CNA segments. One way to achieve this goal is to reduce noise at the probe level, by selecting probes based on dose response to copy number change [26] or sequence properties [27]. Another approach is to apply data smoothing and segmentation methods with high sensitivity and specificity. While most methods designed for arrayCGH data can potentially be applied, their parameters may need to be finetuned to adapt to the different characteristics of the SNP array data. Here we present a testbased data segmentation method. In our algorithm, each chromosome is first broken into small segments through an oversensitive edge detection mechanism. The consecutive segments are then iteratively merged by local testing, using a forwardbackward edge selection scheme, until all remaining edges pass a significance threshold. The data sets used in this study were generated with Affymetrix GeneChip^{®} Mapping 50 K Xba arrays on two model cell lines with known genomic alterations and two tumor DNA samples of oral squamous cell carcinoma.
Results and discussions
The SNP array results on two model cell lines were generated as described in the Methods section for the development and testing of our algorithm. The cell lines used here were GM03226 with a known trisomic aberration segment in chromosome 9 [9pter > q11], and GM00870 with a known single copy deletion segment in chromosome 9 [9pter > p21]. The data was first processed with Copy Number Analysis Tool (CNAT 3.0) from Affymetrix Inc, which utilizes Huang et al.'s method to estimate SNPlevel copy numbers based on libraries of normal samples [14]. We chose CNAT because of its widespread use for the analysis of SNP array data. Better data preprocessing methods [26, 27] may lead to better results than reported here. Following the CNAT process, the SNPlevel copy number values were log_{2} transformed to achieve nearnormal distributed copy numbers. The mean and standard deviation (SD) for the signals from one, two and three copies were defined based on knowledge of the cell lines. We found that normal twocopy DNA yielded a mean of 1.03 and SD of 0.77; singlecopy DNA yielded a mean of 0.25 and SD of 0.63; and threecopy DNA yielded a mean of 1.45 and SD of 0.91. Compared to singlecopy DNA, threecopy DNA has a mean that is closer to twocopy DNA, and a larger standard deviation. Hence threecopy aberrations are harder to detect than singlecopy aberrations.
Parameters tested for the seven R packages
Packages  Parameters tested  

Tuned packages  Tuning parameter  Values tested  Other parameters  
FASeg  Sig  0.25, 0.1, 0.075, 0.05, 0.025, 0.01, 0.005, 0.001, 0.0001, 0.00001  Default  
aCGH  Vr  10, 7, 5, 2, 1, 0.5, 0.1, 0.05, 0.01, 0.001  Default  
DNAcopy  alpha  0.25, 0.1, 0.075, 0.05, 0.025, 0.01, 0.005, 0.001,0.0005, 0.0001  * nperm = 1000  
GLAD  qlambda  0.75, 0.9, 0.925, 0.95, 0.975, 0.99, 0.9925, 0.995, 0.9975, 0.999  ** lambdabreak = 0.01 lambdacluster = 0.01 lambdaclusterGen = 0.01 param = c(d = 1)  
Packages examined at default setting  Picard  Maxk = max(true segment size) + 5, maxSeg= #(true segments) + 1  
RJaCGH  *** burnin = 50, *** TOT = 500, jump.parameters = NULL, k.max = #(true states) + 1  
BioHMM  Default 
Comparison of computing time*
CPU time (seconds)  

FASeg  181 
aCGH  107 
DNAcopy  18 
GLAD  98 
Picard  101 
RJaCGH  13778 
BioHMM  1619 
When the intent is not finding focused CNA, or there is a strong prior belief that the CNA segments are not focused, more stringent pvalue cutoffs should be used. On the other hand, if the intent is to identify focused CNA, less stringent cutoffs have to be used with the risk of identifying false CNA segments. However, the problem of finding falsepositive segments is not as severe as it seems in most applications, where multiple samples are analyzed to identify CNAs overrepresented in a subgroup of samples. In such applications, after data segmentation, crosssample testing is performed to find CNA segments consistent across many samples. False segments at random locations will most likely be inconsistent across samples, hence insignificant in the crosssample testing.
Conclusion
In summary, we presented an algorithm to find break points in copy number data. It consists of an oversensitive edge detection step and a testbased segment merging step. After the oversensitive edge detection step, the segmentation task becomes a modelselection task. In the forwardbackward model selection, by using the common segmentwise Gaussian assumption, the backward step is reduced to a manageable local search. The model makes no assumption about the number of CNA states in a chromosome. Thus it provides the flexibility to handle multiple CNA states in a single chromosome, which is important in the analysis of complex cancer samples. In the implementation of this algorithm into a userfriendly R package (FASeg), we optimized the parameters for identifying focused CNA in noisy data. In simulation studies based on real data, FASeg was sensitive to CNA segments that were small in size and low in signaltonoise ratio. It performed well when presented with complex samples with multiple CNA states per chromosome. From the users' stand point, FASeg is intuitive and easy to tune.
Methods
The forwardbackward fragmentassembling algorithm
The arraybased copy number data consists of a series of N observations {(X_{1}, Y_{1}), ..., (X_{ N }, Y_{ N })} for each chromosome, in which X_{ i }'s are positions along the chromosome and Y_{ i }'s are log_{2}ratios in the aCGH data or processed copy numbers from SNP array data. Between any twoconsecutive edges, which remain to be identified, we assume a local constant model with Gaussian error. We apply a two step method for the identification of edges. The first step aims to identify most true edges, at the price of identifying possible falsepositive edges. This step is a high sensitivity and low specificity step. In the second step, the goal is to remove the falsepositive edges, while retaining the true edges through statistical model selection. Figure 1 shows an illustration of the workflow.
Step 1. Oversensitive edge detection
To identify all possible edges, we apply an ad hoc method, which is based on onedimensional differential edge detection. At this step, the actual X values are ignored, because the main interest is to find copy number changes between data points. (1) To reduce noise, a locally weighted regression smoother (LOESS) with Gaussian kernel is fitted through the data Y_{1},....., Y_{ N }to generate fitted values Y_{1}',....., Y_{ N }'. (2) An edge is identified in every maximal monotone increasing/decreasing segment in the LOESS fitted curve. The edge position is assigned between the two observations that span the medium height of the segment (Figure 1). If the height of the segment is below a predefined threshold value, the edge is removed. The threshold value should be set such that copy number changes at or below that level is ignorable. The default value in the FASeg package is 0.1.
Step 2. Forwardbackward edge reduction
After step 1, the data is overly fragmented into small pieces. The next step is to merge the fragments by statistical testing. The task amounts to a model selection problem with a large number of candidate predictors (edges). The full model is the model that contains all the edges identified in Step 1. We resort to the forwardbackward scheme to quickly reduce the full model to a smaller model containing fewer edges. In the segmentwise constant model, the removal of an edge only affects the likelihood of the data points between the previous and the next edge. Thus a local ANOVA test, or unpaired ttest, is equivalent to the likelihood ratio test for model selection. Also, the removal of an edge only changes the significance of the two neighboring edges. Thus in the backward selection step, only previously removed edges within the segment confined by the two neighboring edges need to be reexamined (Figure 1, green dots).
We first define pvalues for all the edges. For each edge, we consider the observations between the previous and the next edges. These observations are spatially divided into two groups by the edge of interest. The unpaired Student's ttest is performed to find the significance of the division, and the pvalue is associated with the edge. Second, when the pvalues for all the edges are defined, we iteratively remove edges from the least significant one. With the removal of each edge, all previously removed edges around this edge are reexamined. For example, if edge i is being removed, with α edges immediately before edge i and b edges immediately after edge i having been previously removed, then for each edge j ∈ [i  a, i) ∪ (i, i + b], we recompute its pvalue after the removal of edge i. If the lowest of the pvalues is smaller than that of edge i, the corresponding edge is reinstated. This process is iterated until all remaining pvalues are smaller than a cutoff value. This pvalue threshold can be userdefined and may be finetuned based on each sample to get the best balance between sensitivity and specificity. After the edge identification, for the segment between two consecutive edges, the median value of Y is taken as the estimate.
DNA samples and the SNP array mapping assay
SNP array data was generated on 2 model cell lines (GM03226 and GM00870) with known genomic alterations and 2 previously uncharacterized oral squamous cell carcinoma samples (CZ T26 and CZ T322). Each sample was analyzed using one array. The model cell lines were obtained from Coriell Cell Repositories/NIGMS [30]. GM03226 are fibroblasts with a trisomic segment in chromosome 9 [9pter > q11], and GM00870 are fibroblasts that are known to have a single copy deletion segment in chromosome 9 [9pter > p21]. DNA labeling, hybridization, washing and staining of the Mapping 50 K Xba arrays were performed according to the standard GeneChip Mapping 100 K Assay protocol (Affymetrix). The arrays were scanned using a GeneChip Scanner 3000. The scanned array images were processed with GeneChip Operating software (GCOS) 1.3. The genotype calls and intensity of the SNP probes were generated by GeneChip DNA Analysis Software (GDAS) 1.4. The probelevel intensities were further converted to SNP level intensities using CNAT 3.0.
Simulation based on the real data
The simulation data was generated based on the model cell lines GM03226 and GM00870. We obtained pools of SNPlevel copy number values for single, two, and threecopy DNA. By resampling from these pools, we constructed copy number readings of the simulated chromosome. Each simulated chromosome contained 11 segments. Probesets were evenly spaced. Thus we use the number of probesets to represent the segment size. Starting from the normal segment, the chromosome construction alternated between normal segments and CNA segments. Six normal segments and five CNA segments were simulated for each chromosome. Within each simulated chromosome, a single normal segment size and a single CNA segment size were used. Two normal segment sizes (40 and 200 SNPs) and six CNA segment sizes (15, 20, 30, 40, 60 and 80 SNPs) were tested. Three settings of aberration levels were tested: (1) all five CNA segments in the chromosome represented singlecopy DNA, (2) all five CNA segments were threecopy, (3) the five CNA segments were a mixture of singlecopy, threecopy and segments of highermagnitude copy number changes. No real data was available for the highermagnitude CNA. Because such segments were easier to detect, and some deviation from the truth would not severely affect the results of performance comparison, we simulated them by adding constants to the singlecopy and threecopy pools. Three new pools were created. Pool L1 was created by moving the median of the singlecopy pool to log_{2}(0.5) to mimic copy numbers lower than one. Pools H1 and H2 were created by moving the median of the threecopy pool to log_{2}(4) and log_{2}(5) respectively, to mimic copy numbers higher than 3. In the simulated chromosome, the five CNA segments were drawn from the threecopy pool, the singlecopy pool, H1, L1, and H2 respectively. For each of the 2 × 6 × 3 settings, 100 chromosomes were simulated.
At each simulation parameter setting, to assess the ability of the algorithm to identify CNA segments while limiting the number of falsepositive edges, we plotted the sensitivity, which is the proportion of true edges identified, against FDR, which is the proportion of false edges among all identified edges. Because of the high noise level in the data, we allowed a tolerance distance when matching true edges with identified edges. The tolerance distance is defined based on the number of SNPs. If an identified edge is equal to or less than the tolerance distance away from the true edge, we considered the true edge to be correctly detected. The results reported in Figure 2 and 3 were obtained using the tolerance distance of 5 SNPs. The results from using the tolerance distances of 3 and 7 SNPs were reported in the Supplement figures 1 and 2 [see Additional file 1]. In the ideal case, the sensitivity should approach one and the FDR should approach zero.
Abbreviations
 SNP:

single nucleotide polymorphism
 CNA:

copy number aberration
 FASeg:

fragment assembling segmentation
 CGH:

comparative genomic hybridization
 aCGH:

array comparative genomic hybridization
 CNAT:

Copy Number Analysis Tool
 SD:

standard deviation
 GCOS:

GeneChip Operating System
 GDAS:

GeneChip DNA Analysis Software.
Declarations
Acknowledgements
This work was supported in part by NIH PHS grants DE015970 (to D. Wong), and DE014847, DE016569, CA114688 (to X. Zhou). We thank Ms. Katherine Long for excellent editorial assistance. The constructive comments from four anonymous referees helped improve the manuscript significantly.
Authors’ Affiliations
References
 Zhou X, Yu T, Cole SW, Wong DT: Advancement in characterization of genomic alterations for improved diagnosis, treatment and prognostics in cancer. Expert Rev Mol Diagn 2006, 6(1):39–50. 10.1586/14737159.6.1.39View ArticlePubMedGoogle Scholar
 Pinkel D, Albertson DG: Comparative genomic hybridization. Annu Rev Genomics Hum Genet 2005, 6: 331–354. 10.1146/annurev.genom.6.080604.162140View ArticlePubMedGoogle Scholar
 Zhou X, Mok SC, Chen Z, Li Y, Wong DT: Concurrent analysis of loss of heterozygosity (LOH) and copy number abnormality (CNA) for oral premalignancy progression using the Affymetrix 10 K SNP mapping array. Hum Genet 2004, 115(4):327–330. 10.1007/s0043900411631View ArticlePubMedGoogle Scholar
 Daruwala RS, Rudra A, Ostrer H, Lucito R, Wigler M, Mishra B: A versatile statistical analysis algorithm to detect genome copy number variation. Proc Natl Acad Sci USA 2004, 101(46):16292–16297. 10.1073/pnas.0407247101PubMed CentralView ArticlePubMedGoogle Scholar
 Eilers PH, de Menezes RX: Quantile smoothing of array CGH data. Bioinformatics 2005, 21(7):1146–1153. 10.1093/bioinformatics/bti148View ArticlePubMedGoogle Scholar
 Khojasteh M, Lam WL, Ward RK, Macaulay C: A stepwise framework for the normalization of array CGH data. BMC Bioinformatics 2005, 6(1):274. 10.1186/147121056274PubMed CentralView ArticlePubMedGoogle Scholar
 Kim SY, Nam SW, Lee SH, Park WS, Yoo NJ, Lee JY, Chung YJ: ArrayCyGHt: a web application for analysis and visualization of arrayCGH data. Bioinformatics 2005, 21(10):2554–2555. 10.1093/bioinformatics/bti357View ArticlePubMedGoogle Scholar
 Lai Y, Zhao H: A statistical method to detect chromosomal regions with DNA copy number alterations using SNParraybased CGH data. Comput Biol Chem 2005, 29(1):47–54. 10.1016/j.compbiolchem.2004.12.004View ArticlePubMedGoogle Scholar
 Margolin AA, Greshock J, Naylor TL, Mosse Y, Maris JM, Bignell G, Saeed AI, Quackenbush J, Weber BL: CGHAnalyzer: a standalone software package for cancer genome analysis using arraybased DNA copy number data. Bioinformatics 2005, 21(15):3308–3311. 10.1093/bioinformatics/bti500View ArticlePubMedGoogle Scholar
 Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of arraybased DNA copy number data. Biostatistics 2004, 5(4):557–572. 10.1093/biostatistics/kxh008View ArticlePubMedGoogle Scholar
 Picard F, Robin S, Lavielle M, Vaisse C, Daudin JJ: A statistical approach for array CGH data analysis. BMC Bioinformatics 2005, 6: 27. 10.1186/14712105627PubMed CentralView ArticlePubMedGoogle Scholar
 Price TS, Regan R, Mott R, Hedman A, Honey B, Daniels RJ, Smith L, Greenfield A, Tiganescu A, Buckle V, et al.: SWARRAY: a dynamic programming solution for the identification of copynumber changes in genomic DNA using array comparative genome hybridization data. Nucleic Acids Res 2005, 33(11):3455–3464. 10.1093/nar/gki643PubMed CentralView ArticlePubMedGoogle Scholar
 Broet P, Richardson S: Detection of gene copy number changes in CGH microarrays using a spatially correlated mixture model. Bioinformatics 2006, 22(8):911–918. 10.1093/bioinformatics/btl035View ArticlePubMedGoogle Scholar
 Huang J, Wei W, Zhang J, Liu G, Bignell GR, Stratton MR, Futreal PA, Wooster R, Jones KW, Shapero MH: Whole genome DNA copy number changes identified by high density oligonucleotide arrays. Hum Genomics 2004, 1(4):287–299.PubMed CentralView ArticlePubMedGoogle Scholar
 Hupe P, Stransky N, Thiery JP, Radvanyi F, Barillot E: Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics 2004, 20(18):3413–3422. 10.1093/bioinformatics/bth418View ArticlePubMedGoogle Scholar
 Jong K, Marchiori E, Meijer G, Vaart AV, Ylstra B: Breakpoint identification and smoothing of array comparative genomic hybridization data. Bioinformatics 2004, 20(18):3636–3637. 10.1093/bioinformatics/bth355View ArticlePubMedGoogle Scholar
 Lin M, Wei LJ, Sellers WR, Lieberfarb M, Wong WH, Li C: dChipSNP: significance curve and clustering of SNParraybased lossofheterozygosity data. Bioinformatics 2004, 20(8):1233–1240. 10.1093/bioinformatics/bth069View ArticlePubMedGoogle Scholar
 Lingjaerde OC, Baumbusch LO, Liestol K, Glad IK, BorresenDale AL: CGHExplorer: a program for analysis of arrayCGH data. Bioinformatics 2005, 21(6):821–822. 10.1093/bioinformatics/bti113View ArticlePubMedGoogle Scholar
 Liu J, Mohammed J, Carter J, Ranka S, Kahveci T, Baudis M: Distancebased clustering of CGH data. Bioinformatics 2006, 22(16):1971–1978. 10.1093/bioinformatics/btl185View ArticlePubMedGoogle Scholar
 Marioni JC, Thorne NP, Tavare S: BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics 2006, 22(9):1144–1146. 10.1093/bioinformatics/btl089View ArticlePubMedGoogle Scholar
 Rueda OM, DiazUriarte R: A flexible, accurate and extensible statistical method for detecting genomic copynumber changes. COBRA Preprint Series 2006, Article 9.Google Scholar
 Shah SP, Xuan X, Deleeuw RJ, Khojasteh M, Lam WL, Ng R, Murphy KP: Integrating copy number polymorphisms into array CGH analysis using a robust HMM. Bioinformatics 2006, 22(14):e431–439. 10.1093/bioinformatics/btl238View ArticlePubMedGoogle Scholar
 Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain AN: Hidden Markov models approach to the analysis of array CGH data. Journal of Multivariate Analysis 2004, 90(1):132–153. 10.1016/j.jmva.2004.02.008View ArticleGoogle Scholar
 Lai WR, Johnson MD, Kucherlapati R, Park PJ: Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 2005, 21(19):3763–3770. 10.1093/bioinformatics/bti611PubMed CentralView ArticlePubMedGoogle Scholar
 Willenbrock H, Fridlyand J: A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics 2005, 21(22):4084–4091. 10.1093/bioinformatics/bti677View ArticlePubMedGoogle Scholar
 Huang J, Wei W, Chen J, Zhang J, Liu G, Di X, Mei R, Ishikawa S, Aburatani H, Jones KW, et al.: CARAT: a novel method for allelic detection of DNA copy number changes using high density oligonucleotide arrays. BMC Bioinformatics 2006, 7: 83. 10.1186/14712105783PubMed CentralView ArticlePubMedGoogle Scholar
 Nannya Y, Sanada M, Nakazaki K, Hosoya N, Wang L, Hangaishi A, Kurokawa M, Chiba S, Bailey DK, Kennedy GC, et al.: A robust algorithm for copy number detection using highdensity oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res 2005, 65(14):6071–6079. 10.1158/00085472.CAN050465View ArticlePubMedGoogle Scholar
 FASeg website[http://www.sph.emory.edu/bios/FASeg]
 Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al.: Initial sequencing and analysis of the human genome. Nature 2001, 409(6822):860–921. 10.1038/35057062View ArticlePubMedGoogle Scholar
 Coriell Cell Repositories/NIGMS[http://ccr.coriell.org/nigms/]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.