ChopSticks: High-resolution analysis of homozygous deletions by exploiting concordant read pairs
© Yasuda et al.; licensee BioMed Central Ltd. 2012
Received: 7 April 2012
Accepted: 5 September 2012
Published: 30 October 2012
Structural variations (SVs) in genomes are commonly observed even in healthy individuals and play key roles in biological functions. To understand their functional impact or to infer molecular mechanisms of SVs, they have to be characterized with the maximum resolution. However, high-resolution analysis is a difficult task because it requires investigation of the complex structures involved in an enormous number of alignments of next-generation sequencing (NGS) reads and genome sequences that contain errors.
We propose a new method called ChopSticks that improves the resolution of SV detection for homozygous deletions even when the depth of coverage is low. Conventional methods based on read pairs use only discordant pairs to localize the positions of deletions, where a discordant pair is a read pair whose alignment has an aberrant strand or distance. In contrast, our method exploits concordant reads as well. We theoretically proved that when the depth of coverage approaches zero or infinity, the expected resolution of our method is asymptotically equal to that of methods based only on discordant pairs under double coverage. To confirm the effectiveness of ChopSticks, we conducted computational experiments against both simulated NGS reads and real NGS sequences. The resolution of deletion calls by other methods was significantly improved, thus demonstrating the usefulness of ChopSticks.
ChopSticks can generate high-resolution deletion calls of homozygous deletions using information independent of other methods, and it is therefore useful to examine the functional impact of SVs or to infer SV generation mechanisms.
Today, next-generation sequencing (NGS) technologies are essential tools in genome analysis, because they enable us to simultaneously obtain sequences of up to hundreds of billions of base pairs . These technologies enable the characterization of not only small variations such as single-nucleotide polymorphisms (SNPs) but also large-scale mutations such as insertions, deletions, tandem duplications, and inversions. Mutations of these types are collectively called structural variations (SVs) and are frequently observed even in healthy individuals [2–4]. Because SVs affect a much larger portion of genomes than small variations, including SNPs, they have a great impact on biological functions.
Current NGS methods can sequence paired reads, which are pairs of reads several hundred bases away from each other. This ability is useful for analyzing SVs because paired reads can be aligned with the reference genome more accurately than single reads, and because we can analyze structures of genomes larger than the size of each read. However, SV detection is still a difficult task, because it requires analysis of the complex structures involved in an enormous number of alignments of paired reads with the reference genome, and because read sequences and alignments include unavoidable errors. Therefore, for example, a false detection rate (FDR) up to 10% had to be tolerated even when determining just the existence of each SV in the 1000 Genomes Project . It is obviously more difficult to accurately detect the exact positions of SVs. Nevertheless, high-resolution SV calls are necessary to elucidate the functional impact of SVs and molecular mechanisms that generate SVs. Moreover, to conduct a large-scale analysis, SV detection methods for data with a low depth of coverage (hereafter simply referred to as coverage) are desirable, because whole genome sequencing is not easy even with NGS technologies.
Current methods for SV detection search for signatures that indicate SVs hidden in read sequences and their alignments with the genome sequences. The following are basic signatures used for SV detection [2–4].
Read pair (RP) [5–7]: If pairs of reads have aberrant strands or distances, they are likely to be caused by SVs. Such pairs are called discordant pairs, and normally mapped ones are called concordant pairs. If strands of a discordant pair are as expected, a larger distance than expected indicates a deletion, whereas a smaller distance indicates an insertion. There are several categories of methods that detect discordant pairs by using mapping distances.
Threshold-based: A pair with a mapped distance larger or smaller than a predefined threshold is defined as a discordant pair. The threshold is μ±3 σ or μ±4 σ for BreakDancer  and VariationHunter  where μ and σ are mean and standard deviation of mapping distances, or median fragment size ± 10 median absolute deviations for HYDRA .
Distribution-based: Although the mapped distance of a single pair might vary by tens or hundreds bases even without SVs, larger (smaller) mapping distances of many pairs in the same region indicate deletions (insertions). Such reads can be detected by statistical tests on the distribution of mapped distances [5, 8]. Pairs detected in this way might have mapping distances more similar to the expected distance than those of other methods. Nonetheless, we still call them discordant pairs in this paper to unify the word used to refer pairs that support SVs.
Graph-based: Recently Marshall et al.  proposed a new method CLEVER based on the graph theory. CLEVER constructs a graph where a node represents an alignment of a read pair and the genome, while an edge means that connected alignments potentially support the same allele. In this graph, a clique corresponds to a set of pairs supporting the same allele. CLEVER detects SVs by finding maximal cliques (max-cliques). CLEVER has an ability to find more than one max-clique overlaping each other, each of which supports a different allele. Therefore CLEVER can distinguish more than one SV located at the same locus, for example, two deletions of different sizes in a diploid genome.
Split read (SR) : If an alignment of a read and the genome includes only a part of the read, this indicates a position of a breakpoint. Here, a breakpoint is the boundary between a region affected by some SV and its unaffected flanking region.
The most popular signature used to detect SVs is threshold-based RP. Methods based on this signature can detect SVs from a small number of discordant read pairs; therefore threshold-based RP methods can be applied to low-coverage data. However, threshold-based RP methods localize SVs only to regions surrounded by discordant read pairs, thus causing some ambiguity. For RD methods, the problem of resolution is much bigger. Because RD methods involve calculation of coverage in windows of a fixed size, its resolution cannot be finer than the window size. Methods based on the SR signature can determine positions of breakpoints up to base-pair-level (bp-level) resolution if there are reads covering the breakpoints. However, such reads might not exist, in particular when coverage is low, because of unevenness of coverage or repeat elements to which reads cannot be aligned uniquely. Moreover, because such a split alignment is shorter than a read itself, careful analysis is required to avoid spurious matches. If coverage is sufficiently high, AS methods would ultimately reveal the exact positions of SVs at bp-level resolution. Although extremely deep sequencing can be conducted by targeted sequencing , it is still expensive to obtain paired reads of high coverage over the entire genome so that assembly can be performed. In fact, a previous study has indicated that the sensitivity of AS methods is rather low (Table S6B of Mills et al. ).
Because these signatures have their own advantages and disadvantages, it is desirable to combine more than one method . In fact, several methods that use more than one signature have been proposed recently [15, 16]. In combined approaches, we should integrate SV signatures that are independent of each other. In this paper, we propose a new method called ChopSticks that improves the resolution of deletion calls for homozygous deletions generated mainly by threshold-based RP methods. ChopSticks is especially valuable when target SVs are expected to be homozygous as those of inbred mice whose genomes are homozygous at virtually all loci . ChopSticks exploits positions of concordant read pairs in addition to those of discordant ones. Thus far, they have been ignored in threshold-based RP approaches, and therefore, our method can improve the resolution by using this new independent information. As explained below, ChopSticks is effective even for data whose coverage is low.
The organization of this paper is as follows. First, we theoretically analyze the improvement of the resolution achieved by exploiting concordant read pairs. Next, we present our computational method ChopSticks that improves the resolution of homozygous deletion calls. After that, we demonstrate the effectiveness of ChopSticks in computational experiments. Then, we present our conclusions. In addition, we illustrate details of our method and experiments in Methods section.
Results and discussion
Strategy for resolution improvement
Theoretical estimation of resolution
Here we present results of our theoretical analysis of improved resolution achieved by our method as compared to RP methods. We also present the necessary definitions to describe them. See Methods for details.
See Methods for the proof. When c→0, both E[Δ b |b,2c] and E[Δ b |b,c] approach d/2, which is the expected resolution when a deletion is detected with only one read pair. Therefore also approaches d/2 when c→0. On the other hand, when c approaches infinity, approaches E[Δ b |b,2c] because qd + 1→0. In summary,
Trimming of deletion calls to improve resolution
To evaluate the power of ChopSticks in improving the resolution of deletion calls, we conducted computational experiments. Let the upstream difference of a deletion call be x−y, where x is the position of the upstream end of the true deletion and y be that of the deletion call. Similarly, let the downstream difference of a deletion call be y′−x′, where x′ is the position of the downstream end of the true deletion and y′is that of the deletion call. By definition, the closer to zero a difference is, the better. A positive difference value indicates that the called breakpoint is outside the true deletion, whereas a negative value indicates that it is inside the true deletion. To evaluate ChopSticks, the results of ChopSticks have to be compared with the positions of true deletions. Therefore we need NGS reads of a genome whose SVs against the reference genome are known up to bp-level resolution. We conducted two experiments described below.
In the first experiment, we evaluated ChopSticks with simulated NGS reads for which all SVs were known up to bp-level resolution. To obtain data as realistic as possible, we generated a genome sequence with SVs and simulated NGS sequences by using SV annotations published by Quinlan et al. . The accession number of the SV annotations is [dbVar:nstd19]. First, we deleted regions of the reference genome sequence that were annotated as deletions by Quinlan et al. Next, we inserted random fragments whose number and distribution of lengths were the same as annotated deletions, assuming that deletions and insertions are symmetric. Then, we introduced single nucleotide substitutions into the simulated genome sequence and generated paired reads from it. We conducted this simulation and evaluation of ChopSticks for chromosome 1 of the reference mouse genome mm9. All paired reads were mapped to mm9 using Burrows-Wheeler aligner (BWA) . Then we conducted SV analysis by using SV detection tools from each of categories described in the Background section: BreakDancer  of threshold-based RP methods, MoDIL  of distribution-based RP methods, CLEVER  of graph-based RP methods, CNVnator  of RD methods, and Pindel  of SR methods. After that, we applied ChopSticks to their results.
Results of SV detection obtained by BreakDancer, MoDIL, CLEVER, CNVnator, and Pindel
Depth of coverage
Next, we applied ChopSticks to the results of SV detection tools. After that, we examined how well the resolution of deletion calls was improved. We tested ChopSticks for k=1,2,…,5 and f=0.1,0.2,…,1.0. We evaluated differences at both the upstream and downstream ends of deletions, and found that the results were similar. Therefore we only present the results at upstream ends.
Resolution improvements for BreakDancer deletion calls
Resolution improvements for MoDIL deletion calls
Resolution improvements for CLEVER deletion calls
Resolution improvements for CNVnator deletion calls
Results of ChopSticks applied to Pindel deletion calls
Real Illumina reads of DBA/2J
In the second experiment, we evaluated ChopSticks using the real NGS sequences of Quinlan et al. . The sample was taken from a female mouse of the DBA/2J strain, whose genome contains SVs against the reference genome of the C57BL/6J strain . The read sequences were available from the NCBI Sequence Read Archive (SRA) database . The accession number of the read sequences is [SRA:SRA010027]. To evaluate the results of ChopSticks, we need bp-level SV annotations of DBA/2J as well. Therefore we generated deletion calls at bp-level resolution using Sanger reads in a manner similar to that of Quinlan et al. See Methods for details. Our deletion calls are available at the dbVar database under accession no. [dbVar:nstd70].
We tried the five SV detection tools used in the previous experiment, and found that MoDIL, CNVnator and Pindel missed the most of deletions detected with Sanger reads. These methods seemed to suffer from the low depth of coverage and short read lengths. Therefore, we hereafter only describe results of ChopSticks applied to BreakDancer and CLEVER results.
Resolution improvements for BreakDancer deletion calls
Resolution improvements for CLEVER deletion calls
We have presented a new method called ChopSticks to improve the resolution of predicted positions of deletions. The key idea is to exploit both concordant read pairs and discordant ones. According to our theoretical analysis, the resolution of our method is quite similar to that of threshold-based RP methods but with double coverage. In an experiment on simulated NGS reads, ChopSticks clearly improved the results of BreakDancer, MoDIL, CLEVER, and CNVnator. Although the resolution of Pindel results is quite high, ChopSticks works well even for low-coverage data where recall of Pindel is not sufficient. The effectiveness of ChopSticks was also confirmed by performing an experiment on real Illumina reads. Despite a number of methods proposed for detecting SVs [2–4], there is no one-stop method that simultaneously achieves high sensitivity, high specificity, high resolution, and robustness for low-coverage data. Therefore a combination of SV detection methods is required, and ChopSticks can play an important role because it uses new independent information ignored in other methods.
As a future work, we consider to develop a method to distinguish homozygous deletions from heterozygous ones and to apply ChopSticks to the former. With this approach, ChopSticks will be available for more applications.
Derivation of theoretical estimation of resolution
Similarly, we define as an event wherein b is detected and the closest read upstream of b is exactly j bases apart. There are two mutually exclusive cases: (i) at least one of the closest reads is an upstream discordant read or (ii) all the closest reads are concordant reads. In the latter case, we have to consider the joint probability of the following events.
A concordant read exists at j bases upstream of b, the probability of which is 1−q.
No read nearer than the closest concordant read exists, the probability of which is q2j.
No discordant read exists at j bases upstream of b, the probability of which is q.
There must exist an upstream read of discordant pairs whose alignment ends in a region that is j + 1 to d bases upstream of b so that b is successfully included in a deletion call, the probability of which is 1−qd−j.
Proof of Theorem 1
Proof of Theorem 2
Therefore, all of , E[Δ b |b,2c], and E[Δ b |b,c] approach d/2 by Equation (4). On the other hand, when c→∞, qd + 1approaches 0. In consequence, the right hand side of Equation (4) approaches E[Δ b |b,2c] when c→0 or c→∞.
Mapping to the genome
We mapped paired reads to the mm9 reference genome sequences of Mus musculus using BWA version 0.5.9  with default parameters. The target genome sequences involved in our experiment included all chromosomes of mm9 except chromosome Y, assuming cases where a female mouse was analyzed [7, 21].
Simulated NGS sequences
Number of bases and number of reads of simulated data set
Depth of coverage
Total number of bases
Number of reads
Number of mapped reads
Real DBA/2J sequences
Number of bases and number of reads of DBA/2J data set
Total number of bases
Number of reads
Reads of uniquely mapped pairs
Reads of uniquely mapped pairs (chromosome 1)
Trimming algorithm of ChopSticks
Data for computational experiments
To evaluate our method, we need NGS sequences and reliable bp-level positions of breakpoints. There were six SV studies of inbred mice (nstd5, 7, 15, 18, 19, and 48) in the dbVar database  when we accessed it on April 1, 2012. However, none of them provides accurate bp-level positions of breakpoints. Therefore, we evaluated ChopSticks using the following two data sets.
Simulated NGS reads
NGS reads of Quinlan et al. and deletion calls based on Sanger reads
Parameters for SV detection tools and evaluation of their results
We executed BreakDancer with default parameters, and Pindel with an expected template size of 432 bp because the median fragment size was 432 bp according to Quinlan et al. . For CNVnator, we tested three window sizes: 50 bp, 100 bp, and 200 bp. Because the recall of window size 50 bp outperformed those of window sizes 100 bp and 200 bp for our simulated data when coverage was 2, we used results of window size 50 bp for evaluation. Because CLEVER tends to generate deletion calls duplicatedly with slightly different positions, we chose the best one for those overlapping with true deletions in order to estimate the upper limit of the accuracy of CLEVER. We divided the chromosome 1 of mm9 into 5.1 Mbp fragments in a manner such that franking fragments share 0.1Mbp, and applied MoDIL to each fragments, because MoDIL was quite slow as reported previously . We omitted evaluation of MoDIL for coverage=20.
To compare the positions of true and predicted deletions, we used BEDTools .
The super-computing resource was provided by Human Genome Center (University of Tokyo).
- Illumina Sequencing portfolio [http://www.illumina.com/systems/sequencing.ilmn]
- The 1000 genomes project consortium: A map of human genome variation from population-scale sequencing. Nature 2010, 467: 1061–1073. 10.1038/nature09534PubMed CentralView Article
- Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, Abyzov A, Yoon SC, Ye K, Cheetham RK, Chinwalla A, Conrad DF, Fu Y, Grubert F, Hajirasouliha I, Hormozdiari F, Iakoucheva LM, Iqbal Z, Kang S, Kidd JM, Konkel MK, Korn J, Khurana E, Kural D, Lam HYK, Leng J, Li R, Li Y, Lin CY, Luo R, et al.: 1000 genomes project: Mapping copy number variation by population-scale genome sequencing. Nature 2011, 470: 59–65. 10.1038/nature09708PubMed CentralView ArticlePubMed
- Medvedev P, Stanciu M, Brudno M: Computational methods for discovering structural variation with next-generation sequencing. Nat Methods 2009, 6: S13—S20.View ArticlePubMed
- Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP, Shi X, Fulton RS, Ley TJ, Wilson RK, Ding L, Mardis ER: BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods 2009, 6: 677–681. 10.1038/nmeth.1363PubMed CentralView ArticlePubMed
- Hormozdiari F, Alkan C, Eichler EE, Sahinalp SC: Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res 2009, 19: 1270–1278. 10.1101/gr.088633.108PubMed CentralView ArticlePubMed
- Quinlan AR, Clark RA, Sokolova S, Leibowitz ML, Zhang Y, Hurles ME, Mell JC, Hall IM: Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res 2010, 20: 623–635. 10.1101/gr.102970.109PubMed CentralView ArticlePubMed
- Lee S, Hormozdiari F, Alkan C, Brudno M: MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions. Nat Methods 2009, 6: 473–474. 10.1038/nmeth.f.256View ArticlePubMed
- Marschall T, Costa I, Canzar S, Bauer M, Klau G, Schliep A, Schönhuth A: CLEVER: Clique-Enumerating Variant Finder. Bioinformatics 2012,, 28: 2875–2882. 10.1093/bioinformatics/bts566View ArticlePubMed
- Campbell PJ, Stephens PJ, Pleasance ED, O’Meara S, Li H, Santarius T, Stebbings LA, Leroy C, Edkins S, Hardy C, Teague JW, Menzies A, Goodhead I, Turner DJ, Clee CM, Quail MA, Cox A, Brown C, Durbin R, Hurles ME, Bignell PAWEGR, andP Andrew Futreal MRS: Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat Genet 2008, 40: 722–729. 10.1038/ng.128PubMed CentralView ArticlePubMed
- Abyzov A, Urban AE, Snyder M, Gerstein M: CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res 2011, 21: 974–984. 10.1101/gr.114876.110PubMed CentralView ArticlePubMed
- Ye K, Schulz MH, Long Q, Apweiler R, Ning Z: Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 2009, 25(21):2865–2871. 10.1093/bioinformatics/btp394PubMed CentralView ArticlePubMed
- Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 2010, 20: 265–272. 10.1101/gr.097261.109PubMed CentralView ArticlePubMed
- Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, Howard E, Shendure J, Turner DJ: Target-enrichment strategies for next-generation sequencing. Nat Methods 2010, 7: 111–118. 10.1038/nmeth.1419View ArticlePubMed
- Handsaker R, Korn J, Nemesh J, McCarroll S: Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat Genet 2011, 43(3):269–276. 10.1038/ng.768View ArticlePubMed
- Zhang J, Wu Y: SVseq: an approach for detecting exact breakpoints of deletions with low-coverage sequence data. Bioinformatics 2011, 27(23):3228–3234. 10.1093/bioinformatics/btr563View ArticlePubMed
- Inbred mice - The Jaxon Laboratory [http://jaxmice.jax.org/type/inbred/index.html]
- Lander ES, Waterman MS: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 1988, 2(3):231–239. 10.1016/0888-7543(88)90007-9View ArticlePubMed
- ChopSticks [https://github.com/toyasuda/ChopSticks]
- Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25: 1754–1760. 10.1093/bioinformatics/btp324PubMed CentralView ArticlePubMed
- Mouse genome sequencing consortium: Initial sequencing and comparative analysis of the mouse genome. Nature 2002, 420: 520–562. 10.1038/nature01262View Article
- Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Federhen S, Feolo M, Fingerman IM, Geer LY, Helmberg W, Kapustin Y, Landsman D, Lipman DJ, Lu Z, Madden TL, Madej T, Maglott DR, Marchler-Bauer A, Miller V, Mizrachi I, Ostell J, Panchenko A, Phan L, Pruitt KD, Schuler GD, Sequeira E, et al.: Database resources of the National Center for Biotechnology Information. Nuc Acids Res 2011, 39(suppl 1):D38—D51.
- Bashir A, Volik S, Collins C, Bafna V, Raphael BJ: Evaluation of paired-end sequencing strategies for detection of genome rearrangements in cancer. PLoS Comput Biol 2008, 4: e1000051. 10.1371/journal.pcbi.1000051PubMed CentralView ArticlePubMed
- Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comp Biol 2000, 7: 203–214. 10.1089/10665270050081478View Article
- Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 2010, 26(6):841–842. 10.1093/bioinformatics/btq033PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.