ClipCrop: a tool for detecting structural variations with single-base resolution using soft-clipping information
© Suzuki et al; licensee BioMed Central Ltd. 2011
Published: 14 December 2011
Structural variations (SVs) change the structure of the genome and are therefore the causes of various diseases. Next-generation sequencing allows us to obtain a multitude of sequence data, some of which can be used to infer the position of SVs.
We developed a new method and implementation named ClipCrop for detecting SVs with single-base resolution using soft-clipping information. A soft-clipped sequence is an unmatched fragment in a partially mapped read. To assess the performance of ClipCrop with other SV-detecting tools, we generated various patterns of simulation data – SV lengths, read lengths, and the depth of coverage of short reads – with insertions, deletions, tandem duplications, inversions and single nucleotide alterations in a human chromosome. For comparison, we selected BreakDancer, CNVnator and Pindel, each of which adopts a different approach to detect SVs, e.g. discordant pair approach, depth of coverage approach and split read approach, respectively.
Our method outperformed BreakDancer and CNVnator in both discovering rate and call accuracy in any type of SV. Pindel offered a similar performance as our method, but our method crucially outperformed for detecting small duplications. From our experiments, ClipCrop infer reliable SVs for the data set with more than 50 bases read lengths and 20x depth of coverage, both of which are reasonable values in current NGS data set.
ClipCrop can detect SVs with higher discovering rate and call accuracy than any other tool in our simulation data set.
Structural variations (SVs) are polymorphisms that change the structure of the genome, e.g. deletions, insertions, translocations, inversions and tandem duplications . They induce functional change in genes and regulatory regions, which can cause various diseases , e.g. autism , Parkinson's disease , schizophrenia . Not only inherited SVs, but also somatic SVs can be responsible for various diseases including cancer . However, until a few years ago, there were no efficient methods to detect genome wide SVs in high resolution. One of the microarray analyses, array-CGH, can only detect limited SVs, since this approach can neither detect small size SVs nor clarify the single nucleotide level sequence of the target sample. Recently, next-gen sequencing (NGS) has drastically changed this situation. NGS enables us to measure large number of short digested sequence reads (short reads around 50 to 120 bases) with short time with at once . Additionally, alignments of sequenced reads to the reference genome, which were impossible using the microarray approach, are now applicable. Thus, we can detect SVs with higher resolution.
Until now, three types of methods have been developed to detect SVs from NGS data: discordant pair approach, depth of coverage approach and split read approach .
The first approach, discordant pair, uses paired-end reads of NGS data, and calls SVs when the distance of two paired-end reads is discordant . When SVs occur, paired-end reads generated from these locations cannot be mapped to the reference in concordant distance. BreakDancer , VariationHunter , MoDIL  and ABI Tools  can be categorized into this method. This idea has been developed in the early times when the depth of coverage was low and the length of the read sequences (read lengths) was short. Thus, this method is appropriate for smaller datasets of short-read. However, this method cannot detect SVs with shorter lengths, and it has difficulties to know the exact position of SV events.
The second approach, depth of coverage, is used in SegSeq , CNVnator  and ABITools . It uses the frequency of mapped short reads or bases to each position on the reference genome. The main concept of this method is similar to array-CGH. When deletions occur, the number of mapped reads to regions in the reference genome will decrease. In contrast, in the case of duplications, the number of mapped reads to regions in the reference genome will increase. Different from the first approach, this does not require paired-end reads, while it requires high coverage and still has difficulties detecting shorter SV events.
The third approach, ‘split read’, is the method to detect SVs using unsuspected reads, which are not correctly mapped to the reference genome or remain unmapped. In general, split read approach is applicable only to paired-end reads. While it needs sufficient read lengths and depth of coverage, the method can detect SVs with single-base resolution. Reads on an SV event contain a ‘breakpoint’, a boundary of a region affected by SV and its flanking region which is the same as the reference genome. An SV is called when the same breakpoint is detected in unsuspected reads. The algorithm of detecting breakpoints varies with tools. Pindel  and SLOPE  use orphaned reads, unmapped reads whose mate were succeeded in mapping to the reference genome, as unsuspected reads. SLOPE attempts partial alignment between the either end of each unmapped read and the reference genome to obtain breakpoints. Pindel gets substrings from two different regions around the mapped mate read; one region is two fold of average insert size from 3’ end of mapped mate read and the other region is the sum of maximum deletion size and read lengths from the appropriate position. It then checks whether the unmapped read can be reconstructed by concatenating two substrings from each region.
Major mapping tools, such as the Burrows-Wheeler alignment tool, (e.g. BWA ) if they failed to map full length short read to reference genome, still try to map part of the short read. If the short read is mapped partially, then the information of the partial mapping is stored into a major mapping format SAM  as soft-clipping information. The number of soft-clipped reads is comparable to that of orphaned reads which is adopted in Pindel and SLOPE. Thus, our new method ClipCrop employs soft-clipping information and advances the third 'split read approach.' By using the boundary position between mapped sequence and the soft-clipped sequence in a clipped read, we can obtain putative breakpoints. Ideally, among these putative breakpoints, true breakpoint will be contained. We then remap soft-clipped sequences around the detected putative breakpoints and infer which type of event is really occurred at this region. The detailed method is described in Section 2. Section 3 demonstrates the comparison of ClipCrop, Pindel, SLOPE and BreakDander to various simulation data set. Section 4 details the result in Section 3.
In the first process of ClipCrop, reads with soft-clipping information are chosen for the next analysis. The soft-clipping information is written as a CIGAR string in SAM format. Here is a sample data of CIGAR string: “31S69M” means 31 bases from the left end are clipped, and the rest 69 bases are matched.
The SAM file must be generated from paired-end mapping tools, and the mapping tool must generate a SAM file with soft-clipping information. In some mapping tools (e.g. BLAST , BLAT ), mapped result information file does not contain whole read sequence, but only mapped part of the sequence. In such cases, a generated SAM file from them contains hard-clipping, partially unmapped sequence that is not in the SEQ column in SAM format. We can convert hard-clipping information to soft-clipping information by using the original FASTQ file to put information about the original sequence of each read. As a result of partial alignment, there are some reads where both ends are soft-clipped (e.g. 14S54M36S). We ignored such reads because they don’t carry relevant information.
In the next process, soft-clipped fragments with lengths larger than 10bases are collected and remapped to the reference genome around the whole breakpoint. Before mapping, the reference genome is cut around each breakpoint with 1000-base elongation to both sides. This process can reduce the probability of clipped sequences to be mapped in the wrong position. In our current implementation, BWA is used for this remapping process. By checking the mapped pattern of clipped sequence, ClipCrop infer the SV type from deletion, inversion, tandem duplication, insertion and translocation as follows.
In deletion events, clipped sequences from an L-breakpoint are mapped to the left side of an R-breakpoint and vice versa (Figure 1). As we can see in Figure 1(B), reads generated from nearby deleted region are soft-clipped and remapped.
, where B L and B R are the number of clipped reads supporting the L/R-Breakpoint of the SV event, C L and C R are the number of clipped read remapped to the L-Breakpoint of the SV event. In this formula, the higher the clipped and the remapped reads, the higher the score. Also, the score tend to be high when the number of left and right reads are balanced.
Parameters used in simulation data
Human chromosome 22 (Build 37 ref)
Distribution of SV length
N(50, 5), N(80, 8), N(100, 10), N(120, 12),
N(150, 15), N(170, 17), N(200, 20), N(400, 40)
N(600, 60), N(800, 80), N (1000, 100)
N(2000, 200), N(4000, 400)
The rate of single nucleotide alterations
The number of tandem repeat
N(40, 20) (>1)
Mean depth of coverage
5, 10, 15, 20, 40
50, 75, 100, 108
Distribution of template lengths
As tools with high sensitivity can detect with high discovery rate, it can be regarded as the similar concept to sensitivity. In the same way, true call rate can be regarded as the similar concept to specificity.
In all types of SVs, ClipCrop and Pindel could detect most of SVs with high accuracy (Figure 6). It is because these two tools uses split read approach. This approach can detect SVs of any size with single-base resolution. BreakDancer, which employs discordant-pair approach, cannot detect short SVs, and its accuracy cannot be single-base resolution. CNVnator, adopting depth of coverage approach, firstly splits reference genome with a certain window size, so it cannot detect SVs with shorter length than the window size. As we set the window size to 100 bases in our analyses, CNVnator couldn’t detect SVs with length <100 bases. The resolution in CNVnator is also limited to the window size. As well as ClipCrop, Pindel also marked high discovery rate and true call rate, but it couldn’t detect short duplications ( <170 bases). This is because of the following reason. Pindel tries to reconstruct split reads by concatenating two subsequences generated from two regions near the position of mapped mate. In short duplications, reads from duplicated region would contain more than two breakpoints, which means it requires more than three subsequences to reconstruct. Thus, Pindel cannot generate these reads and fails to detect short duplications. ClipCrop, on the other hand, uses only soft-clipped sequences. Some of the short soft-clipped sequences don’t contain any breakpoints, and they can remap and support tandem duplication calls. ClipCrop also excelled over Pindel in true call rate of insertions. As formula of reliability score (1) shows, ClipCrop sets zero to SVs called from only one-side clipping and only one breakpoint, i.e. (B L , B R , C L , C R ) = (n, 0, m, 0) or (B L , B R , C L , C R ) = (0, n, 0, m). Thus, by removing SVs with score zero, we can obtain reliable SVs with both-side supported, which is thought to contribute its higher accuracy.
The results in Figure 7 shows that ClipCrop could detect tandem duplications with high discovery rate and true call rate even the depth of coverage is 5. This is because the depth of tandem duplicated regions is much higher than surroundings, and there are sufficient numbers of reads which support tandem duplications. Also, as inversions can be supported by twice as many reads as deletion and insertion (reads mapped to inverted region with soft-clipping also supports breakpoints), discovery rate and true call rate were higher than those of deletion and insertion. The discovery rate and true call rate were saturated at depth 20, therefore the sufficient depth for ClipCrop is turned out to be 20, which is not so high in current NGS data.
From the results in Figure 8, the sufficient read lengths for ClipCrop is more than 50 bases. Thus, ClipCrop can be applied to most of current NGS data.
There is another recently published SV-detecting tool called CREST , which also uses soft-clipping information. Unlike ClipCrop, CREST cannot detect tandem duplications. CREST assembles soft-clipped sequences, and remaps the assembled sequence. Thus, assembled reads from the region of tandem duplications cannot be mapped to the original reference genome.
Currently, as ClipCrop is focusing only on soft-clipping information, it doesn’t calculate the length of insertion. However, as ClipCrop calls the position of insertion with high accuracy (Figure 6), we will easily be able to obtain these information by combination of other methods. In future, we will combine other methods and run with real data.
ClipCrop is a tool for detcting SVs with soft-clipiing information. Soft-clipped sequences are partially unmatched fragments in a mapped read. ClipCrop remaps these sequence and infers which type of SV events exists from the mapping pattern. ClipCrop can detect SVs with higher discovering rate and call accuracy than any other tool in simulation data set, especially in short size duplications and insertions. In addition, as ClipCrop does not require a large depth of coverage or long read lengths, it can handle most of current NGS data. Currently, the implementation of ClipCrop is only available in our environment, and we are in the process of deploying. We provide current implementation if you contact us.
The super-computing resource was provided by Human Genome Center, Institute of Medical Science, University of Tokyo.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 14, 2011: 22nd International Conference on Genome Informatics: Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S14.
- Medvedev Paul, Stanciu Monica, Brudno Michael: Computational methods for discovering structural variation with next-generation sequencing. Nat. Methods 2009, 6(11):S13-S20. 10.1038/nmeth.1374View ArticlePubMedGoogle Scholar
- McCarroll StevenA, Altshuler DavidM: Copy-number variation and association studies of human disease. Nat. Genetics 2009, 39: S37-S42.View ArticleGoogle Scholar
- Sebat J, Lakshmi B, Malhotra D, Troge J, Lese-Martin C, Walsh T, Yamrom B, Yoon S, Krasnitz A, Kendall J, Leotta A, Pai D, Zhang R, Lee YH, Hicks J, Spence SJ, Lee AT, Puura K, Lehtimäki T, Ledbetter D, Gregersen PK, Bregman J, Sutcliffe JS, Jobanputra V, Chung W, Warburton D, King MC, Skuse D, Geschwind DH, Gilliam TC, Ye K, Wigler M: Strong association of de novo copy number mutations with autism. Science 2007, 316: 445–449. 10.1126/science.1138659PubMed CentralView ArticlePubMedGoogle Scholar
- Singleton AB, Farrer M, Johnson J, Singleton A, Hague S, Kachergus J, Hulihan M, Peuralinna T, Dutra A, Nussbaum R, Lincoln S, Crawley A, Hanson M, Maraganore D, Adler C, Cookson MR, Muenter M, Baptista M, Miller D, Blancato J, Hardy J, Gwinn-Hardy K: Alpha-synuclein locus triplication causes Parkinson’s disease. Science 2003, 302: 841. 10.1126/science.1090278View ArticlePubMedGoogle Scholar
- Xu B, Roos JL, Levy S, van Rensburg EJ, Gogos JA, Karayiorgou M: Strong association of de novo copy number mutations with sporadic schizophrenia. Nat. Genetics 2008, 40: 880–885. 10.1038/ng.162View ArticlePubMedGoogle Scholar
- Shlien Adam, Malkin David: Copy number variations and cancer. Genome Medicine 2009, 1: 62. 10.1186/gm62PubMed CentralView ArticlePubMedGoogle Scholar
- Hawkins R, Hon GaryC., Ren Bing: Next-generation genomics : an integrative approach. Nature Reviews Genetics 2010, 11: 476–486.PubMed CentralPubMedGoogle Scholar
- Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, Kim PM, Palejev D, Carriero NJ, Du L, Taillon BE, Chen Z, Tanzer A, Saunders AC, Chi J, Yang F, Carter NP, Hurles ME, Weissman SM, Harkins TT, Gerstein MB, Egholm M, Snyder M: Paired-end mapping reveals extensive structural variation in the human genome. Science 2007, 318: 420–426. 10.1126/science.1149504PubMed CentralView ArticlePubMedGoogle Scholar
- Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP, Shi X, Fulton RS, Ley TJ, Wilson RK, Ding L, Mardis ER: BreakDancer: An algorithm for high resolution mapping of genomic structural variation. Nat. Methods 2009, 6: 677–681. 10.1038/nmeth.1363PubMed CentralView ArticlePubMedGoogle Scholar
- Hormozdiari F, Alkan C, Eichler EE, Sahinalp SC: Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res 2009, 19: 1527–1541. 10.1101/gr.091868.109View ArticleGoogle Scholar
- Lee S, et al.: MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions. Nat. Methods 2009, 6: 473–474. 10.1038/nmeth.f.256View ArticlePubMedGoogle Scholar
- Hormozdiari F, Alkan C, Eichler EE, Sahinalp SC: Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res 2009, 19: 1527–1541. 10.1101/gr.091868.109View ArticleGoogle Scholar
- Chiang DY, Getz G, Jaffe DB, O'Kelly MJ, Zhao X, Carter SL, Russ C, Nusbaum C, Meyerson M, Lander ES: High-resolution mapping of copy-number alterations with massively parallel sequencing. Nat. Methods 2009, 6: 99–103. 10.1038/nmeth.1276PubMed CentralView ArticlePubMedGoogle Scholar
- Abyzov A, Urban AE, Snyder M, Gerstein M: CNVnator: An approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res 2011, 21: 974–984. 10.1101/gr.114876.110PubMed CentralView ArticlePubMedGoogle Scholar
- Ye K, Schulz MH, Long Q, Apweiler R, Ning Z: Pindel: a pattern growth approach to detect breakpoints of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 2009, 25(21):2865–2871. 10.1093/bioinformatics/btp394PubMed CentralView ArticlePubMedGoogle Scholar
- Abel HJ, Duncavage EJ, Becker N, Armstrong JR, Magrini VJ, Pfeifer JD: SLOPE: a quick and accurate method for locating non-SNP structural variation from targeted next-generation sequence data. Bioinformatics 2010, 26(21):2684–2688. 10.1093/bioinformatics/btq528View ArticlePubMedGoogle Scholar
- Li Heng, Durbin Richard: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25(14):2684–2688.View ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup: The sequence alignment/map format and SAMtools. Bioinformatics 2009, 25(16):2078–2079. 10.1093/bioinformatics/btp352PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J. Mol. Biol 1990, 215: 403–410.View ArticlePubMedGoogle Scholar
- Kent W: BLAT – The BLAST-Like Alignment Tool. Genome Res 2002, 12: 656–664.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang J, Mullighan CG, Easton J, Roberts S, Heatley SL, Ma J, Rusch MC, Chen K, Harris CC, Ding L, Holmfeldt L, Payne-Turner D, Fan X, Wei L, Zhao D, Obenauer JC, Naeve C, Mardis ER, Wilson RK, Downing JR, Zhang J: CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nat. Methods 2011, 8(8):652–654. 10.1038/nmeth.1628PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.