ClipCrop: a tool for detecting structural variations with single-base resolution using soft-clipping information
© Suzuki et al; licensee BioMed Central Ltd. 2011
Published: 14 December 2011
Skip to main content
© Suzuki et al; licensee BioMed Central Ltd. 2011
Published: 14 December 2011
Structural variations (SVs) change the structure of the genome and are therefore the causes of various diseases. Next-generation sequencing allows us to obtain a multitude of sequence data, some of which can be used to infer the position of SVs.
We developed a new method and implementation named ClipCrop for detecting SVs with single-base resolution using soft-clipping information. A soft-clipped sequence is an unmatched fragment in a partially mapped read. To assess the performance of ClipCrop with other SV-detecting tools, we generated various patterns of simulation data – SV lengths, read lengths, and the depth of coverage of short reads – with insertions, deletions, tandem duplications, inversions and single nucleotide alterations in a human chromosome. For comparison, we selected BreakDancer, CNVnator and Pindel, each of which adopts a different approach to detect SVs, e.g. discordant pair approach, depth of coverage approach and split read approach, respectively.
Our method outperformed BreakDancer and CNVnator in both discovering rate and call accuracy in any type of SV. Pindel offered a similar performance as our method, but our method crucially outperformed for detecting small duplications. From our experiments, ClipCrop infer reliable SVs for the data set with more than 50 bases read lengths and 20x depth of coverage, both of which are reasonable values in current NGS data set.
ClipCrop can detect SVs with higher discovering rate and call accuracy than any other tool in our simulation data set.
Structural variations (SVs) are polymorphisms that change the structure of the genome, e.g. deletions, insertions, translocations, inversions and tandem duplications . They induce functional change in genes and regulatory regions, which can cause various diseases , e.g. autism , Parkinson's disease , schizophrenia . Not only inherited SVs, but also somatic SVs can be responsible for various diseases including cancer . However, until a few years ago, there were no efficient methods to detect genome wide SVs in high resolution. One of the microarray analyses, array-CGH, can only detect limited SVs, since this approach can neither detect small size SVs nor clarify the single nucleotide level sequence of the target sample. Recently, next-gen sequencing (NGS) has drastically changed this situation. NGS enables us to measure large number of short digested sequence reads (short reads around 50 to 120 bases) with short time with at once . Additionally, alignments of sequenced reads to the reference genome, which were impossible using the microarray approach, are now applicable. Thus, we can detect SVs with higher resolution.
Until now, three types of methods have been developed to detect SVs from NGS data: discordant pair approach, depth of coverage approach and split read approach .
The first approach, discordant pair, uses paired-end reads of NGS data, and calls SVs when the distance of two paired-end reads is discordant . When SVs occur, paired-end reads generated from these locations cannot be mapped to the reference in concordant distance. BreakDancer , VariationHunter , MoDIL  and ABI Tools  can be categorized into this method. This idea has been developed in the early times when the depth of coverage was low and the length of the read sequences (read lengths) was short. Thus, this method is appropriate for smaller datasets of short-read. However, this method cannot detect SVs with shorter lengths, and it has difficulties to know the exact position of SV events.
The second approach, depth of coverage, is used in SegSeq , CNVnator  and ABITools . It uses the frequency of mapped short reads or bases to each position on the reference genome. The main concept of this method is similar to array-CGH. When deletions occur, the number of mapped reads to regions in the reference genome will decrease. In contrast, in the case of duplications, the number of mapped reads to regions in the reference genome will increase. Different from the first approach, this does not require paired-end reads, while it requires high coverage and still has difficulties detecting shorter SV events.
The third approach, ‘split read’, is the method to detect SVs using unsuspected reads, which are not correctly mapped to the reference genome or remain unmapped. In general, split read approach is applicable only to paired-end reads. While it needs sufficient read lengths and depth of coverage, the method can detect SVs with single-base resolution. Reads on an SV event contain a ‘breakpoint’, a boundary of a region affected by SV and its flanking region which is the same as the reference genome. An SV is called when the same breakpoint is detected in unsuspected reads. The algorithm of detecting breakpoints varies with tools. Pindel  and SLOPE  use orphaned reads, unmapped reads whose mate were succeeded in mapping to the reference genome, as unsuspected reads. SLOPE attempts partial alignment between the either end of each unmapped read and the reference genome to obtain breakpoints. Pindel gets substrings from two different regions around the mapped mate read; one region is two fold of average insert size from 3’ end of mapped mate read and the other region is the sum of maximum deletion size and read lengths from the appropriate position. It then checks whether the unmapped read can be reconstructed by concatenating two substrings from each region.
Major mapping tools, such as the Burrows-Wheeler alignment tool, (e.g. BWA ) if they failed to map full length short read to reference genome, still try to map part of the short read. If the short read is mapped partially, then the information of the partial mapping is stored into a major mapping format SAM  as soft-clipping information. The number of soft-clipped reads is comparable to that of orphaned reads which is adopted in Pindel and SLOPE. Thus, our new method ClipCrop employs soft-clipping information and advances the third 'split read approach.' By using the boundary position between mapped sequence and the soft-clipped sequence in a clipped read, we can obtain putative breakpoints. Ideally, among these putative breakpoints, true breakpoint will be contained. We then remap soft-clipped sequences around the detected putative breakpoints and infer which type of event is really occurred at this region. The detailed method is described in Section 2. Section 3 demonstrates the comparison of ClipCrop, Pindel, SLOPE and BreakDander to various simulation data set. Section 4 details the result in Section 3.
In the first process of ClipCrop, reads with soft-clipping information are chosen for the next analysis. The soft-clipping information is written as a CIGAR string in SAM format. Here is a sample data of CIGAR string: “31S69M” means 31 bases from the left end are clipped, and the rest 69 bases are matched.
The SAM file must be generated from paired-end mapping tools, and the mapping tool must generate a SAM file with soft-clipping information. In some mapping tools (e.g. BLAST , BLAT ), mapped result information file does not contain whole read sequence, but only mapped part of the sequence. In such cases, a generated SAM file from them contains hard-clipping, partially unmapped sequence that is not in the SEQ column in SAM format. We can convert hard-clipping information to soft-clipping information by using the original FASTQ file to put information about the original sequence of each read. As a result of partial alignment, there are some reads where both ends are soft-clipped (e.g. 14S54M36S). We ignored such reads because they don’t carry relevant information.
In the next process, soft-clipped fragments with lengths larger than 10bases are collected and remapped to the reference genome around the whole breakpoint. Before mapping, the reference genome is cut around each breakpoint with 1000-base elongation to both sides. This process can reduce the probability of clipped sequences to be mapped in the wrong position. In our current implementation, BWA is used for this remapping process. By checking the mapped pattern of clipped sequence, ClipCrop infer the SV type from deletion, inversion, tandem duplication, insertion and translocation as follows.
In deletion events, clipped sequences from an L-breakpoint are mapped to the left side of an R-breakpoint and vice versa (Figure 1). As we can see in Figure 1(B), reads generated from nearby deleted region are soft-clipped and remapped.
, where B L and B R are the number of clipped reads supporting the L/R-Breakpoint of the SV event, C L and C R are the number of clipped read remapped to the L-Breakpoint of the SV event. In this formula, the higher the clipped and the remapped reads, the higher the score. Also, the score tend to be high when the number of left and right reads are balanced.
Parameters used in simulation data
Human chromosome 22 (Build 37 ref)
Distribution of SV length
N(50, 5), N(80, 8), N(100, 10), N(120, 12),
N(150, 15), N(170, 17), N(200, 20), N(400, 40)
N(600, 60), N(800, 80), N(1000, 100)
N(2000, 200), N(4000, 400)
The rate of single nucleotide alterations
The number of tandem repeat
N(40, 20) (>1)
Mean depth of coverage
5, 10, 15, 20, 40
50, 75, 100, 108
Distribution of template lengths
As tools with high sensitivity can detect with high discovery rate, it can be regarded as the similar concept to sensitivity. In the same way, true call rate can be regarded as the similar concept to specificity.
In all types of SVs, ClipCrop and Pindel could detect most of SVs with high accuracy (Figure 6). It is because these two tools uses split read approach. This approach can detect SVs of any size with single-base resolution. BreakDancer, which employs discordant-pair approach, cannot detect short SVs, and its accuracy cannot be single-base resolution. CNVnator, adopting depth of coverage approach, firstly splits reference genome with a certain window size, so it cannot detect SVs with shorter length than the window size. As we set the window size to 100 bases in our analyses, CNVnator couldn’t detect SVs with length <100 bases. The resolution in CNVnator is also limited to the window size. As well as ClipCrop, Pindel also marked high discovery rate and true call rate, but it couldn’t detect short duplications ( <170 bases). This is because of the following reason. Pindel tries to reconstruct split reads by concatenating two subsequences generated from two regions near the position of mapped mate. In short duplications, reads from duplicated region would contain more than two breakpoints, which means it requires more than three subsequences to reconstruct. Thus, Pindel cannot generate these reads and fails to detect short duplications. ClipCrop, on the other hand, uses only soft-clipped sequences. Some of the short soft-clipped sequences don’t contain any breakpoints, and they can remap and support tandem duplication calls. ClipCrop also excelled over Pindel in true call rate of insertions. As formula of reliability score (1) shows, ClipCrop sets zero to SVs called from only one-side clipping and only one breakpoint, i.e. (B L , B R , C L , C R ) = (n, 0, m, 0) or (B L , B R , C L , C R ) = (0, n, 0, m). Thus, by removing SVs with score zero, we can obtain reliable SVs with both-side supported, which is thought to contribute its higher accuracy.
The results in Figure 7 shows that ClipCrop could detect tandem duplications with high discovery rate and true call rate even the depth of coverage is 5. This is because the depth of tandem duplicated regions is much higher than surroundings, and there are sufficient numbers of reads which support tandem duplications. Also, as inversions can be supported by twice as many reads as deletion and insertion (reads mapped to inverted region with soft-clipping also supports breakpoints), discovery rate and true call rate were higher than those of deletion and insertion. The discovery rate and true call rate were saturated at depth 20, therefore the sufficient depth for ClipCrop is turned out to be 20, which is not so high in current NGS data.
From the results in Figure 8, the sufficient read lengths for ClipCrop is more than 50 bases. Thus, ClipCrop can be applied to most of current NGS data.
There is another recently published SV-detecting tool called CREST , which also uses soft-clipping information. Unlike ClipCrop, CREST cannot detect tandem duplications. CREST assembles soft-clipped sequences, and remaps the assembled sequence. Thus, assembled reads from the region of tandem duplications cannot be mapped to the original reference genome.
Currently, as ClipCrop is focusing only on soft-clipping information, it doesn’t calculate the length of insertion. However, as ClipCrop calls the position of insertion with high accuracy (Figure 6), we will easily be able to obtain these information by combination of other methods. In future, we will combine other methods and run with real data.
ClipCrop is a tool for detcting SVs with soft-clipiing information. Soft-clipped sequences are partially unmatched fragments in a mapped read. ClipCrop remaps these sequence and infers which type of SV events exists from the mapping pattern. ClipCrop can detect SVs with higher discovering rate and call accuracy than any other tool in simulation data set, especially in short size duplications and insertions. In addition, as ClipCrop does not require a large depth of coverage or long read lengths, it can handle most of current NGS data. Currently, the implementation of ClipCrop is only available in our environment, and we are in the process of deploying. We provide current implementation if you contact us.
The super-computing resource was provided by Human Genome Center, Institute of Medical Science, University of Tokyo.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 14, 2011: 22nd International Conference on Genome Informatics: Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S14.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.