An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data
© Zhang et al; licensee BioMed Central Ltd. 2012
Published: 19 April 2012
Skip to main content
© Zhang et al; licensee BioMed Central Ltd. 2012
Published: 19 April 2012
Recent advances in sequencing technologies make it possible to comprehensively study structural variations (SVs) using sequence data of large-scale populations. Currently, more efforts have been taken to develop methods that call SVs with exact breakpoints. Among these approaches, split-read mapping methods can be applied on low-coverage sequence data. With increasing amount of data generated, more efficient split-read mapping methods are still needed. Also, since sequence errors can not be avoided for the current sequencing technologies, more accurate split-read mapping methods are still needed to better handle sequence errors.
In this paper, we present a split-read mapping method implemented in the program SVseq2 which improves our previous work SVseq1. Similar to SVseq1, SVseq2 calls deletions (and insertions) with exact breakpoints. SVseq2 achieves more accurate calling through split-read mapping within focal regions. SVseq2 also has a much desired feature: there is no need to specify the maximum deletion size, while some existing split-read mapping methods need more memory and longer running time when larger maximum deletion size is chosen. SVseq2 is also much faster because it only needs to examine a small number of ways of splitting the reads. Moreover, SVseq2 supports insertion calling from low-coverage sequence data, while SVseq1 only supports deletion finding. The program SVseq2 can be downloaded at http://www.engr.uconn.edu/~jiz08001/.
SVseq2 enables accurate and efficient SV calling through split-read mapping within focal regions using paired-end reads. For many simulated data and real sequence data, SVseq2 outperforms some other existing approaches in accuracy and efficiency, especially when sequence coverage is low.
Finding structural genomic variations (e.g. deletions and insertions) has become an active research subject recently. It is commonly believed that some structural variations may be linked to complex diseases . Now high throughput sequencing (HTS) technologies (such as the Roche 454 FLX, Illumina Genome Analyzer, and ABI SOLiD) become more available. Sequence data can potentially reveal nearly all genetic variations, including structural variants. Thus, great efforts have been made for discovering structural variations in populations using sequence data. For example, the ongoing 1000 Genomes Project has released called structural variations for several human populations from hundreds of sequenced individuals in the pilot studies .
Many current sequence datasets are consisted of pairs of reads. These pairs can be mapped to a reference genome using read mapping tools such as Bowtie  and BWA . Usually both reads of the same pair can be successfully mapped to two different locations of the reference genome. The distance in between is called insert size, whose value depends on the library mean and standard deviation. Abnormal insert size (as suggested by the two mapped reads) may indicate the presence of some genomic structure not present in the reference genome. Such pairs are called discordant pairs, which can be useful in locating structural variations. There are many methods that detect SVs by analyzing the insert size of discordant pairs, such as PEMer , BreakDancer , GASV  and VariationHunter . A drawback of these methods is that only approximate positions of the breakpoints of the SVs can be found, while the high resolution of break points is useful in SV classification and annotation . Read depth methods (e.g. ) belong to another type of method that does not show the exact breakpoints.
Assembly and split-read mapping methods are the alternative approaches that can find exact breakpoints of SVs. One representative method using split-read mapping is the program Pindel . Sometimes a reads mapping program cannot properly map a pair of reads. There are multiple causes for unmapped reads, e.g. errors in sequence reads. The presence of SVs may also cause some reads to be unmappable. In the case of deletions, for example, when a read contains breakpoints of a deletion site, the read will contain two parts: one from the region prior to the deletion site and one from the region following the deletion site. The read may be unmappable because the read is a concatenation of the two parts and is not contained in the reference genome. The pairs with one read mapped and the other read unmapped are used in the split-read mapping methods. The mapped read in the pair is used as an anchor. The other read is split in the middle and then the two parts are attempted to map to the reference genome. If mapped correctly, the mapped split reads may reveal where the deletions occur. Recently there are more methods dedicated to find exact breakpoints. SRiC  is a split-read method mainly works on longer single reads like the Sanger and 454 reads. AGE  maps an assembled contig to a reference genome to detect the exact breakpoints of multiple SVs. There are also methods (e.g. CREST ) that do not detect the breakpoints themselves but rely on the exact breakpoints provided by mapping tools (through soft-clip mapping). A disadvantage of split-read mapping is that mapping split reads with a large gap is usually less efficient. Moreover, split reads may be mapped to wrong locations due to noises in the reads. Also, for the SVs with a breakpoint in a repetitive region, mapping may fail.
Despite there are increasing number of developed methods, calling structural variations from real sequence data remains a challenging computational problem. The challenges for calling structural variations with real sequence data include: (i) sequence data tends to be short and noisy (i.e. containing sequence errors or artifacts caused by errors in reads mapping), (ii) much current sequence data is at low coverage, and (iii) the volume of sequence data is often large. Therefore, much work is still needed to develop more accurate and efficient approaches for structural variation calling with low-coverage sequence data.
Recently, we have developed a computational approach for calling deletions from low-coverage sequence data . This approach (implemented in the program SVseq1) integrates two existing deletion calling approaches (namely discordant insert size analysis and split-read mapping), and thus in principle it utilizes more information contained in the reads than the pure split-read mapping approaches. Since sequence data tends to be noisy, it is important to utilize more information contained in the data when calling deletions. Briefly, SVseq1 first tries to split a read (that cannot be mapped as a whole) and maps the prefix and suffix parts in two regions. The gap between the two mapped regions of the split read may correspond to a deletion. Since there may be more than one way of splitting for some reads and some mapped split reads may only be artifacts of sequence and/or mapping errors, we filter the candidate deletions (from the split-read mapping) using discordant insert size analysis. That is, we call a candidate deletion a true deletion only when the candidate deletions are supported by the discordant insert size analysis. Simulation results in  show that our method outperforms an existing method .
Our work in  makes progress toward improving deletion calling from sequence data. However, we notice that it has several disadvantages. The most severe issue is that it is difficult to determine the best way for splitting reads: due to noise in reads, there may be many equally good ways for splitting the reads. This not only leads to longer running time (due to the need to examine more candidate deletions), but also may introduce false positives. Moreover, split-read mapping tends to be slow especially for genome-scale data. At last, only deletion calling is supported in  and obviously other types of structural variations (e.g. insertions) may also be of interests to many downstream applications.
Like SVseq1, SVseq2 calls deletions (and insertions) with exact breakpoints.
SVseq2 achieves more accurate calling through split-read mapping on focal regions. SVseq2 also has a much desired feature: there is no need to specify the maximum deletion size, which is often needed by other methods (e.g. [10, 14]). SVseq2 is also much faster because it only needs to examine a small number of ways of splitting the reads.
SVseq2 utilizes new features of sequence reads mapping tools. Latest sequence reads mapping (e.g. BWA ) provides partial reads mapping (called soft-clips in BWA). These partially mapped reads are often provided in the sequence data. SVseq2 relies on the soft-clip mapping provided by reads mapping tools in part of the split-read mapping. This makes SVseq2 faster than SVseq1 and some other similar deletion finding programs (e.g. ).
SVseq2 is also easier to use: it only needs mapped sequence data (stored in a BAM file) and reference genome (stored in FASTA format) as input.
SVseq2 supports insertion calling from low-coverage sequence data.
The mapped segment of a split read (from soft-clip mapping) is used as the starting point of split-read mapping. This utilizes new features of read mapping tool and speeds up the computation.
To locate the soft-clipped segment of the split read, we infer a focal region (i.e. the region that highly likely where the soft-clipped segment may be mapped) using the discordant read analysis. We will explain in the following how this step is performed.
The focal region is usually much shorter and thus there is less chance to introduce false positives. We then search for the occurrence of the second segment within the focal region using a semi-global alignment algorithm.
For insertions, SVseq2 also uses soft-clip mapping in locating the likely insertions. We now give a more detailed description on how SVseq2 calls deletions and insertions.
SVseq2 relies on two types of patterns formed by split reads to detect deletions.
Type I pattern: the segment facing the anchor end is mapped (e.g. Read 1 in Figure 1).
Type II pattern: the segment away from the anchor is mapped (e.g. Read 4 in Figure 2).
For type I pattern, the mapped segment of a split read based on soft-clip mapping faces the anchor. We denote the mapped location of the mapped segment as [a, b] (where a < b). To discover a deletion, the soft-clipped segment needs to be mapped to some region [c, d] (where c < d < a). We denote the length of the soft-clipped segment as l s = d - c + 1. Because the length of the true deletion is not known, some existing split-read mapping methods (e.g. [10, 14]) have a parameter on the maximum distance to search for the second (i.e. the soft-clipped) segment. Instead of searching in a large region, SVseq2 only searches a focal region by the guidance of spanning pairs. Our goal here is to infer where the soft-clipped segment is likely to start (i.e. the likely range of c). Our first observation is: even with low-coverage sequence data, a deletion is still likely to have at least one paired-end read whose two ends are located on different sides of the deletion (i.e. a spanning pair). Suppose there is a read pair whose two ends are mapped to [s 1 , e 1 ] and [s 2 , e 2 ] respectively on the reference genome (where s 1 < e 1 < s 2 < e 2 ), and this pair is a spanning pair for the deletion, whose location is determined by the mapping of the soft-clipped segment of the split read. We let l i be the expected insert size and let σ be the standard deviation of the insert size. Note that l i measures the outer distance of the pair (i.e. the distance of the two farthest points of the two two reads). We denote the length of the two reads of the spanning pair as l 1 and l 2 respectively. Suppose the minimum deletion size to be detected by SVseq2 is m d . SVseq2 sets m d to be 50.
We first show where to find spanning pairs for a given split read.
Lemma 1 For type-I pattern, s 2 ≥ a, and with high probability, we have s 2 ≤ a - l 1 - l 2 + l i + 3σ.
Proof 1 If s 2 < a, then a is not a breakpoint. This does not agree with our underlying assumption that the mapped segment [a, b] corresponds to a deletion.
To give an upper bound on s 2, note that a is the position of the right breakpoint. The rightmost position of e 1 on the reference is a - l del , where l del is the length of the deletion. Now since with high probability, the distance between s 2 and e 1 is at most l i - l 1 - l 2 + 3σ + l del on the reference. So with high probability s 2 ≤ (a - l del )+ (l i - l 1 - l 2 + 3σ + l del ) = a - l 1 - l 2 + l i + 3σ.
Lemma 1 states where the spanning pairs are very likely to be located. For a given split read, SVseq2 searches for reads mapped on the reverse strand within this region for spanning pairs.
Now suppose we find one spanning pair for the given split read. Recall the spanning pair is mapped to [s 1, e 1 ] and [s 2, e 2]. The following lemma specifies the range of c (i.e. the starting point of the soft-clipped segment).
Lemma 2 For type-I pattern, e 1 - l s ≤ c ≤ a - m d - l s . Moreover, with high probability, we have c ≤ e 1 + l i - l s - l 1 - l 2 + 3σ.
Proof 2 Note that the rightmost position a deletion can end is md bases to the left of a on the reference, because the minimum deletion size is m d . So c ≤ a - m d - l s . Since the spanning pair ([s 1, e 1], [s 2, e 2]) spans the deletion, we know the deletion must occur to the right of [s 1, e 1]. The leftmost position of the deletion is thus at least e1. Since the length of l s is to be mapped (to the left) from the left end of the deletion, we have c + l s ≥ e 1.
We now estimate how large c can be. Note that on the alternative chromosome (the chromosome with the deletion), the left and right breakpoints of the deletion become the same, and the left breakpoint of the deletion must be to the left of the starting position of the right end of the spanning pair. Thus, on the reference chromosome, with high probability, the left breakpoint is no bigger than e 1 + l i - l 1 - l 2 + 3σ.
Lemma 2 states that we only need to search for the second segment of the split read within the region [e 1 - l s , min(e 1 + l i - l s - l 1 - l 2 + 3σ, a - m d - l s )]. This region is called the "focal" region for the split read being mapped and a spanning pair. In most current sequence data, the focal region is relatively small. For example, suppose l s = 50 (taken from a read of 100 bps long), l 1 = l 2 = 100, l i = 200 and σ = 50. Then the width of the focal region is not larger than 200. This is much smaller than the focal region that the original split-read mapping would have searched (which can be as long as 1 Mbps). Also, from Lemma 1, the width of the region for spanning pairs is at most 150.
The processing of split reads with type II pattern is similar in many aspects to that of type I pattern. A main difference between type I and type II patterns is that type II pattern does not need additional spanning pairs because the paired-end read itself is a spanning pair. This imposes an additional constraint on the focal region. Suppose the mapped segment of the split read is located at [a, b] and the mapped anchor is located at [s, e] (where a < b < s < e as shown in Figure 2). We let [c, d] be the location of the soft-clipped segment of the split read on the reference (where b < c < d). We let l s = d - c + 1 be the length of the soft-clipped segment. We let l 1 and l 2 be the length of the two reads (i.e. l 2 = e - s + 1). m d , l i and σ are defined as before. The following lemma specifies where the soft-clipped segment is allowed to map.
Lemma 3 b + m d ≤ c ≤ s. Also, with high probability, we have s - l i - l s + l 1 + l 2 - 3σ ≤ c ≤ s.
Proof 3 Note that the leftmost position a deletion can start is m d bases to the right of b on the reference, because the minimum deletion size is m d . Also, the breakpoint cannot go to the right side of s for there is no split on the anchor.
Note that the position of soft-clipped segment is constrained by the anchor position and the insert size. So the second inequality follows the same reasoning as in Lemma 2.
Lemma 3 states that we only need to search for the second segment of Type II pattern of the split read within the focal region [max(b + m d , s - l i - l s + l 1 + l 2 - 3σ), s]. For the cases when the split read is on the reverse strand, the method applied on them is essentially the same as when they are on the forward strand.
Our experience indicates that type I pattern is usually more reliable then type II pattern, because less errors are expected at the head of Illumina reads. Thus SVseq2 gives type I pattern higher weights than type II pattern when calling deletions. The weight of type I pattern is set to 3, and the weight is set to 1 for type II pattern. A cutoff value on the total weight (i.e. the sum of weights of supporting reads for a deletion) is used by SVseq2. The default cutoff value is set to 3, i.e. at least one type I pattern read is required or at least three type II pattern reads are required when there is no type I pattern read.
In practice, there may be more than one spanning pairs for a candidate deletion (corresponding to a split read). When the deletion is heterozygous in a diploid genome, Some spanning pairs may originate from the copy without the deletion while others from the copy with the deletion. Some other spanning pairs may be due to mapping errors. One possible scheme is to find a "consensus" focal region by combining information provided by multiple spanning pairs. SVseq2 simply takes the union of all the focal regions from all the possible spanning pairs. This is because there could be mapping errors in the spanning pairs, and thus SVseq2 takes a conservative estimate of the focal region. Our experience shows that the overall focal region is still relatively small and searching for split read can be performed relatively efficiently.
SVseq2 relies on pair wise sequence algorithm to align two overlapped reads. The parameters are the same as for the deletion case. If the score of the mapping over the length of the overlap is less than 0.1 then the pair is treated as evidence of a possible insertion. The default cutoff value of reads supporting an insertion for SVseq2 is 3. That is, at least another read in this region has the same split and passes the alignment test with the read in this pair on the different strand.
We apply SVseq2 on both simulated datasets and real datasets, comparing with SVseq1  and Pindel 0.2.4d  on accuracy and efficiency. For deletion finding, the three methods are run on simulated population data, real individual and pooled data. For insertion finding, simulated individual data is used. The real sequence datasets (20101123 Illumina data) consist of the alignment files of 18 individuals on chromosome 20. Nine of the individuals are from the CEU population and the others are from the YRI population. These alignment datasets are mapped using BWA with soft-clips on NCBI human genome 37. The accuracy is evaluated according to the results by the 1000 Genomes Project ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20110719_merged_sv_calls/. The results contain assembled deletions and the ones found by five SV detection tools of more than 1000 individuals (which include the ones used in this paper). The methods include BreakDancerMax1.1 , CNVnator , GenomeStrip v1.04 , EMBL/Delly and Pindel . Since not all of the methods are able to provide exact breakpoints of deletions, evaluation of accuracy of methods is based on both a strict criterion and a less strict criterion. A called deletion is viewed correct by the strict one, if the length of the called deletion is the same as a deletion in the results by the benchmark. The less strict one only requires that a called deletion overlaps with a deletion in the benchmark, and at least 50% of the bases of the called deletion are supported.
The simulated datasets with read length 100 from  are used in this paper to compare SVseq2 to SVseq1 and Pindel in terms of accuracy and sensitivity. The datasets are simulated from the sequence of chromosome 15 (100,338,915 bps in length) of NCBI human genome 36. The results of the copy number variation release paper of the 1000 Genomes Project  are based on this version of genome. The deletions of the 45 individuals from the CEU population reported by  are introduced to the simulation datasets (union.2010_06.deletions.genotypes.vcf.gz). Since the haplotypes of the deletions are not inferred in the file, for the heterogeneous deletions we arbitrarily place one such deletion to one of the two haplotypes of an individual. Since the deletions are usually far apart from each other, this may not have big effects on the accuracy of the simulation. A tool called wgsim https://github.com/lh3/wgsim is used with the "-h" option to generate paired-end reads from the two copies of genomes of an individual. Single nucleotide polymorphisms and small indels on each genome are simulated using the default parameters. All the datasets are generated with base error rate 2%. Paired-end reads are simulated with read length 100 and "outer distance" 500. Three datasets with coverage 3.2×, 4.2× and 6.4× are used. BWA, which provides soft-clips, is used with default parameters to map these simulated paired-end reads to the entire NCBI human genome 36.
Comparison of SVseq2, SVseq1 and Pindel in simulation.
Comparison of SVseq2, SVseq1 and Pindel using real individual data.
Using individual sequence data, SVseq2 is able to utilize split reads to call more deletions than SVseq1 and Pindel even when the coverage is very low. With cutoff value 3, SVseq2 finds the largest number of deletions and a large portion has supports by the benchmark. If a higher cutoff value 4 is used, most of the called deletions are supported by the benchmark. The number of findings is still larger than SVseq1 and Pindel, when using cutoff value 4.
Comparison of SVseq2, Svseq1 and Pindel using real pooled data.
Using pooled data, all three methods are able to find more deletions than using individual data. SVseq2 finds more deletions using cutoff value 3 but false positive rate is increased too. Quite a portion of deletions found by SVseq2 using cutoff value 3 are missed by using cutoff value 4. Even pooling nine individuals together, many less frequent deletions still belong to single individuals. Because the sequence coverage is low, only one split read with soft-clipped mapping covers such a deletion (recall that the cutoff value 3 means one type I read). The quality of soft-clipped mapping provided by the mapping tools matters in finding SVs. If a mapping tool fails to perform soft-clip mapping on a split read, then this read is not used by SVseq2. By pooling more data from more individuals, more deletions are likely to be found by SVseq2.
There are fewer insertion finding methods than deletion finding. Also, fewer insertions have been called and released than deletions. To simulate insertion, the release (CEU.trio.2010_06.novelsequences.sites.vcf) of the NA12878 individual is used in this paper. This individual has been sequenced at high coverage and the 1000 Genomes Project has released some inserted sequences of this individual. Chromosome 4 of NCBI human genome 36 is used in the simulation, since there are 13 (the highest number of) inserted sequences on this chromosome in the release for this individual. Each insertion is treated to be heterozygous and added into an arbitrary haplotype. Illumina reads with 20× coverage (so that each inserted sequence has 10× coverage) are simulated using wgism https://github.com/lh3/wgsim with insert size 230 and read length 100. The reads are mapped using BWA. Both SVseq2 and Pindel 0.2.4.d are tested to find insertions. SVseq2 finds 10 insertions with 9 true positives. Pindel finds 2 insertions, both of which are correct. One is found as "LI" and the other is found as "SI". The large insertion reported by Pindel as "LI" has 6 split reads supporting it, where 3 out of 6 are from the forward strand and the other 3 are from the reverse strand. Even at 20× coverage, split reads of type III pattern are not common in this simulation study. This simulation shows that SVseq2 is able to use fewer supporting reads to call insertions.
Comparison of SVseq2, SVseq1 and Pindel on running time.
Max Event Size
1 m 48 s
8 m 40 s
1, 000, 000
DEL, INS, TD
7 m 40 s
DEL, INS, TD
There are four types of methods that use high throughput sequencing data to call SVs. Read depth methods and assembly methods usually need data with higher coverage. Read pair methods and read depth methods are not able to find exact breakpoints of SVs. Split-read mapping methods may find exact breakpoints of some SVs with low-coverage data. However, split-read mapping alone usually leads to significant false positives. Combining split-read mapping with other types of methods may increase the power in finding SVs. In this paper we describe an improved split-read mapping method to call SVs using low-coverage sequence data. We show that by using read pairs with discordant insert sizes, split-read mapping can be applied as mapping a segment of a split read on a focal region. Using the lemmas in the Methods section, we show that the length of the focal region can be much smaller than the maximum deletion size. Mapping split reads within a small focal region reduces the chance that a segment is aligned to incorrect positions. Thus, mapping split reads within focal regions leads to both higher accuracy and shorter running time. Applying on several datasets, we show that SVseq2 outperforms some other methods in both accuracy and efficiency. SVseq2 is more powerful compared to these methods when using very low coverage sequence data.
The split-read mapping approach in SVseq2 can still be improved, e.g. to better model the error patterns of high throughput sequencing data. For the situation when there are repeats in focal regions, insert size analysis might be helpful in finding correct mapping.
The program SVseq2 can be downloaded at http://www.engr.uconn.edu/~jiz08001/.
All the authors are supported by National Science Foundation grant IIS-0953563.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 6, 2012: Proceedings of the Second Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S6.