NextSV: a computational pipeline for structural variation analysis from low-coverage long-read sequencing

Structural variants (SVs) in human genomes are implicated in a variety of human diseases. Long-read sequencing delivers much longer read lengths than short-read sequencing and may greatly improve SV detection. However, due to the relatively high cost of long-read sequencing, users are often faced with issues such as what coverage is needed and how to optimally use the aligners and SV callers. Here, we developed NextSV, a meta SV caller and a computational pipeline to perform SV calling from low coverage long-read sequencing data. NextSV integrates three aligners and three SV callers and generates two integrated call sets (sensitive / stringent) for different analysis purpose. We evaluated SV calling performance of NextSV under different PacBio coverages on two personal genomes, NA12878 and HX1. Our results showed that, compared with running any single SV caller, NextSV stringent call set had higher precision and balanced accuracy (F1 value) while NextSV sensitive call set had a higher recall. At 10X coverage, the recall of NextSV sensitive call set was 93.5%~94.1% for deletions and 87.9%~93.2% for insertions, indicating that ~10X coverage might be an optimal coverage to use in practice, considering the balance between the sequencing costs and the recall rates. We further evaluated the Mendelian errors on an Ashkenazi Jewish trio dataset. Our results provide useful guidelines for SV detection from low coverage whole-genome PacBio data and we expect that NextSV will facilitate the analysis of SVs on long-read sequencing data.


Introduction
identifies genomic variants via two algorithms, long-read discordance (PBHoney-Spots) and 70 interrupted mapping (PBHoney-Tails). Sniffles is a SV caller written in C++ and it detects SVs 71 using evidence from split-read alignments, high-mismatch regions, and coverage analysis. 72 To address these challenges, we developed NextSV, an automated SV detection pipeline 82 integrating multiple tools. NextSV automatically execute these software tools with optimized 83 parameters for the specific coverage that user specified, then integrates results of each caller and 84 generates a sensitive call set and a stringent call set, for different analysis purpose. 85

Performance of SV calling on different coverages of the NA12878 Genome 113
To determine the optimal coverage for SV detection on PacBio data, we evaluated the performance 114 of NextSV under several different coverages. We downloaded a recently published PacBio data 115 set of NA12878 (Pendleton et al., 2015) and down-sampled the data set to 2X, 4X, 6X, 8X, 10X, 116 12X, and 15X. SV calling was performed using NextSV under each coverage. All supported 117 aligner/SV caller combinations were run. At least two supporting reads is required for all SV calls. 118 The resulting calls were compared with the gold standard SV set (including 2094 deletion calls 119 and 1114 insertion calls) described in method section. 120 121 First, we examined how many calls in the gold set can be discovered. As shown in Figure 1, the 122 recall increased rapidly before 10X coverage but the slope of increase slowed down after 10X. slightly. Therefore, 10X coverage might be an optimal coverage to use in practice, considering the 134 relatively high sequencing costs and the generally high recall rates. 135

Performance of SV calling on different coverages on the HX1 Genome 144
To verify the performance of SV detection on different individuals, we also performed evaluation 145 on a Chinese genome HX1, which was sequenced by us recently (Shi et al., 2016) at 103X PacBio 146 coverage. The genome was sequenced using a newer version of chemical reagents and thus the 147 mean read length of HX1 was 40% longer than that of NA12878 (Table 1). The total data set was 148 down-sampled to three representative coverages (6X, 10X and 15X). For each coverage, SVs were 149 called using the four pipelines described above and compared to the gold standard set. The results 150 were similar to those of the NA12878 data set (Figure 3). At 10X coverage, NextSV sensitive call 151 set had a recall of 94.1% for deletions and 93.2% for insertions, highest among all the call sets. of AJ son, AJ father and AJ mother was down-sampled to 10X coverage. SV detection was 165 performed using NextSV with all supported aligners and SV callers enabled. The calls from AJ 166 son were compared with calls from AJ father and AJ mother. The results showed that, NextSV sequencing data set of HX1 (10X coverage) for benchmarking. All aligners and SV callers in 174 NextSV were tested using a machine equipped with 12-core Intel Xeon 2.66 GHz CPU and 48 175 Gigabytes of memory. As shown in Table 5, mapping is the most time-consuming step. BLASR 176 takes about 80 hours to map the reads, whereas NGMLR needs 11.2 hours, which is the fastest 177 among the three aligners. The SV calling step is much faster. PBHoney-Spots and Sniffles take 178 about 1 hour, while PBHoney-Tails needs 0.27 hour. In total, the BLASR / PBHoney combination 179 takes 80.8 hours while the NGMLR / Sniffles combination takes 12.5 hours, 84.5% less than the 180 former one. Since BLASR/PBHoney-Spots and NGMLR / Sniffles have good performance on SV 181 calling and running PBHoney-Tails is very fast given the BLASR output, the NextSV pipeline will 182 execute the three methods by default for generating the final results. 183 184

185
Long-read sequencing such as PacBio sequencing has clear advantages over short-read sequencing 186 on SV discovery (English et al., 2015). However, its application in real-world setting is often 187 limited due to the relatively high sequencing cost and hence the relatively low sequencing coverage. 188 In this study, we developed NextSV, a computational pipeline integrating multiple aligners and 189 SV callers to improve SV discovery on low-coverage PacBio data sets. Our results showed that, In addition to test recalls and precisions, we examined the allele drop-in errors, which represent 214 the SV calls that in the offspring but not appear in either parent. The allele drop-in errors can come 215 from two sources: false positive calls of the offspring or false negatives in the parents, though in 216 very rare cases it could be due to de novo mutations. So this measure is related to both recall and 217 precision. Since we consider 10X coverage as a good choice, we did the evaluation on a family 218 trio data set with ~10X coverage. In our results, NextSV stringent call set has the lowest allele 219 drop-in error, which is consistent with the results that it has the highest F1 value. 220 221 NextSV currently supports four aligner / SV caller combinations: BLASR / PBHoney-Spots, 222 BLASR / PBHoney-Tails, BWA / Sniffles, NGMLR / Sniffles, but we expect to continuously 223 expand the support for other aligner / caller combinations. Users can choose to run any of them. 224 By default, NextSV will enable BLASR / PBHoney-Spots, BLASR / PBHoney-Tails and NGMLR 225 / Sniffles and integrate the results to generate the sensitive calls and stringent calls. We do not 226 enable BWA / Sniffles by default because Sniffles works better with NGMLR in our evaluation 227 and alignment is a time consuming step. SVs that are shorter than reads may result in intra-read 228 discordances while larger SVs may result in soft-clipped tails of long reads. We suggest running 229 both PBHoney-Spots and PBHoney-Tails because they are two complementary algorithms 230 designed to detect intra-read discordances and soft-clipped tails, respectively. Sniffles uses the gold standard calls of insertions and deletions. This is another limitation of the study. We will 235 evaluate the performance on other types of SVs in the future when more gold standard SV calls 236 are available. Nonetheless, NextSV generates SV calls of all types. The output of NextSV is in 237 ANNOVAR-compatible bed format. Users can easily perform downstream annotation using 238 ANNOVAR and disease gene discovery using Phenolyzer (Yang et al., 2015). NextSV is available 239 at http://github.com/Nextomics/NextSV and can be installed by one simple command. We believe 240 that NextSV will facilitate the detection of structural variants from low coverage long-read 241 sequencing data. 242 243

PacBio data sets used for this study 245
Five whole-genome PacBio sequencing data sets were used to test the performance of SV calling 246 pipelines (Table 1). Data sets of NA12878 and HX1 genome were downloaded from NCBI SRA 247 database. Data sets of the AJ family trio were downloaded from ftp site of NIST (ftp://ftp-248 trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/). After we obtained raw data, we extracted 249 subreads (reads that can be used for analysis) using the SMRT Portal software (Pacific Biosciences,  Table 1. to the availability of high-coverage (>100X) data, we used the SV calls from a previously validated detected SVs on 100X coverage PacBio data set of the HX1 genome using BLASR / PBHoney-296 Spots, BLASR / PBHoney-Tails, BWA / Sniffles and NGMLR / Sniffles (minimal read support=20 297 for each SV caller). The initial high-quality calls that overlapped with one of the four 103X call 298 sets were retained as final gold standard calls. SVs with length less than 200 bp were not considered. 299 Number of SVs in the gold standard sets is shown in Table 2. 300 301

Performance Evaluation of SV callers 302
The SV calls of each caller were compared with the gold standard SV set. Precision, recall, and F1 303 score were used to evaluate the performance of the callers. Precision, recall, and F1 were calculated The authors wish to thank the National Institute of Standards and Technology and Genome in a 315 Bottle Consortium for making the reference data on PacBio sequencing available to benchmark 316 bioinformatics software tools. We also thank members of Grandomics to test the software tools 317 and offering valuable feedback.