The overview of this study
We produced abundant simulated error-prone long-read datasets with diverse sequencing settings and comprehensively evaluated the performance on SV calling with a series of state-of-the-art long-read-based SV callers. Three major steps of this approach are as follows.
-
Step 1: Generate simulated synthetic diplontic long-read datasets based on various sequencing attributes (i.e., sequencing coverage, read length, and error rate).
-
Step 2: Discover structural variants using the state-of-the-art SV methods and an ensemble method.
-
Step 3: Comprehensively evaluate SV calling performance on various sequencing datasets to compute the better sequencing settings for achieving satisfactory performance.
The complete workflow of the study is shown in Fig. 1, and more details are provided in the Methods section.
Evaluation of the sequencing coverage impact on SV calling
We first benchmarked the SV calling performance on 3×, 5×, 10 ×, 20×, 30×, 40×, and 50× sequencing depth datasets (20 kbp average read length and 10% error rate) to establish a baseline assessment. We selected six state-of-the-art SV callers, i.e., cuteSV [26], NanoSV [27], NanoVar [28], Sniffles [4], SVIM [29], and PBSV (https://github.com/PacificBiosciences/pbsv), to represent the ability of SV detection at different sequencing depths. The results in Fig. 2 and Additional file 1: Table S1 show that the average F1 score increased along with sequencing coverage, and approximately 0.152 in total grew from 3× to 50×. MCC kept pace with the F1 score and increased by 0.143 at the same coverage range. Especially in regard to 20×, the best three SV callers achieved brilliant resolution, i.e., cuteSV (F1: 0.800, MCC: 0.620), SVIM (F1:0.798, MCC: 0.597), and Sniffles (F1:0.769, MCC: 0.578), which indicates that a relatively low depth of 20× enables SV calling approaches to obtain high performance (Additional file 1: Table S2). Similarly, the recall rate also increased along with sequencing coverage, and improved by 0.173 overall from 3× to 50× . The best three tools mentioned above surpassed 0.65 recall rates on 20× sequencing data as well, i.e., cuteSV 0.719, SVIM 0.776, and Sniffles 0.664 (Additional file 1: Table S1-2). However, the precision decreased as the sequencing depth rose and declined by 0.095 overall from 5× to 50× . On the whole, the higher coverage data could greatly promote SV callers to uncover more SVs, although it would sacrifice a little precision.
To obtain a reliable minimum estimation of sequencing coverage, we computed the trend curve between F1 scores and sequencing depths for each method. All predicted trend curves were in the form of quadratic polynomials and are shown in Additional file 1: Table S3. Generally, the simulated long-read data in 30× to 40× could be the best coverage setting for SV calling methods that enabled them to achieve their best F1 scores; e.g., cuteSV, Sniffles and SVIM obtained 0.80 F1 scores under coverage of 34×, 41×, and 32×, respectively. In addition, a correlation coefficient over 0.92 indicates the high reliability of the predictions. More importantly, it is evident that 20× is sufficient for those tools with a brilliant resolution to complete SV calling; e.g., cuteSV, Sniffles, and SVIM each achieved an F1 score over 0.75.
Then, we assessed the impact of sequencing coverage on various SV types of calls, and the results are shown in Additional file 1: Figure S1A to S1D. There were more deletions and insertions accurately detected on deep coverage datasets, and the F1 score increased by 0.170 and 0.148 totally from 3× to 50×, respectively. The F1 scores of duplication and inversion also grew (i.e., 0.041 and 0.016, respectively), whereas the increments were not significant compared with deletion and insertion (Additional file 1: Table S1). This was mainly due to the greater complexity and larger size of duplicated and inverted calls in our ground truth sets. Discovering these SVs was more dependent on the improvement of read length and sequencing accuracy. Furthermore, almost all types of SVs achieved satisfactory performance on the 20× dataset, especially in detecting duplications after the apparent existence of a peak at approximately 20× (Additional file 1: Figure S1C).
We also evaluated the impact of sequencing coverage on six size scales of calls: 50–99 bp, 100–499 bp, 500–1 kbp, 1 k–5 kbp, 5 k–10 kbp, and > 10 kp. From the results shown in Additional file 1: Figures S1E to S1J, it is evident that the higher the sequencing depth was, the greater the F1 score was obtained, and the growth rate of F1 slowed down significantly on each dataset over 20× (Additional file 1: Table S2). On the other hand, SVs with large size were more dependent on higher coverage datasets compared with small and medium size SVs. It also makes sense that higher coverage data could contain more reads with a longer length and provide more significant signals for larger SVs.
In addition, we further benchmarked the performance of SV calling involving genotypes to determine the sequencing coverage impact in more detail (Fig. 2). Performance remained the same for each approach between identifying genotyped SV (SV with genotype) calls and ungenotyped SV (SV without genotype) calls; that is, higher coverage simulated data could promote the growth of F1 score in calling genotyped SVs. In addition, 20× is still a recommendable sequencing depth setting for the detection of genotyped SVs since SV callers such as cuteSV achieved scores of over 0.761 F1 at this coverage (Additional file 1: Table S2). Moreover, we calculated the difference in F1 score for each SV caller between their genotyped and ungenotyped SV calls. The F1 scores for SV calls that involve genotypes were approximately 0.04 ~ 0.12 lower than those with presence-only SVs (Additional file 1: Table S2). Hence, genotyping remains a large challenge for SV calling even at high sequencing coverage.
Evaluation of the read length impact on SV calling
We then applied 10 simulated long reads, including 1 kbp, 2.5 kbp, 5 kbp, 7.5 kbp, 10 kbp, 15 kbp, 20 kbp, 50 kbp, 100 kbp, and 500 kbp in read length (50× coverage and 10% sequencing error rate), for SV callers to evaluate the ability of SV detection. In this benchmark, we excluded NanoSV and NanoVar due to their poor performance. The results are shown in Fig. 3 and Additional file 1: Table S4-5. It is evident that the F1 score increased along with the read length, and approximately 0.107 increased from 1 to 500 kbp. Evidently, almost all tools performed well using datasets with read lengths from 20 to 50 kbp. Considering the higher cost for the longer reads, 20 kbp is a promising choice and is sufficiently capable to detect the SVs accurately and sensitively; e.g., cuteSV and Sniffles had an F1 score surpassing 0.79, SVIM followed closely with approximately 0.78, and the average F1 score of four SV callers was approximately 0.770. The MCC rate was consistent with F1 and increased with increasing read length. On 20 kbp, the average MCC was 0.554, which indicated a moderately positive linear relationship. With increasing read length, the recall rate improved by approximately 0.146 from 1 to 500 kbp, and on 20 kbp, cuteSV and Sniffles surpassed 0.71 SVIM performed better by approximately 0.82. However, the precision was almost stable with respect to read length and achieved an average of 0.821 in total. To a certain extent, a higher read length evidently increased the F1 score and recall and even slightly improved the precision. However, a horizontal comparison with coverage shows that the improvement in the ability of SV detection under different lengths was not as much as that for coverages, which indicates that coverage would have a greater impact than read length. When we fitted the trend curve of each tool under different read lengths, the correlation coefficient was not sufficiently qualified to predict the best recommendation in our specific simulated datasets. It was necessary to conduct more in-depth analysis for the best prediction of read length using each tool under more introduced SVs (Additional file 1: Table S3).
We then assessed the impact of read length on various SV types of SV calling. From Additional file 1: Figures S2A to S2C, the F1 scores in detecting deletions, insertions and duplications had almost the same increase of approximately 0.1; when data with a 500 kbp mean read length were applied, the F1 scores reached 0.839, 0.741, and 0.616, respectively (Additional file 1: Table S4). When using 20 kbp read length data, the performance of the detection on the three types mentioned above was also satisfactory; i.e., F1 scores were approximately 0.844, 0.738, and 0.530, respectively. However, the increase in F1 scores on inversions was not evident (Additional file 1: Figure S2D). In other words, read length greatly influenced deletion, insertion, and duplication calling but not inversion calling.
For different sizes of SVs, the results were similar to sequencing coverage in that the longer the read length was, the higher the F1 score was. Specifically, from Additional file 1: Figures S2E to S2J, the F1 scores of the six scales could increase by 0.103, 0.103, 0.161, 0.256, 0.350, and 0.055 (Additional file 1: Table S5). It is evident that the larger the SVs were, the greater the need for read length. These improvements were mainly because reads with longer lengths had greater potential to completely cover the region of SV and provide clearer SV signatures for discovering larger SVs. However, this rate of increase declined, as longer read length also decreases the yield of data, reducing the absolute number of reads containing SV signatures. Although the improvement of SV over 10 kbp was not as apparent as SV with other scales, the average F1 score surpassed 0.773, which indicated that the detection of longer SV was more advanced. Moreover, with the 20 kbp read length, the detection of each SV length scale had good performance, and the F1 scores of the six scales were 0.675, 0,770, 0.742, 0.789, 0.746, and 0.790 (Additional file 1: Table S5). Overall, the 20 kbp dataset could lead to satisfactory performance under different length scales.
When considering genotypes, simulated data with a longer read length could increase the F1 score significantly (Fig. 3). Similarly, a 20 kbp read length was still a satisfactory setting for detection; e.g., cuteSV reached a 0.765 F1 score by using this read length (Additional file 1: Table S5). After calculating the difference of F1 scores between considering genotypes and ignoring genotypes, we found that the F1 scores for SV calls that considered genotypes were approximately 0.03–0.14 smaller than those of SV calls that were only considering presence (Additional file 1: Table S4), and with the increase of read length, the differences between them were becoming smaller. All the above results show that longer reads could help genotyping consistency.
Evaluation of the error rate impact on SV calling
In this section, we applied nine simulated datasets, including 20%, 15%, 12.5%, 10%, 7.5%, 5%, 2.5%, 1%, and 0.2% error rates (50× sequencing coverage and 20 kbp read length) to evaluate the ability of the SV callers to detect various kinds of SVs with high performance. It should be noted that the data with 1% and 0.2% sequencing errors were produced by the consensus circular sequencing (CCS, also known as ‘HiFi’) model to make the datasets more consistent with the real situation of highly accurate long high-fidelity reads, and the 0.2% error rate dataset represented the latest HiFi reads [17]. From Fig. 4 and Additional file 1: Table S6, we found that the F1 score increased by approximately 0.073 from a 20 to 0.2% error rate. Notably, there was a sharp improvement around the 1% error rate (i.e., average 0.796 F1 score), which demonstrated that the datasets using the CCS/HiFi model showed more potential in making the SV callers perform much better than the others. Apart from CCS/HiFi reads, the F1 score and MCC increased along with a decrease in sequencing errors from 20 to 2.5%. In regard to 10–7.5% error rates, all four SV callers could obtain a brilliant performance; e.g., Sniffles could reach a 0.8 F1 score and cuteSV could surpass a 0.79 F1 score (Additional file 1: Table S7). The recall rate had no significant increase and maintained an average of 0.730. However, we found that the precision increased relatively noticeably as the error rate decreased; to be more specific, at a 20% error rate, the precision was only 0.804, but it surpassed 0.87 at 1% and 0.2% error rates, which indicates that SV callers had a competent ability to detect SV more precisely on more accurate long reads. Reducing the error rate could significantly improve the precision on the premise of ensuring the recall, and then the F1 score could also be improved. Although the overall performance was improved, the improvement was not nearly as much as that from changing the sequencing coverage and read length.
The trend curves in Additional file 1: Table S3 showed that using long reads with sequencing accuracy rates of approximately 1.04 and 1.00 could make cuteSV and Sniffles achieve their best F1 scores approximately 0.81. Although with the highly effective correlation coefficient, it is impossible for callers to achieve this due to the difficulty of accomplishment both in technique and in cost. Fortunately, a 10–7.5% error rate enabled SV calling to be sufficiently sensitive and accurate; to be more specific, Sniffles and cuteSV obtained a 0.79 F1 score and 0.80 F1 score, respectively.
Next, we assessed the impact of the error rate on the four SV types. From the results shown in Additional file 1: Figure S3 and Table S6, we observed that the F1 scores of deletion were stable on data with diverse sequencing errors and obtained a 0.842 F1 score on average. In contrast, decreasing the sequencing error rate could significantly increase the performance in detecting insertions, and in regard to the 10–7.5% error rate, the F1 score was nearly 0.742. In particular, the average F1 score of discovering insertions surpassed 0.784 below 1% error rate data and made a great improvement compared with other datasets. Unfortunately, with the decreasing error rate, the F1 score of duplications and inversions decreased by 0.047 and 0.079 from 20 to 2.5% in error rates, respectively. It is a good choice to discover duplication and inversion precisely and sensitively on the data with less than 1% sequencing errors if the funds are sufficient. Nevertheless, the detection of inversions and duplications remains a bottleneck, and sequencing errors have a limited impact on their discovery.
We then evaluated the impact of the error rate on SVs under six size scales. The F1 scores of most size scales were satisfactory, i.e., 0.680, 0.779, 0.736, 0.787, 0.759, and 0.787 for the six scales (Additional file 1: Figures S3E to S3J and Table S7). To some extent, the detection for large-scale SV is easier than that for small-scale SV. It is worth noting that the detection of SV with 1–5 kbp performed best in the data with a 10–7.5% error rate among almost all datasets. In other words, a 10–7.5% error rate was sufficient to detect adequate SVs among almost all size scales. Moreover, higher performance was achieved when applying data with error rates below 1% for most size scales, especially for SVs shorter than 500 bp.
When considering genotypes, the average F1 score of the four SV callers also increased along with the reduced sequencing error rate. Notably, a 10–7.5% error rate was still a sufficient setting for SV calling; i.e., cuteSV surpassed 0.76 and Sniffles reached nearly 0.75 (Additional file 1: Table S7), and the datasets below a 1% error rate also had the ability to discover most SVs with high performance.
Evaluation of the ensemble SV calling
With a combination and integration of multiple SV callsets, ensemble SV calling can improve the variant concordance of each SV caller. Here, we assessed the performance of ensemble SV calling performed by SURVIVOR [30] on the three sequencing attributes (i.e., coverage, read length, and error rate). It is evident from the ensemble calling method that the performance was promoting and generating more accurate and sensitive SV callsets, along with increasing coverage, read length and accuracy rate. Even when adopting the poor settings of the sequencing attributes, the ensemble calling method would still generate much better callsets than the single caller.
Figure 5a and Additional file 1: Table S8 show that coverage was still a vital attribute and that it enabled the ensemble method to achieve more sensitivity and accuracy under high coverage data. On the 20× dataset, the ensemble method obtained a 0.772 F1 score, indicating that 20× was still an appropriate coverage setting for discovering SVs with excellent resolution. In addition, the predicted trend curve shown in Additional file 1: Table S9 also supported the opinion that 36× coverage would enable the ensemble method to obtain the best SV results.
With the increase in read length, the F1 score increased steadily as the single callers did and had an average F1 score of approximately 0.766 from 1 to 500 kbp in read length (Fig. 5 and Additional file 1: Table S10). It is worth noting that the F1 score and MCC rate of ensemble calling surpassed 0.774 and 0.574 on the 20 kbp dataset, respectively. However, due to the relatively poor correlation coefficient of approximately 0.591 for the trend curve, it was difficult to give an appropriate prediction for the best read length.
In regard to the sequencing error rate, the average F1 scores for the ensemble approach were 0.774 and 0.780 on the datasets with 10–7.5% and below 1% error rates, respectively (Fig. 5 and Additional file 1: Table S11). The average MCC also maintained 0.574 and 0.586 on the data mentioned above. Unfortunately, the trend curve we fitted for ensemble calling was inconsistent with every single caller. Although the sequencing attributes of the recommendation could not reproduce the optimal result in ensemble calling, the performance achieved with the suggested settings still outperformed significantly.
Evaluation of the joint effect of recommended settings
To illustrate that the recommended settings could also achieve satisfactory performance among all datasets mentioned above, we generated another three simulated datasets with the recommended settings, including 10%, 7.5% and 1% error rates, 20× sequencing coverage and 20 kbp average read length, to further determine their joint effect. From Fig. 6, datasets in the orange circular area show great potential to discover more SVs than others, and they all surpassed 0.77 and 0.70 F1 scores in detecting SVs without and with genotyping, which indicates a satisfactory performance in downstream analysis. The recommended datasets were also located in circular areas and had averages of 0.773 and 0.716 F1 scores, which outperformed many datasets (see more details in Additional file 1: Table S12). Although some datasets with higher coverage and longer read length performed the same as or slightly better than the datasets we recommend, we still suggest the optimal settings of long-read data with 10%, 7.5% and 1% error rates, 20× sequencing coverage and 20 kbp read length for a much lower cost and relatively better performance. In regard to the performance in detecting SVs on various types and size scales, the recommended datasets had an evident advantage on each set in terms of both the performance and cost (see more details in Additional file 1: Figure S5 and Table S12).
Which is the most influential sequencing setting in SV calling?
To further determine which sequencing attributes (i.e., coverage, read length, and error rate) are the most influential factors that determine the performance regarding SV calling, we drew a heatmap of F1 scores under various sequencing settings and tools under all kinds of SV detection targets in this study. For the subplots labeled with “Total” and “SV Types” in Fig. 7, we found that coverage was crucial in SV calling since it achieved a significant advancement of F1 scores from low depth to high depth. In contrast, there was no noteworthy increase in F1 scores for read length and error rate under their various settings. For the subplots of the small SVs (less than 1 kbp), coverage was still the most influential factor, followed by read length and error rate. However, for the subplots of the middle-sized SVs (approximately 1–10 kbp), read length became a key point in obtaining higher F1 scores with the assistance of a longer average length of reads. In regard to genotypes (subplots labeled with “GT” in Fig. 7), read length could provide slightly more support to calculate accurate genotypes even if the growth of F1 scores under each sequencing setting was limited.