Pysim-sv: a package for simulating structural variation data with GC-biases

Background Structural variations (SVs) are wide-spread in human genomes and may have important implications in disease-related and evolutionary studies. High-throughput sequencing (HTS) has become a major platform for SV detection and simulation serves as a powerful and cost-effective approach for benchmarking SV detection algorithms. Accurate performance assessment by simulation requires the simulator capable of generating simulation data with all important features of real data, such GC biases in HTS data and various complexities in tumor data. However, no available package has systematically addressed all issues in data simulation for SV benchmarking. Results Pysim-sv is a package for simulating HTS data to evaluate performance of SV detection algorithms. Pysim-sv can introduce a wide spectrum of germline and somatic genomic variations. The package contains functionalities to simulate tumor data with aneuploidy and heterogeneous subclones, which is very useful in assessing algorithm performance in tumor studies. Furthermore, Pysim-sv can introduce GC-bias, the most important and prevalent bias in HTS data, in the simulated HTS data. Conclusions Pysim-sv provides an unbiased toolkit for evaluating HTS-based SV detection algorithms.


Background
Structural variations (SVs) are genomic variations that lead to structure changes of a donor genome. Indels, copy number variations (CNV) and genomic rearrangements are all subclasses of SVs. Many researches revealed that SVs are wide-spread in normal human populations [1,2] as well as in cancer genomes [3][4][5]. High-throughput sequencing (HTS) has become a major platform for SV detection and a number of algorithms have been developed for SV detection with HTS data [6]. In genome studies based on HTS data, an important problem is to benchmark performances of various algorithms in different scenarios. The performance of an algorithm depends on its design, its implementation as well as quality and features of the sequencing data. An ideal method for *Correspondence: dengmh@math.pku.edu.cn; ruibinxi@math.pku.edu.cn School of Mathematics Science and Center for Statistical Science, Peking University, Yiheyuan Road 5, 100871 Beijing, China benchmarking is by sequencing and subsequent experimental validation, but this method is expensive, labor intensive and time-costing. Thus, simulation of HTS data becomes a powerful and cost-effective alternative way for benchmarking genomic variation detection algorithms.
Usually, benchmarking HTS-based SV detection algorithms by simulation involves (1) generation of genomes containing simulated SVs and (2) simulation of HTS short reads based on the genomes with simulated SVs. To approximate real HTS data, the generated genomes should be similar to real genomes and the simulated HTS data should contain various sequencing errors and biases. Since tumor samples often contain normal contaminations and heterogeneous subclonal tumor cells, simulation data of tumor genomes should contain normal contamination and/or multiple subclonses.
There are several available SV simulation packages including RSVsim [7], SCNVsim [8], VarSim [9], IntSIM [10] and SInC [11]. These packages provide great resources for the community. However, as far as we know, no available package has systematically addressed all issues in data simulation for SV benchmarking. For example, RSVsim can simulate a wide range of SVs, but it cannot generate CNVs and cannot generate tumor data with normal contamination and subclones. SCNVsim considered issues in simulating tumor data such as aneuploidy, normal contamination and multiple-subclones, but it can only generate tumor genomes with somatic SVs/CNVs but not other types of somatic events such as single nucleotide variations (SNVs). VarSim can simulate comprehensive classes of genomic variations, but it also cannot generate tumor data with aneuploidy, normal contamination and multiple-subclones. IntSim and SInC are able to simulate both germline and somatic variants, but they can only simulate SNVs and CNVs. In addition, it is well-known that GC-bias is wide-spread in HTS data. Although the read simulator pIRS [12] can introduce GC-bias in the simulation data, its GCbias profile was trained on one set of data and users can essentially only generate one type of GC-bias. Our analysis of hundreds of sequencing data from The Cancer Genome Atlas and the 1000 Genome project data revealed that GC-bias can take many different forms [13], and pIRS are not flexible enough to simulate these GC-biases.
Here, we present Pysim-sv for SV simulation. Compared with other HTS data simulation packages, Pysim has three main advantages: (1) It can simulate a full spectrum of SVs as well as SNVs; (2) It allows simulation of tumor data with aneuploidy, normal contamination and multiple subclones; (3) It can generate HTS data with GC-biases of any form.

Methods
Pysim-sv uses fasta format reference genome as input. This tool consists of three major components (Fig. 1). The first component generates a personal genome by

SNV/indel simulation
Germline SNVs and indels are sampled from existing database such as dbSNP [14] or a VCF file provided by users. Somatic SNVs and indels are randomly placed in the genome, or if a tumor mutation database is given (such as COSMIC [15,16]), they are randomly sampled from the given database. The SNVs can be heterozygous or homozygous. Users can control the parameters such as the number of SNVs/indels to be introduced and the heterozygous/homozygous ratio according to their simulation purpose.

SV simulation
Pysim-sv can simulate seven classes of SVs including deletions, insertions, tandem duplications, inversions, intra-chromosomal translocations, inter-chromosomal translocations and CNVs. A deletion is generated by removing segments from the genome. An insertion is placed by inserting a sequence into the genome. The inserted sequence can come from either a database of known human insertion sequence (e.g. the Venter genome insertion sequence) or a series of random nucleotides with a random length. An inversion is generated by replacing a segment by its reverse complement. We simulate translocations by taking segments from the genome and inserting them to the same chromosomes (intrachromosomal translocation) or different chromosomes (inter-chromosomal translocation). Translocations can be balanced (no gain/loss of genome segments) or unbalanced (gain/loss of genome segments). Hence, the original segments are either removed or kept in the original location, respectively. A copy number loss event is generated by removing a segment, and a copy number gain event is generated by randomly inserting a segment to several locations of the same chromosome.
Non-allelic homologous recombination (NAHR) and non-homologous recombination (NHR) are two major mechanisms of generating SVs [17]. Pysim-sv simulates NAHR by placing breakpoints in repeat regions in the RepeatMasker database [18] and simulates NHR by randomly placing breakpoints in the reference genome. As breakpoints often co -occurs with SNVs/indels [19], Pysim-sv also introduces SNVs/indels near SV breakpoints. Users can set the expected number of SNVs/indels near SV breakpoints. To make sure the reported SV breakpoints and SNV/indel positions are correct, the positions of SVs and SNVs/indels are first simulated and the SVs and SNVs/indels ordered according to their chromosome

Tumor aneuploidy simulation
Aneuploidy is the deviation of ploidy number from the normal ploidy number. It is very common in cancer genomes and is related with chromosomal instability [20]. Aneuploidy has important impact on tumor CNV detection since it often causes incorrect estimation of copy numbers. Pysim-sv allows generation of aneuploidy according to user-specified aneuploidy status (the copy number of each chromosome) of the simulated genome. Pysim-sv generates genomes with aneuploidy from normal diploid genomes and the resulting genomes provide the starting genomes for somatic SV simulation.

Tumor genome heterogeneity and purity simulation
Tumor cell populations often contain many heterogeneous subclones. Pysim-sv simulates new subclones from a progenitor genome by randomly placing new somatic variations. The number of new subclones from the progenitor genome can be specified by users. Iterative application of this procedure can be used to simulate the clone evolution model [3] as well as the cancer stem cell model [21]. After a specified number of subclones are simulated, Pysim-sv then simulates HTS short reads from these tumor/normal genomes, and mixes these reads according to user specified proportions of these genomes. By default, ART [22] is used to generate HTS data.

GC bias introduction
After the initial HTS data are generated, we further employ a biased subsampling method to introduce GCbiases. Specifically, Pysim-sv first calculates a subsampling probability for every read pair (or every read for single-end data). This subsampling probability depends on the local GC-content of the sequence from which the read pair is generated. We use the following procedure to introduce GC-biases. Given a read pair (R 1 , R 2 ), let (S 1 , S 2 ) (S 1 < S 2 ) be the positions in the simulated chromosome from which the read pair was sequenced. Note that these reads are Illunima platform type reads. Then, the sequence S from S 1 to S 2 + r in the simulated chromosome is the segment that generates the read pair (r is the read length). Let GC S be the GC proportion of this segment. Then, we will subsample this read pair (R 1 , R 2 ) with probability p S = f (GC S )/(1 + f (GC S )), where f is a user specified function. Specifically, Pysim-sv first generates a random number from the Bernoulli distribution with the success rate as the probability p S . Pysim-sv will or will not select this read pair depending on whether or not the random number is 1. Figure 2 show examples of GC-dependency in real sequencing data (top panel) and in simulated data by Pysim-sv (bottom panel). The GCdependency functions in Fig. 2d-f are chosen as f 1 (x) = Note that the purity column represents the proportion of "tumor" cells in the simulated data. Subcolone1 and Subcolone2 columns represent the proportions of subcolone 1 and 2 in the simulated data, respectively. When purity is 1 and subclone1 is 1, it means that all data are from subclone1 and it essentially like a sequencing data from a normal genome

Results and Discussions
Based on the human reference genome (hg19), we simulated six genomes with three levels of purity (1, 0.5, and 0.8) and two levels of GC-dependency (with and without GC-bias). Each genome contains 100,000 SNVs and 200 SVs. We generated 30× 100 bp paired-end read data. We used GATK [23] and Varscan [24] to detect SNVs in these simulated data, and used Delly [25], BreakDancer [6], GASV-Pro [26] and Meerkat [5] to detect SVs. As a comparison, we also ran these algorithms on a real data set NA12878 from the 1000 Genome Project. The NA12878 data was sequenced on the Illumina platform with a read length of 100 bp. The mean insert size is 320 bp with a standard deviation of 60 bp. The coverage of this data set is around 40×. Table 1 shows the sensitivity and false discovery rate (FDR) of GATK and Varscan on the six simulated data. We found that Varscan had a higher sensitivity and FDR rate than GATK. The sensitivities of GATK and Varscan tend to decrease and their FDR rates tend to increase when the purity decreases. Similarly, compared with no GC-bias data, their sensitivities are lower and their FDRs are higher when GC-bias is introduced. On the NA12878 data, by comparing with reported SNVs  The sensitivities of the SV detection algorithms are shown in Fig. 3. Compared with the SNV detection algorithms, the sensitivities of these algorithms are relatively low. All methods have sensitivities above 75%. Meerkat and Delly achieved higher sensitivity rates than Break-Dancer and GASV-Pro because Meerkat and Delly used both discordant reads and split reads to detect SVs. As the purity decreases, we also observe that the sensitivities of the SV detection algorithms tend to decrease. For NA12878, we compared the SV calls from the four SV detection algorithms with the golden standard SV set as reported in Mills et al. 2011 [1]. Since most of the reported SVs in Mills et al. 2011 are deletions, we only considered deletion predictions. The precision and recall rates of the four algorithms are calculated by comparing with the golden standard deletions ( Table 2). The precisions of the four algorithms are relatively low, but this low level of precision might be due to the possibility that many true deletions are not included in the golden standard set. For CNV detection, we generated another simulation data containing 40 CNVs with their sizes ranging from 100 bp to 10 kb and their copy numbers ranging from 0 to 6. SNVs and Indels were also introduced to this genome by randomly sampling from the dbSNP database. We used Pysim-sv coupled with ART to generate 10× data of 100 bp paired-end reads with GC-bias. BIC-seq2 [13,27] (Fig. 4).
To test Pysim-sv speed, we used a diploid human genome(hg19) as reference and evaluated the computational efficiency with different parameter settings. We simulated 2.4 million SNVs which were randomly selected in dbSNP. We simulated 1-3 subclones with 100-300 SVs ranging from 1 to 10 kb. The test was performed on a 32-core sever with Intel Xeon 2.40 GHz CPU, running a Linux operating system. The running time and the memory usage of Pysim-sv were summarized in Table 3.

Conclusion
In this paper, we present Pysim-sv to simulate HTS data for benchmarking SV detection algorithms. Pysim-sv can simulate a wide spectrum of germline and somatic variations and thus the simulated genomes are more similar to real genomes. Pysim-sv is the first HTS data simulation tool that can introduce the GC-bias in the simulated HTS data. These features make simulation data generated by Pysim-sv more similar to real HTS data. We believe that Pysim-sv is a useful toolkit for performance evaluation of SV detection and SNV detection algorithms based on HTS data.