Volume 18 Supplement 3
Pysim-sv: a package for simulating structural variation data with GC-biases
© The Author(s) 2017
Published: 14 March 2017
Structural variations (SVs) are wide-spread in human genomes and may have important implications in disease-related and evolutionary studies. High-throughput sequencing (HTS) has become a major platform for SV detection and simulation serves as a powerful and cost-effective approach for benchmarking SV detection algorithms. Accurate performance assessment by simulation requires the simulator capable of generating simulation data with all important features of real data, such GC biases in HTS data and various complexities in tumor data. However, no available package has systematically addressed all issues in data simulation for SV benchmarking.
Pysim-sv is a package for simulating HTS data to evaluate performance of SV detection algorithms. Pysim-sv can introduce a wide spectrum of germline and somatic genomic variations. The package contains functionalities to simulate tumor data with aneuploidy and heterogeneous subclones, which is very useful in assessing algorithm performance in tumor studies. Furthermore, Pysim-sv can introduce GC-bias, the most important and prevalent bias in HTS data, in the simulated HTS data.
Pysim-sv provides an unbiased toolkit for evaluating HTS-based SV detection algorithms.
KeywordsCopy number variation Translocation Breakpoints Next-generation sequencing
Structural variations (SVs) are genomic variations that lead to structure changes of a donor genome. Indels, copy number variations (CNV) and genomic rearrangements are all subclasses of SVs. Many researches revealed that SVs are wide-spread in normal human populations [1, 2] as well as in cancer genomes [3–5]. High-throughput sequencing (HTS) has become a major platform for SV detection and a number of algorithms have been developed for SV detection with HTS data . In genome studies based on HTS data, an important problem is to benchmark performances of various algorithms in different scenarios. The performance of an algorithm depends on its design, its implementation as well as quality and features of the sequencing data. An ideal method for benchmarking is by sequencing and subsequent experimental validation, but this method is expensive, labor intensive and time-costing. Thus, simulation of HTS data becomes a powerful and cost-effective alternative way for benchmarking genomic variation detection algorithms.
Usually, benchmarking HTS-based SV detection algorithms by simulation involves (1) generation of genomes containing simulated SVs and (2) simulation of HTS short reads based on the genomes with simulated SVs. To approximate real HTS data, the generated genomes should be similar to real genomes and the simulated HTS data should contain various sequencing errors and biases. Since tumor samples often contain normal contaminations and heterogeneous subclonal tumor cells, simulation data of tumor genomes should contain normal contamination and/or multiple subclonses.
There are several available SV simulation packages including RSVsim , SCNVsim , VarSim , IntSIM  and SInC . These packages provide great resources for the community. However, as far as we know, no available package has systematically addressed all issues in data simulation for SV benchmarking. For example, RSVsim can simulate a wide range of SVs, but it cannot generate CNVs and cannot generate tumor data with normal contamination and subclones. SCNVsim considered issues in simulating tumor data such as aneuploidy, normal contamination and multiple-subclones, but it can only generate tumor genomes with somatic SVs/CNVs but not other types of somatic events such as single nucleotide variations (SNVs). VarSim can simulate comprehensive classes of genomic variations, but it also cannot generate tumor data with aneuploidy, normal contamination and multiple-subclones. IntSim and SInC are able to simulate both germline and somatic variants, but they can only simulate SNVs and CNVs. In addition, it is well-known that GC-bias is wide-spread in HTS data. Although the read simulator pIRS  can introduce GC-bias in the simulation data, its GC-bias profile was trained on one set of data and users can essentially only generate one type of GC-bias. Our analysis of hundreds of sequencing data from The Cancer Genome Atlas and the 1000 Genome project data revealed that GC-bias can take many different forms , and pIRS are not flexible enough to simulate these GC-biases.
It can simulate a full spectrum of SVs as well as SNVs;
It allows simulation of tumor data with aneuploidy, normal contamination and multiple subclones;
It can generate HTS data with GC-biases of any form.
Germline SNVs and indels are sampled from existing database such as dbSNP  or a VCF file provided by users. Somatic SNVs and indels are randomly placed in the genome, or if a tumor mutation database is given (such as COSMIC [15, 16]), they are randomly sampled from the given database. The SNVs can be heterozygous or homozygous. Users can control the parameters such as the number of SNVs/indels to be introduced and the heterozygous/homozygous ratio according to their simulation purpose.
Pysim-sv can simulate seven classes of SVs including deletions, insertions, tandem duplications, inversions, intra-chromosomal translocations, inter-chromosomal translocations and CNVs. A deletion is generated by removing segments from the genome. An insertion is placed by inserting a sequence into the genome. The inserted sequence can come from either a database of known human insertion sequence (e.g. the Venter genome insertion sequence) or a series of random nucleotides with a random length. An inversion is generated by replacing a segment by its reverse complement. We simulate translocations by taking segments from the genome and inserting them to the same chromosomes (intra-chromosomal translocation) or different chromosomes (inter-chromosomal translocation). Translocations can be balanced (no gain/loss of genome segments) or unbalanced (gain/loss of genome segments). Hence, the original segments are either removed or kept in the original location, respectively. A copy number loss event is generated by removing a segment, and a copy number gain event is generated by randomly inserting a segment to several locations of the same chromosome.
Non-allelic homologous recombination (NAHR) and non-homologous recombination (NHR) are two major mechanisms of generating SVs . Pysim-sv simulates NAHR by placing breakpoints in repeat regions in the RepeatMasker database  and simulates NHR by randomly placing breakpoints in the reference genome. As breakpoints often co -occurs with SNVs/indels , Pysim-sv also introduces SNVs/indels near SV breakpoints. Users can set the expected number of SNVs/indels near SV breakpoints. To make sure the reported SV breakpoints and SNV/indel positions are correct, the positions of SVs and SNVs/indels are first simulated and the SVs and SNVs/indels ordered according to their chromosome positions. Then, they are generated backwardly from the last position to the first first position.
Tumor aneuploidy simulation
Aneuploidy is the deviation of ploidy number from the normal ploidy number. It is very common in cancer genomes and is related with chromosomal instability . Aneuploidy has important impact on tumor CNV detection since it often causes incorrect estimation of copy numbers. Pysim-sv allows generation of aneuploidy according to user-specified aneuploidy status (the copy number of each chromosome) of the simulated genome. Pysim-sv generates genomes with aneuploidy from normal diploid genomes and the resulting genomes provide the starting genomes for somatic SV simulation.
Tumor genome heterogeneity and purity simulation
Tumor cell populations often contain many heterogeneous subclones. Pysim-sv simulates new subclones from a progenitor genome by randomly placing new somatic variations. The number of new subclones from the progenitor genome can be specified by users. Iterative application of this procedure can be used to simulate the clone evolution model  as well as the cancer stem cell model . After a specified number of subclones are simulated, Pysim-sv then simulates HTS short reads from these tumor/normal genomes, and mixes these reads according to user specified proportions of these genomes. By default, ART  is used to generate HTS data.
GC bias introduction
Note that the GC-dependency functions of Fig. 2 d–f are not chosen to be corresponding to GC-dependency of Fig. 2 a–c. Pysim-sv is very flexible in generating GC-bias and users can easily specify any GC-dependency functions.
Results and Discussions
Based on the human reference genome (hg19), we simulated six genomes with three levels of purity (1, 0.5, and 0.8) and two levels of GC-dependency (with and without GC-bias). Each genome contains 100,000 SNVs and 200 SVs. We generated 30 × 100 bp paired-end read data. We used GATK  and Varscan  to detect SNVs in these simulated data, and used Delly , BreakDancer , GASV-Pro  and Meerkat  to detect SVs. As a comparison, we also ran these algorithms on a real data set NA12878 from the 1000 Genome Project. The NA12878 data was sequenced on the Illumina platform with a read length of 100 bp. The mean insert size is 320 bp with a standard deviation of 60 bp. The coverage of this data set is around 40 ×.
The sensitivity and false discovery rate (FDR) of the different SNV detection algorithms with different simulation setups
Overlaps of deletion predictions of the four SV detection algorithms with golden standard deletions in Mills el. al. 2011
Total reported deletion
Deletions in golden standard
The running time (hour) and memory usage (Gb) for Pysim-sv simulations with different parameter settings
Simulation set up
One subclone with 100 SVs
One subclone with 200 SVs
One subclone with 300 SVs
Two subclones with 100 SVs
Three subclones with 100 SVs
Mixing reads from 2 subclones and 1 normal genome (392 M reads)
GC-bias Introduction (130 M reads)
In this paper, we present Pysim-sv to simulate HTS data for benchmarking SV detection algorithms. Pysim-sv can simulate a wide spectrum of germline and somatic variations and thus the simulated genomes are more similar to real genomes. Pysim-sv is the first HTS data simulation tool that can introduce the GC-bias in the simulated HTS data. These features make simulation data generated by Pysim-sv more similar to real HTS data. We believe that Pysim-sv is a useful toolkit for performance evaluation of SV detection and SNV detection algorithms based on HTS data.
Availability and requirements
Project name: Pysim-svProject home page: https://github.com/xyc0813/pysim/ Operating system(s): Windows,Unix-like (Linux, Mac OSX)Programming language: python(>=2.7)Any restrictions to use by non-academics: None
Copy number variation
High throughput sequencing
Non-allelic homologous recombination
Single nucleotide variations
We thank Prof. Qian Minping for advice and discussions during program development. Part of the analysis was performed on the Computing Platform of the Center for Life Science.
This article has been published as part of BMC Bioinformatics Volume 18 Supplement 3, 2017. Selected articles from the 15th Asia Pacific Bioinformatics Conference (APBC 2017): bioinformatics. The full contents of the supplement are available online https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-18-supplement-3.
The National Key Basic Research Project of China (2015CB856000); The National Natural Science Foundation of China (11471022, 71532001); Recruitment Program of Global Youth Experts of China; Funding for publication cost: the National Key Basic Research Project of China.
Availability of data and materials
The simulated data supporting the conclusions of this article was created using Pysim-sv which is available at https://github.com/xyc0813/pysim/.
YX implemented the package. YX and RX wrote the paper. YX and YL performed the analysis. All authors have read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, Abyzov A, Yoon SC, Ye K, Cheetham RK, et al. Mapping copy number variation by population-scale genome sequencing. Nature. 2011; 470(7332):59–65.View ArticlePubMedPubMed CentralGoogle Scholar
- Sismani C, Koufaris C, Voskarides K. Copy number variation in human health, disease and evolution. In: Genomic Elements in Health, Disease and Evolution. New York: Springer: 2015. p. 129–54.Google Scholar
- Ding L, Wendl MC, Koboldt DC, et al. Analysis of next generation genomic data in cancer: accomplishments and challenges. Hum Mol Genet. 2010; 19(R2):188–96.View ArticleGoogle Scholar
- Stephens PJ, Greenman CD, Fu B, Yang F, Bignell GR, Mudie LJ, Pleasance ED, Lau KW, Beare D, Stebbings LA, et al. Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell. 2011; 144(1):27–40.View ArticlePubMedPubMed CentralGoogle Scholar
- Yang L, Luquette LJ, Gehlenborg N, Xi R, Haseley PS, Hsieh CH, Zhang C, Ren X, Protopopov A, Chin L, et al. Diverse mechanisms of somatic structural variations in human cancer genomes. Cell. 2013; 153(4):919–29.View ArticlePubMedPubMed CentralGoogle Scholar
- Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP, et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 2009; 6(9):677–81.View ArticlePubMedPubMed CentralGoogle Scholar
- Bartenhagen C, Dugas M. RSVSim: an R/Bioconductor package for the simulation of structural variations. Bioinformatics. 2013; 29(13):1679–81.View ArticlePubMedGoogle Scholar
- Qin M, Liu B, Conroy JM, et al. SCNVSim: somatic copy number variation and structure variation simulator. BMC Bioinforma. 2015; 16(1):1–6.View ArticleGoogle Scholar
- Mu JC, Mohiyuddin M, Li J, et al. VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications. Bioinformatics. 2015; 31(9):1469–71.View ArticlePubMedGoogle Scholar
- Yuan X, Zhang J, Yang L. IntSIM: An integrated simulator of next-generation sequencing data. IEEE Trans Biomed Eng. 2016;:1–11.
- Pattnaik S, Gupta S, Rao AA, Panda B. SinC: an accurate and fast error-model based simulator for snps, indels and cnvs coupled with a read generator for short-read sequence data. Bmc Bioinformatic. 2013; 15(1):1–9.Google Scholar
- Hu X, Yuan J, Shi Y, Lu J, Liu B, Li Z, Chen Y, Mu D, Zhang H, Li N. pIRS: Profile-based illumina pair-end reads simulator. Bioinformatics. 2012; 28(11):1533–5.View ArticlePubMedGoogle Scholar
- Xi R, Lee S, Xia Y, et al. Copy number analysis of whole-genome data using BIC-Seq2 and its application to detection of cancer susceptibility variants. Nucleic Acids Res. 2016; 44(13):6274–86.View ArticlePubMedGoogle Scholar
- Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the ncbi database of genetic variation. Nucleic Acids Res. 2001; 29(1):308–11.View ArticlePubMedPubMed CentralGoogle Scholar
- Bamford S, Dawson E, Forbes S, Clements J, Pettett R, Dogan A, Flanagan A, Teague J, Futreal PA, Stratton M, et al. The COSMIC (catalogue of somatic mutations in cancer) database and website. Br J Cancer. 2004; 91(2):355–8.PubMedPubMed CentralGoogle Scholar
- Forbes SA, Bindal N, Bamford S, et al. COSMIC: mining complete cancer genomes in the catalogue of somatic mutations in cancer. Nucleic Acids Res. 2011; 39(suppl_1):D945–D950.View ArticlePubMedGoogle Scholar
- Currall BB, Chiangmai C, Talkowski ME, Morton CC. Mechanisms for structural variation in the human genome. Curr Genet Med Rep. 2013; 1(2):81–90.View ArticlePubMedPubMed CentralGoogle Scholar
- Meyer LR, Zweig AS, Hinrichs AS, Karolchik D, Kuhn RM, Wong M, Sloan CA, Rosenbloom KR, Roe G, Rhead B, et al. The UCSC genome browser database: extensions and updates 2013. Nucleic Acids Res. 2013; 41(D1):64–9.View ArticleGoogle Scholar
- Kumar D. Disorders of the genome architecture: a review. Genomic Med. 2008; 2(3-4):69–76.View ArticlePubMedGoogle Scholar
- Hassold T, Hunt P. To err (meiotically) is human: the genesis of human aneuploidy. Nat Rev Genet. 2001; 2(4):280–91.View ArticlePubMedGoogle Scholar
- Shackleton M, Quintana E, Fearon ER, Morrison SJ. Heterogeneity in cancer: cancer stem cells versus clonal evolution. Cell. 2009; 138(5):822–9.View ArticlePubMedGoogle Scholar
- Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics. 2012; 28(4):593–4.View ArticlePubMedGoogle Scholar
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, et al. The genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data. Genome Res. 2010; 20(9):1297–303.View ArticlePubMedPubMed CentralGoogle Scholar
- Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, Mardis ER, Weinstock GM, Wilson RK, Ding L. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics. 2009; 25(17):2283–285.View ArticlePubMedPubMed CentralGoogle Scholar
- Rausch T, Zichner T, Schlattl A, Stütz AM, Benes V, Korbel JO. Delly: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012; 28(18):333–9.View ArticleGoogle Scholar
- Sindi SS, Onal S, Peng LC, Wu HT, Raphael BJ. An integrative probabilistic model for identification of structural variation in sequencing data. Genome Biol. 2012; 13(3):22.View ArticleGoogle Scholar
- Xi R, Hadjipanayis AG, Luquette LJ, Kim TM, Lee E, Zhang J, Johnson MD, Muzny DM, Wheeler DA, Gibbs RA, et al. Copy number variation detection in whole-genome sequencing data using the bayesian information criterion. Proc Natl Acad Sci. 2011; 108(46):1128–36.View ArticleGoogle Scholar
- Li H, Durbin R. Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics. 2009; 25(14):1754–60.View ArticlePubMedPubMed CentralGoogle Scholar