FUSIM: a software tool for simulating fusion transcripts
© Bruno et al.; licensee BioMed Central Ltd. 2013
Received: 29 August 2012
Accepted: 11 January 2013
Published: 16 January 2013
Gene fusions are the result of chromosomal aberrations and encode chimeric RNA (fusion transcripts) that play an important role in cancer genesis. Recent advances in high throughput transcriptome sequencing have given rise to computational methods for new fusion discovery. The ability to simulate fusion transcripts is essential for testing and improving those tools.
To facilitate this need, we developed FUSIM (FUsion SIMulator), a software tool for simulating fusion transcripts. The simulation of events known to create fusion genes and their resulting chimeric proteins is supported, including inter-chromosome translocation, trans-splicing, complex chromosomal rearrangements, and transcriptional read through events.
FUSIM provides the ability to assemble a dataset of fusion transcripts useful for testing and benchmarking applications in fusion gene discovery.
Chromosome aberrations and their corresponding gene fusions play an important role in carcinogenesis and cancer morbidity . The identification of fusion genes such as TMPRSS2-ERG , EML4-ALK , and BCR-ABL1 , have led to successful diagnostic biomarkers and therapeutic targets. Thus methods for detecting fusion genes and their corresponding chimeric proteins have major clinical significance. Recent advances in next-generation sequencing (NGS) and high-throughput transcriptome sequencing (RNA-Seq) have paved the way for new methods in fusion gene discovery. One of the major challenges in identifying novel fusion transcripts is controlling the high false positive rate. The majority of methods in recent publications utilizing RNA-Seq data [5-10], employ advanced filtering steps to eliminate false positives and nominate a set of potential fusion candidates. However, fusion validation involves a substantial amount of manual effort requiring the design of complex PCR primers which can significantly drive up costs. As a result, only a portion of predicted fusion events subject to experimental validation. Measuring the accuracy of these methods is becoming increasingly important to help improve future algorithm development. To help facilitate this need, we developed FUSIM, a software tool for simulating fusion transcripts from gene models. An advanced set of features are available for controlling fusion transcript simulation modeled after characteristics of gene fusions in vivo. FUSIM enables comprehensive testing in silico of fusion discovery methods in transcriptome sequencing data. FUSIM is open source software written in Java and runs on any platform supporting Java version 1.6 and above.
FUSIM requires as input, the number of fusion transcripts to generate and a gene model file in UCSC GenePred table format . General Feature Format (GFF ) and Gene Transfer Format (GTF ) files are also supported using FUSIM’s built in GTF-to-genePred converter. FUSIM also requires an faidx-indexed reference genome file for use in outputting raw fusion sequences. Reference genomes in FASTA format  can be converted to faidx-indexed format using SAMtools . FUSIM can optionally simulate fusion transcripts based on the expression levels of genes found in experimental data. If this option is selected, a file of RNA-Seq read alignments in Binary Alignment/Map (BAM ) format is required.
Gene selection options in FUSIM
Limit all fusions to specific geneId, transcriptId, or chrom
Filter for gene1
Filter for gene2
Filter for gene3
uniform emprical binned
Path to BAM file containing background reads. Genes will be selected for fusions according to the read profile of the background reads
RPKM cutoff when using background BAM file. Genes below the cutoff will be ignored
Method to use when selecting genes for fusions uniform|empirical|binned
Number of threads to spawn when processing background BAM file
The filters to select genes by gene ID, transcript ID, or chromosome are also supported. They can be set globally (i.e. specifying all genes within a fusion) or set on a per gene basis. For example, specifying only the first gene in a fusion to BCR or both the first and second gene to BCR and ABL1 respectively. This provides the ability to specify simulation of fusion transcripts on genes or chromosomes of interest.
Types of fusions
Options for controlling fusions
After selecting a set of genes using the methods outlined in the previous section, fusion transcripts are created by randomly choosing a breakpoint in each gene and fusing them together. Breakpoints are created by randomly selecting n number of consecutive exons from the start or end of each gene.
FUSIM provides an advanced set of options to further control various aspects of fusion transcript simulation (Figure 1c). By default, genes are fused together by splitting the joined exons in random positions (split exons). The keep exon boundary option will fuse genes exclusively on exon boundaries. The CDS only option creates fusions using exons within the coding sequence region, by default all exons are considered. The foreign insertion option inserts a randomly generated sequence between the fusion breakpoint. FUSIM can be set to auto-correct the orientation of the resulting fusion transcript if genes are located on different strands. This is done by reverse complementing the selected exons to match the orientation of the first gene in the fusion. By default, FUSIM creates in-frame fusion transcripts preserving the reading frame. Generating out-of-frame fusion transcripts disrupting the reading frame is also supported.
Certain fusion discovery tools require sequencing read data in FASTQ  format as input. FUSIM includes wrapper scripts for simulating next generation sequencing reads from the generated fusion transcripts using ART . The resulting FASTQ files can also be aligned back to a reference genome and optionally merged with existing alignment data, useful for injecting reads from simulated fusions into background datasets.
One of the main difficulties in testing fusion discovery methods is the lack of a golden standard dataset of fusion transcripts which can be used to accurately compare performance. FUSIM aims to provide a convenient way to rapidly generate datasets of simulated fusion transcripts for comprehensive comparison across fusion discovery methods. The advanced options in FUSIM allow for construction of simulated fusion transcripts that model the origins of gene fusions in vivo.
Availability and requirements
Project name: FUSIMProject home page:http://aebruno.github.com/fusim/Operating system(s): Platform independentProgramming language: JavaOther requirements: Java 1.6 or higherLicense: Apache License version 2.0Any restrictions to use by non-academics: none
Gene Prediction Track Format
General Feature Format
Gene Transfer Format
reads per kilobase of exon model per million mapped reads.
We wish to thank L. Shepherd, Q. Hu, P. Colson, M. Zhu, Y. Yang and et al for testing the FUSIM software and providing helpful comments.
- Mitelman F, Johansson B, Mertens F: The impact of translocations and gene fusions on cancer causation. Nat Rev Cancer 2007,7(4):233-245. 10.1038/nrc2091View ArticlePubMedGoogle Scholar
- Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW, Varambally S, Cao X, Tchinda J, Kuefer R, Lee C, Montie JE, Shah RB, Pienta KJ, Rubin MA, Chinnaiyan AM: Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 2005,310(5748):644-648. 10.1126/science.1117679View ArticlePubMedGoogle Scholar
- Soda M, Choi YL, Enomoto M, Takada S, Yamashita Y, Ishikawa S, ichiro Fujiwara S, Watanabe H, Kurashina K, Hatanaka H, Bando M, Ohno S, Ishikawa Y, Aburatani H, Niki T, Sohara Y, Sugiyama Y, Mano H: Identification of the transforming EML4-ALK fusion gene in non-small-cell lung cancer. Nature 2007,448(7153):561-566. 10.1038/nature05945View ArticlePubMedGoogle Scholar
- Rowley JD: Chromosome translocations: dangerous liaisons revisited. Nat Rev Cancer 2001,1(3):245-250. 10.1038/35106108View ArticlePubMedGoogle Scholar
- Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X, Sam L, Barrette T, Palanisamy N, Chinnaiyan AM: Transcriptome sequencing to detect gene fusions in cancer. Nature 2009,458(7234):97-101. 10.1038/nature07638PubMed CentralView ArticlePubMedGoogle Scholar
- Iyer MK, Chinnaiyan AM, Maher CA: ChimeraScan: a tool for identifying chimeric transcription in sequencing data. Bioinformatics 2011,27(20):2903-2904. 10.1093/bioinformatics/btr467PubMed CentralView ArticlePubMedGoogle Scholar
- McPherson A, Hormozdiari F, Zayed A, Giuliany R, Ha G, Sun MGF, Griffith M, Moussavi AH, Senz J, Melnyk N, Pacheco M, Marra MA, Hirst M, Nielsen TO, Sahinalp SC, Huntsman D, Shah SP: deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq data. PLoS Comput Biol 2011,7(5):e1001138. 10.1371/journal.pcbi.1001138PubMed CentralView ArticlePubMedGoogle Scholar
- Ge H, Liu K, Juan T, Fang F, Newman M, Hoeck W: FusionMap: detecting fusion genes from next-generation sequencing data at base-pair resolution. Bioinformatics 2011,27(14):1922-1928. 10.1093/bioinformatics/btr310View ArticlePubMedGoogle Scholar
- Asmann YW, Hossain A, Necela BM, Middha S, Kalari KR, Sun Z, Chai HS, Williamson DW, Radisky D, Schroth GP, Kocher JPA, Perez EA, Thompson EA: A novel bioinformatics pipeline for identification and characterization of fusion transcripts in breast cancer and normal cell lines. Nucleic Acids Res 2011,39(15):e100. 10.1093/nar/gkr362PubMed CentralView ArticlePubMedGoogle Scholar
- Li Y, Chien J, Smith DI, Ma J: FusionHunter: identifying fusion transcripts in cancer using paired-end RNA-seq. Bioinformatics 2011,27(12):1708-1710. 10.1093/bioinformatics/btr265View ArticlePubMedGoogle Scholar
- GenePred Table Format [http://genome.ucsc.edu/FAQ/FAQformat.html#format9] 
- Gene Feature Format [http://www.sanger.ac.uk/resources/software/gff/] 
- Gene Transfer Format [http://mblab.wustl.edu/GTF22.html] 
- Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 1988,85(8):2444-2448. 10.1073/pnas.85.8.2444PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup GPDP: The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009,25(16):2078-2079. 10.1093/bioinformatics/btp352PubMed CentralView ArticlePubMedGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008,5(7):621-628. 10.1038/nmeth.1226View ArticlePubMedGoogle Scholar
- Pellestor F, Anahory T, Lefort G, Puechberty J, Liehr T, Hédon B, Sarda P: Complex chromosomal rearrangements: origin and meiotic behavior. Hum Reprod Update 2011,17(4):476-494. 10.1093/humupd/dmr010View ArticlePubMedGoogle Scholar
- Kim P, Yoon S, Kim N, Lee S, Ko M, Lee H, Kang H, Kim J, Lee S: ChimerDB 2.0-a knowledgebase for fusion genes updated. Nucleic Acids Res 2010,38(Database issue):D81-D85. 10.1093/nar/gkp982PubMed CentralView ArticlePubMedGoogle Scholar
- Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM: The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 2010,38(6):1767-1771. 10.1093/nar/gkp1137PubMed CentralView ArticlePubMedGoogle Scholar
- Huang W, Li L, Myers JR, Marth GT: ART: a next-generation sequencing read simulator. Bioinformatics 2012,28(4):593-594. 10.1093/bioinformatics/btr708PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.