State of art fusion-finder algorithms are suitable to detect transcription-induced chimeras in normal tissues?
- Matteo Carrara†1,
- Marco Beccuti†2,
- Federica Cavallo3,
- Susanna Donatelli2,
- Fulvio Lazzarato3,
- Francesca Cordero2 and
- Raffaele A Calogero1Email author
© Calogero et al.; licensee BioMed Central Ltd. 2013
Published: 22 April 2013
RNA-seq has the potential to discover genes created by chromosomal rearrangements. Fusion genes, also known as "chimeras", are formed by the breakage and re-joining of two different chromosomes. It is known that chimeras have been implicated in the development of cancer. Few publications in the past showed the presence of fusion events also in normal tissue, but with very limited overlaps between their results. More recently, two fusion genes in normal tissues were detected using both RNA-seq and protein data.
Due to heterogeneous results in identifying chimeras in normal tissue, we decided to evaluate the efficacy of state of the art fusion finders in detecting chimeras in RNA-seq data from normal tissues.
We compared the performance of six fusion-finder tools: FusionHunter, FusionMap, FusionFinder, MapSplice, deFuse and TopHat-fusion. To evaluate the sensitivity we used a synthetic dataset of fusion-products, called positive dataset; in these experiments FusionMap, FusionFinder, MapSplice, and TopHat-fusion are able to detect more than 78% of fusion genes. All tools were error prone with high variability among the tools, identifying some fusion genes not present in the synthetic dataset. To better investigate the false discovery chimera detection rate, synthetic datasets free of fusion-products, called negative datasets, were used. The negative datasets have different read lengths and quality scores, which allow detecting dependency of the tools on both these features. FusionMap, FusionFinder, mapSplice, deFuse and TopHat-fusion were error-prone. Only FusionHunter results were free of false positive. FusionMap gave the best compromise in terms of specificity in the negative dataset and of sensitivity in the positive dataset.
We have observed a dependency of the tools on read length, quality score and on the number of reads supporting each chimera. Thus, it is important to carefully select the software on the basis of the structure of the RNA-seq data under analysis. Furthermore, the sensitivity of chimera detection tools does not seem to be sufficient to provide results consistent with those obtained in normal tissues on the basis of fusion events extracted from published data.
Recently, Frenkel-Morgenstern et al.  described a new approach to assess chimeras. We term this procedure as the knowledge-based approach since it is based on fusion events extracted from published data. The authors studied 7,424 putative human chimeric RNAs  and detected the expression of 172 chimeric RNAs in 16 human tissues (Illumina Body Map 2.0, GSE30611) using high throughput RNA sequencing, mass spectrometry experimental data, and functional annotations.
Fusion finder algorithms
In the last two years many chimera-detection tools have been developed and published. To the best of our knowledge, ChimeraScan , deFuse , FusionFinder , FusionHunter , FusionMap , MapSplice , ShortFuse , TopHat-Fusion  are the most commonly used tools for chimera detection. ChimeraScan and ShortFuse were not considered here since their run did not terminate properly during the preliminary testing phase. Before describing fusion finder algorithms, we introduce the terms used in the rest of the paper.
RNA-seq experiments provide a set of short reads that can be in two forms: single-end or paired-end. In the latter case both the forward and reverse template strands of DNA fragment are sequenced. According to the identification of fusion boundary (the nucleotide coordinates defining the breakpoint of both genes involved in the fusion) it is possible to observe two contexts: read spanning or read encompassing. Encompassing reads harbor a fusion boundary and each read maps on a different gene of the fused gene couple, while in spanning reads one mate overlaps with a fusion event, while the corresponding paired-end mate matches with one of the two genes involved in the chimera.
We have categorized the fusion detection algorithms into two classes: the fragment-based approach and the pseudo-reference based approach.
In the fragment based approach input reads are split into fragments, which are aligned with respect to reference (whole genome or transcriptome). The mapped fragments are then used to build a list of putative chimeras that undergo through a further selection by means of various types of filters. This category includes the following tools: FusionFinder, FusionMap, MapSlice, deFuse. Pseudo-reference based approaches use candidate chimeras, obtained from the previous mapping phase, to generate a new pseudo reference for chimeras detection. The fusion events resulting from the latter step are further filtered to reduce false positive. TopHat-Fusion and FusionHunter are the tools included in this category.
In this paper, we focus on fusion finder algorithms for ab-initio processes. Between those algorithms, FusionMap has shown the best compromise between sensitivity and sensibility. Its results have been also compared with results obtained by the knowledge-based approach presented in Frenkel-Morgenstern's paper.
Evaluating the sensitivity of fusion-finder algorithms
To compare the sensitivity of fusion-finder algorithms we used a synthetic dataset provided as part of the release of the FusionMap software, and we used it as positive dataset.
Chimera detection performances on positive dataset encompassing 50 synthetic fusion events
False discovery rate
Evaluating the false discovery rate of fusion finder tools
False chimera detection
Searching for chimeras on real dataset with FusionMap
Genomic locations of genes involved in chimeras detected in Body Map 2.0 in 
Li paper 
Li paper 
Chimeras detection in Body map 2.0 by FusionMap
# of genes involved in chimeras in Body Map 2.0
# of genes also detected in the negative dataset
# of genes also detected as chimeras in
Genes in chimeras
Chimeras detected by FusionMap
White blood cells
Table 4 also reports, for each gene involved in the detected chimeras of Body Map, the number of genes that have been falsely detected by FusionMap in the experiment of the negative datasets.
The main goal of this paper was to understand if the main fusion detection software tools, available in the literature, are able to detect chimeras in normal tissue RNA-seq data. To reach our aim, it was essential to understand the behavior of fusion detection software tools. Thus, we evaluated the sensitivity and false discovery rate for six state-of-the-art fusion-finders: FusionHunter, FusionMap, FusionFinder, MapSplice, deFuse and TopHat-fusion.
In our experiments, FusionHunter performed better than all the other tools on the basis of false discovery rate, but had the lowest sensitivity with respect to the others. The behavior of FusionHunter is consistent with two other observations: i) FusionHunter looses all the fusions, in the positive dataset, supported by less than 18 reads, and ii) the median value for false positive chimeras for all tools, excluded FusionHunter, is between 1 to 10 reads. Thus, to reduce the risk of false positive detection, weighting negatively fusions supported by a low number of reads, FusionHunter clearly suffers of a reduced sensitivity. At the same time FusionHunter implements some specific features that make it less sensitive to the discovery of false fusions supported by a high number of reads that are frequently observable in the other fusion detection tools.
Quality scores associated with the datasets affected MapSplice and FusionFinder results. On the other hand, FusionFinder was more sensitive to read length, with a reduction in the false fusion detection rate dependent on a corresponding increase in the read length. Conversely, FusionMap and deFuse performed much better with short reads: the larger the read the higher the number of false positive fusion genes. TopHat-fusion was insensitive to quality score, but it showed the highest false positive discovery rate of the tools tested. With respect to sensitivity, deFuse and FusionHunter, were found to be the least sensitive. The best compromise between sensitivity and specificity was given by FusionMap, which seemed particularly suitable for the analysis of the Illumina normal tissue Body Map 2.0 RNA-seq dataset, since its false fusion detection rate was particularly low in the analysis of negative datasets. Despite the good sensitivity of FusionMap in the test dataset, the analysis of the Body Map 2.0 paired-end reads revealed a low correlation between FusionMap fusions detected in this dataset and fusions detected in the single-end dataset by Frenkel-Morgenstern. An important point to be considered, when comparing the results obtained with the 75 bp reads single-end and the 50 bp reads paired-end Body Map 2.0 datasets, is tissue source origin. The two datasets are generated starting, for each tissue, from the same donor, therefore we expect the results to be comparable. The lack of correspondence between true positive fusions, namely the 22 fusion events validated in the Body Map 2.0 in Frenkel-Morgenstern paper and results obtained with FusionMap on the same dataset in this paper, suggests that ab-initio chimera detection approaches are not sensitive enough to detect fusion genes in normal tissues. However, since chimeras detected by Frenkel-Morgenstern have a quite low representation in normal tissues, it is also possible that they were not sampled in the paired-end dataset for stochastic reasons.
This paper highlights that specificity of state of the art tools for the identification of chimeras is affected at different degrees by read length and read quality scores of the RNA-seq dataset under analysis. Thus, it is important to carefully select the software on the basis of RNA-seq data features. In the specific case of detection of chimeras in normal tissues these fusion finder tools do not seem to provide results consistent with those obtained with a knowledge-based approach such as those reported by Frenkel-Morgenstern .
Fusion detection software
MapSplice  splits each read in a set of consecutive elements, then exon alignment is performed. MapSplice aligns any element not mapped in the previous step, using the knowledge resulting by other aligned elements. Splice junction quality is then assessed with two statistical measures: i) "anchor significance", given by an alignment that maximizes significance as a result of long anchors on the two sides of the splice junction, and ii) "entropy" calculated by the multiplicity of splice junction locations.
FusionMap  splits reads into smaller portions and it finds putative chimeras aligning these elements to genes annotated on genomic reference. The read alignment is based on GSPN algorithm , that provides a tolerance to mismatches of at most two bases. Seeds located at each side of an unmapped read are aligned to the reference. Chimeras are reported only if both seeds align, all chimeras having fusion boundaries distant less than 5 bp are combined and used to refine the position of junction boundary. Canonical splicing patterns are also used to refine the site of the fusion boundary, and false positives are removed using four filters. Reads are removed on the basis of their break point score; read-through fusions are discarded; chimera pseudo-reference are created and fusion without reads aligned to the pseudo-reference are removed; PCR artifact are also removed.
FusionFinder  divides reads into shorter elements and it detects chimeras aligning these fragments annotated genomic reference. The main differences with respect to FusionMap are related to alignment and filter implementation. Bowtie  is used to align fragments with respect to the coding reference transcriptome. Exons tagged as fusion elements go through some filtering steps to refine the results: (i) seeds mapping on the same gene are removed; (ii) pairs of reads mapping on the same chromosome but on opposite strands are discarded; (iii) pairs of reads mapped on genomic coordinates not associated to annotated genes are removed; and (iv) artifacts caused by sequence similarity are also discarded.
deFuse  uses reads pairs showing discordant alignments to detect putative chimeras essentially scoring putative fusions on the basis of fusion junction coverage and considering that shift between overlapping spanning reads must be consistent with the fragment length.
For each putative fusion, chimera boundaries are used to identify encompassing reads and to define fusion boundary at the nucleotide level. Paired-end reads aligning at a length that does not match with the expected distribution of sequenced fragments distance are discarded.
FusionHunter  aligns paired-end reads against a reference genome using Bowtie. The mapped reads are used to identify the fusion candidates, which are aggregated to generate a pseudo reference to detect junction-spanning reads. Unmapped reads are fragmented and aligned on the pseudo-reference. If one fragment is correctly aligned, the nearest canonical splicing junction is searched and the other part of the original read is aligned to this region. Chimeras made of two genes sharing significant homology are removed. Chimeras lacking at least two different paired-end reads supporting the fusion boundary are discarded. Furthermore reads mapping on the break point with less than 6 bp are removed as well as PCR artifacts and read-through events.
TopHat-Fusion  detects all reads mapping entirely within exons using Bowtie, and it creates a set of partial exons from these alignments. Pseudo-genes structures are then created, while unmapped reads are split into shorter elements and mapped on the genome. Chimeras are detected if reads fragments map in a consistent way with fusions (using TopHat  with relaxed parameters). Filtering is subsequently applied to eliminate (i) chimeras associated to multi-copy genes or repetitive sequences; (ii) reads mapping with less than 13 bp on either side of fusion; and read-through events.
TopHat-Fusion also keep track of contradicting reads, i.e. the reads mapping both on a single part of fusion and on fusion boundary.
FusionHunter, FusionMap, FusionFinder, MapSplice, deFuse and TopHat-fusion were downloaded from the repository indicated in their papers and installed in adherence with the requirements indicated in their manual. All software tools were run with their default configuration. The analyses were performed on a 48 cores AMD server with 512 Gb RAM and 9 Tb HD, running linux SUSE Enterprise 11. Statistics and data parsing were executed using R scripting, taking advantage of the gplots-contributed R package http://cran.r-project.org/web/packages/gplots/ and Bioconductor  packages, i.e. Biostrings, org.Hs.eg.db, GenomicRanges and oneChannelGUI .
The negative dataset was generated using BEERS http://www.cbil.upenn.edu/BEERS/, consisting of 70 million 100 paired-end reads (parameters: -readlength 100 -tlen 5 -tpercent 0.1). Since BEERS does not simulate Illumina quality scores, we attached to the 70 million reads the quality scores derived from 100 bp paired-end reads experiments run in our laboratory, to generate lib100_1 and lib100_2 fastq files. In addition from the 100 paired-end reads we generated a set of 2 × 75 nts (lib75_1 and lib75_2) and 2 × 50 nts paired-end reads (lib50_1 and lib50_2), removing 25 or 50 nts at the beginning of each read in the lib100_1 and lib100_2 fastq files, respectively. Negative datasets are available from the authors upon request.
FusionMap http://www.omicsoft.com/fusionmap/#Home developers provide a synthetic dataset of simulated paired-end RNA-Seq reads (~60,000 pairs of reads, 75 nt, fragment size = 158 bp). 50 fusions are represented, with a number of supporting pairs ranging from 9 to 8852. The sensitivity of each tool was calculated by dividing the number of chimeras detected by each tool with respect to the total number of chimeras in the positive dataset. The "false positive" behavior is instead reported directly as the number of chimeras detected that do not match any of the positive 50 chimeras.
Fusion genes detected in the 75 bp Body map dataset
Frenkel-Morgenstern's paper  provided, as additional information, the list of chimeras detectable in the Body Map dataset (75 bp single-end reads) and the tissue in which they were detected. Furthermore, the paper also provided the fasta files for all the analyzed 7,424 putative human chimeric RNAs. Using R http://cran.r-project.org/ script we extracted the subset of 172 fusion events detected by Frenkel-Morgenstern in the Body Map 2.0. Each of the Frenkel-Morgenstern's 172 chimeras was manually blasted http://blast.ncbi.nlm.nih.gov/Blast.cgi against the human reference genome and we considered as a putative chimera only those characterized by a unique mapping on two different genomic locations. Moreover, we discarded all fusion events characterized by: i) having part of the sequence mapping on multiple genomic locations, ii) having the sequence mapping on the same genomic location, iii) having sequences mapping on more than two different chromosomal locations. Out of this filtering 22 fusion genes were left as putative chimeras (Table 3).
Body Map 2.0
Illumina http://www.illumina.com has sequenced mRNAs derived from 16 normal tissues (Body Map 2.0: Adrenal gland, Adipose tissue, Brain, Breast, Colon, Heart, Kidney, Liver, Lung, Lymph Node, Ovary, Prostate, Skeletal Muscle, Testis, Thyroid, white Blood cells). These data are public available on the GEO database (GSE30611). Approximately 80 million reads for each tissue were provided as 75 bp single-ends reads (SE) or 50 nts paired-end reads (PE) datasets. SE and PE refer to the sequencing of one and both ends of a DNA fragment, respectively. The libraries used for sequencing were derived from poly-A selected mRNAs and generated by random priming. In case of PE, the average size of the sequenced fragment was approximately 300 bp. These datasets, due to the high number of reads provided, represent an ideal instrument for the identification of chimeras associated with normal tissue and to investigate chimeras tissue specificity .
This study was funded by grants from the Italian Association for Cancer Research; the Epigenomics Flagship Project EPIGEN, MIUR-CNR; the Italian Ministero dell'Università e della Ricerca; the University of Torino and Regione Piemonte. European Seventh framework program, Health.2012.1.2-1, NGS-PTL grant n. 306242. The work of Marco Beccuti has been partially supported by a project grant Nr. 10-15-1432/HICI from the King Abdulaziz University of Saudi Arabia.
We thank Michael Poidinger for critical reading of the manuscript and the reviewers for their insightful suggestions.
The publication costs for this article were funded by FP7 EU Health Project "Next Generation Sequencing Platform for Targeted Personalized Therapy of Leukemia" (NGS - PTL) Grant Agreement n. 306242
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 7, 2013: Italian Society of Bioinformatics (BITS): Annual Meeting 2012. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S7
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods. 2008, 5 (7): 621-628. 10.1038/nmeth.1226.View ArticlePubMedGoogle Scholar
- Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X, Sam L, Barrette T, Palanisamy N, Chinnaiyan AM: Transcriptome sequencing to detect gene fusions in cancer. Nature. 2009, 458 (7234): 97-101. 10.1038/nature07638.PubMed CentralView ArticlePubMedGoogle Scholar
- Magrangeas F, Pitiot G, Dubois S, Bragado-Nilsson E, Cherel M, Jobert S, Lebeau B, Boisteau O, Lethe B, Mallet J: Cotranscription and intergenic splicing of human galactose-1-phosphate uridylyltransferase and interleukin-11 receptor alpha-chain genes generate a fusion mRNA in normal cells. Implication for the production of multidomain proteins during evolution. The Journal of biological chemistry. 1998, 273 (26): 16005-16010. 10.1074/jbc.273.26.16005.View ArticlePubMedGoogle Scholar
- Akiva P, Toporik A, Edelheit S, Peretz Y, Diber A, Shemesh R, Novik A, Sorek R: Transcription-mediated gene fusion in the human genome. Genome research. 2006, 16 (1): 30-36.PubMed CentralView ArticlePubMedGoogle Scholar
- Parra G, Reymond A, Dabbouseh N, Dermitzakis ET, Castelo R, Thomson TM, Antonarakis SE, Guigo R: Tandem chimerism as a means to increase protein complexity in the human genome. Genome research. 2006, 16 (1): 37-44.PubMed CentralView ArticlePubMedGoogle Scholar
- Nacu S, Yuan W, Kan Z, Bhatt D, Rivers CS, Stinson J, Peters BA, Modrusan Z, Jung K, Seshagiri S: Deep RNA sequencing analysis of readthrough gene fusions in human prostate adenocarcinoma and reference samples. BMC medical genomics. 2011, 4: 11-10.1186/1755-8794-4-11.PubMed CentralView ArticlePubMedGoogle Scholar
- Frenkel-Morgenstern M, Lacroix V, Ezkurdia I, Levin Y, Gabashvili A, Prilusky J, Del Pozo A, Tress M, Johnson R, Guigo R: Chimeras taking shape: Potential functions of proteins encoded by chimeric RNA transcripts. Genome research. 2012, 22 (7): 1231-1242. 10.1101/gr.130062.111.PubMed CentralView ArticlePubMedGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic acids research. 2005, D34-38. 33 DatabaseGoogle Scholar
- Iyer MK, Chinnaiyan AM, Maher CA: ChimeraScan: a tool for identifying chimeric transcription in sequencing data. Bioinformatics. 2011, 27 (20): 2903-2904. 10.1093/bioinformatics/btr467.PubMed CentralView ArticlePubMedGoogle Scholar
- McPherson A, Hormozdiari F, Zayed A, Giuliany R, Ha G, Sun MG, Griffith M, Heravi Moussavi A, Senz J, Melnyk N: deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq data. PLoS computational biology. 2011, 7 (5): e1001138-10.1371/journal.pcbi.1001138.PubMed CentralView ArticlePubMedGoogle Scholar
- Francis RW, Thompson-Wicking K, Carter KW, Anderson D, Kees UR, Beesley AH: FusionFinder: a software tool to identify expressed gene fusion candidates from RNA-Seq data. PloS one. 2012, 7 (6): e39987-10.1371/journal.pone.0039987.PubMed CentralView ArticlePubMedGoogle Scholar
- Li Y, Chien J, Smith DI, Ma J: FusionHunter: identifying fusion transcripts in cancer using paired-end RNA-seq. Bioinformatics. 2011, 27 (12): 1708-1710. 10.1093/bioinformatics/btr265.View ArticlePubMedGoogle Scholar
- Ge H, Liu K, Juan T, Fang F, Newman M, Hoeck W: FusionMap: detecting fusion genes from next-generation sequencing data at base-pair resolution. Bioinformatics. 2011, 27 (14): 1922-1928. 10.1093/bioinformatics/btr310.View ArticlePubMedGoogle Scholar
- Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, He X, Mieczkowski P, Grimm SA, Perou CM: MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic acids research. 2010, 38 (18): e178-10.1093/nar/gkq622.PubMed CentralView ArticlePubMedGoogle Scholar
- Kinsella M, Harismendy O, Nakano M, Frazer KA, Bafna V: Sensitive gene fusion detection using ambiguously mapping RNA-Seq read pairs. Bioinformatics. 2011, 27 (8): 1068-1075. 10.1093/bioinformatics/btr085.PubMed CentralView ArticlePubMedGoogle Scholar
- Kim D, Salzberg SL: TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome biology. 2011, 12 (8): R72-10.1186/gb-2011-12-8-r72.PubMed CentralView ArticlePubMedGoogle Scholar
- Grant GR, Farkas MH, Pizarro AD, Lahens NF, Schug J, Brunk BP, Stoeckert CJ, Hogenesch JB, Pierce EA: Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM). Bioinformatics. 2011, 27 (18): 2518-2528.PubMed CentralPubMedGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-seq. Bioinformatics. 2009, 25 (9): 10.1093/bioinformatics/btp120.Google Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J: Bioconductor: open software development for computational biology and bioinformatics. Genome biology. 2004, 5 (10): R80-10.1186/gb-2004-5-10-r80.PubMed CentralView ArticlePubMedGoogle Scholar
- Sanges R, Cordero F, Calogero RA: oneChannelGUI: a graphical interface to Bioconductor tools, designed for life scientists who are not familiar with R language. Bioinformatics. 2007, 23 (24): 3406-3408. 10.1093/bioinformatics/btm469.View ArticlePubMedGoogle Scholar
- Li H, Wang J, Ma X, Sklar J: Gene fusions and RNA trans-splicing in normal and neoplastic human cells. Cell Cycle. 2009, 8 (2): 218-222. 10.4161/cc.8.2.7358.View ArticlePubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biolology. 2009, 10: R25-10.1186/gb-2009-10-3-r25.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.