- Open Access
Detection of internal exon deletion with exon Del
- Yan Guo†1Email author,
- Shilin Zhao†1,
- Brian D Lehmann2,
- Quanhu Sheng1,
- Timothy M Shaver1,
- Thomas P Stricker3,
- Jennifer A Pietenpol2 and
- Yu Shyr1Email author
© Guo et al.; licensee BioMed Central Ltd. 2014
- Received: 4 March 2014
- Accepted: 20 August 2014
- Published: 16 October 2014
Exome sequencing allows researchers to study the human genome in unprecedented detail. Among the many types of variants detectable through exome sequencing, one of the most over looked types of mutation is internal deletion of exons. Internal exon deletions are the absence of consecutive exons in a gene. Such deletions have potentially significant biological meaning, and they are often too short to be considered copy number variation. Therefore, to the need for efficient detection of such deletions using exome sequencing data exists.
We present ExonDel, a tool specially designed to detect homozygous exon deletions efficiently. We tested ExonDel on exome sequencing data generated from 16 breast cancer cell lines and identified both novel and known IEDs. Subsequently, we verified our findings using RNAseq and PCR technologies. Further comparisons with multiple sequencing-based CNV tools showed that ExonDel is capable of detecting unique IEDs not found by other CNV tools.
ExonDel is an efficient way to screen for novel and known IEDs using exome sequencing data. ExonDel and its source code can be downloaded freely at https://github.com/slzhao/ExonDel.
- Exome Sequencing
- Median Depth
- Callable Gene
- Small Window Size
- Exome Sequencing Data
Exome sequencing is one of the most cost-efficient sequencing approaches for conducting genome research on coding regions. The primary applications of exome sequencing include detection of single nucleotide polymorphisms, somatic mutations, small and large structural variations, and copy number variations. There are also some less obvious data mining opportunities through exome sequencing data such as extraction of mitochondrial  and viral sequences . Another less explored genomic aberration that can be detected through exome sequencing is internal exon deletions (IEDs). Not to confuse with exon skipping, IEDs are the result of the deletion of one or more consecutive exons in a gene where exon skipping are artificial method used to encourage the cellular machinery to skip over an exon .
Functional IEDs were first described in murine T-cell acute lymphoblastic leukemia (T-ALL), in which constitutive ligand-independent activation of NOTCH1 occurs from a deletion of exons 3-27, preserving the transcriptional binding domain in exons 28-34 . A similar IED was recently reported in a breast cancer cell line, HCC1599 . The number of deleted exons range from a single exon to nearly the whole gene as in the example of the HCC1599 cell line. These IEDs are often too short to be considered copy number variation, thus only the very large ones have a chance to be picked up by sequencing-based CNV detectors. IEDs have biological importance in cancer, such as in the removal of important regulatory mechanisms or protein-protein interaction domains. Given the large amount of publically available exome sequencing data accumulated over the last few years, a method that can efficiently detect such deletions would benefit the medical research community greatly and provide means to rapidly identify new IED candidates. Thus, we have designed ExonDel, a tool aimed at detecting IEDs through exome sequencing data. ExonDel is written in a combination of Perl and R. ExonDel detects exon deletion at gene level rather than at global level, and it adjusts for GC content.
An IED can be homozygous or heterozygous. While homozygous means that exon is deleted in both allele and heterozygous means that the exon is deleted in one of the two alleles. Homozygous deletion is relatively easier and more accurately detectable than heterozygous exon deletion. ExonDel is currently designed to detect homozygous IEDs only. ExonDel differs from other sequencing-based CNV tools by detecting exon deletion on a per gene level instead of searching for large lengths of depth variation across the whole genome. To achieve this, ExonDel first computes callable genes based on different exome capture methodologies. There are three major exome sequencing capture kits currently in broad use: Illumina TruSeq, Agilent SureSelect, and NimbleGen SeqCap EZ. The target regions for these three exome capture kits vary and range from 37.6 to 62.1 million base pairs. The capture kits available can enrich the exome, and additional content includes exons plus 3’ and 5’ UTRs. The capture kits differ in their target regions, bait length, bait density, and molecule used for capture. To account for these differences, ExonDel computes the callable genes first. A callable gene has to satisfy the following two conditions: 1) all exons of this gene must be covered by the exome capture kit, and 2) each exon must have at least 90% of its base pairs covered by the exome capture kit. The first condition ensures that no false positive resulted from uncovered exons in the capture kit. The second condition ensures that no false positives resulted from partially covered exons in the capture kit. ExonDel will only attempt to detect exon deletions for the callable genes.
The important inputs of ExonDel include a non-optional Binary Alignment Map (BAM)  file, a non-optional Browser Extensible Data (BED) file of the capture kit, and an optional Gene Feature Format (GTF) file. The BED file provides the exact capture regions down to a single base-pair resolution. The GTF file provides detailed information about the starts and ends of exons. Both BED and GTF files are used to compute callable genes: if GTF is not provided, ExonDel will apply the latest gene annotation from RefSeq. Other input parameters of ExonDel include a maximum window size and a list of genes of interest. The maximum window size parameter determines the max length IED that ExonDel will search for. For example, if the maximum window size is 7, ExonDel will search for IEDs with length less than 7 exons. For the user-input list of genes of interest, instead of searching though the entire exome, ExonDel will only search IEDs in the genes of interest in order to save time.
ExonDel detects exon deletions by comparing each exon’s depth against its parent gene’s median depth after performing the depth adjustment by GC content described previously. To ensure high specificity, reads with poor mapping quality (MQ < 20 for BWA aligned BAMs) are removed. If non BWA aligned BAMs are used (such as BAMs from Bowtie 2 , where mapping quality definition is different), ExonDel will compute the average base quality per read as where l is the length of the read, and BQi is the base quality of ith nucleotide. All reads with average BQ r < 20 are removed.
DP e i < C1 % × DP all , C% × DP all presents the percentile of all exon depth, and C is a constant. By default, C is 2. The user can manually adjust C to change the sensitivity of ExonDel. Increasing C will result more IEDs detected.
DP ē > C2 % Î§ DPall, where ē denotes the exons that do not satisfy condition 1. By default C is 10.
Exon deletion candidates identified by window length, and comparison with other CNV tools using breast cancer cell lines
Deletion window length
Verified in RNAseq2
Found by CNV Tools3
To demonstrate the effectiveness of ExonDel, we used two independent datasets. The first dataset contains exome sequencing data from 16 breast cancer cell lines. The exomes were captured using Illumina’s TrueSeq capture kit. Seventy five nucleotide paired-end sequencing was performed using Illumina’s HiSeq 2000 platform at Vanderbilt Genomic Core. RNAseq data RT-PCR were used to validate the IEDs identified by ExonDel. Because ExonDel is also designed to work with tumors, which are heterogeneous (a mixture of tumor and normal tissues) compared to cell line, we downloaded exome sequencing data of 10 breast cancer tumor samples ("TCGA-A7-A0D9", "TCGA-BH-A0B3", "TCGA-BH-A0B8", "TCGA-BH-A0BJ", "TCGA-BH-A0BM", "TCGA-BH-A0C0", "TCA-BH-A0DK", "TCGA-BH-A0DP", "TCGA-BH-A0E0", "TCGA-BH-A0H7") from The Cancer Genome Atlas (TCGA). The corresponding RNAseq data of the same 10 samples were also downloaded for validation purpose.
Exon deletion candidates identified by window length, and comparison with other CNV tools using the 10 TCGA breast cancer tumor samples
Deletion window length
Verified in RNAseq
Found by CNV tools
IEDs have functional implications in cancer genomics and we have developed a tool, ExonDel, to screen for novel IED candidates efficiently. Using a combination of Perl and R, we provide a single package including all source codes and instructions which is freely available for download. While providing several important new features, ExonDel also contains a few limitations. For example, as the name indicates, ExonDel can only detect exon deletion not amplification. The window size plays a significant role in detection of IED. Large window size ensures more accurate detection at the cost of missing small IEDs. Small window size on the other hand, allows to detection high number of IEDs at the cost of higher false positive rate. Thus, we recommend running ExonDel at window size 1 to 7 in one setting, and scan for potential biological meaningful IED candidates from the results of larger window size to smaller window size.
ExonDel distinguish itself from other sequencing-based CNV tools in two aspects. First, it performs deletion detection at gene level and uses exon as unit. Other sequencing-based CNV tools usually consider CNVs as large deletion or duplication spanning large genomic regions. It is common to see that CNV contains many genes and the median length of CNV detected using sequencing-based CNV tool is around 105 and the average exon less is less than 200 base pairs . Thus, ExonDel is very efficient, and one exome can be screened in about 15 minutes. ExonDel also allows the user to define the deletion window size and can be configured to run multiple BAM files in parallel.
In theory, ExonDel is designed to work with both tumor and cell line samples. Tumor samples differ from cell lines samples because they are usually a mixture of tumor and normal tissues. Thus, tumor sample contains noises which can mask the true variant signal. This is a challenge all variant callers have to face. If the tumor purity is low, a deleted exon might have reads aligned to it due to the presents of normal tissue. In such cases, ExonDel would not able to identify such IEDs. As shown in our TCGA tumor dataset results (Table 2), ExonDel was able to identify many potential IEDs, and a significant portion of them were verifiable by RNAseq and other CNV tools. This indicates that either the purities of these tumors were good, or many true IEDs were not affected by tumor heterogeneity. A portion of IEDs might still be affected by tumor heterogeneity, and these IEDs were not detectable by ExonDel.
Given the large volume of exome sequencing data publically available in repositories such as TCGA, the 1000 Genomes Project, NHLBI Exome Sequencing Project, and The Sequence Reads Archive, ExonDel provides researchers with a powerful tool to mine for internal deletions that may contain novel biological findings.
Project name: e.g. ExonDel project
Project home page: e.g. https://github.com/slzhao/ExonDel
Operating system(s): Linux
Programming language: Perl, R
License: GPL v2
Any restrictions to use by non-academics: No
This study was supported by NIH grants CCSG (P30 CA068485), CA9513, RC2CA148375, P50CA098131, 1K08CA148912 and Komen for the Cure Foundation grant KG262005.
The TCGA data was downloaded with approved request 14085-3 from dbGAP of NIH. We would also like to thank Margot Bjoring for editorial support.
- Guo Y, Li J, Li CI, Shyr Y, Samuels DC: MitoSeek: extracting mitochondria information and performing high throughput mitochondria sequencing analysis. Bioinformatics. 2013, 29 (9): 1210-1211. 10.1093/bioinformatics/btt118.View ArticlePubMed CentralPubMedGoogle Scholar
- Samuels DC, Han L, Li J, Quanghu S, Clark TA, Shyr Y, Guo Y: Finding the lost treasures in exome sequencing data. Trends Genet. 2013, 29 (19): 593-599.View ArticlePubMed CentralPubMedGoogle Scholar
- Harding PL, Fall AM, Honeyman K, Fletcher S, Wilton SD: The influence of antisense oligonucleotide length on dystrophin exon skipping. Mol Ther. 2007, 15 (1): 157-166. 10.1038/sj.mt.6300006.View ArticlePubMedGoogle Scholar
- Ashworth TD, Pear WS, Chiang MY, Blacklow SC, Mastio J, Xu L, Kelliher M, Kastner P, Chan S, Aster JC: Deletion-based mechanisms of Notch1 activation in T-ALL: key roles for RAG recombinase and a conserved internal translational start site in Notch1. Blood. 2010, 116 (25): 5455-5464. 10.1182/blood-2010-05-286328.View ArticlePubMed CentralPubMedGoogle Scholar
- Robinson DR, Kalyana-Sundaram S, Wu YM, Shankar S, Cao X, Ateeq B, Asangani IA, Iyer M, Maher CA, Grasso CS, Lonigro RJ, Quist M, Siddiqui J, Mehra R, Jing X, Giordano TJ, Sabel MS, Kleer CG, Palanisamy N, Natrajan R, Lambros MB, Reis-Filho JS, Kumar-Sinha C, Chinnaiyan AM: Functionally recurrent rearrangements of the MAST kinase and Notch gene families in breast cancer. Nat Med. 2011, 17 (12): 1646-1651. 10.1038/nm.2580.View ArticlePubMed CentralPubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.View ArticlePubMed CentralPubMedGoogle Scholar
- Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, et al: Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008, 456 (7218): 53-59. 10.1038/nature07517.View ArticlePubMed CentralPubMedGoogle Scholar
- Yoon S, Xuan Z, Makarov V, Ye K, Sebat J: Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 2009, 19 (9): 1586-1592. 10.1101/gr.092981.109.View ArticlePubMed CentralPubMedGoogle Scholar
- Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012, 9 (4): 357-359. 10.1038/nmeth.1923.View ArticlePubMed CentralPubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324.View ArticlePubMed CentralPubMedGoogle Scholar
- Sathirapongsasuti JF, Lee H, Horst BA, Brunner G, Cochran AJ, Binder S, Quackenbush J, Nelson SF: Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV. Bioinformatics. 2011, 27 (19): 2648-2654. 10.1093/bioinformatics/btr462.View ArticlePubMed CentralPubMedGoogle Scholar
- Abyzov A, Urban AE, Snyder M, Gerstein M: CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 2011, 21 (6): 974-984. 10.1101/gr.114876.110.View ArticlePubMed CentralPubMedGoogle Scholar
- Krumm N, Sudmant PH, Ko A, O'Roak BJ, Malig M, Coe BP, Quinlan AR, Nickerson DA, Eichler EE: Copy number variation detection and genotyping from exome sequence data. Genome Res. 2012, 22 (8): 1525-1532. 10.1101/gr.138115.112.View ArticlePubMed CentralPubMedGoogle Scholar
- Boeva V, Popova T, Bleakley K, Chiche P, Cappo J, Schleiermacher G, Janoueix-Lerosey I, Delattre O, Barillot E: Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics. 2012, 28 (3): 423-425. 10.1093/bioinformatics/btr670.View ArticlePubMed CentralPubMedGoogle Scholar
- Love MI, Mysickova A, Sun R, Kalscheuer V, Vingron M, Haas SA: Modeling read counts for CNV detection in exome sequencing data. Stat Appl Genet Mol Biol. 2011, 10: 1-Google Scholar
- Klambauer G, Schwarzbauer K, Mayr A, Clevert DA, Mitterecker A, Bodenhofer U, Hochreiter S: cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Res. 2012, 40 (9): e69-10.1093/nar/gks003.View ArticlePubMed CentralPubMedGoogle Scholar
- Guo Y, Sheng Q, Samuels DC, Lehmann B, Bauer JA, Pietenpol J, Shyr Y: Comparative study of exome copy number variation estimation tools using array comparative genomic hybridization as control. Biomed Res Int. 2013, 2013: 915636-PubMed CentralPubMedGoogle Scholar
- Guo Y, Long J, He J, Li CI, Cai Q, Shu XO, Zheng W, Li C: Exome sequencing generates high quality data in non-target regions. BMC Genomics. 2012, 13: 194-10.1186/1471-2164-13-194.View ArticlePubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.