Volume 12 Supplement 1
UMARS: Un-MAppable Reads Solution
- Sung-Chou Li†1, 2, 3,
- Wen-Ching Chan†1, 2, 4,
- Chun-Hung Lai3,
- Kuo-Wang Tsai3,
- Chun-Nan Hsu4, 5,
- Yuh-Shan Jou3,
- Hua-Chien Chen6,
- Chun-Hong Chen7 and
- Wen-chang Lin1, 3Email author
© Li et al; licensee BioMed Central Ltd. 2011
Published: 15 February 2011
Un-MAppable Reads Solution (UMARS) is a user-friendly web service focusing on retrieving valuable information from sequence reads that cannot be mapped back to reference genomes. Recently, next-generation sequencing (NGS) technology has emerged as a powerful tool for generating high-throughput sequencing data and has been applied to many kinds of biological research. In a typical analysis, adaptor-trimmed NGS reads were first mapped back to reference sequences, including genomes or transcripts. However, a fraction of NGS reads failed to be mapped back to the reference sequences. Such un-mappable reads are usually imputed to sequencing errors and discarded without further consideration.
We are investigating possible biological relevance and possible sources of un-mappable reads. Therefore, we developed UMARS to scan for virus genomic fragments or exon-exon junctions of novel alternative splicing isoforms from un-mappable reads. For mapping un-mappable reads, we first collected viral genomes and sequences of exon-exon junctions. Then, we constructed UMARS pipeline as an automatic alignment interface.
By demonstrating the results of two UMARS alignment cases, we show the applicability of UMARS. We first showed that the expected EBV genomic fragments can be detected by UMARS. Second, we also detected exon-exon junctions from un-mappable reads. Further experimental validation also ensured the authenticity of the UMARS pipeline. The UMARS service is freely available to the academic community and can be accessed via http://musk.ibms.sinica.edu.tw/UMARS/.
In this study, we have shown that some un-mappable reads are not caused by sequencing errors. They can originate from viral infection or transcript splicing. Our UMARS pipeline provides another way to examine and recycle the un-mappable reads that are commonly discarded as garbage.
Biomedical research has been greatly accelerated by the advances in sequencing technologies, especially genomic research. Recently, next-generation sequencing (NGS) technology, including Roche 454, Illumina GA and ABI SOLiD platforms, has emerged as a powerful tool for generating high-throughput sequencing data. Systematic evaluation revealed that these three platforms could possess high sequencing sensitivity because of the large number of reads obtained . Therefore, NGS technology has been applied in many studies, including transcriptome profiling [2–4], SNP identification [5, 6], genome sequencing and re-sequencing [7, 8], biomarker detection , and metagenomics [10, 11]. NGS technology was also applied in miRNA identification and profiling studies. Morin and colleagues identified 104 novel human miRNA genes and made a list of miRNAs differentially expressed between embryo cell libraries . Glazov discovered 449 new chicken miRNAs and 39 mirtrons . In addition, Wheeler not only sequenced miRNAs from several metazoan genomes but also studied miRNA’s evolution status .
In a typical analysis pipeline, the generated NGS sequence reads are first subject to adaptor trimming and then mapping back to reference sequences, including genomes, scaffolds or transcripts. Several tools, including blast , Razers , SeqMap , SOAP2 , BWA , MAQ  and Bowtie , have been used for such mapping. Following the mapping step, the NGS reads are further processed to meet specific experimental interrogations. While it is essential to process the mappable reads in subsequent studies, a fraction of sequence reads cannot be mapped back to reference sequences. In many cases, these un-mappable reads are imputed to sequencing errors and discarded without further consideration. With the rapid increase of NGS reads, we intend to examine the possible biological relevance and possible sources of un-mappable reads. Therefore, we have developed the Un-MAppable Reads Solution (UMARS) pipeline in this study. Although un-mappable reads could originate from platform-specific technique errors, there have been reports demonstrating the possibilities of viral genomic sequences or cryptic splicing isoforms in NGS data [22, 23].
Eukaryotic organisms are often infected by different viruses, leading to stable symbiosis or parasitism. As parasites, the infecting viruses rule the infected cells to produce their own genetic materials. Therefore, the collected RNA samples could be contaminated by viral transcripts when tissue or cells are lysed, which produces un-mappable reads when only the host cell genome is used for mapping. Kreuze et al detected virus infection by deep sequencing of viral small RNAs . They concluded that NGS technology can be a method for diagnosis and discovery of virus infections. Wu et al also reached a similar conclusion . In Kreuze’s study, in addition to the expected infecting viruses, unexpected novel virus reads and unidentified sequence reads also accounted for a large fraction of all reads. The results from these studies demonstrate that the genomic sequences from infecting viruses may contribute to un-mappable reads, and NGS technology is useful for systematic examination of putative viral genomes.
Another possible source of un-mappable reads is cryptic splicing isoforms. During gene expression, eukaryotic genes usually undergo mRNA splicing by removing introns and merging exons. The sequence reads located at the exon-exon junctions of novel alternative splicing isoforms can be mapped back neither to the genome nor to reference mRNAs. For example, Trapnell et al. could identify novel wobble splicing junctions from NGS reads . However, there are no specific tools for discovering cryptic alternative splicing exon-exon junctions from large numbers of NGS reads.
At present, there is no biological user-friendly bioinformatic tool or service available focusing on the scanning of viral genomic regions or novel alternative splicing exon-exon junctions from un-mappable reads. We believe that such a tool would be beneficial for biological science researchers.
Methods and materials
Collection of genomes and sequences reads
For mapping sequence reads back to viral genomes, we first downloaded 3602 viral genomic sequences from NCBI RefSeq 40 . According to the categories of their hosts, these viral genomes were classified into five classes, including animals, plants, fungi, protozoan plus algae and bacteria plus archaea. We also downloaded the genomic sequences of several animal species from the UCSC genome browser database  for extracting exon-exon junctions. The genomic versions of these species are listed in Additional file 1. In this study, sequence reads from NGS technology of several libraries were used. The sequencing platform, RNA source species, and RNA source tissue of these libraries are listed in Additional file 2.
Extraction of exon-exon junction sequences
Sequence reads processing and mapping
NGS technologies have produced millions of reads, some of which may occur with high frequency. Such high-occurrence reads cause redundancy problems, and should be solved first. Therefore, we developed an in-house tool, called Non-redundant Reads Producer (NRP), to solve this redundancy problem. NRP identifies unique sequence reads from input data, assigned a new ID and tabulates the occurrence frequency (copy number) of each unique read. After NRP processing, non-redundant un-mappable reads may be mapped back to viral genomes or EEJs by UMARS. In the studies involved in mapping sequence reads back to genomes, 100% identity is usually demanded [12, 13]. Because viral genomes usually have higher mutation rates than eukaryotic ones, we allowed one nucleotide variation, including mismatch and gap, when mapping back to viral genomes or to EEJ sequences. The mapping procedures in this study were done with blast .
Prediction of viral miRNAs
After the mapping procedure, the viral genomic loci mapped by sequence reads are considered as candidate miRNAs. These genomic loci and their flanking sequences were extracted, followed by alignment using miRNA identification pipeline . For each candidate miRNA, the pipeline first calculated the values of ten features, which serve as discrimination indices in a Support Vector Machine (SVM) algorithm. Then, the SVM was used as a classifier to classify candidate miRNAs into positive or negative sets.
In this study, we used sequence reads from L2 library (Additional file 2) to scan the EEJs of novel alternative splicing isoforms, followed by experimental validation of the detected EEJs in 23 human tissues. Bellow we described how to prepare cDNAs from these tissues. Human tissue poly(A) RNAs (5μg) or total RNAs (40μg) purchased from Clontech (Clontech, Palo Alto, CA) were reverse-transcribed by Transcriptor reverse transcriptase (Roche Applied Science), primed by oligo (dT)15 according to the supplier's instructions. After the reverse transcription reaction, the mixtures were phenol-extracted once, followed by chloroform extraction. Excess primers were removed by applying the mixtures to Chroma Spin-200 (Clontech) gel filtration column. The purified cDNAs were properly diluted and subjected to Polymerase Chain Reaction (PCR) as the amplification templates. In this study, cDNAs from 23 tissues were investigated and they were labeled as follow: M: DNA marker, 1: blood, 2: bone marrow, 3: brain, 4: colon, 5: heart, 6: kidney, 7: liver, 8: lung, 9: ovary, 10: pancreas, 11: placenta, 12: skeletal muscle, 13: small intestine, 14: stomach, 15: testis, 16: whole fetus, 17: breast tumor, 18: cervix tumor, 19: colon tumor, 20: kidney tumor, 21: lung tumor, 22: ovary tumor, 23: gastric tumor, 24: PCR no-template control.
Experimental validation of discrete EEJ
The dtetcted EEJs were verified by PCR amplification, followed by capillary sequencing confirmation. Primer pair sequences were picked from each couple of the “discrete” EEJ-spanning exons and were listed in Additional file 3. PCR components include mainly 1mM dNTP, 1μM primer separately, 0.1U Takara Taq DNA polymerase (Takara) per 10μL reaction volume, and the diluted cDNA. The thermal reaction was set at 94°C for 3 minutes, 40 cycles (GAPDH was run specifically for 30 cycles) of denaturing at 94°C for 20 seconds, annealing at 58°C for 30 seconds, and extension at 72°C for 30 seconds, finally at 72°C for 10 minutes. PCR products were separated by 3% NuSieve (Lonza, Rockland, ME) conventional TAE-Agarose gel, and visualized through the ultraviolet light source. The detected and the estimated target size regions of the gel were cut-out and the nucleic-acid contents were purified by Viogene Gel Purification reagents. Minor bands eluted were further subjected to additional 30 PCR cycles with the same pair of primer. The amplified nucleic acid fragments were directly sequenced by ABI 3730xl DNA Analyzer (Applied Biosystems).
Results and discussion
UMARS pipeline and interface
The purpose of UMARS:EEJ is to identify novel alternative splicing exon-exon junctions (EEJs) from un-mappable reads. The sequences of all possible EEJs of 21 species were collected in advance. In UMARS:EEJ, uploaded reads are mapped to EEJs. To avoid random sequence matches, besides our mapping criteria (see Materials and methods), a mapping match must overlap both exons for at least five nucleotides, not skewing too much to either exon. Following the mapping procedure, UMARS tabulats detected EEJs and their expression levels. The detected EEJs are reported as either continuous or discrete EEJs. Continuous EEJs represent known mRNA transcripts. However, discrete EEJs could represent novel splicing isoforms.
The purpose of UMARS:Vir is to identify possible virus genomic regions from un-mappable NGS reads. In UMARS:Vir, the uploaded NGS reads are mapped to all 3,602 known virus genomes. Following the mapping procedure, UMARS tabulates detected virus species and their expression levels. The detected viral genomic regions may locate at intergenic, protein-coding gene, pre-miRNA regions and so on according to the annotations of RefSeq 40 and miRBase 15. Such information of genomic annotation is also provided by the UMARS service. Several viruses are reported to encode viral miRNAs, regulating expression of host genes and playing important roles in host immune misfunctions [28–30]. Therefore, UMARS:Vir may further have the option to detect viral miRNAs by an additional miRNA identification pipeline from viral intergenic genomic regions.
Summary of viruses detected from L1 un-mappable reads.
Human herpesvirus 4 type 1
Macacine herpesvirus 4
Human herpesvirus 2
Bovine herpesvirus 5
Papiine herpesvirus 2
Human herpesvirus 1
Cercopithecine herpesvirus 2
Macacine herpesvirus 1
Case study and demonstration of UMARS:Vir
To demonstrate the utility of UMARS, we have analyzed NGS reads using UMARS:Vir. In the first case, we investigated the un-mappable reads from the human NPC cells (L1 library) infected with Epstein-Barr virus (EBV, also named human herpes virus 4 type 1). We examined whether the expected EBV genome could be detected by UMARS:Vir. As a result, eight viruses were detected under our mapping criteria. As shown in Table 1, the expected EBV matches dominated over other un-expected viruses in terms of expression level, which shows that UMARS:Vir can be used to detect infections by a specific virus from un-mappable reads. Besides EBV, there were seven un-expected viruses detected, most of which infect primates, and all of them belong to the herpes virus family.
EBV genomic regions mapped by reads.
Because of the strong sequencing intensity, additional mature miRNAs from the same precursors, including isomiRs at the same arm and minor forms of mature miRNA at the opposite arm, are usually detected from NGS reads [12, 13, 31]. After arranging miRNA reads in order within their corresponding pre-miRNAs, we observed many isomiRs at all of the 22 detected pre-miRNAs (Additional file 5). Compared with EBV pre-miRNAs in miRBase 15, the reference mature miRNAs do not always represent the most abundant reads. In addition, according to miRBase 15 annotation, mir-BART12, mir-BART16 and mir-BART22 encode mature miRNAs only at their 3’, 5’ and 3’ arm respectively. However, we detected additional mature miRNAs at the 5’ arm of mir-BART12, the 3’ arm of mir-BART16 and the 5’ arm of mir-BART22. Moreover, the 5’ arm of mir-BART12 and the 3’ arm of mir-BART16 encode more reads than the original arms. This result is similar to that in Wheeler’s report  and should be noted in future data updates of the miRBase.
Case study and demonstration of UMARS:EEJ
Summary of EEJs detected from L2 un-mappable reads.
Detected EEJs from CLSTN1.
14 + 5
14 + 5
14 + 5
14 + 5
14 + 5
14 + 5
MYL6 (myosin light polypeptide 6) encodes a myosin alkali light chain and is associated with cell migration . It was also reported that fibroblasts promote the growth of breast tumor cells by enhancing the expression of several genes, including MYL6 . In this study, c alternative transcript and d transcript of MYL6 have similar expression levels in most normal tissues (Additional file 6). However, the d transcript dominates over c alternative transcript in most tumor tissues, including breast tumor (17th lane in Additional file 6a). It is possible that these alternative splicing isoforms function differently with each other and are associated with tumor genesis.
With the rapid increase of sequencing data, UMARS can detect more and more un-expected splicing isoforms which may provide us insights deeper into gene functions and relations to disease. Although NGS technology has been considered a powerful sequencing tool in biological research, large-scale studies, such as those using microarrays, seem to produce un-expected data unavoidably. Such un-expected data could be background noise, and should be eliminated for data accuracy. In this study, we have shown that some un-mappable reads are not caused by sequencing errors. They can originate from viral infection or transcript splicing. Our UMARS pipeline provides another way to examine and recycle the un-mappable reads that are commonly discarded as garbage. Although we have proposed two possible sources for generating un-mappable redas, a fraction of un-mappable reads still failed to be detected by UMARS. More effort should be expended in investigating the biological relevance and possible sources of un-mappable reads.
This work was supported by grants from Academia Sinica and National Science Council of Taiwan.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S1.
- Harismendy O, Ng PC, Strausberg RL, Wang X, Stockwell TB, Beeson KY, Schork NJ, Murray SS, Topol EJ, Levy S, et al.: Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol 2009, 10(3):R32. 10.1186/gb-2009-10-3-r32PubMed CentralView ArticlePubMed
- Peters LM, Belyantseva IA, Lagziel A, Battey JF, Friedman TB, Morell RJ: Signatures from tissue-specific MPSS libraries identify transcripts preferentially expressed in the mouse inner ear. Genomics 2007, 89(2):197–206. 10.1016/j.ygeno.2006.09.006PubMed CentralView ArticlePubMed
- Wang X, Sun Q, McGrath SD, Mardis ER, Soloway PD, Clark AG: Transcriptome-wide identification of novel imprinted genes in neonatal mouse brain. PLoS One 2008, 3(12):e3839. 10.1371/journal.pone.0003839PubMed CentralView ArticlePubMed
- Yassour M, Kaplan T, Fraser HB, Levin JZ, Pfiffner J, Adiconis X, Schroth G, Luo S, Khrebtukova I, Gnirke A, et al.: Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA sequencing. Proc Natl Acad Sci USA 2009, 106(9):3264–3269. 10.1073/pnas.0812841106PubMed CentralView ArticlePubMed
- Qi W, Kaser M, Roltgen K, Yeboah-Manu D, Pluschke G: Genomic diversity and evolution of Mycobacterium ulcerans revealed by next-generation sequencing. PLoS Pathog 2009, 5(9):e1000580. 10.1371/journal.ppat.1000580PubMed CentralView ArticlePubMed
- Trick M, Long Y, Meng J, Bancroft I: Single nucleotide polymorphism (SNP) discovery in the polyploid Brassica napus using Solexa transcriptome sequencing. Plant Biotechnol J 2009, 7(4):334–346. 10.1111/j.1467-7652.2008.00396.xView ArticlePubMed
- Hillier LW, Marth GT, Quinlan AR, Dooling D, Fewell G, Barnett D, Fox P, Glasscock JI, Hickenbotham M, Huang W, et al.: Whole-genome sequencing and variant discovery in C. elegans. Nat Methods 2008, 5(2):183–188. 10.1038/nmeth.1179View ArticlePubMed
- Shen Y, Sarin S, Liu Y, Hobert O, Pe'er I: Comparing platforms for C. elegans mutant identification using high-throughput whole-genome sequencing. PLoS One 2008, 3(12):e4012. 10.1371/journal.pone.0004012PubMed CentralView ArticlePubMed
- Holt KE, Parkhill J, Mazzoni CJ, Roumagnac P, Weill FX, Goodhead I, Rance R, Baker S, Maskell DJ, Wain J, et al.: High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi. Nat Genet 2008, 40(8):987–993. 10.1038/ng.195PubMed CentralView ArticlePubMed
- Gerlach W, Junemann S, Tille F, Goesmann A, Stoye J: WebCARMA: a web application for the functional and taxonomic classification of unassembled metagenomic reads. BMC Bioinformatics 2009, 10: 430. 10.1186/1471-2105-10-430PubMed CentralView ArticlePubMed
- Handelsman J: Metagenomics: application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev 2004, 68(4):669–685. 10.1128/MMBR.68.4.669-685.2004PubMed CentralView ArticlePubMed
- Morin RD, O'Connor MD, Griffith M, Kuchenbauer F, Delaney A, Prabhu AL, Zhao Y, McDonald H, Zeng T, Hirst M, et al.: Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. Genome Res 2008, 18(4):610–621. 10.1101/gr.7179508PubMed CentralView ArticlePubMed
- Glazov EA, Cottee PA, Barris WC, Moore RJ, Dalrymple BP, Tizard ML: A microRNA catalog of the developing chicken embryo identified by a deep sequencing approach. Genome Res 2008, 18(6):957–964. 10.1101/gr.074740.107PubMed CentralView ArticlePubMed
- Wheeler BM, Heimberg AM, Moy VN, Sperling EA, Holstein TW, Heber S, Peterson KJ: The deep evolution of metazoan microRNAs. Evol Dev 2009, 11(1):50–68. 10.1111/j.1525-142X.2008.00302.xView ArticlePubMed
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403–410.View ArticlePubMed
- Weese D, Emde AK, Rausch T, Doring A, Reinert K: RazerS--fast read mapping with sensitivity control. Genome Res 2009, 19(9):1646–1654. 10.1101/gr.088823.108PubMed CentralView ArticlePubMed
- Jiang H, Wong WH: SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics 2008, 24(20):2395–2396. 10.1093/bioinformatics/btn429PubMed CentralView ArticlePubMed
- Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 2009, 25(15):1966–1967. 10.1093/bioinformatics/btp336View ArticlePubMed
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25(14):1754–1760. 10.1093/bioinformatics/btp324PubMed CentralView ArticlePubMed
- Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008, 18(11):1851–1858. 10.1101/gr.078212.108PubMed CentralView ArticlePubMed
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009, 10(3):R25. 10.1186/gb-2009-10-3-r25PubMed CentralView ArticlePubMed
- Kreuze JF, Perez A, Untiveros M, Quispe D, Fuentes S, Barker I, Simon R: Complete viral genome sequence and discovery of novel viruses by deep sequencing of small RNAs: a generic method for diagnosis, discovery and sequencing of viruses. Virology 2009, 388(1):1–7. 10.1016/j.virol.2009.03.024View ArticlePubMed
- Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 2009, 25(9):1105–1111. 10.1093/bioinformatics/btp120PubMed CentralView ArticlePubMed
- Wu Q, Luo Y, Lu R, Lau N, Lai EC, Li WX, Ding SW: Virus discovery by deep sequencing and assembly of virus-derived small silencing RNAs. Proc Natl Acad Sci USA 2010, 107(4):1606–1611. 10.1073/pnas.0911353107PubMed CentralView ArticlePubMed
- Pruitt KD, Tatusova T, Klimke W, Maglott DR: NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res 2009, 37(Database issue):D32–36. 10.1093/nar/gkn721PubMed CentralView ArticlePubMed
- Rhead B, Karolchik D, Kuhn RM, Hinrichs AS, Zweig AS, Fujita PA, Diekhans M, Smith KE, Rosenbloom KR, Raney BJ, et al.: The UCSC Genome Browser database: update 2010. Nucleic Acids Res 2010, 38(Database issue):D613–619. 10.1093/nar/gkp939PubMed CentralView ArticlePubMed
- Li SC, Chan WC, Hu LY, Lai CH, Hsu CN, Lin WC: Identification of homologous microRNAs in 56 animal genomes. Genomics 2010, 96(1):1–9. 10.1016/j.ygeno.2010.03.009View ArticlePubMed
- Li SC, Shiau CK, Lin WC: Vir-Mir db: prediction of viral microRNA candidate hairpins. Nucleic Acids Res 2008, 36(Database issue):D184–189.PubMed CentralPubMed
- Nair V, Zavolan M: Virus-encoded microRNAs: novel regulators of gene expression. Trends Microbiol 2006, 14(4):169–175. 10.1016/j.tim.2006.02.007View ArticlePubMed
- Cullen BR: Viruses and microRNAs. Nat Genet 2006, 38 Suppl: S25–30. 10.1038/ng1793View ArticlePubMed
- Chen X, Li Q, Wang J, Guo X, Jiang X, Ren Z, Weng C, Sun G, Wang X, Liu Y, et al.: Identification and characterization of novel amphioxus microRNAs by Solexa sequencing. Genome Biol 2009, 10(7):R78. 10.1186/gb-2009-10-7-r78PubMed CentralView ArticlePubMed
- Bora PS, Bora NS, Wu X, Kaplan HJ, Lange LG: Molecular cloning, sequencing, and characterization of smooth muscle myosin alkali light chain from human eye cDNA: homology with myocardial fatty acid ethyl ester synthase-III cDNA. Genomics 1994, 19(1):186–188. 10.1006/geno.1994.1041View ArticlePubMed
- Samoszuk M, Tan J, Chorn G: Clonogenic growth of human breast cancer cells co-cultured in direct contact with serum-activated fibroblasts. Breast Cancer Res 2005, 7(3):R274–283. 10.1186/bcr995PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.