Volume 16 Supplement 17
SpliceJumper: a classification-based approach for calling splicing junctions from RNA-seq data
© Chu et al. 2015
Published: 7 December 2015
Next-generation RNA sequencing technologies have been widely applied in transcriptome profiling. This facilitates further studies of gene structure and expression on the genome wide scale. It is an important step to align reads to the reference genome and call out splicing junctions for the following analysis, such as the analysis of alternative splicing and isoform construction. However, because of the existence of introns, when RNA-seq reads are aligned to the reference genome, reads can not be fully mapped at splicing sites. Thus, it is challenging to align reads and call out splicing junctions accurately.
In this paper, we present a classification based approach for calling splicing junctions from RNA-seq data, which is implemented in the program SpliceJumper. SpliceJumper uses a machine learning approach which combines multiple features extracted from RNA-seq data. We compare SpliceJumper with two existing RNA-seq analysis approaches, TopHat2 and MapSplice2, on both simulated and real data. Our results show that SpliceJumper outperforms TopHat2 and MapSplice2 in accuracy. The program SpliceJumper can be downloaded at https://github.com/Reedwarbler/SpliceJumper.
Keywordssplicing junction calling RNA-seq sequence reads alignment support vector machine
With the development of high-throughput sequencing technologies, RNA-seq has been widely applied in transcriptome profiling. This facilitates the further studies of gene structure and expression on the genome wide scale. One of the opportunities provided by RNA-seq is detecting splicing junctions. In eukaryotic genomes, splicing is a process that exons join together and introns are excluded to form the mature mRNA. Recent studies show that variations in splicing patterns are associated with Alzheimer  and other complex diseases . Thus, detecting splicing junctions not only helps to profile transcriptomes, but also contributes to the understanding of the mechanism of some complex diseases. In this paper, we focus on calling splicing junctions from RNA-seq data of organisms that have released reference genomes.
Over the past several years, many sophisticated computational approaches for calling splicing junctions from RNA-seq data have been developed [3–6]. But it is still challenging to call out splicing junctions accurately. One difficulty is that because of the discrete nature of RNA-seq data, reads spanning splice sites cannot be fully mapped to the reference genome. Situation becomes worse when reads span three or more exons. A common strategy is to map reads that span two or more exons as split-mapped reads. Split-mapped read means that one wants to map segments of the read onto multiple disjoint genomic regions (which correspond to the exons). Most existing RNA-seq reads alignment tools, such as TopHat [3, 7], MapSplice , STAR , PALMapper , GSNAP , PASS , and GEM , are able to handle split-mapped reads with large or small gaps. But because of the repeats on genome, reads errors, and the short length of unmapped segments, it is still difficult to align the unmapped segments correctly. Another challenge is that the coverage of reads is uneven, and the expression level of many transcripts is low. Thus many exons are covered by only a few or even no reads. This can make it difficult to call out junctions especially for tools relying on coverage to form exon islands. Another issue is that many tools only use part of the information contained in the reads. For example, TopHat  only uses reads coverage to call splicing junctions. In principle, using more information, e.g. number of discordant encompassing pairs and number of split-mapped reads, contained in the reads may make junction calling more accurate. Moreover. the past study of RNA in organisms such as humans has accumulated substantial knowledge about RNA structure. Ideally, calling splice junctions can be assisted by these prior knowledge. Thus, it is useful to develop new tools that use more information contained in RNA-seq data and also the prior knowledge about the RNA structure to achieve higher sensitivity and specificity of the called splicing junctions.
In this paper, we introduce a new approach, SpliceJumper, which uses a machine learning approach that combines multiple features extracted from RNA-seq paired-end reads. There are three steps for SplicJumper. First, we align the raw reads to genome sequence using BWA  and call out all the candidate splicing sites. Then, we classify the true and false splicing sites using a machine learning approach by combining several features extracted from the reads. Finally, we call out the splicing junctions with the true splicing sites, and paired-end reads are used to filter out the false ones. The main idea behind our approach is that we treat the problem of calling splicing sites as a classification problem. Then, we use the called out splicing sites to guide the re-alignments of reads that are initially not fully mapped. This allows accurate calling of splicing junctions.
There are three main aspects for SpliceJumper:
1) SpliceJumper uses a machine learning approach to call out true splicing sites by combining more features. Also, information contained in paired-end reads is used by SpliceJumper. As in many situations, it is not easy to call out splicing junctions using only one or two features. So combining more features helps to improve sensitivity and specificity. Moreover, we use an efficient classification approach to combine all these features.
2) Similar to many existing approaches, we also re-align the initially unmapped reads. The difference is that we try to re-align the unmapped parts in focal regions with the help of the called out splicing sites. This way, it not only works efficiently, but also helps to filter out ambiguous alignments. We show that SpliceJumper outperforms TopHat2 and MapSplice2 in accuracy on both simulation and real data.
3) Unlike tools such as TopHat, MapSplice, or other tools that require user-provided threshold parameters to call out splicing junctions, SpliceJumper learns the parameters through the training procedure. Thus there is no need for users to set any thresholds.
Signatures of splicing junctions on reads
There are mainly three types of signatures that can be extracted from the RNA-seq reads around splicing sites.
(i) Discordant encompassing pair signature: the number of discordant encompassing pairs. A discordant pair has insert size from the two mapped reads outside the range [m-3v,m+3v], where m is the mean insert size, and v is the standard variation of the insert size. If the length of a splicing junction is large enough, then a pair-end reads encompassing the splicing junction will become discordant after aligned to the reference genome. Figure 1 shows two discordant pairs that encompass splicing junctions [C,D] and [A,D].
(ii) Split-mapped reads signature: the number of split-mapped reads. Reads spanning splicing sites may be clip-mapped at the splicing sites. Thus, each read will be split into two or more segments. If all these segments are aligned correctly, the split-mapped read may reveal the positions of the splicing sites. Figure 1 shows three split-mapped reads that span splicing sites A, B, C, and D. Take the read split-mapped at sites A and B as an example. The partially mapped segment (the left segment) clipped at site A indicates site A is a potential splicing site. And the clipped segment (the right segment) can be mapped at site B that indicates site B is also a potential splicing site.
(iii) Coverage changing signature: the coverage change between left and right neighboring regions. Coverage change will be apparent from left neighboring region to right neighbor region of a splicing site. This is because no reads will be mapped at the right region of a donnor site, and no reads aligned at the left region of an acceptor site. Figure 1 shows the coverage changing of the four splicing sites A, B, C, and D.
More detailed information of parsing features from RNA-seq data are explained in the Method section.
During the past few years, many tools have been developed to call splicing junctions. According to features used and strategies to combine features, there are two main types of approaches. The first is "exon inference" based approach, which uses coverage information to infer exons first, and then align initially unmapped reads and call splicing junctions. One tool is TopHat which first infers exon islands with the initially mapped reads aligned by Bowtie [15, 16]. Then TopHat concatenates the potential exons using the known splicing motifs, and finally re-aligns the initially unmapped reads to the jointed exons. Another similar tool is PASSION . PASSION also builds exon islands from initially mapped reads. Then it uses the pattern growth algorithm to split and align unmapped reads. And finally a filtering strategy is used to call out splicing junctions. When coverage is high, this kind of approaches is expected to work well. However, when coverage is low, these approaches may not work well. For example, because of the existence of errors (reads errors or alignment errors), variations, or the uneven coverage, there may be no reads covering some (small or large) regions of an exon. Thus the exon will be partitioned into two or even more "exons" in this step. This not only affects the alignments of reads in the following steps, but also introduces many false positives.
The second type of approach is "gap alignment" based. There are two main steps for this kind of approaches. The first step is a "seed-extend" process that splits reads into segments (called kmers), which are then aligned to the reference genome independently. If a segment cannot be initially mapped, sometimes its neighboring segments are mapped. If this segment can be reconstructed by extending its neighboring regions, the gap spanned by this segment is considered as a potential splicing junction and is called out. In the second step, true splicing junctions are called out according to some threshold parameters. MapSplice is a tool of this kind. In the second step, each potential splicing junction is scored by MapSplice according to anchor significance and entropy. Another similar approach is TopHat2, which is an improved version of TopHat. One difference between TopHat2 and MapSplice is that TopHat2 first aligns reads to transcriptome that are generated from provided annotations. It is believed that "gap alignment" based approaches perform better than "exon inference" based methods, especially when the expression level or coverage is low . However, this kind of approaches also has its own limitations. For example, the length of many splicing junctions is several thousands bases or even larger, and many kmers may be ambiguously mapped in such a long region. Thus it can be difficult for those tools to distinguish which is the true alignment, even using some anchor based strategies.
Besides these two main kinds of approaches, there are other approaches that combine different features, such as TrueSight . TrueSight combines RNA-seq read mapping quality and coding potential of genomic sequence into a model, which is trained by iterative logistic regression. Then TrueSight uses the model to de novo identify splicing junctions and filter out unreliable ones. One issue is that TrueSight views each candidate junction as a whole and this may introduce false positives. For example, if the donor site is of high confidence (have strong features) while the acceptor site is a false one, then it is quite possible for TrueSight to call this splicing junction as a true one. This can introduce false positives.
In this paper, we compare the performance of SpliceJumper with two tools. One is TopHat2. TopHat2 is widely used for splicing junction calling and RNA-seq reads alignment. The other is MapSplice. MapSplice performs well for both reads alignments and splicing junction calling according to two assessment papers [17, 18].
Our approach, SpliceJumper, aims to call splicing junctions from RNA-seq paired-end reads, and also align the reads. There are mainly three steps to call out the splicing junctions. First, SpliceJumper aligns the raw reads to the reference sequence using BWA, and calls out all the candidate splicing sites. As each candidate splicing site is either a true one or a false one, this problem can be viewed as a classification problem. So it is natural to use machine learning approaches to perform classification. Thus in the second step we use a supervised machine learning approach to call splicing sites. We choose supervised machine learning because for both real and simulation data, we have enough labeled data as training data. To train a model, choosing the appropriate features is important. SpliceJumper uses four features parsed from the three signatures explained in the Background Section. After parsing all the features, SpliceJumper trains a model using the training data. Then with the trained model, SpliceJumper calls out the splicing sites. Finally, in the third step, we call out the splicing junctions with the called out splicing sites. In this step, each called out splicing junction should satisfy three conditions: 1)Both the donor and acceptor sites are the true ones that are called out in the second step. 2) At least one read is split-mapped at the donor and acceptor sides. 3) The distance between the mapped segment and the clipped segment is concordant with the insert size between the split-mapped read and its mapped mate read. And also, the reads alignment is finished during this procedure.
Details of our approach
SpliceJumper calls splicing junctions in three steps. In the first step, BWA is used to align the raw reads to the reference genome and from the BAM file we call out all the candidate splicing sites. Then we collect all the features of each candidate splicing site, train a SVM model based on training data, and then call out the true ones using the trained model. Finally, we call out splicing junctions with the true splicing sites.
Preprocessing and candidate splicing sites calling
SpliceJumper requires BAM files (sorted and indexed) as input. So before calling candidate splicing sites, BWA (or other alignment tools reporting soft-clipped and hard-clipped alignments) is used to align the raw reads to the reference genome. Then reads are classified to three types according to the alignment type: fully mapped reads, clip-mapped reads including soft-clipped and hard-clipped reads, and other reads. Fully mapped reads refer to the reads that can be fully aligned to the reference. BWA only reports the primary alignment. So if only a part of a read is primarily aligned, the read will be reported as a clip-mapped read. Depending on whether the clipped part is ambiguously mapped or not, clip-mapped reads are classified as hard-clipped reads and soft-clipped reads. Other reads are mainly unmapped reads, which usually refer to those reads that are mapped to the junction of two adjacent reference sequences. In our approach, reads with low mapping quality or unmapped are discarded. During the process of classifying reads, reads coverage is calculated. If a read pair is discordant, we also mark the positions of the read pair.
Once the candidate splicing sites are called out, SpliceJumper collects features for each site. We extract four features from the three signatures mentioned in the Background Section. A clipped read contains two types of segments: partially mapped segments and clipped segments. Different types of segments provide information for different splicing sites. So we collect the two features from the split-mapped reads signature: (i) the number of reads clipped at candidate sites, and (ii) the number of clipped segments mapped at candidate sites. From the other two signatures we collect two features: (iii) the number of discordant encompassing paired-end reads, and (iv) the coverage difference between the left and right neighbor regions. Thus, each candidate site will be represented by four features and be viewed as a point in four dimensional space. In this way, the splicing sites calling problem is converted to a classification problem.
(i) Reads clipped at candidate sites: the number of reads clipped at candidate sites. Because of the existence of introns, reads spanning the splicing sites cannot be fully mapped and are reported as soft-clipped or hard-clipped alignments. This provides us not only the positions of splicing sites, but also strong evidence that the candidate splicing sites may be the true ones. Figure 1 shows three reads clipped at splicing sites.
(ii) Clipped segments mapped at candidate sites: the number of clipped segments, which initially cannot find primary alignments when aligned with BWA, but are mapped at candidate splicing sites in the re-alignment stage.
If no paired-end reads can be used as the boundary, a maximum junction size with default value 1,500,000 bp is used as the boundary. For reads clipped at donor (acceptor) sites, the searching region is [b0,b0+1500000] ([b0-1500000,b0]). We try to find an alignment around related splicing sites in the searching region. Here, "related" means for reads clipped at donor sites, we only check candidate acceptor sites; and conversely for reads clipped at acceptor sites, we only check candidate donor sites. If an alignment can still not be found, then we try to align the segment within the whole region [b1, b2] using local alignment. And if the segment finds an alignment at position p new , then p new will be viewed as a new candidate splicing site and is added into the candidate splicing sites list. If the clipped segment finds an alignment at candidate site c j , then the number of clipped segments mapped at c j is increased by one.
(i) Discordant encompassing pair: the number of discordant encompassing pairs.
Recall that a discordant pair has insert size from the two mapped reads outside the range [m-3v,m+3v], where m is the mean insert size, and v is the standard variation of the insert size. See Figure 1 for an illustration of discordant encompassing pair for splicing junctions. Note that not all the junctions can make encompassing read pairs to be discordant. This is because if the length of a splicing junction is smaller than 3v, then even the read pair encompasses the junction, it still could be concordant. Our approach relies on other features to call out these splicing junctions. According to our experiments, around nine tenths of the junctions can cause encompassing pairs to be discordant. Thus, discordant encompassing pair is still a useful feature for calling splicing junctions.
(ii) Coverage difference between left and right neighboring regions: for candidate donor (acceptor) sites, it is the average coverage of left (right) region minus the average coverage of right (left) region. The average coverage of a region is calculated by: , where L d is the length of the region and d i is the read depth at position i of the region. SpliceJumper provides a "-l" option for users to set the region length (with default value 25 bp). Figure 1 illustrates the coverage difference at splicing sites A, B, C and D.
Classification and call out junctions
After parsing all the features for each candidate splicing site, SpliceJumper uses a machine learning based approach to classify all the candidate splicing sites into the true ones and the false ones. Before training the model, for each candidate site with four features we perform normalization. This is based on the observation that coverage is quite uneven. And for a candidate site from a region with high coverage the quantity of the features will be large, while for a candidate site from a region with low coverage the quantity will be small. However, maybe both of the two sites are the true ones. Thus using the original quantity of the features may mislead the classifier. So normalization before training can improve the accuracy. For candidate donor sites, all the four features are normalized by the average coverage of the left neighboring regions. For candidate acceptor sites, all are normalized by the average coverage of the right neighboring regions.
Model training and classification We use support vector machine (SVM) to perform classification. In particular, we use LibSVM  to train a model, and then use the trained model to classify splicing sites. To train a model, training data that contains candidate sites and labeled with true or false should be provided. For simulation data, the true label of each candidate is known. So to prepare the training data we just label the candidate splicing sites that are ture as 1, and the rest are labeled as 0. For real data, users can train the model with released annotations, such as human annotations released by UCSC, Ensembl, Geneid, Genscan, RefSeq, SGP, AceView, Vega, etc. One problem is that not all transcripts will express in every sample. In other words although some splicing sites do exist in annotations, they may not express. So there will be no reads cover those splicing sites. This kind of splicing sites should not be considered as true ones. So before using labeled splicing sites released in annotations, first we check whether the coverage around the sites is zero or not. If larger than zero then the sites are labeled with true. In this way, we get all the positive cases of the training data. Then we randomly choose the same number of negative cases, which are sites that have reads covered but not in annotation. With the positive and negative cases, we have all the training data of real data. To train a model, 10-cross validation and grid search are used to find the optimal parameters. Then with the trained model, we classify the candidate sites in testing data into the true and false ones.
Junction calling After finishing classification, SpliceJumper has called out all the true splicing sites. There are two steps to call out the splicing junctions. First, we call out all the candidate splicing junctions. To call a splicing junction, first both the two sites (one is donor and the other is acceptor) of the junction should be true in classification. There should be a connection between the two sites. We say two sites are connected if there at least one read that is clipped at the donor (acceptor) site, and the clipped segment of the read can be aligned at the acceptor (donor) site. In practice, this step is finished in feature parsing step, and a graph with all the candidate sites as nodes and connecting edges is kept. In the second step, we check whether there is a conflict between the insert size and the mapping position of the clipped segment. If the mapped clipped segment leads the read pair to be discordant, then we say it is a conflict. If conflicts happen, it is quite possible that the connection between the two sites is false. This may be caused by wrong alignment of the clipped segment because the segment usually is short and there are repeats in genome. We discard this kind of splicing junctions. Figure 3 shows an example: A, B, C, and D are called out splicing sites. 1(1a and 1b), 2(2a and 2b), and 3(3a and 3b) are three pairs. 1a, 2a, and 3b are fully mapped, while 1b, 2b, and 3a are split-mapped. Splicing junction [C,D] is considered as a false one, because although read 2b is split-mapped at position C and D, 2a and 2b will form a discordant pair if the alignment of 2b is correct. Thus a conflict happens, and SpliceJumper will report [C,D] as false positive. In contrast, splicing junction [A,B] is considered as a true splicing junction.
We run SpliceJumper on both simulation and real data, and compare with TopHat 2.0.10 and MapSplice 2.1.6 on accuracy and efficiency. For simulation data, because we know the true junctions, we can calculate the number of true positive, false positive and false negative of each tool. Three metrics are used: 1) Precision=TP/(TP+FP), 2) Recall=TP/(TP+FN), and 3) F-value = 2*Precision*Recall/(Precission+Recall), where TP represents true positive, FP represents false positive, and FN represents false negative. For real data, because no true junctions are provided, we cannot directly evaluate the accuracy of called junctions of each tool. We rely on another metric: the ratio of mapped bases. RNA-seq reads alignment and splicing junction calling are related to each other. To call splicing junctions accurately the first step is to align reads correctly, and at the same time calling splicing junctions accurately can also help to guide the alignment of the unmapped reads. Thus, the ratio of mapped bases is an approximate estimate of the accuracy of called out splicing junctions. For real data, as the origin of each read is unknown, we give the ratio of mapped bases. For simulation data, because the true alignment position of each read is known, we compare the ratio of correctly mapped bases.
Calling RNA junctions with simulated data
We use the same simulated data released in . Two datasets (Test1 and Test2 of simulation one) are used. All the reads are simulated using a tool BEERS released in the same paper. Both of the datasets are generated from 30,000 mouse build mm9 transcript models. Indel, substitution and error rates for the Test1 dataset are 0.0005, 0.001 and 0.005 respectively, and 0.0025, 0.005 and 0.01 for Test2 dataset. For each dataset, 10 million pairs of reads with read length 100bp are simulated. Wetest the accuracy of the three tools on chromosome 11. It is unfair for all the tools if there are no reads covering the splicing sites. So when calculating the accuracy, only junctions with at least one read covered are counted. Thus there are 14,939 and 15,738 benchmarked junctions on chromosome 11 for Test1 and Test2 dataset respectively. Recall that SpliceJumper requires training data. We use the splicing sites of chomosome 1 as the training data, and test the performance of the trained model on chromosome 11.
Comparison of SpliceJumper, TopHat2 and MapSplice2 on simulation data.
Comparison of SpliceJumper, TopHat2 and MapSplice2 on simulated Test1 dataset when the slack value is 8.
Comparison of SpliceJumper, TopHat2 and MapSplice2 on simulated Test2 dataset when the slack value is 8.
We also calculate the ratio of correctly mapped bases. For Test1 dataset, Splice-Jumper is 95.10%, while TopHat2 and MapSplice2 are 93.49% and 94.11% respectively. For Test2 dataset, SpliceJumper is 92.09%, while TopHat2 and MapSplice2 are 89.76% and 91.25% respectively.
Calling RNA junctions with real data
The real data is released in , which is gathered across a time-course experiment (GEO accession number: GSM818582). There are 65,352,789 pairs of reads with read length 101 bp. We compare the performance of the three tools on chromosome 11. We train the model based on the annotation released by UCSC. 8,000 splicing sites of chromosome 20 in the released annotation data are used as positive cases, and each splicing site has at least one read covered. 8,000 randomly generated sites are used as negative cases, and each site has coverage larger than zero and is not included in released annotation. SpliceJumper calls out 9,902 junctions, while TopHat2 and MapSplice2 calls out 9,836 and 9,878 junctions respectively. Because the true junctions are unknown, we use the ratio of mapped bases to evaluate the performance of the three tools. The ratio of mapped bases of SpliceJumper is 92.69%, while 90.89% and 92.13% for TopHat2 and MapSplice2 respectively. One can see SpliceJumper has the highest ratio of mapped bases.
Comparison of SpliceJumper, TopHat2 and MapSplice2 on real data.
SpliceJumper with MapSplice2
SpliceJumper with TopHat2
TopHat2 with MapSplice2
Running time and memory usage
Comparison of running time and memory usage for SpliceJumper, TopHat2 and MapSplice2 on simulated Test1 dataset.
711 minutes and 28 seconds
679 minutes and 18 seconds
548 minutes and 5 seconds
One issue of using SpliceJumper is that SpliceJumper needs training data (i.e. data with known splicing sites). For well studied organisms, we can use the released annotation data to generate training data. In the case when no annotation data is available, we believe that simulated data may also be used as training data to train models for real data analysis. In , we show that models on simulated data can give reasonably accurate insertion and deletion genotype calling on real data.
For future work, we notice that some kinds of structural variations, like long deletions, can introduce false positives to our results. Many of the existing methods such as TopHat2 and MapSplcie2 can call indels while aligning reads. So in the future work, we plan to add the indels calling part, which is not only useful to profile the gene structure, and also to improve the accuracy of splicing junctions calling.
All the authors are partially supported by National Science Foundation grants IIS-0953563 and IIS-1447711.
Publication of this article was funded by National Science Foundation grant IIS-0953563.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 17, 2015: Selected articles from the Fourth IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2014): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S17.
- Twine NA, Janitz K, Wilkins MR, Janitz M: Whole transcriptome sequencing reveals gene expression and splicing differences in brain regions affected by alzheimer's disease. PLoS ONE. 2011, 6 (1): 16266-View ArticleGoogle Scholar
- Wang GS, Cooper TA: Splicing in disease: disruption of the splicing code and the decoding machinery. Nature Reviews Genetics. 2007, 8 (10): 749-761.View ArticlePubMedGoogle Scholar
- Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL: TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013, 14 (4): 36-View ArticleGoogle Scholar
- Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, He X, Mieczkowski P, Grimm SA, Perou CM, et al: MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic acids research. 2010, 622-Google Scholar
- Li Y, Li-Byarlay H, Burns P, Borodovsky M, Robinson GE, Ma J: TrueSight: a new algorithm for splice junction detection using RNA-seq. Nucleic acids research. 2013, 41 (4): 51-51.View ArticleGoogle Scholar
- Zhang Y, Lameijer EW, AC't Hoen P, Ning Z, Slagboom PE, Ye K: PASSion: a pattern growth algorithm-based pipeline for splice junction detection in paired-end RNA-seq data. Bioinformatics. 2012, 28 (4): 479-486.PubMed CentralView ArticlePubMedGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-seq. Bioinformatics. 2009, 25 (9): 1105-1111.PubMed CentralView ArticlePubMedGoogle Scholar
- Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR: STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013, 29 (1): 15-21.PubMed CentralView ArticlePubMedGoogle Scholar
- Jean G, Kahles A, Sreedharan VT, Bona FD, Rätsch G: RNA-Seq Read Alignments with PALMapper. Current protocols in bioinformatics. 2010, 11-6.Google Scholar
- Wu TD, Nacu S: Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010, 26 (7): 873-881.PubMed CentralView ArticlePubMedGoogle Scholar
- Campagna D, Albiero A, Bilardi A, Caniato E, Forcato C, Manavski S, Vitulo N, Valle G: PASS: a program to align short sequences. Bioinformatics. 2009, 25 (7): 967-968.View ArticlePubMedGoogle Scholar
- Marco-Sola S, Sammeth M, Guigó R, Ribeca P: The GEM mapper: fast, accurate and versatile alignment by filtration. Nature methods. 2012, 9 (12): 1185-1188.View ArticlePubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760.PubMed CentralView ArticlePubMedGoogle Scholar
- Black DL: Mechanisms of alternative pre-messenger RNA splicing. Annual review of biochemistry. 2003, 72 (1): 291-336.View ArticlePubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL, et al: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): 25-View ArticleGoogle Scholar
- Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nature methods. 2012, 9 (4): 357-359.PubMed CentralView ArticlePubMedGoogle Scholar
- Grant GR, Farkas MH, Pizarro AD, Lahens NF, Schug J, Brunk BP, Stoeckert CJ, Hogenesch JB, Pierce EA: Comparative analysis of RNA-seq alignment algorithms and the RNA-seq unified mapper (RUM). Bioinformatics. 2011, 27 (18): 2518-2528.PubMed CentralPubMedGoogle Scholar
- Engström PG, Steijger T, Sipos B, Grant GR, Kahles A, Rätsch G, Goldman N, Hubbard TJ, Harrow J, Guigó R, et al: Systematic evaluation of spliced alignment programs for RNA-seq data. Nature methods. 2013, 10 (12): 1185-1191.PubMed CentralView ArticlePubMedGoogle Scholar
- Chang CC, Lin CJ: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011, 2: 27-12727. (2011). Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvmView ArticleGoogle Scholar
- Chen R, Mias GI, Li-Pook-Than J, Jiang L, Lam HY, Chen R, Miriami E, Karczewski KJ, Hariharan M, Dewey FE, et al: Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell. 2012, 148 (6): 1293-1307.PubMed CentralView ArticlePubMedGoogle Scholar
- Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP: Integrative genomics viewer. Nature biotechnology. 2011, 29 (1): 24-26.PubMed CentralView ArticlePubMedGoogle Scholar
- Chu C, Zhang J, Wu Y: GINDEL: accurate genotype calling of insertions and deletions from low coverage population sequence reads. PLoS ONE. 2014, 9: 113324-View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.