CLASS: constrained transcript assembly of RNA-seq reads
© Song and Florea; licensee BioMed Central Ltd. 2013
Published: 10 April 2013
RNA-seq has revolutionized our ability to survey the cellular transcriptome in great detail. However, while several approaches have been developed, the problem of assembling the short reads into full-length transcripts remains challenging.
We developed a novel algorithm and software tool, CLASS (C onstraint-based L ocal A ssembly and S election of S plice variants), for accurately assembling splice variants using local read coverage patterns of RNA-seq reads, contiguity constraints from read pairs and spliced reads, and optionally information about gene structure extracted from cDNA sequence databases. The algorithmic underpinnings of CLASS are: i) a linear program to infer exons, ii) a compact splice graph representation of a gene and its splice variants, and iii) a transcript selection scheme that takes into account contiguity constraints and, where available, knowledge about gene structure.
In comparisons against leading transcript assembly programs, CLASS is more accurate on both simulated and real reads and produces results that are easier to interpret when applied to large scale real data, and therefore is a promising analysis tool for next generation sequencing data.
CLASS is available from http://sourceforge.net/projects/splicebox.
Gene annotation is the first and most important step in analyzing a genome. More than 90% of human genes [1, 2] are alternatively spliced to produce multiple mRNA transcripts involving different combinations of exons. The number of splice variants of a gene varies, from two to possibly thousands . Recently, the RNA-seq technology has made it possible to survey the cellular transcriptome at unprecedented depth, within days and at a fraction of the cost of traditional methods. However, assembling the short reads into full-length transcripts is a difficult problem, complicated by artifacts in sample preparation, sequencing and read alignment.
Programs that assemble transcripts from short RNA-seq reads aligned to a reference genome largely follow two approaches . In the first approach, programs such as Cufflinks  and Scripture  use read alignments to predict the exon-intron structure of transcripts, then employ statistical models of fragment distributions to quantify their expression levels. To predict transcript models, Cufflinks represents all fragments at a locus as an overlap graph in which two reads are connected if they overlap and have compatible splice patterns, and then traverses the graph to produce the minimum number of transcripts that can explain all the input fragments. This minimization approach may result in under-prediction. In contrast, Scripture enumerates combinations of exons from spliced reads within windows of the gene, and then assembles them into whole transcripts. Thus, Scripture may produce many more isoforms than present in the sample. In the second approach, programs such as IsoLasso  and its recent implementation IsoCEM perform simultaneous assembly and quantification of transcripts, jointly modeling the two problems into a quadratic or an estimation maximization program.
There are several drawbacks to these approaches. First, programs that simultaneously predict transcript structure and estimate their abundance typically make unrealistic assumptions about the uniformity of read coverage along the length of the gene, which can lead to incorrect transcript models. Second, by reconstructing transcripts from RNA-seq data alone, programs overlook existing knowledge about the gene structure that can be used to more accurately infer isoforms .
We describe a novel algorithm, called CLASS (C onstraint-based L ocal A ssembly and S election of S plice variants), for transcript reconstruction from RNA-seq data, which takes advantage of local read coverage patterns and known transcript substructures from existing annotations. CLASS employs a linear program to locally reconstruct combinations of exons represented in the RNA-seq data, then connects the exons into a splice graph . A splice graph is a directed acyclic graph in which exons represent the nodes, introns derived from splice read alignments represent edges, and candidate transcripts are maximal paths in the graph . CLASS then selects a subset of transcripts that parsimoniously explain all contiguity constraints derived from mate pairs and optionally incorporates gene structure knowledge from existing databases. This step is modeled as a SET_COVER problem. CLASS does not estimate transcript abundance; rather, once a set of transcripts is produced, any of a number of programs (e.g., cuffdiff2  and RSEM ) can be used to rigorously quantify them.
We tested CLASS and three other popular programs, Cufflinks, Scripture and IsoCEM, an expectation-maximization variant of IsoLasso. CLASS was both more sensitive and more precise on simulated data. On a large real data set, namely the adrenal sample from Illumina's Human Body Map Project, CLASS had higher accuracy as measured by the F-value and produced results that were easier to interpret.
CLASS also has several ancillary advantages over current approaches, in particular single-stage transcript prediction and quantification methods such as those implemented in IsoLasso and SLIDE . When linear programming is applied to a portion of the gene, to find exons rather than full transcripts, it leads to smaller systems that can be solved more easily and more accurately. Lastly, by decoupling the exon prediction, transcript selection and transcript quantification stages, CLASS allows for a modular design where each step can be performed by a variety of methods, including approaches developed elsewere.
The rest of the material is organized as follows. Section 2 introduces the linear program model for predicting exons from RNA-seq data. Section 3 describes the splice graph construction, and the subsequent enumeration and selection of candidate transcripts. Lastly, section 4 presents the results of evaluating CLASS and other programs on simulated and real data.
Constraint-based local assembly of exons
Each 5' and 3' splice site must appear in at least one candidate exon.
Every read must be compatible with the exons in the current combination. For unspliced reads, the read must be included in some exon, whereas spliced reads must have splice junctions compatible with the group of exons.
For paired-end reads, the inner endpoints of the reads in a pair must either belong to the same exon, or must be connected by a path of non-overlapping exons from the current subset (i.e., the two mates must have compatible alignments).
- 1.Additivity: The average coverage in each interval should be roughly equal to the sum of average coverage values for all subexons within that interval, e.g.:
- 2.Continuity: The average coverage of adjacent subexons of the same exon should be approximately equal, e.g.:
- 3.Conservation: The total coverage of all subexons should be approximately equal to the total coverage of the region:
- 4.Non-negativity: The average read coverage for each subexon should be no less than 1, e.g.:
Read pair feasibility: Suppose there are N read pairs, and n i,j read pairs where the left read starts in interval i and the right read ends in interval j. Assuming the average length of a fragment is f l and the standard deviation is f σ , we compute the minimal and maximal distance between interval i and interval j given the current exon subset. If the range of feasible distances is outside of the interval f l ± 2f σ , then we add to the score , or the proportion of unsatisfied pairs.
We solve the problem for c i,j and ε n . Quantities and L j are known or can be calculated from the input alignments. Finally, we choose the combination with the minimal score as the set of exons for the region. If there are multiple such combinations, we select the one with the smallest number of exons. Note that the additivity condition is similar to those used by IsoLasso and SLIDE, albeit formulated in terms of exons rather than transcripts, whereas the rest of the conditions are specific to our method.
Candidate transcript enumeration and selection
Once the set of exons is identified, we generate a splice graph by connecting the exons (nodes) via introns (edges) extracted from spliced read alignments . Candidate transcripts are encoded in the graph as maximal paths from a node with no incoming edges (source) to a node with no outgoing edges (sink). However, not all variants encoded in the graph will be real, and therefore we use the following SET_COVER formulation to select a high-confidence set of candidate transcripts.
It is easy to see how this can be formulated as an instance of the SET_COVER problem: each candidate transcript t corresponds to the set of constraints it satisfies. We solve using the following ln(|C|/OPT) greedy approximation algorithm :
S ← ∅
- 2.While contains elements not covered by S:
Find transcript t containing the largest number of constraints not satisfied by transcripts in S
Add t to S
Incorporating gene structure knowledge into transcript selection. The above criterion simply minimizes the number of selected transcripts. However, the modular organization of biological sequences into protein domains and regulatory blocks implies that local exon-intron substructures are likely to recur among isoforms. We formulate the weighted SET_COVER problem for transcript selection by assigning a weight (cost) to each transcript and then solving with a similar greedy approximation algorithm . We implemented a simple weight function that first determines the number of pairs of consecutive introns in the candidate transcript that can be found in cDNA sequences , then assigns a weight to the transcript proportional to its fraction of intron pairs not covered by the measure. We are exploring more sophisticated weight functions that combine several evidence-based criteria, similar to those we used in , for future implementations. We call the weighted version of the program CLASS, and the unweighted version CLASS0.
Evaluation against a gold reference
We evaluated our methods and three other leading transcript assemblers (Cufflinks , IsoCEM  and Scripture ) on simulated data, which allows us to precisely assess their accuracy relative to a gold reference. We used default values for IsoCEM (version 0.9) and Scripture (version beta-2), and option '-F 0.01' for Cufflinks (version 2.0.2), since it was significantly more sensitive than the default version in previous testing . Using the program FluxSimulator , we generated 140 million 75 bp paired-end reads and, separately, 140 million 75 bp single-end reads, using the ENSEMBL 61 annotation as model (120,221 reference transcripts). FluxSimulator first assigns an expression level (possibly 0) to each transcript in the annotation, and then simulates all steps in the library preparation process in a typical RNA-seq experiment. Fragments generated are then sequenced in silico from one or both ends to generate single-end and paired-end reads, respectively. No sequencing errors were introduced, but reads were then mapped to the human genome hg19 using Tophat , a spliced read mapper, which can introduce mismapping artifacts in the assembly process.
For simplicity, we restricted our analysis to chromosome 12, which left 3,281,440 paired-end reads and 3,400,225 single-end reads. Reads were assembled with each of the four programs. Running times were roughly comparable among programs (paired-end reads: 1m45s IsoCEM, 9m14s Cufflinks, 1m17s Scripture, 2m24s CLASS and 2m10s CLASS0; single-end reads: 1m25s IsoCEM, 9m13s Cufflinks, 1m6s Scripture, 2m32s CLASS, 2m47s CLASS0), and therefore will not be discussed further.
We compared the results of each program against the gold reference at exon and transcript levels, using the following criteria. To assess the accuracy of exon reconstruction, we consider a match if: i) an internal predicted exon matches an internal exon in the annotation precisely at both ends; ii) a terminal predicted exon matches an annotation exon at the splice site end, and is included in the reference exon; and iii) an exon not bounded by splice sites is included in an annotation exon.
Similarly, a predicted transcript is said to match a reference isoform if all of its internal introns appear consecutively in the reference. Because transcripts may be only partially sampled and/or reconstructed, we also calculate an effective coverage value for both reference and predicted transcripts, defined as the fraction of the reference transcript's exons in the longest match for the transcript being assessed.
if K out of M reference transcripts match K' out of N candidate transcripts, as defined in . Since programs may produce partial isoforms, we use precision and recall curves to plot these values as we vary the effective coverage cutoff, as defined above.
Accuracy evaluation on simulated data.
Performance of CLASS with alternate exon data.
Evaluation on real data
To assess the practicality of using the program in large RNA-seq applications, we applied CLASS to the 160 million 50 bp paired-end reads from Illumina's Human Body Map adrenal tissue sample. Again, we restricted our analysis to chromosome 12, with 3,280 spliced genes. Because the ENSEMBL annotation is inherently incomplete and may also include genes and isoforms not present in adrenal tissue, it is not possible to determine the programs' true sensitivity and precision. Nevertheless, consistency with the reference annotation, in particular sensitivity, provides a good indication of a program's performance. Because now both the reference and the reconstructed transcripts may be incomplete, we relax the definition of a match to include all pairs of reference and candidate transcripts with compatible intron patterns, using the effective coverage defined above to more finely explore the extent of their agreement. In addition, we allow for a small margin of error at exon boundaries (V = 10) to account for potential inaccuracies in the reference annotation.
Performance of four programs on the adrenal data set
We present a novel algorithm and computer program for assembling transcripts from RNA-seq data, combining a linear program to infer exons with a transcript selection scheme that determines the final set of transcripts based on contiguity constraints derived from spliced and paired reads and on gene structure knowledge available from cDNA sequence databases. CLASS outperformed Cufflinks, IsoCEM and Scripture, three of the leading transcript assembly programs, in overall accuracy on both simulated and real data and, unlike other programs that report significant amounts of 'noise', it provided a robust and easy to interpret set of transcripts. Further improvements in the algorithm and implementation, including more sophisticated weight functions, will increase the program's accuracy and speed, and implicitly its usefulness for annotation.
We thank Illumina for making publicly avaialable the Human Body Map RNA-seq data. This work was supported in part by NSF award 1159078 to LF.
Funding for publication of this article was provided by NSF (ABI award 1159078).
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 5, 2013: Proceedings of the Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq 2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S5.
- Wang E, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C: Alternative isoform regulation in human tissue transcriptomes. Nature. 2008, 456 (7221): 470-476. 10.1038/nature07509.PubMed CentralView ArticlePubMedGoogle Scholar
- Pan Q, Shai O, Lee L, Frey B, Blencowe B: Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet. 2008, 40 (12): 1413-1415. 10.1038/ng.259.View ArticlePubMedGoogle Scholar
- Graveley B: Alternative splicing: Increasing diversity in the proteomic world. Trends in Genet. 2001, 17 (2): 100-107. 10.1016/S0168-9525(00)02176-4.View ArticleGoogle Scholar
- Martin J, Wang Z: Next-generation transcriptome assembly. Nature Rev Genet. 2011, 12: 671-682. 10.1038/nrg3068.View ArticlePubMedGoogle Scholar
- Trapnell C, Williams B, Pertea G, Mortazavi A, Kwan G, van Baren M: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010, 28 (5): 511-515. 10.1038/nbt.1621.PubMed CentralView ArticlePubMedGoogle Scholar
- Guttman M, Garber M, Levin J, Donaghey J, Robison J, Adiconis X: Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol. 2010, 28 (5): 503-510. 10.1038/nbt.1633.PubMed CentralView ArticlePubMedGoogle Scholar
- Li W, Feng J, Jiang T: IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly. J Comput Biol. 2011, 18 (11): 1693-1707. 10.1089/cmb.2011.0171.PubMed CentralView ArticlePubMedGoogle Scholar
- Li J, Jiang C, Brown J, Huang H, Bickel P: Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation. Proc Natl Acad Sci USA. 2011, 108 (50): 19867-19872. 10.1073/pnas.1113972108.PubMed CentralView ArticlePubMedGoogle Scholar
- Florea L, Di Francesco V, Miller J, Turner R, Yao A, Harris M: Gene and alternative splicing annotation with AIR. Genome Res. 2005, 15 (1): 54-66. 10.1101/gr.2889405.PubMed CentralView ArticlePubMedGoogle Scholar
- Heber S, Alekseyev M, Sze SH, Tang H, Pevzner P: Splicing graphs and EST assembly prroblem. Bioinformatics. 2002, 18 (Suppl 1): S181-188. 10.1093/bioinformatics/18.suppl_1.S181.View ArticlePubMedGoogle Scholar
- Trapnell C, Hendrickson DG, Savageau M, Goff L, Rinn JL, Pachter L: Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol. 2013, 31: 46-53.View ArticlePubMedGoogle Scholar
- Li B, Dewey C: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011, 12: 323-10.1186/1471-2105-12-323.PubMed CentralView ArticlePubMedGoogle Scholar
- Feige U: A threhshold of ln n for approximating Set Cover. J ACM. 1998, 45 (4): 634-652. 10.1145/285055.285059.View ArticleGoogle Scholar
- Wheeler D, Church D, Federhen S, Lash A, Madden T, Pontius J: Database resources of the National Center for Biotechnology. Nucl Acids Res. 2003, 31 (1): 28-33. 10.1093/nar/gkg033.PubMed CentralView ArticlePubMedGoogle Scholar
- Florea L, Song L, Salzberg SL: Thousands of novel exon skipping events discovered in transcripts from sixteen human tissues. BMC Genomics. 2013, under revisionGoogle Scholar
- Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, Guigisó R: Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res. 2012, 40: 10073-10083. 10.1093/nar/gks666.PubMed CentralView ArticlePubMedGoogle Scholar
- Trapnell C, Pachter L, Salzberg S: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25 (9): 1105-1111. 10.1093/bioinformatics/btp120.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.