QoRTs: a comprehensive toolset for quality control and data processing of RNA-Seq experiments

Background High-throughput next-generation RNA sequencing has matured into a viable and powerful method for detecting variations in transcript expression and regulation. Proactive quality control is of critical importance as unanticipated biases, artifacts, or errors can potentially drive false associations and lead to flawed results. Results We have developed the Quality of RNA-Seq Toolset, or QoRTs, a comprehensive, multifunction toolset that assists in quality control and data processing of high-throughput RNA sequencing data. Conclusions QoRTs generates an unmatched variety of quality control metrics, and can provide cross-comparisons of replicates contrasted by batch, biological sample, or experimental condition, revealing any outliers and/or systematic issues that could drive false associations or otherwise compromise downstream analyses. In addition, QoRTs simultaneously replaces the functionality of numerous other data-processing tools, and can quickly and efficiently generate quality control metrics, coverage counts (for genes, exons, and known/novel splice-junctions), and browser tracks. These functions can all be carried out as part of a single unified data-processing/quality control run, greatly reducing both the complexity and the total runtime of the analysis pipeline. The software, source code, and documentation are available online at http://hartleys.github.io/QoRTs. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0670-5) contains supplementary material, which is available to authorized users.


Overview
While RSeQC ostensibly offers much of the same functionality as QoRTs, the results from the two tools consistently disagree on a number of metrics. In this supplement we demonstrate several of these inconsistencies, and show that in all of these cases the results provided by QoRTs are accurate while the RSeQC results are not. This supplement is not comprehensive: the reader should not assume that the RSeQC functions not listed here are accurate. The specific flaws demonstrated here are only a subset of the inconsistencies that we have detected between QoRTs and RSeQC. Isolating, demonstrating, and documenting every individual conflict would be arduous and excessive.
We have generated a set of extremely small and simple example sam/bam files. Each file contains 1-14 reads, and each simulated read is only 10 base-pairs long. We have also created a simulated genome annotation containing a single 10000-bp chromosome and few small "genes" (see the appendix in Section 4). These simplistic example files are designed such that the "true" values for the various quality control metrics can easily be confirmed simply by viewing the files manually as text or by loading them into viewing software. We recommend either the UCSC genome browser or the IGV utility.
These tests do not employ unusual edge-cases; the errors reproduced here were first identified in real datasets before being isolated via simulated data.
Our goal in this supplement is to demonstrate that in numerous test cases QoRTs produces results that are demonstrably accurate, and RSeQC produces results that are clearly erroneous. The errors outlined in this document reveal that numerous systematic flaws exist in the RSeQC package, and that the package has not undergone sufficient testing during development. Because these errors may conceal actual quality control issues, or may make nonexistent quality control issues appear where no such problems exist, we recommend that RSeQC not be used for quality control of RNA-Seq data.
All the tests included in this supplement were run using QoRTs v0.2.5 and RSeQC 2.6.1, which were both current as of March 12th, 2015. All the data used in this supplement including example sam/bam files, scripts, and raw plots, are available as a separate supplement, currently available online (https://dl.dropboxusercontent.com/u/103621176/QoRTs/supp/suppTestData.zip).

Utilities that return incorrect results
Several RSeQC utilities run to completion without throwing any errors, but return results that are misleading and/or erroneous. This section contains a subset of these inconsistencies.

Clipping Profile
The RSeQC appears to uniformly return incorrect results for the "clipping profile". The bug appears to be related to which strand the reads are aligned to. Note that the bug appears whether the data is strand-specific or not.
The complete sam file (test.clip.sam) follows:  RSeQC appears to have a bug where it does not properly account for the read being reversed when it is aligned to the reverse (genomic) strand (See Figure 2). This is why clipping profiles from RSeQC always appear uniformly symmetric, despite the fact that for most datasets and aligners soft clipping can be expected to be very asymmetric.
Figure 2: Clipping profile from QoRTs (left) and RSeQC (right). Note that QoRTs plots the clipping rate whereas RSeQC plots the inverse, thus the two plots should be identical when inverted.
As a minor aside: RSeQC does not have a paired-end mode that separates the 1st and 2nd read of each pair separately. Instead it always counts each read independently. This would only affect paired-end data, and is not demonstrated here.

Insertion Profile
RSeQC returns different results for the insertion profile. Unlike in the Section 2.1, it is very unclear exactly what RSeQC is doing wrong. However it is clear that the results are in fact incorrect.
The complete sam file (test.ins.sam) follows: Note: The correct insertion profile is calculated by hand in the table below. Insertion bases are marked with X's, and the bottom row lists the total number of insertions found at the given read position.
To run the RSeQC insertion_profile script, we used the command: insertion_profile.py -i test.ins.sam -o out -l 10 -n 10  While it's not clear precisely what is going wrong with the RSeQC utility, it is clear that the supplied answer is incorrect.

Insert Size
RSeQC calculates "inner distance" instead of insert size. The inner distance is the distance between the inner endpoints of the two paired reads. The insert size (ie, the distance between the outer endpoints of the two paired reads) will be equal to the inner distance plus the length of both reads. For the example dataset, this means the insert size is equal to the inner distance plus 20. Note that this test uses the simulated annotation data (see the appendix in Section 4).
RSeQC appears to mis-calculate the inner distance/insert size under certain common conditions. It's not entirely clear what specifically those conditions are.

Example Data
For this test we use example sam file test.paired.7.sam, a paired-ended sam file containing 7 read-pairs (14 reads total). read-pair 1: Insert size 13, read 1 bridges an 17-bp (novel) junction. read-pair 2: Insert size 11, no splices. read-pair 3: Insert size 12, both reads have a 4-bp splice read-pair 4: Insert size 26, the reads flank a 4-bp annotated splice junction, but are on the wrong strand read-pair 5: Insert size 22, the reads flank a 4-bp annotated splice junction. read-pair 6: Insert size 10, the reads overlap completely, and both bridge the same 9bp splice junction read-pair 7: Insert size 15, both reads bridge two 3-bp splice junctions.

Results
RSeQC systematically returns erroneous results under a variety of different circumstances (see Figure  6). There's no clear pattern to these errors, and the discrepancies may be caused by a number of independent bugs. Figure 6: Insert size plot from QoRTs (left), compared with inner distance plot from RSeQC (right). Note that RSeQC bins the frequencies into 5-bp bins so the plots are not directly comparable.
RSeQC also produces a table containing each read and the calculated inner distance. This makes it easy to discern specifically which inner distances were mis-calculated: Note that read-pairs 3, 6, and 7 are incorrect. In addition to calculating the inner distance incorrectly, RSeQC appears to count read-pair 6 twice, and the read-pair appears twice in the inner_distance.txt file. It is not immediately clear why this occurred, or why specifically read 6 was repeated.

Read ID
Also note that RSeQC does not have a strand-specific mode for calculating inner distances. Thus, RSeQC would incorrectly assume that read-pair 4 flanks an annotated splice junction that lies on the wrong strand. QoRTs, on the other hand, has a strand-specific mode and would correctly assign read-pair 4 an insert size of 26 when run in this mode.

Read Counts
Direct comparison of QoRTs and RSeQC's read counts is difficult, as the two utilities (by design) use fundamentally different methods. In particular, RSeQC cannot read gtf annotations, which allow the specification of multiple transcripts belonging to the same gene.
HTSeq, QoRTs, and the bioconductor GenomicRanges package all use identical methods to calculate gene-level read counts. Reads that intersect with any section of any known transcript of a gene are counted towards that gene. This is the method recommended for use with differential expression tools like edgeR and DESeq/DESeq2. RSeQC instead treats each transcript as a separate gene and counts reads intersecting with each transcript separately.
Even though the counts are not directly comparable, we can still visually identify cases in which the RSeQC counts are incorrect.

Example Data
For this test we use example sam file test.gene.sam, a single-ended sam file containing 5 reads, all on the forward strand on GENE 5. read 1: Covers exon 1. 5 bases clipped from the 5' end. read 2: Covers exon 1. 5 bases clipped from the 5' end. read 4: Covers exons 1 and 2, bridges a known splice site. read 4: Covers exons 1 and 2, bridges a known splice site. 1 base clipped from the 5' end. read 5: Covers exons 2 and 3, bridges a known splice site. 4 bases clipped from the 5' end, 2 bases from the 3' end.
The complete sam file (test.gene.sam) follows:  There are several unexplained errors here: The first exon (exon 1 of TX1 and TX2) shows 1 read rather than 4. The second exon (exon 2 of TX1 and TX2, exon 1 of TX3) shows 2 reads rather than 3. Feature "GENE05_TX2_intron_1" has a read-count of 7, despite there being only 5 reads in the file.
The RPKM values appear to be calculated based on a total read count of 4 rather than 5. Note: this does not appear to be a simple off-by-one error. In other datasets it appears to be off by more.
Note that while QoRTs does not calculate the exact same metrics as RSeQC, it does produce some similar metrics. QoRTs internally generates a "flattened" exon annotation composed of the set mutually-exclusive exon segments for all transcripts belonging to each gene. Each exon-segment is then assigned a unique identifier. This uses the same methods used by DEXSeq. QoRTs also generates coverage counts for all splice junctions (both known and novel).
The following is an excerpt from QC.annoSpliceJunctionAndExonCounts.txt.gz:

NVC
Both RSeQC and QoRTs produce a nucleotide-vs-cycle plot, which is supposed to plot the nucleotidebase composition as a function of sequencer cycle. However, RSeQC's NVC plot instead plots nucleotide vs read position. For single-ended read data this is equivalent, but for paired-end data the same position on two paired reads would come from completely different sequencer cycles. QoRTs counts the two reads separately thus generating separate base rates for each sequencer cycle.
Note that this issue is less a bug, and arguably nothing more than a difference in methodology. However: as many quality issues will be specific to sequencer cycles, merging the two reads may obscure artifacts or errors.
This problem only exists for paired-end data. For single-ended data the two utilities appear to consistently return identical results.

Example Data
The example dataset is the same sam file used for the inner distance test in Section 2.3.
To run RSeQC's read_NVC script, we used the command: Some of these errors may be related to the simplicity and small size of the dataset. Nevertheless, these flaws make it much more difficult to definitively assess the accuracy of these utilities. Numerous simulated datasets were created and tested, but we were unable to find test sets that were simple enough to verify results visually but that did not also cause RSeQC to fail.

Quality score distribution
To run the RSeQC read_quality script, we used the command: This produces the full set of quality plots. See Figure 9.

Gene-Body Coverage
To run the RSeQC geneBody_coverage script, we use the command: geneBody_coverage.py -i test.paired.7.sam -o out -r anno.bed While this did not print any errors, it also did not return any results. The output file "out.geneBodyCoverage.txt only had a title row, and the file "out.geneBodyCoverage.curves.pdf" would not open. Examining the R script file "out.geneBodyCoverage.r", it appears that no plots were generated.
In contrast, the QoRTs utility can be run using the command: java -jar~/UTILS/SCALA/QoRTs.jar QC \ --runFunctions writeGeneBody \ --coordSorted --generateSeparatePlots \ test.paired.7.sam anno.gtf ./geneBody/ For the purposes of calculating gene body coverage, QoRTs ignores any genes that overlap with other genes, as in these cases it will be impossible to determine the position in the gene body. Thus, in this particular example file the gene-body coverage plot will be based entirely on the coverage of GENE01. See Figure 10.

Appendix: Example Annotation
In additon to the example sam/bam files, we generated a set of simple simulated annotation files. These annotation files include 5 "genes". All genes except for GENE05 have only one single isoform. Both a ".gtf" file and a ".bed" file were generated, for use with QoRTs and RSeQC, respectively. For more information on these formats, see the UCSC documentation 1 See Figure 11 for a visualization of these annotation files. Figure 11: The file anno.gtf and anno.bed, viewed via the UCSC genome browser.
The simulated gtf file anno.gtf is: