Hybrid benchmarking study using both real and simulated data
For the simulated data we started with 11 real RNA-Seq samples: six liver and six hippocampus samples from the Mouse Genome Project [26]. Isoform expression distributions were estimated from these samples in [3] which were then used to generate simulated data for which the source isoform of every read is known. Two types of simulated datasets were generated with the BEERS simulator [15]. First, idealized simulated data were generated, with no SNPs, indels, or sequencing errors, no intron signal and uniform coverage across each isoform [3]. Second, simulated data were generated with polymorphisms (SNPs and indels), sequencing errors, intron signal, and empirically inferred non-uniform coverage [3]. Relative performance on idealized data does not necessarily reflect relative performance on real data, but we do expect the accuracy of the methods on idealized data to be upper bounds on the accuracy in practice. If a bound on idealized data is below what one would tolerate in practice, then it cannot be expected to be viable in practice. The (more) realistic data provide insight into the effect of the various factors on the method performance. The realistic data probably also gives bounds on accuracy of real data since it was designed to be no more complex than real data. For simplicity of exposition, we will refer to the data with the complexities as the “realistic” data, keeping in mind it does not reflect every property of real data, just the five properties listed above (SNPs, Indels, Sequencing Error, Intron Signal and Non-Uniform Coverage). For both the idealized and realistic simulated data, we use three liver and three hippocampus samples to evaluate isoform quantification, and six liver and five hippocampus samples to evaluate DE analysis, as in [21]. All samples were obtained from independent animals raised as biological replicates. Comparisons between tissues were employed to assess consistency and differential expression; brain has a more complex transcriptome than other tissues [27], and thus isoform level analysis is expected to be more challenging for the algorithms.
We performed a comparative analysis of seven of the most commonly used full-length isoform quantification algorithms; kallisto [13], Salmon [14], RSEM [7], Cufflinks [5], HTSeq [4], featureCounts [6] and a naïve read proportioning approach similar to the method first described by Mortazavi et al. [25] (NRP; See “Methods”). Kallisto and Salmon are pseudo-aligners; RSEM, Cufflinks, HTSeq, and featureCounts are genome alignment-based approaches where the alignments are guided by incorporating transcriptome information, and NRP is a transcriptome alignment-based approach. These methods were evaluated at the isoform expression level using idealized and realistic simulated data, with full and incomplete annotation, and also at the differential expression level using both realistic and real data.
Comparison of full-length quantification methods
Idealized data
The idealized data has no indels, SNP’s, or errors, includes no intron signal, and deviates from uniform coverage across each isoform only as much as may happen due to random sampling. Under such perfect conditions we expect that all methods will achieve their best performance. The data were aligned to the reference genome or transcriptome with STAR [19] and quantified with the seven methods. In Fig. 2A, estimated expression is plotted against the true transcript counts, for each method in Liver. Each point represents the average of the three replicates of that tissue. A point on the diagonal indicates a perfect Sestimate. A point on the X-axis indicates an unexpressed transcript which was erroneously given positive expression. A point on the Y-axis indicates an expressed transcript which was erroneously given zero expression.
In liver, we observe that kallisto, Salmon, RSEM and Cufflinks’ are fairly concordant with the truth, while NRP performs moderately well considering its simplicity. HTSeq and featureCounts show high deviation from the truth and in particular tend to undercount. That is because both methods ignore any read which is ambiguous between isoforms.
Accuracy is somewhat lower in hippocampus, which has more complicated transcription with more alternate splicing, but similar conclusions hold (Additional file 1: Fig S1A) for all methods.
In what follows we will regularly use the log (base 2) fold change between the inferred counts relative to the truth. This will be denoted logFC and if the absolute value is taken then |logFC|. Before taking ratios, all counts are adjusted by addition of a pseudo-count of 1 (see “Methods”).
Scatter plots are only so informative. Figure 2C and Additional file 1: Fig. S1C compare for each method the percent of transcripts that have |logFC|< c for each c > 0. Specifically, a point (x,y) on the graph means x% of transcripts have |logFC|< y. The lines are monotonically non-decreasing and the closer to the x-axis, the better. When the |logFC| for a transcript is close to zero, it suggests that the method gives an accurate estimate. When the p-th percentile of the cumulative distribution is close to zero it means the top p percent of the best estimated transcripts are all very accurate. Genes where all of their isoforms have zero expression will be generally easy for the algorithms to get right, since there will usually be no reads mapping anywhere to the gene locus. To avoid those exaggerating the average performance, such genes were removed from consideration before generating these graphs. These graphs were also generated with all genes removed that had three or less expressed isoforms (Additional file 1: Fig S1E), to focus just on the more difficult cases, which visually makes very little difference to these accuracy percentiles. Kallisto, Salmon, RSEM, and Cufflinks display high accuracy up to the 85th percentile. NRP diverges from the truth at the 60th percentile. We see minimal difference between replicates or between tissues (Additional file 1: Fig. S1C).
For each sample, correlation was computed between the vectors of true and inferred quants (Fig. 2E, Additional file 2: S1 Table). All four methods kallisto, Salmon, RSEM and Cufflinks perform comparably, with Salmon marginally better. One must keep in mind rank correlation is a very general measure which could mean different things, so it is important interpret with caution. But nonetheless the considerably lower correlation in NRP, HTSeq and featureCounts is notable.
Realistic data
The “realistic” data include variants (SNPs and indels), sequencing errors, intron signal, and non-uniform coverage across each isoform. Non-uniform coverage mimics observed coverage in real data but does not attempt to model coverage-associated sequence features such as GC content. As such, the non-uniform coverage options of Salmon that involve GC content could not be utilized. The non-uniform coverage option of RSEM was used since it is sequence agnostic.
Similar to the idealized data, points in Fig. 2B and Additional file 1: Fig. S1B are the average of the three quantified versus the average of three true counts, for each tissue. In both tissues, kallisto, Salmon, RSEM, Cufflinks, and NRP’s isoform quantification is still largely concentrated around the diagonal (Fig. 2B, Additional file 1: S1B), but the variance from the truth is markedly increased, as compared to the variance in the idealized data. NRP’s performance on realistic data has decreased less relative to the idealized than other methods, so the difference between methods is now less apparent. Similar to the results with the idealized data, HTSeq and featureCounts consistently undercount across datasets.
The percentile plots of |logFC| are given in Fig. 2D, Additional file 1: Fig. S1D. These show that Salmon, RSEM and Cufflinks continue to be highly concordant with the truth up to the 80th percentile. NRP and kallisto now perform comparably. HTSeq and featureCounts diverge from the truth the fastest. Comparing the percentile plots and scatter plots, it appears the error profile of the most accurate 80% are roughly the same in the realistic as in the idealized, but the least accurate 20% are considerably less accurate in the realistic than in the idealized. We will investigate below in more detail what features of the isoforms are driving the accuracy, but it is not just the number of isoforms. Length plays perhaps a larger role. And other features come into play as well, such as sequence compression complexity.
The correlation boxplots for the realistic data are given in Fig. 2F (Additional file 2: S2 Table). Here we see a greater separation between the methods. Salmon has incurred relatively less of an increase in error in the realistic data as compared to kallisto, indicating there is more difference between the two than there was in early versions, where the primary difference was in how GC bias was handled. This data has no GC bias, so that cannot explain the difference here. Surprisingly, the accuracy of NRP has barely been reduced in the realistic data and is comparable to kallisto, indicating this approach may have some robustness that the other methods lack and could perhaps be modified to be competitive.
Salmon is marginally superior here, in both absolute comparisons and correlations.
Hierarchical relationships between the methods
Clustering was performed to investigate the hierarchical relationships between the methods. Here, the number of replicates was increased to be all six liver and all five hippocampus realistic samples. The hierarchical clustering was based on the average expression, with correlation-based distance metric. Hippocampus and Liver give somewhat different results (Fig. 3A, B). The true counts cluster with the first group. Surprisingly, Cufflinks goes from a neighbor of the truth in hippocampus to an outlier in liver. Meanwhile, clustering on the logFC of hippocampus versus liver, shows all methods reasonably close to the truth except HTSeq and featureCounts (Fig. 3C).
Features associated with quantification accuracy
We next investigate the covariates that affect the quantification accuracy. For example, the more isoforms a gene has, the more difficult we expect the problem to be. Other obvious features that we expect to impact accuracy are length and number of exons. Less obvious features will be investigated below, but first we look at length. The logFC of estimated counts relative to true counts is plotted against transcript length in Fig. 4A, B and Additional file 1: S2A, B. Only transcripts of length 200 bp or longer were simulated, so the plots start there. Surprisingly, the variance from the truth does not increase monotonically with length as much as might be expected. All methods appear to have the most difficulty with moderate length isoforms. A value of ± 10 on the vertical axis represents a fold-change from the truth of roughly 1000-fold, so regardless of length there is considerable divergence from the truth for all methods.
Next Fig. 4C, D shows the |logFC| plotted against the number of expressed isoforms per gene (Additional file 2: S3–4 Table). The closer to zero the |logFC| is, the better. Surprisingly, mean accuracy of the viable methods kallisto, Salmon, RSEM and Cufflinks does not decay rapidly with the number of expressed isoforms. In realistic data, the number of isoforms effect is larger.
Additionally, to investigate whether the methods are more accurate with genes with low number of isoforms, we filtered out the genes with 1 or 2 isoforms and saw that the performance of the most accurate methods (kallisto, Salmon, RSEM, Cufflinks, NRP) is largely unaffected (Additional file 1: Fig S1E, F).
Another way to investigate the features associated with accuracy is to compare the distribution of the feature over all isoforms, to the distribution of the feature over the set of isoforms most discordant from the true counts. For convenience these will be called the “background” and the “foreground” distributions. Specifically, in each tissue in the realistic data, for each method, the 2000 transcripts were identified with the greatest |logFC| relative to the truth (averaged across three of the replicates). This produces, for each method/tissue, a list of the most discordant isoforms. For a given structural property, such as length, the background distribution is the (empirical) distribution of the property over all isoforms (which is the same for both tissues). And the foreground distribution is the (empirical) distribution of the property over the 2000 discordant isoforms. These two distributions can be plotted together and interpreted visually—and can also be compared using the Kolmogorov–Smirnov test. The Teqila tool was used for this analysis [28].
A number of structural properties were investigated, such as number of isoforms, hexamer entropy, transcript length, transcript sequence compression complexity [29], exon count, etc. The p values were multiple-testing corrected by Bonferroni for testing multiple methods and multiple properties. The significant properties are shown in Fig. 5, the more significant the darker. In the idealized data, exon count and transcript length are comparable. However, in the realistic data length becomes much more relevant. Surprisingly, the number of sibling isoforms (other isoforms of the same gene) is far less relevant than the length or number of exons, except for NRP where the number of sibling isoforms is strongly associated with accuracy.
It is worth looking at the distributions of some of the properties. A select few are shown here, others are in Additional file 1. Figure 6 shows the foreground and background distributions for Transcript Length and Exon Count, for Salmon. Only Salmon is shown because the distributions are similar for all methods. Figure 6A shows that there are proportionally far fewer isoforms in the foreground than the background up to length of around 1000 bases. Then after a length of about 2000 bases there are proportionally far more isoforms in the foreground; however, it does not then get progressively worse. Similarly, with the number of exons, performance seems to flip at around 10 exons. With sequence compression complexity the foreground distribution is highly enriched for low compression complexity, for all methods (see Additional file 1: Fig S3).
The observations about length and exon count apply equally well to all methods. However, for number of isoforms, NRP is quite different from the others (see Fig. 7).
Incomplete annotation
Annotation guided quantification is only as good as the annotation itself. And no annotation is perfect, as, in a given sample, there likely exist both annotated isoforms that are not present as well as unannotated isoforms that are present. It can quickly get complicated trying to sort out the effect of annotation issues on the performance of the algorithms. Sometimes missing annotation can be beneficial, for example if a non-expressed isoform is missing from the annotation, then it cannot erroneously be assigned reads. On the other hand, if a highly expressed isoform is missing, then the method must figure out what to do with the orphaned reads. It should be able to figure out that they should be ignored. Otherwise, they will be assigned to the wrong isoform. To investigate these two issues, we modified the annotation in two ways. First, all isoforms that were not expressed were removed. The extent to which their absence improves accuracy gives an indication of how well the method is handling unexpressed isoforms when they are present. Second, the highest expressed isoform of each gene was removed. This should cause major problems to the algorithms, so it is informative to see which methods handle this better.
Figure 8 shows the percentile plots of the |logFC|. Hippocampus sample Hip1 is shown but all samples Hip and Liv look very similar. The first thing to note is that removing the maximally expressed isoform has dramatically decreased the accuracy of all methods except for HTSeq and featureCounts. And removing the non-expressed isoforms has marginally increased accuracy for those methods. In contrast, for HTSeq and featureCounts we observe the opposite: removing the non-expressed isoforms has dramatically decreased accuracy and removing the highest expressed isoform has made very little difference, particularly with featureCounts. This may be due to how HTSeq and featureCounts do not assign ambiguous reads and so removing annotations can allow the assignment of more reads, though not necessarily to the correct isoform.
Figure 9 compares for the different methods the percentile plots for removing the maximally expressed isoform. This eliminates the isoform of the majority of the reads so should have a dramatic effect on accuracy. Here Salmon has the most difficulty and HTSeq and featureCounts are the most robust to this, followed by NRP. Here we see a significant difference between Salmon and kallisto that goes in the opposite direction of the differences seen by the other perspectives.
Effects on differential expression
Next, we use differential expression to assess quantification accuracy. If differential expression analysis is the downstream goal for the quantified values, then it does not matter if the absolute abundances differ from the truth, if the DE p values are unaffected. To investigate this, the two tissues were compared against each other; different enough tissues so that there is an abundance of differentially expressed genes. Six hippocampus samples and six liver samples of the realistic data were quantified, with each of the seven methods, and the resulting quantified values were used as input for DE analyses with EBSeq [30], which is optimized for isoform differential expression. The p values generated from the true counts are compared to p values from the inferred counts—the assumption being that the closer a DE analysis on the inferred counts is to the corresponding DE analysis on the true counts, the more effectively the method has quantified the expression, with respect to informing the DE analysis. Kallisto and Salmon are recommended to be run with Sleuth, however since Sleuth cannot take true counts as input, the comparison would not be meaningful. Since we are comparing EBSeq (truth) to EBSeq (inferred) for all methods, it should be meaningful to compare methods to each other with this metric.
Comparing two developmentally divergent tissues, we expect a large number of transcripts that are expressed to be differentially expressed. Figure 10A shows the overlap with the truth, for the top n most significant genes, as n varies from 1 to 30,000. Since EBSeq reports a lot of zero p values, rounded down from their limit of precision, ties were broken with the logFC. The vertical axis is the Jaccard index [31] of the top n DE transcripts determined using the true counts and the top n DE transcripts determined using the inferred counts. The Jaccard index of two subsets of a set is the size of the intersection divided by the size of the union. The higher the curve, the better. Salmon and Cufflinks are performing best from this perspective, followed by RSEM. NRP and kallisto appear roughly equivalent. In Fig. 10B the number of DE transcripts is plotted as a function of the q-value cutoff (Additional file 2: S5 Table). If a curve rises above the truth, then that method must be reporting more false-positives than the q-value indicates. At varying places between 0.05 and 0.2 all methods become anti-conservative except featureCounts and HTSeq. Salmon, Cufflinks and RSEM track the truth closest at small cutoffs.
This data can also be used to evaluate the DE methods themselves—EBSeq, Sleuth and DESeq2. DESeq2 is included for reference, but it was not specifically designed with transcript-level DE in mind [32]. In DE benchmarking, it is notoriously difficult to determine a benchmark set of either differential, or non-differential, transcripts. However, if an isoform has zero expression in all replicates of both conditions, then it must necessarily be non-differential. A total of 43,872 isoforms have zero in all replicates of both conditions. Any transcript called DE in this set must be a false positive arising from mistakes in the quantification process. This allows us to define a lower bound on the actual FDR, because it gives a lower bound on the number of false positives, as given by the number of these null isoforms that were called DE. This lower bound on the FDR is plotted as a function of the q-value cutoff (Fig. 11A). Additionally, the actual number of null isoforms called DE is plotted as a function of the q-value cutoff in Fig. 11B. Figure 11A shows that in all cases the true FDR is much greater than reported.
Indeed, Fig. 11B shows that even at very small q-values EBSeq and DESeq are reporting thousands of these false positives. At an FDR of 0.01 there are at least 1000 isoforms using any method. These cannot simply be the 1% false positives allowed by an FDR of 0.01 since that would then require an additional 99,000 true positives, which is more isoforms than are even annotated.
Why is this happening? When an isoform has zero true expression, but another isoform of the same gene has positive expression, it is easy for reads of the expressed isoform to be misassigned to unexpressed. However, if none of the isoforms of a gene are expressed, it is far less likely that any of the isoforms are assigned spurious reads since it is much less likely that any reads map anywhere to the gene’s locus. Therefore, if a gene has no expressed isoforms in Liver and has one or more expressed isoform in Hippocampus, in addition to one or more unexpressed isoforms, then the unexpressed isoforms will tend to have zero expression in Liver and will tend to incur spurious expression in Hippocampus. Such isoforms are then easily mistaken as differential. An isoform level DE method should account for this variability, but we see in Fig. 11 that both EBSeq and Sleuth are anti-conservative. The isoform-level DE methods do however outperform DESeq2, which is not intended for transcript-level analysis. On the quantification methods where it is applicable, Sleuth shows the lowest false positive rate, reflecting the fact that it uses additional variance information from bootstrap samples.
Evaluation with real data
In all comparisons performed with the simulated datasets, HTSeq and featureCounts are very similar and kallisto, Salmon, RSEM, Cufflinks, and NRP are also generally comparable. To explore whether the comparative analyses can be replicated with a real experiment, we used the real data that informed the simulations. Here we used six Hippocampus and six Liver samples. Hierarchical clustering was performed with correlation distance, on the average expression of six samples. The results recapitulate these two groups in hippocampus (Fig. 12A), while in liver Cufflinks clusters further and alone (Fig. 12B), as in the realistic simulated data (Fig. 3A, B). This suggests that Cufflinks is strongly influenced by a tissue-specific effect and confirms that the simulated data successfully capture properties of the real data.
Furthermore, we compare the seven quantification approaches on how well they inform a DE analysis, using the real data. We quantified six samples from each tissue with the seven methods, followed by DE analysis between the two tissues using EBSeq. The methods cluster similarly for both realistic and real data (Figs. 3, 12). There is a significant difference in the number of DE transcripts identified at various q-value cutoffs, among the seven methods (Fig. 12D, Additional file 2: S6 Table).