Our approach to evaluating the performance of short paired-end reads vs. longer single-end ones is to compare expression estimates and differential expression results derived from these strategies to a truth set. Because we cannot know the “truth”, we assume that its closest approximation are results obtained from long, paired-end reads. Thus, we define our gold standard as the results obtained from 2 × 125 paired-end reads. We then trim these data down to two data sets with read lengths of 40 and 75 base pairs, and calculate Spearman’s rank correlations between transcripts-per-million (TPM) estimates based upon 2 × 40, 1 × 75 and 1 × 125 with the 2 × 125 gold standard. We present these correlations in two different ways. First, we examine the distributions of those correlation coefficients across alternative sequencing strategies. Second, we plot the correlations obtained using 2 × 40 over those obtained with 1 × 75 and 1 × 125.
Kallisto transcript-level TPM estimates made with 2 × 40 are always more highly correlated with those made with the 2 × 125 gold standard than those made with data trimmed to 1 × 75 reads (Fig. 1a,c), and perform better than estimates made with data trimmed to 1 × 125 for all but a few samples from one SRA accession (Fig. 1a, e). A similar performance advantage for the 2 × 40 strategy compared to the 1 × 75 strategy is observed at the gene level (Fig. 1b, d), although there is one accession where 1 × 125 performs better than 2 × 40 (Fig. 1f).
RSEM-based TPM estimates show an overall pattern generally consistent with that observed with kallisto (Fig. 2). A handful of samples for which 2 × 40 performed less well than both 1 × 75 and 1 × 125 had low bowtie2 alignment rates to the reference transcripts (for the 2 × 125 libraries, overall alignment rates of 37.1–51.0%), suggesting that library quality issues were responsible, but raising a broader issue concerning whether variation in alignment rates between strategies could explain the relative performance of 2 × 40 to 1 × 75 and 1 × 125.
Kallisto does not conduct formal sequence alignment, so for each sample-strategy combination we summed the kallisto transcript-level counts as a proxy for the number of informative reads used in estimating expression. The summed kallisto counts for 2 × 40 have a weak correlation with performance (with respect to TPM) relative to 1 × 75 the transcript level, and a moderate correlation at the gene level (Fig. 3a,b); the relative differences in counts between 2 × 40 and 1 × 75 appear uncorrelated with the relative performance of 2 × 40 and 1 × 75 (Fig. 3c,d). A similar pattern is observed with respect to 1 × 125 (Additional file 1: Figure S1). For RSEM-derived expression estimates, we found no impact of the percent of reads uniquely aligning on the relative performance of 2 × 40, but at the gene level found that relative differences in overall alignment rate had some effect on the robustness of TPM estimates of 2 × 40 relative to both 1 × 75 and 1 × 125 (Additional file 1: Figure S2, S3). For both kallisto and RSEM, while there are no obvious taxonomic effects there are clear experiment-specific effects, manifested as differences in relative 2 × 40 performance among accessions derived from the same species. Similar to the above results based upon expression correlations, looking across all accessions, for most samples root-mean-square error of expression estimates based upon 1 × 75 and 1 × 125 was greater than for 2 × 40 (Fig. 4).
We next evaluated the extent to which differences in performance with respect to estimating expression levels translated into downstream effects on tests of differential expression. Within each SRA accession and for each pair of conditions and for each sequencing strategy, we conducted Wald tests with, sleuth, limma-voom, and DESeq2. Similar to our evaluation of expression estimate robustness, we defined “true” differential expression signals as those recovered with 2 × 125, given a false discovery rate (FDR) of 0.01. We then calculated seven performance metrics on a per-accession basis for each differential expression method -sequencing strategy combination.
Sequencing strategy clearly impacts downstream differential expression analyses. Looking across all three differential expression methods, at both the transcript and gene level, for the majority of accessions false negative rates are lower for 2 × 40 compared to 1 × 75 (Fig. 5; Additional file 1: Figure S4, S5; Additional file 1: Table S2). The empirical false discovery rate—that is, for a strategy being evaluated, the proportion of putative differentially expressed features with a Benjamini-Hochberg adjusted p-value below our chosen FDR threshold of 0.01 for which the adjusted p-value is > 0.01 with 2 × 125—was also lower for 2 × 40 compared to both 1 × 75 and × 125, although it was consistently higher than 0.01, the threshold used to classify Wald tests as significant, including the 2 × 125 analyses with which we defined true positives and negatives (Fig. 5; Additional file 1: Figure S4, S5; Additional file 1: Table S2). FDR control was poorer at the transcript than the gene level, and, as previously reported, DESeq2 did not control FDR as well as limma-voom and sleuth [18]. At both the transcript and gene level, AUC is greater for 2 × 40 than 1 × 75 for all but a handful of method-accession combinations (Fig. 5, Additional file 1: Figure S4, S5; Additional file 1: Table S2). Additional metrics indicate a similar overall performance advantage of 2 × 40 compared to 1 × 75 (Additional file 1: Table S2). For all metrics, the short paired-end strategy outperforms 1 × 125 for the majority of method-accession combinations, although the magnitude of performance advantage of 2 × 40 compared to 1 × 125 is smaller than when compared with 1 × 75 smaller (Fig. 5, Additional file 1: Figure S4,S5; Additional file 1: Table S2).