Discrepancy between different gene annotations
The Ensembl and NCBI RefSeq annotations are among the most widely used gene annotations that have been utilized for RNA-seq gene expression quantification in the field. In this study, we examined a recent Ensembl annotation, a recent RefSeq annotation (‘RefSeq-NCBI’) and an older Refseq annotation (‘RefSeq-Rsubread’), to assess the impact of gene annotation choice on the accuracy of RNA-seq expression quantification. See Materials and Methods for more details of these annotations. The inclusion of an older RefSeq annotation allowed us to investigate the accuracy of new annotation data generated in recent years when the next-gen sequencing data have been used as a new data source for genome-wide annotation generation.
As RNA-seq gene-level expression quantification is typically performed for genes that contain exons [3,4,5], in this study we only focused on the genes that have annotated exons in each annotation. All the genes included in the downloaded Ensembl annotation contain at least one exon and they were all included in this study. The downloaded RefSeq-NCBI annotation contains 54,636 genes in total, however only 39,535 of them contain at least one exon. Genes that do not contain any annotated exons in this annotation were excluded from the study. Genes included in the RefSeq-Rsubread annotation all contain at least one exon and therefore they were all included in this study.
Figure 1A shows that the Ensembl annotation contains a lot more genes than the two RefSeq annotations. It is also worth noting that the RefSeq-NCBI annotation still has >12,000 genes absent from the Ensembl annotation even it contains less genes than Ensembl. Nearly 60% of the Ensembl genes are found to be absent from both of the two RefSeq annotations. In total, 25,496 common genes are found between the three annotations. Most of the genes included in the RefSeq-Rsubread annotation can be found in the RefSeq-NCBI or Ensembl annotations.
We also broke down genes according to their biotype and examined the overlap of genes between annotations for each biotype that is present in all annotations. Additional file 1: Fig. S1 shows that the three annotations are highly concordant in the categories of protein-coding genes and microRNAs. However, there are large number of pseudogenes and long non-coding RNAs (lncRNAs) found to be unique in both Ensembl and RefSeq annotations. Other non-coding RNAs, including small nucleolar RNAs, small nuclear RNAs, ribosomal RNAs and miscellaneous RNAs, also exhibited significant differences between Ensembl and RefSeq annotations. For the genes that are present in RefSeq-NCBI or RefSeq-Rsubread annotations but not in Ensembl annotation (13,173 genes), the majority of them are lncRNAs (Additional file 1: Fig. S2). For the genes that are unique to Ensembl (33,732 genes), most of them are lncRNAs and pseudogenes (Additional file 1: Fig. S3).
We then examined the effective gene lengths in each annotation. The effective length of a gene is the total number of unique bases included in all the exons belonging to the gene. Figure 1B shows the distributions of effective lengths of genes in the three annotations. Around half of the Ensembl genes have an effective length less than 1000 bases, whereas in the two RefSeq annotations only \(\sim 25\%\) of the genes are shorter than 1,000 bases in length. The median effective gene lengths in RefSeq-NCBI and RefSeq-Rsubread are \(\sim 3000\) bases, which is much larger than that in Ensembl (\(\sim 1000\) bases). Although the Ensembl annotation contains a lot more genes than the two RefSeq annotations, it also contains a much higher percentage of short genes.
We further performed gene-wise comparison of effective gene lengths using common genes between each pair of annotations. Although every annotation contains both longer and shorter genes in comparison to the corresponding genes from other annotations, the Ensembl genes were found to have a larger effective length than genes from the two RefSeq annotations overall (Fig. 1C). This is in contrast to the higher proportion of short genes observed in the Ensembl annotation (Fig. 1B), which indicates that the Ensembl genes that are also present in RefSeq-NCBI or RefSeq-Rsubread annotations tend to be longer than those Ensembl genes that can only be found in the Ensembl annotation. Although at least half of the genes were found to have a less than 2-fold (1-fold at \(log_2\) scale) length difference between annotations (Fig. 1C), the length differences could be as high as more than 64-folds (6-folds at \(log_2\) scale). The RefSeq-NCBI genes seem to be slightly longer than the corresponding RefSeq-Rsubread genes overall. Ensembl and RefSeq-Rsubread were found to be the least concordant annotations among the three annotations being compared.
Lastly, we compared the size of the transcriptome represented by each annotation. The transcriptome size of an annotation is computed as the sum of effective gene lengths from all the genes included in that annotation, which also represents the total number of exonic bases that were annotated in an annotation. Figure 1D shows that the Ensembl annotation has a larger transcriptome size than both RefSeq-NCBI and RefSeq-Rsubread annotations. This is not surprising because the Ensembl annotation contains more genes and also Ensembl genes common to other annotations are longer in general. RefSeq-Rsubread has a much smaller transcriptome size than RefSeq-NCBI, indicating a significant expansion of the RefSeq-NCBI annotation in the past five years. However, it is important to note that the RefSeq-Rsubread annotation is not a subset of the RefSeq-NCBI annotation, as demonstrated by the existence of RefSeq-RSubread genes that are absent in the RefSeq-NCBI annotation, the difference in gene length distribution and the length differences of the same genes between the two annotations (Fig. 1A–C). This indicates that not only were new genes added to the RefSeq annotation during the expansion, but existing genes have been modified.
It is against this background that we sought to understand how these differences in the annotations impact on the overall gene-level quantification results.
Fragments counted to annotated genes
We used a benchmark RNA-seq dataset generated by the SEQC project [1] to evaluate the impact of gene annotation on the accuracy of RNA-seq expression quantification. This dataset contains paired-end 100bp read data generated for four samples including a Universal Human Reference RNA sample (sample A), a Human Brain Reference RNA sample (sample B), a mixture sample with 75%A and 25%B (sample C) and a mixture sample with 25%A and 75%B (sample D).
We mapped the RNA-seq reads to the human genome GRCh38 using the Subread aligner [5, 19], and then counted the number of mapped fragments (read pairs) to each gene in each annotation using the featureCounts program [4, 5]. FeatureCounts assigns a mapped fragment to a gene if the fragment overlaps any of the exons in the gene. Figure 2 shows that across all the 16 libraries, the RefSeq-Rsubread annotation constantly has substantially more fragments assigned to it than the Ensembl and RefSeq-NCBI annotations. This is surprising because RefSeq-Rsubread contains much less annotated genes and also has a significantly smaller transcriptome, compared to Ensembl and RefSeq-NCBI (Fig. 1A, D). We then performed a detailed investigation into the mapping and counting results to find out what enabled RefSeq-Rsubread to achieve a higher percentage of successfully assigned fragments.
Although gene annotations were utilized in mapping reads to the human reference genome, the use of different annotations was not found to affect the number of successfully aligned fragments for each library (Additional file 1: Fig. S4). We found that when assigning fragments to genes using the Ensembl or RefSeq-NCBI annotation, more fragments were unable to be assigned because they did not overlap any genes (ie. failed to overlap any exon included in any gene), despite there are more genes included in these annotations compared to the RefSeq-Rsubread annotation (Additional file 1: Fig. S5). This is particularly the case for the fragment assignment in the human brain reference samples. Additional file 1: Fig. S6 shows the read mapping results for a few selected genes that are only present in the RefSeq-Rsubread annotation. The annotations of these genes were generally well supported by the read mapping results.
We also found that the use of Ensembl and RefSeq-NCBI annotations led to more fragments being unassigned due to the assignment ambiguity, ie. a fragment overlaps more than one gene (Additional file 1: Figure S7). This should be because there are more genes that overlap with each other (ie. exons from different genes overlap with each other) in the Ensembl and RefSeq-NCBI annotations compared to the RefSeq-Rsubread annotation. Our investigation revealed that less gene overlapping in the RefSeq-Rsubread annotation and better compatibility of this annotation with the mapped fragments have led to more fragments being successfully counted for each library in this dataset. Given that both the Universal Human Reference and Human Brain Reference samples used in this study are known to contain a very high number of expressed genes and the RNA-seq data generated from these samples are expected to cover most of the human transcriptome, our analysis suggests that the RefSeq-Rsubread annotation is likely to contain more transcribed region in the genome than the other two annotations in general.
We would also like to note that multi-mapping fragments, which can be best mapped to more than one location in the genome, were also included in the read mapping results and subsequently assigned to the genes they overlap (see “Materials and Methods” for more details). Around 3–9% of total fragments in each library were found to be multi-mapping fragments (Additional file 1: Fig. S8). We performed an in-depth analysis to examine all the mapping locations found for them. 20–55% of multi-mapping fragments were found to map to two or more genes (ie. mapping to exons included in two or more genes) across the three annotations, however they only accounted for around 0.5–2.5% of total fragments in each library (Additional file 1: Fig. S9). In particular, when the NCBI-Rsubread annotation was used in the mapping, the percentage of such fragments was only \(\sim\)0.8% on average. As this is the only group of multi-mapping fragments that can cause assignment ambiguity, the very low percentage of these fragments means that they should not have much effect on the quantification results. Around 10–50% of multi-mapping fragments were found to map to one or more exons within the same gene, and 30–60% of them did not map to any exon in any gene, across the three annotations. These corresponded to around 0.5–4.5% and 1–3% of total fragments in each library, respectively (Additional file 1: Figs. S10 and S11).
As it is known that many multi-mapping fragments originate from pseudogenes, we particularly looked at the fraction of fragments that were assigned to these genes. As expected, significantly more fragments were assigned to pseudogenes in Ensembl (\(\sim 6\)%) than in RefSeq-NCBI and RefSeq-Rsubread (\(<1\)%)(Additional file 1: Fig. S12), due to much larger number of pseudogenes present in Ensembl.
Intensity range of gene expression
We examined if the gene annotation choice has an impact on the range of gene expression levels in the RNA-seq data. Raw gene counts of the SEQC data were converted to \(\hbox {log}_2\)FPKM (\(\hbox {log}_2\) fragments per kilo exonic bases per million mapped fragments) values for all the genes included in each annotation. A prior count of 0.5 was added to the raw counts to avoid log-transformation of zero. Figure 3 shows that the two RefSeq annotations exhibit a desirable larger intensity range of gene expression than the Ensembl annotation, as shown by the larger boxes in the boxplots. It is surprising to see that the Ensembl genes have the smallest intensity ranges in all the libraries, give that the Ensembl annotation contains the largest number of genes in all the three annotations being examined. In addition to the large intensity range, the RefSeq-Rsubread genes were also found to have a markedly higher median expression level than genes in the RefSeq-NCBI and Ensembl annotations.
Gene annotation discrepancy after expression filtering
As it is a common practice to filter out genes that are deemed as lowly expressed, or are completely absent in an RNA-seq data analysis [2], we also set out to assess the differences between alternative annotations after excluding such genes. We excluded those genes that failed to have at least 0.5 CPM (counts per million) in at least four libraries (each sample has four replicates) in the analysis of the SEQC dataset. The expression-filtered data were also used for comparing the accuracy of quantification from using alternative annotations presented in the following sections.
The bar plot in Fig. 4A shows that Ensembl has significantly more genes (also higher proportion of genes) filtered out due to low or no expression, compared to RefSeq-NCBI and RefSeq-Rsubread. After expression filtering, the total numbers of remaining genes from the three annotations became more similar to each other. 16,472 genes were found to be common between the three annotations after filtering, accounting for 69%, 78% and 86% of the filtered genes in the Ensembl, RefSeq-NCBI and RefSeq-Rsubread annotations respectively (Fig. 4B). Almost all the filtered genes in the RefSeq-Rsubread annotation can be found in the other two annotations.
After expression filtering, the median effective gene length has increased to \(\sim 4000\) bases for all annotations (Fig. 4C), meaning that a higher proportion of short genes were removed due to low expression in every annotation. The median effective length of Ensembl genes now became comparable to, or slightly higher than those in the two RefSeq annotations, indicating that the Ensembl annotation contained a higher proportion of lowly expressed short genes than the two RefSeq annotations. When comparing the effective lengths of genes common to all three annotations after filtering, the Ensembl genes were found to have the largest median effective length and the RefSeq-Rsubread genes have the smallest median effective length (Fig. 4D). This is not surprising because the Ensembl annotation is known to be more aggressive than the RefSeq annotations and RefSeq-Rsubread is an old annotation that has not been updated in the last five year.
The expression filtering did not seem to affect the distribution of differences of effective gene lengths between each pair of annotations (using genes common to each pair of annotations), with Ensembl and RefSeq-Rsubread remaining to be the least concordant annotations (Fig. 4E and 1C). Using genes common to all three annotations after filtering exhibited similar distributions of gene length differences between each pair of annotations compared to using genes common to each pair of annotations (Fig. 4F). Similar to before filtering, the gene-wise length comparison performed after filtering also showed that overall the Ensembl genes had the largest gene lengths and the RefSeq-Rsubread genes had the shortest gene lengths.
Comparison of titration monotonicity preservation
To assess the impact of gene annotation choice on the accuracy of RNA-seq quantification result, we utilized as ground truth the inbuilt titration monotonicity in the SEQC data, the TaqMan RT-PCR data and the microarray data generated for the same samples, to evaluate which annotation gives rise to a better expression correlation of the RNA-seq quantification data with the truth.
In this section, we compared the ability of Ensembl and the two RefSeq annotations in retaining the inbuilt titration monotonicity in the RNA-seq dataset. In Fig. 5, the reference titration curve depicts the expected fold change that genes are expected to follow in sample C versus sample D based on the fold change in sample A versus sample B. This is computed using the Eq. (1) (see “Materials and methods”). We then calculated the Mean Squared Error (MSE) between the reference titration monotonicity and the titration monotonicity obtained from each annotation. A smaller MSE value means that the generated quantification data is closer to the truth. Figure 5 shows that the MSE computed for the RefSeq-Rsubread annotation is constantly lower than those computed for the Ensembl and RefSeq-NCBI annotations, regardless if filtering was applied or if only common genes were included for comparison. RefSeq-Rsubread was also found to yield comparable or lower MSE compared to the other two annotations when the data were TMM or quantile normalized (Additional file 1: Figs. S13 and S14), in addition to the library-size normalized data shown in Fig. 5. These results demonstrated that the use of RefSeq-Rsubread annotation led to better quantification accuracy for the RNA-seq data.
Validation against TaqMan RT-PCR data
The TaqMan RT-PCR dataset generated in the MAQC study [13, 14] was used to validate the gene-level quantification results from the RNA-seq dataset. This dataset contains measured expression levels for \(>1000\) genes in the four SEQC samples. The aim was to understand how well Ensembl and RefSeq annotated gene expression correlated with the TaqMan RT-PCR data.
The RNA-seq data generated from each annotation were filtered to remove lowly expressed genes before being compared to the RT-PCR data. Numbers of matched genes between the RT-PCR data and the RNA-seq data were 856, 901 and 901 for Ensembl, RefSeq-NCBI and RefSeq-Rsubread, respectively. 846 RT-PCR genes were found to be common to all the three annotations. The raw TaqMan RT-PCR data were \(\hbox {log}_2\)-transformed before comparing to the filtered RNA-seq data.
Pearson correlation analysis of the RNA-seq gene expression (\(\hbox {log}_2\)FPKM values) and RT-PCR gene expression (\(\hbox {log}_2\) values) from using the RT-PCR genes matched with each individual annotation showed that the RefSeq-Rsubread annotation constantly yielded a higher correlation than the Ensembl and RefSeq-NCBI annotations, across all the samples and the three different normalization methods (left panel in Fig. 6). The Ensembl annotation was found to produce the worst correlation in all these comparisons. When using the RT-PCR genes matched with all three annotations for comparison, RefSeq-Rsubread was again found to yield the highest correlation (right panel in Fig. 6). Ensembl and RefSeq-NCBI were found to produce similar correlation coefficients. Taken together, results from this evaluation showed that the use of RefSeq-Rsubread annotation led to a better concordance in gene expression between the RNA-seq data and the RT-PCR data, compared to the use of Ensembl and RefSeq-NCBI annotations.
Validation against microarray data
An Illumina BeadChip microarray dataset, which was generated by the MAQC-I project [14] for the same samples as in the RNA-seq data used in this study, was used to further validate the gene-level RNA-seq quantification results obtained from different annotations. The microarray dataset was background corrected and normalized using the ‘neqc’ function in the limma package [17, 25]. Microarray genes were then matched to the RNA-seq genes included in the filtered RNA-seq data. 14,405, 14,561 and 14,508 microarray genes were found to be matched with RNA-seq genes from Ensembl, RefSeq-NCBI and RefSeq-Rsubread annotations, respectively. 13,424 microarray genes were found to be present in all three annotations. For those microarray genes that contain more than one probe, a representative probe was selected for each of them. The representative probe selected for a gene had the highest mean expression value across the four samples among all the probes the gene has.
A Pearson correlation analysis was then performed between microarray data and RNA-seq data for each of the three annotations. Both RNA-seq and microarray data include \(\hbox {log}_2\) expression values of genes. Figure 7 shows that the use of RefSeq-Rsubread annotation consistently yielded the highest correlation between RNA-seq and microarray data in all the comparisons, no matter which RNA-seq normalization method was used and if all or common matched genes were included in the evaluation. On the other hand, the use of the Ensembl annotation resulted in the worst correlation between RNA-seq data and microarray data in all the comparisons.