Assessing the impact of human genome annotation choice on RNA-seq expression estimates
© Wu et al; licensee BioMed Central Ltd. 2013
Published: 4 November 2013
Genome annotation is a crucial component of RNA-seq data analysis. Much effort has been devoted to producing an accurate and rational annotation of the human genome. An annotated genome provides a comprehensive catalogue of genomic functional elements. Currently, at least six human genome annotations are publicly available, including AceView Genes, Ensembl Genes, H-InvDB Genes, RefSeq Genes, UCSC Known Genes, and Vega Genes. Characteristics of these annotations differ because of variations in annotation strategies and information sources. When performing RNA-seq data analysis, researchers need to choose a genome annotation. However, the effect of genome annotation choice on downstream RNA-seq expression estimates is still unclear. This study (1) investigates the effect of different genome annotations on RNA-seq quantification and (2) provides guidelines for choosing a genome annotation based on research focus.
We define the complexity of human genome annotations in terms of the number of genes, isoforms, and exons. This definition facilitates an investigation of potential relationships between complexity and variations in RNA-seq quantification. We apply several evaluation metrics to demonstrate the impact of genome annotation choice on RNA-seq expression estimates. In the mapping stage, the least complex genome annotation, RefSeq Genes, appears to have the highest percentage of uniquely mapped short sequence reads. In the quantification stage, RefSeq Genes results in the most stable expression estimates in terms of the average coefficient of variation over all genes. Stable expression estimates in the quantification stage translate to accurate statistics for detecting differentially expressed genes. We observe that RefSeq Genes produces the most accurate fold-change measures with respect to a ground truth of RT-qPCR gene expression estimates.
Based on the observed variations in the mapping, quantification, and differential expression calling stages, we demonstrate that the selection of human genome annotation results in different gene expression estimates. When conducting research that emphasizes reproducible and robust gene expression estimates, a less complex genome annotation may be preferred. However, simpler genome annotations may limit opportunities for identifying or characterizing novel transcriptional or regulatory mechanisms. When conducting research that aims to be more exploratory, a more complex genome annotation may be preferred.
Next-generation sequencing (NGS) technology is a powerful tool for extracting and interpreting genetic information from a broad range of biological systems, e.g., miRNA regulatory networks , genome-wide association between single nucleotide polymorphisms (SNPs) and phenotypes , DNA-protein interactions [3, 4], and differentially expressed genes between treated and control samples [5, 6]. NGS is preferable over first-generation Sanger sequencing because of its high sequencing throughput and low cost per base pair. With NGS, sequencing an entire human genome becomes feasible, which enables a larger cohort of samples in genome-scale comparative studies. RNA-seq is a major branch of NGS technology that is useful for studying transcriptomes . One aspect of transcriptome research is quantification of expression levels for various genomic elements, e.g., genes, transcripts, and non-coding RNAs . Acquiring transcriptome expression profiles requires that genomic elements be defined in the context of the genome. Multiple human genome annotations exist, including the AceView database  and the RefSeq database . Thus, it is necessary to study the impact of genome annotation choice on transcriptome quantification.
Genome annotation is a dynamic process that defines coordinates for each genomic element with respect to the genome sequence. Such a process bridges the gap between DNA or RNA sequences and biological functions . Integration of a genome annotation with mapping information from RNA-seq short sequence reads enables quantification of genomic elements such as genes and transcripts. Each genome annotation project uses different annotation strategies and information sources. Thus, high variation exists among multiple available annotations in terms of the comprehensiveness of annotated genomic elements. Some annotation strategies rely on computer-based prediction, resulting in more complex gene models that contain more predictive or exploratory genomic elements. Other annotation strategies rely on evidence-based methods, i.e., methods that require more manual curation, leading to simpler gene models with fewer genes and isoforms (i.e., splice variants of a gene).
Properties of various human genome annotations.
Database Downloaded Date
Nov. 12, 2011
Apr. 20, 2012
May 1, 2012
June 26, 2012
Dec. 21, 2011
July 23, 2012
# of Genes
# of Isoforms
# of Exons
Average # of Isoforms per Gene
Maximum # of Isoforms per Gene
(with HGNC Gene Name)
Annotated Percentage (%) Gene
Any relationship between genome annotation complexity and gene expression quantification could be helpful in guiding the selection of a genome annotation for various expression studies using RNA-seq data. Currently no guidelines for selecting a genome annotation for RNA-seq are available, and the effect of genome annotation choice on downstream data analysis is still unclear. The focus of this study is to acquire some insights into the impact of human genome annotation choice on RNA-seq expression estimates.
Results and discussion
Complexity of human genome annotations
Effect of human genome annotation complexity on mapping
Effect of human genome annotation complexity on quantification
Effect of annotation complexity on differential expression calling
For thrombin study samples (SRA: SRP008482), we use OSA alignment with htseq-count quantification to prepare input read count data for the edgeR package. Most of the top 20 differentially expressed genes are detected in at least two of the six annotations, while several others are unique and annotation-specific . Even though the functions of these annotation-specific genes are unclear, more complex annotations are still preferable since they provide an opportunity to identify novel genomic elements that may be functionally important.
Fold-changes between thrombin-treated samples and control samples (SRA: SRP008482).
RNA-seq data with TPM Expression Estimates (OSA Pipeline)
Average Absolute Deviation from RT-qPCR
RNA-seq data with FPKM Expression Estimates (TopHat/Cufflinks Pipeline)
Average Absolute Deviation from RT-qPCR
The genome annotation is a necessary component of RNA-seq expression analysis. Multiple genome annotations are publicly available; however, it is not clear how different choices of genome annotation will affect downstream RNA-seq expression estimates. We defined the complexity of human genome annotation, and assessed the relationship between genome annotation complexity and several RNA-seq performance metrics. Based on our complexity measure, we ordered existing human genome annotations from most to least complex as follows: AceView, H-InvDB, Ensembl, Vega, UCSC, and RefSeq. In more complex annotations, a higher percentage of the entire genome is annotated. For RNA-seq sequence mapping, less complex annotations result in a higher percentage of uniquely mapped reads and uniquely mapped pairs for both single-end and paired-end sequencing samples. However, at the same time, the number of RNA-seq reads that map to annotated genomic sequences is smaller for less complex annotations. Genome annotation complexity also affects RNA-seq expression estimates. More complex annotations result in more ambiguous mappings, which increase the difficulty of RNA-seq quantification and thus, cause higher expression variation between RNA-seq technical replicates. Furthermore, more complex annotations lead to a lower percentage of present (i.e., detected) genes or isoforms, which suggests that the predictive or hypothetical genomic elements in these annotations tend to belong to non-expressors or low expressors. For RNA-seq differentially expressed gene detection, the concordance is high among the six annotations. More complex annotations are capable of identifying annotation-specific genes that may be functionally important. Deviations in RNA-seq expression estimates due to differences in genome annotation complexity can propagate to fold-change statistics and, subsequently, differential expression detection. Thus, when comparing RNA-seq fold-change statistics to ground truth RT-qPCR fold-change statistics, more complex annotations tend to have larger deviation and smaller correlation. In summary, the impact of genome annotation choice on RNA-seq expression estimates is significant, and the choice of annotation should depend on the study. Less complex genome annotations are preferable for studies that require more stable RNA-seq expression estimates. However, to discover and explain unknown biological mechanisms, more comprehensive and complex genome annotations are necessary.
Human genome annotation
Several organizations or institutions have spent more than a decade working on annotating the human genome. Various annotating techniques have been developed and a variety of information sources have been utilized to provide the most informative and correct human genome annotation [23, 24]. We use six well-known annotations, including AceView Genes, Ensembl Genes, H-InvDB Genes, RefSeq Genes, UCSC Known Genes, and Vega Genes. Since each annotation emphasizes slightly different categories of genomic elements (e.g., some report more functionally important small RNAs), we need to manually examine these annotations and prepare them to be comparable with each other. We choose to use AceView annotation as a standard to determine which genomic elements need to be filtered out in other annotations. We also decided to use annotations that only fall into main chromosome contigs, unplaced contigs, and unlocalized contigs in the UCSC HG19 assembly.
The information sources and annotating strategies of each human genome annotation are summarized in . The procedures for improving cross-annotation comparability are briefly described below:
AceView Genes  - The AceView annotation was downloaded from its website. We choose this annotation as the standard definition for genomic element categories. To match genomic contigs for the sequence mapping stage, we remove the mitochondrial annotation from the original annotation file.
Ensembl Genes  - The Ensembl annotation was downloaded from its FTP site. It includes annotations that fall outside of the preselected genomic contigs as well as a considerable amount of small RNAs (e.g., tRNA, miscRNA scRNA, snRNA). These genomic elements are removed in the preparation process. Meanwhile, the chromosome names in this annotation are translated to the chromosome names in HG19 convention.
H-InvDB Genes  - The H-InvDB annotation was downloaded from its website. To prepare this annotation, we remove annotations located on haplotype and mitochondrial contigs. We also map the chromosome names in H-InvDB annotation to the chromosome names in HG19 convention.
RefSeq Genes  - The RefSeq annotation was downloaded from the UCSC Table Browser. We prepare the RefSeq annotation by removing haplotype annotations.
UCSC Known Genes  - The UCSC Known Genes annotation was downloaded from the UCSC Table Browser. To prepare this annotation, we remove annotations located on the haplotype and mitochondrial contigs. We also remove other small RNAs such as tRNA, miscRNA, and snRNA.
Vega Genes  - The Vega annotation was downloaded from its website. The preparation steps are similar to that of the Ensembl annotation. Genomic elements that fall outside of the preselected genomic contigs are removed. The chromosome names in Vega annotation are also translated to the chromosome names in HG19 convention.
We download two publicly available RNA-seq datasets from the NCBI Sequence Read Archive (SRA) repository. The first dataset (accession number: SRP008482) investigates how thrombin treatment affects endothelial function in terms of gene expression profiles. Generally, thrombin can stimulate endothelial cells and regulate the expression, release and activation of a number of biological mediators . The targeted biological samples are "human pulmonary microvascular endothelial cells (HMVEC-L)" with two conditions: either control (two technical replicates) or treated with thrombin for six hours (three technical replicates). The sequencing platform was Illumina HiScanSQ with sequencing depth around 50 million read pairs for each technical replicate, and read lengths of 2 × 101 base pairs. The study also validated the expression fold-change of three genes (CELF1, FANCD2, and TRAF1) between treated and control samples using RT-qPCR technology. Such RT-qPCR information is considered the ground truth and is useful for validating and evaluating RNA-seq expression estimates.
The second dataset (accession number: SRP000727) studies alternative isoform regulation in human tissue transcriptomes . It profiled 16 tissue transcriptomes, and two MAQC (MicroArray Quality Control) samples were included in the study. The two MAQC samples are Ambion Human Brain Reference RNA (HBRR) and Stratagene Universal Human Reference RNA (UHRR). The study includes four technical replicates for the HBRR sample and 3 technical replicates for the UHRR sample. This is an older sequencing dataset which used Illumina Genome Analyzer to generate single-end reads with read length of 36 base pairs. Each technical replicate has only 2.5 million reads. The merit of this dataset is that the RT-qPCR results are publicly available through the MAQC study. Fold-changes of 1,044 genes from the TaqMan RT-qPCR assay again provide an external ground truth for evaluating RNA-seq expression estimates.
Short sequence read mapping methods and evaluation metrics
We use the UCSC HG19 human genome assembly as a reference genome for spliced alignment. We include only 24 main chromosome contigs, 20 unplaced contigs, and 39 unlocalized contigs. The rest of the contigs, e.g., mitochondria and haplotypes, are excluded.
We define two evaluation metrics for the mapping stage. The first metric is based on the categorization of read mapping outcomes. For paired-end sequencing, the categories include uniquely paired reads, uniquely mapped singletons, non-uniquely paired reads, non-uniquely mapped singletons, and unmapped reads. For single-end sequencing, the categories are simpler, including only uniquely mapped reads, non-uniquely mapped reads, and unmapped reads. The second metric is the percentage of the number of reads that map to the annotated and un-annotated genomic sequences.
Gene/isoform expression quantification, normalization methods, and evaluation metrics
We use the htseq-count script from the HTSeq package to count the number of reads (or fragments in the paired-end sequencing case) that map to each gene as the gene expression estimate. For each mapped read or fragment, htseq-count determines the genes to which these reads or fragments associate. If a read or a fragment overlaps more than one gene, it provides three scenarios to resolve this ambiguous situation. We use the HTSeq package version 0.5.3p9 with "intersection-nonempty" overlapping resolution . All the other options were kept in the default setting. Read counts from htseq-count are used as the input for packages that detect differentially expressed genes. The OSA package includes functionality for quantifying and normalizing gene/isoform expressions. We choose to report gene/isoform expression estimates in terms of Transcripts Per Million (TPM) normalized values . For TopHat alignment outputs, we use Cufflinks to quantify gene and isoform expressions in terms of Fragments Per Kilobase exon model per Million mapped reads (FPKM) estimates. We use Cufflinks version 2.0.2, keeping most parameters in the default setting except enabling sequencing bias correction and multi-mapped reads correction .
The second evaluation metric is the percentage of present genes or isoforms that have nonzero expression in at least one replicate. Since some annotations (e.g., the AceView annotation) possess considerably more genes or isoforms compared to other annotations, through this metric, we aim to examine the practicability of this additional information for transcriptome profiling studies (assuming sequencing libraries are prepared by poly-A enrichment).
Differentially expressed gene calling methods and evaluation metrics
The edgeR package in R is applied to identify differentially expressed genes . It takes the raw read count as the input instead of the normalized expression estimates. For each gene, edgeR assumes that the raw read count across several replicates follows a negative binomial distribution. The Fisher's exact test is chosen to determine the significance level of each differentially expressed gene (DEG). To examine the repeatability of DEGs, we observe the concordance of the 20 most significant DEGs from each annotation.
For MAQC samples (SRA: SRP000727), we use the publicly available MAQC-I dataset that used a TaqMan RT-qPCR assay to profile 1,044 genes for both the Human Brain Reference RNA (HBRR) and Universal Human Reference RNA (UHRR) samples  (GEO accession number: GSE5350). There are four technical replicates for the HBRR sample and four technical replicates for the UHRR sample. We select 830 TaqMan genes that have present calls for all replicates of HBRR and UHRR samples, and then calculate (1) average absolute deviation, (2) root mean squared error, and (3) correlation coefficient between RNA-seq-based fold-changes and RT-qPCR-based fold-changes. We use Log2 transformation for the fold-change estimates for both technologies.
This work was supported in part by grants from the Parker H. Petit Institute for Bioengineering and Bioscience, National Institutes of Health (Bioengineering Research Partnership R01CA108468, Center for Cancer Nanotechnology Excellence U54CA119338), National Cancer Institute, Georgia Cancer Coalition (Distinguished Cancer Scholar Award to MDW), Georgia Research Alliance, Hewlett-Packard, and Microsoft Research.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 11, 2013: Selected articles from The Second Workshop on Data Mining of Next-Generation Sequencing in conjunction with the 2012 IEEE International Conference on Bioinformatics and Biomedicine. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S11.
- Liu N, Olson EN: MicroRNA regulatory networks in cardiovascular development. Developmental cell. 2010, 18 (4): 510-525. 10.1016/j.devcel.2010.03.010.PubMed CentralView ArticlePubMedGoogle Scholar
- Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447 (7145): 661-678. 10.1038/nature05911.Google Scholar
- Park PJ: ChIP-seq: advantages and challenges of a maturing technology. Nature reviews Genetics. 2009, 10 (10): 669-680. 10.1038/nrg2641.PubMed CentralView ArticlePubMedGoogle Scholar
- Morozova O, Marra MA: Applications of next-generation sequencing technologies in functional genomics. Genomics. 2008, 92 (5): 255-264. 10.1016/j.ygeno.2008.07.001.View ArticlePubMedGoogle Scholar
- Li H, Zhou H, Wang D, Qiu J, Zhou Y, Li X, Rosenfeld MG, Ding S, Fu XD: Versatile pathway-centric approach based on high-throughput sequencing to anticancer drug discovery. Proceedings of the National Academy of Sciences of the United States of America. 2012, 109 (12): 4609-4614. 10.1073/pnas.1200305109.PubMed CentralView ArticlePubMedGoogle Scholar
- Kalari KR, Rossell D, Necela BM, Asmann YW, Nair A, Baheti S, Kachergus JM, Younkin CS, Baker T, Carr JM: Deep Sequence Analysis of Non-Small Cell Lung Cancer: Integrated Analysis of Gene Expression, Alternative Splicing, and Single Nucleotide Variations in Lung Adenocarcinomas with and without Oncogenic KRAS Mutations. Frontiers in oncology. 2012, 2: 12-PubMed CentralView ArticlePubMedGoogle Scholar
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews Genetics. 2009, 10 (1): 57-63. 10.1038/nrg2484.PubMed CentralView ArticlePubMedGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods. 2008, 5 (7): 621-628. 10.1038/nmeth.1226.View ArticlePubMedGoogle Scholar
- Thierry-Mieg D, Thierry-Mieg J: AceView: a comprehensive cDNA-supported gene and transcripts annotation. Genome Biol. 2006, 7 (Suppl 1): 1-14. 10.1186/gb-2006-7-s1-s1.View ArticlePubMedGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic acids research. 2007, 35 (Database): D61-65. 10.1093/nar/gkl842.PubMed CentralView ArticlePubMedGoogle Scholar
- Stein L: Genome annotation: from sequence to biology. Nature reviews Genetics. 2001, 2 (7): 493-503. 10.1038/35080529.View ArticlePubMedGoogle Scholar
- Flicek P, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S: Ensembl 2012. Nucleic acids research. 2012, 40 (Database): D84-90.PubMed CentralView ArticlePubMedGoogle Scholar
- Yamasaki C, Murakami K, Fujii Y, Sato Y, Harada E, Takeda J, Taniya T, Sakate R, Kikugawa S, Shimada M: The H-Invitational Database (H-InvDB), a comprehensive annotation resource for human genes and transcripts. Nucleic acids research. 2008, 36 (Database): D793-799.PubMedGoogle Scholar
- Hsu F, Kent WJ, Clawson H, Kuhn RM, Diekhans M, Haussler D: The UCSC Known Genes. Bioinformatics. 2006, 22 (9): 1036-1046. 10.1093/bioinformatics/btl048.View ArticlePubMedGoogle Scholar
- Wilming LG, Gilbert JG, Howe K, Trevanion S, Hubbard T, Harrow JL: The vertebrate genome annotation (Vega) database. Nucleic acids research. 2008, 36 (Database): D753-760.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu P, Phan JH, Wang MD: The effect of human genome annotation complexity on RNA-Seq gene expression quantification. Bioinformatics and Biomedicine Workshops (BIBMW), 2012 IEEE International Conference on: 4-7 October 2012. 2012, 712-717. 10.1109/BIBMW.2012.6470224.View ArticleGoogle Scholar
- Hu J, Ge H, Newman M, Liu K: OSA: a fast and accurate alignment tool for RNA-Seq. Bioinformatics. 2012, 28 (14): 1933-1934. 10.1093/bioinformatics/bts294.View ArticlePubMedGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25 (9): 1105-1111. 10.1093/bioinformatics/btp120.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, He X, Mieczkowski P, Grimm SA, Perou CM: MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic acids research. 2010, 38 (18): e178-e178. 10.1093/nar/gkq622.PubMed CentralView ArticlePubMedGoogle Scholar
- Li B, Dewey CN: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC bioinformatics. 2011, 12: 323-10.1186/1471-2105-12-323.PubMed CentralView ArticlePubMedGoogle Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology. 2010, 28 (5): 511-515. 10.1038/nbt.1621.PubMed CentralView ArticlePubMedGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26 (1): 139-140. 10.1093/bioinformatics/btp616.PubMed CentralView ArticlePubMedGoogle Scholar
- Bailey LC, Searls DB, Overton GC: Analysis of EST-driven gene annotation in human genomic sequence. Genome research. 1998, 8 (4): 362-376.PubMedGoogle Scholar
- Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 447 (7146): 799-816. 10.1038/nature05874.View ArticlePubMedGoogle Scholar
- Zhang LQ, Cheranova D, Gibson M, Ding S, Heruth DP, Fang D, Ye SQ: RNA-seq reveals novel transcriptome of genes and their isoforms in human pulmonary microvascular endothelial cells treated with thrombin. PloS one. 2012, 7 (2): e31229-10.1371/journal.pone.0031229.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB: Alternative isoform regulation in human tissue transcriptomes. Nature. 2008, 456 (7221): 470-476. 10.1038/nature07509.PubMed CentralView ArticlePubMedGoogle Scholar
- Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY: The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature biotechnology. 2006, 24 (9): 1151-1161. 10.1038/nbt1239.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.