Study overview
This paper aims to provide insights into the effect of different choices of human genome annotation on the variation in RNA-seq expression estimates. We propose a complexity measure (refer to the Background section) to relate observations in downstream RNA-seq analyses to the trend of genome annotation characteristics. The typical pipeline for RNA-seq expression analysis includes mapping, quantification, normalization, and calling differentially expressed genes. As shown in Figure 7, in this study, we use two publicly available RNA-seq datasets that provide differentially expressed genes and RT-qPCR validation information. We align short sequence reads to the human genome with two spliced aligners, OSA [17] and TopHat [18]. Alignment outputs of both tools are quantified by htseq-count [19] to acquire gene expression estimates in terms of the read counts. Since OSA has embedded quantification and TPM normalization [20] in its package, we use Cufflinks [21] to quantify TopHat alignment outputs only and then obtain gene/isoform expressions in terms of FPKM-normalized values [8]. Given the read counts data from htseq-count, we apply the edgeR package in R [22] to call differentially expressed genes between treatment and control samples. For TPM or FPKM expression estimates, we calculate fold-change between treatment and control samples and then compare these numbers to external RT-qPCR validation results provided by the original studies. We propose several evaluation metrics for each analysis step to demonstrate performance variation induced by annotation complexity.
Human genome annotation
Several organizations or institutions have spent more than a decade working on annotating the human genome. Various annotating techniques have been developed and a variety of information sources have been utilized to provide the most informative and correct human genome annotation [23, 24]. We use six well-known annotations, including AceView Genes, Ensembl Genes, H-InvDB Genes, RefSeq Genes, UCSC Known Genes, and Vega Genes. Since each annotation emphasizes slightly different categories of genomic elements (e.g., some report more functionally important small RNAs), we need to manually examine these annotations and prepare them to be comparable with each other. We choose to use AceView annotation as a standard to determine which genomic elements need to be filtered out in other annotations. We also decided to use annotations that only fall into main chromosome contigs, unplaced contigs, and unlocalized contigs in the UCSC HG19 assembly.
The information sources and annotating strategies of each human genome annotation are summarized in [16]. The procedures for improving cross-annotation comparability are briefly described below:
AceView Genes [9] - The AceView annotation was downloaded from its website. We choose this annotation as the standard definition for genomic element categories. To match genomic contigs for the sequence mapping stage, we remove the mitochondrial annotation from the original annotation file.
Ensembl Genes [12] - The Ensembl annotation was downloaded from its FTP site. It includes annotations that fall outside of the preselected genomic contigs as well as a considerable amount of small RNAs (e.g., tRNA, miscRNA scRNA, snRNA). These genomic elements are removed in the preparation process. Meanwhile, the chromosome names in this annotation are translated to the chromosome names in HG19 convention.
H-InvDB Genes [13] - The H-InvDB annotation was downloaded from its website. To prepare this annotation, we remove annotations located on haplotype and mitochondrial contigs. We also map the chromosome names in H-InvDB annotation to the chromosome names in HG19 convention.
RefSeq Genes [10] - The RefSeq annotation was downloaded from the UCSC Table Browser. We prepare the RefSeq annotation by removing haplotype annotations.
UCSC Known Genes [14] - The UCSC Known Genes annotation was downloaded from the UCSC Table Browser. To prepare this annotation, we remove annotations located on the haplotype and mitochondrial contigs. We also remove other small RNAs such as tRNA, miscRNA, and snRNA.
Vega Genes [15] - The Vega annotation was downloaded from its website. The preparation steps are similar to that of the Ensembl annotation. Genomic elements that fall outside of the preselected genomic contigs are removed. The chromosome names in Vega annotation are also translated to the chromosome names in HG19 convention.
RNA-seq datasets
We download two publicly available RNA-seq datasets from the NCBI Sequence Read Archive (SRA) repository. The first dataset (accession number: SRP008482) investigates how thrombin treatment affects endothelial function in terms of gene expression profiles. Generally, thrombin can stimulate endothelial cells and regulate the expression, release and activation of a number of biological mediators [25]. The targeted biological samples are "human pulmonary microvascular endothelial cells (HMVEC-L)" with two conditions: either control (two technical replicates) or treated with thrombin for six hours (three technical replicates). The sequencing platform was Illumina HiScanSQ with sequencing depth around 50 million read pairs for each technical replicate, and read lengths of 2 × 101 base pairs. The study also validated the expression fold-change of three genes (CELF1, FANCD2, and TRAF1) between treated and control samples using RT-qPCR technology. Such RT-qPCR information is considered the ground truth and is useful for validating and evaluating RNA-seq expression estimates.
The second dataset (accession number: SRP000727) studies alternative isoform regulation in human tissue transcriptomes [26]. It profiled 16 tissue transcriptomes, and two MAQC (MicroArray Quality Control) samples were included in the study. The two MAQC samples are Ambion Human Brain Reference RNA (HBRR) and Stratagene Universal Human Reference RNA (UHRR). The study includes four technical replicates for the HBRR sample and 3 technical replicates for the UHRR sample. This is an older sequencing dataset which used Illumina Genome Analyzer to generate single-end reads with read length of 36 base pairs. Each technical replicate has only 2.5 million reads. The merit of this dataset is that the RT-qPCR results are publicly available through the MAQC study. Fold-changes of 1,044 genes from the TaqMan RT-qPCR assay again provide an external ground truth for evaluating RNA-seq expression estimates.
Short sequence read mapping methods and evaluation metrics
We use two spliced alignment tools, TopHat and OSA, to map short sequencing reads to the human genome with the guidance of various genome annotations. The spliced alignment tools enable RNA-seq reads to directly map to the genome (the typical pipeline for spliced alignment is shown in Figure 8). TopHat is a spliced alignment tool that is widely used for mapping RNA-seq data to the genome or transcriptome [18]. It aligns short sequence reads to the human transcriptome first, and then attempts to remap the unmapped reads from the previous stage to the human genome. The alignment outputs from the two stages are merged into the final output. We use TopHat version 2.0.5 with no novel junction detection and the "-G" option, which allows us to provide a genome annotation GTF file (the standard genome annotation reporting format) and to force TopHat to use the mapping strategy described in Figure 8. OSA (Omicsoft Sequence Aligner) is a new spliced alignment tool that "improves mapping speed 4-10-fold with better sensitivity and less false positives" compared to the TopHat, SoapSplice, and RUM pipelines [17]. It implements a similar mapping strategy as TopHat. We use OSA version 1.8.2 with the default settings.
We use the UCSC HG19 human genome assembly as a reference genome for spliced alignment. We include only 24 main chromosome contigs, 20 unplaced contigs, and 39 unlocalized contigs. The rest of the contigs, e.g., mitochondria and haplotypes, are excluded.
We define two evaluation metrics for the mapping stage. The first metric is based on the categorization of read mapping outcomes. For paired-end sequencing, the categories include uniquely paired reads, uniquely mapped singletons, non-uniquely paired reads, non-uniquely mapped singletons, and unmapped reads. For single-end sequencing, the categories are simpler, including only uniquely mapped reads, non-uniquely mapped reads, and unmapped reads. The second metric is the percentage of the number of reads that map to the annotated and un-annotated genomic sequences.
Gene/isoform expression quantification, normalization methods, and evaluation metrics
We use the htseq-count script from the HTSeq package to count the number of reads (or fragments in the paired-end sequencing case) that map to each gene as the gene expression estimate. For each mapped read or fragment, htseq-count determines the genes to which these reads or fragments associate. If a read or a fragment overlaps more than one gene, it provides three scenarios to resolve this ambiguous situation. We use the HTSeq package version 0.5.3p9 with "intersection-nonempty" overlapping resolution [19]. All the other options were kept in the default setting. Read counts from htseq-count are used as the input for packages that detect differentially expressed genes. The OSA package includes functionality for quantifying and normalizing gene/isoform expressions. We choose to report gene/isoform expression estimates in terms of Transcripts Per Million (TPM) normalized values [20]. For TopHat alignment outputs, we use Cufflinks to quantify gene and isoform expressions in terms of Fragments Per Kilobase exon model per Million mapped reads (FPKM) estimates. We use Cufflinks version 2.0.2, keeping most parameters in the default setting except enabling sequencing bias correction and multi-mapped reads correction [21].
At the quantification and normalization stage, we propose two evaluation metrics to observe the impact of genome annotation complexity on RNA-seq expression estimates. We use the stability of normalized TPM or FPKM expression between technical replicates as the first evaluation metric. With the help of unique HGNC gene identifier for each gene, we identify 13,613 common genes from OSA alignment with TPM quantification and 13,810 common genes from TopHat alignment with Cufflinks FPKM quantification across six genome annotations. The number of common genes between the two pipelines is different since Cufflinks tend to collapse or not report certain low-expressing genes. For each genome annotation, we remove genes that are absent for all replicates and then compute the coefficient of variation (CV), averaged across all targeted genes or isoforms, as defined in equation (1), where Si is the sample standard deviation of expression estimates across replicates with the same biological condition, is the mean of expression estimates across replicates with the same biological condition, and n is the number of targeted genes or isoforms. We apply the same technique to the set of common genes, uncommon genes, and to all isoforms.
(1)
The second evaluation metric is the percentage of present genes or isoforms that have nonzero expression in at least one replicate. Since some annotations (e.g., the AceView annotation) possess considerably more genes or isoforms compared to other annotations, through this metric, we aim to examine the practicability of this additional information for transcriptome profiling studies (assuming sequencing libraries are prepared by poly-A enrichment).
Differentially expressed gene calling methods and evaluation metrics
The edgeR package in R is applied to identify differentially expressed genes [22]. It takes the raw read count as the input instead of the normalized expression estimates. For each gene, edgeR assumes that the raw read count across several replicates follows a negative binomial distribution. The Fisher's exact test is chosen to determine the significance level of each differentially expressed gene (DEG). To examine the repeatability of DEGs, we observe the concordance of the 20 most significant DEGs from each annotation.
The TPM and FPKM expression estimates are used for fold-change evaluation. For thrombin study samples (SRA: SRP008482), three genes were selected for RT-qPCR validation with fold-changes provided in the original study. For each of the six annotations, we consider the RT-qPCR result as a ground truth for estimating the error of RNA-seq-based fold-changes from RT-qPCR-based fold-changes. We use the average absolute deviation to measure the error as defined in equation (2), where FC stands for Fold-Change and n is the number of fold-changes being compared. We use Log2 transformation for the fold-change estimates for both technologies.
(2)
For MAQC samples (SRA: SRP000727), we use the publicly available MAQC-I dataset that used a TaqMan RT-qPCR assay to profile 1,044 genes for both the Human Brain Reference RNA (HBRR) and Universal Human Reference RNA (UHRR) samples [27] (GEO accession number: GSE5350). There are four technical replicates for the HBRR sample and four technical replicates for the UHRR sample. We select 830 TaqMan genes that have present calls for all replicates of HBRR and UHRR samples, and then calculate (1) average absolute deviation, (2) root mean squared error, and (3) correlation coefficient between RNA-seq-based fold-changes and RT-qPCR-based fold-changes. We use Log2 transformation for the fold-change estimates for both technologies.