Conventionally used reference genes are not outstanding for normalization of gene expression in human cancer research

Background The selection of reference genes is essential for quantifying gene expression. Theoretically they should be expressed stably and not regulated by experimental or pathological conditions. However, identification and validation of reference genes for human cancer research are still being regarded as a critical point, because cancerous tissues often represent genetic instability and heterogeneity. Recent pan-cancer studies have demonstrated the importance of the appropriate selection of reference genes for use as internal controls for the normalization of gene expression; however, no stably expressed, consensus reference genes valid for a range of different human cancers have yet been identified. Results In the present study, we used large-scale cancer gene expression datasets from The Cancer Genome Atlas (TCGA) database, which contains 10,028 (9,364 cancerous and 664 normal) samples from 32 different cancer types, to confirm that the expression of the most commonly used reference genes is not consistent across a range of cancer types. Furthermore, we identified 38 novel candidate reference genes for the normalization of gene expression, independent of cancer type. These genes were found to be highly expressed and highly connected to relevant gene networks, and to be enriched in transcription-translation regulation processes. The expression stability of the newly identified reference genes across 29 cancerous and matched normal tissues were validated via quantitative reverse transcription PCR (RT-qPCR). Conclusions We reveal that most commonly used reference genes in current cancer studies cannot be appropriate to serve as representative control genes for quantifying cancer-related gene expression levels, and propose in this study three potential reference genes (HNRNPL, PCBP1, and RER1) to be the most stably expressed across various cancerous and normal human tissues. Electronic supplementary material The online version of this article (10.1186/s12859-019-2809-2) contains supplementary material, which is available to authorized users.


Background
To understand how genetic alterations driving tumorigenesis lead to the formation of complex cellular networks and induce biological process variation, recent research into cancer genetics has focused on the identification of molecular differences between cancerous and normal tissues [1,2]. Recent high-throughput transcriptomic studies [3] have offered the opportunity to explore the molecular complexity of human cancer, and have provided evidence for classifying human cancer data into normal, benign, and malignant classes, based on their gene expression patterns. Nevertheless, the expression levels of transcriptionally identified candidate cancer genes require experimental verification via molecular methods such as quantitative reverse transcription PCR (RT-qPCR). One of the most important factors ensuring the accuracy of RT-qPCR analyses is the normalization of the identified target-gene expression level to that of a consistently expressed reference gene. To date, cancer researchers have predominantly used the GAPDH and β-actin reference genes as internal reference controls, because their mRNA expression levels are established to be high and constant in many different cells and tissues [4,5]. However, cancerous tissues often exhibit a higher level of gene expression variability than normal tissues, due to tumor heterogeneity, genetic instability, and the fact that genetic alterations in diverse cancer types may differentially affect cellular processes at the transcriptome level. Thus, it is a challenging to determine which reference genes would best serve as internal reference controls for a range of different human cancers. Indeed, an increasing number of researches have shown the striking expression variability of known reference genes in human cancers, and subsequently recommended novel reference genes for gene expression studies in each specific human cancer type [6,7]. These efforts with in silico analysis (e.g., geNorm, NormFinder, and Bestkeeper [8][9][10]) are ongoing; however, to date, no transcriptome-wide analysis for the identification of the most stably expressed consensus reference genes has been reported.
The primary objective of the present study was to conduct a screen for the most stable reference genes for the study of cancer gene expression. We exploited large-scale gene expression data from The Cancer Genome Atlas (TCGA) database, which contains 10,028 (9,364 cancerous and 664 normal) samples from 32 different cancer types. We identified novel reference genes that exhibited both a high expression and low expression-variation level across various cancerous and normal tissue types, and then demonstrated the effectiveness of these newly identified reference genes for use in RT-qPCR. Thus, the results of the present study promote a better understanding of gene expression changes in different cancer types, and will be of considerable use in facilitating the normalization of target-gene expression levels in future cancer research.

Data collection and bioinformatics analysis
The overall workflow of the present study is shown in Fig. 1. We downloaded RNA-sequence (RNA-seq) V2 data (level 3) of 34 different cancer types from the TCGA database (http://tcga-data.nci.nih.gov/tcga/). The TCGA RNA-seq pipeline has used two distinct measurement methods, comprising RPKM (Reads Per Kilobase per Million mapped reads) [11] and TPM (Transcripts Per Million) [12,13], to obtain expression levels from RNA-seq data. Given that TPM is established to produce more comparable results across various sample types than RPKM [13,14], we used TPM-generated data for 32 of the 34 cancer types for further analyses [esophageal carcinoma (ESCA) and stomach adenocarcinoma (STAD) were excluded, since only RPKM-generated data were available for these cancer types]. Unless otherwise stated, all gene expression levels used in our analyses represent the unit of transformed (multiplied by 10 6 ) normalized read counts (extracted from TCGA files with the extension "rsem.genes.normalized_results"). The human protein interaction network data were collected from the Human Protein Reference Database (HPRD release 9, http://www.hprd.org) [15], which includes 30,047 protein entries and 41,327 protein-protein interactions (PPIs). We extracted all binary PPIs from the HPRD, and counted the number of interactions for each protein without redundancy to estimate the size of the protein complex.

Human specimens
The validity of all matched human cancerous and normal tissues was confirmed via patient clinical diagnosis. In total, 58 matched sample pairs were obtained for analysis, of which the cancerous tissue sample in each was isolated from patient breast (n = 18), colon (n = 12), thyroid (n = 8), lung (n = 8), liver (n = 8), kidney (n = 2), or cervical (n = 2) cancer tissues. All human tissue was trimmed to 0.5 cm 2 immediately after removal from the patient and stored in 5 volumes of RNAlater solution (ThermoFisher Scientific, USA) at − 80°C. For the experiment, samples were used within 3 years of storage. These all utilized human specimens and data were provided by the Biobank of Chungnam University Hospital (Korea Biobank Network).

RNA preparation and RT-qPCR
Total RNA was extracted using a eCube Tissue RNA Mini Kit (PhileKorea, Korea) according to the manufacturer's instructions, and reverse-transcribed using M-MLV reverse transcriptase (Promega, USA) with random hexamers. RT-qPCR was performed with a SYBR-Green fluorescent dye (GENET BIO, Korea) and the AriaMx PCR System (Agilent, USA). All reactions occurred under identical cycling conditions, comprising 40 cycles of amplification with denaturation (95°C, 20 s), annealing (58°C, 20 s), and elongation (72°C, 20 s). The specificity of the products generated by each primer set was confirmed by both gel electrophoresis and a melting curve analysis (Additional file 1: Table S1 and Additional file 2: Figure S1).

Results and discussion
Commonly used reference genes exhibit a high level of expression variation in both tumorous and normal tissue samples To assess the gene expression variability within human cancerous and normal tissues, we collected gene expression data from the TCGA database, which contains 10,028 (9,364 cancerous and 664 normal) samples isolated from 32 different cancer types. We used TPM-generated data to calculate the coefficient of variation (CV, calculated as the standard deviation divided by the mean), for target gene expression levels across the analyzed samples. We initially evaluated the gene expression variability of commonly used reference genes (Table 1) [18], and found all 12 analyzed genes to exhibit a CV-value greater than 45% (Table 1). Most (23/31, 74%; Tables 2 and 3) of the experimentally selected reference genes expressed in cancer tissues were observed to exhibit a similar level of gene expression variability. We repeated this process to separately analyze cancerous and normal samples, so as to eliminate potential error caused by sample size bias (since 9,364 cancerous, but only 664 normal tissue samples were analyzed). The results of this second analysis showed the same trends in each cancer and normal group, whereby all 12 commonly used reference genes and 74% (23/31) of the experimentally selected reference genes were found to exhibit a CV value greater than 45% in both groups together (Additional file 3: Table S2). These results suggest that the reference genes most commonly used in current cancer studies may not be appropriate to serve as representative reference genes, and thus, their use may lead to erroneous quantification of cancer-related gene expression levels.

Selection of novel reference gene candidates from the TCGA database
Because genetic alterations in diverse cancer types may differentially affect cellular processes at the transcriptome level, we investigated whether reference genes defined by analysis of a single type of cancerous tissue could be applied to other cancer types. Thus, we calculated and compared the CV values of > 40 samples (and their matched normal tissue samples) from nine cancer types (BRCA, COAD, HNSC, LUAD, LUSC, LIHC, PRAD, THCA, and KIRC; Additional file 4: Figure S2), that were contained within the TCGA database. Among a total set of 20 top-ranked (by CV) genes from each cancer type, no genes (1) were included in the list of commonly used reference genes, and (2) were found in more than 50% (5 out of 9) of cancer types ( Fig. 2 and Additional file 5: Table S3), indicating the dependency of reference genes on cancer types. To newly determine suitable novel genes appropriate to act as internal controls for the normalization of target gene expression in cancer research, we selected a number of genes identified (1) to exhibit unvarying expression levels across both cancerous and normal tissue samples, (2) to have a CV value < 35%, (3) a minimum TPM > 0, (4) and an average of TPM value ≥1 across all tissue samples. Of the 10,028 analyzed samples from the 32 different cancer types, we identified 38 candidate novel cancer-research reference genes (Fig. 3a, Additional file 1: Table S4). We subsequently evaluated whether these newly identified reference genes had the same functional characteristics as the previously established, commonly used reference genes. We found the average expression level of the newly identified reference genes to be significantly higher than that of the others (115.06 versus 42.93; P < 0.0413, using an empirical permutation test with 10,000 replications). This result is consistent with previously reported expression levels for the established reference genes [4]. Next, we determined that, as expected [4,5,19], the newly identified reference genes were significantly enriched in functional categories associated with transcription-translation processes, such as polyA-RNA, ribonucleoprotein, and RNA-binding (FDR < 5%, Fig. 3b). The established reference genes have been previously demonstrated to act as the 'hubs' of the  highly connected protein-protein interaction (PPI) networks [20][21][22]. In the present study, we observed the newly identified reference genes to be characterized by a greater number of PPI network-interaction partners than the other genes (8.42 versus 3.67; P < 0.0185, using an empirical permutation test with 10,000 replications), indicating their functional importance for biological systems.

RT-qPCR validation of the newly identified reference genes in human cancer tissues
We next sought to confirm the validity of the newly identified candidates as reference genes for the normalization of RT-qPCR expression data in the context of human cancer. Therefore, we compared the RT-qPCR analysis results for two commonly used reference genes (GAPDH and β-actin) with those for the 11 most highly expressed of the newly identified reference genes (PCBP1, HNRNPC, HNRNPL, EMC4, SNX17, MRPL43, IST1, FAM32A, PFDN1, RNF10, and RER1) across 29 patient samples including breast, colon, liver, lung, and/or thyroid cancer types. Each human tissue was immersed in RNAlater solution immediately after extraction from the patient and stored at -80°C to minimize RNA degradation. In addition, 2 μg of total RNA extracted from tissues was electrophoresed on 1.5% denaturing agarose gel and only 28S/18S ratio of > 2 confirmed RNA was used in the experiment. The specificity of the products generated by each primer Newly identified reference genes whose ΔC T value was found to be lower than that of both β-actin and GAPDH in all samples are highlighted (red-colored box). * Samples from seven types of cancerous tissues, including breast (n = 9), colon (n = 6), liver (n = 4), lung (n = 4), thyroid (n = 4), kidney (n = 1), and cervical (n = 1) were combined. Note that kidney and cervical tissues have not been separately represented in the box plot set was confirmed by both gel electrophoresis and a melting curve analysis (Additional file 1: Table S1 and Additional file 2: Figure S1). Since optimal references genes for cancer-transcriptome analysis should exhibit a low level of expression variability between cancerous and normal tissue samples, we isolated total RNA from each cancerous and normal sample from a single patient and compared their C T values (where, C T is the "Cycle Threshold", defined as the number of cycles required for the fluorescence signal to exceed background level, and is inversely correlated with the amount of target nucleic acid in the sample). Of the 11 newly identified genes, HNRNPL (ΔC T = 0.37), PCBP1 (ΔC T = 0.42), PFDN1 (ΔC T = 0.46), and RER1 (ΔC T = 0.48) were found to have a lower average C T difference (ΔC T = C T [cancer] -C T [normal] ) between cancerous and normal tissue samples than β-actin (ΔC T = 0.58) and/or GAPDH (ΔC T = 0.60), suggesting their suitability for use as consensus reference genes for gene expression studies in human cancer (Fig. 4). To ensure the reliability and robustness of these results, we reconfirmed whether these reference genes had lower ΔC T values than β-actin and/or GAPDH in each cancer sample. HNRNPL was identified to have a ΔC T value lower than that of both β-actin and GAPDH in four (breast, colon, liver, and lung) of five cancer sample types. Similarly, PCBP1 and RER1 had lower ΔC T values than β-actin and GAPDH in all cancer sample types except liver cancer tissue, and PFDN1 exhibited a lower ΔC T value than β-actin and GAPDH in two cancer sample types (breast and lung, Fig. 4).

Conclusion
In summary, cancer is a disease characterized by complex molecular networks, in which highly heterogeneous and multifocal tumor cells cooperate with host cells within their microenvironment. Recent gene expression studies have been conducted to investigate the intricate interplay of gene expression patterns that regulate cancer invasion and metastasis at the transcriptional level; however, their accurate quantification of gene expression level is dependent upon the selection and use of reliable and appropriate reference genes for the normalization of target gene expression levels. Thus, in the present study, we performed in silico bioinformatics analyses and experimental validation to identify HNRNPL, PCBP1 and RER1 as novel candidate reference genes, whose expression is predominantly consistent, independent of cancer type, stage, and treatment status, and of patient age and gender. Although a larger sample size and more cancer types are needed for more reliable results, these novel reference genes will be invaluable for diagnosis and the prediction of patient prognosis, in a wide range of human cancers.

Availability of data and materials
All data for this study are downloaded from TCGA (https://cancergenome. nih.gov/) public database and included in Tables, Supplementary tables, and Additional files. Authors' contributions CP and JJ designed research. CP, KKK, and SYC contributed to the research coordination. JJ, SC, JO, and SGL performed research. CP, KKK, SYC, JJ, and SC analyzed data. CP, JJ, and KKK wrote the paper. All authors read and approved the final manuscript.

Ethics approval and consent to participate
The Institutional Review Board (IRB) of Chungnam National University approved the use of human tissues in the present study (IRB number 2016-08-032). All utilized human specimens and data were provided by the Biobank of Chungnam University Hospital (Korea Biobank Network).