Evaluation of Sirtuin-3 probe quality and co-expressed genes using literature cohesion

Background Gene co-expression studies can provide important insights into molecular and cellular signaling pathways. The GeneNetwork database is a unique resource for co-expression analysis using data from a variety of tissues across genetically distinct inbred mice. However, extraction of biologically meaningful co-expressed gene sets is challenging due to variability in microarray platforms, probe quality, normalization methods, and confounding biological factors. In this study, we tested whether literature derived functional cohesion could be used as an objective metric in lieu of ‘ground truth’ to evaluate the quality of probes and microarray datasets. Results We examined Sirtuin-3 (Sirt3) co-expressed gene sets extracted from either liver or brain tissues of BXD recombinant inbred mice in the GeneNetwork database. Depending on the microarray platform, there were as many as 26 probes that targeted different regions of Sirt3 primary transcript. Co-expressed gene sets (ranging from 100–1000 genes) associated with each Sirt3 probe were evaluated using the previously developed literature-derived cohesion p-value (LPv) and benchmarked against ‘gold standards’ derived from proteomic studies or Gene Ontology classifications. We found that the maximal F-measure was obtained at an average window size of 535 genes. Using set size of 500 genes, the Pearson correlations between LPv and F-measure as well as between LPv and mitochondrial gene enrichment p-values were 0.90 and 0.93, respectively. Importantly, we found that the LPv approach can distinguish high quality Sirt3 probes. Analysis of the most functionally cohesive Sirt3 co-expressed gene set revealed core metabolic pathways that were shared between hippocampus and liver as well as distinct pathways which were unique to each tissue. These results are consistent with other studies that suggest Sirt3 is a key metabolic regulator and has distinct functions in energy-producing vs. energy-demanding tissues. Conclusions Our results provide proof-of-concept that literature cohesion analysis is useful for evaluating the quality of probes and microarray datasets, particularly when experimentally derived gold standards are unavailable. Our approach would enable researchers to rapidly identify biologically meaningful co-expressed gene sets and facilitate discovery from high throughput genomic data. Electronic supplementary material The online version of this article (10.1186/s12859-019-2621-z) contains supplementary material, which is available to authorized users.


Background
The amount of genome-wide gene expression data available in public repositories is accumulating rapidly. Increasing evidence suggests that genes in related metabolic pathways and cellular processes are coordinately expressed [1,2]. Recent studies using co-expression analysis have *Correspondence: rhomayon@memphis.edu 1 Bioinformatics Program, University of Memphis, 38152 Memphis, USA 2 Center for Translational Informatics, University of Memphis, 38152 Memphis, USA Full list of author information is available at the end of the article provided important insights into complex traits and diseases and have enabled researchers to reverse engineer molecular pathways [3,4]. The GeneNetwork database is a web resource that contains a large amount of gene expression data from a variety of tissues across panels of recombinant inbred (RI) mice, which have been derived through inbreeding of progeny from distinct inbred parental lines [5][6][7]. Co-expression studies using RI mice have identified gene networks associated with alcohol and stress responses [8], fear conditioning [9], liver fibrosis [10], retinal ganglion cell function [11], and many other complex traits.
Although co-expression analysis can be quite useful, there are major challenges when using microarraybased gene expression data. Microarrays are highly sensitive to biological and technical variability, which often result in noisy data [12]. In addition, different microarray platforms have variable technical reproducibility [13]. Moreover, although microarray platforms often include several probes for each target transcript, the probes produce inconsistent results due to targeting error or differences in hybridization properties [14]. It is therefore necessary to develop scalable objective methods that can identify problematic probes and datasets.
Our group has developed various literature-based semantic approaches to derive implicit functional associations between genes, transcription factors, or microR-NAs [15][16][17][18][19]. We have demonstrated that the semantic similarity scores can be used to calculate a literature p-value (LPv) representing the functional cohesion of gene sets [20]. The method was shown to be both accurate and robust when evaluated against Gene Ontology (GO) classifications. Subsequently, we used LPv to compare different microarray normalization procedures [21]. More recently, an extension of this method, called Literature Based Functional Significance (LBFS), was used to evaluate statistical methods for determining differentially expressed genes from microarrays [22].
The goal of this study is to apply literature-based functional cohesion analysis to quantitatively evaluate co-expressed gene sets derived from the GeneNetwork database. In particular, we used this method to evaluate inconsistencies between probes that target the same transcript by benchmarking our method against three proteomic datasets focused on Sirtuin-3 signaling pathway. Sirtuin-3 (Sirt3) belongs to a family of NAD(+)-dependent deacetylases and plays an important role in regulation of cellular metabolism and aging [23,24]. Sirt3 is a mitochondrial protein that is highly expressed in energy-demanding tissues such as brain, heart, skeletal muscle and kidney, as well as in energy-producing tissue such as liver. Sirt3 deficiency in mice results in a reduction of ATP production through inhibition of oxidative phosphorylation [25]. In addition, Sirt3 regulates key enzymes in fatty acid oxidation, amino acid metabolism and antioxidant defenses [24]. In liver, acetyl proteomic studies have demonstrated that Sirt3 is involved in global metabolic reprogramming during calorie restriction [26]. In neurons, Sirt3 is required for adaptive responses to excitotoxicity as well as oxidative and mitochondrial stress [27].

Gene document collection and gene-gene similarity calculation
Medline citations for 21,027 mouse genes were collected based on the PubMed identifiers (PMIDs) in the gene2pubmed repository [28] available at NCBI, and concatenated to construct a gene-document for each gene. Gene-Gene similarity scores were calculated by Latent Semantic Indexing (LSI) as previously described [15][16][17]. Briefly, a term-by-gene matrix was created, where the entries of the matrix were the log-entropy weighted frequencies of terms in the document collection. Then, a truncated singular value decomposition (SVD) of that matrix was performed to produce a lower dimension (reduced rank) concept-by-gene matrix with 500 concepts. Genes were then represented as concept vectors in the reduced rank matrix and the similarity between genes was calculated as the cosine of the vector angles. A graphical representation of the procedure is shown in Additional file 1: Figure S1.

Literature Cohesion P-value (LPv) calculation
Literature cohesion p-values (LPv) for a given gene set were calculated based on LSI derived gene-gene similarities using a slightly modified version of the procedure previously described [20]. LPv is derived by using Fisher's exact test to determine whether the number of pair-wise literature similarity associations above a pre-calculated threshold in a given gene set are significantly higher than that expected by chance. The pre-calculated threshold was set at 95 th percentile of all pairwise similarities among the 21,027 genes. For clarity, the procedure is briefly described below.
Let S be the set of all pairwise cosines for the set of all 21,027 genes G, |S| = |G| 2 . Let T represent 95 th percentile of all pairwise cosine similarities in S . For a given subset G of G for which LPv needs to be calculated, let S be the set of all pairwise cosines for the set G , |S | = |G | 2 . Additionally, let: sample size k ≤ 50 and number of samples n, N ≈ 1000. R 1 , R 2 , . . . R n be n randomly selected sets of cosines from S , each the same size k.
A be the average of counts of cosines that meet the criteria |R 1 ≥ T|, |R 2 ≥ T|, . . . |R n ≥ T|.
B be the average of counts of cosines that meet the criteria |R 1 < T|, |R 2 < T|, . . . |R n < T|. R 1 , R 2 , . . . R N be N randomly selected sets of cosines from S, each the same size k.
C be the average of counts of cosines that meet the criteria |R 1 ≥ T|, |R 2 ≥ T|, . . . |R N ≥ T|.
D be the average of counts of cosines that meet the criteria |R 1 < T|, |R 2 < T|, . . . |R N < T|.
The literature cohesion p-value (LPv) is calculated using a right tailed Fisher's exact test [29]: We essentially calculate the probability of observing by random chance, cosines as high as observed in the input gene set. Lower p-value indicates higher cohesion. We transform the metric slightly by taking − log 10 (p-value) so that the higher value indicates higher cohesion. Additional file 1: Figure S2 demonstrates the LPv calculation procedure with a scaled down representative example.

Workflow
The workflow for our approach is shown in Fig. 1. For a given tissue and dataset in GeneNetwork database, all Sirt3 probes were evaluated. For each Sirt3 probe in the dataset, 10 different Sirt3 co-expressed gene sets, containing 100 to 1000 highest correlated genes, were selected. LPv was calculated for each Sirt3 co-expressed gene set. In addition, the F-measure (F-score) of the gold standard gene sets was calculated for each Sirt3 co-expressed gene set. Lastly, the Pearson correlation coefficient was calculated for the LPvs and F-scores of all Sirt3 co-expressed gene sets.

Microarray datasets
Three genome-wide gene expression datasets pertaining to liver (2) and brain (1) tissues across BXD recombinant inbred mice were identified from GeneNetwork. The datasets were derived from different microarray platforms and normalization methods. From each dataset, sets of 100-1000 genes whose expression patterns were maximally correlated with Sirt3 expression were extracted. The following datasets were used:

Gold standards gene sets used for benchmarking microarray datasets
Selection of an appropriate gold standard for evaluation of microarray data and gene co-expression networks is a challenging task due to a lack of 'ground truth' . Rather than using other gene expression data, which can have overlapping confounds, we used a more functional approach for selection of gold standards. Based on the premise that co-expressed genes function together in closely linked signaling pathways, we expect that coexpressed genes directly interact with one another and are localized in the same cellular compartment. In this study, we focused on Sirtuin-3 (Sirt3) which is a mitochondrial NAD-dependent deacetylase that broadly controls cellular metabolism and has been implicated in a variety of diseases and aging-related processes [23,24]. For benchmarking purposes, we used two different proteomic studies, which identified direct targets of Sirt3 by comparing the acetylomes in different tissues of Sirt3 knock-out and wild-type control mice. Rardin et al. [31] identified 248 proteins whose acetylation levels were significantly (p < 0.05) increased in Sirt3 knock-out livers (Gold Standard 1, GS1). Using a similar approach, Dittenhafer-Reed et al. [32] identified 203 Sirt3 targets (Golds Standard 2, GS2) in liver. Between the two studies, 140 Sirt3 targets were common in liver tissue (Fig. 2). To benchmark Sirt3 co-expression networks in the hippocampus, only one acetyl proteomic dataset could be found. Dittenhafer-Reed et al. [32] identified 171 proteins whose acetylation levels were significantly changed in Sirt3 KO brains. Interestingly, 93 and 79 out of the 171 Sirt3 target proteins in brain were also found in liver according to Rardin et al. and Dittenhafer-Reed et al., respectively (Fig. 2). As an alternative functional benchmarking approach, we examined mitochondrial enrichment of various Sirt3 coexpressed gene sets.

Literature cohesion of Sirt3 co-expressed gene sets in liver
We initially focused on two liver datasets in GeneNetwork that used Affymetrix Gene 1.0 ST arrays, which contained 10 different Sirt3 probes. While each of the probes appropriately targeted Sirt3 exons, we found that the correlation values among the 10 Sirt3 probes across the panel of BXD RI mice were highly variable (Additional file 1: Figure S3). Only 6 Sirt3 probes showed a Pearson correlation > 0.5 for GS1, and 4 probes showed a Pearson correlation > 0.5 for GS2. These results suggest that using probe-to-probe correlations may not necessarily be a good metric to evaluate probe quality as a whole, since the same probes showed different correlation structure in different datasets (Additional file 1: Figures S3A and S3B).

Fig. 2
Overlap between Sirt3 targets identified in proteomic studies by Rardin et al. [31] and Dittenhafer-Reed et al. [32], henceforth referred to as Gold Standard set 1 (GS1) and 2 (GS2), respectively We next examined the correlated gene sets produced by the different Sirt3 probes. We found that the co-expressed gene sets associated with highly inter-correlated Sirt3 probes (e.g., PID1056899) were dramatically different from the Sirt3 probes which were uncorrelated (PID 10569006; Additional file 1: Figure S4). Whereas the strongly inter-correlated Sirt3 probes identified robust gene sets with high connectivity (Pearson correlation > 0.8 or < −0.8) among themselves (Additional file 1: Figure S4A), the weakly correlated Sirt3 probes produced gene sets that were weakly connected (Additional file 1: Figure S4B).
To evaluate which Sirt3 probes produced biologically meaningful co-expressed gene sets, we benchmarked the top 100 to 1000 correlated genes associated with each Sirt3 probe using the two proteomic Gold Standard gene sets described above. As expected, the recall of gold standard genes increased with increasing window sizes (larger co-expressed gene sets), whereas the precision decreased (Additional file 1: Figures S5 and S6). Using F-measure (weighted harmonic mean of precision and recall), we found that some probes consistently performed better across all window sizes (Fig. 3).
To evaluate if the Sirt3 correlated gene sets represent functionally related genes, we compared the literature derived cohesion p-values (LPvs) as described previously by our group [20]. LPv was calculated for gene sets ranging from 100 to 1000 genes that were associated with each Sirt3 probe in the two datasets. In general, we found that the most highly correlated genes (smaller window size) to Sirt3 probes were more functionally cohesive and that − log 10 (LPv) decreased (less significant) with increasing window size. Interesting, in DS1, only five out of the 10 Sirt3 probes produced significant (LPv < 0.05) functional cohesion (Fig. 4). The Sirt3 co-expressed gene sets produced from four of these five probes exhibited the highest F-scores for the two gold standard gene sets (GS1 and DS2).
Next, we examined the correlation between Fmeasure and LPv across various window sizes of Sirt3 co-expressed gene sets. Whereas the correlation between F-measure and LPv improved with increasing a b c d window size for DS1, the correlation decreased for DS2 (Fig. 5).

Literature cohesion of Sirt3 co-expressed gene sets in brain hippocampus
Multi-tissue proteomic experiments performed by [32] suggested that although Sirt3 targets a set of core mitochondrial proteins, it targets different proteins in energy-producing tissues compared to energy-demanding tissues. Therefore, we compared Sirt3 co-expressed gene sets derived from brain hippocampus in addition to liver tissues. The hippocampal study (DS3) utilized Affymetrix Exon array platform to measure transcript levels across a panel of 93 inbred mouse strains. There is a total of 26 Sirt3 probes represented in the mouse Affymetrix Exon Array; 12 probes that target intronic regions and 14 probes that target exonic regions of Sirt3 as annotated by the manufacturer. This is useful for evaluation, enabling us to use the intronic probe correlated gene sets as negative controls. For this analysis, only one proteomic gold standard dataset specific to brain tissue could be obtained [32]. In general, the F-scores of the Sirt3 exonic probe correlated gene sets were much higher than those obtained from the intronic probes (Fig. 6). This result was consistent with the LPv for the exonic probe correlated gene sets, although a few exonic probes produced insignificant literature cohesion and several intronic probes produced significant literature cohesion (Fig. 7). The highest F-scores were obtained with an average window size of 535 genes across all Sirt3 exonic probes. Focusing on a window size of 500 transcripts, we found that the majority (78%) of Sirt3 co-expressed gene sets obtained from exonic probes were significantly (LPv< 0.05) cohesive (Table 1). Surprisingly, 6 out of 12 intronic Sirt3 probes produced significant literature cohesion although their F-scores were very low 0-0.042 (Table 1). Overall, the Pearson correlation between gold standard F-measure and − log 10 (LPv) was greater than 0.79 for all Sirt3 correlated gene set sizes ranging from 100 to 1000 (Fig. 8).
For example, the Pearson correlation between F-score and − log 10 (LPv) for 500 Sirt3 co-expressed genes was 0.90 (R 2 = 0.81). Examination of the scatter plot of the data showed that only the Sirt3 exonic probe correlated gene sets produced the highest F-scores and literature cohesion significance (Fig. 9a). Indeed the LPv of Sirt3 exonic probe correlated genes were significantly (p < 0.0145, Wilcoxon/Kruskal-Wallis Rank Sum test) higher than the LPv of Sirt3 intronic probes (Fig. 9b).
As an alternate functional benchmarking approach, we examined the correlation between LPv and enrichment p-value of mitochondrial genes annotated in GO database. Consistent with the proteomic benchmarking results, we found a very high correlation of 0.93 (R 2 = 0.80) between LPv and mitochondrial gene enrichment for Sirt3 co-expressed genes (Fig. 9c). Notably, not all Sirt3 exonic probe correlated genes and none of the intronic probe correlated genes were enriched for the Mitochondrion GO category (Fig. 9c). The LPv for the mitochondrial enriched Sirt3 correlated genes was significantly (p < 0.0003, Wilcoxon/Kruskal-Wallis Rank Sum test) higher than the LPv for Sirt3 correlated genes which were not enriched for mitochondrial genes (Fig. 9d). Moreover, the correlation between LPv and all mitochondrial GO categories for Sirt3 correlated gene sets was significantly (p < 10 −25 , Wilcoxon/Kruskal-Wallis Rank Sum test) higher than the correlation of LPv and all other GO categories (Additional file 1: Figure S7).

Functional analysis of Sirt3 co-expressed gene sets in brain and liver
Having identified the most functionally cohesive Sirt3 co-expressed gene sets in liver (using PID# 10569000 in Affymetrix Mouse Gene 1.0 ST array) and hippocampus (using PID# 5499567 in Affymetrix Mouse Exon array) tissues, we performed functional enrichment analysis on the top 500 Sirt3 correlated transcripts. The most enriched categories were the UniProt Keywords Acetylation and Mitochondrion cellular compartment ( Table 2). The Sirt3 co-expressed gene sets in both liver and Hippocampus were enriched for core metabolic processes such as oxidation-reduction process and oxidative phosphorylation. Interestingly, Sirt3 co-expressed genes in both tissues were associated with focal adhesion, myelin sheath, Huntington's disease, Parkinson's disease and Alzheimer's disease. However, in the liver, the Sirt3 co-expressed genes were involved in tricarboxylic acid cycle and nucleotide binding, as well as amino acid, lipid, and cholesterol metabolism. In contrast, the Sirt3 co-expressed gene sets in the hippocampus were enriched for ribosomal proteins, translation regulation, proteasome regulation as well as non-alcoholic fatty liver disease and vesicle-mediated transport pathways.

Discussion
In this study, we demonstrated that using a literaturebased method to determine functional cohesion of gene sets is an effective approach to evaluate the quality of specific microarray probes and their correlated gene expression. Using two different proteomic gold standard gene sets, we found that the literature derived p-values of Sirt3 co-expressed genes were highly correlated with F-scores across different window sizes (Figs. 5, 8 and 9). In addition, the LPv of Sirt3 co-expressed genes was highly correlated with enrichment of mitochondrial genes (Fig. 9). These results suggest that LPv may be used as a proxy benchmarking tool when gold standards are not available.
We found that not all probes targeting the same gene on a microarray platform produced meaningful co-expressed gene sets based on LPv, F-measure, or GO enrichment analysis (Figs. 3, 4, 6, 7, and 9). These results suggest that a considerable number of microarray probes may produce erroneous results. Our results are consistent with various other studies which have documented non-specific probe hybridization and other factors that affect the quality of expression analysis using Affymetrix GeneChip arrays [12,14,33]. Importantly, based on probe level analysis of the Hippocampus Affymetrix Exon Arrays, we found three Sirt3 exonic probes with non-significant LPv (Fig. 9, Table 1). Further analysis revealed that two of these probes (PID# 5553357 and PID# 5321074) actually target Sirt3 introns rather than exons, revealing annotation errors by the manufacturer. In addition, the third Sirt3 probe (PID# 4601230) targeted other regions in the genome in addition to Sirt3. In contrast, some Sirt3 intronic probe correlated gene sets showed significant LPv (Table 1). While the reasons for this result is unclear, it does indicate that a metric other than LPv cutoff of p < 0.05 is needed to identify high quality probes. Based on the distribution of LPv for exonic and intronic Sirt3 probes, we found that the mean − log 10 (LPv) (6.42) may be adequate to distinguish high quality probes from intronic controls (Fig. 9b). Applying this metric to DS1 and DS2, which used Affymetrix Gene 1.0 ST array, identified five Sirt3 probes with greater than average LPv (Additional file 1: Figure  S8). Four out of the five 'high' quality Sirt3 probes were consistent between the two datasets, suggesting that some differences in literature cohesion may be caused by experimental differences. Further study using a larger number of datasets for a given array platform will be needed to develop a global probe quality metric based on LPv.
In general, the F-scores across all probes and datasets were low (<0.04) although some co-expressed gene sets exhibited highly significant literature cohesion. The low F-measure is a result of both low recall and precision of the gold standard genes, which is likely due to a number of factors. First, low F-scores may simply be due to technical or experimental variation in the proteomic studies that we used to define the gold standards. Only 140 (56% of GS1 and 69% of GS2) Sirt3 target protein/genes in the liver were common between the two proteomic studies (Fig. 2). Second, low F-scores may be explained by the fact that not all Sirt3 co-expressed genes are expected to be direct deacetylation targets of Sirt3, yet they may play important roles in Sirt3 signaling pathways. Third, gene expression is not expected to tightly correlate with protein levels nor post-translational modification. Fourth, the liver samples used for our co-expression analysis were from treated animals, whereas the gold standard genes were identified from untreated Sirt3 knock-out and wildtype animals. Therefore, it is possible that many of the Sirt3 co-expressed genes in the liver datasets may be confounded by the treatment. Finally, since the gold standard experiments utilized Sirt3 KO animals, some of the acetylation changes may be a result of indirect affects or compensatory mechanisms caused by Sirt3 deficiency during development. Nevertheless, using two different sets of proteomic gold standards as well as benchmarking with mitochondrial enrichment analysis enabled us to validate LPv performance across multiple tissues and datasets. As noted above, we consistently found high correlation between LPv and the different benchmarks (Figs. 5, 8 and 9).
Large-scale acetyl proteomic studies have reported that Sirt3 regulates a wide range of metabolic enzymes and is responsible for global metabolic reprogramming  The gene count, % of input gene list, raw p-value, and Benjamini-Hochberg adjusted p-value are shown for each functional category. Categories that are unique to one dataset are bolded [26,31,32]. Consistent with these studies, our functional enrichment analysis of Sirt3 co-expressed genes in BXD recombinant inbred mice revealed that Sirt3 participate in core metabolic pathways involved in oxidative phosphorylation, however the specific targets in this biochemical pathway were different in the hippocampus compared to liver. For example, out of 31 liver genes and 19 brain genes involved in oxidative phosphorylation pathway, only 3 were common between liver and hippocampus. This suggest that although Sirt3 generally regulates oxidative phosphorylation, it may mediate its effects through distinct target proteins in different tissues. Also, consistent with previous proteomic studies, our analysis showed that Sirt3 regulates distinct metabolic pathways in an energy producing tissue such as liver compared to an energy demanding tissue such as brain hippocampus. In the liver, we found that Sirt3 co-expressed genes are involved in fatty acid and amino acid metabolism, whereas in the brain Sirt3 co-expressed genes are involved in regulation of protein synthesis and vesicle mediated transport. This result is consistent with recent studies that reported Sirt3 is critical for adaptive responses to exercise and metabolic challenges in neurons [27]. Using a literature-based approach such as LPv is important because gold standard sets (despite their limitations discussed above) are not readily available for the vast majority of genes. In the absence of gold standards, it is difficult to assess the quality of different probes for a given gene represented on various microarray platforms. Lack of appropriate quality benchmarking will lead to false discovery and hinder biological interpretation of the data. Thus, we posit that the LPv approach would enable researchers to focus on the best probes and datasets and ultimately facilitate genomic discovery. The Gene-Set Cohesion Analysis Tool (GCAT) [20] which calculates literature derived functional cohesion p-values is readily available in GeneNetwork.org.

Conclusions
Taken together, we have demonstrated that literature derived functional cohesion provides for a robust, automated and objective metric for evaluating the quality of probes and co-expressed genes. This makes the LPv metric a viable probe quality indicator substitute in the absence of 'ground truth' or experimentally derived gold standards, which is the case for most genes.