Measures of co-expression for improved function prediction of long non-coding RNAs

Background Almost 16,000 human long non-coding RNA (lncRNA) genes have been identified in the GENCODE project. However, the function of most of them remains to be discovered. The function of lncRNAs and other novel genes can be predicted by identifying significantly enriched annotation terms in already annotated genes that are co-expressed with the lncRNAs. However, such approaches are sensitive to the methods that are used to estimate the level of co-expression. Results We have tested and compared two well-known statistical metrics (Pearson and Spearman) and two geometrical metrics (Sobolev and Fisher) for identification of the co-expressed genes, using experimental expression data across 19 normal human tissues. We have also used a benchmarking approach based on semantic similarity to evaluate how well these methods are able to predict annotation terms, using a well-annotated set of protein-coding genes. Conclusion This work shows that geometrical metrics, in particular in combination with the statistical metrics, will predict annotation terms more efficiently than traditional approaches. Tests on selected lncRNAs confirm that it is possible to predict the function of these genes given a reliable set of expression data. The software used for this investigation is freely available. Electronic supplementary material The online version of this article (10.1186/s12859-018-2546-y) contains supplementary material, which is available to authorized users.


Background
Long non-coding RNAs (lncRNAs), defined as non-protein-coding transcripts longer than 200 nucleotides, are one of the most common RNA species, but they are in most cases poorly understood with respect to function [1]. It has been shown that lncRNAs play important roles in a wide range of biological process [2,3] and diseases [4][5][6]. Possible functions of lncRNAs can be characterized experimentally using gain-and loss-of-function approaches [7,8], but this is not a straightforward method, for example because lncRNAs can be expressed in multiple isoforms. Therefore, to apply computational methods and algorithms can be a good and accessible supplement to experimental methods for suggesting possible functions of lncRNAs and other un-annotated genes.
Currently, such computational approaches are still at an early stage of development, although its importance has been recognized [9][10][11]. There are considerable challenges in finding precise and reliable computational approaches due to a lack of suitable data. There is also a lack of databases with relevant features that are suitable for example for machine learning, there is a lack of lncRNAs with known function for training, and for most lncRNAs we are not aware of any common structural features that are important for function. For example, many lncRNA gene sequences are not conserved and do not contain clear motifs [12], which makes it difficult to find and predict the function of lncRNAs by relying on their sequences. A lack of any rich set of molecular interaction data for most lncRNAs is also a limitation with respect to computational annotation [13,14].
One possible and frequently used approach is based on "guilt by association", i.e., to identify well-annotated genes that seem to be involved in some of the same processes as a given un-annotated gene. This association is often based on co-expression, indicating potential co-regulation of a set of genes [15,16]. It is then possible to predict function of the un-annotated gene by using information from the well-annotated co-expressed genes, assuming that co-expressed genes may be involved in similar or at least related processes. This approach is illustrated in Fig. 1, and it has been tested in several implementations, with some success. Alam et al. and Gong et al. list several methods and databases for annotation of ncRNAs [17,18], and most of these use co-expression at some stage, mainly with standard statistical metrics like Pearson or Spearman. For example, Guttman et al. determined several sets of mouse lncRNAs to be related to sets of mRNA for protein-coding genes by Pearson correlation [19], and co-expressed lncRNA-mRNA networks for mouse have also been used by Guo et al. (with "lnc-GFP") and Liao et al. (using a coding-non-coding (CNC) network computed with Pearson correlation) to assigned function to 340 and 1625 mouse lncRNAs, respectively [20,21]. Jiang et al. used Pearson correlation to identify co-expressed genes for human lncRNAs (with "LncRNA2Function"), and they could annotate given lncRNAs with significantly enriched gene ontology (GO) terms among the set of co-expressed protein-coding genes, according to the hypergeometric test [22]. Park et al. built a database of lncRNAs ("lncRNAtor"), where they included co-expression by Spearman correlation [23], whereas Zhao et al. developed a web-based application ("Co-LncRNA") for exploring combinatorial effects of lncRNAs, using linear regression and Spearman correlation to map co-expression between lncRNAs and protein-coding genes [24]. Perron et al. combined co-expression by Pearson correlation with evolutionary conservation in "FuncPred" [25], whereas Zhou et al. used a combination of ChIP-seq, CLIP-seq and RNA-seq (in "lncFunNet") to predict lncRNA function, using Pearson correlation to estimate co-expression of gene pairs [26]. Zhou et al. also released a toolkit ("lncFunTK") for calculation of a Functional Information Score (FIS) [27], based on their previous work with lncFunNet. Recently Zhang et al. used a hierarchy of neural networks to predict GO terms for lncRNA genes, implemented as "NeuraNetL2GO", with Pearson correlation as the measure of co-expression [28]. Thus, most proposed methods rely on identification of co-expressed genes from experimental expression data. Therefore, estimation of co-expression is a crucial step, and relevant alternative metrics should be evaluated. Here we use several metrics for estimating co-expression, in particular statistical (Pearson, Spearman) and geometrical ones (Sobolev, Fisher), and a combination of those. This has been implemented as LNCRNA2GOA, which is available to users. The aim is to provide improved identification of true co-expression. We use an enrichment analysis to identify enriched GO-terms in the co-expressed gene sets, and use this to predict GO terms for the un-annotated gene. We have benchmarked the methods for co-expression on a subset of well-annotated proteincoding genes, using semantic similarity to compare real  Fig. 1 The "guilt by association" principle for prediction of annotation terms for a novel gene. It is based on comparison of gene expression profiles between the novel gene and a set of annotated genes, ranking of the annotated genes according to the similarity of the expression profiles relative to the novel gene, and enrichment analysis of annotation terms in the most highly ranked annotated genes. The significantly enriched terms can be used as an estimate of annotation for the novel gene and predicted GO terms, and also tested the performance on a small number of well-known human lncRNAs.

Data sources
For expression data we have used data from Jiang et al. [22]. They used the information from GENCODE V15 [29] for genomic coordinates and RNA-Seq data of 19 human normal tissues from the Human Body Map 2 project (ArrayExpress accession GSE0554), and read and computed expression values using tophat [30] and cufflinks [31] for all human lncRNA and protein-coding genes. Details are given in the paper by Jiang et al. [22]. GO annotations were downloaded from the Gene Ontology Project [32], and R [33]

Main workflow
The schematic workflow is shown in Additional file 1: Figure S1, where key aspects of this study (i.e., metrics for comparison of expression profiles, and similarity measures for comparing predicted and known annotations) are highlighted. A pseudocode representation of the LNCRNA2GOA algorithm is shown in Table 1. The goal is function prediction for an lncRNA or another poorly annotated gene (denoted as g). Let be a set of statistical and geometrical methods (see below) and Target m g be all protein-coding (or well-annotated) genes that are co-expressed with g as determined by method m. Now the gene g will be functionally annotated with significantly enriched annotation terms (here GO) among the set of co-expressed well-annotated proteincoding genes. We use the hypergeometric test to compute the p-value of each term T: Herein, N is the number of all protein-coding genes, M is the number of protein-coding genes that are annotated in the functional term T, n is the size of Target m g and t is the number of genes in Target m g that are annotated in the functional term T. Because the statistical analysis is not appropriate for problems of small size, we exclude GO terms with less than five annotated protein-coding genes from the enrichment analysis, as recommended by Jiang et al. [22]. We also use false discovery rate (FDR) for correction for multiple hypothesis tests. The significance cut-off of corrected p-value is set as 0.05.

Methods to identify Target m g
To identify co-expressed genes for a typical gene g, we used the geometrical metrics Sobolev and Fisher information, in addition to the statistical metrics Pearson and Spearman.

Sobolev metric
In this section, we use definitions and notations as in [36]. We start with the usual p-inner product. Let f, g be real-valued functions (in this case f and g values are the expression vectors of two genes f and g): By this notation, Sobolev inner product, norm and meter of degree k respectively can be defined by: where D k is the kth differential operator. For the special case p = 2 and α = 1 an interesting connection to the Fourier-transform of analysis can be made; letf be the where ω k ¼ 2πk Finally, the norm can be written as In this work we used metric (5) with norm (7) and k = 1.

Fisher information metric
In this section we use definitions and notations as in [37]. To define Fisher information metric we first introduce the n-simplex P n defined by The coordinates {x i } describe the probability of observing different outcomes in a single experiment (or expression value of a gene in ith cell type). The Fisher information metric on P n can be defined by We now define a well-known representation of the Fisher information as a pull-back metric from the positive n-sphere S þ n ; S þ n ¼ x∈R n ; ∀i; x n ≥ 0; The transformation T : P n →S þ n defined by pulls back the Euclidean metric on the surface of the sphere to the Fisher information on the multinomial simplex. Actually, the geodesic distance for x, y ∈ P n under the Fisher information metric may be defined by measuring the length of the great circle on S þ n between T(x) and T(y) Size of Target m g In the previous section we introduced four methods to measure similarity between genes based on their expression values, which can be used to rank genes. Now the challenge is to determine a threshold for identifying the most informative set of correlated genes as Target m g . Some studies, for example [22] used a threshold of 0.9 on the Pearson metric, but this does not identify an optimal cutoff; sometimes it returns thousands of co-regulated genes, and sometimes nothing. It is also difficult to set a threshold in a similar way when using a geometrical metric. Therefore, we tried to select a threshold based on a possible number of Target m g elements. We selected a range of different sizes for subsets of the most similar genes from the Target m g set, including {50 * x| x = 1, 2, …, 10}, and analysed how well the algorithm could predict some well-known lncRNAs. This analysis showed that a selection of the top 250 co-expressed protein-coding genes seemed to have optimal performance for prediction of GO terms for some well-known lncRNAs.

Combination of methods
Let SigGOs m;r g ¼ ½SigGOs m g r be the r most significant terms assigned to gene g by method m. Since the optimal method m is individually different we can assign SigGOs r;s g ¼ ½∪ m∈Μ SigGOs m;r g s to the gene, and this will increase the accuracy of predictions (see Results and discussion). If there are identical terms for some methods, we just consider the term with the minimum corrected p-value. Actually, in the combination method the algorithm first collects the r most significant terms for all m ∈ Μ, then selects the s most significant terms from this collection. When r = all and s = all then the algorithm returns all significant terms.

Evaluation on protein-coding genes
Before evaluating the performance of our method on some known lncRNA genes, we benchmarked it on a set of well-annotated genes. Since protein-coding genes in general are much more well-annotated than lncRNA genes, this set was based on protein-coding genes. We selected protein-coding genes annotated by 5 or 6 molecular function (MF) terms and 9 or 10 biological process (BP) terms in the GO database. This represents the average number of GO-terms in each set, as the genes were on average annotated by 4.67 MF and 9.25 BP terms. Therefore the benchmark set has a typical (average) level of annotation. It consists of 352 protein-coding genes (denoted as Test 352 ), and is available as part of the LNCRNA2GOA distribution [38]. We used a "leave-one-out" approach for the actual benchmarking. That is, for each gene in Test 352 we treated the gene as unannotated and predicted GO terms for the gene by each of five different methods (Pearson, Spearman, Sobolev, Fisher, and Combine, which is a combination of all four methods). We then used the TopoICSim [39] and GOSemSim [40] approaches to estimate similarity between real and predicted GO annotations for each gene in Test 352 . TopoICSim and GOSemSim are two algorithms for measuring semantic similarity between pairs of genes. To have a more realistic semantic measure for benchmarking, we restricted the size of the output set of enriched terms by selecting SigGOs ðm;2ÂLÞ g where L is the number of GO terms for gene g in Test 352 based on the gene ontology database. For the Combine case, where we find an optimal combination of predictions from each of the four different methods, we evaluated the performance first for SigGOs Μ;2ÂL;2ÂL g (Combine_2L) and then For SigGOs Μ;2ÂL;4ÂL g (Combine_4L).

Evaluation on Test 352
Before applying LNCRNA2GOA to human lncRNAs, we first benchmarked the performance on the wellannotated set of protein-coding genes, Test 352 . We have applied all metrics in for each gene in Test 352 and evaluated semantic similarity between actual and predicted annotation as measured by the GOSemSim and TopoICSim methods. We also included results for LncRNA2Function [22] and Co-LncRNA [24]. The results for Co-LncRNA were based on the published computational approach, but using the LncRNA2-Function database, to keep the results comparable. The average semantic similarities for each method and metric are shown in Fig. 2. In all cases the TopoICSim measure shows better performance than GOSemSim. It has previously been shown that TopoICSim seems to be a more correct measure for semantic similarity compared to other approaches [39], therefore this most likely reflects a real similarity in predicted annotation, and not a systematic bias in measurements. In particular for TopoICSim there is a good correlation between performance in MF and BP, indicating that the improved performance is not random, but due to the choice of better methods. The results also show that the geometrical metrics predict function more efficiently than statistical metrics, and in particular the combined measure shows quite good performance. It can be argued that the standard deviation (SD) of the averages is high, indicating a variation in performance. However, the SD is lower for the Combine measure. The overall performance seems to be better for BP compared to MF. It is possible that this reflects a bias in the dataset, as each gene in Test 352 has almost twice as many BP terms compared to MF terms. In summary, Fig. 2 shows that the combined approach has good performance with respect to prediction of GO terms, in particular for terms related to BP.

Evaluation on data from FuncPred
We also tested the performance of LNCRNA2GOA on a set of 37 manually annotated lncRNAs from FuncPred [25]. Figure 3 shows the number of successful predictions by LNCRNA2GOA, FuncPred and LncRNA2Function [22]. Here a prediction was counted as successful if at least one of the correct GO terms could be predicted. The same approach has previously been used for example in testing of the NeuraNetL2GO method [28]. Again we see that the best performance was achieved with the LNCRNA2GOA approach.

Comparison to functional predictions by lncFunTK
Finally, we have done a qualitative evaluation of predictions performed with LNCRNA2GOA, compared to predictions by lncFunTK with data from HeLa cells and Human Body Map, provided as Table S2 in Zhou et al. [27]. The lncFunTK predictions are sorted according to a b Fig. 2 Evaluation of the different similarity metrics for 352 well-annotated protein-coding genes (Test 352 ). For each gene in Test 352 all methods were applied for prediction of function, and similarity between real and predicted terms were measured with TopoICSim and GOSemSim. The table shows average similarity scores over the test set, with standard deviation. LncRNA2Function: Co-expressed protein-coding genes was obtained for each gene in Test 352 using [22] with Pearson correlation coefficient > 0.9. Co-LncRNA_Pearson or _Spearman: Co-expressed protein-coding genes was obtained for each gene in Test 352 using [24] with Pearson or Spearman correlation coefficient > 0. 8

and LncRNA2Function expression data
Functional Information Score (FIS), which is a predicted functional importance of a given lncRNA, based on a combination of several data types (ChIP-seq, CLIP-seq, RNA-seq). We have focussed on the 100 most significant cases from lncFunTK according to FIS, and used LNCRNA2GOA to predict GO terms for 50 of those genes where Ensembl IDs could reliably be assigned (see Additional file 2: Table S1). The predicted GO terms were then compared to the GO terms from lncFunTK, using experimental data from literature as a common reference. We were able to retrieve PubMed entries based on gene name for only 20 of the 50 genes, which limits the comparison. However, for more than half of these genes the predictions from LNCRNA2GOA showed at least some similarity to literature data (see Additional file 3: Table S2). For lncFunTK, on the other hand, the available predictions consist of only a single GO term per gene, and most of these terms are identical. For the set of 50 genes, 35 entries are classified as GO:0045944 (positive regulation of transcription from RNA polymerase II promoter) by lncFunTK, and this strong preference for GO:0045944 is the same for the full set of predictions.
It is difficult to do a direct comparison of predictions from these two methods. Both use RNA-seq data from multiple tissue types to estimate co-expression. However, the lncFunTK approach also ranks genes using additional data (in this case, on HeLa cells), which may give some preference for properties that are particularly enriched in this cell type (for example increased transcription). LNCRNA2GOA may to a larger extent display properties that are important across the full range of activities where each lncRNA is involved. This may explain some of the differences in the output between lncFunTK and LNCRNA2GOA. However, it is our general impression that the predictions from LNCRNA2-GOA contain more information, compared to lncFunTK, and that several of these predictions seem to be consistent with observations from literature.

Functional annotation of human lncRNAs
There is a lack of good "gold standard" datasets for human lncRNAs with known function that can be used for benchmarking. However, we have used a small set of five well-known lncRNAs from literature as examples to show the efficiency of LNCRNA2GOA. These lncRNAs have previously been used for documenting the performance of in particular LncRNA2Function [22]. The summary information describing these example lncRNAs, their functions and references are shown in Table 2.
To examine whether LNCRNA2GOA is able to functionally annotate the lncRNAs that are listed in Table 2, we applied the algorithm with defaults (i.e., ontology set as Biological Process and using the Combine method with ( SigGOs Μ;all;all g ), and prediction results are presented and discussed for each lncRNA separately. The top 10 predictions are in each case are given in Table 3.
HOTAIR HOTAIR (Hox transcript antisense RNA) is an lncRNA known to be involved in development, cancer and high risk metastases, at least partly through interaction with PRC2 and regulation of HOX genes [41,42]. To investigate whether LNCRNA2GOA is able to associate functionally relevant GO terms with HOTAIR, we applied the algorithm with defaults parameters. This identified 124 GO BP terms in total. As expected, most significantly enriched terms are associated with development and morphogenesis, and cell metastasis. The results in Table 3 indicate that LNCRNA2GOA successfully identifies functionally relevant GO terms for HOTAIR.

HCP5
HCP5 (HLA complex P5) is an lncRNA associated with immune response, and it is associated with for example AIDS [43] and virus-related cancers [44]. When testing whether HCP5 can be annotated by LNCRNA2GOA, we Fig. 3 GO terms for a manually annotated set of 37 lncRNAs from FuncPred has been predicted using LNCRNA2GOA, FuncPred and LncRNA2Function, and the number of successful predictions has been counted found that HCP5 was associated with 333 GO terms for biological processes. As expected, most of them involve the immune system and immune response, and functional terms that are associated with the development of AIDS (see Table 3). A relevant example is regulation of immune response (GO:0050776).
HULC HULC (Heptacellular carcinoma up-regulated long non-coding RNA) is known to be upregulated in liver cancer and associated with tumorigenesis [45,46]. We used LNCRNA2GOA to assess whether HULC can be correctly assigned to have liver-related functions. The enrichment output showed 177 GO terms for biological processes associated with HULC. Most of these terms are involved in functions related to liver and lipids, such as lipid hydroxylation (GO:0002933) and chylomicron remodelling (GO:0034371). Table 3 shows the top 10 GO functional terms enriched in protein-coding genes that are co-expressed with the liver-related lncRNA HULC.

H19
H19 (Imprinted maternally expressed transcript) is known to be important for fertility and several processes associated with female disease risk, including cancer [47,48]. LNCRNA2GOA identified 53 GO terms for biological processes as associated with H19. Several of these terms suggested strongly that H19 can play an important role in infertility or breast cancer, such as female pregnancy (GO:0007565), female gamete generation (GO:0007292), and adrenal gland development (GO:0030325). There were also other relevant GO terms, as for example, JAK-STAT cascade involved in growth hormone signalling pathway (GO:0007565), cell-cell signalling (GO:0007267), and cell proliferation (GO:0008283). This indicates that H19 can play a role in various cancers and other conditions where JAK-STAT signalling is important. The top 10 GO terms for biological processes are shown in Table 3.

PCA3
PCA3 (Prostate cancer associated 3) is strongly upregulated in prostate cancer [49,50]. LNCRNA2GOA identified 25 terms for GO biological processes as associated with PCA3. There were three terms directly involved in prostate cancer; urinary bladder development (GO:0060157), prostate gland development (GO:0030850), and prostate epithelial cord arborization involved in prostate glandular acinus morphogenesis (GO:0060527). PCA3 is a quite challenging case, but this prediction seems to be an improvement over a previous prediction by LncRNA2Function, where only a single pathway linked to androgen receptor was identified, and in some aspects also predictions by the more recent FARNA tool [17].

Additional tests
We also tested LNCRNA2GOA on a few cases not included in the original LncRNA2Function test set, and got similar results. The two most significant GO BP terms for MALAT1 (Metastasis associated lung adenocarcinoma transcript 1) were ion transport (GO:0006811) and excretion (GO:0007588), which seems to be consistent with the observation that MALAT1 often is associated with kidney function and with renal cell carcinoma [51]. For LINC00152, also known as CYTOR (Cytoskeleton regulator RNA), the two most significant terms were cell adhesion (GO:0007155) and extracellular matrix organisation (GO:0030198), which is consistent with the observation that this lncRNA influences the properties of breast cancer cells with respect to for example invasion and migration [52]. For LINC-ROR (Long intergenic non-protein coding RNA, regulator of reprogramming) the two most significant terms were chromatin silencing at rDNA  [49,50] (GO:0000183) and nucleosome assembly (GO:0006334), which may be consistent with observations that LINC-ROR is involved in regulation of differentiation of embryonic stem cells [53].
We also tested the performance on a more general data set, using a set of prostate cancer-associated lncRNAs with unknown molecular mechanism, published by Mitobe et al. [54]. Most of these lncRNAs are upregulated in prostate cancer, and it has been shown that RNA interference (RNAi) towards these lncRANs leads to a reduction in proliferation, making it reasonable to assume that their upregulation in prostate cancer contributes to increased proliferation in cancer tissue. However, function prediction on this gene set illustrates one of the main challenges. The lack of proper benchmark data, and the fact that the molecular mechanism for these lncRNAs is unknown, makes it difficult to assess the quality of the predictions. There are some cases where recent results seem to support some of the predictions. For example, for SNHG1 (Small nucleolar RNA host gene 1) a highly significant prediction is for positive regulation of histone H3-K27 methylation (GO: 0061087), which is consistent with recent results from Yu et al. [55] showing that SNHG1 may be involved in epigenetic silencing of the tumour suppressor CDKN1A. However, in most cases the predictions need further verification. The prediction output is therefore included here as supplementary material for future assessment (see Additional file 4: Table S3).

Discussion
In recent years, thousands of lncRNAs have been discovered that probably play important roles in many different biological processes and diseases, but unfortunately the vast majority of them still need to be functionally annotated. In this paper, we present an improved approach for estimating co-expression for computational function prediction. We compare several measures for estimating co-expression, covering both statistical and geometrical ones, and this gives improved identification of true co-expression. We use an enrichment analysis to identify enriched GO terms in the co-expressed gene set, and use this to predict GO terms for un-annotated genes. This can be any un-annotated gene, but here in particular human lncRNAs.
We have benchmarked the co-expression for enrichment analysis on a subset of well-annotated protein-coding genes. For each gene the GO terms were predicted (without using the known terms for the gene), and the fit between predicted and known GO terms was measured using semantic similarity measures. This showed good correlation between predicted and known GO terms, in particular for terms related to biological process (BP) and when using a combined similarity measure on gene expression. These score values are clearly better than the score values achieved using Pearson or Spearman, which previously has been the most common approach. The procedure was then tested on lncRNAs, in particular using a set of five well-described lncRNAs tested in previous publications [22]. The predicted GO-terms showed good correspondence with published functional descriptions of these lncRNAs. This shows that it is possible to predict the function of both protein-coding and ncRNA genes, given a reliable set of expression data.
There is still room for improvement. Although the prediction is successful in many cases (indicated by the high average similarity score in the benchmarking), the high SD indicates that there are specific cases where the prediction is less successful. It would be very useful if we were able to identify and focus on cases where prediction is most likely to be successful. The performance may also be sensitive to the quality, variation and annotation of the reference data. The approach used for enrichment analysis will also influence the result.

Conclusion
The results presented here show that approaches for computational gene annotation based on co-expressed genes can provide useful annotations, in particular when using improved estimates of co-expression based on a combination of geometrical and statistical metrics.

Additional files
Additional file 1: Figure S1. A flowchart illustrating the approach used for prediction and benchmarking in LNCRNA2GOA. (PDF 342 kb) Additional file 2: Table S1. Predicted annotation for a set of lncRNAs from HeLa, previously analysed with lncFunTK by Zhou et al. [27] and listed in their Table S2. (TXT 76 kb) Additional file 3: Table S2. Predicted annotations from Additional file 3: Table S1 for lncRNAs found in PubMed, illustrated with selected PubMed references. (TXT 435 kb)  Table S3. Predicted annotation for a set of lncRNAs with unknown mechanism as described by Mitobe et al. [54] in their Table 1 and discussed in the main paper. (PDF 475 kb) Abbreviations (l)ncRNA: (long) non-coding RNA; BP: Biological process; GO: Gene ontology; MF: Molecular function; SD: Standard deviation