NoRCE: non-coding RNA sets cis enrichment tool

Background While some non-coding RNAs (ncRNAs) are assigned critical regulatory roles, most remain functionally uncharacterized. This presents a challenge whenever an interesting set of ncRNAs needs to be analyzed in a functional context. Transcripts located close-by on the genome are often regulated together. This genomic proximity on the sequence can hint at a functional association. Results We present a tool, NoRCE, that performs cis enrichment analysis for a given set of ncRNAs. Enrichment is carried out using the functional annotations of the coding genes located proximal to the input ncRNAs. Other biologically relevant information such as topologically associating domain (TAD) boundaries, co-expression patterns, and miRNA target prediction information can be incorporated to conduct a richer enrichment analysis. To this end, NoRCE includes several relevant datasets as part of its data repository, including cell-line specific TAD boundaries, functional gene sets, and expression data for coding & ncRNAs specific to cancer. Additionally, the users can utilize custom data files in their investigation. Enrichment results can be retrieved in a tabular format or visualized in several different ways. NoRCE is currently available for the following species: human, mouse, rat, zebrafish, fruit fly, worm, and yeast. Conclusions NoRCE is a platform-independent, user-friendly, comprehensive R package that can be used to gain insight into the functional importance of a list of ncRNAs of any type. The tool offers flexibility to conduct the users’ preferred set of analyses by designing their own pipeline of analysis. NoRCE is available in Bioconductor and https://github.com/guldenolgun/NoRCE. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04112-9.

• breastmRNA : 667 differentially expressed mRNAs in human breast cancer. The analyze is conducted on patient expression data obtained from the TCGA project, collected on July 15 th 2017. Differential expression analysis is applied for p−value < 0.05 and F DR < 0.05 using Bioconductor limma package [9].
• mirna & mrna : Subset of brain cancer expression levels for mRNA, and miRNA obtained from TCGA. Data contains 527 matched tumor patients for 150 mRNA and 183 miRNA. Those datasets are subset of the pre-processed miRNA and mRNA expression data that utilized in Case Study 2 and they are intented to use for examples.
• ncRegion : The relevant regions of the differentially expressed human ncRNA genes (except pseudogenes) for three psychiatric disorders. The diseases include autism spectrum disorder (ASD), schizophrenia (SCZ), and bipolar disorder (BD), and the data were generated by Gandal et al. [3]. The final set contains 930 gene regions.

Functional enrichment results
Figure S1: The top 7 enriched GO terms and the ncRNA genes which are associated with them are depicted as a network. The size of the nodes represents the degree, number of edges that are incident to the vertex, and each color represents different clusters in the network. Table S4: Top 10 enrichment results of brain related biological process GO-terms that are obtained from the neighbourhood coding genes of ncRNA set of the [3]. The GeneRatio is computed by dividing the overlapping with the coding genes with the functional gene set to the number of all protein-coding genes within the input set neighbourhood. The BGRatiao column represents the ratio of the number of genes found in the enriched GO term set to the size of the background gene set. The EGNo refers to the size of the overlap between the corresponding GO term gene set and the neighboring coding gene set. ncGeneList column contains ncRNA genes that are enriched with the corresponding GO-term. ID    Figure S3: Interaction between the top 7 enriched pathways and the genes are depicted as a network. The size of the nodes represents the degree, number of edges that are incident to the vertex, and each color represents different clusters in the network.    Figure S5: Network for all enriched processes for CLC. The nodes' size represents the degree, the number of edges that are incident to the vertex, and each color represents different clusters in the network.

Case Study 4: Functional enrichment analysis of pan-cancer driver lncR-NAs filtered with TAD boundaries
Table S11: All biological process GO term enrichment results when TAD filtering is used in CLC data. The TAD information is obtained from 3D Genome Browser [11]. The GeneRatio is computed by dividing the overlapping with the coding genes with the functional gene set to the number of all protein-coding genes within the input set neighbourhood. The BGRatio column represents the ratio of the number of genes found in the enriched GO term set to the size of the background gene set. The EGNo refers to the size of the overlap between the corresponding GO term gene set and the neighboring coding gene set. ncGeneList column contains ncRNA genes that are enriched with the corresponding GO-term.