- Open Access
mirCoX: a database of miRNA-mRNA expression correlations derived from RNA-seq meta-analysis
BMC Bioinformaticsvolume 14, Article number: S17 (2013)
Experimentally validated co-expression correlations between miRNAs and genes are a valuable resource to corroborate observations about miRNA/mRNA changes after experimental perturbations, as well as compare miRNA target predictions with empirical observations. For example, when a given miRNA is transcribed, true targets of that miRNA should tend to have lower expression levels relative to when the miRNA is not expressed.
We processed publicly available human RNA-seq experiments obtained from NCBI's Sequence Read Archive (SRA) to identify miRNA-mRNA co-expression trends and summarized them in terms of their Pearson's Correlation Coefficient (PCC) and significance.
We found that sequence-derived parameters from TargetScan and miRanda were predictive of co-expression, and that TargetScan- and miRanda-derived gene-miRNA pairs tend to have anti-correlated expression patterns in RNA-seq data compared to controls. We provide this data for download and as a web application available at http://wrenlab.org/mirCoX/.
This database of empirically established miRNA-mRNA transcriptional correlations will help to corroborate experimental observations and could be used to help refine and validate miRNA target predictions.
MicroRNAs (miRNAs) are small (19-22 nt) non-coding RNAs that can interfere with mRNA translation by base-pairing with mRNAs to form double-stranded RNA or by promoting the loss of polyadenylation, leading to inhibition of translation or degradation of the mRNA by the cellular machinery [1, 2]. Less often, dsRNA formed by miRNA-target complexes can target gene promoters and actually enhance transcription of target genes, sometimes termed RNAa (RNA activation) . Through these mechanisms, a single miRNA can potentially alter the expression levels of hundreds, even thousands, of mRNA transcripts  and long non-coding RNAs  and therefore exert considerable regulatory control over cellular processes. The diverse regulatory behavior of microRNAs also has been found to play an important role in a wide variety of pathologies, including cancer and cardiovascular disease [6, 7].
Predicting miRNA target genes has been a topic of active research , and works by calculating sequence similarity metrics between miRNAs and their putative target sequences, often in the 3' UTR of genes [9–14], although other features have predictive value . However, because miRNA-target base pairing is rarely characterized by perfect complementarity, even in the "seed" region, sequence-based miRNA target prediction algorithms are prone to many false positives and have poor inter-algorithm agreement [16, 17]. Experimental validation of miRNA-target interactions can be conducted, but is prohibitively costly and time-consuming to do for large-scale assessments of miRNA target prediction efficacy. Because of the biological effects of miRNAs on their targets, expression data is something that can potentially assist sequence-based methods. Therefore, several methods have been developed to prioritize predicted miRNA-target interactions in silico.
Several groups [18–20] have used expression data from paired miRNA-mRNA microarrays to generate correlations between miRNAs and their putative targets, with the idea that miRNA-mRNA pairs with a negative correlation are more likely to be legitimate interactions . Similarly, Gennarino et al developed HOCTAR, a method that re-ranks sequence-based predictions for intragenic miRNAs by using expression of a host gene as a proxy for expression of the miRNA, thus avoiding the necessity of using specialized miRNA arrays but at the cost of being restricted to intragenic miRNAs [21, 22].
Although previous expression-based miRNA-target interaction prioritization approaches have been successful, they have several drawbacks. Typically, the miRNA and mRNA expression profiles are determined by different types of arrays, leading to the possibility of technology-specific artifacts or batch effects. Although microRNAs can target other non-coding RNAs and methods are being actively developed to discover such interactions [5, 23, 24], ncRNAs are poorly represented on most array platforms. Furthermore, microarrays are not as quantitative and require probes for each transcript of interest, as compared with next-generation sequencing technology, which is probe agnostic and can provide information about all expressed transcripts, including isoforms.
To overcome these drawbacks, we generated a database of expression correlations between microRNAs and mRNAs by using total RNA sequencing (RNA-seq) experiments from NCBI's Sequence Read Archive (SRA). We then integrated sequence-based miRNA target predictions from miRanda and TargetScan databases together with RNA-seq derived expression correlations to prioritize these predicted miRNA-target pairs by co-expression and created a publicly-available web server for users to query these expression correlations. We anticipate this will be potentially useful for two different types of users. First, biologists would be interested from the standpoint of interpreting correlations in their own experiments. For example, if they knocked out Gene X and observed miRNA Y was highly expressed in that experiment, a natural question to ask is whether or not X and Y are normally anti-correlated. If so, it lends itself to the hypothesis that Gene X might somehow repress miRNA Y, either directly or indirectly. Second, bioinformaticists would potentially be interested in downloading the data to either help train their sequence-based miRNA target prediction algorithms or to use the correlations as corroborating evidence of effect. Finally, we would like to note that, although the empirical correlations we are reporting here may be suggestive of effect, this resource itself is not intended to predict miRNA-mRNA target pairs.
Selection and pre-processing of RNA-seq experiments
RNA co-expression data was obtained from processing RNA-seq datasets available for download in NCBI's Sequence Read Archive (SRA). The Bioconductor package SRAdb was installed  and its companion database was downloaded, current as of January 2013, to obtain experiment information from SRA. The database was queried to find RNA-seq runs which met the following criteria: 1) Taxon ID 9606 (human), 2) "RANDOM" library selection, to ensure that certain varieties of transcripts were not artificially enriched or depleted by the selection method, as for example via poly(A) tail selection; 3) "TRANSCRIPTOMIC" library sources and "RNA-Seq" library strategy to eliminate specialized procedures such as cap analysis of gene expression (CAGE), and 4) paired-end reads. The selected run accessions were then downloaded from the SRA using the Aspera software and converted to FASTQ using the "fastq-dump" program from NCBI's SRA Toolkit .
Reads were trimmed first for Illumina adapters, then by quality, using the "fastq-mcf" program from the ea-utils package. The Bowtie2 aligner  was used to map each set of reads in FASTQ format to the GRCh37/hg19 reference genome using the "--sensitive" parameter. Samtools  was used to convert Bowtie2's SAM output to sorted BAM. Using the Kent source utilities , these mappings were then converted to BigWig, and transcript coverage was quantified using the "bigWigAverageOverBed" program, where transcript coordinates were obtained from the UCSC knownGenes (for genes and ncRNAs) and from mirBase (for microRNAs). To obtain runs with adequate sequencing depth and quality, runs which had fewer than 10 million mapped reads, fewer than 60% mapped reads, or which failed two or more FASTQC tests were discarded from further analysis. Overall, two expression matrices (of knownGene IDs and mirBase IDs) containing raw mapped counts were constructed using 141 runs (see Additional File 1 for a list of runs used).
Transcript-miRNA correlations and miRNA-target information
Both expression matrices (of knownGene IDs and mirBase IDs) were normalized using the implementation of quantile normalization provided by the limma package . As shown in Figure 1, both genes and miRs showed a typical log-normal distribution of average mapped read depth, although it is likely that many of the miRs detected were in their precursor (pre-miRNA) forms. Pearson correlation coefficients and quantile expression rank between transcripts (knownGene IDs) and miRNAs were obtained using these two matrices and the Numpy suite of programs. Transcripts or miRNAs which were not detected in any experiment were not assigned correlations and were removed from downstream analyses. In total, 58282 of the UCSC knownGenes and 597 miRNAs were detected in at least one experiment.
Experimentally validated target predictions for each miRNA were obtained from miRecord , and computational predictions were obtained from TargetScan version 6.2  and miRanda (August 2010 release) [9, 11]. miRanda divides its predicted targets into conserved and nonconserved targets, and we analyzed these two groups separately. For miRecord, only 215 of 1582 miR-target pairs were mappable to miR-knownGene pairs. The reason for this relatively low mapping rate is that frequently the RefSeq IDs for genes in miRecords did not correspond to a RefSeq ID for a knownGene transcript.
Web application implementation
The above data were assembled into a MySQL database, and a PHP front-end was implemented to serve user searches on particular transcripts, genes, or miRNAs. The application is hosted on a LAMP server. This web interface is accessible by Internet Explorer version 10 or greater and modern versions of Firefox and Chrome.
To test the hypothesis that RNA-seq derived miRNA-mRNA transcriptional correlations could be used as a different data type to corroborate sequence-based methods, we examined the transcriptional correlations between predicted or experimentally validated miR-target pairs. Because the general mode of action for microRNAs is to repress their targets, valid miR-target pairs should have lower Pearson correlations than randomly selected miR-target pairs. Table 1 shows that this is the case for experimentally validated miR-target pairs (from miRecord) as well as for computationally predicted pairs (TargetScan, miRanda). The effect is most pronounced in the experimentally validated pairs, suggesting that experimentally validated pairs have the highest quality.
The TargetScan and miRanda software provide several metrics that are used to predict the likelihood of a putative miRNA-mRNA interaction based on sequence and genomic location. To determine whether these scoring metrics were predictive of miRNA-mRNA pair expression correlations in our data, we determined the correlation between each parameter and the co-expression value for the corresponding miR-mRNA pair. For miRanda conserved and non-conserved targets, all parameters (alignment score, conservation score, free energy, and mirSVR score) were able to significantly predict co-expression values (Tables 2a, 2b). For TargetScan, all metrics except the number of non-conserved 8mer sites showed significant ability to predict co-expression values (Table 2c).
Next, we assessed the degree of agreement between mRNA-miRNA co-expression values and TargetScan predictions. Because miRNAs often repress expression of their targets, we hypothesized that gene-miR pairs with negative expression correlations should be enriched for target interactions predicted by miRanda and TargetScan. As expected, Figures 2a and 2b show a clear over-representation of miRanda conserved and non-conserved predictions among pairs with negative expression correlations. Figure 2c shows a more complex bimodal pattern, wherein TargetScan miR-target pairs tend to have moderately negative correlations, and to a lesser extent moderately positive correlations, but tend not to have extreme positive or negative correlations, or to be uncorrelated.
We created a web application (Figure 3) and tab-delimited source data files available for download. Gene/ncRNA or miRNA accessions can be entered to obtain a ranked list of positive and negative correlations with other transcripts. If a given gene-miR pair is predicted to interact by TargetScan or miRanda, the corresponding parameters, such as context score and numbers of conserved and non-conserved binding sites of each type, are also displayed alongside the expression values. The results of any query can be exported to CSV format, and bulk download of all the data used in the web server is also available at the same URL.
RNA-sequencing data contains information about expression patterns for the entire transcriptome at the time of measurement. Consequently, this data is an excellent means of exploring expression correlations among non-coding transcripts, microRNAs, and genes. We provide a method and interface to explore detected miRNA-mRNA correlations in light of the commonly accepted hypothesis that miRNA-mRNA pairing affects expression and/or translation of the mRNA, often as a means of repression but also as a means of activating transcription. Positive or negative correlations alone, of course, do not prove causation, as there can be a number of different factors involved in mRNA degradation. But strong correlations, particularly in the absence of other strong miRNA correlations with the same mRNA, can be considered strongly suggestive of an influence. And the more expression samples gathered for analysis, the more statistical confidence in trends that can be gained, and rare transcripts will become less of a problem.
The correlations we detected were statistically significant, yet much smaller in magnitude than we had anticipated prior to the study. There are a number of potential reasons this might be. First, since it has been observed that miRNAs, to achieve effective translational repression, frequently act in combination ., the transcription of one miRNA simply does not correlate strongly enough. This is analogous to the situation whereby multiple transcription factors frequently act in combination in order to achieve transcriptional activation . Second, it is possible that the repressive effect is more pronounced on the translational level. If so, this would suggest transcriptional approaches are not well-suited to address this problem. Third, it's possible that there is another layer of regulation not accounted for by simple correlations, for example the competing endogenous RNA hypothesis, whereby some transcripts may exist to "soak up" multiple miRNAs of the same type and keep them from repressing their targets . Finally, although probably least likely, if the bulk of transcripts being detected are pre-miRNAs that have not yet been processed (which we would expect to correlate with mature miRNA levels), then it's possible that there's another regulatory layer that keeps them from being processed until some further signal is given.
In summary, RNA-seq experiments provide us with a unique overview of the whole transcriptome in a single experiment, which can be used to detect transcript-transcript correlations and potentially detect whether or not one type of transcript with known regulatory effects is influencing the other. This work provides a resource to examine predicted correlations among two classes of RNA that are known to interact, with miRNAs able to degrade mRNA expression. It has the potential to help corroborate sequence-based predictions of miRNA-mRNA interactions, estimate the efficiency of using a miRNA with the specific intention of degrading a target mRNA, and for comparison of general co-expression trends to specific experimental observations.
mirCoX, a web application described in this article, is available at http://wrenlab.org/mirCoX
Aravin A, Tuschl T: Identification and characterization of small RNAs involved in RNA silencing. Febs Lett. 2005, 579: 5830-40. 10.1016/j.febslet.2005.08.009.
Giraldez AJ, Mishima Y, Rihel J, Grocock RJ, Van Dongen S, Inoue K, Enright AJ, Schier AF: Zebrafish MiR-430 promotes deadenylation and clearance of maternal mRNAs. Science. 2006, 312: 75-9. 10.1126/science.1122689.
Huang V, Qin Y, Wang J, Wang X, Place RF, Lin G, Lue TF, Li L-C: RNAa is conserved in mammalian cells. Plos One. 2010, 5: e8848-10.1371/journal.pone.0008848.
Lim LP, Lau NC, Garrett-Engele P, Grimson A, Schelter JM, Castle J, Bartel DP, Linsley PS, Johnson JM: Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature. 2005, 433: 769-73. 10.1038/nature03315.
Juan L, Wang G, Radovich M, Schneider BP, Clare SE, Wang Y, Liu Y: Potential roles of microRNAs in regulating long intergenic noncoding RNAs. Bmc Med Genomics. 2013, 6 (Suppl 1): S7-10.1186/1755-8794-6-S1-S7.
Majid S, Dar Aa, Saini S, Yamamura S, Hirata H, Tanaka Y, Deng G, Dahiya R: MicroRNA-205-directed transcriptional activation of tumor suppressor genes in prostate cancer. Cancer. 2010, 116: 5637-49. 10.1002/cncr.25488.
Soifer HS, Rossi JJ, Saetrom Pl: MicroRNAs in disease and potential therapeutic applications. Mol Ther J Am Soc Gene Ther. 2007, 15: 2070-9. 10.1038/sj.mt.6300311.
Witkos TM, Koscianska E, Krzyzosiak WJ: Practical Aspects of microRNA Target Prediction. Curr Mol Med. 2011, 11: 93-109. 10.2174/156652411794859250.
Betel D, Koppal A, Agius P, Sander C, Leslie C: Comprehensive modeling of microRNA targets predicts functional non-conserved and non-canonical sites. Genome Biol. 2010, 11: R90-10.1186/gb-2010-11-8-r90.
Enright AJ, John B, Gaul U, Tuschl T, Sander C, Marks DS: MicroRNA targets in Drosophila. Genome Biol. 2003, 5: R1-10.1186/gb-2003-5-1-r1.
John B, Enright AJ, Aravin A, Tuschl T, Sander C, Marks DS: Human MicroRNA targets. Plos Biol. 2004, 2: e363-10.1371/journal.pbio.0020363.
Lewis BP, Burge CB, Bartel DP: Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell. 2005, 120: 15-20. 10.1016/j.cell.2004.12.035.
Krek A, Grün D, Poy MN, Wolf R, Rosenberg L, Epstein EJ, MacMenamin P, da Piedade I, Gunsalus KC, Stoffel M, Rajewsky N: Combinatorial microRNA target predictions. Nat Genet. 2005, 37: 495-500. 10.1038/ng1536.
Chandra V, Girijadevi R, Nair AS, Pillai SS, Pillai RM: MTar: a computational microRNA target prediction architecture for human transcriptome. BMC Bioinformatics. 2010, 11 (Suppl 1): S2-10.1186/1471-2105-11-S1-S2.
Grimson A, Farh KK-H, Johnston WK, Garrett-Engele P, Lim LP, Bartel DP: MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Mol Cell. 2007, 27: 91-105. 10.1016/j.molcel.2007.06.017.
Didiano D, Hobert O: Perfect seed pairing is not a generally reliable predictor for miRNA-target interactions. Nat Struct Mol Biol. 2006, 13: 849-51. 10.1038/nsmb1138.
Saito T, Saetrom P: MicroRNAs-targeting and target prediction. New Biotechnol. 2010, 27: 243-9. 10.1016/j.nbt.2010.02.016.
Huang JC, Babak T, Corson TW, Chua G, Khan S, Gallie BL, Hughes TR, Blencowe BJ, Frey BJ, Morris QD: Using expression profiling data to identify human microRNA targets. 2007, 4: 1045-1049.
Hausser J, Berninger P, Rodak C, Jantscher Y, Wirth S, Zavolan M: MirZ: an integrated microRNA expression atlas and target prediction resource. Nucleic Acids Res. 2009, 37: W266-72. 10.1093/nar/gkp412.
Su W-L, Kleinhanz RR, Schadt EE: Characterizing the role of miRNAs within gene regulatory networks using integrative genomics techniques. Mol Syst Biol. 2011, 7: 490-
Gennarino VA, Sardiello M, Mutarelli M, Dharmalingam G, Maselli V, Lago G, Banfi S: HOCTAR database: a unique resource for microRNA target prediction. Gene. 2011, 480: 51-8. 10.1016/j.gene.2011.03.005.
Gennarino VA, Sardiello M, Avellino R, Meola N, Maselli V, Anand S, Cutillo L, Ballabio A, Banfi S: MicroRNA target prediction by expression analysis of host genes. Genome Res. 2009, 19: 481-90.
Jeggari A, Marks DS, Larsson E: miRcode: a map of putative microRNA target sites in the long non-coding transcriptome. Bioinforma Oxf Engl. 2012, 28: 2062-3. 10.1093/bioinformatics/bts344.
Paraskevopoulou MD, Georgakilas G, Kostoulas N, Reczko M, Maragkakis M, Dalamagas TM, Hatzigeorgiou AG: DIANA-LncBase: experimentally verified and computationally predicted microRNA targets on long non-coding RNAs. Nucleic Acids Res. 2013, 41: D239-45. 10.1093/nar/gks1246.
Zhu Y, Stephens RM, Meltzer PS, Davis SR: SRAdb: query and use public next-generation sequencing data from within R. BMC Bioinformatics. 2013, 14: 19-10.1186/1471-2105-14-19.
NCBI SRA Toolkit. [http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=std]
Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012, 9: 357-9. 10.1038/nmeth.1923.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinforma Oxf Engl. 2009, 25: 2078-9. 10.1093/bioinformatics/btp352.
Kent WJ, Zweig AS, Barber G, Hinrichs AS, Karolchik D: BigWig and BigBed: enabling browsing of large distributed datasets. Bioinforma Oxf Engl. 2010, 26: 2204-2207. 10.1093/bioinformatics/btq351.
Smyth G: Limma: linear models for microarray data. Bioinforma Comput Biol Solutions Using R Bioconductor. 2005, New York: Springer, 397-420.
Xiao F, Zuo Z, Cai G, Kang S, Gao X, Li T: miRecords: an integrated resource for microRNA-target interactions. Nucleic Acids Res. 2009, 37: D105-10. 10.1093/nar/gkn851.
Valencia-Sanchez MA, Liu J, Hannon GJ, Parker R: Control of translation and mRNA degradation by miRNAs and siRNAs. Genes Dev. 2006, 20: 515-24. 10.1101/gad.1399806.
Yeang CH, Jaakkola T: Modeling the combinatorial functions of multiple transcription factors. J Comput Biol. 2006, 13: 463-80. 10.1089/cmb.2006.13.463.
Salmena L, Poliseno L, Tay Y, Kats L, Pandolfi PP: A ceRNA hypothesis: the Rosetta Stone of a hidden RNA language?. Cell. 2011, 146: 353-8. 10.1016/j.cell.2011.07.014.
Publication of this work has been supported in part by the National Institutes of Health (NIH, grants # 1P20GM103636 and 8P20GM103456 to JDW) and by the Indian Council of Medical Research's Viral Disease Network Program (VIR/8/2011-ECD-1 to RGD).
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 14, 2013: Proceedings of the Tenth Annual MCBIOS Conference. Discovery in a sea of data. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S14.
The authors declare that they have no competing interests.
CG generated the database of correlations and targets and performed the experiments, RG implemented the web interface, CG, RG, MD and JW participated in study design and helped write the manuscript.