- Open Access
iMir: An integrated pipeline for high-throughput analysis of small non-coding RNA data obtained by smallRNA-Seq
BMC Bioinformatics volume 14, Article number: 362 (2013)
Qualitative and quantitative analysis of small non-coding RNAs by next generation sequencing (smallRNA-Seq) represents a novel technology increasingly used to investigate with high sensitivity and specificity RNA population comprising microRNAs and other regulatory small transcripts. Analysis of smallRNA-Seq data to gather biologically relevant information, i.e. detection and differential expression analysis of known and novel non-coding RNAs, target prediction, etc., requires implementation of multiple statistical and bioinformatics tools from different sources, each focusing on a specific step of the analysis pipeline. As a consequence, the analytical workflow is slowed down by the need for continuous interventions by the operator, a critical factor when large numbers of datasets need to be analyzed at once.
We designed a novel modular pipeline (iMir) for comprehensive analysis of smallRNA-Seq data, comprising specific tools for adapter trimming, quality filtering, differential expression analysis, biological target prediction and other useful options by integrating multiple open source modules and resources in an automated workflow. As statistics is crucial in deep-sequencing data analysis, we devised and integrated in iMir tools based on different statistical approaches to allow the operator to analyze data rigorously. The pipeline created here proved to be efficient and time-saving than currently available methods and, in addition, flexible enough to allow the user to select the preferred combination of analytical steps. We present here the results obtained by applying this pipeline to analyze simultaneously 6 smallRNA-Seq datasets from either exponentially growing or growth-arrested human breast cancer MCF-7 cells, that led to the rapid and accurate identification, quantitation and differential expression analysis of ~450 miRNAs, including several novel miRNAs and isomiRs, as well as identification of the putative mRNA targets of differentially expressed miRNAs. In addition, iMir allowed also the identification of ~70 piRNAs (piwi-interacting RNAs), some of which differentially expressed in proliferating vs growth arrested cells.
The integrated data analysis pipeline described here is based on a reliable, flexible and fully automated workflow, useful to rapidly and efficiently analyze high-throughput smallRNA-Seq data, such as those produced by the most recent high-performance next generation sequencers. iMir is available at http://www.labmedmolge.unisa.it/inglese/research/imir.
Small RNA analysis by massively parallel sequencing (smallRNA-Seq) represents an increasingly popular method to address different questions concerning the biological role of miRNAs and other regulatory small transcripts, such as piwi-interacting (piRNAs), small inhibitory (siRNAs), transcription initiation (tiRNAs), transfer (tRNAs) and other small non-coding (sncRNAs) RNAs, including also extra-cellular small RNAs (exRNAs). Among sncRNAs, miRNAs and piRNAs are emerging as key regulators in multiple cellular functions and for this reason are widely studied by direct sequencing. miRNAs, the best know and studied class of sncRNAs, are interesting to investigate due to their ability to control gene expression in eukaryotes by fine tuning mRNA translation [1-3]. They represent a class of short (~ 22 nucleotides) RNA molecules that play pivotal roles in a variety of molecular processes, such as immune response , differentiation , development [6-8], infection [9, 10] and carcinogenesis [11-13]. miRNA genes are synthesized as long precursor RNA molecules (pri-miRNAs), usually by RNA polymerase II , that are rapidly processed in the nucleus by Drosha RNase III to release approximately 70 nucleotides long miRNA precursor stem loop (pre-miRNA)  that in turn are exported to the cytoplasm by Exportin 5 . In the cytoplasm, mature miRNAs are produced through the action of Dicer RNase . These small RNAs regulate gene expression by binding to targets sites generally in the 3′ untraslated region (3′ UTR) of target mRNAs, resulting in mRNA degradation or translation inhibition [1, 18]. miRNAs recognition of the 3′ UTR of their target mRNA is mediated by hybridization between nucleotides 2-8 at 5′ end of the small RNA (seed sequence) and the complementary sequences present in the 3′ UTR of the mRNA [1, 19, 20]. On the other hand, small non-coding RNAs that interact with Piwi proteins, called piRNAs, are emerging as regulatory transcripts able to control a broad range of biological processes. The main roles of these molecules has been investigated mainly in germline stem cells, where they are involved in: (i) regulation of transposone activity; (ii) modulation of genome epigenetic state, (iii) development and (iv) spermatogenesis . However piRNAs have been also identified in somatic cells, including human cancer cells , suggesting their possible involvement in tumors. This aspect highlights the need for sensitive and efficient bioinformatics tools to study these novel class of sncRNAs in smallRNA-Seq datasets. SmallRNA-Seq allows detection of RNAs with a high dynamic range and reliably measures small differences in RNA concentration between samples, enabling also to discover novel RNA molecules not annotated in databases. Generally, data analysis is performed by combining multiple statistical and bioinformatics tools available from different sources. Many useful programs for processing these data exist nowadays, such as RandA , Shortran , UEA sRNA Workbench , DSAP , miRTools 2.0  and miRExpress . Two main issues hamper diffusion and implementation of such programs: (i) web-based tools have some restriction on data upload; (ii) stand-alone programs often lack one or more analysis steps, such as for example prediction of novel sncRNAs. As main consequence, the analytical workflow is slowed down by the need for the continuous interventions by the operator, a critical factor when a large number of samples need to be analyzed at once. A main challenge in bioinformatics is thus to create comprehensive computational tools for handling and analyzing, in an automated manner, the huge amount of data generated by these experiments.
We describe here a modular analysis pipeline, iMir, for comprehensive analyses of smallRNA-Seq data integrating multiple open source modules and resources linked together in automated way. The pipeline allows identification of miRNAs and other sncRNAs, such as piRNAs, to perform differential expression analysis and, for miRNAs, to predict the corresponding mRNA targets. In addition, iMir provides the possibility to perform hierarchical clustering and to apply different statistical approaches to the analysis, improving discrimination of expressed sncRNAs and allows to identify those more likely to be biologically relevant. The pipeline output includes graphics and text files that are useful for a better interpretation of the results. iMir is well suited for the analysis of smallRNA-seq data obtained from animal samples. Moreover, it can be used to investigate the role of sncRNAs in plants adding the appropriate reference tracks in iMir database.
One of the main problems when dealing with the large datasets generated with the currently available Next Generation Sequencing (NGS) technologies are represented by the difficulties in their management and analysis. Analysis of smallRNA-Seq data, for example, requires implementation of different bioinformatics tools and the possibility to perform multiple, subsequent file format conversions that slows-down and makes cumbersome the analytical procedure. The need of a bioinformatics instrument that may help solve these problems in a user-friendly and handy way led us to devise a tool, called iMir, that integrates various open source modules and resources and, in addition, implementing different statistical approaches for sncRNAs expression analyses allows users to select the most performing and relevant method for analysis of their data. The analytical modules included in iMir, selected among the best available, have been made more performant thanks to home-made scripts that allows the user to create self-defined analytical flows. Indeed, the resources integrated in iMir were selected after careful comparison and throughout evaluation of the performance of software widely used in smallRNA-Seq data analysis, according also to what recently reported by Cordero et al.  and Williamson et al. . iMir was implemented using an object-oriented programming language, Python, and comprises also a Graphical User Interface (GUI, Figure 1A-C and Additional file 1) that makes it easier the use of command line tools, so that the pipeline is particularly suited for biologist and early stage bioinformaticians, also because it simplify the way how projects are created, parameters are specified for each of several steps of the analysis and the different algorithms are run on project data (Figure 1A). In addition, terminal display window (Figure 1C) can be used to follow the flow of analysis. iMir provides stepwise planning that allows user to select the desired combination of analytical tools in the workflow. Some modules are mandatory while others are, instead, optional. A schematic representation of the iMir workflow is shown in Figure 2A-F. The pipeline takes in input the deeply sequenced reads in FASTQ format. As mentioned above, iMir offers the possibility to run different modules independently and this can be considered one of its main advantages. Indeed, in some cases user can work using pre-analyzed datasets (e.g. reads clipped from adapters sequences, or table with read-counts for each sncRNAs detected in test vs control samples), or may need to perform only a specific analytical step, such as adapter cleavage from input reads, detection of known and/or novel miRNAs, or to map sequence reads against other sncRNA libraries and then to perform differential expression analysis. In all cases, it is possible to start the analytical flow at that step simply by using the input file specific for it. The initial analytical step described in Module 1 (Figure 2A) allows to perform a pre-process analysis of the input files by setting user-defined options for performing adapter cleavage with cutadapt tool , as well as quality filtering and analysis of the length distribution of reads. Cutadapt is used for adapter trimming and differs from other adapter trimming tools because it provides several useful option, e.g. error rate assessment in adapter cleavage or search and removal of multiple adapter sequences, essential to get rid of adapter duplications occurring during sequencing library preparation. Module 2 (Figure 2B) allows detection of known miRNAs. To this aim, iMir integrates in its pipeline miRanalyzer stand-alone tool , that in its last version (miRanalyzer version 0.3) was improved in speed and features, including a comprehensive analysis of sequences corresponding to isomiR . At this step it is possible to perform also cluster analyses, carrying out PCA analysis and/or applying different hierarchical clustering algorithms. This feature, in fact, is useful when dealing with a very large number of samples to assess similarities and differences among them, such as for example when analyzing results from large cohorts of tumor biopsies. One main advantage of small non-coding RNA sequencing is the possibility to predict novel miRNAs not annotated in databases. This procedure (Module 3, Figure 2C) is performed in iMir with miRanalyzer stand-alone tool  and miRDeep2 . With this process it is possible to achieve a dual purpose: (i) to obtain more accurate results on novel miRNAs, that can then be experimentally validated and (ii) to evaluate presence and concentration of reads relative to other sncRNAs in the same datasets. We included in iMir the possibility to implement an intermediate step (Module 4, Figure 2D), before proceeding to differential expression analysis step (Module 5, Figure 2E), to remove the noise and less informative reads, e.g. miRNAs or sncRNAs expressed with very low read counts, that is based on the following statistics: (i) cumulative distribution function and (ii) arbitrary value approaches. Furthermore, low read counts might not reveal a real biological information, being due to sequencing errors or inaccuracy during the procedure of read alignment to the reference genome, such as cross mapping artefacts. To account for this problem, a minimum read count value can be used to filter out reads detected below the cutoff (“Minimum Read Count”, Figure 1B). In addition, after known miRNAs detection (Module 2, Figure 2B), we included statistical approaches to evaluate the cumulative distribution function, such as quartile or percentile values that are computed considering the whole reads-counts to exclude from the list of expressed miRNAs in a given sample those showing read count below that value. This approach can be used also with any sncRNA libraries. The other statistical approach implemented in iMir addresses a common problem encountered when calculating fold-change values (test/control read counts ratios) for RNAs present in the samples at very low levels. Considering a fold-change threshold of ±1.5 and p-value <0.05, when read counts for a given sncRNA are very low this setting can generate biologically irrelevant results. For example, considering a case where 10 tags are assigned to a given RNA in the test sample and 5 in the control, both derived for large cell populations, the resulting fold-change (2.0) may be statistically significant but of doubtful biological relevance , contrary to what occurs for sncRNAs showing in the same conditions high read counts. To overcome this problem, we included in iMir the possibility to add in such cases a correction factor. This, computed automatically by the tool as the median of the read count distribution relative to the sncRNA datasets of interest, can be added to the actual read counts of all entries in the datasets. In the case described above, considering a calculated correction factor of 30 the fold-change value calculated for such low-expressed sncRNA will be 1.14, and thus below the threshold set, while for an RNA of the same dataset that shows 4,000 read counts in the sample and 2,000 in the control this adjustment will be irrelevant (see also: Results and discussion). Of note, the user can either disregard this function or use at will an alternative value for this parameter, calculated by any other means desired. These approaches are useful to reduce the number of false-positives detected by differential expression analysis, as these most likely occur among RNAs expressed at a very low level.
Identification and analysis of differentially expressed sncRNAs using digital data is implemented in iMir with two different methods (Module 5, Figures 1C and 2E). The first one, based on the DESeq bioconductor package , is particularly suited when biological or technical replicates are available . The second, based on quantile normalization and Fisher’s exact test to assess the statistical relevance, is specially designed for use when no replicates are available . The last iMir module (Module 6, Figure 2F) is designed to perform mRNA targets prediction of expressed, or differentially expressed, miRNAs. mRNA targets are predicted by using miRanda [38, 39], that includes current knowledge on target rules and uses a compendium of mammalian miRNAs, and TargetScan [40, 41], that computes mRNA targets by searching for the presence of 8mer and 7mer sites matching the seed region of each miRNA. iMir includes in its databases different sncRNAs, such as miRNAs, piRNAs, tRNAs, mRNAs and data from RFam for human, rat and mouse (Additional file 2: Table S1). Performance of iMir was compared with that of the individual bioinformatics tools considered by Williamson et al. , selected on the basis of their popularity highlighted by number of citations in the literature. Furthermore, the number of known (available in miRBase) and of novel (absent from the latest release of miRBase) miRNAs detected and the time required to carry to completion the whole analytical flow were evaluated on multiple datasets generated in our laboratory and available from public data repositories and then taken as indicators of iMir performance. The results obtained are in line with what previously reported , suggesting reliability of this new tool.
Results and discussion
As an example of the performance of the iMir pipeline, the tool was applied to analyze patterns of sncRNA expression and changes in human breast cancer MCF-7 cells maintained in two different culture conditions affecting cell cycle progression, e.g. growth-arrest and exponential growth [42, 43] (see: Additional file 3 for details). For each experimental condition, three sequencing replicates were analyzed to gather a correct estimation also of the technical variability occurring during differential expression analyses. A comprehensive smallRNA-Seq data analysis was performed running all iMir functions with default parameters and the performances of the tool are summarized in Figure 3. To detect classes of sncRNAs other than miRNAs, the raw reads not mapping to known mature miRNAs were aligned against tRNA and mRNA sequences from UCSC Genome Browser, rRNA and piRNA sequences from Nucleotide NCBI database and other sncRNA sequences from RFam  (see: Additional file 3 for details). One of the main advantages when applying this procedure is the possibility to reduce false-positive rate in novel miRNAs prediction, while at the same time allowing to search for and analyze other classes of sncRNAs in the datasets. For each sncRNA library sequenced, ~4,0 M raw reads/sample for exponentially growing and ~5.8 M reads/sample for growth arrested cells were obtained (Figure 3 and Table 1). After the pre-process analysis, a small percentage of reads, all <15nt-long, is discarded as the algorithm is unable at present to manage them. The read-length distribution after adapter cleavage in all samples is reported to the right of Module 1 in Figure 3, to show how the majority of reads obtained after this first step are ~22nt long, suggesting that they are mainly due to miRNAs. This observation is further confirmed by the number of reads that actually match known miRNAs (Table 1), computed to account for more than 50% of the entire dataset in each case. The remaining reads, showing a length distribution between 26 and 31nt could include piRNAs, while those 36nt-long could result from longer RNA degradation products. The heatmap reported in Figure 3 to the right of Module 5 highlights the high degree of similarity of miRNA profiles in the three replicates for each sample sequenced (A1-3 for the Case and B1-3 for the Control). A similar results was obtained for piRNAs (data not shown). After that, the iMir module that computes differential expression analysis (Module 5, Figure 2E) was run. miRNA analysis led to the identification of about 460-70 miRNAs per sample (Figure 3 and Table 2), some of which differentially expressed. The pie-charts in Figure 3 summarizes the results of miRNA differential expression analysis performed with DESeq , expressed as percentage of detected RNAs showing statistically significant differences in concentration between samples beyond a standard threshold (fold-change ≤ −1.5 or ≥1.5 in exponentially growing vs quiescent cells, p-value ≤ 0.05) or not (70.5%). These results, when compared with previously published data relative to miRNA modulation in the same cell line [13, 42, 43], confirm that iMir is useful to rapidly and efficiently perform differential analysis for these sncRNAs.
Recently, an increasing number of studies highlighted the role of piRNAs in breast cancer [45, 46]. Since the average length of these RNAs is ~30nt (see: Figure in Additional file 3: Figure S1), smallRNA-Seq represents an efficient analytical approach to assess also absolute and relative expression of these molecules. Based on this assumption, we searched for and analyzed piRNAs in the datasets selected to test iMir performance. To reduce cross-mapping artifacts, reads corresponding to other RNAs, in particular miRNAs, tRNAs, rRNAs and mRNAs, were first filtered out with iMir mapping them against the selected transcribed RNA libraries. This allowed at once to start from a set of more reliable data and to gather information concerning other small RNAs detected by sequencing (Table 1 and Table 2). This analysis led to the identification of 70 and 85 piRNAs expressed in growth-arrested and exponentially growing MCF-7 cells, respectively. Differential piRNA expression analysis and statistical significance testing performed with iMir revealed 12 downregulated and 25 upregulated piRNAs in growing cells, when compared to quiescent ones (p-value = 0.05, threshold = 1.5; Figure 3). We do not have a ready explanation for these relatively low numbers of piRNAs identified in breast cancer cells, except for the fact that piRNAs know to date have been identified in germ cells  and it is thus possible that the majority of them is expressed only in these cell types. Furthermore, most piRNAs identified so far associate with the piRNA biogenesis factor Piwil1 (Hiwi) [21, 47], that is not detectable in MCF-7 cells, where only Piwil 2 (Hili) and Piwil4 (Hiwi2) are detected [Hashim et al., manuscript in preparation]. As new validated piRNA datasets will become available, for example those identified by association to Piwil 2 and 4, the possibility built into iMir to customize its database will allow to include these in the analysis. The decision to focus here on individual piRNAs instead of considering their genomic organization in clusters is based on the observation that in somatic cells piRNAs deriving from a given cluster show different levels of steady-state expression, possibly due to a specific mechanism of precursor RNA maturation active in these cells or to differences in their half-life. In addition, recent results suggest that individual piRNAs could play important roles in tumor cells [48-50].
We then tested another function of iMir by performing differential miRNAs expression analysis in two different ways: (i) starting directly from the number of raw read-counts obtained with miRanalyzer  or (ii) by adding to each of these counts a correction factor (31), computed as the median of the whole read dataset (see above). Once compared, the results obtained with the two approaches showed slight but substantial differences, since ~10% of the miRNAs identified with the first method (pvalue ≤ 0.05) were excluded by the second one (Additional file 4: Table S2). This is explained by the fact that the RNAs expressed at a very low level under both experimental conditions, and thus of uncertain biological significance, were filtered out when using this correction. iMir offers the possibility to choose this method, when needed, also to other classes of sncRNAs.
With respect to the possibility to perform target prediction for selected miRNAs using miRanda and TargetScan databases, another useful function of the tool, it is worth mentioning the possibility for the user to update when required these and the other databases associated to the pipeline, such as those of miRNAs from miRBase, [51-53], of other sncRNAs from different sources and of mRNA targets from TargetScan [40, 41, 54] and miRanda [38, 39].
We designed, built and describe here iMir, a pipeline that integrates multiple open source modules/resources and implements statistical approaches, combined in an automated flow for high-throughput smallRNA-Seq data analysis. iMir is rapid, accurate and efficient, allowing to examine multiple samples at once and thereby addressing a critical factor for high-throughtput analysis of sncRNA sequencing data, represented by the need for continuous interventions by an operator skilled in informatics and programming. The graphical user interface of iMir, allows a simplified use of the many tools integrated in the pipeline and to customize data analysis according to different needs. In addition, the implementation of different statistical approaches provides the possibility to analyze data according to standard, widely used, as well as to specific needs. Finally, iMir works on Linux and Mac operative systems, user-friendly for biologists with limited skills in informatics. In the future, following the evolution of NGS technologies and recommendations by the scientific community, we plan to keep improving iMir features, including for example tools for sequence variants detection, evolutionary sncRNAs analysis across multiple species and adding specific functions for analysis of emerging classes of small RNAs (pi-, si-, sn-, sno-, ti-RNA, etc.).
Availability and requirements
Project name: iMir.
Project home page:http://www.labmedmolge.unisa.it/inglese/research/imir.
Operating System(s): Unix/Linux based.
Other requirements: Python, Java, Perl, R, DESeq, Bowtie, Vienna RNA Secondary Structure package.
License: GNU GPL v3.
Any restrictions to use by non-academics: specified by GNU GPL v3.
Small non-coding RNA
Next generation sequencing.
Bartel DP: MicroRNAs: genomics, biogenesis, mechanism, and function. Cell. 2004, 116 (2): 281-297. 10.1016/S0092-8674(04)00045-5.
He L, Hannon GJ: MicroRNAs: small RNAs with a big role in gene regulation. Nat Rev Genet. 2004, 5 (7): 522-531. 10.1038/nrg1379.
Flynt AS, Lai EC: Biological principles of microRNA-mediated regulation: shared themes amid diversity. Nat Rev Genet. 2008, 9 (11): 831-842. 10.1038/nrg2455.
Tili E, Michaille JJ, Cimino A, Costinean S, Dumitru CD, Adair B, Fabbri M, Alder H, Liu CG, Calin GA, et al: Modulation of miR-155 and miR-125b levels following lipopolysaccharide/TNF-alpha stimulation and their possible roles in regulating the response to endotoxin shock. J Immunol. 2007, 179 (8): 5082-5089.
Tay YM, Tam WL, Ang YS, Gaughwin PM, Yang H, Wang W, Liu R, George J, Ng HH, Perera RJ, et al: MicroRNA-134 modulates the differentiation of mouse embryonic stem cells, where it causes post-transcriptional attenuation of Nanog and LRH1. Stem Cells. 2008, 26 (1): 17-29. 10.1634/stemcells.2007-0295.
Lagos-Quintana M, Rauhut R, Lendeckel W, Tuschl T: Identification of novel genes coding for small expressed RNAs. Science. 2001, 294 (5543): 853-858. 10.1126/science.1064921.
Lau NC, Lim LP, Weinstein EG, Bartel DP: An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans. Science. 2001, 294 (5543): 858-862. 10.1126/science.1065062.
Lee RC, Ambros V: An extensive class of small RNAs in Caenorhabditis elegans. Science. 2001, 294 (5543): 862-864. 10.1126/science.1065329.
Gupta A, Gartner JJ, Sethupathy P, Hatzigeorgiou AG, Fraser NW: Anti-apoptotic function of a microRNA encoded by the HSV-1 latency-associated transcript. Nature. 2006, 442 (7098): 82-85.
Jopling CL, Yi M, Lancaster AM, Lemon SM, Sarnow P: Modulation of hepatitis C virus RNA abundance by a liver-specific MicroRNA. Science. 2005, 309 (5740): 1577-1581. 10.1126/science.1113329.
Huang Q, Gumireddy K, Schrier M, le Sage C, Nagel R, Nair S, Egan DA, Li A, Huang G, Klein-Szanto AJ, et al: The microRNAs miR-373 and miR-520c promote tumour invasion and metastasis. Nat Cell Biol. 2008, 10 (2): 202-210. 10.1038/ncb1681.
Silber J, Lim DA, Petritsch C, Persson AI, Maunakea AK, Yu M, Vandenberg SR, Ginzinger DG, James CD, Costello JF, et al: miR-124 and miR-137 inhibit proliferation of glioblastoma multiforme cells and induce differentiation of brain tumor stem cells. BMC Med. 2008, 6: 14-10.1186/1741-7015-6-14.
Paris O, Ferraro L, Grober OMV, Ravo M, De Filippo MR, Giurato G, Nassa G, Tarallo R, Cantarella C, Rizzo F, et al: Direct regulation of microRNA biogenesis and expression by estrogen receptor beta in hormone-responsive breast cancer. Oncogene. 2012, 31 (38): 4196-4206. 10.1038/onc.2011.583.
Lee Y, Kim M, Han J, Yeom KH, Lee S, Baek SH, Kim VN: MicroRNA genes are transcribed by RNA polymerase II. EMBO J. 2004, 23 (20): 4051-4060. 10.1038/sj.emboj.7600385.
Lee Y, Ahn C, Han J, Choi H, Kim J, Yim J, Lee J, Provost P, Radmark O, Kim S, et al: The nuclear Rnase III Drosha initiates microRNA processing. Nature. 2003, 425 (6956): 415-419. 10.1038/nature01957.
Lund E, Guttinger S, Calado A, Dahlberg JE, Kutay U: Nuclear export of microRNA precursors. Science. 2004, 303 (5654): 95-98. 10.1126/science.1090599.
Hutvagner G, McLachlan J, Pasquinelli AE, Balint E, Tuschl T, Zamore PD: A cellular function for the RNA-interference enzyme Dicer in the maturation of the let-7 small temporal RNA. Science. 2001, 293 (5531): 834-838. 10.1126/science.1062961.
Nilsen TW: Mechanisms of microRNA-mediated gene regulation in animal cells. Trends Genet. 2007, 23 (5): 243-249. 10.1016/j.tig.2007.02.011.
Ambros V: The functions of animal microRNAs. Nature. 2004, 431 (7006): 350-355. 10.1038/nature02871.
Zamore PD, Haley B: Ribo-gnome: the big world of small RNAs. Science. 2005, 309 (5740): 1519-1524. 10.1126/science.1111444.
Luteijn MJ, Ketting RF: PIWI-interacting RNAs: from generation to transgenerational epigenetics. Nat Rev Genet. 2013, 14 (8): 523-534. 10.1038/nrg3495.
Cheng J, Guo JM, Xiao BX, Miao Y, Jiang Z, Zhou H, Li QN: piRNA, the new non-coding RNA, is aberrantly expressed in human cancer cells. Clin Chim Acta. 2011, 412 (17-18): 1621-1625.
Isakov O, Ronen R, Kovarsky J, Gabay A, Gan I, Modai S, Shomron N: Novel nsight into the non-coding reperto ire through deep sequencing analysis. Nucleic Acids Res. 2012, 10: 1093-
Gupta V, Markmann K, Pedersen CN, Stougaard J, Andersen SU: shortran: a pipeline for small RNA-seq data analysis. Bioinformatics. 2012, 28 (20): 2698-2700. 10.1093/bioinformatics/bts496.
Stocks MB, Moxon S, Mapleson D, Woolfenden HC, Mohorianu I, Folkes L, Schwach F, Dalmay T, Moulton V: The UEA sRNA workbench: a suite of tools for analysing and visualizing next generation sequencing microRNA and small RNA datasets. Bioinformatics. 2012, 28 (15): 2059-2061. 10.1093/bioinformatics/bts311.
Huang PJ, Liu YC, Lee CC, Lin WC, Gan RR, Lyu PC, Tang P: DSAP: deep-sequencing small RNA analysis pipeline. Nucleic Acids Res. 2010, 38: W385-W391. 10.1093/nar/gkq392. Web Server issue
Wu J, Liu Q, Wang X, Zheng J, Wang T, You M, Sheng Sun Z, Shi Q: mirTools 2.0 for non-coding RNA discovery, profiling, and functional annotation based on high-throughput sequencing. RNA Biol. 2013, 10 (7): 1087-1092. 10.4161/rna.25193.
Wang WC, Lin FM, Chang WC, Lin KY, Huang HD, Lin NS: miRExpress: analyzing high-throughput sequencing data for profiling microRNA expression. BMC bioinformatics. 2009, 10: 328-10.1186/1471-2105-10-328.
Cordero F, Beccuti M, Arigoni M, Donatelli S, Calogero RA: Optimizing a massive parallel sequencing workflow for quantitative miRNA expression analysis. PLoS One. 2012, 7 (2): e31630-10.1371/journal.pone.0031630.
Williamson V, Kim A, Xie B, McMichael GO, Gao Y, Vladimirov V: Detecting miRNAs in deep-sequencing data: a software performance comparison and evaluation. Brief Bioinform. 2013, 14 (1): 36-45. 10.1093/bib/bbs010.
Martin M: Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet Journal. 2011, 17: 1-
Hackenberg M, Rodriguez-Ezpeleta N, Aransay AM: miRanalyzer: an update on the detection and analysis of microRNAs in high-throughput sequencing experiments. Nucleic Acids Res. 2011, 39: W132-W138. 10.1093/nar/gkr247. Web Server issue
Morin RD, O’Connor MD, Griffith M, Kuchenbauer F, Delaney A, Prabhu AL, Zhao Y, McDonald H, Zeng T, Hirst M, et al: Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. Genome Res. 2008, 18 (4): 610-621. 10.1101/gr.7179508.
Friedländer MR, Mackowiak SD, Li N, Chen W, Rajewsky N: miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades. Nucleic Acids Res. 2012, 40 (1): 37-52. 10.1093/nar/gkr688.
Wee LM, Flores-Jasso CF, Salomon WE, Zamore PD: Argonaute divides its RNA guide into domains with distinct functions and RNA-binding properties. Cell. 2012, 151 (5): 1055-1067. 10.1016/j.cell.2012.10.036.
Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11 (10): R106-10.1186/gb-2010-11-10-r106.
Garmire LX, Subramaniam S: Evaluation of normalization methods in mammalian microRNA-Seq data. RNA. 2012, 18 (6): 1279-1288. 10.1261/rna.030916.111.
Betel D, Koppal A, Agius P, Sander C, Leslie C: Comprehensive modeling of microRNA targets predicts functional non-conserved and non-canonical sites. Genome Biol. 2010, 11 (8): R90-10.1186/gb-2010-11-8-r90.
Betel D, Wilson M, Gabow A, Marks DS, Sander C: The microRNA.org resource: targets and expression. Nucleic Acids Res. 2008, 36: D149-D153. Database issue
Lewis BP, Burge CB, Bartel DP: Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell. 2005, 120 (1): 15-20. 10.1016/j.cell.2004.12.035.
Grimson A, Farh KK, Johnston WK, Garrett-Engele P, Lim LP, Bartel DP: MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Mol Cell. 2007, 27 (1): 91-105. 10.1016/j.molcel.2007.06.017.
Cicatiello L, Mutarelli M, Grober OM, Paris O, Ferraro L, Ravo M, Tarallo R, Luo S, Schroth GP, et al: Estrogen receptor alpha controls a gene network in luminal-like breast cancer cells comprising multiple transcription factors and microRNAs. Am J Pathol. 2010, 176 (5): 2113-2130. 10.2353/ajpath.2010.090837.
Ferraro L, Ravo M, Nassa G, Tarallo R, De Filippo MR, Giurato G, Cirillo F, Stellato C, Silvestro S, Cantarella C, et al: Effects of estrogen on microRNA expression in hormone-responsive breast cancer cells. Horm Cancer. 2011, 2 (5): 610-621.
Burge SW, Daub J, Eberhardt R, Tate J, Barquist L, Nawrocki EP, Eddy SR, Gardner PP, Bateman A: Rfam 11.0: 10 years of RNA families. Nucleic Acids Res. 2013, 41: D226-D232. 10.1093/nar/gks1005. Database issue
Esposito T, Magliocca S, Formicola D, Gianfrancesco F: piR_015520 belongs to Piwi-associated RNAs regulates expression of the human melatonin receptor 1A gene. PLoS One. 2011, 6 (7): e22727-10.1371/journal.pone.0022727.
Huang G, Hu H, Xue X, Shen S, Gao E, Guo G, Shen X, Zhang X: Altered expression of piRNAs and their relation with clinicopathologic features of breast cancer. Clin Transl Oncol. 2012, [Epub ahead of print]
Aravin A, Gaidatzis D, Pfeffer S, Lagos-Quintana M, Landgraf P, Iovino N, Morris P, Brownstein MJ, Kuramochi-Miyagawa S, Nakano T, Chien M, Russo JJ, Ju J, Sheridan R, Sander C, Zavolan M, Tuschl T: A novel class of small RNAs bind to MILI protein in mouse testes. Nature. 2006, 442 (7099): 203-207.
Cheng J, Deng H, Xiao B, Zhou H, Zhou F, Shen Z, Guo J: piR-823, a novel non-coding small RNA, demonstrates in vitro and in vivo tumor suppressive activity in human gastric cancer cells. Cancer Lett. 2012, 315 (1): 12-17. 10.1016/j.canlet.2011.10.004.
Huang G, Hu H, Xue X, Shen S, Gao E, Guo G, Shen X, Zhang X: Altered expression of piRNAs and their relation with clinicopathologic features of breast cancer. Clin Transl Oncol. 2013, 15 (7): 563-568. 10.1007/s12094-012-0966-0.
Law PT, Qin H, Ching AK, Lai KP, Co NN, He M, Lung RW, Chan AW, Chan TF, Wong N: Deep sequencing of small RNA transcriptome reveals novel non-coding RNAs in hepatocellular carcinoma. J Hepatol. 2013, 58 (6): 1165-1173. 10.1016/j.jhep.2013.01.032.
Kozomara A, Griffiths-Jones S: miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res. 2011, 39: D152-D157. 10.1093/nar/gkq1027. Database issue
Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ: miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res. 2006, 34: D140-D144. 10.1093/nar/gkj112. Database issue)
Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ: miRBase: tools for microRNA genomics. Nucleic Acids Res. 2008, 36: D154-D158. 10.1093/nar/gkn221. Database issue
Friedman RC, Farh KK, Burge CB, Bartel DP: Most mammalian mRNAs are conserved targets of microRNAs. Genome Res. 2009, 19 (1): 92-105.
Work supported by: Italian Ministry for Education, University and Research (Grant PRIN 2010LC747T_002 to AW and FIRB RBFR12W5V5_003 to RT), Italian Association for Cancer Research (Grant IG-13176), Fondazione Umberto Veronesi’ (Grant 2012-13), EU SeqAhead COST Action (BM1006), CNR Epigen Flagship Project (Grant 2013), University of Salerno (Fondi FARB 2011-2012). GN is supported by a ‘Mario e Valeria Rindi’ fellowship of the Italian Foundation for Cancer Research, MR is supported by a ‘Vladimir Ashkenazy’ fellowship of Italian Association for Cancer Research, FR is supported by a “Young Investigator Programme” fellowship of the Fondazione ‘Umberto Veronesi’, University of Napoli ‘Federico II’ and AH is a PhD student in “Experimental Physiopathology and Neurosciences”, Second University of Napoli.
The authors declare that they have no competing interests.
GG and MRDF lead in the conception and design of the software, performed the statistical analyses and coordinated manuscript drafting and revision. AR participated in conception and development of the software. AH and FR contributed to software conception and design and to manuscript drafting. GN, MR and RT carried out smallRNA-Seq analyses, manuscript drafting and contributed to software conception. AW coordinated the project, participated in conception of the software and participated in drafting and finalization of the manuscript. All authors read and approved the final manuscript.
Giorgio Giurato, Maria Rosaria De Filippo contributed equally to this work.
Electronic supplementary material
About this article
Cite this article
Giurato, G., De Filippo, M.R., Rinaldi, A. et al. iMir: An integrated pipeline for high-throughput analysis of small non-coding RNA data obtained by smallRNA-Seq. BMC Bioinformatics 14, 362 (2013). https://doi.org/10.1186/1471-2105-14-362
- Next generation sequencing
- Data analysis pipeline
- Breast cancer
- Small non-coding RNA
- Piwi-interacting RNA