A framework for automated enrichment of functionally significant inverted repeats in whole genomes
© Frank et al; licensee BioMed Central Ltd. 2010
Published: 7 October 2010
RNA transcripts from genomic sequences showing dyad symmetry typically adopt hairpin-like, cloverleaf, or similar structures that act as recognition sites for proteins. Such structures often are the precursors of non-coding RNA (ncRNA) sequences like microRNA (miRNA) and small-interfering RNA (siRNA) that have recently garnered more functional significance than in the past. Genomic DNA contains hundreds of thousands of such inverted repeats (IRs) with varying degrees of symmetry. But by collecting statistically significant information from a known set of ncRNA, we can sort these IRs into those that are likely to be functional.
A novel method was developed to scan genomic DNA for partially symmetric inverted repeats and the resulting set was further refined to match miRNA precursors (pre-miRNA) with respect to their density of symmetry, statistical probability of the symmetry, length of stems in the predicted hairpin secondary structure, and the GC content of the stems. This method was applied on the Arabidopsis thaliana genome and validated against the set of 190 known Arabidopsis pre-miRNA in the miRBase database. A preliminary scan for IRs identified 186 of the known pre-miRNA but with 714700 pre-miRNA candidates. This large number of IRs was further refined to 483908 candidates with 183 pre-miRNA identified and further still to 165371 candidates with 171 pre-miRNA identified (i.e. with 90% of the known pre-miRNA retained).
165371 candidates for potentially functional miRNA is still too large a set to warrant wet lab analyses, such as northern blotting, on all of them. Hence additional filters are needed to further refine the number of candidates while still retaining most of the known miRNA. These include detection of promoters and terminators, homology analyses, location of candidate relative to coding regions, and better secondary structure prediction algorithms. The software developed is designed to easily accommodate such additional filters with a minimal experience in Perl.
In the last decade, non-coding RNA (ncRNA) sequences have become more essential to our understanding of gene organization. They were once considered insignificant in comparison to protein-coding sequences. But since then, a variety of new types of ncRNA genes have been discovered, each of them revealing new biological roles and cellular mechanisms like gene silencing, replication, gene expression regulation, transcription, chromosome stability, and protein stability [1–3]. Therefore, the identification of ncRNA has significant importance to the biological and medical community. To date, the genomes of numerous organisms have been fully sequenced, making it possible to perform genome-wide computational analyses. Computational methods of ncRNA identification typically involve scanning genomic DNA or transcriptome data for candidate sequences, after which wet lab techniques like northern blotting are required to verify their cellular function .
The precursors of non-coding RNA sequences like transfer RNA, ribosomal RNA, microRNA, and small-interfering RNA, typically adopt hairpin-like, cloverleaf, or similar symmetric structures which are the result of dyad symmetry, i.e. inverted repeats (IRs) in the RNA sequences. But hundreds of thousands of IRs that can be found by a simple scan of genomic DNA. This makes it difficult to claim that any inverted repeat in a genome has functional significance, but it potentially raises the number of functional RNA sequences that have yet to be identified.
Transcription from DNA to RNA is typically guided by the presence of promoter and terminator sequences in the genome that usually lie in the vicinity of non-coding or protein coding genes. However, current methods can only detect certain classes of promoters and terminators, and the degrees of accuracy of such methods are insufficient for genome-wide scans . In addition to this, the starting points of the transcripts in the genome are not always known, even for commonly studied genes. It has been reported that some intergenic regions (DNA between protein coding genes) contain ncRNA that act to regulate the genes nearby. Hence, many RNA detection methods make the assumption that ncRNA is present in the vicinity of known genes and between coding regions within genes (introns). However, most intergenic DNA still have no known function and the basis for this assumption is anecdotal. Current estimates show that approximately 60% of miRNAs are expressed independently, 15% of miRNAs are expressed in clusters, and 25% are in introns .
If an RNA is functionally significant, then the structure and sequence are conserved over the generations. A method called miRNAminer  searches for such evolutionarily related miRNA sequences from different species (homologs). Given a query miRNA, candidate homologs from different species are tested for secondary structure, alignment and conservation, in order to assess their candidacy as miRNAs. By computationally identifying small sections of a genome that could form hairpin-like secondary structures, some researchers have been able to identify sets of potential miRNA sequences which include a subset of known miRNA. Two such methods, miRSeeker  and miRScan , first identify hairpin structures from intergenic regions using homology search and secondary structure prediction. To these candidates, miRSeeker applies mutation patterns that are typical of miRNAs, and miRScan identifies those structures having features such as symmetric bulges or a highly conserved stem near the terminal loop. miRRim  represents the evolutionary and secondary structural features of all known miRNA and their surrounding regions with a sequence of multidimensional vectors. It uses these to train hidden Markov models (HMM) for miRNA and non-miRNA sequences. These models are combined into a single HMM and used to search genomic sequences for miRNA. Current methods of secondary structure prediction involve a dynamic programming algorithm similar to those used for sequence alignment. These methods are promising, but cannot predict more complex secondary structures like pseudoknots (non-nested pairing). Clote et al  proposed that the secondary structures of functional ncRNA are more thermodynamically stable than random RNA. The Gibbs free energy (∆G°) is a popularly used measure of this thermodynamic potential energy, and some ncRNA detection methods incorporate it as a threshold for detection of miRNA .
The hairpin-like secondary structure of a microRNA is a result of the inverted repeats that it contains. It is believed that IRs are the result of inverted DNA duplication events that occurred during the course of evolution of most organisms . If this is the case, the asymmetries and bulges as seen in Figure 1 are formed later as a result of accumulation of mutations, insertions, and deletions. However, the inverted repeats remain highly conserved since the base-paired stem loops of the hairpin structures are relatively much longer than the asymmetries. The degree of dyad symmetry can therefore be used as a criterion for miRNA detection. We present a fast genome-wide scanning algorithm named irScan that first finds all sufficiently symmetric IRs in a given genomic DNA sequence (typically a whole chromosome). This large number of ncRNA candidates is then further reduced based on user-defined criteria for ncRNA detection. We demonstrate the capability of this algorithm using criteria for miRNA detection like the density of symmetric matches in the IR (density of base-pairs in the hairpin-like secondary structure), statistical probability of the symmetry, average length of contiguous symmetric matches in the IR (length of base-paired stems in the hairpin), and the GC content of the matches in the IRs. Detection of inverted repeats by itself is an insufficient criterion for ncRNA detection. Our preliminary scan using irScan's base thresholds on the fully sequenced Arabidopsis chromosomes revealed around 1.1 million mostly symmetric IRs. It is thus necessary to bring this number down to a small set of candidates that are most likely to be functionally significant ncRNA and hence warrant further wet lab analyses.
Detection of inverted repeats
Scoring matrix used by irScan's IR detector
MicroRNA precursor analysis
Our second criterion of detection is based on the probability of occurrence of an IR in a randomly generated RNA sequence. Let us denote this as P. Small values of P most likely indicate highly conserved dyad symmetries and hence potential functionality, while large values of P most likely indicate a random RNA sequence. But they could also indicate a potentially functional RNA that has lost most of its symmetry but retained its functionality. Using P and D values as filters excludes such ncRNAs, but from our understanding of pre-miRNA processing, sufficient symmetry between inverted repeats is a necessary condition for the formation of stable hairpins that can be processed by the ribonuclease Drosha. The calculation of P, like D, depends on the ratio of matches to mismatches in the IR generated. This is described below.
Consider an RNA sequence with 2k nucleotide bases. The left hand side (LHS) of length k bases is mostly inversely symmetric with the right hand side (RHS) of equal length resulting in the hairpin-like secondary structures shown in Figures 1 and 2. Let n be the number of bases that are inverted repeats (part of the base-pairing stem loop). The probability that n is exactly 1 is represented as P(1, k) = 0.25 × (0.75) k-1 × k, where 0.25 is the probability that one base in the LHS is the reverse complement of the corresponding base in the RHS, out of 4 possible bases A, C, G, or U. (0.75) k-1 is the probability that all other k-1 bases are mismatches. And k is the number of combinations in which this can occur. Similarly, if n is exactly 2, then P(2, k) = (0.25) 2 × (0.75) k-2 × k C 2 , where k C 2 is the number of combinations in which 2 matches can occur among k-2 mismatches. In general, we can use the calculation below.
P(n, k) = (0.25) n × (0.75) k-n × k C n
The irScan framework
irScan using base parameters
The base parameters for irScan were selected to identify as many of the known 190 At pre-miRNA as possible, while keeping the total number of IRs detected less than 1 million. These parameters were D min = 59%, P max = 9.99×10-9, Amin = 2.2 bp, and Gmin = 18 bp. In all runs of irScan, D max was set to 95% to exclude low complexity regions, and IRs had to be at least 50 bp long to qualify as potential miRNA precursors. The irScan program returned 714700 IR candidates with these base parameters which included 186 of the known pre-miRNA sequences. If duplicate IRs generated by overlapping windows of different sizes were removed, then 483908 IR candidates remained with 183 of the known pre-miRNA sequences. Three of the initial 186 pre-miRNA were skipped because the simpler duplicate removal process cannot always distinguish between nearby identical IRs and duplicate IRs generated by overlapping scanning windows.
Finding optimal parameters for irScan
Number of IRs and IIRs found using different irScan filters
G min =18
G min =19
G min =20
G min =21
G min =22
G min =23
G min =24
G min =25
G min =26
From Table 2, we can see that the optimal parameters were A min = 2.3 and G min = 24, with D min = 60% and P max = 9.99×10-11. This set of irScan parameters finds 165371 IR pre-miRNA candidates which include exactly 171 IIRs. This is still too large a number of candidates to warrant wet lab analyses on each, but it is a considerable reduction from the 1.5 million found by preliminary scans.
Recently, nine more Arabidopsis pre-miRNAs were added to miRBase. 6 of them matched one or more of the 165371 ncRNA candidates while the remaining 3 were just about insufficiently symmetric to pass the preliminary scans by the IR detector. They did not pass thresholds Dmin and Pmax, but they did pass the Amin and Gmin thresholds. This indicates that the selection of thresholds for symmetry were made slightly too stringent in an effort to keep the final set of candidates small. The user of the framework can tune these parameters to allow less symmetric sequences that compensate for more specific filters to find a smaller final set of candidates.
Initially, our study revealed that partially symmetric inverted repeats are abundant in genomic DNA. However, it was shown that most of these IRs are easily distinguishable from the IRs of known pre-miRNA and can be filtered out using generic criteria like density of symmetry, statistical probability of symmetry, average length of symmetric regions, and the GC content of sufficiently symmetric regions. It is then reasonable to assume that more accurate filters that are highly specific to certain kinds of ncRNA will retain a smaller final list of IRs that can then be further analysed using wet lab techniques such as northern blotting to identify novel ncRNA genes. The irScan software framework was designed to be easily expandable with such additional filtering criteria, by anyone with experience in the Perl programming language. The more computationally demanding IR detector algorithm was implemented in C++ and parallelized to be able to scan the whole Arabidopsis genome for IRs in less than a minute using a base set of filters. A user could then filter these IRs further by running various combinations of filters using Perl to find an optimal set of filters and parameters, that minimizes the number of IR candidates while maximizing the number of known ncRNAs identified.
Additional filters are required to further enrich the final set of IRs with those that are more likely to be functional ncRNA, while still retaining most of the known ncRNA. Some such filters include the detection of promoters and terminators, homology analyses, location of candidate relative to coding regions or relative to each other, and better secondary structure prediction algorithms. Statistical analyses of related organisms can lead to filters for organisms that are less studied. The software developed is designed to easily accommodate such additional filters by someone with minimal experience in Perl, while the computationally expensive underlying genome-wide scanning algorithms have been implemented in the more efficient C++ programming language.
Work was funded by the Department of Computer Science and the Department of Biological Sciences at Missouri University of Science and Technology. The journal reviewers contributed extremely helpful suggestions that improved this manuscript.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 6, 2010: Proceedings of the Seventh Annual MCBIOS Conference. Bioinformatics: Systems, Biology, Informatics and Computation. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S6.
- Huttenhofer A, Schattner P, Polacek N: Non-coding RNAs: hope or hype. Trends Genet 2005, 21(5):289–297. 10.1016/j.tig.2005.03.007View ArticlePubMedGoogle Scholar
- Machado-Lima A, del Portillo HA, Durham AM: Computational methods in noncoding RNA research. J Math Biol 2008, 56: 15–49. 10.1007/s00285-007-0122-6View ArticlePubMedGoogle Scholar
- Liu C, Bai B, Skogerbo G, Cai L, Deng W, Zhang Y, Bu D, Zhao Y, Chen R: NONCODE: an integrated knowledge database of non-coding RNAs. Nucleic Acids Res 2005, 33: D112-D115. 10.1093/nar/gki041PubMed CentralView ArticlePubMedGoogle Scholar
- Meng Y, Huang F, Shi Q, Cao J, Chen D, Zhang J, Ni J, Wu P, Chen M: Genome-wide survey of rice microRNAs and microRNA-target pairs in the root of a novel auxin-resistant mutant. Planta 2009, 230: 883–898. 10.1007/s00425-009-0994-3View ArticlePubMedGoogle Scholar
- Lee Y, Jeon K, Lee JT, Kim S, Kim VN: MiRNA maturation: stepwise processing and subcellular localization. EMBO J 2002, 21: 4663–4670. 10.1093/emboj/cdf476PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang B, Stellwaga EJ, Pan X: Large-scale genome analysis reveals unique features of microRNAs. Gene 2009, 443: 100–109. 10.1016/j.gene.2009.04.027View ArticlePubMedGoogle Scholar
- Lee Y, Ahn C, Han J, Choi H, Kim J, Yim J, Lee J, Provost P, Radmark O, Kim S, Kim VN: The nuclear RNase III Drosha initiates miRNA processing. Nature 2003, 425: 415–419. 10.1038/nature01957View ArticlePubMedGoogle Scholar
- Ruby JG, Jan CH, Bartel DP: Intronic microRNA precursors that bypass Drosha processing. Nature 2007, 448: 83–86. 10.1038/nature05983PubMed CentralView ArticlePubMedGoogle Scholar
- Zhou X, Ruan J, Wang G, Zhang W: Characterization and Identification of MicroRNA Core Promoters in Four Model Species. PLoS Comput Biol 2007, 3(3):e37. 10.1371/journal.pcbi.0030037PubMed CentralView ArticlePubMedGoogle Scholar
- Lim LP, Lau NC, Weinstein EG, Abdelhakim A, Yekta S, Rhoades MW, Burge CB, Bartel DP: The micro-RNAs of Caenorhabditis elegans. Genes & Dev 2003, 17: 991–1008. 10.1101/gad.1074403View ArticleGoogle Scholar
- Artzi S, Kiezun A, Shomron N: miRNAminer: a tool for homologous microRNA gene search. BMC Bioinformatics 2008, 9: 39. 10.1186/1471-2105-9-39PubMed CentralView ArticlePubMedGoogle Scholar
- Lai EC, Tomancak P, Williams RW, Rubin GM: Computational identification of Drosophila microRNA genes. Genome Biol 2003, 4: R42. 10.1186/gb-2003-4-7-r42PubMed CentralView ArticlePubMedGoogle Scholar
- Terai G, Komori T, Asai K, Kin T: miRRim: A novel system to find conserved miRNAs with high sensitivity and specificity. RNA 2007, 13: 2081–2090. 10.1261/rna.655107PubMed CentralView ArticlePubMedGoogle Scholar
- Clote P, Ferré F, Kranakis E, Krizanc D: Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. RNA 2005, 11(5):578–591. 10.1261/rna.7220505PubMed CentralView ArticlePubMedGoogle Scholar
- Lin CT, Lin WH, Lyu YL, Whang-Peng J: Inverted repeats as genetic elements for promoting DNA inverted duplication: implications in gene amplification. Nucleic Acids Res 2001, 29: 3529–3538. 10.1093/nar/29.17.3529PubMed CentralView ArticlePubMedGoogle Scholar
- Smith TF, Waterman MS: Identification of Common Molecular Subsequences. J Mol Biol 1981, 147: 195–197. 10.1016/0022-2836(81)90087-5View ArticlePubMedGoogle Scholar
- Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ: miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res 2006, 34: D140-D144. 10.1093/nar/gkj112PubMed CentralView ArticlePubMedGoogle Scholar
- Bonnet E, Wuyts J, Rouze P, Van de Peer Y: Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free energies than random sequences. Bioinformatics 2004, 20: 2911–2917. 10.1093/bioinformatics/bth374View ArticlePubMedGoogle Scholar
- Gruber AR, Lorenz R, Bernhart SH, Neuböck R, Hofacker IL: The Vienna RNA Websuite. Nucleic Acids Res 2008, 36: W70-W74. 10.1093/nar/gkn188PubMed CentralView ArticlePubMedGoogle Scholar
- The Arabidopsis Information Resource[http://www.arabidopsis.org/]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.