Finding sRNA generative locales from high-throughput sequencing data with NiBLS
© MacLean et al; licensee BioMed Central Ltd. 2010
Received: 4 June 2009
Accepted: 18 February 2010
Published: 18 February 2010
Next-generation sequencing technologies allow researchers to obtain millions of sequence reads in a single experiment. One important use of the technology is the sequencing of small non-coding regulatory RNAs and the identification of the genomic locales from which they originate. Currently, there is a paucity of methods for finding small RNA generative locales.
We describe and implement an algorithm that can determine small RNA generative locales from high-throughput sequencing data. The algorithm creates a network, or graph, of the small RNAs by creating links between them depending on their proximity on the target genome. For each of the sub-networks in the resulting graph the clustering coefficient, a measure of the interconnectedness of the subnetwork, is used to identify the generative locales. We test the algorithm over a wide range of parameters using RFAM sequences as positive controls and demonstrate that the algorithm has good sensitivity and specificity in a range of Arabidopsis and mouse small RNA sequence sets and that the locales it generates are robust to differences in the choice of parameters.
NiBLS is a fast, reliable and sensitive method for determining small RNA locales in high-throughput sequence data that is generally applicable to all classes of small RNA.
High-throughput sequencing technologies such as Illumina's Solexa, 454 Life Sciences' GS-FLX and ABI's SOLiD platforms allow researchers to generate gigabases of sequence data in a matter of hours . As such they are finding use in the analysis of many biological datasets, including the deep sequencing and cataloguing of non-coding small regulatory RNAs (sRNAs). These sRNAs have been described as the 'dark matter of genetics'  because they are highly abundant yet difficult to detect. They have roles in regulating gene expression via post-transcriptional and translational mechanisms in animals, fungi and plants. Single-stranded silencing RNAs of 21-25 nt in length, are created from a double stranded RNA by the protein Dicer. The RNAs are the guide for AGO nucleases that cleave the targeted RNA in a sequence specific manner. Cleaved RNAs are degraded further or become template for RNA-dependent polymerase to generate a dsRNA [3, 4]. The known number of classes of sRNAs is great and with the advent of high-throughput sequencing is getting greater. With these recent advances in sequencing technology we are in a position to find new classes of sRNA that have not previously been discovered. The first step in this is in the identification of parts of the genome that generate sRNAs. We call these regions "locales", choosing this word for the obvious similarity to the term locus from the genetic literature, which defines a distinct point or region on a genome. It is the detection of locales with which this paper is concerned. After generating the sequence the reads must be aligned to the genome. Alignment is a well studied problem and is handled by a range of programs such as SSAHA , MAQ  and SOAP  (see  for a review and other alternatives). Grouping the reads into locales that represent the place of origin of potential functional sRNAs is the next step.
There has been little discussion of what constitutes a sRNA-generating locale, with researchers sometimes relying on restrictive and arbitrary definitions [8–10]. Many existing tools rely on the detection of specific classes of sRNA. For example, mirCat  and mirDeep  are micro-RNA (miRNA) detectors. Chen et al. have created a tool for predicting trans-acting siRNA (ta-siRNA) . Other studies have used time-series data-mining algorithms to identify genomic locales from which sRNAs originate with disregard to sRNA class , but to date have relied on identifying only those that were statistically more 'unusual' than others according to their own measures. Such a method is not necessarily useful as it would lack the sensitivity to find the majority of locales. To avoid these problems, researchers have previously used simple but functional tools for generative region detection . Thus there is a need for generally applicable, sensitive methods for determining locales from sequencing data. Since the full range of different classes of sRNA is not yet known search strategies for potential functional locales must be general.
Results and Discussion
Definition and detection of locales
A locale is defined as a component of a graph G = (V, E) with vertices V and edges E that has clustering coefficient γ above a user-defined cutoff C. To create the graph we align sRNAs to the target genome such that s is a sRNA on chromosome c with start i and end j.
The extent of each locale is from the lowest start (i) to the highest end (j) for each sRNA in the component l.
Sensitivity and specificity of the algorithm
To test whether our algorithm is capable of detecting biologically meaningful locales from sRNA data, we examined its sensitivity and specificity on publicly available high-throughput sRNA pyrosequencing of sRNAs extracted from the flowers, rosettes or entire seedlings of the higher plant Arabidopsis thaliana  and mouse embryonic stem (ES) cells . Typically, sensitivity of an algorithm is assessed by comparison of some output against a pre-known result. However, there is no organism or tissue in which the full set of expressed sRNA and generative locales is known; thus it is difficult to establish a comprehensive set of true positive locales for comparison.
Number of RFAMs in each tissue.
Total number of RFAMs
RFAMs > 5 hits
Embryonic Stem Cells
The Caenorhabditis elegans sRNA complement includes a huge number of well known and well annotated sRNAs, such as the 21U-RNAs, a class of RNAs whose sequence begins with uracil and have length of 21 nt . It could be argued that this provides an excellent test case as many of the real locales are known. However, the know loci in this case are very easy to detect, having specific mapping points on the reference genome. We added 21U-RNA to our sample and carried out the analysis as described above in C.elegans. The sensitivity of the algorithm in this case was very high (Additional File 5) and never drops to be as low as that in the other tests. At 75% of parameter values we used over 40% of loci are recovered. In this case we believe that the large number of 21U-RNAs (>15000)  is skewing the result and giving a perhaps non-representative view of the efficacy of the algorithm for general use.
The specificity of the algorithm was high: greater than 90 in all tissues at all parameters (see Additional Files 6, 7). In part this is because it is not possible for the algorithm to detect locales where there are no sRNAs aligned and so it cannot spontaneously generate false positives. Furthermore, for a locale to exist the definition requires that a component l of the graph should have at least two vertices. This removes all sRNAs separated by more than M from others, since, in redundant sequence sets, the real locales would be expected to be represented by more than one sequence. Such a factor has the effect of greatly reducing the 'junk' that could be considered for inclusion in locales. Together these results show clearly that the algorithm can sensitively and specifically identify sRNA locales in sRNA sequence data from evolutionarily distantly related species. In the Arabidopsis and mouse sequence data tested here it seems that parameter settings for optimal sensitivity fall in the range 0 <M < 20 and 0 <C < 0.6.
It is important to note the necessary differences in interpretation of the value of the clustering coefficient in the context of co-overlapping sRNAs and the interpretation used in the network literature, in particular the primary article of Watts and Strogatz . Graphs created by randomly assigning edges between nodes typically have a lower clustering coefficient than real-world networks, biological networks such as the Caenorhabditis elegans neuronal network have clustering coefficients on the order of 0.3, random networks of around 0.05 . The high clustering coefficient implies that the nodes in the real-world networks share many neighbours with their neighbours and suggests the structure of the network is modular. In our algorithm we use the clustering coefficient simply as a measure of the co-overlapping of the sRNAs and if we find a sufficiently high co-overlapping pattern we have a candidate locale. The effective values are in the range 0 <C < 0.6 which shows that the reads from sequencing experiments and different types of sRNA co-overlap in a wide variety of patterns, thus the clustering co-efficient reflects the structure of the potential locale. Locales in which the sRNA reads overlap in a serial manner on the reference one after the other in a 'fallen domino' sort of pattern will have lower clustering coefficients, whereas locales in which sRNA reads are piled high on the reference, each overlapping many other sRNA reads more akin to the bricks in a wall will have higher clustering coefficients. The exact value of the clustering coefficient cut-off could conceivably be manipulated to narrow ranges to find locales with specific sRNA alignment patterns, although in this paper the aim is to retain as wide a selection as possible.
Reproducibility of results at different parameter settings
Genomic features with sRNA locales
Standalone Perl version
Our algorithm has been implemented in Perl  to provide an easy to run multi-platform package that can be incorporated easily into analysis pipelines. This implementation is limited only by local system resources. To gain optimal performance from graph analyses which can be computationally expensive, we have used the Boost Graph Library , implemented in C++ and available free to academic users under the Boost Graph License and the Perl interface Boost-Graph module , available under the GNU public license . Both of these pieces of software are pre-requisites for running the implementation. Our implementation is released under GPL3 . The Perl implementation requires as input a GFF format file  describing the alignment of sRNAs to the reference genome. As guide to performance, with the 213,799 mapped sRNAs in the Arabidopsis flower data , our Perl implementation ran in 37 minutes on an AMD64 IBM Intellistation Desktop with 2 Gb of RAM.The Perl implementation can be obtained from github .
We have created an algorithm that uses a graph theoretical approach to identify sRNA generative locales from high-throughput sequencing data. Despite the huge evolutionary distance between Arabidopsis and mouse the algorithm was capable of correctly identifying locales with very high sensitivity and with similar patterns of sensitivity for both of the species, suggesting that it has applicability across the plant and animal kingdoms. The sets of locales generated by the algorithm's user-definable parameters M and C are robust to small changes over the possible range whereas larger differences have greater effects indicating that the algorithm is both robust and responsive. With our stand-alone Perl implementation it is possible for a user carry out a parameter scan at the start of an analysis to identify the parameter values of greatest sensitivity and specificity for their sequence set if necessary.
One difficulty all sRNA locus finding algorithms must deal with is the fact that not all sRNAs from high-throughput sequencing experiments will be 'functional' and depending on the sequencing protocol used many of the sRNAs could be a result of degradation processes which a researcher may not have interest in. The literature does not yet contain a consensus on what such a degradation locus may look like, making it difficult for algorithms to distinguish such locales from those of functional interest in any generally useful way at present. Nonetheless in such situations our algorithm can be of use in filtering out potential non-functional locales in cases where the researcher has prior expectation of the pattern formed by degradation products. For example in the case where degradation products have a distinctive visual pattern, representative locales matching the pattern can be identified visually in a genome browser and comparing an initial run of the algorithm with positions of the pattern. The clustering coefficients of the locales can then be used as a band-filter whereby any locales lower or higher than this can be presumed not to be from the same sort of degradation process.
As our algorithm uses only positional data of aligned sRNAs and the clustering coefficient cut-off to identify locales it is naturally sRNA class agnostic which mean it can be used to identify locales of many different kinds at once as well as, potentially, previously unknown classes of locales. Typically the number of locales called is many times greater than the number of locales known as RFAMs for a given species, for example in the M = 10, C = 0.4 set discussed in Figure 4 10,000 locales are predicted. This indicates that there are a huge number of sRNA generative locales and sRNAs not yet known, fully justifying the description of them as the dark matter of genetics. Undoubtedly there is much scope for many different methods for detection of sRNA locales. Furthermore, the identification and cataloguing of sRNA generative locales could help the development of methods that can predict generative locales de novo from genomic sequence.
Alignment of sequences to reference genomes
Publicly available data from small RNA deep sequencing experiments were downloaded from the Gene Expression Omnibus  with accession numbers GSM118373 (Arabidopsis thaliana)  and GSM314558 (Mus musculus) . RFAMs and sequences for each species were obtained from RFAM . Sequences were aligned to either the TAIR 8 Arabidopsis sequence or the mm9 mouse assembly build 37 hosted at UCSC , using SSAHA 3.1 . For sRNA alignment redundant sequence sets were used and only sequences matching to the reference with 100% identity over 100% of the sequence length were retained. Sequences aligning to more than one position on the reference genome were not removed or normalised in any way, meaning a sRNA that belongs to one position may appear as if it comes from many. Parsing and collation was done with custom Perl scripts.
To systematically determine the sensitivity and specificity of the algorithm, we carried out 'parameter scans', a series of runs of the algorithm on each dataset changing the value of one of the paramaters at each run. The M parameter (minimum inclusion distance) was tested at values of 5, 10, 20, 30, 50, 75, and 100. Early runs with the Arabidopsis data showed that results changed little when M values exceeded 100. Values of C were 0.1, 0.25, 0.4, 0.5, 0.6, 0.75 and 0.9.
Calculation of Sensitivity and Specificity
For sensitivity and specificity analyses, the number of true positives (TP) was calculated as the number of nucleotides in the genome with an RFAM alignment and a putative locale alignment. True negatives (TN) were calculated as the number of nucleotides in the reference genome with neither a filtered RFAM alignment nor a putative locale alignment. False positives (FP) were calculated as a nucleotide in the genome that aligned to a putative locale but had no RFAM aligned. False negatives (FN) were calculated as nucleotides in the genome with no putative locale aligned and an RFAM aligned.
For calculation of numbers of overlapping genomic features in different locales sets and relative to genome annotations Perl scripts were used. Reference annotations were obtained as GFF from TAIR .
Visualisation of Results
Contour graphs were created by using the R package akima  to carry out bivariate interpolation of the irregularly spaced parameter scan data onto a regularly spaced grid with the interp and filled.contour functions. Heatmaps were generated using MeV 4 
Availability and Requirements
Project name: NiBLS
Project home page: http://github.com/danmaclean/NiBLS
Operating system(s): Platform independent
Programming language: Perl
Other requirements: Perl 5.6 or higher, Perl Boost::Graph module, also under GPL and available from http://search.cpan.org/~dburdick/Boost-Graph-1.2/Graph.pm
License: GPL 3
Restrictions to use by non-academics: none
The authors wish to thank Dr Frank Schwach of the UEA for invaluable philosophical and technical contributions during the development of this algorithm. We thank Mike Burell for technical support. DM and DJS are supported by the Gatsby Charitable Foundation.
- MacLean D, Jones JDG, Studholme DJ: Application of 'Next Generation' sequencing technologies to microbial genetics. Nat Revs Microbiol 2009, 7(4):287–296.Google Scholar
- Baulcombe DC: RNA silencing in plants. Nature 2004, 431: 356–363. 10.1038/nature02874View ArticlePubMedGoogle Scholar
- Brodersen P, Voinnet O: The diversity of RNA silencing pathways in plants. Trends Genet 2006, 22: 268–280. 10.1016/j.tig.2006.03.003View ArticlePubMedGoogle Scholar
- Lippman Z, Martienssen R: The role of RNA interference in heterochromatic silencing. Nature 2004, 431: 364–370. 10.1038/nature02875View ArticlePubMedGoogle Scholar
- Ning Z, Cox AJ, Mullikin JC: SSAHA: a fast search method for large DNA databases. Genome Res 2001, 11: 1725–1729. 10.1101/gr.194201View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008, 18(11):1851–1858. 10.1101/gr.078212.108View ArticlePubMedPubMed CentralGoogle Scholar
- Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program. Bioinformatics 2008, 24(5):713–714. 10.1093/bioinformatics/btn025View ArticlePubMedGoogle Scholar
- Rajagopalan R, Vaucheret H, Trejo J, Bartel DP: A diverse and evolutionarily fluid set of microRNAs in Arabidopsis thaliana . Genes Dev 2006, 20: 3407–3425. 10.1101/gad.1476406View ArticlePubMedPubMed CentralGoogle Scholar
- Molnar A, Schwach F, Studholme DJ, Thuenemann EC, Baulcombe DC: miRNAs control gene expression in the single-cell alga Chlamydomonas reinhardtii . Nature 2007, 447: 1126–1129. 10.1038/nature05903View ArticlePubMedGoogle Scholar
- Mosher RA, Schwach F, Studholme D, Baulcombe DC: PolIVb influences RNA-directed DNA methylation independently of its role in siRNA biogenesis. Proc Nat Acad Sci USA 2008, 105: 3145–3150. 10.1073/pnas.0709632105View ArticlePubMedPubMed CentralGoogle Scholar
- Moxon S, Schwach F, Dalmay T, MacLean D, Studholme DJ, Moulton V: A toolkit for the analysis of large-scale plant small RNA datasets. Bioinformatics 2008, 24(19):2252–2253. 10.1093/bioinformatics/btn428View ArticlePubMedGoogle Scholar
- FriedlÃnder MR, Chen W, Adamidi C, Maaskola J, Einspanier R, Knespel S, Rajewsky N: Discovering microRNAs from deep sequencing data using miRDeep. Nat Biotechnol 2008, 26: 407–415. 10.1038/nbt1394View ArticlePubMedGoogle Scholar
- Chen HM, Li YH, Wu SH: Bioinformatic prediction and experimental validation of a microRNA-directed tandem trans-acting siRNA cascade in Arabidopsis . Proc Natl Acad Sci USA 2007, 104: 3318–3323. 10.1073/pnas.0611119104View ArticlePubMedPubMed CentralGoogle Scholar
- Bagnall AJ, Moxon S, Studholme D: Time-series data-mining algorithms for identifying short RNA. Arabidopsis thaliana, UEA Technical Report CMP-C07–02" 2008.Google Scholar
- Huber W, Carey VJ, Long L, Falcon S, Gentleman R: Graphs in molecular biology. BMC Bioinformatics 2007., 8(S8):Google Scholar
- Watts DJ, Strogatz SH: Collective dynamics of 'small-world' networks. Nature 1998, 393: 409–410. 10.1038/30918View ArticleGoogle Scholar
- Babiarz JE, Ruby JG, Wang Y, Bartel DP, Blelloch R: Mouse ES cells express endogenous shRNAs, siRNAs, and other Microprocessor-independent, Dicer-dependent small RNAs. Genes Dev 2008, 22: 2773–2785. 10.1101/gad.1705308View ArticlePubMedPubMed CentralGoogle Scholar
- Ruby JG, Jan C, Player C, Axtell MJ, Lee W, Nusbaum C, Ge H, Bartel DP: Large-scale sequencing reveals 21U-RNAs and additional microRNAs and endogenous siRNAs in C. elegans. Cell 2006, 127(6):1193–1207. 10.1016/j.cell.2006.10.040View ArticlePubMedGoogle Scholar
- The Perl Directory[http://www.perl.org]
- Boost Graph Library[http://www.boost.org/]
- Boost-Graph-1.2 Perl module[http://search.cpan.org/~dburdick/Boost-Graph-1.2/Graph.pm]
- The GNU Public License Version 3[http://www.gnu.org/licenses/gpl-3.0.txt]
- GFF file format[http://www.sanger.ac.uk/Software/formats/GFF/]
- The Sainsbury Laboratory ncRNA webserver[http://github.com/danmaclean/NiBLS]
- The Gene Expression Omnibus[http://www.ncbi.nlm.nih.gov/geo/]
- The Arabidopsis Information Resource[http://arabidopsis.org]
- The UCSC Genome Bioinformatics Website[http://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/]
- CRAN - Comprehensive R Archive Network, akima package[http://cran.r-project.org/web/packages/akima/index.html]
- Dudoit S, Gentleman RC, Quackenbush J: Open source software for the analysis of microarray data. Biotechniques 2003, (Suppl):45–51.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.