proTRAC - a software for probabilistic piRNA cluster detection, visualization and analysis
© Rosenkranz and Zischler; licensee BioMed Central Ltd. 2012
Received: 5 October 2011
Accepted: 10 January 2012
Published: 10 January 2012
Throughout the metazoan lineage, typically gonadal expressed Piwi proteins and their guiding piRNAs (~26-32nt in length) form a protective mechanism of RNA interference directed against the propagation of transposable elements (TEs). Most piRNAs are generated from genomic piRNA clusters. Annotation of experimentally obtained piRNAs from small RNA/cDNA-libraries and detection of genomic piRNA clusters are crucial for a thorough understanding of the still enigmatic piRNA pathway, especially in an evolutionary context. Currently, detection of piRNA clusters relies on bioinformatics rather than detection and sequencing of primary piRNA cluster transcripts and the stringency of the methods applied in different studies differs considerably. Additionally, not all important piRNA cluster characteristics were taken into account during bioinformatic processing. Depending on the applied method this can lead to: i) an accidentally underrepresentation of TE related piRNAs, ii) overlook duplicated clusters harboring few or no single-copy loci and iii) false positive annotation of clusters that are in fact just accumulations of multi-copy loci corresponding to frequently mapped reads, but are not transcribed to piRNA precursors.
We developed a software which detects and analyses piRNA clusters (proTRAC, probabilistic TRacking and Analysis of Clusters) based on quantifiable deviations from a hypothetical uniform distribution regarding the decisive piRNA cluster characteristics. We used piRNA sequences from human, macaque, mouse and rat to identify piRNA clusters in the respective species with proTRAC and compared the obtained results with piRNA cluster annotation from piRNABank and the results generated by different hitherto applied methods.
proTRAC identified clusters not annotated at piRNABank and rejected annotated clusters based on the absence of important features like strand asymmetry. We further show, that proTRAC detects clusters that are passed over if a minimum number of single-copy piRNA loci are required and that proTRAC assigns more sequence reads per cluster since it does not preclude frequently mapped reads from the analysis.
With proTRAC we provide a reliable tool for detection, visualization and analysis of piRNA clusters. Detected clusters are well supported by comprehensible probabilistic parameters and retain a maximum amount of information, thus overcoming the present conflict of sensitivity and specificity in piRNA cluster detection.
In a wide variety of animals, mainly germline expressed small RNAs - named Piwi interacting (pi)RNAs because of their interaction with effector Piwi proteins - play an important role as guiding RNAs in safeguarding the genome from the detrimental effects of actively transposing elements . Most piRNAs are encoded in strand specific genomic clusters ranging from <1kb to >100kb. Beside mono-directional clusters encoding piRNAs on only one strand, there are also bi-directional clusters whose halves encode piRNAs on opposite strands and where transcription starts in opposite directions from a centrally located promoter. In general, piRNA clusters are assumed to be transcribed into long single stranded precursors that are subject to subsequent processing, leading to mature piRNAs. In a process referred to as ping pong cycle , piRNA guided Piwi proteins cleave TE transcripts thus producing a second population of TE derived piRNAs. Although piRNA genesis shows signs of a quasi-random mechanism with partially overlapping sequences, piRNAs exhibit typical sequence characteristics, e.g. position specific frequency patterns. In mice, the cluster derived piRNA population exhibits a strong bias for Uridine at the 5'-end, whereas the transposon derived population is biased for Adenine at position 10. In Drosophila, the situation is converse . However, many questions concerning this process, as well as the functional role of piRNAs beyond transposon silencing (only 17% of mouse piRNAs correspond to TE sequences with the majority mapping only once to the genome ) remain elusive.
Research on piRNA biogenesis and function, as well as the successful targeting of questions related to the possible coevolution of the Piwi/piRNA system, will involve comparative studies of homologous piRNA clusters [5, 6]. Therefore, a reliable bioinformatic piRNA cluster detection tool is imperative, especially in light of the ever exceeding amount of data obtained from next generation sequencing (NGS) that requires robust automated bioinformatic solutions.
Summary of piRNA cluster identification methods and results from previous studies
criteria for detection of clusters
Aravin et al. 2006 
at least 4 piRNA loci per cluster, maximum distance between two piRNA loci 15 kb
Girard et al. 2006 
at least 5 piRNA loci/5kb, at least 10 piRNA loci per cluster, only sequence reads that mapped 1-5 times to the genome were considered, sequence reads mapped to genome allowing up to 2 mismatches
Lau et al. 2006 
at least 20 single-copy loci, at least 1 piRNA locus/kb
Lakshmi and Agrawal 2007 
at least n** single-copy loci/20kb, at least 2 piRNA loci/kb
Grivna et al. 2006 
Watanabe et al. 2006 
p < = 0.05, where p = (s/S)n-1×NCn S: genome size (bp), s: cluster size (bp), N: total number of sequence hits, n: number of hits in cluster
Based on the imperfection of the currently available algorithms, and since the essential differences between them may hamper upcoming comparative studies in this field, we developed the user-friendly cluster detection software proTRAC, which uses SeqMap output files  for fast and customized detection, visualization and analysis of piRNA clusters in genomes, ensuring reproducibility and comparability. proTRAC analyzes the entirety of mapped sequence reads and identifies clusters based on significant deviations from the hypothetical uniform distribution regarding the density of mapped reads, strand asymmetry, frequency of putative piRNA loci with T at position 1 (1T) or A at position 10 (10A), as well as the amount of putative piRNA loci within the typical piRNA length range and the quantity of loci corresponding to infrequently mapped reads. The latter criterion represents a convenient benchmark since many piRNA loci sequences within one cluster exhibit low redundancy or are even unique within the genome. This causes an increased number of normalized hits within a piRNA cluster (normalized by the number of genomic hits produced by the respective sequence reads) as compared to the value that one would expect if a corresponding number of hits was randomly selected from the entirety of mapped reads. In the following we show, that the proTRAC algorithm provides considerable advantages compared to the currently applied methods.
In order to obtain hit density threshold values with a resolution more precise than in steps of 1 hit/kb, proTRAC calculates probabilities (P) for non-integer hit numbers (k∈ℝ) per kb with an increment of 0.1 by using factorials deduced from the gamma function: Γ(n) = (n-1)!. The stepwise calculation of P(x ≥ k) continues until the probability reaches the significance level. Then, the minimum hit density is defined as k hits/kb.
The minimum number of putative piRNA loci per piRNA cluster corresponds to the minimum number of loci needed to reach all stated score values (log10 of reciprocal probabilities) with the given dataset.
If necessary, since overlapping sections of proper hit density can result in one large cluster candidate which in sum falls below the minimum hit density, proTRAC performs a stepwise clipping of peripheral hits to find the largest possible cluster candidate. During a progressive process with an increasing number of hits, in each step all possible combinations of upstream and downstream hits are clipped from the cluster candidate ends and the effect on hit density is assessed for each combination. This process continues until a sufficient combination with the minimal number of hits is found and the hit density of the remaining cluster, comprising the highest possible number of putative piRNA loci, exceeds the required minimum. The removed hits, potentially forming a separate cluster, are assessed subsequently as a separate cluster candidate.
Since piRNA clusters are typically organized in a mono- or bidirectional manner proTRAC determines directionality by comparing the degree of mono- and bi-directionality. The degree of mono-directionality is simply given by the proportion of sequence reads encoded on the main strand (the strand which encodes the majority of mapped sequence reads). To determine the degree of bi-directionality, each cluster is split stepwise between every pair of hits that yields two fragments with each fragment comprising at least 25% of all hits or at least 10 hits respectively. Subsequently, the proportion of sequence reads encoded on the main strand is computed independently for each fragment and summed pro rata. The highest attainable value accounts for the degree of bi-directionality. If one or both values exceed the user defined directionality threshold, the cluster is assigned to the respective directionality category. Otherwise it is assessed to be non-directional.
Not to rely solely on tracing regions that exhibit a high hit density, and rather additionally consider the characteristics of the clustered putative piRNA loci, proTRAC now verifies each cluster candidate by examining its: i) number of normalized hits to total hits ratio, ii) extent of strand bias, iii) proportion of putative piRNA loci with 1T or 10A respectively and iv) proportion of putative piRNA loci within the typical length range. For each parameter proTRAC assigns score values based on a probabilistic scoring system which evaluates the probabilities to obtain the observed cluster characteristics in the light of the given dataset (e.g. a score value of 2.0 corresponds to a probability of 0.01). Regarding strand bias, we assume equal probabilities for one mapped sequence read to be encoded on either plus- or minus strand. Furthermore, we presume that the number of hits per cluster is very small compared to the total number of mapped sequence reads, so that proTRAC can calculate probabilities in a sampling-with-replacement fashion, which accelerates computation without a noticeable effect on the results.
In general, the calculation of score values implies calculation of factorials corresponding to the total number of loci within one cluster, which gets computationally more expensive with an increasing number of hits per cluster candidate. If one cluster candidate comprises more than 1000 putative piRNA loci, its total number of putative piRNA loci is set to 1000 and the other parameters are scaled down accordingly. This may lead to an underestimation of score values. However, this will not lead to rejection of true clusters, since the computation of probability by default aborts once the probability falls below 1/10100 (score = 100), which is often the case with clusters of this size. Moreover, the minimum set score values should reasonably not exceed 10.
Since it is possible, that real piRNA clusters are concealed by the presence of loci that correspond to frequently mapped sequence reads that do not originate from the cluster in question but distort its strand asymmetry, proTRAC optionally reevaluates rejected cluster candidates in that case, considering only loci that correspond to sequence reads that mapped not more than a stated maximum times to the genome, similar to the method applied by Girard et al. 2006 .
Once clusters are identified and validated, proTRAC calculates the probability for each cluster, whether any of its observed deviations from the hypothetical uniform distribution came off by chance and deduces the probability for 0, 1-or-more and 2-or-more mistakenly annotated clusters within the obtained set of detected clusters.
Beside a separate FASTA file and picture file for each cluster, proTRAC can optionally output a summary file which lists all detected clusters with corresponding cluster data, optionally sorted by the summed score values.
Results and Discussion
Initial testing of program performance
In order to test the performance of proTRAC and to determine suitable thresholds for score values, nonredundant human (32046), mouse (34520) and rat (31357) piRNA sequences downloaded from National Center for Biotechnology Information (NCBI) nucleotide database  were mapped to the respective genomes (human NCBI Build37, mouse NCBI Build 37, rat RGSC3.4) via SeqMap (allowing only perfect matches, using the option/output_all_matches) and the SeqMap output files were used as input files for proTRAC.
Detection of human, mouse and rat piRNA clusters using proTRAC with different thresholds
minimum piRNA density [hits/kb]**
tracked clusters (mono-, bi-, non-directional)***
p 0/≥2 mistakenly annotated clusters
0.1 - 1.4
368 (202, 120, 46)
0.6 - 2.4
187 (139, 40, 8)
1.5 - 3.5
119 (99, 19, 1)
1.5 - 2.3
242 (163, 63, 16)
2.5 - 3.4
171 (129, 38, 4)
3.7 - 4.7
151 (123, 25, 3)
1.2 - 1.7
186 (139, 43, 4)
1.8 - 2.7
168 (138, 28, 2)
2.8 - 3.7
162 (136, 26, 0)
The inquiry of up to four different score values, corresponding to a particular probability, as well as the applied significance-based minimum hit density for each cluster, lead on to the statistical problem of multiple testing. According to the Bonferroni correction, we obtain the significance level α = 0.05 for the whole family of n (n = 5) tests by applying a significance level of 0.01 for each individual test (α/n). On this basis, we assume α = 0.01 (for hit density) and score values ≥2 to be suitable thresholds to yield results of adequate reliability.
For further assessment, we compared proTRAC results obtained with these thresholds with the clusters annotated at piRNABank as well as the results from Lau et al. 2006 . Finally we repeated cluster detection with proTRAC taking only those sequence reads into account, that mapped 1-5 times to the genome, since only these sequence reads were considered by Girard et al. 2006 . The resulting piRNA clusters were checked for the absence of loci that correspond to the excluded sequence reads.
proTRAC results compared to piRNABank annotation
Comparison of proTRAC results with piRNABank annotation
proTRAC and piRNABank*
proTRAC results compared to results from Lau et al. 2006 
The piRNA cluster criteria applied by Lau and colleagues (cf. table 1) are very stringent and suitable to mostly avoid false positive piRNA cluster annotation. However, they may lead to being insensitive regarding piRNA clusters that arose from recent paralogization events thus harboring no or only a few single-copy piRNA loci.
We searched for piRNA clusters applying the same sequence datasets as used by Lau and colleagues (piRNA sequences from Lau et al. 2006  mapped to mouse build mm7 and rat build rn3). proTRAC confirmed all previously annotated clusters with the only exception of the X-chromosomal rat piRNA cluster 92 which undercuts the minimum required hit density of 2.7 hits/kb for the X chromosome (p ≤ 0.01). In addition, proTRAC detected 37 further mouse (total number of putative piRNA loci: 1255, normalized hits: 445, single-copy loci: 223) and 65 further rat (total number of putative piRNA loci: 6585, normalized hits: 1605, single-copy loci: 433) clusters. As an example, figure 4c shows two mouse piRNA clusters on chromosome 5, with the upper cluster arising from a 3.5kb duplication of the lower cluster. Although the duplicate harbors 175 putative piRNA loci (75.8 normalized hits), it is not annotated as piRNA cluster by Lau et al.  since it contains only 10 single-copy loci as a consequence of duplication.
proTRAC results compared to results from Girard et al. 2006 
In order to avoid false positive piRNA cluster annotation, Girard et al.  considered only sequence reads that mapped 1-5 times to the genome. We reproduced this method by running proTRAC taking only these infrequently mapped reads into account and applied a minimum hit density of 1 hits/kb and a minimum of 10 hits in total per piRNA cluster (cf. table 1) with the aim to assess the implications of rejecting those reads from the analysis. Excluding frequently mapped reads led to a decrease of the total number of detected clusters from 187 to 179, withal 47 of which were prior rejected because they lack a significant strand bias.
Regarding the assigned reads of the remaining 132 clusters identified with both methods, the total number of assigned reads that mapped once to the genome increased slightly from 12763 to 12836 which is caused by the lower hit density threshold (1 hits/kb) that led to an extension of piRNA cluster borders. However the number of assigned multiple mapping reads decreased drastically from 6560 to 3695. Excluding frequently mapped reads may also have a substantial influence on the resulting sequence composition of clustered piRNA loci since this will automatically exclude sequences that match to TEs with high copy number like LINEs or SINEs. In this way sequences that contribute to a major function of the Piwi-piRNA system, which is posttranscriptional silencing of actively transposing elements could be accidentally underrepresented.
Figure 4d shows a human piRNA cluster on chromosome 2, detected taking all sequence reads into account (top) and rejecting sequence reads that mapped more than 5 times to the genome (bottom). In the former case, 94 multiple mapping reads are assigned to this cluster, compared to 8 multiple mapping reads in the latter case.
proTRAC performance in de novo piRNA cluster detection
Four small putative piRNA clusters (comprising 10 to 13 loci) were detected with the method from Girard et al.  but were rejected by proTRAC. Three of them because they lacked a significant strand bias, the forth because the respective putative piRNA loci did not show a significant enrichment for 1T or 10A. Seven clusters (with the largest one comprising 82 putative piRNA loci) were solely detected with proTRAC since each of them harbours less than ten loci that correspond to sequence reads producing 1-5 genomic hits. Nonetheless they exhibit all typical piRNA cluster characteristics. Figures and FASTA sequence files as well as a summary file of the detected piRNA clusters in rhesus macaque are available as additional file 8 (proTRAC_results_macaca.zip).
Furthermore, the sequence composition regarding different TE classes is apparently biased. While Alu related sequence reads constitute the major fraction of mapped sequence reads that match perfectly to TEs (~44% if two mismatches are allowed) in clusters detected with proTRAC, Alu matching sequence reads are completely absent and LTR matching sequence reads make up the vast majority of TE matching sequence reads in clusters detected with the Girard et al.  method. Caused by the extremely high copy number of Alu elements in primate genomes, being the most abundant mobile element of all , every Alu related and clustered sequence read from our dataset mapped more than five times to the macaque genome and hence was excluded from the dataset for cluster detection with the Girard et al.  method. Obviously, drawing conclusions in relation to specific piRNA functions from piRNA sequence composition relies on an unbiased representation of all existing piRNA loci in detected clusters which is one of the advantages of the proTRAC software. In summary, proTRAC detected more clusters than could be detected using the method applied by Lau et al.  and assigned more sequence reads to clusters than the method applied by Girard et al.  were rejection of frequently mapped reads led to an inadvertently underrepresentation as well as compositional bias of TE matching sequence reads.
proTRAC provides a powerful tool for detection, visualization and analysis of piRNA clusters. Unlike hitherto applied methods for piRNA cluster identification, the proTRAC algorithm considers all hitherto described crucial piRNA cluster characteristics. Thus, piRNA clusters detected with proTRAC are well supported by comprehensible probabilistic parameters. In addition, proTRAC retains more information since it does not a priori preclude frequently mapped reads, which exclusively contribute to posttranscriptional transposon silencing, which was shown to lead to more assigned sequence reads per cluster in most cases and prevents accidentally underrepresentation and compositional bias of TE matching sequence reads. Moreover, proTRAC potentially allows clusters with only a few or even without single-copy loci, which leads to the detection of piRNA clusters arisen from segmental duplications that are passed over when using algorithms that require a fixed minimum number of single-copy loci.
Availability and Requirements
Project name: proTRAC
Project home pages:
Operating system(s): Platform independent (an executable file is available for Windows systems)
Programming language: Perl
Other requirements: Perl (no other requirements for executable file)
License: Academic Free License (AFL)
Any restrictions to use by non-academics: For commercial use please contact DR.
Thanks go to Nadir Kucun and Markus Pauly for discussion and helpful advice related to the mathematical issues. The staff of the working group Zischler is gratefully acknowledged for helpful advice. Sincere thanks are given to Michael Eichert for proofreading the manuscript and providing conducive improvement suggestions. Finally, our thanks go to the anonymous reviewers for their constructive criticism that contributed to the improvement of the manuscripts quality. This work was supported by the Schwerpunkt "Rechnergestuetzte Forschungsmethoden in den Naturwissenschaften (SRFN)", Johannes Gutenberg-University Mainz, Germany.
- Thomson T, Lin H: The biogenesis and function of PIWI proteins and piRNAs: progress and prospect. Annu Rev Cell Dev Biol 2009, 25: 355–376. 10.1146/annurev.cellbio.24.110707.175327PubMed CentralView ArticlePubMed
- Brennecke J, Aravin AA, Stark A, Dus M, Kellis M, Sachidanandam R, Hannon GJ: Discrete small RNA-generating loci as master regulators of transposon activity in Drosophila. Cell 2007, 128: 1089–1103. 10.1016/j.cell.2007.01.043View ArticlePubMed
- Aravin AA, Sachidanandam R, Bourc'his D, Schaefer C, Pezic D, Fejes Toth K, Bestor T, Hannon GJ: A piRNA pathway primed by individual transposons is linked to de novo DNA methylation in mice. Mol Cell 2008, 31(6):785–799. 10.1016/j.molcel.2008.09.003PubMed CentralView ArticlePubMed
- Girard A, Sachidanandam R, Hannon GJ, Carmell MA: A germline-specific class of small RNAs binds mammalian Piwi proteins. Nature 2006, 442: 199–202.PubMed
- Lau NC, Seto AG, Kim J, Kuramochi-Miyagawa S, Nakano T, Bartel DP, Kingston RE: Characterization of the piRNA complex from rat testes. Science 2006, 313: 305–306. 10.1126/science.1131186View Article
- Assis R, Kondrashov AS: Rapid repetitive element-mediated expansion of piRNA clusters in mammalian evolution. Proc Natl Acad Sci USA 2009, 17: 7079–7082.View Article
- Lakshmi SS, Agrawal S: piRNABank: a web resource on classified and clustered Piwi-interacting RNAs. Nucleic Acids Res 2008, 36(Database issue):D173-D177.
- Jiang H, Wong HW: SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics 2008, 24: 2395–2396. 10.1093/bioinformatics/btn429PubMed CentralView ArticlePubMed
- NCBI nucleotide database[http://www.ncbi.nlm.nih.gov/nuccore]
- Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J: Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 2005, 110(1–4):462–467. 10.1159/000084979View ArticlePubMed
- Batzer M, Deininger P: Alu repeats and human genomic diversity. Nat Rev Genet 2002, 3(5):370–379. 10.1038/nrg798View ArticlePubMed
- Grivna ST, Pyhtila B, Lin H: MIWI associates with translational machinery and PIWI-interacting RNAs (piRNAs) in regulating spermatogenesis. Proc Natl Acad Sci USA 2006, 103: 13415–13420. 10.1073/pnas.0605506103PubMed CentralView ArticlePubMed
- Watanabe T, Takeda A, Tsukiyama T, Mise K, Okuno T, Sasaki H, Minami N, Imai H: Identification and characterization of two novel classes of small RNAs in the mouse germline: retrotransposon-derived siRNAs in oocytes and germline small RNAs in testes. Genes Dev 2006, 20(13):1732–1743. 10.1101/gad.1425706PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.