 Software
 Open Access
 Published:
Sensitive and highly resolved identification of RNAprotein interaction sites in PARCLIP data
BMC Bioinformatics volume 16, Article number: 32 (2015)
Abstract
Background
PARCLIP is a recently developed Next Generation Sequencingbased method enabling transcriptomewide identification of interaction sites between RNA and RNAbinding proteins. The PARCLIP procedure induces specific base transitions that originate from sites of RNAprotein interactions and can therefore guide the identification of binding sites. However, additional sources of transitions, such as cell typespecific SNPs and sequencing errors, challenge the inference of binding sites and suitable statistical approaches are crucial to control false discovery rates. In addition, a highly resolved delineation of binding sites followed by an extensive downstream analysis is necessary for a comprehensive characterization of the protein binding preferences and the subsequent design of validation experiments.
Results
We present a statistical and computational framework for PARCLIP data analysis. We developed a sensitive transitioncentered algorithm specifically designed to resolve protein binding sites at high resolution in PARCLIP data. Our method employes a Bayesian network approach to associate posterior logodds with the observed transitions, providing an overall quantification of the confidence in RNAprotein interaction. We use published PARCLIP data to demonstrate the advantages of our approach, which compares favorably with alternative algorithms. Lastly, by integrating RNASeq data we compute conservative experimentallybased false discovery rates of our method and demonstrate the high precision of our strategy.
Conclusions
Our method is implemented in the R package wavClusteR 2.0. The package is distributed under the GPL2 license and is available from BioConductor at http://www.bioconductor.org/packages/devel/bioc/html/wavClusteR.html.
Background
RNAbinding proteins (RBPs) play a fundamental role in virtually all aspects of RNA metabolism, including the regulation of RNA localization, stability, translation or degradation [1]. These proteins extensively contribute to the control of gene expression by regulating the life cycle of microRNAs, where the RBPRNA interaction is mediated by specific RNA sequence motifs or secondary structures [2]. Interestingly, recent studies showed that deregulation of RBP expression or mutation of cognate binding sites are causally related to several human diseases including cancer [36]. Many of these studies have been made possible by the development of new methods mapping interaction sites in a comprehensive and systematic manner [7]. Particularly, the PhotoActivatable Ribonucleosideenhanced CrossLinking and ImmunoPrecipitation (PARCLIP) method made it possible to identify highly specific RBPRNA interactions by generating a distinct imprint in the bound RNA [810]. In this method, cells are cultured with a ribonucleoside analogue, e.g. 4thiouridine (4SU), which becomes incorporated into nascent RNA molecules. Then, in vivo UV crosslinking at a specific wavelength is performed to stabilize the RNARBP interaction, resulting in a covalently linked RNARBP complex. Next, the complex is isolated, the protein digested, the RNA molecules recovered and reverse transcribed to cDNA. NextGeneration Sequencing is then used to determine the identity of these molecules.
Importantly, the reverse transcription induces specific base transitions at the original crosslinked sites, which can be used to identify highconfidence RBPRNA binding interactions in PARCLIP data [11]. Based on the induction of transitions, different strategies have been developed for the identification RNARBP interactions in PARCLIP data. CLIPZ [12], a widely adopted method for PARCLIP data analysis, ranks protein binding sites, referred to as clusters, based on their total number of observed transitions. PARalyzer [13], in contrast, utilizes transitions to fit a Gaussian kernel density estimate classifier in order to discriminate signal from noise at interaction sites with the aim to infer the protein binding sites. The cluster boundaries are determined by extending the interaction sites using a fixed threshold on the coverage or by applying an arbitrary window size. PIPECLIP [14], a very recent tool designed for CLIPseq data analysis, employs a binomial model and performs comparably to PARalyzer in identifying binding sites in PARCLIP data. However, these methods fall short on important aspects of PARCLIP data analysis. (i) As experimental validation of RNARBP interactions is laborious and only feasible on small scale, statistically rigorous approaches are needed to rank clusters and identify highconfidence subsets amenable to experimental testing. (ii) PARCLIP data allows for a highly resolved identification of the RBP binding sites. However, to delineate cluster boundaries accurately, a sensitive peak caller tailored to this problem is needed. (iii) Not every observed transition is induced by crosslinking, i.e. by PARCLIP. Rather, sequencing errors, RNA contaminants and cell typespecific SNPs represent additional sources of transitions which can lead to the detection of a considerable number of false positives [11]. Attempts to limit the false discovery rate (FDR) by requiring a minimum number of interaction sites per cluster, as recommended by [13,15], can largely reduce sensitivity as it will inevitably miss all true clusters containing less interaction sites. In fact, given that the nucleotide composition of protein binding sites can greatly vary, clusters exhibiting a few PARCLIP induced transitions can still correspond to bona fide interaction sites.
In this work, we specifically address the three points outlined above. We introduce a Bayesian model to identify PARCLIP induced highconfidence transitions extending our recent work in [11]. We detail a new, coveragebased algorithm for the identification of cluster boundaries termed MiniRank Norm (MRN) and show that it substantially improves resolution of binding sites over other methods. We test our algorithm on published data and compare its performance with PARalyzer. We demonstrate that wavClusteR outperforms alternative algorithms both in detection and resolution of clusters. By using a transition frequencybased strategy our method overcomes the reduction in sensitivity and specificity which characterizes hard thresholding approaches such as PARalyzer. Lastly, we evaluate the performance of our algorithm by integrating matched RNASeq data to compute conservative FDR estimates, confirming that highconfidence transitions identified by our approach are PARCLIP specific.
Methods
Model
Let i be a genomic position spanned by a number of reads after the short read alignment. The relative substitution frequency (RSF) x at position i is the ratio between the number of base substitutions y within the reads (e.g. T → C) aligned at i relative to the total coverage z at the site, and can be interpreted as an estimate of the corresponding transition probability. We recently introduced a nonparametric, twocomponent mixture model to discriminate PARCLIPspecific from nonexperimentallyinduced transitions [11]. In our model, the first and second component represent nonexperimental and PARCLIPinduced transitions, respectively.
Here, we developed a model that integrates information over the entire RSF range. For this purpose, we consider a Bayesian network representation of our mixture model (Figure 1) corresponding to a chain of three random variables (Θ,X,Y). Here Θ∈{1,2} encodes the source of transition (nonexperimental [ Θ=1] or experimental [ Θ=2]), X∈(0,1] represents a relative substitution frequency (RSF) value and Y is the number of observed transitions at a given position. According to this model all observed substitutions are thought to be generated as follows. First, a binary random number Θ is drawn from a Bernoulli distribution Bern(λ) with p(Θ=1)=λ. The value of Θ determines the component used to sample the base substitution probability x. Second, the number of observed transitions Y is obtained from a binomial distribution Bin(z,x), where the sample size z corresponds to the total number of aligned reads at a given position. According to our model, p(Y,X,Θ) factorizes as:
Therefore, the posterior probability that a given number of transitions was induced by either source can be computed as:
The resulting posterior probability marginalizes out X and thereby integrates information over the entire RSF range. Using p(Θ=θY), we then compute the logodds ratio for each transition as log(p(Θ=2Y)/p(Θ=1Y) and define the relative logodds ratio for a cluster as the sum of all logodds within a cluster, normalized to the total number of bases susceptible to crosslinking.
Cluster boundaries identification
Let \({\mathcal {C}}(i)\) be the coverage at position i and \({\mathcal {C}}(i,j)\) be the sequence of coverage values \(({\mathcal {C}}(i), {\mathcal {C}}(i+1), \ldots, {\mathcal {C}}(j)), i\leq j\). Similarly, let \({\mathcal {S}}(i)\), \({\mathcal {E}}(i)\) be the positive and negative differences in the coverage function, i.e. the number of read alignments starting or ending at position i, respectively, and \({\mathcal {S}}(i,j) = ({\mathcal {S}}(i), {\mathcal {S}}(i+1), \ldots, {\mathcal {S}}(j))\), \({\mathcal {E}}(i,j) = ({\mathcal {E}}(i), {\mathcal {E}}(i+1), \ldots, {\mathcal {E}}(j))\), i≤j be the extended notation to intervals. We then consider the set of all genomic positions corresponding to highconfidence transitions (hcTs) of a given type (e.g. T →C). In the following paragraphs, we detail the steps performed by the MRN algorithm.
Estimate local background threshold
For each \(i_{t} \in {\mathcal {T}}\), we consider the largest nonzero coverage window w containing i _{ t } and compute all putative cluster start \(C_{s}=({\mathcal {S}}(i) \geq \delta)_{i \in w: i\leq i_{t}}\) and cluster end \(C_{e}=({\mathcal {E}}(i) \geq \delta)_{i \in w: i\geq i_{t}}\) positions therein, where δ is an integer background threshold (Figure 2A). To account for large variations in coverage between distinct genomic regions, we estimate noise levels in the coverage function at positions proximal to hcTs and use this estimate to compute a windowspecific threshold δ _{ w } as follows. We draw a random sample \({U} \subseteq {\mathcal {T}}\) of size N (here N=1000) and consider \({\tilde {W}}=((i_{t}n, i_{t}+n))_{t \in U}\), i.e. a sequence of genomic intervals centered on each i _{ t }. By default, n=25. Then, we compute normalized nonzero coverage differences D ^{+} within \({\tilde {W}}\). Let
be the sequence of all normalized coverage fluctuations observed within \(\tilde W\). We model the distribution of nonzero fluctuations D ^{+} as a mixture of two Gaussian components with unequal variance. The first component (k=1) models noisy fluctuations, while the second component (k=2) models sharp jumps in the coverage function. Model parameters are estimated using expectationmaximization and responsibilities are used to compute the coefficient c= min{x∈D ^{+}:p(k=2x)≥p(k=1x)}, which along with \(\text {max}({\mathcal S}(w))\) and \(\text {max}({\mathcal E}(w))\) determines δ _{ w } within each window. Alternatively, the user can define a global threshold, e.g. by selecting a fraction of the minimum coverage requirement m _{1} at hcTs or of the mode m _{2} of the coverage distribution at these sites. A choice δ=⌊0.1·max(m _{1},m _{2})⌋, where ⌊x⌋ is the largest integer not greater than x, empirically works well on all analyzed PARCLIP datasets.
Identify candidate cluster boundaries
The width \(W=(lk+1)_{(k,l) \in C_{s} \times C_{e}}\) of each candidate cluster and signal levels at C _{ s } and C _{ e }, namely \(n_{s}=({\mathcal S}(i))_{i \in C_{s}}\) and \(n_{e}=({\mathcal E}(i))_{i \in C_{e}}\), respectively, are computed (Figure 2A).
Represent candidate clusters as rank vectors
We represent each candidate cluster characterized by the vector \((n_{s_{k}}, n_{e_{l}}, w_{\textit {kl}})\) as a rank vector \(\phantom {\dot {i}\!}\mathbf r_{\textit {kl}}=(r_{s_{k}}, r_{e_{l}}, r_{w_{\textit {kl}}})\) (Figure 2A), where r _{ s } is the ranking of start positions (with ties resolved 5’ →3’, i.e. with increasing values at each index set of ties), r _{ e }=π(n _{ e }) is the ranking of end positions (with ties resolved 3’ →5’) and r _{ w }=π(−W) is the ranking of candidate cluster widths (with ties resolved by minimum ranking).
Identify the optimal solution
The expected coverage of a RBP binding site corresponds to a sharply peaked rectangle function [12], represented by the rank vector O=(0,0,0). We identify the optimal solution as the candidate cluster that is closest to O in terms of the euclidean norm of its rank vector r _{ kl }. Although multiple optimal solutions can occur, the choice of the euclidean norm strongly disfavors large clusters by construction. Therefore, in case of ties, the shortest cluster is reported, as it corresponds to the binding site with higher signal at the cluster boundaries as compared to any other optimal solution.
Comparison with PARalyzer
Data processing
All PARCLIP data sets were processed as previously described [11]. Briefly, adapter sequences were removed. Reads of length ≥15 passing the Illumina quality filter were aligned to the human reference assembly ‘hg19’ using Bowtie [16], allowing at most one mismatch. The following Bowtie parameters were specified: best chunkmbs 512 n 1 S M 100.
Parametrization
While our method depends on few parameters  essentially the minimum required coverage at transitions and the posterior probability cutoff  PARalyzer contains a more extensive parametrization. To allow for a fair comparison we selected mainly default and recommended parameter values. Both methods differ on what constitutes the minimum required evidence for a binding site. While wavClusteR poses a cutoff c on the strandspecific coverage at hcTs, PARalyzer applies a threshold on the number of reads forming a read group. To compare the performance of the two methods, we first learn the PARCLIP specific RSF interval by fitting the mixture model using c=20 and then exhaustively identify binding sites with wavClusteR starting from hcTs with RSF values within the Bayes classifier and c=1. We run PARalyzer using two different values for the minimum conversion locations for clusters n, namely n=1 (default value) or n=2 (recommended value), respectively. The choice of this parameter crucially determines sensitivity and recall of the algorithm. The full set of PARalyzer parameters used for the comparison is provided in Additional file 1, Section 1.1. Only clusters exhibiting at least one T to C transition with a strandspecific coverage of 10 are retained for the comparison, with no requirement on its RSF value to enable a fair comparison between algorithms.
microRNA seed mapping
We considered a set of microRNAs (miRNAs) previously shown to be expressed in HEK293 cells [17] and computed the enrichment of miRNA seeds within each set of cluster sequences relative to a random control. The latter was obtained by generating 10^{4} samples of dinucleotide shuffled microRNA sequences and the mean relative seed count was used as background estimate. To allow for a fair comparison PARalyzerspecific clusters were extended to the median length of wavClusteRspecific clusters.
Computing false discovery rates
To provide experimentallybased estimates of the False Discovery Rate (FDR) of our method, we analyze the MOV10 PARCLIP data set and a matched total RNASeq profile from the same HEK293 cells used to perform the PARCLIP experiment [11].
FDR of highconfidence interaction sites
We estimated a highly conservative FDR upper bound and a FDR lower bound as a function of the RSF as follows. Let be the set of genomic positions with a minimum coverage of 20 within the PARCLIP and the RNASeq data set and at least one transition within PARCLIP. Each element of is associated with a specific PARCLIP RSF value. We partition the RSF interval (0,1] into ten equally spaced intervals and for each range we identify the genomic positions \(\mathcal {P} \subseteq \mathcal {G}\) such that the associated RSF values fall into the RSF interval. We compute a conservative FDR upper bound by regarding as FPs all genomic positions \(\mathcal {U} \subseteq \mathcal {P}\) showing at least one transition in the RNASeq data, irrespective of their RSF values. The FDR upper bound is therefore \({\mathcal U}/{\mathcal P}\). Similarly, we compute the FDR lower bound by considering FPs all genomic positions \(\mathcal {L} \subseteq \mathcal {P}\) exhibiting an RNASeqbased RSF within the same interval, and compute the FDR lower bound as \({\mathcal L}/{\mathcal P}\).
FDR clusters
We rank clusters by decreasing values of relative logodds and consider the resulting top n clusters. For each cluster in the ranking, we identify the set of genomic positions with hcTs localizing therein and compute the RNASeqbased RSF \(x_{t},t\in \mathcal {T}\). To compute conservative FDR estimates, we regard a cluster as FP if there exist at least one \(t\in \mathcal {T}\) such that a≤x _{ t }≤b, where [ a,b] is the PARCLIPspecific RSF support resulting from applying a given posterior probability cutoff. This condition is highly conservative, as a single true hcT within a cluster with multiple detected hcT suffices to correctly identify the binding site. Similarly, we compute less conservative FDR values by regarding a binding site as FP if every \(t\in \mathcal {T}\) satisfies a≤x _{ t }≤b.
Implementation
The algorithms described above are implemented in version 2.0 of our R package wavClusteR [11]. The MRN algorithm is implemented using parallelization, as binding sites are independent of each other. A graphical outline of the data analysis workflow offered by wavClusteR is illustrated in Additional file 1, 2.1.
Results and discussion
First, we show that the MRN algorithm provides sensitive and highly resolved identification of clusters. We then apply our method to published PARCLIP data sets and demonstrate part of our postprocessing pipeline. We compare our algorithm to PARalyzer [13] using published AGO2 PARCLIP data. Finally, we report estimates of FDRs of highconfidence transitions (hcTs) and of inferred protein binding sites by integrating matched RNASeq data.
Sensitive delineation of clusters at high resolution
We previously proposed an algorithm to resolve cluster boundaries by computing the continuous wavelet transform (CWT) of the coverage function around hcTs. However, this method is prone to false negatives, i.e. hcTs that are not assigned to a cluster, when genomic regions with complex coverage geometry and high variance of local signaltonoise ratios are encountered. To address this issue and increase the sensitivity of our peak calling procedure, we developed a CWTindependent algorithm which we termed minirank norm (MRN). The MRN algorithm (see Methods and Figure 2) solves an optimization problem in which hcTs are first employed to reduce the search space. Signal and noise in the coverage function are locally separated by modeling coverage fluctuations and integrating knowledge of the geometric properties of RBP binding sites. By assuming that the expected coverage of a cluster corresponds to a sharply peaked rectangle function [12], all candidate cluster boundaries spanning a highconfidence PARCLIP signal are then exhaustively evaluated and ranked accounting for this prior knowledge. By design, our algorithm favors sharp boundaries and short cluster widths, and, thus, accurately resolves clusters even when multiple binding sites localize within close proximity (Figure 2B). In order to test whether the highly volatile coverage function of PARCLIP data reflects complex RBP binding profiles or is an artifact of the procedure, we analyzed published AGO2 PARCLIP data [18] for which we can readily evaluate identified binding sites by considering expressed microRNA sequences. Our sequence analysis of 3’UTRs exhibiting multiple clusters resulted in a large number of transcripts (n=928). Each cluster localized within the 3’UTR could be assigned to one or more microRNA seed sequences of microRNAs expressed in HEK293 cells, suggesting that these clusters correspond to biologically relevant AGO2 binding sites. Two exemplary regions are illustrated in Figure 2CD. In addition, our hcTcentered strategy resulted on average in a ∼10x speed up over the CWTbased peak calling on all tested PARCLIP data sets (Additional file 1, 1.2).
Application to published PARCLIP data sets
In order to place the binding preference of the RBP within the biological context, postprocessing of identified binding sites is required. PARalyzer returns all identified clusters and read groups, and optionally seedmatches for supplied microRNA sequences within the resulting clusters as text files. In contrast, wavClusteR makes use of the R environment to provide extensive postprocessing functions supporting i) export of the coverage function, hcTs and clusters for visualization in the UCSC genome browser; ii) export of cluster sequences in FASTA format for de novo motif discovery and motif analysis; iii) strandspecific cluster annotation across different functional transcriptome compartments in sense and antisense orientations, including normalization of observed frequencies to the overall compartment length and iv) generation of metagene profiles of clusters and their statistics to assess the proteinspecific distribution of binding sites across genes. Furthermore, most BioConductor packages can directly use R objects returned by wavClusteR as an input.
For illustration, we provide examples of cluster annotations and metagene profiles obtained from PARCLIP data sets of MOV10 and QKI, which are characterized by different binding preferences. Annotation of MOV10 clusters shows that MOV10 preferentially binds to 3’UTRs of transcripts [11] (Figure 3A), whereas binding sites of QKI, which regulates premRNA splicing, mRNA export and stability, and protein translation [19], are enriched in 3’UTRs, coding sequences and introns (Figure 3A). Notably, the distinct binding preferences of the two proteins are neatly reflected in their metagene profiles (Figure 3B).
Comparison with PARalyzer
Using published AGO2 PARCLIP data sets, we compared the performance of wavClusteR with PARalyzer [13]. Our comparison revealed that the largest fraction of clusters is similarly identified by both methods (Figure 4A, see Additional file 1, 2.2 for cluster size distributions). However, depending on the parameter settings, the clusters specifically called by either method can represent a substantial fraction. Therefore, we decided to analyze methodspecific clusters in more detail. The distribution of RSF values within these clusters revealed that PARalyzerspecific clusters contained almost exclusively extreme RSF values. These values are unlikely to be caused by experimental induction, as the PARCLIPspecific enrichment of T to C transitions is missing when compared with other substitutions exhibiting similar RSF values (Additional file 1, 2.3). In contrast, wavClusteRspecific clusters covered the entire RSF range (Figure 4B and Additional file 1, 2.4) and mostly localized within the highconfidence RSF support. In addition, analysis of the read count distribution of PARalyzerspecific clusters (Additional file 1, 2.5) ruled out that the observed extreme RSF values result from clusters with low read count, which could be otherwise filtered out using more stringent parameter cutoffs. Annotation of clusters to the transcriptome shows that PARalyzerspecific AGO2 clusters preferentially localize within intergenic regions or introns (Figure 4C). In contrast, wavClusteRspecific binding sites mainly fall into 3’UTRs, which agrees well with the known biological function of the AGO2 protein [20,21]. Furthermore, we integrated RNASeq data derived from the same cell line to independently assess expression of the identified clusters. PARalyzerspecific clusters show significantly reduced expression levels (Figure 4D). This analysis suggests that the largest proportion of PARalyzerspecific clusters corresponds to false positives, possibly caused by RNA contamination during the experimental procedure, as most of the clustercontaining transcripts show no detectable expression.
An additional criterion to determine whether an AGO2 cluster, identified by either PARalyzer or wavClusteR, corresponds to a bona fide binding site is whether the site can be assigned to any expressed miRNA. Therefore, we decided to evaluate the quality of the methodspecific clusters by considering the presence of seeds of miRNAs known to be expressed in HEK293 cells. Since miRNAs target AGO2 proteins by complementary base pairing [20], we searched for corresponding seed sequences within the identified AGO2 binding sites (see Methods). Our analysis revealed that wavClusteRspecific clusters were substantially more enriched (>2 folds) for miRNA seeds than PARalyzerspecific ones, suggesting that these clusters more accurately reflect AGO2 binding sites. In addition, we repeated the analyses using PARCLIP datasets from mir124 miRNA transfection experiments [8] to quantify the fraction of the PARalyzer and wavClusteRspecific clusters that could be assigned to the transfected miRNA. Figure 4F shows an enrichment of mir124 seeds within wavClusteRspecific clusters, which is missing in clusters exclusively called by PARalyzer. Finally, these results are further supported by the analysis of a previously published Pumilio2 (PUM2) PARCLIP data set [8]. This RNAbinding protein recognizes a well characterized UGUAHAUA motif [22], which we found to be strongly enriched in wavClusteRspecific clusters (20.2%, n=1777) with respect to PARalyzerspecific ones (0.9%, n=219 and standard parameters).
Experimentallybased estimation of false discovery rates
We assessed the FDR of our highconfidence transitions by integrating matched total RNASeq data from HEK293 cells [11]. We reason that no crosslinking induced transitions are present in RNASeq. Hence, if our model correctly identifies PARCLIP induced RSF value, a transition classified as PARCLIPspecific and equally found in RNASeq data is likely to correspond to a false positive (FP). We partitioned the entire RSF interval (0,1] into different subsets and used transitions identified in both PARCLIP and RNASeq data to compute a highly conservative FDR upper bound, treating all observed RNASeq transitions as FPs irrespective of their RSF values (see Methods). Our analysis shows that the RSF interval [0.2,0.7], which we previously reported as PARCLIPspecific [11], is bounded by the lowest FDRs values (Figure 5A), thus demonstrating the high precision of our approach. Furthermore, the distribution of RNASeq RSF values within the central partitions of the RSF interval (Figure 5B) are mainly dominated by low RSF values compatible with sequencing errors, indicating that our FDR estimates are highly conservative.
Next, we assessed the FDR of clusters, which potentially contain multiple interaction sites. We considered increasing posterior probability cutoffs δ and computed highly conservative FDR estimates of clusters (see Methods) obtained for each threshold (Table 1). At δ=0.9, our method identified 66.837 MOV10 clusters (of which 20% contained a single hcT) at a ≤3% FDR for the top 250 clusters ranked by relative logodds (Table 1). Notably, the FDR values dropped substantially from δ=0.7 to δ=0.9 without a major effect on the total number of reported clusters, thus showing that stringency of the analysis can be effectively tuned by this parameter. This property is desirable for experimental validation, which is generally performed on few top ranked candidates only.
Conclusion
We presented a sensitive and comprehensive framework for PARCLIP data analysis, which provides statistically grounded and biologically interpretable results. In our approach, not the total number of interaction sites or observed transitions are considered, but rather the frequency at which expected transitions occur. This transitionbased strategy outperforms hard thresholdingbased approaches and achieves higher sensitivity and specificity.
Availability and requirements

Project name: wavClusteR

Project home page: https://github.com/FedericoComoglio/wavClusteR

Operating system(s): Platform independent

Programming language: R

Other requirements: R?>?= 3.0.0

License: GPL2

Any restrictions to use by nonacademics: none
References
 1
van Kouwenhove M, Kedde M, Agami R.MicroRNA regulation by RNAbinding proteins and its implications for cancer. Nat Rev Cancer. 2011; 11:644–56.
 2
Yates LA, Norbury CJ, Gilbert RJ. The long and short of microRNA. Cell. 2013; 153:516–9.
 3
Kechavarzi B, Janga SC. Dissecting the expression landscape of RNAbinding proteins in human cancers. Genome Biol. 2014; 15:R14.
 4
Lukong KE, Chang KW, Khandjian EW, Richard S.RNAbinding proteins in human genetic disease. Trends Genet. 2008; 24:416–25.
 5
Castello A, Fischer B, Hentze MW, Preiss T.RNAbinding proteins in Mendelian disease. Trends Genet. 2013; 29:318–27.
 6
Ray D, Kazan H, Cook KB, Weirauch MT, Najafabadi HS, Li X, et al.A compendium of RNAbinding motifs for decoding gene regulation. Nature. 2013; 499:172–177.
 7
König J, Zarnack K, Luscombe NM, Ule J.ProteinRNA interactions: new genomic technologies and perspectives. Nat Rev Genet. 2012; 13:77–83.
 8
Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, et al.Transcriptomewide identification of RNAbinding protein and microRNA target sites by PARCLIP. Cell. 2010; 141:129–41.
 9
Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, et al.PARCliP–a method to identify transcriptomewide the binding sites of RNA binding proteins. J Vis Exp. 2; 2010:2034.
 10
Ascano M, Hafner M, Cekan P, Gerstberger S, Tuschl T.Identification of RNAprotein interaction networks using PARCLIP. Wiley Interdiscip Rev RNA. 2012; 3:159–77.
 11
Sievers C, Schlumpf T, Sawarkar R, Comoglio F, Paro R.Mixture models and wavelet transforms reveal high confidence RNAprotein interaction sites in MOV10 PARCLIP data.Nucleic Acids Res. 2012; e160:40.
 12
Khorshid M, Rodak C, Zavolan M.CLIPZ: a database and analysis environment for experimentally determined binding sites of RNAbinding proteins. Nucleic Acids Res. 2011; 39:D245—52.
 13
Corcoran DL, Georgiev S, Mukherjee N, Gottwein E, Skalsky RL, Keene JD, et al.PARalyzer: definition of RNA binding sites from PARCLIP shortread sequence data. Genome Biol. 2011; 12:R79.
 14
Chen B, Yun J, Kim MS, Mendell JT, Xie Y.PIPECLIP: a comprehensive online tool for CLIPseq data analysis. Genome Biol. 2014; 15:R18.
 15
Kaneko S, Son J, Shen SS, Reinberg D, Bonasio R.PRC2 binds active promoters and contacts nascent RNAs in embryonic stem cells. Nat Struct Mol Biol. 2013; 20:1258–64.
 16
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome Biol. 2009; 10:R25.
 17
Hausser J., Berninger P, Rodak C, Jantscher Y, Wirth S, Zavolan M. MirZ: an integrated microRNA expression atlas and target prediction resource. Nucleic Acids Res. 2009; 37:W266–72.
 18
Kishore S, Jaskiewicz L, Burger L, Hausser J, Khorshid M, Zavolan M. A quantitative analysis of CLIP methods for identifying binding sites of RNAbinding proteins. Nat Methods. 2011; 8:559–64.
 19
Chénard CA, Richard S. New implications for the QUAKING RNA binding protein in human disease. J Neurosci Res. 2008; 86:233–42.
 20
Hutvagner G, Simard MJ. Argonaute proteins: key players in RNA silencing. Nat Rev Mol Cell Biol. 2008; 9:22–32.
 21
Meister G. Argonaute proteins: functional insights and emerging roles. Nat Rev Genet. 2013; 14:447–59.
 22
White EK, MooreJarrett T, Ruley HE. PUM2, a novel murine puf protein, and its consensus RNAbinding site. RNA. 2001; 7:1855–66.
Acknowledgements
This work was supported by the ETH Zurich. F.C. is a member of the Life Science Zurich Graduate School, PhD program in Systems Biology. We are thankful to Moritz Gerstung and Maurizio Rinaldi for helpful discussion, to Martin Morgan for support with the package implementation, and to Hind Hashwah and Tommy Schlumpf for feedback on the package.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
FC and CS developed, implemented and tested the method; RP guided this research. All authors participated in the design of the project. FC and CS wrote the manuscript. All authors read and approved the final manuscript.
Additional file
Additional file 1
Supplementary methods and results. PARalyzer parameters, graphical outline of wavClusteR 2.0 and analysis of publicly available PARCLIP data sets.
Rights and permissions
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Comoglio, F., Sievers, C. & Paro, R. Sensitive and highly resolved identification of RNAprotein interaction sites in PARCLIP data. BMC Bioinformatics 16, 32 (2015). https://doi.org/10.1186/s128590150470y
Received:
Accepted:
Published:
Keywords
 PARCLIP
 RNA
 RNA binding proteins
 Bayesian statistics