Sensitive and highly resolved identification of RNAprotein interaction sites in PARCLIP data
 Federico Comoglio^{1}Email author,
 Cem Sievers^{1, 2, 3}Email author and
 Renato Paro^{1, 4}
https://doi.org/10.1186/s128590150470y
© Comoglio et al.; licensee BioMed Central. 2015
Received: 28 July 2014
Accepted: 15 January 2015
Published: 1 February 2015
Abstract
Background
PARCLIP is a recently developed Next Generation Sequencingbased method enabling transcriptomewide identification of interaction sites between RNA and RNAbinding proteins. The PARCLIP procedure induces specific base transitions that originate from sites of RNAprotein interactions and can therefore guide the identification of binding sites. However, additional sources of transitions, such as cell typespecific SNPs and sequencing errors, challenge the inference of binding sites and suitable statistical approaches are crucial to control false discovery rates. In addition, a highly resolved delineation of binding sites followed by an extensive downstream analysis is necessary for a comprehensive characterization of the protein binding preferences and the subsequent design of validation experiments.
Results
We present a statistical and computational framework for PARCLIP data analysis. We developed a sensitive transitioncentered algorithm specifically designed to resolve protein binding sites at high resolution in PARCLIP data. Our method employes a Bayesian network approach to associate posterior logodds with the observed transitions, providing an overall quantification of the confidence in RNAprotein interaction. We use published PARCLIP data to demonstrate the advantages of our approach, which compares favorably with alternative algorithms. Lastly, by integrating RNASeq data we compute conservative experimentallybased false discovery rates of our method and demonstrate the high precision of our strategy.
Conclusions
Our method is implemented in the R package wavClusteR 2.0. The package is distributed under the GPL2 license and is available from BioConductor at http://www.bioconductor.org/packages/devel/bioc/html/wavClusteR.html.
Keywords
Background
RNAbinding proteins (RBPs) play a fundamental role in virtually all aspects of RNA metabolism, including the regulation of RNA localization, stability, translation or degradation [1]. These proteins extensively contribute to the control of gene expression by regulating the life cycle of microRNAs, where the RBPRNA interaction is mediated by specific RNA sequence motifs or secondary structures [2]. Interestingly, recent studies showed that deregulation of RBP expression or mutation of cognate binding sites are causally related to several human diseases including cancer [36]. Many of these studies have been made possible by the development of new methods mapping interaction sites in a comprehensive and systematic manner [7]. Particularly, the PhotoActivatable Ribonucleosideenhanced CrossLinking and ImmunoPrecipitation (PARCLIP) method made it possible to identify highly specific RBPRNA interactions by generating a distinct imprint in the bound RNA [810]. In this method, cells are cultured with a ribonucleoside analogue, e.g. 4thiouridine (4SU), which becomes incorporated into nascent RNA molecules. Then, in vivo UV crosslinking at a specific wavelength is performed to stabilize the RNARBP interaction, resulting in a covalently linked RNARBP complex. Next, the complex is isolated, the protein digested, the RNA molecules recovered and reverse transcribed to cDNA. NextGeneration Sequencing is then used to determine the identity of these molecules.
Importantly, the reverse transcription induces specific base transitions at the original crosslinked sites, which can be used to identify highconfidence RBPRNA binding interactions in PARCLIP data [11]. Based on the induction of transitions, different strategies have been developed for the identification RNARBP interactions in PARCLIP data. CLIPZ [12], a widely adopted method for PARCLIP data analysis, ranks protein binding sites, referred to as clusters, based on their total number of observed transitions. PARalyzer [13], in contrast, utilizes transitions to fit a Gaussian kernel density estimate classifier in order to discriminate signal from noise at interaction sites with the aim to infer the protein binding sites. The cluster boundaries are determined by extending the interaction sites using a fixed threshold on the coverage or by applying an arbitrary window size. PIPECLIP [14], a very recent tool designed for CLIPseq data analysis, employs a binomial model and performs comparably to PARalyzer in identifying binding sites in PARCLIP data. However, these methods fall short on important aspects of PARCLIP data analysis. (i) As experimental validation of RNARBP interactions is laborious and only feasible on small scale, statistically rigorous approaches are needed to rank clusters and identify highconfidence subsets amenable to experimental testing. (ii) PARCLIP data allows for a highly resolved identification of the RBP binding sites. However, to delineate cluster boundaries accurately, a sensitive peak caller tailored to this problem is needed. (iii) Not every observed transition is induced by crosslinking, i.e. by PARCLIP. Rather, sequencing errors, RNA contaminants and cell typespecific SNPs represent additional sources of transitions which can lead to the detection of a considerable number of false positives [11]. Attempts to limit the false discovery rate (FDR) by requiring a minimum number of interaction sites per cluster, as recommended by [13,15], can largely reduce sensitivity as it will inevitably miss all true clusters containing less interaction sites. In fact, given that the nucleotide composition of protein binding sites can greatly vary, clusters exhibiting a few PARCLIP induced transitions can still correspond to bona fide interaction sites.
In this work, we specifically address the three points outlined above. We introduce a Bayesian model to identify PARCLIP induced highconfidence transitions extending our recent work in [11]. We detail a new, coveragebased algorithm for the identification of cluster boundaries termed MiniRank Norm (MRN) and show that it substantially improves resolution of binding sites over other methods. We test our algorithm on published data and compare its performance with PARalyzer. We demonstrate that wavClusteR outperforms alternative algorithms both in detection and resolution of clusters. By using a transition frequencybased strategy our method overcomes the reduction in sensitivity and specificity which characterizes hard thresholding approaches such as PARalyzer. Lastly, we evaluate the performance of our algorithm by integrating matched RNASeq data to compute conservative FDR estimates, confirming that highconfidence transitions identified by our approach are PARCLIP specific.
Methods
Model
Let i be a genomic position spanned by a number of reads after the short read alignment. The relative substitution frequency (RSF) x at position i is the ratio between the number of base substitutions y within the reads (e.g. T → C) aligned at i relative to the total coverage z at the site, and can be interpreted as an estimate of the corresponding transition probability. We recently introduced a nonparametric, twocomponent mixture model to discriminate PARCLIPspecific from nonexperimentallyinduced transitions [11]. In our model, the first and second component represent nonexperimental and PARCLIPinduced transitions, respectively.
Cluster boundaries identification
Let \({\mathcal {C}}(i)\) be the coverage at position i and \({\mathcal {C}}(i,j)\) be the sequence of coverage values \(({\mathcal {C}}(i), {\mathcal {C}}(i+1), \ldots, {\mathcal {C}}(j)), i\leq j\). Similarly, let \({\mathcal {S}}(i)\), \({\mathcal {E}}(i)\) be the positive and negative differences in the coverage function, i.e. the number of read alignments starting or ending at position i, respectively, and \({\mathcal {S}}(i,j) = ({\mathcal {S}}(i), {\mathcal {S}}(i+1), \ldots, {\mathcal {S}}(j))\), \({\mathcal {E}}(i,j) = ({\mathcal {E}}(i), {\mathcal {E}}(i+1), \ldots, {\mathcal {E}}(j))\), i≤j be the extended notation to intervals. We then consider the set of all genomic positions corresponding to highconfidence transitions (hcTs) of a given type (e.g. T →C). In the following paragraphs, we detail the steps performed by the MRN algorithm.
Estimate local background threshold
Identify candidate cluster boundaries
The width \(W=(lk+1)_{(k,l) \in C_{s} \times C_{e}}\) of each candidate cluster and signal levels at C _{ s } and C _{ e }, namely \(n_{s}=({\mathcal S}(i))_{i \in C_{s}}\) and \(n_{e}=({\mathcal E}(i))_{i \in C_{e}}\), respectively, are computed (Figure 2A).
Represent candidate clusters as rank vectors
We represent each candidate cluster characterized by the vector \((n_{s_{k}}, n_{e_{l}}, w_{\textit {kl}})\) as a rank vector \(\phantom {\dot {i}\!}\mathbf r_{\textit {kl}}=(r_{s_{k}}, r_{e_{l}}, r_{w_{\textit {kl}}})\) (Figure 2A), where r _{ s } is the ranking of start positions (with ties resolved 5’ →3’, i.e. with increasing values at each index set of ties), r _{ e }=π(n _{ e }) is the ranking of end positions (with ties resolved 3’ →5’) and r _{ w }=π(−W) is the ranking of candidate cluster widths (with ties resolved by minimum ranking).
Identify the optimal solution
The expected coverage of a RBP binding site corresponds to a sharply peaked rectangle function [12], represented by the rank vector O=(0,0,0). We identify the optimal solution as the candidate cluster that is closest to O in terms of the euclidean norm of its rank vector r _{ kl }. Although multiple optimal solutions can occur, the choice of the euclidean norm strongly disfavors large clusters by construction. Therefore, in case of ties, the shortest cluster is reported, as it corresponds to the binding site with higher signal at the cluster boundaries as compared to any other optimal solution.
Comparison with PARalyzer
Data processing
All PARCLIP data sets were processed as previously described [11]. Briefly, adapter sequences were removed. Reads of length ≥15 passing the Illumina quality filter were aligned to the human reference assembly ‘hg19’ using Bowtie [16], allowing at most one mismatch. The following Bowtie parameters were specified: best chunkmbs 512 n 1 S M 100.
Parametrization
While our method depends on few parameters  essentially the minimum required coverage at transitions and the posterior probability cutoff  PARalyzer contains a more extensive parametrization. To allow for a fair comparison we selected mainly default and recommended parameter values. Both methods differ on what constitutes the minimum required evidence for a binding site. While wavClusteR poses a cutoff c on the strandspecific coverage at hcTs, PARalyzer applies a threshold on the number of reads forming a read group. To compare the performance of the two methods, we first learn the PARCLIP specific RSF interval by fitting the mixture model using c=20 and then exhaustively identify binding sites with wavClusteR starting from hcTs with RSF values within the Bayes classifier and c=1. We run PARalyzer using two different values for the minimum conversion locations for clusters n, namely n=1 (default value) or n=2 (recommended value), respectively. The choice of this parameter crucially determines sensitivity and recall of the algorithm. The full set of PARalyzer parameters used for the comparison is provided in Additional file 1, Section 1.1. Only clusters exhibiting at least one T to C transition with a strandspecific coverage of 10 are retained for the comparison, with no requirement on its RSF value to enable a fair comparison between algorithms.
microRNA seed mapping
We considered a set of microRNAs (miRNAs) previously shown to be expressed in HEK293 cells [17] and computed the enrichment of miRNA seeds within each set of cluster sequences relative to a random control. The latter was obtained by generating 10^{4} samples of dinucleotide shuffled microRNA sequences and the mean relative seed count was used as background estimate. To allow for a fair comparison PARalyzerspecific clusters were extended to the median length of wavClusteRspecific clusters.
Computing false discovery rates
To provide experimentallybased estimates of the False Discovery Rate (FDR) of our method, we analyze the MOV10 PARCLIP data set and a matched total RNASeq profile from the same HEK293 cells used to perform the PARCLIP experiment [11].
FDR of highconfidence interaction sites
We estimated a highly conservative FDR upper bound and a FDR lower bound as a function of the RSF as follows. Let be the set of genomic positions with a minimum coverage of 20 within the PARCLIP and the RNASeq data set and at least one transition within PARCLIP. Each element of is associated with a specific PARCLIP RSF value. We partition the RSF interval (0,1] into ten equally spaced intervals and for each range we identify the genomic positions \(\mathcal {P} \subseteq \mathcal {G}\) such that the associated RSF values fall into the RSF interval. We compute a conservative FDR upper bound by regarding as FPs all genomic positions \(\mathcal {U} \subseteq \mathcal {P}\) showing at least one transition in the RNASeq data, irrespective of their RSF values. The FDR upper bound is therefore \({\mathcal U}/{\mathcal P}\). Similarly, we compute the FDR lower bound by considering FPs all genomic positions \(\mathcal {L} \subseteq \mathcal {P}\) exhibiting an RNASeqbased RSF within the same interval, and compute the FDR lower bound as \({\mathcal L}/{\mathcal P}\).
FDR clusters
We rank clusters by decreasing values of relative logodds and consider the resulting top n clusters. For each cluster in the ranking, we identify the set of genomic positions with hcTs localizing therein and compute the RNASeqbased RSF \(x_{t},t\in \mathcal {T}\). To compute conservative FDR estimates, we regard a cluster as FP if there exist at least one \(t\in \mathcal {T}\) such that a≤x _{ t }≤b, where [ a,b] is the PARCLIPspecific RSF support resulting from applying a given posterior probability cutoff. This condition is highly conservative, as a single true hcT within a cluster with multiple detected hcT suffices to correctly identify the binding site. Similarly, we compute less conservative FDR values by regarding a binding site as FP if every \(t\in \mathcal {T}\) satisfies a≤x _{ t }≤b.
Implementation
The algorithms described above are implemented in version 2.0 of our R package wavClusteR [11]. The MRN algorithm is implemented using parallelization, as binding sites are independent of each other. A graphical outline of the data analysis workflow offered by wavClusteR is illustrated in Additional file 1, 2.1.
Results and discussion
First, we show that the MRN algorithm provides sensitive and highly resolved identification of clusters. We then apply our method to published PARCLIP data sets and demonstrate part of our postprocessing pipeline. We compare our algorithm to PARalyzer [13] using published AGO2 PARCLIP data. Finally, we report estimates of FDRs of highconfidence transitions (hcTs) and of inferred protein binding sites by integrating matched RNASeq data.
Sensitive delineation of clusters at high resolution
We previously proposed an algorithm to resolve cluster boundaries by computing the continuous wavelet transform (CWT) of the coverage function around hcTs. However, this method is prone to false negatives, i.e. hcTs that are not assigned to a cluster, when genomic regions with complex coverage geometry and high variance of local signaltonoise ratios are encountered. To address this issue and increase the sensitivity of our peak calling procedure, we developed a CWTindependent algorithm which we termed minirank norm (MRN). The MRN algorithm (see Methods and Figure 2) solves an optimization problem in which hcTs are first employed to reduce the search space. Signal and noise in the coverage function are locally separated by modeling coverage fluctuations and integrating knowledge of the geometric properties of RBP binding sites. By assuming that the expected coverage of a cluster corresponds to a sharply peaked rectangle function [12], all candidate cluster boundaries spanning a highconfidence PARCLIP signal are then exhaustively evaluated and ranked accounting for this prior knowledge. By design, our algorithm favors sharp boundaries and short cluster widths, and, thus, accurately resolves clusters even when multiple binding sites localize within close proximity (Figure 2B). In order to test whether the highly volatile coverage function of PARCLIP data reflects complex RBP binding profiles or is an artifact of the procedure, we analyzed published AGO2 PARCLIP data [18] for which we can readily evaluate identified binding sites by considering expressed microRNA sequences. Our sequence analysis of 3’UTRs exhibiting multiple clusters resulted in a large number of transcripts (n=928). Each cluster localized within the 3’UTR could be assigned to one or more microRNA seed sequences of microRNAs expressed in HEK293 cells, suggesting that these clusters correspond to biologically relevant AGO2 binding sites. Two exemplary regions are illustrated in Figure 2CD. In addition, our hcTcentered strategy resulted on average in a ∼10x speed up over the CWTbased peak calling on all tested PARCLIP data sets (Additional file 1, 1.2).
Application to published PARCLIP data sets
In order to place the binding preference of the RBP within the biological context, postprocessing of identified binding sites is required. PARalyzer returns all identified clusters and read groups, and optionally seedmatches for supplied microRNA sequences within the resulting clusters as text files. In contrast, wavClusteR makes use of the R environment to provide extensive postprocessing functions supporting i) export of the coverage function, hcTs and clusters for visualization in the UCSC genome browser; ii) export of cluster sequences in FASTA format for de novo motif discovery and motif analysis; iii) strandspecific cluster annotation across different functional transcriptome compartments in sense and antisense orientations, including normalization of observed frequencies to the overall compartment length and iv) generation of metagene profiles of clusters and their statistics to assess the proteinspecific distribution of binding sites across genes. Furthermore, most BioConductor packages can directly use R objects returned by wavClusteR as an input.
Comparison with PARalyzer
An additional criterion to determine whether an AGO2 cluster, identified by either PARalyzer or wavClusteR, corresponds to a bona fide binding site is whether the site can be assigned to any expressed miRNA. Therefore, we decided to evaluate the quality of the methodspecific clusters by considering the presence of seeds of miRNAs known to be expressed in HEK293 cells. Since miRNAs target AGO2 proteins by complementary base pairing [20], we searched for corresponding seed sequences within the identified AGO2 binding sites (see Methods). Our analysis revealed that wavClusteRspecific clusters were substantially more enriched (>2 folds) for miRNA seeds than PARalyzerspecific ones, suggesting that these clusters more accurately reflect AGO2 binding sites. In addition, we repeated the analyses using PARCLIP datasets from mir124 miRNA transfection experiments [8] to quantify the fraction of the PARalyzer and wavClusteRspecific clusters that could be assigned to the transfected miRNA. Figure 4F shows an enrichment of mir124 seeds within wavClusteRspecific clusters, which is missing in clusters exclusively called by PARalyzer. Finally, these results are further supported by the analysis of a previously published Pumilio2 (PUM2) PARCLIP data set [8]. This RNAbinding protein recognizes a well characterized UGUAHAUA motif [22], which we found to be strongly enriched in wavClusteRspecific clusters (20.2%, n=1777) with respect to PARalyzerspecific ones (0.9%, n=219 and standard parameters).
Experimentallybased estimation of false discovery rates
Conservative FDR estimates of clusters as a function of different posterior probability cutoffs (see Methods )
Posterior cutoff ( δ )  >0.7  >0.8  >0.9 

RSF support  [0.014,0.808]  [0.021,0.779]  [0.044,0.713] 
No. hcTCs  268.771  265.085  246.619 
No. clusters  67.856  67.493  66.837 
with 1 hcTC  13.227 (19.5%)  13.351 (19.8%)  13.471 (20.1%) 
FDR top 75  0.066  0.066  0.024 
FDR top 125  0.096  0.085  0.0266 
FDR top 250  0.132 (0.02)  0.108 (0.02)  0.028 (0.012) 
Conclusion
We presented a sensitive and comprehensive framework for PARCLIP data analysis, which provides statistically grounded and biologically interpretable results. In our approach, not the total number of interaction sites or observed transitions are considered, but rather the frequency at which expected transitions occur. This transitionbased strategy outperforms hard thresholdingbased approaches and achieves higher sensitivity and specificity.
Availability and requirements

Project name: wavClusteR

Project home page: https://github.com/FedericoComoglio/wavClusteR

Operating system(s): Platform independent

Programming language: R

Other requirements: R?>?= 3.0.0

License: GPL2

Any restrictions to use by nonacademics: none
Declarations
Acknowledgements
This work was supported by the ETH Zurich. F.C. is a member of the Life Science Zurich Graduate School, PhD program in Systems Biology. We are thankful to Moritz Gerstung and Maurizio Rinaldi for helpful discussion, to Martin Morgan for support with the package implementation, and to Hind Hashwah and Tommy Schlumpf for feedback on the package.
Authors’ Affiliations
References
 van Kouwenhove M, Kedde M, Agami R.MicroRNA regulation by RNAbinding proteins and its implications for cancer. Nat Rev Cancer. 2011; 11:644–56.View ArticlePubMedGoogle Scholar
 Yates LA, Norbury CJ, Gilbert RJ. The long and short of microRNA. Cell. 2013; 153:516–9.View ArticlePubMedGoogle Scholar
 Kechavarzi B, Janga SC. Dissecting the expression landscape of RNAbinding proteins in human cancers. Genome Biol. 2014; 15:R14.Google Scholar
 Lukong KE, Chang KW, Khandjian EW, Richard S.RNAbinding proteins in human genetic disease. Trends Genet. 2008; 24:416–25.View ArticlePubMedGoogle Scholar
 Castello A, Fischer B, Hentze MW, Preiss T.RNAbinding proteins in Mendelian disease. Trends Genet. 2013; 29:318–27.View ArticlePubMedGoogle Scholar
 Ray D, Kazan H, Cook KB, Weirauch MT, Najafabadi HS, Li X, et al.A compendium of RNAbinding motifs for decoding gene regulation. Nature. 2013; 499:172–177.View ArticlePubMedPubMed CentralGoogle Scholar
 König J, Zarnack K, Luscombe NM, Ule J.ProteinRNA interactions: new genomic technologies and perspectives. Nat Rev Genet. 2012; 13:77–83.View ArticlePubMedGoogle Scholar
 Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, et al.Transcriptomewide identification of RNAbinding protein and microRNA target sites by PARCLIP. Cell. 2010; 141:129–41.View ArticlePubMedPubMed CentralGoogle Scholar
 Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, et al.PARCliP–a method to identify transcriptomewide the binding sites of RNA binding proteins. J Vis Exp. 2; 2010:2034.Google Scholar
 Ascano M, Hafner M, Cekan P, Gerstberger S, Tuschl T.Identification of RNAprotein interaction networks using PARCLIP. Wiley Interdiscip Rev RNA. 2012; 3:159–77.View ArticlePubMedGoogle Scholar
 Sievers C, Schlumpf T, Sawarkar R, Comoglio F, Paro R.Mixture models and wavelet transforms reveal high confidence RNAprotein interaction sites in MOV10 PARCLIP data.Nucleic Acids Res. 2012; e160:40.Google Scholar
 Khorshid M, Rodak C, Zavolan M.CLIPZ: a database and analysis environment for experimentally determined binding sites of RNAbinding proteins. Nucleic Acids Res. 2011; 39:D245—52.Google Scholar
 Corcoran DL, Georgiev S, Mukherjee N, Gottwein E, Skalsky RL, Keene JD, et al.PARalyzer: definition of RNA binding sites from PARCLIP shortread sequence data. Genome Biol. 2011; 12:R79.Google Scholar
 Chen B, Yun J, Kim MS, Mendell JT, Xie Y.PIPECLIP: a comprehensive online tool for CLIPseq data analysis. Genome Biol. 2014; 15:R18.Google Scholar
 Kaneko S, Son J, Shen SS, Reinberg D, Bonasio R.PRC2 binds active promoters and contacts nascent RNAs in embryonic stem cells. Nat Struct Mol Biol. 2013; 20:1258–64.View ArticlePubMedPubMed CentralGoogle Scholar
 Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome Biol. 2009; 10:R25.Google Scholar
 Hausser J., Berninger P, Rodak C, Jantscher Y, Wirth S, Zavolan M. MirZ: an integrated microRNA expression atlas and target prediction resource. Nucleic Acids Res. 2009; 37:W266–72.View ArticlePubMedPubMed CentralGoogle Scholar
 Kishore S, Jaskiewicz L, Burger L, Hausser J, Khorshid M, Zavolan M. A quantitative analysis of CLIP methods for identifying binding sites of RNAbinding proteins. Nat Methods. 2011; 8:559–64.View ArticlePubMedGoogle Scholar
 Chénard CA, Richard S. New implications for the QUAKING RNA binding protein in human disease. J Neurosci Res. 2008; 86:233–42.View ArticlePubMedGoogle Scholar
 Hutvagner G, Simard MJ. Argonaute proteins: key players in RNA silencing. Nat Rev Mol Cell Biol. 2008; 9:22–32.View ArticlePubMedGoogle Scholar
 Meister G. Argonaute proteins: functional insights and emerging roles. Nat Rev Genet. 2013; 14:447–59.View ArticlePubMedGoogle Scholar
 White EK, MooreJarrett T, Ruley HE. PUM2, a novel murine puf protein, and its consensus RNAbinding site. RNA. 2001; 7:1855–66.PubMedPubMed CentralGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.