mrSNP: Software to detect SNP effects on microRNA binding

Background MicroRNAs (miRNAs) are short (19-23 nucleotides) non-coding RNAs that bind to sites in the 3’untranslated regions (3’UTR) of a targeted messenger RNA (mRNA). Binding leads to degradation of the transcript or blocked translation resulting in decreased expression of the targeted gene. Single nucleotide polymorphisms (SNPs) have been found in 3’UTRs that disrupt normal miRNA binding or introduce new binding sites and some of these have been associated with disease pathogenesis. This raises the importance of detecting miRNA targets and predicting the possible effects of SNPs on binding sites. In the last decade a number of studies have been conducted to predict the location of miRNA binding sites. However, there have been fewer algorithms published to analyze the effects of SNPs on miRNA binding. Moreover, the existing software has some shortcomings including the requirement for significant manual labor when working with huge lists of SNPs and that algorithms work only for SNPs present in databases such as dbSNP. These limitations become problematic as next-generation sequencing is leading to large numbers of novel variants in 3’UTRs. Result In order to overcome these issues, we developed a web-server named mrSNP which predicts the impact of a SNP in a 3’UTR on miRNA binding. The proposed tool reduces the manual labor requirements and allows users to input any SNP that has been identified by any SNP-calling program. In testing the performance of mrSNP on SNPs experimentally validated to affect miRNA binding, mrSNP correctly identified 69% (11/16) of the SNPs disrupting binding. Conclusions mrSNP is a highly adaptable and performing tool for predicting the effect a 3’UTR SNP will have on miRNA binding. This tool has advantages over existing algorithms because it can assess the effect of novel SNPs on miRNA binding without requiring significant hands on time.


Background
MicroRNAs (miRNAs) are predicted to regulate over 60% of all genes and as such have a significant impact on cell function and biology [1]. MiRNAs bind to the 3'UTR of an mRNA which results in decreased expression of the targeted gene. Thus, miRNA binding analysis is essential for any biological workflow that examines gene expression.
Processing of miRNAs is a multi-step process. First the miRNA transcript folds into a hairpin loop which is called the pri-miRNA. The hairpin loop is processed further into a pre-miRNA and is exported to the nucleus where it http://www.biomedcentral.com/1471-2105/ 15/73 Generally, plant miRNAs have perfect base-pairing with their target, causing its degradation. In animals, miRNAs can also form limited base-pairing, primarily between the 2nd and 7th bp from the 5' end of miRNA (seed of miRNA), which leads to translational repression. This imprecise sequence matching makes it more difficult to predict miRNA targets in animals with high accuracy. Different techniques have been proposed to predict mammalian miRNA-mRNA binding. These include the pattern of base pairing, thermodynamic stability of the miRNA-mRNA hybrid, comparative sequence analysis for conservation, and examination of multiple target sites [2].
Several software programs have been developed that utilize one or more of these methods to identify miRNA binding sites in the genome. TargetScan checks thermodynamic stability and conservation of the target sites in related species [3]. Miranda combines the pattern of base pairing, the thermodynamic stability of the miRNA-mRNA hybrid and comparative sequence analysis for conservation [4]. RNAhybrid determines the optimal and subobtimal binding energies between a given miRNA and its mRNA target [5]. MicroInspector detects binding sites according to complementarity using two sliding windows of 6 nucleotides in length [6] . Pictar requires base-pair matches in the seed region of miRNA, applies filtering by calculating thermodynamic binding energy, and assigns a likelihood score using a Hidden Markov Model for each binding [7]. Diana-microT considers principles of binding energy and conservation [8,9]. It also integrates biological pathways and analysis of interactions between predicted target genes.
Disease-associated functional SNPs may alter gene expression. Therefore, the relationship between SNPs and miRNAs becomes important for understanding the role of SNPs on disease [10]. Although there are many miRNA binding prediction tools that have been studied in the last decade, fewer studies to assess SNP effects on miRNA binding have been published [11][12][13][14][15]. Recently, the databases microSNiPer, Patrocles, Mirsnpscore, miRd-SNP, MirSNP, PolymiRTS have been released [16][17][18][19][20][21]. These databases follow similar algorithms as those utilized by the miRNA prediction tools in order to detect the effects of the SNPs on miRNA binding. These algorithms are run on the whole genome for all SNPs present in a genomic database like dbSNP, then results are stored. Users can query the results using SNP, gene or miRNA IDs. One of the deficiencies of these databases is that they only work for SNPs that already exist in databases and do not work for novel or unreported SNPs. Moreover, if the list of SNPs is large, the web interface of the tools may require an infeasible amount of manual labor. With the advent of next-generation sequencing technologies such as RNA-Seq, exome and whole genome sequencing, thousands of novel SNPs in 3'UTRs are being identified. RNA-Seq, which sequences all expressed genes in a sample, provides concordant gene expression and SNP data. Since a substantial number of the detected SNPs are previously undocumented, use of algorithms that require a SNP to be present in dbSNP may not meet the needs of researchers using RNA-Seq or other next-generation sequencing methods. Currently, when a novel SNP is encountered, a user can compare the location of the SNP against the predicted and validated miRNA target sites using the current prediction tools which is fairly labor intensive. The probability of the SNP disturbing a binding site can be considered to be proportional to the distance of the SNP to the seed of the target site. However, a SNP may not affect binding even when it is very close to the miRNA target seed region. Moreover, a SNP may introduce a totally new binding with a new miRNA, which is impossible to capture with the current databases. Thus, next-generation sequencing data demands new computational tools to relate the SNP and gene expression, which motivated us to develop a web-based tool, named mrSNP, to overcome the shortcomings of existing tools.

Implementation
The implementation of the mrSNP is presented in Figure 1. All the 3' UTR sequences and phastCons scores of the each nucleotides are downloaded from the UCSC Database using the Genome Table Browser [22]. Each chromosome is stored in a single file, where each sequence has information including gene name and 3'UTR sequence coordinates. All available miRNAs are downloaded from the mirBase database and clustered according to their conservation across species using the information obtained from the TargetScan prediction tool [3,23]. The software accepts input SNPs with the related information containing the organism, the assembly according to which the mapping is done, the chromosome on which the SNP is located, the position of the SNP in the given chromosome, and the SNP alleles. Once this information is provided, the software searches for the sequence where it is located. If SNP is not located in a denoted 3' UTR sequence, no further calculation is done and the software reports the SNP as, "not in 3'UTR". If the SNP is found in a 3'UTR, the 79 basepairs (bp) of sequence that contains the SNP at the center is returned at this step. This length (79bp) was chosen based on the observation that the typical maximum size of an miRNA is 25 bp and a maximum 15 bp loop is allowed in the binding. Therefore, we allow a miRNA binding site to have a maximum length of 40 bp. If a SNP is to affect miRNA binding, it will be located in the miRNA's binding site whose start/end nucleotide can be at most 39 bp apart from the SNP. Therefore, a 79 bp sequence (40 bp + 39 bp) is the minimal sequence to use for calculating potential miRNA binding differences. Once this sequence http://www.biomedcentral.com/1471-2105/15/73 is obtained, it is duplicated and each SNP allele is substituted in the correct position. After this point, for each miRNA stored we check the existence of a minimum of 6 consecutive Watson-Crick (W-C) matches starting from second position of the miRNA seed region.
The remainder of the approach is adapted from [9]. A sequence with 6 (7, 8, or 9) consecutive matches is called a 6mer (7mer etc.). We allow a single G:U wobble for 7, 8 and 9mer sequences. If no instances satisfy matching criteria, the miRNA and the sequence couple are not investigated further, and we conclude that the miRNA does not bind to this region. Moreover, if the sequence has at least 7 Watson-Crick matches in the seed region, it is considered as a miRNA binding site immediately. For weaker bindings such as the 6mers, or 7, 8 and 9mer sequences containing a single G-U wobble, we calculate the binding energy with RNAhybrid [5]. RNAhybrid runs a dynamic programming algorithm that finds the suboptimal binding energy between 2 sequences. For 6mers and 7mers (8mers and 9mers), we say that microRNA binds to a sequence if its binding energy is higher than 74% (60%) of the maximum binding energy. The numbers and methods used are adapted from [9]. For a given SNP-miRNA couple, the steps explained above are followed for both of the SNP sequences. If one of the them satisfies the binding criteria, while the other does not, we report this as a binding difference.
In the literature, many of the prediction tools apply a post-processing step to reduce the false positive rate of the binding predictions. This is performed using the conservation of the target site across different species. If the target site is conserved over different species, the binding possibility is considered to be higher. Although mrSNP does not filter out the results with this post-processing method, it calculates the conservation score (CS) of the seed region using the phastCons scores provided by UCSC database. For each prediction, CS is obtained as the average phastCons score of the nucleotides in the seed region. Then, it reports the probabilistic CS of the seed region as well as the conservation of the miRNA over the species.
Usage mrSNP software is publicly available from http://mrsnp. osu.edu. First, the user selects the organism and the assembly used in the analysis. mrSNP currently supports 11 organisms with the available assemblies.
Once organism and assembly are chosen, the user inputs the list of SNPs by either typing in the textbox or uploading a file. Each line should contain a single SNP with chromosome number, SNP position, and first and second allele. Each entry is separated by a space. An e-mail address can be provided for obtaining results. Also, the cut-off ratios to apply 6, 7, 8 and 9mers are parametrized for the option of using different ratios for different organisms. When a job is submitted, the user is directed to another page summarizing inputs and a link to the results page. Once the result is ready, it is displayed in a table containing the fields: chromosome, SNP position, target gene, the binding miRNA, the binding energy difference, the binding energies of each SNP, the cut-offs applied to each sequence, and the alignment of the bindings. If a SNP is not located in any 3'UTR region, or if it does not affect any miRNA bindings, the related information is reported at target gene field. A downloadable file is also provided. If there are any errors found, a link to the error page is given at the bottom of the page.

Results and discussion
Although mrSNP does not require its input to be validated SNPs, in order to evaluate the accuracy of mrSNP, we ran a series of experiments on multiple sets of experi-http://www.biomedcentral.com/1471-2105/15/73 mentally validated SNP-miRNA couples for human hg19 assembly.
In the first set of validation experiments, we ran experimentally validated disease associated SNPs used in the experiments of [20] in order to compare mrSNP with different databases. 16 SNPs which are both associated with disease and experimentally validated to disrupt miRNA binding were chosen. Table 1 gives the results of this experiment. For each SNP-miRNA pair, the table reports the SNP's rsID, the name of the miRNA, the location of the SNP, SNP alleles, the success/failure of mrSNP, and the explanation of the behavior. In this experiment, mrSNP is able to recover 11 disease associated SNPs out of 16. Among the 5 SNPs which are predicted not to affect miRNA bindings, the effect of rs13212041 on hsa-miR-96 is not captured as the SNP is not located in the 3'UTR. The effects of 3 SNPs (rs2735383, rs34764978, rs9341070) are not recovered as the sequences for both of the alleles do not satisfy the minimum matching criteria. On the other hand, although mrSNP recognizes the binding energy change of hsa-miR-148a for rs67384697; both of the sequences satisfy the matching criteria, and no effect on miRNA binding is detected. Among the 11 SNP-miRNA pairs that are successfully detected by mrSNP, 7 of them are captured because the SNPs break a matching in the seed region which causes the sequence not to meet the minimum matching criteria. The other 4 SNPs introduce GU wobbles in the seed region. The binding of these SNPs are predicted to be disrupted since the binding energies are calculated to be lower than the required threshold.
In comparing the results of mrSNP to other databases and algorithms described in [20], MirSNP, PolymiRTS, Mirsnpscore, and Patrocles are able to capture 12, 7, 7, and 5 of the disease-associated SNPs respectively. Thus, mrSNP outperforms all tools except MirSNP.
MirSNP detected similar binding differences as mrSNP with the exception of capturing the rs67384697hsa-miR-148a-3p pair. MirSNP reports this pair as "Enhance/Decrease" which means a binding energy difference between two sequences for each the allele of the SNP was measured, rather than a break in the binding. As explained in Table 1, a binding energy difference between the alleles is also captured by mrSNP, however, it is not reported because both of them satisfy the matching criteria.
In the next set of validation experiments, we tested mrSNP on SNPs obtained from the miRdSNP database [19]. We chose the SNPs that map to the miRNA targets predicted by TargetScan for the miRNAs and genes which are experimentally validated to bind. Note that the effects of these SNPs on binding itself was not specifically evaluated experimentally for all cases. There are 108 SNP-miRNA pairs reported in this database for which the SNP lies in the miRNA target. We filtered out the duplicated pairs and polymorphisms longer than a single nucleotide. After filtering, we obtained 64 SNP-miRNA pairs for study. The results of evaluating the 64 pairs are given in Tables 2, 3, and 4.
As Table 2 shows, mrSNP reports the binding effects of 43 SNPs out of 64 (67%) couples. For these miRNA-mRNA couples, the SNPs either disrupt a match in the seed region or introduce a new GU wobble. For 19 of these pairs given in Table 2(a), the SNPs break a matching in the seed region, therefore, the minimum matching criteria cannot be satisfied. On the other hand, 4 of SNPs given at the top of Table 2(b) break a matching at the end of the seed region, resulting a 6mer, for which the binding energies become lower than the threshold. Similarly, the other 20 SNPs in Table 2(b) introduce GU wobbles in the seed region, resulting to disturb the binding due to the minimum binding energy criteria. Table 3 lists the 8 pairs for which mrSNP does not report a binding difference as the sequences for both alleles are predicted to bind the miRNAs at similar levels. Note that for the first two pairs in the Table 3, mrSNP captures the disruption in binding. However, mrSNP does not report these SNPs to affect binding, as it identifies another seed region for the miRNA in a location very close the original target. mrSNP identifies changes in the binding energies of the pairs in Table 3, which are ignored as the sequences for both alleles satisfy the minimum matching criteria. Table 4 lists another category of pairs that were not predicted by mrSNP. For these 13 pairs, the sequences for neither allele were calculated to bind the given miR-NAs. The binding of these miRNAs are not predicted because the minimum matching criteria is not satisfied, as explained in more detail in the table.
We queried the 64 SNPs (Tables 2, 3, and 4) on mirSNP, PolymiRTS, and mirsnpscore. MirSNP reports 44 of these pairs as binding difference, and the results of MirSNP are very consistent with mrSNP. 57 of the 64 pairs queried are present in the PolymiRTS database, which includes miRNA-mRNA pairs identified through methods that include experimental data such as gene expression profiles and cross-linked immunoprecipitation sequencing data as well as pairs identified from the literature (rather than purely prediction methods). Because of the inclusion of experimental data and results from the literature, it is difficult to compare the results of PolymiRTS to mrSNP. Only 9 of the validated miRNA-mRNA pairs are found using mirsnpscore. Note that although the miRNAs-gene bindings are experimentally validated in this experiment, (Tables 2, 3, and 4), the actual effects of the SNPs on miRNA bindings are unknown. Therefore, it is not possible to determine the biological accuracy of the tools in this experiment. However, one result we can conclude from this experiment is that, mrSNP captures 51 (43 + 8) of the 64 (80%) experimentally validated miRNA bindings.
When comparing mirSNP to mrSNP across both experiments, 56 of 80 SNPs (70%) were predicted by mirSNP to disrupt miRNA binding. mrSNP compares favorably by predicting that SNPs will disrupt the binding 54 of the 80 (68%) miRNA target sites across these two experiments.

Conclusion
We developed a new tool, mrSNP, that predicts the effects of SNPs on miRNA binding. There are several advantages to this tool over existing tools. The proposed tool not only works on existing SNPs in databases such as dbSNP but also on novel SNPs which will be of great utility for researchers identifying new SNPs or somatic mutations in their samples. Secondly, our tool decreases the manual labor currently required for running prediction algorithms for novel SNPs. We present the results of mrSNP for various 3'UTR SNPs that were experimentally http://www.biomedcentral.com/1471-2105/15/73   validated to disturb miRNA binding. We also compare the performance of mrSNP with other miRNA binding prediction tools, for which mrSNP performed better than all but one other platform, MirSNP, that had a success rate of 75% (Table 1). mrSNP correctly predicted 11 of 16 (69%) disease-associated and/or experimentally validated SNPs that are reported in the literature or other databases. We observed that the recovery rate of mrSNP can be adjusted by using different set of parameters, but this may alter the false-positive rate. The major limitation of mrSNP is  that it did not capture all of the SNPs experimentally predicted to disrupt miRNA binding. In future experiments, we will study additional larger sets of experimentally validated SNPs to improve the sensitivity and specificity of our binding predictions. As the literature is beginning to note miRNA binding to other regions of mRNAs and the potential for an influence on the 3'UTR location on binding, we will strive to incorporate these into our algorithms. http://www.biomedcentral.com/1471-2105/15/73 In summary, mrSNP is a highly adaptable and performing tool for predicting the effect a SNP will have on miRNA binding.

Availability and requirements
Project name: mrSNP; Project home page: http://mrsnp.osu.edu; Operating system(s): Platform independent; Programming language: Python, PHP and JavaScript; Other requirements: JavaScript compatible browser; License: Free for commercial and academic use; Any restrictions to use by non-academics: No specific restrictions.