TargetSpy: a supervised machine learning approach for microRNA target prediction
© Sturm et al; licensee BioMed Central Ltd. 2010
Received: 9 October 2009
Accepted: 28 May 2010
Published: 28 May 2010
Virtually all currently available microRNA target site prediction algorithms require the presence of a (conserved) seed match to the 5' end of the microRNA. Recently however, it has been shown that this requirement might be too stringent, leading to a substantial number of missed target sites.
We developed TargetSpy, a novel computational approach for predicting target sites regardless of the presence of a seed match. It is based on machine learning and automatic feature selection using a wide spectrum of compositional, structural, and base pairing features covering current biological knowledge. Our model does not rely on evolutionary conservation, which allows the detection of species-specific interactions and makes TargetSpy suitable for analyzing unconserved genomic sequences.
In order to allow for an unbiased comparison of TargetSpy to other methods, we classified all algorithms into three groups: I) no seed match requirement, II) seed match requirement, and III) conserved seed match requirement. TargetSpy predictions for classes II and III are generated by appropriate postfiltering. On a human dataset revealing fold-change in protein production for five selected microRNAs our method shows superior performance in all classes. In Drosophila melanogaster not only our class II and III predictions are on par with other algorithms, but notably the class I (no-seed) predictions are just marginally less accurate. We estimate that TargetSpy predicts between 26 and 112 functional target sites without a seed match per microRNA that are missed by all other currently available algorithms.
Only a few algorithms can predict target sites without demanding a seed match and TargetSpy demonstrates a substantial improvement in prediction accuracy in that class. Furthermore, when conservation and the presence of a seed match are required, the performance is comparable with state-of-the-art algorithms. TargetSpy was trained on mouse and performs well in human and drosophila, suggesting that it may be applicable to a broad range of species. Moreover, we have demonstrated that the application of machine learning techniques in combination with upcoming deep sequencing data results in a powerful microRNA target site prediction tool http://www.targetspy.org.
The discovery of microRNAs in 1993  introduced a totally new dimension in our understanding of how gene expression is regulated. Animal and plant genomes contain hundreds of microRNA genes [2, 3] that control fundamental cellular processes and are implicated in severe diseases. Incorporated into a protein complex named RISC, microRNAs perform posttranscriptional gene regulation either through perfect binding to a cis-regulatory target site in the 3'UTR that is subsequently cleaved, leading to mRNA degradation, or by imprecise binding preferably of the microRNA 5' end to a target site, leading to possibly reversible repression of protein production. While posttranscriptional cleavage is prevalent in plants, translational repression is the predominant type of regulation in animals. Our current knowledge about the function of specific microRNAs, their targeted messenger RNAs, and the exact location of binding sites is limited.
Experimental detection of microRNA target sites is a costly and time-consuming process. While recent estimates suggest that more than 50% of human protein-coding genes may be regulated by microRNAs and that each microRNA may bind to 300-400 target genes, the latest release of the TarBase database contains information on only 995 human in vivo microRNA-gene interactions involving 103 distinct microRNAs and 825 distinct genes, a far cry from the actual extent of microRNA targeting [2, 4]. Computational prediction of microRNA/gene interactions is a valuable tool for guiding wet-lab experiments, and it remains the only option for systematic genome-wide reconstruction of the complex combinatorial picture of microRNA-mediated target binding. It is also a challenging task because of the daunting difficulty of distinguishing true microRNA-mRNA hybrids against the noisy background of millions of possible microRNA-gene combinations and, more generally, because the basic mechanisms of microRNA target recognition remain largely unknown. Over the recent years many target prediction algorithms have been developed based on different principles [see  for a review]. However, the two recurring parameters used by the available methods are i) the existence of a seed match (continuous base pairing between a 3'UTR and the first 6-8 bases of a microRNA 5' end), and ii) evolutionary conservation of the target site across multiple species. Utilization of these powerful constraints in prediction algorithms leads to more reliable detection of those functional duplexes containing them, but at the same time limits our ability to identify biologically relevant microRNA target sites that do not fulfill these requirements. By definition, organism-specific or simply poorly conserved sites cannot be predicted at all if a conservation filter is applied. It has also been suggested that the seed match requirement may be too stringent, and that at least a second "type" of target sites - the so called 3' compensatory target sites - exists that cannot be detected by the seed match based methods. On the other hand, many potential microRNA-target interactions that do involve conserved seed regions may be non-functional in a physiological context . Furthermore, new biological insights into the mechanisms of target binding have been obtained which could be used for predicting target sites. For example, target site accessibility to the RISC complex has been suggested as an important determinant of functional interactions.
We sought to develop a computational technique free from both the seed requirement and the conservation filter and thus capable of predicting species-specific and 3' compensatory target sites. Our method, TargetSpy, incorporates current biological knowledge in form of multiple sequence and structure features evaluated in the framework of the objective machine-learning prediction scheme MultiBoost with decision stumps as base learner. However, since the (conserved) seed match is a strong determinant of target site detection, even though not the only one, we additionally generate predictions for sites with conserved and unconserved seed matches by post-filtering TargetSpy results. We carried out extensive benchmark tests of the TargetSpy performance in human and Drosophila melanogaster. Our results suggest that TargetSpy, although trained on mouse, achieves the same performance as the best state-of-the-art methods in D.melanogaster, implying that the method can be applied to a broad taxonomic range of species for which no experimentally validated target sites are known. Furthermore, on the recently published experimental human dataset, describing changes in protein synthesis mediated by microRNAs, our method shows the highest accuracy among all tested prediction algorithms.
Results and Discussion
Classification of prediction approaches
Classification of microRNA target site prediction tools
Seed match not required
Seed match required
Seed match required and conservation considered
PITA All 3/15 
MiRBase Targets 
PITA TOP 
TargetSpy cons. seed
PITA All 3/15 
MiRBase Targets 
TargetSpy cons. seed
Some methods cannot be perfectly fitted into this scheme. For example, while miRanda  does not require a perfect seed match it weights the seed region so high that on average just around 7% of all predicted target sites show mismatches to the 7-mer seeds (microRNA nucleotides 1-7 or 2-8). miRBase Targets  uses miRanda for candidate generation, and permits a single mismatch to the seed region. Since both approaches additionally require the target site to be conserved, we consider them as members of class III.
Note also that TargetSpy generally belongs to class I, since our model does not impose a strict seed match requirement and does not consider conservation of target sites. However, we can easily build subsets of our predictions that satisfy the criteria of class II and III. Throughout this work we refer to the subset of TargetSpy predictions containing a perfect 7-mer seed match as TargetSpy seed. Likewise, TargetSpy conserved seed denotes a set of predicted target sites containing a conserved seed match (see Methods for details).
Computational pipeline for predicting microRNA target sites
Our intention here is to build a pipeline for predicting microRNA target sites based on the multiple features described in the Methods section. At run time TargetSpy takes two multiple FASTA files as input; one with the 3' UTR sequences and the other one with the mature microRNA sequences. Note that no other extrinsic information, such as evolutionary conservation needs to be provided.
Number of target sites predicted in each species by different versions of the TargetSpy method. See Methods for more detail
Number of predicted target sites
Number of 3'UTRs
Average 3'UTR length
Number of microRNAs
TargetSpy no-seed sens
TargetSpy no-seed spec
TargetSpy seed sens
TargetSpy seed spec
Target site candidates
A usual starting point of a prediction workflow is the search for perfect seed matches in the 3'UTR of transcripts of interest. Since our goal is to develop a model that does not rely on the presence of a seed match we had to redefine the rules for selecting initial candidate target sites. Following the reasoning that a functional site is more attracted by the loaded RISC complex than its surrounding area, we identify candidates by searching for areas in the target sequence where the predicted Gibbs free energy of the microRNA-target duplex is below a certain microRNA-specific energy threshold (see Methods for detail). To ensure a high coverage of functional binding sites we have chosen a conservative cut-off. With this candidate definition at hand, we identified about 150 million target site candidates for all microRNAs in human.
Selection of informative features and classifier evaluation
A ranked list of all features used in this work. The score is calculated by the ReliefF method
Number of base parings to the microRNA 8-mer seed
G+C content of target site
Number of base pairings to the first 8 nucleotides of the microRNA 3' end
Number of consecutive base-pairings to the microRNA 3' end with two allowed non-pairing positions
Occurrence of CpG in target site
G+C content ratio between the microRNA and the target site
Longest stretch of consecutive base-pairings in the hybrid
Number of bulges in the microRNA of size three
T1 S/W anchor
Total number of base-pairings
Number of bulges on the target site of size seven or greater
Number of bulges in the microRNA of size two
Occurrence of CpG in the upstream flanking area
Number of bulges in the target site of size one
Total bulge length of the target site
Length of the target site
Total bulge length of the microRNA
Target site position within the 3'UTR
Number of symmetric bulges
G + C content upstream of the target site
Number of bulges on the target site
Length of the second largest bulge on the target site
Mean length of bulges on the target site
T9 S/W anchor
Number of bulges in the target site of size two
Total number of G:U wobble base pairs
Local RISC accessibility 30/30
Local RISC accessibility 3/15
Number of bulges in the target site of size four
Difference in G+C content between the first and the last nt of the target site
Occurrence of CpG in downstream flanking area
Number of bulges in the microRNA of size one
Length of the second largest bulge on the microRNA
Number of bulges on the microRNA
Number of bulges in the microRNA of size five
Number of bulges in the target site of size three
Difference in G + C content between the target site and the 20 nt upstream and downstream flanking region
Number of bulges in the target site of size five
Number of bulges in the microRNA of size four
G+C content downstream of the target site
Number of bulges in the target site of size six
Applied thresholds and limitations on the prediction subsets
Prediction dataset name
Seed match required
False-positive rate threshold
TargetSpy no-seed sens
TargetSpy no-seed spec
TargetSpy seed sens
TargetSpy seed spec
TargetSpy cons. seed sens
TargetSpy cons. seed spec
Evaluation on experimentally verified data
We next set out to evaluate the quality of the learning scheme implemented in TargetSpy on experimentally verified data and to benchmark TargetSpy against commonly used methods. This task is challenging as published methods are based on different principles, which makes it hard to compare them in a fair fashion. Our current knowledge about microRNA target sites is almost exclusively drawn from a handful of experiments exploring the targeting of a minority of the most highly expressed microRNAs . These experiments may have a strong selection bias in that they usually analyze the impact of microRNA overexpression or depletion on conserved molecular mechanisms. In addition, experimentally identified targets are often biased towards computational prediction approaches used to identify the initial pool of candidates .
Recently the impact of microRNA overexpression and knockdown was analyzed in large-scale proteomic studies [16, 18] not suffering from the selection bias discussed above. A further advantage is that none of the prediction approaches were trained on these data. Beside the low number of measured microRNAs (five) and the fact that the precise location of the target site in the respective transcript is not determined in pSILAC, until high quality deep sequencing data such as the HITS-CLIP from  become available in large amounts, these data constitute the current gold standard. Hence we use the fly dataset [9, 17] and the human pSILAC dataset  for evaluation.
Performance comparison in Drosophila melanogaster
In 2005 Stark et al.  conducted a broad comparison of widely used target prediction approaches. A set of 133 experimentally tested functional and non-functional microRNA-gene interactions was compiled, from which 120 (functional: 61, non-functional: 59) were used for the actual comparison . This dataset served as the standard of truth to evaluate the evolutionary approach to microRNA target prediction published by Gaidatzis et al.  and was later extended to 190 interactions by Kertesz et al. . Note that this latter set also includes the 13 interactions that were excluded by Stark and colleagues since their respective 3'UTRs were not annotated. To assess the predictive power of our method and compare it with other methods, we applied it first to the original set and then to the extended set of targets. Since we assign each candidate zone a score, we are able to quantify the performance by a receiver operating characteristic (ROC) curve, making the comparison to other approaches more transparent.
Despite the benefit of being able to compare all methods by just one value, looking at the ROC curve progression is even more enlightening, especially for those methods that are clustered closely together by the AUC value. Particularly interesting is the characteristic of the curves at low false-positive rates as for many experiments the amount of samples may be strongly limited. TargetSpy conserved seed shows the lowest false-positive rate (FPR) in the test up to a true-positive rate (TPR) of 48%. TargetSpy seed shows the second lowest FPR, but offers slightly better TPR, comparable to that of PicTar. Note that TargetSpy no-seed shows a performance that is close to class II and III methods, especially for its top predictions that cover more than 50% in TPR.
Benchmarking on the extended set of 190 experimentally verified microRNA-target interactions (see Methods) produces several interesting observations (Figure 3B,D). First, the AUC values are generally lower compared to the original set. Second, PITA, specifically fitted to this set, is far ahead of all other approaches. Third, the ranking of the other approaches has not changed except that i) TargetSpy seed performs ahead of PicTar and EIMMo, TargetSpy conserved seed outperforms TargetScanS and miRBase Targets  performs better than PITA TOP 3/15 and ii) the relative distance between TargetSpy no-seed and TargetScanS is reduced. Finally, the specificity in particular that of EIMMo and TargetScanS, suffered strongly especially for their top predictions.
In summary, the evaluation on experimental fly data suggests that TargetSpy, which was trained on mouse data, performs as good as current state-of-the-art algorithms when enforcing the seed match criterion. Furthermore, the no-seed prediction is notably better than RNA22, the other tested algorithm that does not require a perfect seed match.
Evaluation on pSILAC data
Selbach et al.  performed a comparison of the most widely used approaches by measuring the fraction of predicted target sites associated with proteins that are more strongly down-regulated than -0.1 log2 fold change. They generated two background (random) sets: i) a set where all mRNAs present are considered as targets, and ii) a set of all mRNAs that have a 6-mer seed match in their sequence, further referred to as the Selbach background. We further generated a 7-mer seed background for class II as PicTar and TargetSpy require perfect 7-mer seed matches. In order to compare class III predictors to the random expectation, we additionally introduced a background dataset for conserved 7-mer seed matches as proposed by . Specifically, we searched for 6-mer (positions 2-7) seed matches that are perfectly conserved in human, chimp, mouse rat and dog that show additionally a match to either base 1 or 8 . This way we have background sets produced by trivial prediction strategies for each of the three classes of prediction tools.
The background accuracy for class II is at 42.6% when using 6-mer seeds (Selbach background). As PITA covers all target site candidates with seed matches beginning at the size of 6 nt and subsequently ranks them according to their accessibility, it is necessary to consider only its top ranking predictions. Following Selbach et al. 2008  we took the top 1000 predictions per microRNA and found a 42.3% overlap with down-regulated proteins demonstrating an accuracy below the background level. TargetScanS, predicting non-conserved target sites with at least a 6-mer seed match, shows a higher accuracy (47.9%) than PITA and noticeably out-performs the 6-mer seed background, although it does not pass the accuracy of the 7-mer seed background. For 7-mer seeds, which are used by PicTar and TargetSpy seed, the corresponding background accuracy was 48.4%. Both TargetSpy seed sens and TargetSpy seed spec perform clearly better than this trivial prediction, showing accuracies of 52.8% and 55.3%, respectively, and thus perform best in class II.
The final class of target prediction approaches (seed/conservation) shows an overlap of 62.5% with random predictions. Both miRanda and miRBase Targets predictions include a small fraction of conserved target sites with imperfect seed regions, and the corresponding accuracies (45.5% and 42.1%, respectively) are below the background and close to that of the trivial seed predictor of class II. PicTar 4-way (those predictions where the seed match is conserved among 4 species) reaches an accuracy of 60.2%, which is also below the background. Interestingly the more stringent PicTar 5-way that additionally requires conservation in chicken, performs worse that PicTar 4-way. Also DIANA-microT, TargetScanS, PITA TOP, an official subset of PITA with conserved 8-mer seed matches required, and EIMMo with the high confidence setting (score > = 0.5) are performing slightly below background. The first approach performing above the trivial predictor of class III is TargetRank  (62.9%), followed by mirTarget2  (67.4%) and the sensitive subset of TargetSpy conserved seed (68.1%). Finally, TargetSpy conserved seed spec achieves the highest accuracy of all methods (75.3%). It should be noted, however, that although TargetSpy achieves superior performance in terms of accuracy and sensitivity in classes I and II, the sensitivity of TargetSpy in class III is lower compared to other approaches. The higher sensitivity of some approaches might be attributed to the choice of seed match that is enforced. EIMMo, for example, integrates several different seed match definitions (including also short 6-base long ones) and shows a sensitivity that is 2.7 times higher than our approach at the sensitive threshold and more than 6 times higher when compared to our specific setting.
In general our results on the pSILAC data suggest that TargetSpy performs best in each class, showing furthermore a constant gain in accuracy from the sensitive to the specific threshold. This finding implies that the prediction quality increases with the score and therefore the ranking of target sites imposed by the score of our model seems to have biological relevance. Under the assumption that all pSILAC interactions that are not down-regulated are true negatives, ROC plots (see Additional File 1, one for each class) confirm the performance evaluations based on accuracy.
Finally, we determined the number of functional target sites (down-regulated proteins) not possessing seed regions, which have been correctly predicted solely by TargetSpy but not by any other algorithm. To avoid putatively spurious seedless target sites, we excluded from consideration those gene-microRNA interactions that showed at least one perfect seed match. After removing such gene-microRNA interactions we obtained 564 unique target sites without seed region in the sensitive set and 134 in the specific set. This means that TargetSpy reports on average between 26 (spec) and 112 (sens) functional target sites showing no seed match per microRNA that could not be detected by any other tool.
In summary, we have developed a novel computational approach for predicting microRNA target sites that neither implies the existence of a seed nor utilizes phylogenetic footprinting. Instead of using rigid rules and/or arbitrarily selected target site features, we objectively derived a set of discriminative features to be used for machine learning. Due to these important advantages TargetSpy i) is able to predict species specific (i.e. unconserved) target sites, ii) is suitable for processing poorly conserved/low quality genomic sequences for which methods that rely on conservation and species specific information will not work, and iii) allows analyzing differences in microRNA targets between various species.
We grouped computational prediction approaches into three classes, depending on their usage of a seed match criterion in order to provide a comparison of their performance among each other and against the background level. On an experimentally derived microRNA-target interaction set in Drosophila, our method is on par with the best available approaches. In a further benchmark on human microRNA-target data generated by the pSILAC technology, TargetSpy not only reported the highest accuracies in class I, but also in the other two classes for which our predictions where post filtered according to the class definition. Given that TargetSpy, trained on experimentally derived Ago binding sites in the 3'UTR of mouse transcripts, showed very good performance when evaluated in fly and human, we suggest that our algorithm can be applied to a broad taxonomic range of organisms.
Finally, we have shown that even on a small, high-quality data set of microRNA binding sites, derived by a deep sequencing experiment , machine learning techniques show high potential in the prediction of microRNA target sites. We assume that advances in this direction will become even more pronounced as more data of this kind become available.
Dataset of 3' UTR sequences
We retrieved 3'UTR sequences from the UCSC Genome Database  using the UCSC Table Browser. For human (hg18, March 2006), mouse (mm8, July 2007), rat (rn4, November 2004) and chicken (galGal2, May 2006) we used the RefSeq Genes Track, for fly (dme, April 2006) we took the FlyBase annotations. For generating target site predictions considering conserved seeds, we used Galaxy  to extract 3'UTR alignments for human, chimp, mouse, rat and dog from the 17-way human whole genome alignment and D. melanogaster, D. yakuba, D. ananassae, and D. pseudoobscura from the 15-way D. melanogaster whole genome alignment.
Dataset of MicroRNA sequences
All mature microRNA sequences originate from the miRBase, release 12 . In total we retrieved 692 microRNAs for human, 513 for mouse, 443 for chicken and 147 for fly.
Target site predictions by previously published methods
The target predictions of PicTar  were downloaded from the UCSC database using the Table Browser and were migrated from hg17 to hg18 by applying the UCSC command line tool liftover. We used the predictions conserved in human, mouse, rat, chimp and dog (4-way) as well as the predictions additionally conserved in chicken (5-way). For fly we downloaded the sensitive prediction set (S1) of PicTar that is composed of predictions conserved in D. melanogaster, D. yakuba, D. ananassae, and D. Pseudoobscura, also via the UCSC Table Browser. Predictions for the human genome made by miRanda , release September 2008, were downloaded from http://microRNA.org . Only predictions for transcripts contained in the RefSeq database were considered. Human and fly predictions made by miRBase Targets , version 5, were downloaded from http://microrna.sanger.ac.uk/targets/v5/. RNA22  predictions for human 3'UTR sequences were downloaded from http://cbcsrv.watson.ibm.com/rna22.html. Since these predictions were made using Ensembl transcripts, we mapped the predictions to RefSeq genes by applying mapping tables provided by Ensembl and UCSC. Predictions of PITA  were downloaded from http://genie.weizmann.ac.il/pubs/mir07/mir07_data.html. We utilized the "TOP" and the "ALL" set with 3/15 flankings. TargetScanS  predictions and the corresponding microRNA family mapping table were downloaded from http://www.targetscan.org/cgi-bin/targetscan/data_download.cgi?db=vert_50. Predictions made by Gaidatzis et al.  were downloaded from the EIMMo server http://www.mirz.unibas.ch/. Targets predicted by mirTarget2 (version 3)  were downloaded from http://mirdb.org/miRDB. Human target site predictions of DIANA-microT v3.0  were retrieved via the web server at http://diana.cslab.ece.ntua.gr/microT/ for the thresholds loose (score = 7.3) and strict (score = 19). Finally, we downloaded the human target site predictions of TargetRank  from http://hollywood.mit.edu/targetrank/.
Two sets of experimentally verified target sites were used to benchmark target prediction algorithms. For evaluation on Drosophila melanogaster, we used the 120 experimentally tested microRNA - gene interactions compiled by Stark et al.  (see Additional File 2) and the 190 interactions published by Kertesz et al.  (see Additional File 3). The former set is composed of 61 functional and 59 non-functional interactions; the latter set consists of 102 functional and 88 non-functional interactions. The appropriate 3'UTR sequences were derived from the FlyBase annotations provided by UCSC. Transcripts for which no 3' UTR was available were discarded. For evaluation on human, we used an experimental dataset (see Additional File 4) that is based on the pSILAC technique and reveals fold changes in protein production caused by five selected microRNAs , downloaded from http://psilac.mdc-berlin.de.
Free energy estimates
All duplex structures and energy estimates were calculated by the RNAduplex and RNAcofold programs from the Vienna package version 1.6.1 . We applied the option -noLP to exclude base pairs, which can only occur as lonely pairs and the option -e to retrieve all suboptimal structures instead of just the one with the minimum free energy. The minimum free energy that can be observed for a microRNA is defined as the energy value calculated for the duplex of the microRNA and its perfect reverse compliment.
Generation of candidate zones
The microRNA - mRNA interaction is typically characterized as an interval within the mRNA sequence that is almost perfectly reverse complementary to the microRNA sequence over a substantial fraction of the microRNA's length, or at least over a seed region of 6-8 bases. In this work we investigate the possibility to abandon the strict requirement for the presence of a seed region and attempt to find zones of high attraction between the microRNA and its target mRNA independent of seed occurrence. Such candidate zones cover not just a particular binding site, but a larger stretch of sequence including several potential adjacent binding sites.
Duplex stacking and anchor choice
Since we do not choose duplex structures based on significant free energy values, we obtain for each microRNA-mRNA pair a vast amount of overlapping predicted secondary structures, typically in the order of 5000-20000. Therefore we need to group the resulting structures according to their location on the mRNA. Structures with a G:U match in the first 8 base pairings as well as those that show five or more G:U pairings in the entire duplex are discarded. Given that duplexes will heavily overlap on the mRNA, we need to define anchor points in order to map the duplexes to a specific position on the mRNA sequence.
Different types of anchors can be considered including, for example, the mRNA position where the first nucleotide pairing occurs, counting either from the 3' end or from the 5' end of the corresponding microRNA, or some middle point of the duplex. In view of the findings suggesting that base pairing at the 5' end of the microRNA particularly strongly contributes to target site recognition [33–37] we chose to define the anchor point as the position on the mRNA sequence where the first base-pairing with the microRNA 5' end occurs (Additional File 5). If duplexes with a perfect microRNA 5' end pairing of at least 7 consecutive nucleotides (seed region) are present all other structures with the same anchor point and no seed region are discarded. Subsequently the energetically most favorable one for each anchor is selected as the best candidate for this specific anchor position (Figure 6A,B).
In order to develop a classifier of high quality, it is essential to obtain a training set that is truly representative of both the positive (actual target sites) and negative (non-target sites) class. Recently a set of argonaute (Ago) - mRNA binding sites, identified by a novel technique that isolates RNA by crosslinking immunoprecipitation in high-throughput experiments (HITS-CLIP), was published for the 20 most abundant microRNAs present in the P13 mouse brain . Argonautes are proteins that upon association with microRNAs form the RNA-induced silencing complex (RISC), which is responsible for the repression of target mRNA expression. To our knowledge this is the first experimental data set that reports directly microRNA target sites in a large-scale fashion and TargetSpy is the first algorithm that is using it for training. We retrieved the data from http://ago.rockefeller.edu/ and removed all sites that did not map to 3'UTRs or had no RefSeq accession number associated. Since only the microRNA family is specified in this publication, we identified the candidates for all microRNAs belonging to that family. Those candidates that overlapped the experimentally derived sites were retained as positive instances. In cases where several candidates of the same microRNA family overlapped, we took the energetically most favorable one. Target site candidates having no equivalent in the set of experimentally derived Ago binding sites are unlikely to be biologically relevant. We therefore identified the energetically most stable candidate for a reported Ago-mRNA interaction that does not overlap the validated Ago-binding site. Those candidates served as negative instances. In total we obtained 3872 positive and 4540 negative instances.
Features of microRNA - mRNA duplexes
In order to build an accurate microRNA target predictor it is of paramount importance to define a set of characteristics that effectively distinguish real microRNA-mRNA interactions from any other types of hybrids. Numerous properties of such duplexes have been reported in biological literature in recent years, and some of them have been incorporated in target prediction methods developed earlier. Here, instead of relying on a limited number of empirically selected features we chose to objectively evaluate the performance of a possibly broad spectrum of pairing requirements within the framework of a machine learning approach. Below follows the list of features used in this study.
General extent of microRNA-mRNA binding
Number of base-pairings to the microRNA 8-mer seed.
Number of base-pairings to the first eight nucleotides of the microRNA 3' end.
Number of consecutive base-pairings at the microRNA 3' end with two allowed non-pairing positions, beginning at the first base pairing position.
Length of the longest stretch of consecutive base-pairings anywhere in the hybrid.
Length of the target site.
Binding asymmetry. Here we measured the ratio between the amounts of paired bases in the 3' versus the 5' region of the microRNA. We considered 8 nucleotides on each side.
Extent of G:U base pairing
Total number of G:U wobble base pairs in the microRNA - mRNA hybrid.
Bulge-related features of duplexes
Number of bulges on the microRNA.
Number of bulges on the target site.
Total bulge length on the microRNA.
Total bulge length on the target site.
Number of bulges on the microRNA. We tested the bulge lengths of 1,2,3,4 and 5 bases.
Number of bulges on the target site. In this cases we tested bulge lengths of 1,2,3,4,5,6 and equal or greater than 7 bases as bulges on the mRNA sequence tend to be larger than those on microRNAs.
Length of the second largest bulge on the microRNA.
Length of the second largest bulge on the target site.
Mean length of bulges on the target site.
Number of symmetric bulges.
Position specific features
▪ Position of the target site in the 3'UTR. We split 3'UTRs into 100 bins and returned the index of the bin containing the anchor of the candidate zone's representative as the position of the target site.
▪ Following the reasoning of Lewis et al.  we calculated four features related to the base occurrence at given positions. Specifically we recorded the nucleotides in the target site at microRNA positions 1 (t1 anchor) and 9 (t9 anchor) and the existence of an S (A or U) or W (G or C) base at the same positions (t1 S/W anchor and t9 S/W anchor).
Base composition of both 3' UTRs  and microRNAs plays an important role in mRNA-microRNA recognition. Here we employ the following compositional features:
G + C content of the target site.
G + C content of the 50 nucleotide long region upstream of the target site.
G + C content of the 50 nucleotide long region downstream of the target site.
G + C content ratio between the microRNA and the target site.
Difference in G + C content between the target site and the upstream flanking region.
Difference in G + C content between the first and the last eight nucleotides of the target site.
Occurrence of CpG di-nucleotide in the target site sequence as well as in its 3' and 5' flanking regions.
The length of 3' and 5' flanking regions was taken to be 20 nucleotides, unless otherwise stated.
In the equation for the compactness above, #pair is the number of basepairings, tsLen the target site length and miRLen the length of the microRNA.
Accessibility of the target site to RISC
Recent literature [9, 39] suggests that target site accessibility to RISC is a critical factor in microRNA target recognition. Examples of approaches that have been developed to approximate accessibility are RNAup  and IntaRNA . We applied the definition of  and calculated accessibility as the difference between the free energy of the microRNA hybrid and the energy of the local secondary structure of the target site including 3 nt upstream and 15 nt downstream flanking sequences. We further tested all combinations for upstream and downstream flankings from 0 nt to 30 nt in 5 nt steps.
Target site prediction
For each organism considered we predicted microRNA target sites and ranked them according to their score. As explained above TargetSpy initially considers every potential candidate zone and assigns a score to it. Overlapping candidate zones were merged together, and the representative with the highest score becomes the representative of the entire merged zone. This permissive approach generates vast amounts of candidates with very low scores. Additional criteria are subsequently imposed in various combinations to narrow down the set of predicted targets (Table 2). The naming of the prediction datasets is based on whether or not the presence of a seed region is required, and whether a permissive (sens) or strict (spec) threshold is applied.
Evaluation of prediction performance
The evaluation on pSILAC data was performed as in Selbach et al. 2008. The performance measure (accuracy) was defined as the fraction of predicted mRNA targets with reduced protein production (log2 fold change < -0.1), matching the definition of the positive predicted value (PPV).
Implementation and availability
We implemented our method as a stand-alone Java program called TargetSpy. The program relies on the Java Virtual Machine version 1.5 and two freely available third party software packages - the Vienna package for RNA secondary structure prediction (version 1.6.1) and the data mining software WEKA (version 3.5.3). TargetSpy is available from our web site http://www.targetspy.org along with installation instructions and links to all required third party software packages.
We are grateful to Hans-Werner Mewes, Florian Büttner and Thorsten Schmidt for careful reading of the manuscript and many useful comments. The work of MH was supported by the Spanish Government (Grant number: BIO2008.01353) and by the Junta de Andalucia (Grant number P07-FQM-03613). The work of the German authors was supported by the Helmholtz Association and the Technical University of Munich.
- Lee RC, Feinbaum RL, Ambros V: The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell 1993, 75(5):843–854. 10.1016/0092-8674(93)90529-YView ArticlePubMedGoogle Scholar
- Bartel DP: MicroRNAs: target recognition and regulatory functions. Cell 2009, 136(2):215–233. 10.1016/j.cell.2009.01.002View ArticlePubMedPubMed CentralGoogle Scholar
- Bentwich I, Avniel A, Karov Y, Aharonov R, Gilad S, Barad O, Barzilai A, Einat P, Einav U, Meiri E, et al.: Identification of hundreds of conserved and nonconserved human microRNAs. Nature genetics 2005, 37(7):766–770. 10.1038/ng1590View ArticlePubMedGoogle Scholar
- Sethupathy P, Corda B, Hatzigeorgiou AG: TarBase: A comprehensive database of experimentally supported animal microRNA targets. RNA (New York, NY) 2006, 12(2):192–197.View ArticleGoogle Scholar
- Didiano D, Hobert O: Perfect seed pairing is not a generally reliable predictor for miRNA-target interactions. Nature structural & molecular biology 2006, 13(9):849–851.View ArticleGoogle Scholar
- John B, Enright AJ, Aravin A, Tuschl T, Sander C, Marks DS: Human MicroRNA targets. PLoS biology 2004, 2(11):e363. 10.1371/journal.pbio.0020363View ArticlePubMedPubMed CentralGoogle Scholar
- Enright AJ, John B, Gaul U, Tuschl T, Sander C, Marks DS: MicroRNA targets in Drosophila. Genome biology 2003, 5(1):R1. 10.1186/gb-2003-5-1-r1View ArticlePubMedPubMed CentralGoogle Scholar
- Kononenko I: Estimating Attributes: Analysis and Extensions of RELIEF. Springer Verlag; 1994:171–182.Google Scholar
- Kertesz M, Iovino N, Unnerstall U, Gaul U, Segal E: The role of site accessibility in microRNA target recognition. Nature genetics 2007, 39(10):1278–1284. 10.1038/ng2135View ArticlePubMedGoogle Scholar
- Witten IH, Frank E: Data Mining: Practical machine learning tools and techniques. 2nd edition. Morgan Kaufmann, San Francisco; 2005.Google Scholar
- Mark A, Hall LAS: Feature Subset Selection: A Correlation Based Filter Approach. Thesis submitted in partial fulfilment of the requirements of the degree of Doctor of Philosophy at the University of Waikato 1998.Google Scholar
- Yousef M, Jung S, Kossenkov AV, Showe LC, Showe MK: Naive Bayes for microRNA target predictions--machine learning for microRNA targets. Bioinformatics (Oxford, England) 2007, 23(22):2987–2992. 10.1093/bioinformatics/btm484View ArticleGoogle Scholar
- Lewis BP, Burge CB, Bartel DP: Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 2005, 120(1):15–20. 10.1016/j.cell.2004.12.035View ArticlePubMedGoogle Scholar
- Kim SK, Nam JW, Rhee JK, Lee WJ, Zhang BT: miTarget: microRNA target gene prediction using a support vector machine. BMC bioinformatics 2006, 7: 411. 10.1186/1471-2105-7-411View ArticlePubMedPubMed CentralGoogle Scholar
- Bandyopadhyay S, Mitra R: TargetMiner: microRNA target prediction with systematic identification of tissue-specific negative examples. Bioinformatics (Oxford, England) 2009, 25(20):2625–2631. 10.1093/bioinformatics/btp503View ArticleGoogle Scholar
- Baek D, Villen J, Shin C, Camargo FD, Gygi SP, Bartel DP: The impact of microRNAs on protein output. Nature 2008, 455(7209):64–71. 10.1038/nature07242View ArticlePubMedPubMed CentralGoogle Scholar
- Stark A, Brennecke J, Bushati N, Russell RB, Cohen SM: Animal MicroRNAs confer robustness to gene expression and have a significant impact on 3'UTR evolution. Cell 2005, 123(6):1133–1146. 10.1016/j.cell.2005.11.023View ArticlePubMedGoogle Scholar
- Selbach M, Schwanhausser B, Thierfelder N, Fang Z, Khanin R, Rajewsky N: Widespread changes in protein synthesis induced by microRNAs. Nature 2008, 455(7209):58–63. 10.1038/nature07228View ArticlePubMedGoogle Scholar
- Chi SW, Zang JB, Mele A, Darnell RB: Argonaute HITS-CLIP decodes microRNA-mRNA interaction maps. Nature 2009, 460(7254):479–486.PubMedPubMed CentralGoogle Scholar
- Gaidatzis D, van Nimwegen E, Hausser J, Zavolan M: Inference of miRNA targets using evolutionary conservation and pathway analysis. BMC bioinformatics 2007, 8: 69. 10.1186/1471-2105-8-69View ArticlePubMedPubMed CentralGoogle Scholar
- Wang X, El Naqa IM: Prediction of both conserved and nonconserved microRNA targets in animals. Bioinformatics (Oxford, England) 2008, 24(3):325–332. 10.1093/bioinformatics/btm595View ArticleGoogle Scholar
- Krek A, Grun D, Poy MN, Wolf R, Rosenberg L, Epstein EJ, MacMenamin P, da Piedade I, Gunsalus KC, Stoffel M, et al.: Combinatorial microRNA target predictions. Nature genetics 2005, 37(5):495–500. 10.1038/ng1536View ArticlePubMedGoogle Scholar
- Miranda KC, Huynh T, Tay Y, Ang YS, Tam WL, Thomson AM, Lim B, Rigoutsos I: A pattern-based method for the identification of MicroRNA binding sites and their corresponding heteroduplexes. Cell 2006, 126(6):1203–1217. 10.1016/j.cell.2006.07.031View ArticlePubMedGoogle Scholar
- Griffiths-Jones S: The microRNA Registry. Nucleic acids research 2004, (32 Database):D109–111. 10.1093/nar/gkh023View ArticlePubMedPubMed CentralGoogle Scholar
- Chen K, Rajewsky N: Natural selection on human microRNA binding sites inferred from SNP data. Nature genetics 2006, 38(12):1452–1456. 10.1038/ng1910View ArticlePubMedGoogle Scholar
- Nielsen CB, Shomron N, Sandberg R, Hornstein E, Kitzman J, Burge CB: Determinants of targeting by endogenous and exogenous microRNAs and siRNAs. RNA (New York, NY 2007, 13(11):1894–1910.View ArticleGoogle Scholar
- Karolchik D, Kuhn RM, Baertsch R, Barber GP, Clawson H, Diekhans M, Giardine B, Harte RA, Hinrichs AS, Hsu F, et al.: The UCSC Genome Browser Database: 2008 update. Nucleic acids research 2008, (36 Database):D773–779.Google Scholar
- Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, et al.: Galaxy: a platform for interactive large-scale genome analysis. Genome research 2005, 15(10):1451–1455. 10.1101/gr.4086505View ArticlePubMedPubMed CentralGoogle Scholar
- Betel D, Wilson M, Gabow A, Marks DS, Sander C: The microRNA.org resource: targets and expression. Nucleic acids research 2008, (36 Database):D149–153.Google Scholar
- Maragkakis M, Reczko M, Simossis VA, Alexiou P, Papadopoulos GL, Dalamagas T, Giannopoulos G, Goumas G, Koukis E, Kourtis K, et al.: DIANA-microT web server: elucidating microRNA functions through target prediction. Nucleic acids research 2009, (37 Web Server):W273–276. 10.1093/nar/gkp292Google Scholar
- Hofacker I, Fontana W, Stadler P, Bonhoeffer L, Tacker M, Schuster P: Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie/Chemical Monthly 1994, 125(2):167–188. 10.1007/BF00818163View ArticleGoogle Scholar
- Ambros V: The functions of animal microRNAs. Nature 2004, 431(7006):350–355. 10.1038/nature02871View ArticlePubMedGoogle Scholar
- Brennecke J, Stark A, Russell RB, Cohen SM: Principles of microRNA-target recognition. PLoS biology 2005, 3(3):e85. 10.1371/journal.pbio.0030085View ArticlePubMedPubMed CentralGoogle Scholar
- Doench JG, Sharp PA: Specificity of microRNA target selection in translational repression. Genes & development 2004, 18(5):504–511.View ArticleGoogle Scholar
- Lai EC: Micro RNAs are complementary to 3' UTR sequence motifs that mediate negative post-transcriptional regulation. Nature genetics 2002, 30(4):363–364. 10.1038/ng865View ArticlePubMedGoogle Scholar
- Lewis BP, Shih IH, Jones-Rhoades MW, Bartel DP, Burge CB: Prediction of mammalian microRNA targets. Cell 2003, 115(7):787–798. 10.1016/S0092-8674(03)01018-3View ArticlePubMedGoogle Scholar
- Rajewsky N, Socci ND: Computational identification of microRNA targets. Developmental biology 2004, 267(2):529–535. 10.1016/j.ydbio.2003.12.003View ArticlePubMedGoogle Scholar
- Robins H, Press WH: Human microRNAs target a functionally distinct population of genes with AT-rich 3' UTRs. Proceedings of the National Academy of Sciences of the United States of America 2005, 102(43):15557–15562. 10.1073/pnas.0507443102View ArticlePubMedPubMed CentralGoogle Scholar
- Zhao Y, Samal E, Srivastava D: Serum response factor regulates a muscle-specific microRNA that targets Hand2 during cardiogenesis. Nature 2005, 436(7048):214–220. 10.1038/nature03817View ArticlePubMedGoogle Scholar
- Muckstein U, Tafer H, Hackermuller J, Bernhart SH, Stadler PF, Hofacker IL: Thermodynamics of RNA-RNA binding. Bioinformatics (Oxford, England) 2006, 22(10):1177–1182. 10.1093/bioinformatics/btl024View ArticleGoogle Scholar
- Busch A, Richter AS, Backofen R: IntaRNA: efficient prediction of bacterial sRNA targets incorporating target site accessibility and seed regions. Bioinformatics (Oxford, England) 2008, 24(24):2849–2856. 10.1093/bioinformatics/btn544View ArticleGoogle Scholar
- Webb GI: MultiBoosting: A Technique for Combining Boosting and Wagging. Machine Learning 2000, 40(2):159–196. 10.1023/A:1007659514849View ArticleGoogle Scholar
- Xu Z, Dai M, Meng D: Fast and efficient strategies for model selection of Gaussian support vector machine. IEEE Trans Syst Man Cybern B Cybern 2009, 39(5):1292–1307. 10.1109/TSMCB.2009.2015672View ArticlePubMedGoogle Scholar
- George H, John PL: Estimating Continuous Distributions in Bayesian Classifiers. San Mateo: Morgan Kaufmann; 1995.Google Scholar
- Quinlan JR: C4.5 Programs for Machine Learning. San Mateo: CA: Morgan Kaufmann; 1993.Google Scholar
- Freund YS, R E: Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning. Bari, Italy: Morgan Kaufmann; 1996:148–156.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.