An accurate and interpretable model for siRNA efficacy prediction
BMC Bioinformatics volume 7, Article number: 520 (2006)
The use of exogenous small interfering RNAs (siRNAs) for gene silencing has quickly become a widespread molecular tool providing a powerful means for gene functional study and new drug target identification. Although considerable progress has been made recently in understanding how the RNAi pathway mediates gene silencing, the design of potent siRNAs remains challenging.
We propose a simple linear model combining basic features of siRNA sequences for siRNA efficacy prediction. Trained and tested on a large dataset of siRNA sequences made recently available, it performs as well as more complex state-of-the-art models in terms of potency prediction accuracy, with the advantage of being directly interpretable. The analysis of this linear model allows us to detect and quantify the effect of nucleotide preferences at particular positions, including previously known and new observations. We also detect and quantify a strong propensity of potent siRNAs to contain short asymmetric motifs in their sequence, and show that, surprisingly, these motifs alone contain at least as much relevant information for potency prediction as the nucleotide preferences for particular positions.
The model proposed for prediction of siRNA potency is as accurate as a state-of-the-art nonlinear model and is easily interpretable in terms of biological features. It is freely available on the web at http://cbio.ensmp.fr/dsir
RNA interference (RNAi) is the process through which a double-stranded RNA (dsRNA) induces gene expression silencing, either by degradation of sequence-specific complementary messenger RNA (mRNA) or by repression of translation . The RNAi pathway was firstly identified in lower organisms (plants, fungi and invertebrates) and led to many successful applications such as genome-wide RNAi screens [2–5]. In mammalian systems, chemically synthesized dsRNA reagents shorter than 30 nt were found to trigger sequence-specific RNAi response without inducing the cell's immune mechanism [6, 7]. The use of exogenous small interfering RNAs (siRNAs) for abolishing gene expression has quickly become a widespread molecular tool providing a powerful means for gene functional study and new drug target identification [8, 9]. Moreover, RNAi represents a promising technology for therapeutic applications against HIV , neurodegenerative disorders  and cancer . Its popularity stems in particular from its simplicity and low cost compared to other methods, e.g., involving knockout mice.
Considerable progress has been made recently in understanding how the RNAi pathway mediates gene silencing. Two main types of sequence-specific cleavage triggers have been identified: siRNAs and micro RNAs (miRNAs), chemically synthesized respectively from long dsRNA and miRNA precursor (pre-miRNA) by Dicer, a multidomain enzyme of the RNase III family. Once synthesized the siRNA/miRNA is incorporated into a ribonucleoprotein complex (RNP) called RISC loading complex (RLC). Duplex is unwounded and one strand is selectively incorporated into the RNA-induced silencing complex (RISC). Then this complex triggers either mRNA degradation or translational repression of the mRNA depending on the degree of complementarity between the RISC-associated RNA strand (the guide strand) and the target. Although miRNAs differ from siRNAs in their biogenesis, their functions are highly similar if not identical .
siRNA design is one of the most crucial steps in reliable use of RNAi, since it must ensure the efficacy and the specificity of the selected sequence for a target gene [9, 14]. Tuschl et al.  provided a set of guidelines (commonly known as the MPI principles) on how to design effective siRNA. These empirical rules, based for example on GC content and symmetric 3' TT overhangs, are however not discriminative enough since significant proportion of ineffective siRNAs following these rules were reported . Recent advances in the understanding of the biochemical mechanism of RNAi and statistical analyses of experimentally verified siRNAs have highlighted new biochemical and biophysical features of the siRNA reagents. It has been shown that thermodynamic profiles of the siRNA duplex determine which strand enters RISC as the guide strand and that the antisense strand can only direct cleavage of the sense mRNA targets [17, 18]. These functionally asymmetric siRNA duplexes exhibit lower base-pairing stabilities at the 5' end of the antisense strand and at the cleavage site. It was also suggested by either experimental or computational means that mRNA target accessibility could contribute to silencing activity [19–21], although the extent of this contribution remains controversial [22, 23]. Moreover, off-targets effects of siRNAs to unrelated mRNA targets were observed in several studies, which partially explain the loss of potency in silencing effect [24, 25]. Several studies combining cellular assays and statistical analysis reported features that consistently correlate with functionality across data sets of experimentally validated siRNAs. For instance, Ui-Tei et al.  proposed several criteria that correlate with highly effective silencing after an analysis of 62 siRNAs targeting six genes, including for example the presence of A/U at the 5' end and of G/C at the 3' end of the antisense strand. Amarzguioui and Prydz , analyzing a dataset of 80 siRNAs duplexes targeting four genes, corroborated these findings and expanded them by identifying sequence motifs on the siRNA sense strand that correlated positively (S1, A6, W19) and negatively (U1, G19) with potency at 70% knockdown level. Finally, on the basis of a systematic analysis of 180 siRNAs that target two mRNA regions, Reynolds et al.  proposed eight parameters that taken together increase the probability of selecting an effective siRNA and further incorporated them into a widely-used rational design algorithm.
More recently several programs and websites have been developed for automatic siRNA design, implementing design rules based on sequence features, hairpin potential formation, stability profiles, energy features, weighted patterns and mRNA secondary structure [23, 29]. A comparison study of these methods on a larger collection of siRNAs has however shown that many of them give close to random classification on unseen data . According to the current best rational design approach, less than 35% of siRNAs experimentally tested produce more than 90% of gene silencing, and almost 20% of siRNAs result in less than 50% efficacy . Therefore, despite considerable efforts, the design of high efficacy siRNA is far from conclusive. The variability in performance of potency prediction algorithms could be explained by the limited amount of siRNA sequences used during their design, together with the fact that these limited datasets are often biased since many published siRNAs have been designed following the MPI or Reynolds' rules .
In this context the publication in Huesken et al.  of an unbiased set of 2431 randomly selected siRNAs targeting 34 mRNA species assayed through a high-throughput fluorescent reporter gene system, together with the potency prediction system BIOPREDsi based on an artificial neural network trained on this dataset, represents a landmark. With a reported Pearson correlation coefficient of 0.66 between the measured and predicted efficacy on a set of 249 siRNAs not used during training, BIOPREDsi was shown to be more accurate than previous simpler models. The biological interpretation of the function encoded by BIOPREDsi is however challenging due to the use of nonlinear terms, and was not discussed in Huesken et al. .
In this study we investigate the possibility to design both an accurate and easily interpretable model to predict siRNA potency. To ensure direct interpretability of the model we restrict ourselves to linear models over two simple sets of features of siRNA sequences: the nucleotides present at each position in the siRNA sequence and the global content of the siRNA in short motifs. In order to allow a potentially large number of features and focus only on the most informative of these features, we train the linear model on the dataset of Huesken et al.  with the LASSO regression method  which leads to sparse linear models, automatically discarding irrelevant features if any. We show that both representations contain relevant and partially complementary information to predict the siRNA potency, and that the combination of both representations leads to a simple linear model as accurate as the BIOPREDsi neural network, with a Pearson correlation coefficient of 0.67 on the dataset used to evaluate the performance of BIOPREDsi. The analysis of the weights of the linear model allows a comprehensive description of the influence of various features in the siRNA sequence related to its efficacy. We show that only few features are in fact irrelevant, detect and quantify known and new preferences of nucleotides at several positions, and observe the importance of the presence of asymmetric short motifs in the siRNA sequence with A/U at the 5' end and C/G at the 3' end of the guide strand. We notice particularly that the inclusion of the 2 overhang nucleotides at the 3' end of the guide strand in the model improves prediction accuracy. We also show that models based on the thermodynamic features are significantly less accurate than those based on the sequence features only, highlighting the limitations of these features.
Representation of siRNA sequences
In order to develop an interpretable linear model for siRNA efficacy prediction we must chose a set of features to represent siRNA sequences. By convention we will always refer to the nucleotides in the antisense guide strand below, with positions in the sequence ordered in the 5'→3' direction on the guide strand. The siRNA sequences can therefore formally be represented as strings of length 21 in this study (including two 3' overhang nucleotides at positions 20 and 21). Previous studies have highlighted the importance of at least two families of string features that largely determine siRNA potency. First, several authors have observed that preferred nucleotides at specific positions are strong indicators of siRNA efficacy: for example, A and U are overrepresented in the first position of the guide strand of potent siRNAs, while C is overrepresented in positions 7 and 11 of siRNAs with poor efficacy [18, 28, 33]. We therefore define the sparse-21 representation of a siRNA sequence as the binary vector of dimension 84 that indicates the presence (1) or absence (0) of each nucleotide at each of the 21 positions. The influence of the two 3' overhang nucleotides being controversial, we also consider the sparse representation restricted to the first 19 nucleotides, called sparse-19 below. It is a binary vector of dimension 76 that encodes the presence or absence of each nucleotide at each of the first 19 positions. Second, Teramoto et al.  has shown that the composition of siRNAs in short motifs of length 1 to 3, without positional information, also contains relevant information for efficacy prediction. We therefore define the spectral representation of a siRNA sequence as the vector of count of occurrences of each nucleotide motif of length 1 to 3. There are respectively 4, 16 and 64 possible nucleotide motifs of length 1, 2 and 3, so the spectral representation of a siRNA guide sequence is an integer-valued vector of dimension 4 + 16 + 64 = 84. As for the sparse representation, we call spectrum-19 and spectrum-21 the spectral representation for the 19- and 21-length siRNA sequence, respectively. Finally, we define the composite representation of a siRNA sequence as the concatenation of the sparse and spectral representations. The composite representation is a vector of dimension 160 or 168, depending on whether it is based on the sparse-19 or sparse-21 representation. It provides the simplest way to integrate the potentially complementary information provided by the presence of specific nucleotides at specific positions, on the one hand, and the global content of particular short motifs, on the other hand.
Accuracy of siRNA potency prediction
We estimated a linear model based on the various sparse, spectral and composite representations from the siRNA datasets provided by Huesken et al. . This dataset comprises a total of 2431 randomly selected siRNAs targeting 34 mRNAs, split into a training set of 2182 sequences used to train a predictive model, and a test set of 249 sequences to evaluate the performance of the predictive model. Given the large dimensions of the vectors compared to the limited number of sequences available for training, we estimated the linear model with the LASSO procedure which is known to be resistant to large dimensions . If only a few variables are relevant the LASSO has also the good property of shrinking the weights of irrelevant features to zero, therefore providing simpler interpretation of relevant features in the final model. Table 1 shows the performance of the predictive linear models trained on the various sparse, spectral and composite representations. The performance is assessed in terms of Pearson correlation coefficient between the predicted and the true efficacy for the 249 siRNAs in the test set, in order to provide a fair comparison with Huesken et al.  where the neural-network-based BIOPREDsi predictive engine (based on the sparse-21 representation) has a reported performance of 0.66.
As observed in Huesken et al. , the inclusion of the two 3' overhang nucleotides of the guide strand in the sparse representation leads to more accurate models (0.65 vs. 0.62 for the sparse-21 vs. sparse-19 representations), suggesting that complementariness over the full oligoribonucleotide guide length provides an improved gene knock-down. Although less intuitive, the spectral-19 representation outperforms the sparse-19 representation (0.64 vs. 0.62), confirming the good results claimed by Teramoto et al.  on smaller datasets. This shows that, although much previous work has focused on position-dependent features of siRNA (in particular nucleotide preferences and local thermodynamic features), position-independent short motif contents can provide at least as much predictive power. Surprisingly, the spectral representation of the 21-length siRNA (including antisense 3' overhangs) is clearly worse than the spectral representation of the 19-length siRNA (without antisense 3' overhang). This suggests that mixing the motif content of the first 19 nucleotides with that of the antisense 3' overhang tends to blur the information contained in the spectral representation for efficacy prediction, a point to be clarified below when we examine the weights associated to the different features. Interestingly, the composite representations integrating the 19 or 21 sparse and spectral representations always perform at least as well as (when antisense 3' overhangs are discarded) or better than (when antisense 3' overhangs are included) the best among the spectral and sparse representations alone. This confirms that the sparse and spectral representations contain slightly complementary relevant information, and that the LASSO procedure is able to extract this information from the composite representation in spite of the large dimension of the vectors. The discrepancy between the actual and predicted efficacy for the 249 siRNAs in the test set for the linear model trained on the composite-21/19 representation (i.e., concatenation of the sparse-21 with the spectral-19 representations) is further illustrated in Figure 1.
In terms of absolute precision, we observe that the performance of the simple linear models is competitive with the nonlinear BIOPREDsi prediction method. Based on the similar sparse-21 representation, the linear model has a correlation coefficient of 0.65, to be compared with the 0.66 correlation reported by Huesken et al. . This suggests that the gain resulting from the possible nonlinearity of the neural network is at most marginal in this application, and that the relationship between the siRNA features we and other have considered and the experimental efficacy of the siRNA can accurately be described by an affine model. Furthermore we observe that the composite representations that add the spectral (19 or 21) to the sparse-21 representation result in slightly better performance than the BIOPREDsi model (0.67 vs. 0.66), therefore confirming that this simple approach is state-of-the-art in terms of performance on this dataset. In order to further assess the performance of the linear model based on the composite-21/19 representation, we repeated three times a full 10-fold cross-validation experiment on the 2431 siRNA. The average correlation coefficients over the 30 train/test splits was 0.657, with a standard deviation of 0.039, confirming that the particular train/test split defined by Huesken et al.  provides a fair estimate of the performance of the methods. We also trained and tested the LASSO approach on the various training and test sets described by Huesken et al. . Table 2 shows the results of this experiments, together with the performance of BIOPREDsi in the same context, and confirms the good performance of the linear model.
On most tasks, in particular when the test set is made of human sequences ("All human") or of the human E2 genes ("hE2"), the linear model outperforms the neural network model by a few percents. The picture is less clear on test set made of rodent sequences ("Rodent"), where the linear and nonlinear models are closer to each other. Interestingly, with very few training examples (218 sequences), the performance of the LASSO linear model with the composite-21/19 model clearly outperforms the nonlinear neural network (by 0.09 to 0.14 on the four test sets) in spite of the large dimension of the vectors (168), confirming the relevance of the regularized linear LASSO procedure in this context.
Interpretability of the prediction
The model built with the LASSO procedure being linear, it is possible to understand how the predicted accuracy is computed and to assess the importance of the different features by examining the weights of the model associated to each feature. For example, for the sparse representation a weight is attributed to each nucleotide at each position, and the efficacy predicted for a siRNA is simply the sum of the weights of the nucleotides of the siRNA. Table 3 presents the weights attributed to the features of the sparse-21 representation when the LASSO is trained on the training set of 2182 siRNA sequences, and Figure 2 displays them graphically. We recognize in this picture several well-known features mentioned in previous studies, as well as new ones. The two main differences between previously reported nucleotide features and the features highlighted in Table 3 and Figure 2 are that (i) the features obtained by the LASSO result from a global analysis of the complete dataset, as opposed to statistical analysis of subsets of siRNAs with extreme potencies (e.g., the top and bottom 8% siRNAs in terms of accuracy in the case of Huesken et al. ), and (ii) we provide a precise quantitative assessment of the importance of each feature, the weight of a feature being its contribution in the final predicted silencing efficacy. For example, the respective weights of A, U, C and G in the first guide strand position are 10.94, 18.39, 0 and -2.30, showing that the most preferred nucleotide in the first position is U, and that replacing a G by a U has an expected effect of increasing the efficacy by 21% (when the efficacy is measured in percentage of inhibition of eYFP fluorescence compared to a control experiment, as described in Huesken et al. ). The weights reflect the required asymmetry already detected and discussed [17, 18, 33], with a strong propensity for active siRNAs to contain a A or a U at the 5'-guide terminus (positions 1 and 2), a C or a G at position 19, and a C at position 18. Using a different methodology to compare the 8% most active with the 8% least potent siRNAs, Huesken et al.  detected the overrepresentation of U at position 2, but not that of A at the same position, concluding that this feature does not reflect the need for a weak bond close to the 5'-end. Our findings based on the analysis of all data simultaneously however detect the importance of U as well as A, suggesting that this feature can simply be related to the need for weak affinity at this position. More generally, a global propensity to contain more A and U toward the 5'-end and more C and G towards the 3'-end of the guide strand emerges from this picture. On top of this general trend, several particular positions have large weights for particular nucleotides. As already observed by Huesken et al. , the requirement for an A at position 10 corresponds to a previously characterized U-cleavage site .
Conversely, several nucleotides tend to have a strong negative effect on the potency of siRNA, including C at positions 7 and 21, and G at position 14. The pattern observed at positions 20 and 21 suggests a strong negative effect of C at these positions, and explains why more accurate models are obtained from the sparse representation of 21 nucleotides than from that of 19 nucleotides. Interestingly there is a big difference between the weights of C and G in the antisense 3' overhang positions, confirming that this feature is not merely related to the need for weak or strong hybridization in this position (as in the 5' end) but rather to individual properties of nucleotides. Finally, more subtle motifs appear in this picture and might suggest biological interpretations, such as the positive effect of A in position 3, 4 and 16, and of U in positions 5 to 9, 11, 14 and 17. In order to check the robustness of these motifs with respect to the data used to train the algorithm, we randomly split the training set into two non-overlapping subsets, and trained a model with the LASSO on both subsets independently. The resulting weights are graphically shown in Figure 3. The comparison of the two models learned to each other highlights the conservation of most motifs discussed above, suggesting that they are not just an artifact of the training set but might be related to some biological function.
In fact, as observed in Table 3, very few positions seem to be without influence on the efficacy of the siRNA. A question worth investigating is whether all the features appearing in Table 3 and Figure 2 really help predict efficacy, or whether some of them may be discarded. The fact that the LASSO model tries to find parsimonious models based on as few features as possible to predict accuracy suggests that all detected motifs indeed play a role. As an illustration, Figure 4 shows the accuracy obtained by the LASSO in a cross-validation experiment as a function of the L1 norm of the weight vector, which is closely related to the number of non-zero weights. The fact that the performance of the LASSO increases almost until the weights reach the ordinary least square solution shows that increasing the number of features used in prediction indeed improves the accuracy of the prediction. Of course, only the most evident predictive features are detected and used when few siRNA sequences are available for training, but the conclusion so far is that if enough siRNA sequences are available for training, then most if not all features are likely to contribute either positively or negatively to the efficacy of siRNA. This suggests in particular that as more siRNA data become available, the most precise efficacy prediction models for siRNA design will have to rely on most if not all features simultaneously. We note that this conclusion is only valid in the framework of the method we use for feature selection (constraint of the L1 norm of the weight vector), and that other feature selection procedures might lead to different conclusion; only biological validation could confirm the hypothesis that most positions influence the efficacy of siRNA.
A similar analysis can be performed on the features of the spectral representation. In brief, Table 4 summarizes the weights associated to the spectral features when the LASSO is trained on the spectral features of the first 19 nucleotides. This table shows that, more than a general enrichment in particular nucleotides, the LASSO on the spectral representation detects the asymmetry of the guide sequences. Indeed, the main positive weights involve motifs with A or U at their 5' end, C or G at their 3' end (e.g., AAC, UC, UG, AAG, AGC), while the main negative weights also involve asymmetric motifs but with opposite direction (e.g., CUU, CUA, GUU, GU, GAU). For example, the motif UG has the third-largest positive weight, while GU has the fifth most negative weight. A notable exception to this strong trend are the motifs CUC and CUG that have strong negative weights in spite of no obvious asymmetry. The comparison of these weights with the motifs on Figure 2 highlights the reason why, contrary to the sparse representation, the spectral representation should be restricted to positions 1 to 19 of the antisense sequence. Indeed, positions 20 and 21 contain relevant information used by the sparse representation.
However they break the general trend of having a decreasing frequency of A-U and increasing frequency of C-G towards the 3'-end, with for example a C at position 21 appearing as a strong handicap for siRNA potency. This in turns implies that the motifs detecting the asymmetry of the sequence are less likely to be good predictors. For example, the motif AAC is the strongest positive predictor of potency when restricted to the first 19 nucleotides, but the presence of AAC at positions 18–20 or 19–21 is a strong inhibitor of siRNA potency. The fact that the spectral-19 representation even outperforms the sparse-19 representation suggests that the presence of asymmetric motifs along the siRNA sequence is at least as important as the presence of particular nucleotides at given positions for efficacy prediction. Because these two types of features contain independent information (nucleotide preferences at particular positions and presence of short asymmetric motifs with A/Us followed by C/Gs in the 5' to 3' guide strand direction), it is no surprise that the composite representation, that integrates both types of features, performs at least as well as each type of features taken independently. For sake of completeness the weights of the model trained on the composite-19/21 representation are presented in Table 5.
Previous studies have insisted on the correlation between the stability asymmetry of the siRNA sequence and the efficacy of gene silencing. It has been observed that thermodynamic stability profile of the siRNA duplex determines which strand enters the RISC complex as guide strand, and that the most potent siRNA duplexes therefore present lower base-pairing stabilities at the 5' end of the guide strand, as well as at the cleavage site [17, 18]. This suggests that local stability profiles and base-stacking energy profiles might be relevant features to help predict the efficacy of siRNA. Table 1 shows the result of siRNA efficacy prediction when these features are used. In both cases the correlation obtained is in the range 0.54, confirming that these features are indeed simple and good predictors. However we note that they provide less precise models than when larger sets of features are used (e.g., the composite representation). We also observed that the combination of these profiles with the composite representations does not improve over the composite representation alone, suggesting that the asymmetry is already summarized in the sparse and spectral representations. Intuitively this latter point is not surprising given the nature of the spectral representation, which can be interpreted as the counts of hydrogen bounds and of the types of the nearest and the next-nearest stacking interactions in the siRNA oligomer, and of the sparse representation which gives information about the localization of the different nucleotides in the oligomer. Common models for the global Gibbs free energy of the oligomer are for example particular linear functions in the spectral representation. By allowing the LASSO to estimate any linear model, we give it the freedom to estimate a function that could be related to thermodynamic properties of the siRNA, although it could also be related to other properties. The performance we obtain suggest that a large part of the correlation captured by the LASSO model can in fact be related to thermodynamic properties, but that other features also contribute to siRNA efficacy. We note that recently Shabalina et al.  proposed a simple model for siRNA potency based only on a few compositional and thermodynamic features; its performance on the test set of Huesken et al.  reaches 0.52, slightly below the performance of our model based on thermodynamic features only (A. Ogurtsov, private communication).
The weights of the linear model trained on the profile of pair stacking energies are shown in Figure 5. The local energies are particularly relevant at two positions of the siRNA for efficacy prediction : the first pair of nucleotide at the 5' end of the guide strand must have high stack energy, that is, be thermodynamically unstable for the siRNA to be potent, with an estimated average contribution of 10% gain in efficacy for each increase of 1 kcal/mol, while the last base pair at the 3' extremity must have a low free energy, that is be stable, with an estimated average contribution of -5% per increase of 1 kcal/mol in free energy. Smaller contributions of other features suggest that potent siRNAs should have a tendency for weak double-strand stability around positions 3–4, 6–7 and 13–14 of the guide strand.
Performance on independent datasets
Several datasets published previous to and independently from the dataset of Huesken et al.  are available and provide a useful benchmark to assess the validity of our approach beyond the particularities of the dataset used to train the model. In this section we report results aiming at validating (i) the predictive accuracy of our model on independent datasets and (ii) the stability of the model interpretation with respect to the data used to train the model.
In order to assess the predictive accuracy of our model beyond Huesken's data we tested it on four independent datasets: the Reynolds dataset  of 240 siRNAs, the Vickers dataset  of 76 siRNAs, the Harborth dataset  of 44 siRNAs, and the compilation of 653 siRNA gathered by Shabalina et al. . The first three datasets were used in  to assess the accuracy of BIOPREDsi on independent datasets, and therefore provide a useful comparison with BIOPREDsi. The fourth dataset of Shabalina et al. is a compilation of siRNA with experimentally measured activities gathered from 13 previous publications , and represents to our knowledge the most complete public compilation of siRNA data to date, apart from Huesken's data. In all cases only the first 19 nucleotides of the siRNA are available, so we tested the performance of the sparse-19, spectrum-19 and composite-19 models trained on the Huesken training set of 2182 siRNA. The results are presented in Table 6, and compared to the results of the BIOPREDsi algorithm in a similar setting for the first three datasets. Overall these results show no clear winner between the linear and neural network-based model, confirming that the models produced by the LASSO are robust to new datasets and competitive compared to BIOPREDsi. In terms of absolute values we observe a clear drop in the performance between the test set of 249 siRNA of Huesken (where the sparse-19, spectrum-19 and composite-19 have respective correlations of 0.62, 0.64 and 0.64) and the independent test sets (where the correlations range between 0.45 and 0.58). This drop in performance illustrates and quantifies the differences between Huesken's dataset and previous publications, both in terms of siRNA selection and in terms of efficacy measurements. We note in particular that Huesken et al. measure the knock-down efficacy by looking at protein levels , while in some other datasets the mRNA degradation is directly evaluated. The relatively high levels of correlation obtained nevertheless suggest that, beyond the many differences between various siRNA datasets, a coherent biological effect is quantified by the model.
A second question of importance concerns the validity of the interpretation of the models trained on Huesken's dataset. In order to check whether the main features detected and analyzed in the previous sections are not merely a bias of this dataset, we trained the LASSO models based on the sparse-19, spectrum-19 and composite-19 representations on the Shabalina dataset of 653 siRNA. These models were tested on the Huesken test set of 249 siRNA, resulting in correlations of 0.60, 0.44 and 0.58, respectively. Compared to the correlations of 0.62, 0.64 and 0.64, respectively, obtained when the models are trained on the Huesken's training set of 2182 siRNA, we note a slight drop in performance for the sparse representation, and a more important one for the spectrum and composite ones. The important decrease in performance of the spectrum representation might be due in part to the limited number of sequences available for training (653 vs 2182), as we observed that the spectrum representation requires more examples than the spectral representation to be competitive. The good performance of the sparse representation (0.60 vs 0.62) suggests that the model trained on the Shabalina dataset has found biologically relevant features that are valid on Huesken's dataset. Figure 6 displays graphically the weights of the sparse-19 model trained on Shabalina's dataset. Compared to the weights of the spectrum-21 model trained on Huesken's training set (Figure 2), we observe numerous features shared between the two pictures, such as the importance of the features A/U in positions 1, 2 and 7 in potent siRNA. We also note a few differences, such as the presence of a U at position 13 which is a strong indicator of potency in Shabalina's data, but has a limited contribution to the model trained on Huesken's data. Many factors can explain the differences observed, such as the fact that many siRNA were designed following certain rules in several datasets compiled by Shabalina et al., and the differences in siRNA efficacy assessment and quantification. Overall this analysis of how the LASSO model performs and looks like on independent datasets confirms that differences exist between datasets, but that a fair level of generality is reached both in terms of predictive accuracy and in terms of model interpretations with the model trained on Huesken's data.
Despite considerable progress made recently siRNA design is still a subject of intense research. Ideally an efficient design should aim at defining a synthetic siRNA that mimics the natural RNA interference pathway considering different steps such as siRNA-RISC binding, duplex unwinding and strand selection, target specificity and accessibility, target cleavage and release. As each step likely involves multiple protein and nucleic acid interactions, key sequence and biophysical features are expected to be essential to ensure optimal functionality. In our study we focused only on intrinsic properties of siRNA, and it is likely that more precise prediction might be obtained by integrating information about the target specificity of the siRNA and the accessibility of the mRNA target, for example. As a first step towards assessing the importance of target accessibility, we considered the 20 siRNA sequences in our test set for which the discrepancies between predicted and actual efficacy were the largest (Table 7). In order to check whether there is any obvious structural basis for the errors of the model we computed the local secondary structure of the target mRNA with the Mfold program by submitting the corresponding fourteen target sequences provided by Huesken et al.  and subsequently focusing on the region of the mRNA containing the region targeted by the siRNA (Figure 7). Even though this dataset was produced using reporter constructs with fusion transcripts, a high correlation between potency profiles of siRNA against the reporter fusion mRNA and the corresponding endogenous gene was demonstrated suggesting that common sequence fragments in endogenous and fusion RNAs have similar secondary structures. For each predicted structure, the local free energy ΔG loc was calculated (see Methods) but conversely to Schubert et al. , no clear correlation between the actual silencing efficacy and the local energy of the target site structure was observed (Table 7). It is known that structure predictions have strong limitations when long RNAs are concerned, rendering only approximations of the actual folding at this site. Nevertheless, by a systematic inspection of the overall set of predicted local structures at the target site with a same range of ΔG loc , we observed that most target regions of siRNA with under-predicted efficacy have a 3' end in a stem loop (Figure 7A). This apparent target site accessibility could be related to the prominent role of nucleotides 2–8 on the 5' end of the siRNA guide strand that provide most of the binding energy that leads RISC to the target mRNA .
Conversely we observed a general trend for the mRNA targets of the siRNA with the most over-predicted efficacy to have a 3' extremity with a limited accessibility (Figure 7B). Although an effect due to the dataset used herein can not be completely excluded (see above), these preliminary observations suggest that a systematic and extended analysis of the local structures of target mRNA is required to decipher in more details the actual contribution of the local structural target accessibility to the silencing activity.
Regarding the importance of siRNA sequence features, the fact that the composite representation generally outperforms the sparse and spectral representation highlights a slight complementariness between them. Although the propensity of potent siRNAs to have particular nucleotides at particular positions has been recognized and biologically justified for some time, the importance of asymmetric short motifs along the sequence has been largely overlooked. The picture that emerges from these results is a superposition of different constraints to favor a good efficacy of siRNA, probably resulting from the necessity to follow different steps in the RNA interference pathway. The different constraints include in particular the preference for particular nucleotides at some positions, the presence of asymmetric motifs in the sequence, and a strong contrast between the free energies of the base pairs at the 5' and 3' ends of the double strand siRNA. Of particular interest is the important role played by the 3' overhang of the guide strand, first observed by Huesken et al. , who noticed an excess of G in position 21 in the 3' overhang of potent siRNAs. The analysis of the weights of the linear models based on the sparse-21 representation suggests that the main contribution of the overhang nucleotides is the negative effect of a C in position 21 of the guide strand, which has an average effect of decreasing the efficacy by 7% compared to a G in the same position, and by 6% compared to an A or a U. The pattern in position 20 is less important in quantitative terms, with a 3% improvement expected when a C is replaced by a U in this position. These findings suggest including 3' antisense nucleotide information in the siRNA design, and raises questions about the common inclusion of a TT 3' overhang in siRNA design. The role played by the 3' overhang nucleotides could be explained by recent findings about the human Argonaute 2 (Ago2) protein, a core constituent of the RISC complex characterized by two unique domains, PAZ and PIWI . Structural studies of the PAZ domain suggest that it binds the two nucleotides at the 3' overhang of the siRNA duplex and is essential to guide the interaction between the siRNA and the target mRNA for cleavage and translational repression .
In conclusion, we have developed an accurate and interpretable model for siRNA efficacy prediction that performs at least as well as the current state-of-art. We have shown how the weights estimated by the linear model provide a direct quantitative estimate of the effect of the features, and that most sequence features contribute either positively or negatively to the siRNA potency. Efforts towards publicly available siRNA datasets with confident annotation, albeit costly, is therefore highly desirable to improve current learning models. Finally, while considering steps involved in RNAi pathway, it clearly appears that other phenomena such as mRNA target accessibility and off-targets have to be considered into a single-one strategy for a fine tuned design and are currently under investigation.
The main source of siRNAs used in this study is the dataset provided by Huesken et al. , comprising a total of 2431 randomly selected siRNAs targeting 34 different mRNAs and split into a training set of 2182 sequences and a test set of 249 sequences. Each siRNA sequence was converted to a vector of features using PYTHON scripts available from the authors.
Given a representation of siRNAs as vectors the LASSO procedure estimates a linear model with offset to explain a variable of interest (the siRNA efficacy in our case) from a set of siRNAs with known efficacy by minimizing the mean square error over the set of linear model with L1 norm bounded by a control parameter . We used the LASSO implementation available in the Lars package  for the R statistical software. In each experiment the constraint on the L1 norm of the vector of weight was optimized by 5-fold cross-validation on the training set.
We considered two siRNA duplex thermodynamics indexes, a local stability and a base-stacking energy profiles , to check how features derived from prior biochemical knowledge perform compared to the sparse and spectral representation.
These profiles are based on free energy computation, a measure of the base pairing force. We used modified Freier tables [42–44] which provide an estimate of the base pair stack free energy ΔGXi,Xi+1for the nucleotide pair at positions i and i + 1 of the sense (X = S) or antisense (X = A) strand, for i = 1, ..., 18. First, the local free energy profile is defined as the 18-dimensional vector whose i-th feature is precisely the free entropy ΔGAi,Ai+1. Second, we consider the local stability profile, a 19-dimensional profile computed by taking sums of free energies over pentamers as described by the following equation for i = 1, ..., 19  (ΔG penalty represents the penalty for A or U at 5' end, and the free energy for the stack composed by the last position of X strand, and the dangling T):
RNA secondary structure prediction
To assess mRNA target accessibility, we computed local RNA secondary structures using the program mfold version 3.1 [43, 46] with the following parameters W = 0, MAXBP = 100, P = 100 and MAX = 10 . For each target sequence, the ten first predicted structures with the lowest free energies were recorded. For each predicted, local free energies ΔG loc were calculated for the structural motives in which nucleotides of the target site were involved. Each local target sequence was carefully inspected and those that occurred at high abundance (from 7 to 10/10) with associated ΔG loc values in the same range (+/- 0.2 kcal/mol) was selected and subsequently considered.
Availability and requirements
The algorithm is implemented in the platform-independent web server DSIR (Design of SIRna), freely available without restriction for academic and non-academic users at http://cbio.ensmp.fr/dsir.
Huppi K, Martin SE, Caplen NJ: Defining and assaying RNAi in mammalian cells. Mol Cell 2005, 17: 1–10. 10.1016/j.molcel.2004.12.017
Baulcombe DC: RNA as a target and an initiator of post-transcriptional gene silencing in transgenic plants. Plant Mol Biol 1996, 32(1–2):79–88. 10.1007/BF00039378
Cogoni C, Irelan JT, Schumacher M, Schmidhauser TJ, Selker EU, Macino G: Transgene silencing of the al-1 gene in vegetative cells of Neurospora is mediated by a cytoplasmic effector and does not depend on DNA-DNA interactions or DNA methylation. EMBO J 1996, 15(12):3153–3163.
Fire A, Xu S, Montgomery MK, Kostas SA, Driver SE, Mello CC: Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 1998, 391(6669):806–811. 10.1038/35888
Jorgensen RA: Sense cosuppression in plants: Past, present, and future. In RNAi: A guide to gene silencing. Edited by: Hannon GJ. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press; 2003.
Elbashir SM, Harborth J, Lendeckel W, Yalcin A, Weber K, Tuschl T: Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells. Nature 2001, 411(6836):494–498. 10.1038/35078107
Caplen NJ, Parrish S, Imani F, Fire A, Morgan RA: Specific inhibition of gene expression by small double-stranded RNAs in invertebrate and vertebrate systems. Proc Natl Acad Sci USA 2001, 98(17):9742–9747. 10.1073/pnas.171251798
McManus MT, Sharp PA: Gene silencing in mammals by small interfering RNAs. Nat Rev Genet 2002, 3(10):737–747. 10.1038/nrg908
Hannon GJ, Rossi JJ: Unlocking the potential of the human genome with RNA interference. Nature 2004, 431(7006):371–8. 10.1038/nature02870
Surabhi RM, Gaynor RB: RNA interference directed against viral and cellular targets inhibits human immunodeficiency Virus Type 1 replication. J Virol 2002, 76(24):12963–12973. 10.1128/JVI.76.24.12963-12973.2002
Xia H, Mao Q, Eliason SL, Harper SQ, Martins IH, Orr HT, Paulson HL, Yang L, Kotin RM, Davidson BL: RNAi suppresses polyglutamine-induced neurodegeneration in a model of spinocerebellar ataxia. Nat Med 2004, 10(8):816–820. 10.1038/nm1076
Pai SI, Lin YY, Macaes B, Meneshian A, Hung C-F, Wu TC: Prospects of RNA interference therapy for cancer. Gene Ther 2006, 13(6):464–477. 10.1038/sj.gt.3302694
Hutvágner G, Zamore P: A microRNA in a multiple-turnover RNAi enzyme complex. Science 2002, 297(5589):2056–2060. 10.1126/science.1073827
Meister G, Tuschl T: Mechanisms of gene silencing by double-stranded RNA. Nature 2004, 431(7006):343–9. 10.1038/nature02873
Tuschl T, Zamore P, Lehmann R, Bartel D, Sharp P: Targeted mRNA degradation by double-stranded RNA in vitro. Genes Dev 1999, 13(24):3191–7. 10.1101/gad.13.24.3191
Yiu SM, Wong PWH, Lam T, Mui Y, Kung HF, Lin M, Cheung YT: Filtering of Ineffective siRNAs and Improved siRNA Design Tool. Bioinformatics 2005, 21(2):144–151. 10.1093/bioinformatics/bth498
Khvorova A, Reynolds A, Jayasena S: Functional siRNAs and miRNAs exhibit strand bias. Cell 2003, 115(2):209–216. 10.1016/S0092-8674(03)00801-8
Schwarz DS, Hutvagner G, Du T, Xu Z, Aronin N, Zamore PD: Asymmetry in the assembly of the RNAi enzyme complex. Cell 2003, 115(2):199–208. 10.1016/S0092-8674(03)00759-1
Vickers TA, Koo S, Bennett CF, Crooke ST, Dean NM, Baker BF: Efficient reduction of target RNAs by small interfering RNA and RNase H-dependent antisense agents. A comparative analysis. J Biol Chem 2003, 278(9):7108–18. 10.1074/jbc.M210326200
Schubert S, Grünweller A, Erdmann VA, Kurreck J: Local RNA target structure influences siRNA efficacy: systematic analysis of intentionally designed binding regions. J Mol Biol 2005, 348(4):883–893. 10.1016/j.jmb.2005.03.011
Overhoff M, Alken M, Far RK, Lemaitre M, Lebleu B, Sczakiel G, Robbins I: Local RNA target structure influences siRNA efficacy: a systematic global analysis. J Mol Biol 2005, 348(4):871–881. 10.1016/j.jmb.2005.03.012
Holen T, Amarzguioui M, Wiiger MT, Babaie E, Prydz H: Positional effects of short interfering RNAs targeting the human coagulation trigger Tissue Factor. Nucleic Acids Res 2002, 30(8):1757–1766. 10.1093/nar/30.8.1757
Boese Q, Leake D, Reynolds A, Read S, Scaringe SA, Marshall WS, Khvorova A: Mechanistic insights aid computational short interfering RNA design. Methods Enzymol 2005, 392: 73–96.
Jackson AL, Bartz SR, Schelter J, Kobayashi SV, Burchard J, Mao M, Li B, Cavet G, Linsley PS: Expression profiling reveals off-target gene regulation by RNAi. Nat Biotechnol 2003, 21(6):635–7. 10.1038/nbt831
Semizarov D, Frost L, Sarthy A, Kroeger P, Halbert DN, Fesik SW: Specificity of short interfering RNA determined through gene expression signatures. Proc Natl Acad Sci USA 2003, 100(11):6347–52. 10.1073/pnas.1131959100
Ui-Tei K, Naito Y, Takahashi F, Haraguchi T, Ohki-Hamazaki H, Juni A, Ueda R, Saigo K: Guidelines for the selection of highly effective siRNA sequences for mammalian and chick RNA interference. Nucleic Acids Res 2004, 32(3):936–948. 10.1093/nar/gkh247
Amarzguioui M, Prydz H: An algorithm for selection of functional siRNA sequences. Biochem Biophys Res Commun 2004, 316(4):1050–8. 10.1016/j.bbrc.2004.02.157
Reynolds A, Leake D, Boese Q, Scaringe S, Marshall WS, Khvorova A: Rational siRNA design for RNA interference. Nat Biotechnol 2004, 22(3):326–330. 10.1038/nbt936
Saetrom P: Predicting the efficacy of short oligonucleotides in antisense and RNAi experiments with boosted genetic programming. Bioinformatics 2004, 20(17):3055–3063. 10.1093/bioinformatics/bth364
Saetrom P, Snøve O: A comparison of siRNA efficacy predictors. Biochem Biophys Res Commun 2004, 321: 247–53. 10.1016/j.bbrc.2004.06.116
Ren Y, Gong W, Xu Q, Zheng X, Lin D, Wang Y, Li T: siRecords: an extensive database of mammalian siRNAs with efficacy ratings. Bioinformatics 2006.
Truss M, Swat M, Kielbasa SM, Schäfer R, Herzel H, Hagemeier C: HuSiDa – the human siRNA database: an open-access database for published functional siRNA sequences and technical details of efficient transfer into recipient cells. Nucleic Acids Res 2005, (33 Database):D108–11.
Huesken D, Lange J, Mickanin C, Weiler J, Asselbergs F, Warner J, Meloon B, Engel S, Rosenberg A, Cohen D, Labow M, Reinhardt M, Natt F, Hall J: Design of a genome-wide siRNA library using an artificial neural network. Nat Biotechnol 2005, 23(8):995–1001. 10.1038/nbt1118
Tibshirani R: Regression shrinkage and selection via the lasso. J Royal Statist Soc B 1996, 58: 267–288.
Teramoto R, Aoki M, Kimura T, Kanaoka M: Prediction of siRNA functionality using generalized string kernel and support vector machine. FEBS Lett 2005, 579(13):2878–82. 10.1016/j.febslet.2005.04.045
Zamore P, Tuschl T, Sharp P, Bartel D: RNAi: double-stranded RNA directs the ATP-dependent cleavage of mRNA at 21 to 23 nucleotide intervals. Cell 2000, 101: 25–33. 10.1016/S0092-8674(00)80620-0
Shabalina S, Spiridonov A, Ogurtsov A: Computational models with thermodynamic and composition features improve siRNA design. BMC Bioinformatics 2006, 7: 65. 10.1186/1471-2105-7-65
Harborth J, Elbashir SM, Vandenburgh K, Manninga H, Scaringe SA, Weber K, Tuschl T: Sequence, chemical, and structural variation of small interfering RNAs and short hairpin RNAs and the effect on mammalian gene silencing. Antisense Nucleic Acid Drug Dev 2003, 13(2):83–105. 10.1089/108729003321629638
Haley B, Zamore PD: Kinetic analysis of the RNAi enzyme complex. Nat Struct Mol Biol 2004, 11(7):599–606. 10.1038/nsmb780
Ma J, Ye K, Patel D: Structural basis for overhang-specific small interfering RNA recognition by the PAZ domain. Nature 2004, 429(6989):318–322. 10.1038/nature02519
Efron B, Hastie T, Johnstone I, Tibshirani R: Least angle regression. Ann Stat 2004, 32(2):407–499. 10.1214/009053604000000067
Freier SM, Kierzek R, Jaeger JA, Sugimoto N, Caruthers MH, Neilson T, Turner DH: Improved free-energy parameters for predictions of RNA duplex stability. Proc Natl Acad Sci USA 1986, 83(24):9373–7. 10.1073/pnas.83.24.9373
Mathews DH, Sabina J, Zuker M, Turner DH: Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J Mol Biol 1999, 288(5):911–40. 10.1006/jmbi.1999.2700
Xia T, SantaLucia J, Burkard ME, Kierzek R, Schroeder SJ, Jiao X, Cox C, Turner DH: Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. Biochemistry 1998, 37(42):14719–35. 10.1021/bi9809425
Ding Y, Yu CC, Lawrence CE: Sfold web server for statistical folding and rational design of nucleic acids. Nucleic Acids Res 2004, (32 Web Server):W135-W141.
Zuker M: On finding all suboptimal foldings of an RNA molecule. Science 1989, 244(4900):48–52. 10.1126/science.2468181
Heale B, Soifer H, Bowers C, Rossi J: siRNA target site secondary structure predictions using local stable substructures. Nucleic Acids Res 2005, 33(3):e30. 10.1093/nar/gni026
This work was supported in part by the "Cartes d'Identité des Tumeurs" (CIT) program of the Ligue Nationale Contre le Cancer. We are grateful to Bernard Billoud for comments and critical reading of the manuscript.
All authors conceived the experiments. JPV and NF implemented the algorithms and performed the experiments. YV and NF studied the structures of siRNA. JPV and YV drafted the manuscript. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Vert, JP., Foveau, N., Lajaunie, C. et al. An accurate and interpretable model for siRNA efficacy prediction. BMC Bioinformatics 7, 520 (2006). https://doi.org/10.1186/1471-2105-7-520