### Data and sources

The siRNA set was compiled from published experiments targeting 52 distinct mRNAs. The effectiveness of each siRNA sequence was judged by its gene silencing activity as shown in the Results section of the publication [see Additional file 4]. Most of reported activity values are averages of duplicate measurements at a constant oligonucleotide dose expressed as percent of untreated control. Reported activity values range from 0 (complete knock-out) to = 100% (no effect). Based on reported activity, siRNAs were then divided into three groups: (i) siRNAs that induced more than 70% gene silencing (a total of 295), (ii) siRNAs that induced less than 30% gene silencing (130 in total), and (iii) the remaining sequences in the collection. Our database of siRNAs is presented in Supplementary Materials [see Additional file 4]. The database consists of two subsets of siRNAs produced by Dharmacon and Amgen Incorporated (298 siRNA duplexes) and other sources (355 siRNAs). The total number of analyzed siRNAs with experimentally determined silencing efficiency is 653. Every entry in the database consists of an oligonucleotide sequence, the target mRNA GenBank sequence accession number, the sequence and coordinates of the intended mRNA target region, miRNA silencing efficiency for that mRNA, and the publication source. The resulting models were tested on the 2431 siRNA dataset recently published by Huesken et al. [24].

The siRNAs in the database were screened for the following criteria. (i) Gene expression was measured after siRNA application relative to untreated control. (ii) At least 5 siRNAs were assayed for a given target mRNA. (iii) The nucleotide sequence of the antisense strand has a perfect complementary target site in the mRNA sequence. Analysis was performed for 19-nucleotide oligonucleotides and did not include the overhangs at the 3'-ends. The important technical details of the corresponding gene silencing experiments are shown in the Material and Methods sections of each publication. We analyzed sets of miRNAs available at http://www.sanger.ac.uk/cgi-bin/Rfam/mirna/browse.pl for *Homo sapiens*, *Mus musculus* and *Rattus norvegicus*.

We calculated a number of thermodynamic features such as: ΔG values per nucleotide pair that are relevant to stabilities of sense-antisense siRNA duplexes, siRNA antisense strand intra-molecular structure stability, siRNA antisense strand inter-molecular dimer stability, local target mRNA stabilities, and stabilities of each two neighboring base pairs in the siRNAs sense-antisense duplexes. These characteristics were calculated using a nearest neighbor model (see Software). The following section describes the modeling methods we used.

### Training and validation sets

We used only the 350 siRNAs from 653 siRNA training set when creating composite parameters such as the summarized position-dependent consensus. Parameter selection and model optimization were also restricted to the 350 siRNAs initially and then applied to the complete training set (653 siRNAs).

On the 2431 siRNA validation set, we made just two efficiency predictions (using models obtained from the training set): one with linear regression, and one with the best neural network model. We also used these validation set predictions to generate ROC curves for the corresponding classification models.

### Parameter selection

We used a number of parameters described in the literature [23], and supplemented them with a number of new ones. In all, we had a list of about 150 parameters, which is too large for effective analysis. We used two criteria to reduce this number: significant correlation with activity and stability of the correlation. Both criteria were evaluated on the training set only. Overall block scheme illustrating siRNA feature selection approach is presented in Supplementary Materials [see Additional file 2].

We required that parameters have a correlation of at least 0.1 with siRNA efficiency, and that this correlation be significant at the 0.05 level. This left 18 parameters, as detailed in Results.

Since our training set is very heterogeneous, combining many experiments and slightly different protocols, we had the opportunity to select those parameters which are most universal. To do this, we split our data set into *n* parts (*n =* 4 and 10), and for every part and parameter computed the correlation coefficient. Taking 1000 such splits, we computed *S*
_{
n
}, the standard deviation of *R* for every parameter. We used this stability *S*
_{
n
}value as an indicator of how much the parameter's predictive power depends on the choice of the particular subset of the data.

### Linear regression model

In our analysis, we used regression, rather than classification models, since they provide more information, are more flexible, and are easier to evaluate. We performed multiple linear regression analysis on our sets of 18 parameters, with cross-validation as described below. We used scripts written in the Octave (MATLAB work-alike) language. The scripts are available by request.

### Neural network model

In all neural network experiments, we used the Stuttgart Neural Network Simulator (SNNS) V4.2, available at the SNNS website http://www-ra.informatik.uni-tuebingen.de/SNNS/. We chose fully-connected multi-layer feed-forward networks (created using the BIGNET->GENERAL tool in SNNS). Such a network starts with a layer of input neurons (one per feature), which is connected to a hidden layer, with a connection going from every input to every hidden neuron. The first hidden layer may be fully connected to another hidden layer, and so on. The last hidden layer is fully connected to the output neuron. Each neuron integrates information from the incoming connections, and outputs it to the next layer. The integration is a two-step process. First, the inputs are summed with neuron-specific coefficients (which are chosen adaptively during learning). Then, the sum is transformed with an activation function (which has its own adaptable coefficient), and the result is output. We used logistic activation functions (default SNNS setting) on all neurons.

We achieved the best results with resilient propagation (RProp). The algorithm uses the following parameters: (i) the initial learning step size, to which generalization performance is insensitive, and which we set to 0.1, (ii) the maximum learning step size, also unimportant for generalization, and set to 10, and (iii) the weight decay parameter ALPHA, which was determined by a simple search (See the subsection "Training parameter selection"). We were unable to get better performance with other algorithms or a linear, rather than logistic, output function (results not shown).

The training process consists of epochs. Each epoch SNNS trains the network on all points of the training data set (as given by the cross-validation procedure) in random order (set the SHUFFLE option in SNNS). Every epoch, we also modified each coefficient in the network by ± 0.1% (SNNS option JOG weights with setting 0.001), since we found that this helps reduce the variation in performance by pushing the network out of local minima.

An important decision during training is when to stop training the network. An under trained network is a poor fit to both the training and the validation data set. Excessive training is time-consuming, and the network might over fit the training data set and get worse on the validation set. Although a correct setting of the ALPHA parameter in RProp reduces over fitting, it is still important to choose a good stopping time. A traditional approach called "early stopping" is to split the training set into a training subset, and a small "test" subset, and stop when the error on the test subset reaches a minimum. We did not do this for two reasons. Firstly, it reduces the training set, and increases the variation in the results (from the choice of the split into training/test sets). Secondly, when ALPHA is set properly, the validation error tends to be nearly monotonically decreasing, making early stopping unnecessary. Instead, we searched for an optimum fixed number of epochs, after which to stop (see "Training parameter selection").

We tested the cross-validation performance of the following network geometries (input neurons = 18 × hidden layer × hidden layer × ... × output neurons = 1): 18 × 1 (logistic regression), 18 × 2 × 1, 18 × 3 × 1, 18 × 4 × 1, 18 × 6 × 1, 18 × 8 × 1, 18 × 2 × 2 × 1, 18 × 3 × 2 × 1, 18 × 3 × 3 × 1, 18 × 4 × 3 × 1, 18 × 6 × 2 × 1, 18 × 2 × 2 × 2 × 1. Comparing the number of training coefficients to the number of training points, some overfitting might be possible for an 18 × 8 × 1 network. However, given the behavior of RProp, it is not likely: the algorithm slows down the learning process gradually in a way that helps to avoid overfitting. For network selection, we used the reasonable parameters ALPHA = 1.8 and 250 epochs of training, which were determined by hand experimentation in SNNS on a 200-siRNA sample of the data. We found that all 18-parameter 1- and 2-hidden layer networks performed very similarly, and so only tested the 4 × 2 × 2 × 1 network in the 4-parameter case. Scripts for many of the above tasks are available by request.

### Data scaling

In order to keep features with large numerical scales (e.g. varying from -30 to -5) from overshadowing features with small numerical scales (e.g. 0 to 0.2), we linearly rescaled all features to the range [0, 1]. Because our output neuron uses a logistic activation function, it is biased against producing outputs close to 0 or 1. To help with this problem, the efficiency values were linearly rescaled to [0.25, 0.75].

### Classification and ROC computation

Neural networks originated as a classification technique. A classification decision in such a network is made by putting a threshold function in the output neuron. The network would then be trained to optimize some measure of performance (% error, or perhaps another measure, favoring higher sensitivity or higher specificity). However, for different purposes, one may wish to define an active siRNA as one with 5%, 10%, 20%, or even 30% residual activity. For every such choice, one would have to re-train the network.

A threshold on regression output can produce a classification decision too. A regression model enables us to avoid retraining – we only need to choose a threshold for the predicted value, a much quicker computation than network training. We computed ROC (receiver operating characteristic) curves for classifiers of this type, as follows.

First, we separated the validation set predictions into "active" and "inactive" based on the actual silencing efficiency. For every predicted activity threshold in this list: 3%, 6%, 9%, ..., 96%, 99%, we repeated the following steps. We marked each prediction "predicted active" or "predicted inactive", based on the current threshold. The total number of "active, predicted active" gave us the true positive (TP) count. Similarly, we obtained FP, TN, FN counts, and from them calculated Sensitivity = TP/(TP+FN) and Specificity = TN/(TN+FP). The result is 33 pairs of the form (1 – Specificity, Sensitivity), which give the ROC curve.

### Cross-validation and performance

When choosing or optimizing models on the training set, we used *n*-fold cross-validation, a standard method for evaluating model generalization. Cross-validation randomly splits the data set into *n* equally-sized subsets. Then, one trains, in turn, on each subset set of *n*-1, validating on the remaining one. The validation predictions from the *n* models combine to make a prediction for every data point. Using these predictions, we compute the coefficient of determination *R*
^{2} = (*Actual Variation – Error*)/(*Actual Variation*), where the actual variation is ∑_{
i
}(*Actual efficiency*
_{
i
}- *Average efficiency*)^{2}, and the error is ∑_{
i
}(*Actual efficiency*
_{
i
}- *Predicted efficiency*
_{
i
})^{2}. Thus, *R*
^{2} reflects the percentage of variation in efficiency explained by our model. If the predictions came from a non-cross-validated linear regression, this *R*
^{2} would exactly match the square of Pearson's correlation coefficient.

Standard cross-validation can overestimate the generalization performance because substantial numbers of the siRNAs in our data set overlap. If a training siRNA overlaps with a validation one, a good prediction for the validation siRNA does not demonstrate generalization.

To address this issue, we designed the following cross-validation scheme. Instead of subdividing into parts uniformly at random we developed a tool which splits the database using the following algorithm: (i) allocate *n* empty buckets. While there are unallocated siRNAs, do the following steps: (ii) find the bucket *i* with the smallest number of elements. (iii) Pick an unallocated siRNA *s* uniformly at random, put it into *i*. (iv) All siRNAs overlapping *s* are placed into *i*, as are siRNAs overlapping those siRNAs, etc. The result of the algorithm is *n* parts of nearly equal size, with no pair of overlapping siRNAs that is split between two parts. This algorithm is preferable to the standard minus-mRNA approach in our case because some mRNAs have many more siRNAs than others (100 siRNAs versus 5). The subsets only stay roughly equal up to *n* = 7, because there are some overlapping chains of ≈653/7 = 93 siRNAs long.

For *n* > 7, we tried a similar version of the algorithm, which discarded a fraction of the data in order to break up long overlapping chains. Using this method with *n* = 20 or 30 helped variability in *R*
^{2} rather little, but slowed computations a lot (results not shown). Hence, we did all cross-validation using the simpler algorithm above with *n* = 7. To reduce variation, cross-validation results were typically averaged over 50 such random splits. We also compared the results of our non-overlapping cross-validation algorithm with standard random cross-validation.

### Training parameter selection

Our neural network training procedure had two parameters with the potential to affect generalization performance. We selected the best pair using the following procedure. We picked a set of 50 random cross-validation splits; this set remained fixed for the entire procedure. We then varied ALPHA from 1.5 to 2.7 in steps of 0.1, and tried stopping times from 100 to 300 in steps of 50 epochs. For every pair of parameters, we output the 50-split average *R*
^{2} and its standard deviation.

### Software

The tools and scripts used in producing and evaluating these models are available by request. We use a combination of the C programming language, the **Bash** shell scripting language http://www.gnu.org/software/bash/bash.html, GNU **Octave** scripting http://www.octave.org/, and SNNS v 4.2 Batchman (see Neural Networks). The tool set was developed on a **Gentoo GNU/Linux** system http://www.gentoo.org.

Calculations of a list of thermodynamic values such as: ΔG values that are relevant to stabilities of sense-antisense siRNA duplexes, siRNA antisense strand intra-molecular structure stability, siRNA antisense strand inter-molecular dimer stability, local target mRNA stabilities, and stabilities of each two neighboring base pairs in the siRNA sense-antisense duplexes were done with the OligoTherm program. OligoHybrid is a tool for calculation of potential targets of complementary interactions between two RNA molecules. Calculations of potential secondary structures for RNA molecule and estimation of the free energy of the local secondary structure and prediction of oligonucleotide affinity to nucleic acid targets were performed with the RNApack program. These programs work with the same thermodynamic parameters for the nearest neighbor model as the Mfold program [43–46]. The programs are available by request.