Demonstration of two novel methods for predicting functional siRNA efficiency

Background siRNAs are small RNAs that serve as sequence determinants during the gene silencing process called RNA interference (RNAi). It is well know that siRNA efficiency is crucial in the RNAi pathway, and the siRNA efficiency for targeting different sites of a specific gene varies greatly. Therefore, there is high demand for reliable siRNAs prediction tools and for the design methods able to pick up high silencing potential siRNAs. Results In this paper, two systems have been established for the prediction of functional siRNAs: (1) a statistical model based on sequence information and (2) a machine learning model based on three features of siRNA sequences, namely binary description, thermodynamic profile and nucleotide composition. Both of the two methods show high performance on the two datasets we have constructed for training the model. Conclusion Both of the two methods studied in this paper emphasize the importance of sequence information for the prediction of functional siRNAs. The way of denoting a bio-sequence by binary system in mathematical language might be helpful in other analysis work associated with fixed-length bio-sequence.


Background
RNA interference (RNAi) is a biological mechanism by which double stranded molecules inhibit gene expression by mediating sequence-specific mRNA degradation [1]. The process starts when dsRNA molecules are degraded into short interfering RNA (siRNA) molecules, about 21-23 nucleotides in length, by the RNase enzyme Dicer. These siRNAs are subsequently incorporated into a silencing complex called RISC (RNA-induced silencing complex), which identifies and destroys complementary RNAs [2]. RNAi is an evolutionally conserved mechanism for targeted repression of gene expression that has been developed into an experimental tool for silencing specific genes across systems [1,2]. Gene silencing efficiency has varied greatly among siRNAs targeting at different positions of a specific gene or at different genes. Therefore, efficient predictive tools are in high demand for siRNA design.
Many efforts have been made trying to develop computational methods which can provide improved prediction of potentially functional siRNA [2][3][4][5][6]. These methods have been based on various parameters including sequence features, energy features, RNA secondary structure features, and so on [7]. Each of these characteristics may affect RNAi efficiency. Scoring algorithms have been widely utilized since statistically significant nucleotide base preferences can easily be applied in the construction of scoring algorithms [4]. Data mining methods have also been employed in siRNA prediction and shown promising performance [8].
In the field of machine learning, siRNA prediction can be considered as a typical pattern recognition or classification problem. Based on this consideration we developed two kinds of algorithms, both of which achieved high performence. The first one is a sequence based statistical model that has been successfully used in signal peptide prediction [9]. In this paper, we applied this method to siRNA analysis and also obtained satisfactory prediction quality. In addition, we employed Vapnik's Support Vector Machine (SVM) as an alternative solution to this problem [10]. SVM has many attractive characteristics, including over fitting avoidance, large feature spaces handling and key information extracting from a given data set. This approach has provided satisfactory performance for a wide variety of classification problems in bioinformatics areas including microarray data analysis [11], protein structure classification [12], signal peptide prediction [13] and protein subcellular localization identification [14] and so on. SVM has already been applied to predict the efficacy of short oligonucleotides in antisense and siR-NAs by Satron and got good results [8,15]. Unlike above methods, we use the SVM algorithm in a novel way by introducing a binary system to denote sequences of fixedlength. Besides the binary system, thermodynamic profile and nucleotide composition were also introduced to construct the vector space of SVM.

Results
As we have known, to objectively assess a prediction method, a homogeneous and sufficiently large dataset is of high importance. It should also be very careful to combine datasets from different resources, because the efficiency of a siRNA changes variously under different biological and experimental conditions. Fortunately, the recently published Dieter's dataset from a high-throughput assay makes it possible to break the bottleneck of directly comparing different source datasets [16]. Besides, in order to compare with the former work, we also use the dataset from Satron, which combines published siRNAs from several researches and has been used in Satron's research [8]. To guarantee the unification of the dataset and to make the training processes of our algorithms objectively, we trained those two datasets separately and took the results from Dieter's dataset as the main evaluation measurement. Moreover, the Satron's dataset provides siRNAs with 19 nt in length while the Dieter's dataset with 21 nt in length, which also makes it impossible to train them together.
Whether a siRNA can be denoted as functional or nonfunctional depends on its ability of silencing a target gene, which is often measured by the value of siRNA inhibitory activity. Thus we generated the positive and negative subsets according to the level of inhibitory activity. During our research, we took three cut-off values, 0.5, 0.6 and 0.7 to generate six combinations of positive and negative subsets for the two datasets (table 1). That is, siRNAs in the positive dataset have their inhibitory activity greater than the cutoff value and siRNAs in the negative dataset have their inhibitory activity less then the cutoff value.
For each of the six combinations, both of the self-consistency and the jackknife test were performed. Especially, during the training processes of SVM algorithm, to compare the contributions of each of the three attributes, all seven combinations have been performed independently, which are "binary, thermodynamic and composition", "binary and thermodynamic", "thermodynamic and composition", "binary and composition", "binary only", "thermodynamic only", "composition only". All the results from the two methods have been listed from table For the three kinds of attributes, namely binary, thermodynamic and composition, the highest accuracy achieved when only the binary attribute is used for SVM training processes against Dieter's dataset during the jackknife test with any of the cut-off value. For Satron's dataset, the highest accuracy appeared when "binary and composition", or "thermodynamic and composition", or "binary, composition and thermodynamic" are used during the jackknife process.
From table 3 and table 4 we can also see that the values of sensitivity and specificity differ from each other greatly during SVM training processes tested on Satron's dataset with cut-off value of 0.5, 0.6 or 0.7 and on Dieter's dataset with cut-off value of 0.5 or 0.7. For example, for Satron's dataset when the cut-off value 0.7 was used, we got a positive subset containing 141 siRNAs and a negative subset containing 420 siRNAs. We use these datasets as the input for jackknife test by SVM and got the sensitivity of 21.99% and the specificity of 96.09%. For the six combinations listed in table 1, only the one generated by Dieter's dataset using cut-off value of 0.6 has no this problem.
To compare with Dieter's work [16], we constructed 15 sub-datasets according to Dieter's description to perform the training and testing test by the sequence-based model and SVM model we have constructed in this work. Each of the cut-off value of 0.5, 0.6 or 0.7 was performed separately. Table 5 lists the results of the two methods with cut-off value of 0.6. The other results are detailed in supplemental file (see additional file 4). From table 5, we can see that both of the sequence-based model and SVM methods show great improvement than Dieter's work. The highest pearson correlation coefficient reaches 0.9771 by SVM and 0.8562 by sequence-based model when the training dataset is "All (2182)" and the testing dataset is "All (249)", while the corresponding coefficient by Dieter et al is 0.66. Whatever the cut-off value is, both of the two methods got high correlation coefficient, especially the SVM method.

Comparing the three attributes in the vector space for SVM training
During the training processes executed by SVM, we constructed three kinds of attributes which are the binary representation, the thermodynamic profile and the nucleotide composition of the sequence. All the seven combinations of the three attributes have been chosen as the input of SVM training machine to find their contributions and all the results are listed in supplemental file (see additional file 3). From the results in table 3 and table 4, we can definitely come to the conclusion that the binary representation system plays the most important role. For both of the datasets, the accuracy of the prediction will be improved greatly whenever the binary system has been added. Take Dieter's dataset for example, the four attribute combinations, all of which contain the binary system, have their accuracy higher than 90%, with about 10~25% improvement comparing with that of the rest three combinations. Neither the thermodynamic profile nor the nucleotide composition can provide such an obvious enhancement. We refer this phenomenon to the fact that the binary representation, though not indicate any biological or chemistry property of the sequence, might carry the sequence speciality such as sequence order, base preferences at certain sites, etc. Previous studies have proved that effective siRNAs show base preferences at positions 3, 10, 13 and 19 of the sense strand [4]. Other sequence characteristics have been also noted by Kumiko Ui-Tei et al [5]. The high correlation coefficients in our research emphasized the fact that the sequence of a potential siRNA oligo is intimately correlated with its function.
Actually, what we discussed here is ubiquitous in the field of bio-sequence based function prediction -the problem is how to describe a bio-sequence in a suitable mathematical language. The binary system performs well in this problem of siRNA prediction. It also works well in other machine learning areas in bioinformatics, such as prediction work about protein signal sequences and their cleavage sites [13]. This kind of binary system can be used to represent qualitative concepts such as season, blood group, etc. Generally speaking, when a variable has n types, the binary system use n-dimension vector to denote each one of the n type, with the value of the ith dimension equals 1 and all other dimensions equals 0 for the ith type. We suggest this binary model might be used in other sequence based prediction works.
The nucleotide composition, including single nucleotide and di-nucleotide compositions, indicating sequence profile at a certain level thus also has its special role in train-  " in the table), thermodynamic profile ("B" in the table) and composition ("C" in the table). Seven combinations of the attributes are put forward, which are A+B+C (means "binary, thermodynamic and composition"), A+B (means "binary and thermodynamic"), B+C (means "thermodynamic and composition"), A+C (means "thermodynamic and composition"), A (means "binary only"), B (means "thermodynamic only") and C (means "composition"). The self-consistency and jackknife test are executed in all the seven vector space respectively to compare the contribution from each of the three attributes. ing process. Also take Dieter's dataset for example, when the input attributes are "Thermodynamic and Composition", "Thermodynamic only" or "Composition only", the accuracy are 85.97%, 78.69% or 81.65% respectively for cut-off value as 0.7, 83.83%, 76.31% or 80.09% for 0.6, and 83.67%, 77.46% or 79.47% for 0.5 (see table 3). Briefly speaking, the attribute of composition profile provides about 6~7% enhancement for the prediction, without considering the possibly weakening or enhancing interaction between two attributes. However, it should be noted that composition profile is not sufficient enough since two oligos having the same nucleotide composition might differ greatly in sequence order. Thus we proposed to use this attribute together with other attributes to provide enough information for mapping a siRNA oligo onto the vector space.
As has been proved by many experimental researches, the thermodynamic profile of a siRNA plays important role in the RNA interference mechanism [17]. That is why we take the thermodynamic profile as the input of SVM machine. The results from our work are consistent with the previous work. When the "thermodynamic only" attribute was provided as the input for SVM training processes, the accuracy during jackknife test achieved 74.87%, 68.27% and 63.99% for Straon's dataset with cut-off value as 0.7, 0.6 and 0.5, while the corresponding value are 78.69%, 76.31% and 77.46% for Dieter's dataset. This sufficiently showed the importance of the thermodynamic character during the RNA interference process. However, considering all the seven combinations of the three attributes, we suggest it is better to put them together as the input of SVM machine.
To avoid redundancy between the three attributes, we calculated the Pearson correlation coefficiency. For Dieter's dataset, the correlation coefficiency between the nucleotide composition and the thermodynamic profile is 1.381E-4, the nucleotide composition and binary system is 1.132E-5, and the thermodynamic profile and binary system is 3.092E-4. The low correlation between these attributes indicates that it is proper to combine them together for prediction. From table 3 and table 4, we can see that when the five subsets, which are Satron's dataset with cut-off value of 0.5, 0.6 and 0.7, and Dieter's dataset with cut-off value of 0.5 and 0.7, are taken as the input datasets for SVM training, the sensitivity and specificity apart from each other abnormally. On the one hand, this disparity between sensitivity and specificity appears to be much greater when the number of records in positive dataset departs further from the number in the negative dataset or when the dimension of the vector space turns lower. For example, the value of sensitivity and specificity present the greatest disparity with 0.00% and 100.00% respectively under the following conditions: the vector space are constructed with only the thermodynamic or composition attribute, and the dataset is that from Straon's data with cut-off value as 0.7, in which the number of records in the positive dataset is almost three times of the number in the negative dataset. Even when the vector space is expanded by all of the three attributes, and we got the smallest difference in the record number between the positive and negative datasets, this disparity is more than 20% (see table 5, vector space expanded by the attributes "binary, thermodynamic and composition", training set as from Satron's dataset with cut-off value of 0.5, during jackknife test). On the other hand, the disparity is not obvious for some certain combinations of attributes when the dataset is large enough. This can be seen from Dieter's datasets with cutoff value of 0.5 or 0.7 in the vector space of "binary, thermodynamic and composition", "binary and thermodynamic", "binary and composition", or "binary only". Nevertheless, when the vector space is constructed by "thermodynamic and composition", "composition only", or "thermodynamic only", there are still more than 20% difference between sensitivity and specificity, even if the dataset is pretty large. Based on these situations, we come to the hypothesis that when there are much difference in record numbers between positive and negative datasets, especially when the dataset is not sufficiently large, the SVM learning machine is inclined to make a biased prediction toward the class with the larger dataset, which results in high false positive or false negative prediction.

Balancing the biased dataset in SVM training
To validate the hypothesis, we take the following procedure to improve our algorithms: 1. Randomly choose a subset from the larger dataset until the subset has the same number of records as the smaller dataset; 2. Repeat step 1 for ten times to construct ten combinations of this "sub-larger dataset + whole smaller dataset". Make sure that these combinations cover at least 99% of the larger dataset.
3. Training the ten combinations by SVM in the seven vector spaces one by one. Take Satron's dataset for example, when cut-off value of 0.7 is used, the positive dataset has 141 records while the negative has 420 records. We randomly choose 141 records from the negative dataset for ten times to construct ten subsets. Each of these ten subsets will be trained with the whole positive dataset by SVM in the seven vector spaces. The work of randomly chosen is executed by JAVA program. Only the case when the dataset is Dieter's and cut-off value is 0.6 did not need this randomly chosen scheme. The randomly chosen data of the other five subsets has been supplied in the supplemental material (see additional file 1).
The average results of the randomly process have been supplied in table 6. For Satron's data, the disparity between sensitivity and specificity has been repressed with any of the cut-off values and vector space expanded by any one of the seven attribute combinations. As for Dieter's dataset, the weaken effect on the disparity is not obvious when the dimension of the vector space is high (in this case, the disparity is neglectable), but when the vector space is "thermodynamic and composition", "thermodynamic only" or "composition only" the disparity between sensitivity and specificity is also repressed. These results proved that our strategy to lessen the discrepancy between sensitivity and specificity is workable and can efficiently reduce false positive and false negative during the training processes of machine learning methods.

The methods are robust for different cut-off values
The cut-off value for siRNA inhibitory activity might be various according to the requests from different experimenter or different experimental intention. Thus we applied three kinds of cut-off values to construct the positive and negative dataset for separate training. For the method of sequence-based statistical model, the prediction results various little under different cases of cut-off values. The accuracy of the sequence-based statistical model is 70.94%, 70.59% and 70.23% for Satron's dataset by cut-off value of 0.7, 0.6 and 0.5 while for Dieter's dataset the corresponding accuracy is 89.35%, 89.47% and 89.43% (table 2). For the method of SVM executed in the space of "binary, thermodynamic and composition", the accuracy in jackknife test on Dieter's dataset is 94.78%, 94.65% and 94.65% for cut-off value of 0.7, 0.6 and 0.5, while the corresponding value on Dieter's dataset is 78.07%, 71.66% and 72.55%, respectively. The little differences from these results show that the three cut-off values affect little on the performance of the two methods we presented in this paper. From the results listed in table 2, 3 and 4, we can also see that the SVM model trained in the vector space of "binary, thermodynamic and composition" performs better than the sequence-based model, without discounting the latter (see additional file 2 and 3). From table 5, the high correlation coefficient serves as a strong demonstration of the utility of the sequence-based model and the ability of providing high accuracy than the artificial neural network constructed by Dieter et al [16].

Conclusion
We applied the sequence-based statistical model and support vector machine to the identification of functional siRNA. We constructed three kinds of attribute, namely the binary representation, the thermodynamic profile and the nucleotide composition of a sequence to build the vector space of SVM training machine. Both of the two methods achieved high performance and showed their potential ability to predict efficient siRNAs. We also put forward a procedure to reduce high false positive or false negative values in the situation when the number of records differs greatly between the positive dataset and the negative dataset.

Data set
We chose two datasets for the training of the prediction system. One is Satron's dataset which has been collected from several published experimental reports [8]. The original dataset contains 581 records related with 40 target genes. 8 of the 581 siRNA oligo targeting at least two different genes are deleted from the training dataset. Another four records containing mismatched nucleotides are also deleted. Therefore, the "Satron's dataset" in this research contains 561 records correlated with 40 target genes. Another dataset originated from a recently published high-throughput screening work conducted by Dieter et al [16]. This dataset contains 2431 siRNAs covering 34 target genes. Since the high-throughput dataset is sufficiently large and homogeneous for training, we took the Dieter's dataset as the main assessing dataset. Three cut-off values, 0.5, 0.6 and 0.7, are used to separate the whole dataset into a positive one and a negative one (see additional file 1).

The sequence-based model [18]
A siRNA oligo nucleotide with 19 or 21 bases can be present as: where R i is the nucleotide in the ith site. We will take the 19-nt oligos for convenience. All the formulas below will be applicable to the 21-nt oligos. In our research, we suppose that the nucleotide at each site can be treated as an independent element, such that there is no coupling among these sub-sites, then the attribute of the functional and non-functional siRNAs can be formulated, respectively, as:

Support Vector Machine
The support vector machine has been widely used in many areas for its attractive features. It is a kind of machine learning methodology based on statistical learning theory. Briefly speaking, the SVM defines a feature space (often with a higher dimension) and map the input vectors into this space. In the feature space, SVM tries to classify the data points into two classes by seeking an optimized classifier called hyperplane. It has also been applied in the field of multiple classification problems recently. In this paper, we use it to classify functional siR-NAs from non-functional siRNAs. To learn more about the mechanism of SVM, readers can consult a series of previous studies for further explanation on the procedure of how to use the support vector machine [10,[12][13][14]19,20].
To construct the vector space, we introduced three kinds of attributes, representing three aspects of a siRNA oligo. The first one is the binary system by which the 5 nucleotides of as, ts, us, cs and gs are coded by 4-D vectors composed of only 0 and 1 (adenine = 0001, uracil or thymine = 0010, cytosine = 0100, guanine = 1000). Then a 19-nt ψ 0 Satron's dataset, cut-off value = 0.6. Record in the positive part before randomly chosen: 178; Records in the negative part before randomly chosen: 383. Each of the 10 randomly chosen subsets has 178 records as positive and 178 out of 383 as negative part. The pseudorandom numbers were generated by the java class of java.lang.Random.
siRNA is represented by a 76-dimension vector in the SVM input or a 21-nt siRNA can be represented by an 84dimension vector. The second attribute describes the thermodynamic profile of the siRNAs. The nearest neighbour model was introduced to calculate the pentamer subsequences of the siRNA oligos under examination to reflect its internal stability values [21]. For the terminal four bases of the 3'end of the oligos, the targeting sequences would be extended for calculation. If extension is impossible (for example when the targeting sites on the targeted gene is before the third base of the gene), the thermodynamic value of the terminal four bases are set to be zero. The third attribute defines the composition features of siRNA sequences including 4 single nucleotide and 16 di-nucleotide compositions.
The parameters for the training process by SVM were set to be default values. The software used to implement SVM was SVM_light by Joachims [22]. The training process was carried out on the computer we used (Dell OptiPlex GX270 computer with an Intel Pentium4 2.80 GHz CPU).
To objectively assess our prediction system, we employed the following measurements for sensitivity, specificity, accuracy and receiver operating characteristics (ROC):