Multi-task learning for cross-platform siRNA efficacy prediction: an in-silico study
© Liu et al. 2010
Received: 21 October 2009
Accepted: 10 April 2010
Published: 10 April 2010
Skip to main content
© Liu et al. 2010
Received: 21 October 2009
Accepted: 10 April 2010
Published: 10 April 2010
Gene silencing using exogenous small interfering RNAs (siRNAs) is now a widespread molecular tool for gene functional study and new-drug target identification. The key mechanism in this technique is to design efficient siRNAs that incorporated into the RNA-induced silencing complexes (RISC) to bind and interact with the mRNA targets to repress their translations to proteins. Although considerable progress has been made in the computational analysis of siRNA binding efficacy, few joint analysis of different RNAi experiments conducted under different experimental scenarios has been done in research so far, while the joint analysis is an important issue in cross-platform siRNA efficacy prediction. A collective analysis of RNAi mechanisms for different datasets and experimental conditions can often provide new clues on the design of potent siRNAs.
An elegant multi-task learning paradigm for cross-platform siRNA efficacy prediction is proposed. Experimental studies were performed on a large dataset of siRNA sequences which encompass several RNAi experiments recently conducted by different research groups. By using our multi-task learning method, the synergy among different experiments is exploited and an efficient multi-task predictor for siRNA efficacy prediction is obtained. The 19 most popular biological features for siRNA according to their jointly importance in multi-task learning were ranked. Furthermore, the hypothesis is validated out that the siRNA binding efficacy on different messenger RNAs(mRNAs) have different conditional distribution, thus the multi-task learning can be conducted by viewing tasks at an "mRNA"-level rather than at the "experiment"-level. Such distribution diversity derived from siRNAs bound to different mRNAs help indicate that the properties of target mRNA have important implications on the siRNA binding efficacy.
The knowledge gained from our study provides useful insights on how to analyze various cross-platform RNAi data for uncovering of their complex mechanism.
RNA interference (RNAi) is the process through which a double-stranded RNA (dsRNA) induces gene expression silencing, by either degradation of sequence-specific complementary mRNA or repression of translation . Nowadays, RNAi has become an effective tool to inhibit gene expression, serving as a potential therapeutic strategy in viral diseases, drug target discovery and cancer therapy . The key inhibition mechanism of RNAi is triggered by introducing a short interfering double-stranded RNA (siRNA,19~ 27 bp) into the cytoplasm, where the guide strand of siRNA (usually antisense strand) is incorporated into the RNA-induced silencing complex (RISC) that binds to its target mRNA and the expression of the target gene is blocked. How to design siRNAs with high efficacy and high specificity for their target genes is one of the critical research issues [3–7].
So far, considerable progress has been made in studying the silencing capacity of siRNAs (the siRNA binding efficacy). Some fundamental empirical guidelines for designing efficient siRNA molecules have been presented [8, 9]. Further investigations include the study of the RNAi mechanism itself as well as characteristics of siRNAs with either high or low silencing capacity [10–16]. In total, these studies have led to several advanced algorithms and tools that allow the selection of potent siRNAs or the prediction of the efficacy of siRNA for gene silencing [13, 17–26].
Computational models for siRNA efficacy prediction are often constructed in a training phase. The training data consist of a collection siRNA sequences and related inhibiting efficacy vis-a-vis their target genes. In the testing phase, trained models are applied to new instances, when potential characteristics related to siRNA efficacy are extracted from siRNA sequences or target mRNA and used for the prediction of siRNAs efficacy for new targets. This procedure is generally formulated as a classification or regression model . Although various statistical and machine learning methods have been proposed in the last few years [24, 27, 28], there is limited success in predicting siRNA efficacy due to the diversity of data and limited sizes of available siRNA datasets. The problem caused by the differences in the training data pose difficulties for in-silico siRNA design. Typically, the RNAi data are provided by different research groups under different platforms/protocols in different experimental scenarios. This kind of data is refereed as "cross-platform" to emphasize the considerable diversity in such data. We observed that usually the observations (siRNA efficacy) from multiple platforms may not have an identical conditional distribution (i.e. the same residual variance) due to: First, a variety of assays/platforms/scales exist for measurements of the siRNA efficacy, such as different cell types (Hela, fibroblasts), test methods (Western Blotting, real-time PCR) or siRNA delivery methods (vectors method, synthetic oligos method). Second, there may exist very different concentrations of siRNAs used in different experiments. Finally, large differences can be found in sub-optimal time intervals between transfection and down-regulation measurement etc [24, 29].
As we show later in the experimental part, a naive integration of the data for siRNA efficacy prediction will only result in poor performance. This data distribution diversity problem has largely been ignored in many previous studies, such as the Pȧl Sætrom data , a classical dataset for siRNA efficacy prediction. This dataset has been used as a benchmark for training and testing in several computational studies for siRNA efficacy prediction, but the issue of non-identical conditional distribution has not received sufficient attention [30, 31].
Since different RNAi experiments encompass siRNAs that are partially targeted on different mRNAs, how to jointly utilize different experimental datasets becomes a critical issue for large-scale RNAi screening analysis. Solutions to this problem are expected to provide new insights into the RNAi mechanism in a large-scale view. In our study, although cross-platform siRNA datasets may have different conditional distribution of their efficacy, they are related to a common biological problem and can be viewed as different prediction tasks under the same latent variables. This observation inspires us to exploit the possible synergies between different datasets, rather than combining them directly, to learn a multi-task predictor jointly and simultaneously for siRNA efficacy prediction. This predictor will allow different classification tasks to enhance each other during the training process, which eventually makes the efficacy prediction better than when the datasets are naively combined, or when the datasets are used separately.
In this paper, the cross-platform model construction issue was addressed by applying a simple, yet effective linear regression model based on the multi-task learning paradigm. This model was applied on multiple datasets for siRNA efficacy prediction. Recently,  presented a multi-task learning approach to learning drug combinations for drug design. In , a multi-task classification approach is applied on multiple platforms for finding out a small number of highly significant marker genes to aid in biological studies, where the emphasis is on feature selection across platforms. In , a novel transfer learning technique is applied to address such cross-platform siRNA efficacy prediction problem where the focus is on using the auxiliary domains to help improve the regression performance of a target class. To the best of our knowledge, our work is one of the first to apply the multi-task learning model for siRNA efficacy analysis for learning regression models.
To test our multi-task regression learning framework, extensive experiments were conducted to show that multi-task learning is naturally suitable for cross-platform siRNA efficacy prediction. The biological features were ranked to derive the most important common features for siRNA design across different experiments on this model. Furthermore, our experiments also validate the observation that the siRNA efficacy depends on the properties of the targeted mRNA, instead of merely on the properties of siRNA sequence. We also conjecture that continued computational siRNA efficacy study can be benefited greatly from the multi-task learning framework by focusing on a much smaller task level, where we can take, for example, each mRNA and its binding siRNAs as a task, rather than an entire experiment as a task.
Description of the 14 cross-platform RNAi experiments as well as another 2 independent experiments performed at low siRNA concentrations.
Platform label scale (min-max)
Feature weights for siRNA design derived from multi-task learning
position-dependent nucleotide consensus: sum
Δ G difference between positions 1 and 18
Δ G of sense-antisense siRNA duplexes
position-dependent nucleotide consensus: preferred
preferred dinucleotide content index
local target mRNA stabilities (Δ G)
position-dependent nucleotide consensus: avoided
nucleotide content: U
stability (Δ G) of dimers of siRNAs antisense strands
stability profile for each two neighboring base pairs in the siRNA sense-antisense in position 1
siRNA antisense strand intra-molecular structure stability (Δ G)
avoid dinucleotide content index
stability profile for each two neighboring base pairs in the siRNA sense-antisense in position 13
stability profile for each two neighboring base pairs in the siRNA sense-antisense in position 18
nucleotide content: G
stability profile for each two neighboring base pairs in the siRNA sense-antisense in position 2
stability profile for each two neighboring base pairs in the siRNA sense-antisense in position 6
stability profile for each two neighboring base pairs in the siRNA sense-antisense in position 14
frequency of potential targets for siRNA
We should explain the reasons for why this particular data source is chosen: First, the data source contains nearly all the RNAi experiments with numerical siRNA efficacy values reported in recent studies, thus proven to be a complete dataset for training regression models for siRNA efficacy prediction. Second, the data source is a mixture dataset with cross-platform experiments stated in Pȧl Sæ trom dataset, a dataset misused by several computational siRNA efficacy prediction models where its data diversity is not considered [30, 31]. We want to use the multi-task learning paradigm to address this cross-platform issue by comparing our test results with those of traditional studies. We noted that in the current study, we only focused on the regression model rather than the general classification models, since the siRNA efficacy values are in nature continuously valued under different experimental platforms and we don't want to waste any data information in using our model. Though our model is designed for regression problem, it's actually also suitable for the classification problem with categorical data as input. To support our argument, we applied our model in multi-task classification with the siRecords dataset , which normally standardized siRNA with consistent efficacy ratings across different platforms. The results are listed in the supplementary materials [Additional file 1], and they also indicate that our multi-task classification model is significantly better the single-task classification models.
It should be noted that this model could be generalized to kernel ridge regression by using the kernel trick . However, model selection is not our main focus here. Various regression models can be applied, but we choose the linear ridge regression as our regression model based on the following reasons: (1) The performance of linear ridge regression model is comparable to most of the state-of-art regression models on siRNA efficacy prediction, and it is simple enough in representation . We applied the sophisticated support vector regression (SVR) with both linear kernel and radial basis function kernel in siRNA efficacy prediction, and we obtained nearly the same (even worse) prediction results as compared to linear ridge regression (See Results and Discussion). (2) We also want to exploit the feature importance across the platforms for better siRNA design. This goal cannot be achieved if we use a kernel regression model since it will map the input features as some non-meaningful high-dimensional representations.
In our experimental study, 5-fold cross-validation was applied to find the optimal regularization parameter that minimizes the cross-validation errors. For all the 14 experiments, 5-fold cross-validation is performed on 5 regularization parameter regions respectively, i.e. [0.001,0.1] with interval 0.001, [0.01,0.1] with interval 0.01, [0.1,1] with interval 0.1, [1, 10] with interval 1 and [10,100] with interval 10. Finally λ = 10 was obtained by evaluation of the total cross-validation errors in the 14 experiments. This parameter was kept the same throughout our study for consistent comparison.
where n is the number of predicted siRNA sequences. The smaller the RMSE is, the better the predict performance is.
In our study, the paired t-test and F-test is performed to compare multi-task learning versus single-task learning in siRNA efficacy prediction . Paired t-test is proven to work well by machine learning community in measuring the significance of one model outperforming another model and it is suitable for the most common data distribution assumption (say, normal distribution, instead of specific chi-squared distribution, for example) when we don't know the exact data distribution. To be briefly, this test is trying to determine whether the mean of a set of samples, i.e., the cross-validation estimates for the various datasets (tasks) is significantly greater than, or significantly less than the mean of another, followed by the assumptions that the observed data are from a matched subject and are drawn from a population with nearly to normal distribution.
More specifically, given two paired sets X i and Y i of n measured values, which could be the error rates evaluated by RMSE for each experiments under the single-task learning model and multi-task learning model in out study, the paired t-test determines whether this two model differ from each other in a significant way under the assumptions that the paired prediction error rate differences for each experiment are independent and identically normally distributed.
where n- 1 is the statistic degrees of freedom. Once a t value is determined, a p-value can be found using a table of values from Student's t-distribution to determine the significance level at which two models differ.
Multi-task learning has been developed in machine learning research to situations where multiple related learning tasks are accomplished together [38–46]. It has been proven to be more effective than learning each task independently when there are explicit or hidden inter-relationship among the tasks that can be exploited . The intuition underlying the framework is that the multiple related tasks can benefit each other by sharing the data and features across the tasks, which can often boost the learning performance of each single task. Such an advantage is especially evident when the number of labeled data in each task is limited, such that training on each single task with insufficient labeled data may not work well. Recently, researchers have begun to resort to the multi-task learning model to solve biological problems, such as medical diagnosis, tumor classification and drug screening [48–50]. However, applications of multitask learning in bioinformatics have just begun.
In this section, we demonstrate how to formulate the cross-platform siRNA efficacy prediction problem as a multi-task learning problem. A critical issue is to learn a set of sparse (regression) functions across the tasks. In particular, l1-norm regularization is used to control the number of learned features common for all the tasks, and the whole multi-task learning problem is equivalent to a convex optimization problem . Consequently, the problem is solved iteratively until convergence, by alternately performing an unsupervised step and a supervised step. In the unsupervised step, the common representations shared by the tasks are learned and then in the supervised step, these representations are used to learn the regression functions for each each task. Detailed algorithm derivations can be found in supplementary file [Additional file 1]. A Matlab script package for such multi-task learning in siRNA efficacy prediction is provided, which is accessible freely on our website.
If λ i ≠ 0, the i th feature is the common feature; otherwise, the i th feature is not useful in regression learning across the different tasks, since its regression weights are zeros for all the tasks. The value of λ i indicates the weight of the corresponding feature, which gives us a quantitative way to evaluate the importance of various features for siRNA design.
In this section, a number of experiments on multi-task learning for cross-platform siRNA efficacy prediction are performed. The siRNA efficacy prediction problem is formulated as a linear ridge regression model and the parameters of this model are tuned with a 5-fold cross-validation process. The root mean square error (RMSE) is adopted as the performance evaluation for different test results. To verify the statistical significance of our model over the baseline algorithms, the paired t-test on the experimental results is also conducted .
Comparison between linear ridge regression and support vector regression for single task siRNA efficacy prediction.
Linear ridge regression
SVR with linear kernel
SVR with radial basis function kernel
Linear ridge regression
SVR with linear kernel
SVR with radial basis function kernel
Single task learning with direct combination and label scaling for siRNA efficacy prediction.
From Table 4, we can clearly see that even if the training data labels are scaled to the same level, and the training data are pooled together to train a general model for individual task prediction, the prediction results are still not improving all the time. In fact, we observe worse results in half of the experiments under this general model. Statistical test evaluation on these two models has shown that there is no statistically significant difference between these two prediction results (p-value = 0.7043). It indicates that directly scaling the labels and increasing the number of training data by combining the data from cross-platform experiments only provides limited help in improving the prediction performance; in many cases the performance is degraded. All tests so far reveal that there exists a high-level of diversity across these 14 experiments, which motivates us to apply more sophisticated multi-task learning in this study.
Comparison between multi-task learning and single task learning for siRNA efficacy prediction.
Single task learning
Tests on two independent experiments.
Test 4 (50% training data)
Single task learning
Test 5 (with added tasks, 50% training data)
We make some observations from the results in Table 6: (1) Multi-task learning gives better performance as compared to single-task learning for the two independent experiments in the siRNA efficacy prediction, and (2) Multi-task learning with more tasks proved to be more helpful for siRNA efficacy prediction, as shown in Test 5. (3) The multi-task regression generalized well to new experimental conditions (and new mRNAs) of the two independent experiments. These conclusions indicate that multi-task learning provides an effective way to alleviate the data insufficiency problem of single task domains by exploiting the available synergy between different tasks. More tasks are expected to provide much more help from a joint learning procedure. Furthermore, with more tasks, multi-task learning can help more to improve the in-silico siRNA design targeted on new mRNAs.
Using our multi-task learning model, we compute the weights for each selected feature in the siRNA efficacy prediction across 14 cross-platform experiments, by considering the learned diagonal matrix D calculated in Equation (11). Multi-task learning in this case is also trained with 50% of the data for each experiment and randomly performed by 10 times. The features ranked with their weights are listed in order in Table 2. It can be seen that the position-dependent nucleotide consensus features and Δ G difference between positions 1 and 18 contribute greatly to the design of efficient siRNAs. This conclusion is consistent with the study on the siRNA design as reported in recent works [51, 52]. In addition, we can see that the feature of local target mRNA stability has a relatively high weight (0.07) in determining the siRNA efficacy, and this indicates that the properties of mRNA cannot be ignored in the design of potent siRNAs. We will further discuss this issue in the following section.
The impact of mRNA properties (especially the secondary structure of mRNA) on the siRNA binding efficacy has long been a controversial issue [24, 52–54]. Traditional studies suggested that it may not be critical to consider the target site's secondary structure in siRNA efficacy prediction. Several models have been presented based on the features merely derived from siRNA sequences to predict their efficacies [18, 24]. They show that the mRNA characteristics seem to offer little to the predictive strength of their models. On the other hand, several studies have shown that the properties of mRNA may play an important role in determining the binding efficacy of a siRNA [25, 55–57]. These reports motivate us to study the impact of mRNA properties on siRNA binding efficacy from a multi-task learning perspective.
We examine the possibility for siRNA efficacy prediction from a smaller multi-task level, i.e., we consider the task at "mRNA" level instead of the "experiment" level in the efficacy prediction. If the properties of mRNA influence siRNA efficacy, siRNAs that bind to the same mRNA should have some potential connections and thus be viewed as a task in the multi-task learning model. For example, it has been reported that sequence length of target mRNA has certain positive correlation with the activity of binding siRNAs . We speculate that there should exist certain efficacy distribution diversity across siRNAs binding to different mRNAs while this efficacy distribution diversity should be weak within the siRNAs binding to the same mRNAs. Similar to the tests performed on multiple experiments, combining siRNAs targeted on different mRNAs may not benefit the final prediction results. If this is the case, it could be computationally validated that the properties of mRNA indeed have an important impact on the siRNA design.
Description of the RNAi dataset with viewing each mRNA and its binding siRNAs as a task.
Comparison between multi-task learning and single task learning in a "mRNA" task level.
Test on the efficacy prediction with siRNAs binding to single mRNA.
STL with combination and scaling
STL with combination and scaling
In conclusion, in siRNA efficacy prediction, there indeed exists certain efficacy distribution diversity across the siRNAs binding to different mRNAs, and this distribution diversity seems to be weak within the siRNAs binding to the same mRNAs. This result helps validate the observation that the properties of mRNA indeed have influence on potent siRNA design, since certain data heterogeneity has been detected across the siRNAs binding to different mRNAs.
In this study, a multi-task learning paradigm for cross-platform siRNA efficacy prediction is presented. Extensive empirical tests have been conducted to demonstrate that multi-task learning provides an efficient way for the alleviation of data heterogeneity and insufficiency across multiple tasks. Our method was shown to achieve better prediction performance as compared to the traditional regression models on each individual task independently. This paradigm facilitates different tasks used to learn the hidden data patterns based on a common feature representation. In addition, our experiments validated that siRNA efficacy not only depends on the properties of siRNA, but also on the properties of its targeted mRNA.
Future research on siRNA design could be done to address the data heterogeneity issue further under the multi-task learning scheme. One approach is by taking each mRNA and its binding siRNAs as a task rather than taking each experiment as a task. Another important consideration is to address the issue on finding the major causes for such heterogeneity across different experimental conditions or mRNAs. Our multi-task learning paradigm can only reveal such heterogeneity. For experimental conditions, we wish to further find out what is important on the siRNA concentration, the knockdown assay, etc., in the siRNA design. Similarly, and more importantly, we wish to pursue the question of identifying the most important characteristics that determine the siRNA binding efficacy. Addressing these issues would help to shed new light on why certain genes seem to be easier to be knocked down by RNAi than others. We believe that a better understanding to such problems can be achieved when the amount of available data increases and more new features that influence siRNA-mediated RNA interference are identified.
A package of matlab scripts for cross-platform siRNA efficacy prediction under the proposed multi-task learning paradigm is presented. This package together with the datasets used in our manuscript is freely accessible at http://lifecenter.sgst.cn/RNAi/.
Test 1 : For 14 cross-platform experiments as 14 individual tasks, selected 50% of the data from each experiment to train a regression model, and tested the model on the remain 50% of the data of each experiment, respectively.
Test 2 : For 14 cross-platform experiments as 14 individual tasks, scaled all the experimental labels into [0,1] and pooling together 50% of the data from each experiment to train a general model, and tested the model on the remain 50% of the data of each experiment, respectively.
Test 3 : For 14 cross-platform experiments as 14 individual tasks, perform comparison between multi-task learning and single task learning for siRNA efficacy prediction, both trained with 10%, 30%, 50%, 70% and 90% of the data from each experiment, respectively.
Test 4 : For 2 independent experiments, perform comparison between single task learning and multi-task learning on them, both trained with 50% of the data from each experiment, respectively.
Test 5 : Multi-task learning on the two independent experiments together with the former 14 experiments, totally 16 experiments, trained with 50% of the data from each experiment, respectively.
Test 6 : For the 20 tasks in a "mRNA" level, selected 50% of the data from each experiment to train a regression model, and tested the model on the remain 50% of the data of each experiment, respectively.
Test 7 : For the 20 tasks in a "mRNA" level, scaled all the experimental labels into [0,1] and pooling together 50% of the data from each experiment to train a general model, and tested the model on the remain 50% of the data of each experiment, respectively.
Test 8 : For the 20 tasks in a "mRNA" level, perform multi-task learning for siRNA efficacy prediction, trained with 50% of the data from each experiment, respectively.
Test 9 : Two datasets (D1 and D2) with siRNAs binding to single mRNA are randomly split into 5 sub-tasks and similar study as Test 1-Test 2 are performed on them respectively.
This work was supported in part by Project HKUST-RPC06/07.EG09, Hong Kong University of Science and Technology. The authors would like to thank other members of Prof. Qiang Yang's research group at the Hong Kong University of Science and Technology for their helpful discussions and support. We also thank Prof. Argyriou A in University College London for sharing the multi-task learning scripts.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.