 Methodology article
 Open Access
 Published:
Constructing the boundary between potent and ineffective siRNAs by MGalgorithm with Cfeatures
BMC Bioinformatics volume 23, Article number: 337 (2022)
Abstract
Background
In siRNA based antiviral therapeutics, selection of potent siRNAs is an indispensable step, but these commonly used features are unable to construct the boundary between potent and ineffective siRNAs.
Results
Here, we select potent siRNAs by removing ineffective ones, where these conditions for removals are constructed by Cfeatures of siRNAs, Cfeatures are generated by MGalgorithm, Icccluster and the different combinations of some commonly used features, MGalgorithm and Icccluster are two different algorithms to search the nearest siRNA neighbors. For the ineffective siRNAs in test data, they are removed from test data by Iiteration, where Iiteration continually updates training data by adding these successively removed siRNAs. Furthermore, the efficacy of siRNAs of test data is predicted by their nearest neighbors of training data.
Conclusions
By siRNAs of Hencken dataset, results show that our algorithm removes almost ineffective siRNAs from test data, gives the clear boundary between potent and ineffective siRNAs, and accurately predicts the efficacy of siRNAs also. We suggest that our algorithm can provide new insights for selecting the potent siRNAs.
Background
In the past decades, many RNAi therapeutic programs focusing on cancer, metabolic diseases, respiratory disorders, retinal degeneration, dominantly inherited brain, skin diseases and infectious diseases had entered the clinical practice [1, 2], several RNAi based antiviral therapeutic projects had also reached at clinical trial stages [3, 4]. More recently, some researchers reported the identification of a group of endogenous siRNAs that played a part in enhancing environmental stress responses by repressing translation [5, 6]. However, the gene silencing effectiveness of RNAi relied on the siRNA efficacy in targeting a specific gene, so the efficacy prediction method constituted a huge challenge in selecting the potent siRNAs [7]. In general, researchers mainly used the machinelearning algorithms to design potent siRNAs [8,9,10], and focused on these features that contained empirical rules [11, 12], nucleotide frequency [10], binary pattern [13, 14], thermal stability [13], and many hybridized approaches [10].
However, for these commonly used features [10,11,12,13,14], there were no directly experimental evidences showing that they were able to influence siRNA activity [7], so their reliability needed to be validated when they were used to define the similarity of siRNAs. Here, MGalgorithm and Icccluster were used to verify their reliability, where MGalgorithm was able to generate such minigroups that their samples were the nearest neighbors with each other [15], and Icccluster was able to put the distant samples to the different miniclusters [16]. Results showed that most potent siRNAs of test data were unable to search their nearest neighbors from potent ones of training data.
Moreover, for theses commonly used algorithms for selection of potent siRNAs, they tried to constructing the overall difference between potent and ineffective siRNAs, such as ThermoComposition21 [17], DSIR11 [18], iscore [19] that were both in the classification and regression modes, Biopredsi [20] that tried to combine the features together with the rules as input, ANN [20] that used two kinds of siRNA sequence features as feature set, Linear [21] that was linear regression model that was constructed by nucleotide preference scores, and SVM [7, 14, 17] that based on deep learning algorithm. However, potent and ineffective siRNAs belonged to a chaotic system when their similarity were defined by these commonly used features. Thus, for any of these algorithms, it might misidentify many ineffective siRNAs when it tried to searching the majority of potent ones.
Here, we firstly constructed Cfeatures of siRNAs by MGalgorithm, Iccalgorithm and the hybridized features of these commonly used features, where these hybridized features were the different combinations of the frequencies of multinucleotides and the binary codings of their sequences. Then, for these ineffective siRNAs of test data, they were continually removed from test data and put to training data by Iiteration, where Iiteration continually updated training data by these successively removed siRNAs. In this study, for any removed siRNA of test data, its overall similarity with ineffective siRNAs of training data exceeded all potent siRNAs of training data. Moreover, we used Hencken dataset [7] to validate the reliability of our algorithm. For siRNAs of test data, results showed that our algorithm was able to remove the ineffective siRNAs from test data, gave the clear boundary between their potent and ineffective ones, and also accurately predicted their efficacy. We hoped our algorithm was able to help the researchers to select the best effective siRNAs for use as potential therapeutics against important human viruses.
Results
Constructing training and test data
Hencken dataset contained over 1358 siRNA sequences targeting different human viruses and HIV siRNA database [5], where the experimental indicators of siRNAs were provided, the lengths of siRNAs were 19 bp, and 70% targeted gene knockdown was considered as the threshold to define potent and ineffective siRNAs.
In this paper, siRNAs of the data set were reordered by their observed inhibitions, and then these 20% siRNAs whose new serial numbers were multiple of 5 were selected to construct test data. That is, we selected 103 potent and 242 ineffective siRNAs to construct test data, and other 1380 siRNAs to construct training data.
Moreover, 20 test sets were randomly generated from Hencken dataset also, where we used testi to denote the ith test set, and testi contain 103 potent and 242 ineffective siRNAs also. Here, the average identification results of these 20 testi sets were used to compare the different algorithms, and the results of test data to show the details of our algorithm.
Comparison of different \(C_{k}\)features
Here, for siRNAs of training and test data, their \(C_{15}\)features and \(C_{31}\)features were displayed on tSNE maps (Fig. 1) respectively, where tSNE(tstatistic stochastic neighbor embedding) was a nonlinear dimension reduction method which had been used to preserve local structure in the data [22], \(C_{15}\)feature was the combination of 4 \(F_{m}\)features, and \(C_{31}\)feature was the combination of 4 \(F_{m}\)features and Bfeature. From Fig. 1, the potent and ineffective siRNAs were significantly intermixed with any \(C_{k}\)features. That is, potent and ineffective siRNAs belonged to a chaotic system when their similarity were defined by \(C_{k}\)features.
In fact, none of the commonly used features was able to give the clear boundary between potent and ineffective siRNAs, such as empirical rules [11, 12], nucleotide frequency [10], binary pattern [13, 14], thermal stability [13], and many hybridized approaches [10]. However, when we enlarged Fig. 1, we was able to find that some ineffective siRNAs were the nearest neighbors with each other. Thus, MGalgorithm(or Icccluster) with \(C_{k}\)features was able to generate such minigroups(or miniclusters) that did not contain potent siRNAs of training data.
Comparison of \(C_{k}^{\alpha _{s},t}\)features and \(D_{k}^{\alpha _{s},t}\)features
In this study, \(C_{k}^{\alpha _{s},t}\)features and \(D_{k}^{\alpha _{s},t}\)features were not used to removed ineffective siRNAs from test data. In fact, two elements of any of \(C_{k}^{\alpha _{s},t}\)features had at most one 1, and the sum of three elements of any of \(D_{k}^{\alpha _{s},t}\)features was 1.
However, for the fixing k and \(\alpha _{s}\), \(D_{k}^{\alpha _{s},t}(t=1,2,3,4)\)features of R had significant difference. The reason was that \(C_{k}\)features did not follow the normal distribution, minigroups of \(MG_{1}\)algorithm, \(MG_{2}\)algorithm, miniclusters of \(Icc_{1}\)cluster and \(Icc_{2}\)cluster had significant difference.
Furthermore, since the goal of Iiteration was that continually removed ineffective siRNAs by \(\alpha _{s}\)parameters, \(C_{k}^{\alpha _{s},t}\)features were constructed by the first and third elements of \(D_{k}^{\alpha _{s},t}\)features only.
Comparison of different \(C^{\alpha _{s},t}\)features
Here, for siRNAs of training and test data, their \(C^{20,1}\)features and \(C^{20,3}\)features were directly mapped on Fig. 2 by their two elements, respectively. Figure 2 showed that \(C^{20,1}\)features and \(C^{20,3}\)features were not able to give the clear boundary between potent and ineffective siRNAs, but they had a tendency to separate potent and ineffective siRNAs. Importantly, for the second elements of \(C^{20,1}\)features and \(C^{20,3}\)features of siRNAs, Fig. 2 showed that the largest ones came some ineffective siRNAs of training and test data at the same time. Thus, \(C^{\alpha _{s},t}\)features could be used to remove some ineffective siRNAs from test data. Moreover, Fig. 2 showed that \(C^{20,1}\)features and \(C^{20,3}\)features had significant difference also.
The reliability of Iiteration
Here, for Iiteration with \(\alpha _{s}\)parameters, the cumulative numbers of their removed ineffective and potent siRNAs were mapped on Fig. 3a and b, respectively. From Fig. 3a and b, when \(\alpha _{s}\)parameter was less than 70%, Iiteration removed few potent siRNAs, and almost ineffective ones from test data. However, when \(\alpha _{s}\)parameter was equal to 70%, Iiteration removed 10 potent siRNAs from test data. In fact, for some siRNAs that their efficacy was from 65 to 75%, their \(C^{\alpha _{s},t}\)features had no significant difference. To prevent that Iiteration falsely removed potent siRNAs from test data, 65% was selected as the largest \(\alpha _{s}\)parameter.
Moreover, in all removals of Iiteration, we found all \(\beta ^{s,1}_{1}(1)\)parameters and \(\beta ^{s,2}_{1}(1)\)parameters were equal to zero, where we only showed \(\beta ^{s,t}(1)\)parameters that were the first constructed by \(\alpha _{s}\)parameters. That is, for siRNAs of test data, these ones were removed from test data that all their \(c^{\alpha ,1}(1)\)features(or \(c^{\alpha ,2}(1)\)features) were zero.
Furthermore, all \(\beta ^{s,3}_{1}(1)\)parameters and \(\beta ^{s,4}_{1}(1)\)parameters were mapped on Fig. 3c, respectively, Fig. 3c showed that all \(\beta ^{s,3}_{1}(1)\)parameters and \(\beta ^{s,4}_{1}(1)\)parameters were greater than 10. That is, for these siRNAs of test data that their \(c^{\alpha ,3}(1)\)features(or \(c^{\alpha ,4}(1)\)features) were zero, their \(c^{\alpha ,3}(2)\)features(or \(c^{\alpha ,4}(2)\)features) were greater than 10 might be removed from test data.
At last, Fig. 3d showed that \(\beta ^{s,t}_{2}(1)(t=1,2)\)parameters were greater than 27, while \(\beta ^{s,t}_{2}(1)(t=3,4)\)parameters were relatively small.
That is, for any removed siRNAs of test data, its overall similarity with ineffective siRNAs of training data exceeded all potent siRNAs of training data.
The boundary between potent and ineffective siRNAs
Here, for potent and ineffective siRNAs of test dat, their boundary were constructed by \(C^{{\alpha _{10}}}\)features( Eq. 7), where their \(C^{{\alpha _{10}}}\)features were displayed on tSNE map(Fig. 4a), and \(C^{{\alpha _{10}}}\)features were generated from the updated training data of Iiteration. Fig. 4a showed that \(C^{{\alpha _{10}}}\)features gave the relatively clear boundary between potent and ineffective siRNAs of test data.
The distinguishing results of Pcluster and Icluster
For siRNAs of test data, their distinguishing results of Pcluster and Icluster were summarized in Tables 1 and 2. From Table 2, TN, FN, TP and FP of test data were 228, 14, 98 and 5 respectively. That is, only 5 potent and 14 ineffective siRNAs of test data were misidentified, respectively. Moreover, the distinguishing result of test data was displayed on tSNE map by \(C^{{\alpha _{10}}}\)features of siRNAs (Fig. 4a). Furthermore, Fig. 4a was able to help us to search these misidentified siRNAs.
Moreover, for TN, FN, TP and FP of 20 testi sets, their average value were 210.3, 21.7, 95.5 and 7.5 respectively, and the details were summarized in the second row of Table 2. That is, the average distinguishing results of 20 testi sets were slightly less compared to ones of test data. The reason was that some ineffective siRNAs were easier to search their neighbors from these ones with similar efficiency. These results demonstrated that Iiteration was able to correctly remove ineffective siRNAs from test data.
Predicting efficacy of siRNAs
Here, for siRNAs of test data, their efficacy were predicted by Eq. (10), where the predicting results were summarized in Fig. 4b and Table 2. Table 2 showed that PCC of the predicting efficacy was equal to 0.76 that was calculated by Eq. (12). Moreover, for 20 testi sets, the average value of their PCCs was equal to 0.73 (Table 2). That is, the average PCCs of 20 testi sets were slightly less than one of test data.
And more importantly, Fig. 4b showed that the efficacy of siRNAs in Pcluster (or Icluster) was greater (or less) than 70%. This was because Iiteration gave the relatively clear boundary between Pcluster and Icluster. That is, for almost potent(or ineffective) siRNAs of test data, their predicting efficacy of Eq. (10) were potent (or ineffective) also.
Comparison to existing design algorithms
For the distinguishing results of ScoreLevel [7], ThermoComposition21 [17], DSIR11 [18], iscore [19] and Biopredsi [20], they were summarized in Tables 1 and 2, where these results were the average value of these 20 testi sets, ScoreLevel used Fscore to investigate the contribution of each feature and remove the weak relevant features to SVM [7], ThermoComposition21 combined position features and thermodynamic features to an artificial neural network model [17], DSIR11 used basic sequence information and a simple linear model LASSO [18], iscore utilized linear regression models to perform artofthestate accuracy rates [19], and Biopredsi applied artificial neural networks to predict siRNA efficacy [20]. Table 1 showed that the highest sensitivity of those servers came from ScoreLevel [7] that was 63.1% only. Moreover, Table 2 showed that the poor sensitivity of those servers was generated from the large FN. For instance, for DSIR11, iscore and Biopredsi, their FN were greater than their TP. That is, the numbers of their misidentified ineffective siRNAs were greater than their correctly identified potent ones. In fact, these algorithms tried to constructing the overall difference between potent and ineffective siRNAs, but siRNAs belonged to a chaotic system when their similarity were defined by these commonly used features. Thus, for any of these algorithms, it might misidentify many ineffective siRNAs when it tried to searching the majority of potent ones. Furthermore, Tables 1 and 2 showed that the distinguishing results of hybridized features (ScoreLevel and ThermoComposition21) were superior to ones of relatively simple features(DSIR11), and the nonlinear results(ScoreLevel [7] and ThermoComposition21) were superior to linear ones also(DSIR11 and iscore). In total, these results verified that these algorithms were unable to construct the clear boundary between potent and ineffective siRNAs.
Compared to above algorithms, the sensitivity of Iiteration(81.5%) was far more than any one of them. The reason was that FN of Pcluster and Icluster was far less than ones of other algorithms. In fact, Iiteration was used to remove ineffective siRNA from test data, and only these ones that their overall similarity with ineffective siRNAs of training data exceeded all potent siRNAs of training data were removed from test data. And more importantly, Iiteration did not construct the overall difference between potent and ineffective siRNAs, it only continually updated training data by these successively removed siRNAs.
Here, the efficacy predicting results of ANN [20], Linear [21] and SVM [7, 14, 17] were summarized in Table 3, where ANN used the artificial neural network to train on a complementary 21nucleotide guide sequence [20], Linear used support vector machine regression by combining and filtering features [21], SVM [14] used various characteristic methods, and SVM [17] used thermodynamic and composition features. From Table 3, the highest PCC of these servers came from ScoreLevel and ThermoComposition21(SVM [14]) also. That is, the better efficacy prediction was generated from the better classification.
Compared to above algorithms, the efficacy prediction of Eq. (10) was nearly equal to ScoreLevel, but the classification of Iiteration was far more than ScoreLevel. The reason was that \(C^{\alpha _{10}}\)features only constructed the overall difference between potent and ineffective siRNAs.
The sensitivity analysis of our algorithm
In the process of constructing features, \(\alpha _{s}\)parameters were beginning with 20%, and ending with 65%. Naturally, this begged a followup question, that is, whether similar distinguishing results of our algorithm could be constructed by other initial and final values. In fact, for \(\alpha _{s}\)parameters that were beginning with 10% and ending with 65%, their distinguishing results had no difference compared to our used values. That is, Pcluster and Icluster were not sensitive to the initial values of \(\alpha _{s}\)parameters.
The crossvalidation of our algorithm
Here, we also used these siRNAs whose new serial numbers were multiple of 1 (or 2, or 3, or 4) to construct testa(or testb, or testc, or testd) data, and other siRNAs to construct training data. Then, their Pcluster and Icluster were constructed by Eq. (10), and the distinguishing results were summarized in Table 2. Table 2 showed that the distinguishing result of 5 groups of Pclusters and Iclusters had no difference compared to our used test data. These results demonstrated that ineffective siRNAs were easier to search their neighbors from these ones with similar efficiency, and our algorithm was able to ensure nonrandomness of the performance in experiments also.
Discussion
In fact, MGalgorithm (or Icccluster) with \(C_{k}\)features is able to produce some these ineffective minigroups(or miniclusters) that do not contain potent ones of training data, where these ineffective minigroups(or miniclusters) contain about 20% siRNAs of test data. That is, for some ineffective siRNAs of test data, they are relatively easy to search their nearest neighbors from ineffective ones of training data. That is, some ineffective siRNAs exist local similarity with \(C_{k}\)features. And more importantly, for different \(C_{k}\)features, their ineffective minigroups(or miniclusters) have significant difference. Thus, if we construct enough these ineffective minigroups(or miniclusters) that have significant difference, we are able to remove ineffective siRNAs from test data.
Moreover, for most \(C^{65,t}\)features that are constructed by raw training data, they can remove more than 70% ineffective siRNAs from test data, but a penalty to be paid for about 20% potent siRNAs of test data are falsely removed also. To remain potent siRNAs in test data, Iiteration uses \(\beta ^{s,t}\)parameters that only removed ineffective siRNAs from test data. In fact, the conditions of \(\beta ^{s,t}\)parameters for these removal are very harsh. That is, for any removed siRNAs of test data, its overall similarity with ineffective siRNAs of training data exceeds all potent siRNAs of training data. In fact, we can only remove about 35% of ineffective siRNAs of test data at a time, so we use Iiteration to construct 10 removals.
Since 70% targeted gene knockdown is considered as the threshold to define potent and ineffective siRNAs, 65% is selected as the largest \(\alpha _{s}\)parameter. This prevents that potent siRNAs are falsely removed from test data by Iiteration. Furthermore, since the number of siRNAs in training data is selected as the clustering number of Icccluster, \(C^{\alpha _{s},3}\)features and \(C^{\alpha _{s},4}\)features of potent siRNAs in training data have significant advantage compared to ones in test data. This ensures that Iiteration does not remove potent siRNAs from test data also.
Conclusion
In fact, the key to success of our algorithm is MGalgorithm, which does not focus on searching the overall difference between potent and ineffective siRNAs, but constructs the local similarity of ineffective ones. That is, if some ineffective siRNAs are highly correlated with some specific features, MGalgorithm can extract their similarity by minigroups. In total, we hope our algorithm can be useful in predicting highly potent siRNAs to aid therapeutic development.
Methods
Here, for siRNAs of training data, we use X(i) and Y(j) to denote the ith and jth potent and ineffective ones, respectively. Moreover, we use Z(l) to denote the lth siRNA of test data, where the efficacy of Z(l) is seen as unknown.
\(C_{k}\)features of siRNAs
For R that is a random siRNA, its \(F_{1}\)feature, \(F_{2}\)feature, \(F_{3}\)feature and \(F_{4}\)feature are constructed by the frequencies of mononucleotide, dinucleotide, trinucleotide and tetranucleotide of the sequence of \(R_{m}\) respectively, where
Moreover, Bfeature of R is constructed by its binary codings of nucleotide, where
Furthermore, 31 \(C_{k}\)features of R are constructed by the different combinations of 4 \(F_{m}\)features and 1 Bfeature. That is, any \(C_{k}\)feature contains one or more \(F_{m}\)features and Bfeature.
MGalgorithm and Icccluster
MGalgorithm directly puts the nearest neighbor siRNAs in the same minigroups [15]. That is, when a siRNA belongs to a minigroup, its nearest neighbor is also in the minigroup, where \(MG_{1}\)algorithm and \(MG_{2}\)algorithm use Euclidean distance and PCC (Pearson Correlation Coefficient) to define the similarity of siRNAs, respectively. But for Icccluster algorithm, its clustering centers are generated from these most distant siRNAs with each other, and other siRNAs are put to miniclusters by searching their nearest centers [16], where \(Icc_{1}\)cluster and \(Icc_{2}\)cluster use Euclidean distance and PCC to define the similarity of siRNAs, respectively. Moreover, the freely available MATLAB implementes to perform MGalgorithm and Icccluster are summarized in Additional file 1.
In fact, for many potent siRNAs of test data, their nearest neighbors come from ineffective ones of training data. To separate these nearest neighbors that come from different efficient categories, the number of siRNAs of training data is selected as the clustering number of Icccluster. Results show that some of these nearest neighbor siRNAs are put to different miniclusters by Icccluster with the clustering number.
\(\alpha _{s}\)parameters
Here, siRNAs of training data are specified as 3 Egroups by a \(\alpha _{s}\)parameter, where
\(Y_{e}(j)\) is the experimental efficacy of Y(j), and \(\alpha _{s}\)parameter is a artificial efficacy boundary between ineffective siRNAs.
\(D_{k}^{\alpha _{s},t}\)features of siRNAs
For siRNAs of training and test data, \(MG_{1}\)algorithm with their \(C_{k}\)features divides them into minigroups, where \(G_{k}^{u}\)group is used to denote the uth minigroup. For R, if it is put to \(G_{k}^{p}\)group, its \(D_{k}^{\alpha _{s},1}\)feature is constructed, where
\(N\{G_{k}^{p}\bigcap E_{l}\}\) is the siRNA number of the intersection of \(G_{k}^{p}\) and \(E_{l}(l=1,2,3)\), and \(N\{G_{k}^{p}\}\) is the siRNA number of \(G_{k}^{p}\)group.
\(C_{k}^{\alpha _{s},t}\)features of siRNAs
Here, \(C_{k}^{\alpha _{s},1}\)feature of R is constructed by its \(D_{k}^{\alpha _{s},1}\), where
Moreover, \(C_{k}^{\alpha _{s},2}\)features, \(C_{k}^{\alpha _{s},3}\)features and \(C_{k}^{\alpha _{s},4}\)features of siRNAs are constructed by \(MG_{2}\)algorithm, \(Icc_{1}\)cluster and \(Icc_{2}\)cluster respectively, where k is from 1 to 31, s is from 1 to 11, and t is from 1 to 4.
\(C^{\alpha _{s},t}\)features of siRNAs
For R, its \(C^{\alpha _{s},t}\)feature is constructed by its 31 \(C_{k}^{\alpha _{s},t}\), where
\(C^{\alpha _{s}}\)features of siRNAs
For R, its \(C^{\alpha _{s}}\)feature is the combination of four types \(C^{{\alpha _{s}},t}\)features, where
Iiteration
Here, for ineffective siRNAs of test data, Iiteration uses \(\beta ^{s,t}\)parameters to remove them from test data, where \(\beta ^{s,t}\)parameters are constructed by \(C^{\alpha _{s},t}\)features(Eq. 6) of X(i), and
Then, for any Z(l) of test data, it is removed from test data if its \(C^{\alpha _{s},t}\)feature satisfies any of the conditions of Eq. (9), where Eq. (9) is defined as
In details, the iteration process is constructed by the following:
Step 1 Based on \(\alpha _{1}\)parameter and \(\beta ^{1,t}\)parameters that are constructed by Eq. (8), these siRNAs of test data are removed from test data if their \(C^{\alpha _{1},t}\)features satisfy any of the conditions of Eq. (9), where \(\alpha _{s}\) of Eq. (9) is \(\alpha _{1}\). Moreover, the copies of these removed siRNAs are put to Icluster and \(E_{3}\)group(Eq. 3) simultaneously. That is, the training data is updated.
Step 2 Based on the updated training data of Step 1, Repeat Step 1 until no Z(l) can be removed from test data by new \(\beta ^{1,t}\)parameters, where we obtain new \(\beta ^{1,t}\)parameter by the updated training data of Step 1.
Step 3 Repeat Step 1 and 2 until \(\beta ^{11,t}\)parameter does not remove Z(l) from test data.
That is, the updated training data stops until \(\alpha _{11}=70\). At last, the remaind siRNAs of test data are put to Pcluster. Here, for siRNAs of test data, they are distinguished as potent(or ineffective) ones when they belong to Pcluster (or Icluster).
Predicting efficacy of siRNAs
Here, for siRNAs of Pcluster and potent ones of training data (or siRNAs of Icluster and ineffective ones of training data), they are divided into minigroups by \(MG_{1}\)algorithm with their \(C^{\alpha _{10}}\)features (Eq. 7), where \(C^{{\alpha _{10}}}\)features are generated from the updated training data of Iiteration, the efficacy of Z(l) is predicted by Eq. (10),
\(\widehat{Z_{e}(l)}\) is the predicted efficacy of Z(l), \(X_{e}(i)\)(or \(Y_{e}(j)\)) is the experimental indicator of X(i)(or Y(j)), u(or v) is the number of potent (or ineffective) siRNAs of the minigroup that contains Z(l).
Sensitivity and specificity
Here, we use Se(sensitivity) and Sp(specificity) to evaluate the consistency between the experiment indicators and clustering results, where the experimental indicators are seen as the golden standard of genes, and Se and Sp are defined as
where TN, FN, TP and FP are the number of true negatives, false negatives, true positives and false positives, respectively.
PCC model
In general, PCC model is used to measure the correlation between the predicted efficacy and observed inhibition, where
b is the number of siRNAs of test data, \(\widehat{Z_{e}(l)}\) and \({Z_{e}(l)}\) are the predicted value and observed label of Z(l), \(\overline{Z_{p}}\) and \(\sigma _{Z_{p}}\) are the mean and standard deviation of all \(\widehat{Z_{e}(l)}\), \(\overline{Z_{e}}\) and \(\sigma _{Z_{e}}\) are the mean and standard deviation of all \(Z_{e}(l)\), respectively.
Availability of data and materials
Hencken dataset: http://crdd.osdd.net/servers/virsirnapred/dataset.php?dataset=1723, MGalgorithm: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s1285901824955, Iccalgorithm: https://febs.onlinelibrary.wiley.com/doi/10.1002/22115463.12327
Abbreviations
 \(F_{m}(m=1,2,3,4)\)feature:

They are defined by Eq. (1)
 Bfeature:

It is defined by Eq.(2)
 \(C_{k}\)features:

They are constructed by the different combinations of its 4 \(F_{m}\)features and 1 Bfeature
 \(D_{k}^{\alpha _{s},t}\)feature of R :

It is defined by Eq. (4)
 \(C_{k}^{\alpha _{s},t}\)feature of R :

It is defined by Eq. (5)
 \(C^{\alpha _{s},t}\)feature of R :

It is defined by Eq. (6)
 \(C^{\alpha _{s}}\)feature of R :

It is defined by Eq. (7)
 \(\alpha _{s}\)parameter:

It is defined by Eq. (3)
 \(\beta\)parameter:

It is defined by Eq. (8)
 \(\{ \beta ^{s,t}_{1}, \beta ^{s,t}_{2}\}\)parameter:

It is defined by Eq. (8)
 Pcluster:

Containing these siRNAs of test data that are distinguished as potent ones
 PCC:

Pearson Correlation Coefficient
References
Angaji SA, Hedayati SS, Poor RH, Madani S, Poor SS, Panahi S. Application of RNA interference in treating human diseases. J Genet. 2010;89(4):527–37.
Davidson BL, McCray PJ. Current prospects for RNA interferencebased therapies. Nat Rev Genet. 2011;12(5):329–40.
Haasnoot J, Westerhout EM, Berkhout B. RNA interference against viruses: strike and counterstrike. Nat Biotechnol. 2007;23(12):1435–43.
Duffy S, Shackelton LA, Holmes EC. Rates of evolutionary change in viruses: patterns and determinants. Nat Rev Genet. 2008;9(4):267–76.
Baumann K. How plants silence stress. Nat Rev Mol Cell Biol. 2020;21:303.
Wu H, Li B, Iwakawa HO, Pan Y, Tang X, LingHu Q, Liu Y, Sheng S, Feng L, Zhang H, Zhang X. Plant 22nt siRNAs mediate translational repression and stress adaptation. Nature. 2020;581:89–93.
He F, Han Y, Gong J, Song J, Wang H, Li Y. Predicting siRNA efficacy based on multiple selective siRNA representations and their combination at score level. Sci Rep. 2017;7:44836.
Mysara M, Elhefnawi M, Garibaldi JM. MysiRNA: improving siRNA efficacy prediction using a machinelearning model combining multitools and whole stacking energy (DeltaG). J Biomed Inform. 2012;45(3):528–34.
Han Y, He F, Chen Y, Liu Y, Yu H. SiRNA silencing efficacy prediction based on a deep architecture. BMC Genomics. 2018;19(Suppl 7):669.
Qureshi A, Thakur N, Kumar M. VIRsiRNApred: a web server for predicting inhibition efficacy of siRNAs targeting human viruses. J Transl Med. 2013;11:305.
Reynolds A, Leake D, Boese Q, Scaringe S, Marshall WS, Khvorova A. Rational siRNA design for RNA interference. Nat Biotechnol. 2004;22(3):326–30.
UiTei K, Naito Y, Saigo K. Guidelines for the selection of effective shortinterfering RNA sequences for functional genomics. Methods Mol Biol. 2007;361:201–16.
Liu L, Li QZ, Lin H, Zuo YC. The effect of regions flanking target site on siRNA potency. Genomics. 2013;102(4):215–22.
Pan WJ, Chen CW, Chu YW. siPRED: predicting siRNA efficacy using various characteristic methods. PLoS ONE. 2011;6(11):e27602.
Jia X, Han Q, Lu Z. Analyzing the similarity of samples and genes by MGPCC algorithm, tSNESS and tSNESG maps. BMC Bioinform. 2018;19(1):512.
Jia X, Liu Y, Han Q, Lu Z. Multiplecumulative probabilities used to cluster and visualize transcriptomes. FEBS Open Bio. 2017;7(12):2008–20.
Shabalina SA, Spiridonov AN, Ogurtsov AY. Computational models with thermodynamic and composition features improve siRNA design. BMC Bioinform. 2006;7:65.
Vert JP, Foveau N, Lajaunie C, Vandenbrouck Y. An accurate and interpretable model for siRNA efficacy prediction. BMC Bioinform. 2006;7:520.
Ichihara M, Murakumo Y, Masuda A, Matsuura T, Asai N, Jijiwa M, Ishida M, Shinmi J, Yatsuya H, Qiao S, et al. Thermodynamic instability of siRNA duplex is a prerequisite for dependable prediction of siRNA activities. Nucleic Acids Res. 2007;35(18):e123.
Huesken D, et al. Design of a genomewide siRNA library using an artificial neural network. Nature Biotechnol. 2005;23:995–1001.
Peek AS. Improving model predictions for RNA interference activities that use support vector machine regression by combining and filtering features. BMC Bioinform. 2007;8:182.
Bushati N, Smith J, Briscoe J, Watkins C. An intuitive graphical visualization technique for the interrogation of transcriptome data. Nucleic Acids Res. 2011;39(17):7380–9.
Acknowledgements
This work rests almost entirely came from open data. Contributors were gratefully acknowledged. Furthermore, we deeply thank Mr Quanming Luo(Linyi No. 18 Middle School, PR China.)that carefully review our manuscript.
Funding
This project was financially supported by Major Program of National Natural Science Foundation of China(2016YFA0501600). The funders did not play any role in this study.
Author information
Authors and Affiliations
Contributions
All coauthors contributed equally to this work, XJ and QH designed the work, ZL acquired data, XJ and ZL analyzed data, XJ and QH created software used in this work, ZL supervised the study. All coauthors actively commented and improved the manuscript, as well as finally read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declared that they had no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
: MATLAB algorithm. A freely available MATLAB implemented to perform MGalgorithm and Icccluster for a data set.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Jia, X., Han, Q. & Lu, Z. Constructing the boundary between potent and ineffective siRNAs by MGalgorithm with Cfeatures. BMC Bioinformatics 23, 337 (2022). https://doi.org/10.1186/s12859022048679
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859022048679
Keywords
 MGalgorithm
 Icccluster
 Cfeature
 Iiteration