Improving model predictions for RNA interference activities that use support vector machine regression by combining and filtering features

Peek, Andrew S

doi:10.1186/1471-2105-8-182

Research article
Open access
Published: 06 June 2007

Improving model predictions for RNA interference activities that use support vector machine regression by combining and filtering features

Andrew S Peek¹

BMC Bioinformatics volume 8, Article number: 182 (2007) Cite this article

6099 Accesses
29 Citations
Metrics details

Abstract

Background

RNA interference (RNAi) is a naturally occurring phenomenon that results in the suppression of a target RNA sequence utilizing a variety of possible methods and pathways. To dissect the factors that result in effective siRNA sequences a regression kernel Support Vector Machine (SVM) approach was used to quantitatively model RNA interference activities.

Results

Eight overall feature mapping methods were compared in their abilities to build SVM regression models that predict published siRNA activities. The primary factors in predictive SVM models are position specific nucleotide compositions. The secondary factors are position independent sequence motifs (N-grams) and guide strand to passenger strand sequence thermodynamics. Finally, the factors that are least contributory but are still predictive of efficacy are measures of intramolecular guide strand secondary structure and target strand secondary structure. Of these, the site of the 5' most base of the guide strand is the most informative.

Conclusion

The capacity of specific feature mapping methods and their ability to build predictive models of RNAi activity suggests a relative biological importance of these features. Some feature mapping methods are more informative in building predictive models and overall t-test filtering provides a method to remove some noisy features or make comparisons among datasets. Together, these features can yield predictive SVM regression models with increased predictive accuracy between predicted and observed activities both within datasets by cross validation, and between independently collected RNAi activity datasets. Feature filtering to remove features should be approached carefully in that it is possible to reduce feature set size without substantially reducing predictive models, but the features retained in the candidate models become increasingly distinct. Software to perform feature prediction and SVM training and testing on nucleic acid sequences can be found at the following site: ftp://scitoolsftp.idtdna.com/SEQ2SVM/.

Background

RNA interference (RNAi) describes the property of short (21 to 23 base) RNA molecules, or short interfering RNA (siRNA), to associate with naturally occurring cellular machinery, the RNA-Induced Silencing Complex (RISC) and reduce the quantity of a second RNA molecule, or the target gene RNA [1, 2]. In the relationship between the siRNA and the target RNA, the siRNA must be able to Watson-Crick base pair with some segment of the target RNA using standard base pairing rules. The RISC then catalytically cleaves the target RNA.

In addition to the RISC mediated silencing mechanism, the siRNA can reduce target gene levels utilizing two other methods. First, siRNA can inhibit transcription of the target gene's DNA [2–4]. Second, it can utilize a mechanism similar to an endogenous and highly conserved class of small RNAs known as microRNAs (miRNAs). MicroRNAs mediate the reduction of target gene protein level by repressing target RNA translation through imperfect base pairing to the target gene transcript [5]. All of these various methods and mechanisms result in target gene knockdown [6]. In addition to the epigenetic gene knockdown, siRNA sequences can cause sequence expulsion from the genome [7] and small dsRNAs are implicated in the induction of transcription [8].

SiRNA molecules are not all equally effective in their ability to knockdown target genes [9–14]. Some combination of the properties of the siRNA, the target RNA sequence and their interacting components are thought to account for the differential effectiveness. Furthermore, it is not known whether specific characteristics of an siRNA molecule contribute differently to the 3 gene knockdown mechanisms of RISC mediated, transcription inhibition and translation repression, since presumably each mechanism interacts with distinct subsets of cellular components and possibly different optimality criteria [15, 16]. In addition to the mechanism of knockdown, there is also possible variation among transcripts [17], organisms, cell type, developmental time course, transfection methods [18] and environmental treatment in gene knockdown, and many of these properties are not accounted for in siRNA effectiveness. Although several rules describing properties of functional siRNA sequences have been proposed and proven to work with variable effectiveness, the fundamental questions of what properties comprise an effective siRNA for gene knockdown, by any mechanism, are unsettled. More realistic models will be needed for further dissecting siRNA mechanism or mechanisms [19]. Once appropriate experiments are derived for taking each of the complex series of variables into account, researchers will need to identify the critical components to model RNA interference activities and then use those models to develop reagents with the desired properties.

Several methods for identifying the properties of effective versus ineffective siRNA molecules from empirical data have included the following:

a.
classification by statistical grouping [9–14, 20, 21]
b.
classification and regression by neural networks [22–24]
c.
classification by boosted genetic programming [25]
d.
classification by decision trees [26, 27]
e.
classification and regression by support vector machines (SVMs) [25, 28, 29].

Many of the classification approaches have taken empirically derived continuously distributed data, and used it to map "effective" versus "ineffective" siRNA sequences and their associated properties by cutoffs and binning. A comparison of various algorithms in predicting siRNA efficacy by classification [30] suggests a large variance in performance. Furthermore, several features have been shown to associate with predictive models of activity including the following:

a.
position specific base composition [11–14, 20, 29, 30]
b.
guide strand thermodynamics [9, 10, 24, 25, 29]
c.
guide strand secondary structure [30, 31]
d.
structure features that discriminate microRNAs [32]

e. N-grams [25, 28, 29]

f.
target strand secondary structure [21, 24, 33–40]
g.
the energetics of multiple guide strand binding sites within the target [24].

Support Vector Algorithms or Support Vector Machines (SVMs) are a group of machine learning methods that build a maximum margin hyperplane through n-dimensional space to separate the m elements in a discrete classification problem [41]. The n-dimensional space is comprised of some set of factors that describe the m elements being classified. In addition to discrete classification, SVMs can also be used to build regression models in n-dimensional space. Generally this can be done by describing the regression as a set of 2m classification support vectors that separate the m-elements in the dataset. In fact, the single hyperplane SVM classification problem is a special case solution of the more general multi-hyperplane SVM regression problem [41]. Finally, SVM methods can extend beyond linear models to describe the maximum margin hyperplane(s) of the support vector solution space by non-linearly mapping the initial vector into higher dimensional feature space [42].

SVM regression kernel methods produce varied results depending on the application, and kernel performance needs to be determined empirically [43]. Also, feature-mapping methods have an effect on SVM performance [42]. Given the observation that SVM kernel methods are effective at defining maximum margin hyperplanes and the knowledge that results can depend on feature mapping to vector space, this study investigates several feature mapping methods and examines their utility in creating predictive regression models for siRNA activity.

Given that several types of sequence based features can be used to build predictive models of RNAi, one of the main intentions of this study is to first ask what features individually correlate with RNAi efficacy to help identify additional siRNA properties that may have structural or functional importance previously not seen. A second intention is to ask if there is a consensus as to the feature mapping methods that can be used either alone or together and do they contribute to developing models generally predictive of activity on data not seen during model training. Furthermore, do feature selection methods, such as feature filtering, on large feature sets actually improves predictive models or if feature subsets are found in common. Two datasets are used in the present study. The first is a set of 2431 siRNA sequences of 21 nucleotides in length from [23], specifically from the corrigendum [44], referred to as dataset₂₄₃₁. The second is a compiled set of 579 siRNA sequences of 19 nucleotides in length from [25] referred to as dataset₅₇₉.

Methods

RNA interference and target sequence data

Dataset₂₄₃₁ was from [23], the 21-mer sequence and activity data used was from the corrigendum [44]. Dataset₅₇₉ was from the compiled 581 19-mer sequences and activities dataset used by [25], with the exception of five sequences that did not precisely correspond to their target gene DNA sequence. Of these five sequences, two were discarded due to ambiguity of matching to their target and three were changed at one or two positions to correctly correspond to the target mRNA sequence. The target mRNA sequences were either from [23] or downloaded from the NCBI [45].

data mapping methods for SVM

The following eight general approaches, in Roman numerals, were used to map a sequence to a vector space, to result in 14 methods, labeled in Arabic numerals:

I.
position specific base composition (method 1)
II.
thermodynamics (method 2)
III.
entropy (method 3)
IV.
guide strand structure (method 4)
V.
guide strand structure features (method 5)
VI.
N-grams (methods 6–11)
a.
N-grams N = 2 (method 6)
b.
N-grams N = 3 (method 7)
c.
N-grams N = 4 (method 8)
d.
N-grams N = 5 (method 9)
e.
N-grams N = 6 (method 10)
f.
N-grams N = 2 through 5 (method 11)
VII.
target strand structure (methods 12–13)
a.
target strand structure – nondirectional (method 12)
b.
target strand structure – directional (method 13)
VIII.
target imprecise thermodynamics (method 14)

method 1: position specific base composition

Each position in the siRNA sequence was mapped to four dimensions in vector space, where each dimension corresponded to one of the bases in the DNA alphabet. The relationship between the length of the sequence (L) and the number of dimensions of vector space (M) was then M = S xL, where S is the size of the alphabet, in this case 4 for nucleic acids. For example, using the coding system between DNA base and vector results in the following mapping:

A = < 1,0,0,0 >

C = < 0,1,0,0 >

G = < 0,0,1,0 >

U/T = < 0,0,0,1 >

method 2: thermodynamics

The thermodynamics mapping method has 23 dimensions, with 20 of the dimensions corresponding to the Gibbs free energy stabilities of the nucleotide pairs of the 21-nucleotide RNA molecule. An additional two dimensions were for the stability energetics of the terminal 5' and 3' ends, encompassing 4 nucleotide sites. The final dimension is the Gibbs free energy stability of the entire sequence. The nearest neighbor model predicted Gibbs free energies with the RNA parameters of Xia [46].

method 3: Shannon entropy

The Shannon entropy mapping method is similar in dimensionality and implementation to the thermodynamics method, but the 23 dimensions of the 20 nucleotide pairs, the 5' and 3' terminal ends and the final dimension of the entire 23 nucleotide sequence were populated with Shannon's measure of bitwise information content [47] by formula (1).

H (X) = - \sum_{i = 1}^{l} p (x_{i}) \log_{2} (p (x_{i}))

(1)

Where l is the length of the sequence, p(x_i) is the frequency of the character at position i.

method 4: guide strand secondary structure

Nucleic acid secondary structure describes the ability of a single molecule of nucleic acid sequence to form one or more intramolecular bonds, thereby stabilizing some sequence segments as double stranded. siRNA sequence secondary structures were predicted with the RNAfold as implemented in the Vienna package [48]. Energetics were predicted by partition function and by minimal free energy algorithms for evaluation purposes. Partition function energetics produced models with higher predictive accuracy and was used in this study. First, a 21-length feature vector was produced with one dimension for each base position in the siRNA sequence corresponding to whether the position was involved in an intramolecular secondary structure. Second, a single dimension was added corresponding to the overall intramolecular stability as measured by the Gibbs free energy of folding. Finally, two additional dimensions were numerical counts of the number of bases in the 7 most 5' and 7 most 3' bases of siRNA sequence involved in a predicted secondary structure [31].

method 5: guide strand secondary structure features

The guide strand secondary structure features mapping method is an implementation of the sequence feature method described by Xue et al. [32] for discriminating real and pseudo miRNAs. Briefly, a 32-length feature vector is comprised of the occurrence frequencies of three nucleotide sequence-structure features. The middle base of the 3 base triplet has one of 4 possibilities (A, C, G or T/U) and each position could be in either a bonded or non-bonded state resulting in a 32 (4 × 2³) dimensional feature space. The nomenclature used is the base at the middle position and then 3 binary symbols. For example, 'U000' indicates the middle position is 'U' and this 3 base triplet is not within a secondary structure, whereas 'C111' indicates the middle base position is a 'C' and this triple is completely paired within a structure. See Xue et al. [32] for complete details.

methods 6–11: N-gram

The N-Gram approach mapped the presence or absence of each possible sub word of a given length and character composition from the original siRNA sequence[25]. For example, there are 4² = 16 possible 2-grams from the 4 base DNA alphabet, (generally, A^Nwhere A is the number of characters in the alphabet and N is the length of the word). The 16 length 2-gram vector for the DNA 'ACGT' alphabet would then be:

< AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT >

and mapping the previous example sequence of "ATGCATG" onto this vector space by presence or absence would yield:

< 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0 >

The N-Gram method is therefore position independent, and vector space can be adjusted to account for frequency and position in addition to simply presence or absence.

methods 12–13: target strand secondary structure

The predicted secondary structures for the mRNA target sequences were determined in the same identical manner as the siRNA guide sequences. Regions of structure prediction were limited to the guide strand binding region plus 100 bases up and down stream. However the guide strand binding region was used to map structure to vector space. In the case of direction independent structure, this resulted in a total of 22 dimensions: 21 dimensions with one for each nucleotide position plus an additional dimension as the Gibbs free energy of structural stability. For directional binding in the target structure, the dimensions were 42 plus the Gibbs free energy, totaling 43.

method 14: target strand multiple binding patches

The guide strand of the siRNA sequence could imperfectly pair with multiple regions of the target strand. The 22 most stable imperfect sites of guide strand to mRNA pairing were predicted by RNA thermodynamics [46] and their thermodynamic stabilities populated the dimensions of the feature vector.

SVM regression kernel methods

Four regression kernel functions were tested:

1.
Linear kernel
2.
Polynomial kernel
3.
Radial Basis Function (RBF) kernel – This is a similar in implementation to the radial basis function neural network.
4.
Sigmoid kernel – This is similar to another type of neural network, a multilayer perceptron with no hidden layers.

SVM kernels were implemented with the libsvm library [49].

SV regression was used rather than SV classification, since the activity data were continuously distributed on the interval [0, 1]. Here we tested classification models to predict RNAi activities, but choosing arbitrary division points in the outcome classes resulted in highly variable model performance. This observation suggests that data categorization has a sufficient impact in model building and that the optimization of data categorization is important.

N-fold cross validation within a dataset

Cross validation (CV) was performed by the method of dividing the original data into N equally sized (or as nearly as possible) partitions and trained on (N-1) partitions and tested on the N^th partition. This was performed for all N partitions and the Pearson correlation coefficient (R) and mean squared error (MSE) between predicted and observed on the testing partition was averaged for all N tests. Specifically, 10-fold cross validation on dataset₂₄₃₁ divided the dataset into 10 datasets of size 243. A model was then trained on a dataset of size 2187, and then tested on the remaining data of 243. This procedure was repeated 9 more times on the remaining partitions. Values of R and MSE are comparable within tables from cross validation in that the same pseudo-random number seed was used to produce the dataset divisions. Cross validations involving feature selection were performed by using the feature selection method only on the training set and applying this feature subset to the training set. Performing feature selection within the cross validation reduces the bias in CV model estimates, but can result in different feature sets being used among the partitions of cross validation. The average number of features used among partitions, and the similarities among the CV feature subsets is reported where appropriate.

individual feature correlation to RNAi activity and feature filtering

Individual features were tested for their significance of correlation to activity by correlation and the t-test of significance, calculated by formula (2).

t = | R \times ((\sqrt{\frac{o - 2}{(1 - R^{2})}})) |

(2)

where

R = Pearson correlation coefficient

o = number of observations

R² = Pearson correlation coefficient squared (coefficient of determination)

Feature filtering used only the training portion of the dataset to perform feature subset selection, along with appropriate calculation metrics on the training dataset, and this feature subset was then applied to the naive testing dataset. Evaluating feature selection within cross validation reduces bias in assessing model performance metrics when the same dataset is not in both model training and then model testing. By contrast, when the entire dataset is used for both training and testing, the results are optimistically biased due to model over fitting. When the training and testing are performed alternatively between dataset₂₄₃₁ and dataset₅₇₉, the results are likely to be pessimistically biased, principally due to the dissimilarities between the datasets.

The feature selection method of Correlation based Feature Selection (CFS) [50] was used to select feature subsets with presumed high effectiveness. CFS is a maximum-relevance minimum-redundancy method that greedily adds features to a feature subset by maximizing a scoring metric. CFS used equation (3) to maximize G_sin selecting features for the subset.

G_{s} = \frac{k r_{c i}}{\sqrt{k + k (k - 1) r_{i i}}}

(3)

where k is the number of features in the subset, r_ciis the mean correlation of the feature to the outcome and r_iiis the mean feature intercorrelation or feature to feature cross correlation.

Multicollinearity exists within and between some of the feature mapping methods. For example, the base composition at positions 1 and 2 (method 1) correspond to the thermodynamics measurement for this area (method 2) and these share significant cross correlations.

software architecture

A group of C++ classes are made available to the research community that performs the following functions:

1.
SVM model construction, given a feature set and RNAi sequence dataset
2.
Perform N-fold cross validation given a model, feature set and RNAi sequence dataset
3.
Predict RNAi activities given an SVM model, feature set and a candidate RNAi sequence set
4.
Predict siRNA sequences given a feature set, candidate gene sequence and a SVM model,
5.
Predict various types of feature filters, feature comparisons as well as feature cross-correlation

The most recent library classes and associated main functions can be downloaded [51].

Software was developed with C++ under Linux kernel 2.6.9-5, with the gcc compiler 3.4.3. The classes for manipulating and modeling siRNA sequences and their activities compile without warnings with the -Wall -ansi -pedantic-errors compilation flags, including wrapper classes for libsvm-2.71 and libRNAfold-2.4 libraries. Additional platforms and compilers have not been systematically tested, but the package is distributed with the GNU autotools and should compile on supported architectures. Further development of additional functionality for this library is intended and the resulting code will also be released. Areas of development include interfaces to other machine learning techniques including ANN's, additional feature mapping methods and implementing wrapper methods for model construction and optimization. Contact the author if you intend to develop functionality, primarily to ensure a minimal duplication of effort, if the method has already been constructed and not released.

Results

The results section is divided into three major sections with the following structure. The first investigates individual feature correlation with RNAi activity involving only dataset₂₄₃₁. This section specifically examines the methods of site-specific base composition (method 1), guide strand thermodynamics (method 2), guide strand entropy (method 3), guide strand secondary structure (method 4), guide strand secondary structure feature (method 5), target sequence secondary structure (methods 12 and 13) and finally N-Grams (methods 6 to 11).

The second section investigates these single feature mapping methods and their abilities to train and test SVM models on two datasets: dataset₂₄₃₁ and dataset₅₇₉. The second section also introduces feature filtering by t-test, features removed by increasing stringency of t-test of individual feature to RNAi activity.

The final section investigates the effectiveness of both combining individual feature mapping methods and feature filtering by Correlation based Feature Selection (CFS) to produce feature subsets in the training and testing of SVM models on dataset₂₄₃₁ and dataset₅₇₉. Also feature subset comparisons are made, investigating the commonality between predictive feature subsets derived from either within the same dataset between cross validations or between different datasets.

I a. site specific base composition

The correlation of position specific base composition to RNAi activity was calculated for each of the 84 features in the position specific base composition vector. Overall, there are 45 features that have a correlation with RNAi activity with a t-test value of 2.0 (P < 0.05) or greater (Figure 1, horizontal lines at correlation R = +/-0.05 have t-test values of ~2.4 and simply provide visual landmarks). Statistical tests have not been corrected for multiple comparisons and there are several kinds of non-independence within the data, features, models and tests presented. Many of these bases and positions are consistent with previous observations of site-specific base composition (see Suppl1_comparison_position_specific_base_composition.xls), but several have not been previously identified as statistically significant. Previous analyses even from the same dataset yield inconsistencies in features found to be or not be significant.

Briefly, the method for identifying position specific biases in base composition from this data previously used the 200 most potent and 200 least potent siRNA sequences rather than the entire dataset [23], so differences are not unexpected. For example, sites that have not previously been shown as significantly associated with RNAi efficacy: C3 (namely a "C" base at the 3^rd position in the guide strand, starting from the 5' end of the guide strand), C5, C10, G11, G17 are overly associated with lower potency and U6, U8, A16, T20 are overly associated with higher potency, numbering from the 5' end of the guide strand. In general, from the 45 features that have values of t greater than 2.0, the features are relatively evenly distributed across bases: 11 A's, 12 C's, 9 G's and 13 U/T's, but not in their association with lower potency: 2 A's, 10 C's, 7 G's, 2 U/T's versus higher potency: 9 A's, 2 C's, 2'G's, 11 U/T's, and their distribution across positions are irregular (Figure 1).

In addition to the guide strand of the siRNA, site-specific base composition biases might exist in the target mRNA as well. Investigating this possibility in the target mRNA surrounding the guide strand-binding region resulted in 3 overall patterns. First, the guide strand binding area on the target strand has the largest magnitude of site-specific base composition biases, when compared to the surrounding 100 bases (Supplementary figure 1). Second, the magnitude of the positive correlation drops with distance from the guide strand whereas the magnitude of the negative correlation appears reasonably constant. Third, the overwhelming trend for positive correlations with activity relates to the bases A and T/U. The trend for negative correlations with activity relates to the bases G and C (Supplementary Figure 2). Despite these suggestive patterns, no dominant features of site-specific base composition were obvious outside of the guide strand binding area, and further study of site-specific base composition was limited to the guide strand region.

I b. guide strand thermodynamics, entropy, secondary structure

Guide strand thermodynamics (R = 0.283), guide strand sequence entropy (R = 0.074), guide strand secondary structure stability (R = 0.227) and overall target strand secondary structure stability (R = 0.248) all have correlations with RNAi activity that have high t-values. In addition, these features have position specific distributions from within the guide strand (Figure 2). Correlations between activity and guide strand thermodynamics, guide strand secondary structure and target secondary structure have been shown before and we see overall correlations between these features and RNAi activity as well. Also, position dependence of guide strand thermodynamics has also been shown previously and this is seen in the present data as well (Figure 2). Additionally, there is a general positive association between the entire guide sequence's information content (Shannon entropy) and activity, where guide sequences with higher information content (lower repeat structure, a more even distribution of bases, etc.) have higher potency. There is also a weak indication that this pattern is seen in positions 3 through 9 of the guide strand (Figure 2).

I c. sequence structure features

Recently, a sequence structure mapping method was proposed that allowed the discrimination of real versus pseudo microRNAs [32] by combining sequence and secondary structure. Applying this method on the guide strand sequence, several sequence-structure features were observed that had positive or negative correlations with activity. Using the nomenclature described in the methods section, features such as U/T000 (R = 0.152) and A110 (R = 0.099) had a positive correlation as well as sequence-structure features that had a negative correlation C111 (R = -0.160) and G111 (R = -0.129). Generally, open structures are preferred to bonded structures and the bases A and U/T are preferred to C and G (see Suppl2_all_features_corr_descr_tval.txt for a list of individual feature to activity correlates for thermodynamics, structure, entropy, etc.).

I d. target secondary structure

Investigating the target strand secondary structure more fully, the target strand secondary structure was predicted and the positions surrounding the guide strand binding area were interrogated to see whether they form pairs in an intramolecular target strand structure. Intramolecular interactions that were limited to 100 nucleotide sites upstream and downstream of the guide strand binding area were used in the presented data. Folding areas of 20, 50, 75, 80, 125, 150 and the entire target strand were investigated and were, on the whole consistent. However, 100 sites resulted in the highest correlation between target strand structure stability and RNAi activity, similar to the observations of [37]. Graphing the correlations between each position in the target strand that is within an intramolecular structure and the RNAi activity resulted in two overall patterns (Figure 3). First, there is an overall negative correlation between any site within the local target area being paired and RNAi activity (with a few potentially positively correlating areas or anomalous regions near or within the guide strand binding area) that is consistent with the observation that there is a correlation between target strand structure stability and activity. Second, the most dominant negatively correlative position that results in lower potency siRNA sequences occurs where the 5' most site of the guide strand would pair to the target strand within an intramolecular Watson-Crick pair.

Target secondary structure was further investigated by asking whether there are any structural patterns in the overall orientation of the Watson-Crick pairing within the immediate region of the guide strand. Intramolecular bonds were categorized into those occurring to a base more 5' on the target strand and those occurring to a base more 3' of itself (respectively yellow and blue in Figure 4) on the target strand. There are two patterns that emerge from this analysis. The first pattern is the highly deleterious position where the guide strand's 5' most base would pair. It is fairly equally comprised of structures that involve sites that are both 5' and 3' of itself, suggesting guide strand access is not asymmetric. Second, there appears to be a weak symmetry of sites immediate to the 3' of the guide strand binding area, (positions 2 through 7 on the area 3' of the guide strand binding region, Figure 4) on the target strand to be positively correlated with activity if bonding with a 5' more site and negatively correlated with activity if bonding with a 3' more base. This weak symmetry is reflected within the guide strand binding area (positions 13 though 17 in the guide strand, Figure 4) where these positions are weakly positively correlated with activity if bonding with a 3' more site and negatively correlated with activity if bonding to a 5' more base. The overall suggestion might be that structures that hold the 5' most site of the guide strand's pair in a target secondary structure are deleterious whereas nearby target secondary structure stems that hold this position in an unstructured loop are more (weakly) positive for RNAi activity. Since this is an analysis that comprises several thousand guide strand regions, it is necessarily a population average. Therefore, individual cases where this is not observed would not be surprising.

I e. N-grams

Sequence motifs, or N-grams, simply a subsequence of N items from a given sequence, were then investigated for motif specific correlation with RNAi activity (see supplementary table 3 for complete table of feature N-gram correlations with activity). Overall, 10 of the 16 possible 2-grams had t-values greater than 2.0, 6 with positive correlations tending to be A and U/T rich ("AA" R = 0.090, "AT" R = 0.118, "TA" R = 0.174, "TC" R = 0.047, "TG" R = 0.053 and "TT" R = 0.153) and 4 with negative being the four possible combinations of both C and G base ("CC" R = -0.088, "CG" R = -0.089, "GC", R = -0.120, "GG" R = -0.114). This overall pattern holds true for the 3 through 6 length N-grams with a general preference for A and U/T and aversion for C and G. Higher order patterns are seen in the preference or aversion to specific longer motifs as well. For example, there are 64 possible guide strand 3-grams and 39 of these 64 have t-values greater than 2.0. Furthermore, there are 114 of the 256 4-Grams with t-values for their correlations greater than 2.0. One striking observation is that overall for 3-grams, their individual 3-nucleotide motif associations with RNAi activity negatively correlate with their corresponding codon usage frequency (R = -0.221), reverse and complementing the guide strand 3-gram into the target strand codon sequence. Also, the magnitude of deviation for each 3-gram, as measured by t, negatively correlates with both the codon usage frequency (R = -0.127) and with synonymous codon usage frequency (R = -0.156).

Table 3 Guide strand position specific base composition (Method 1) for training RBF-epsilon regression SVM model

Improving model predictions for RNA interference activities that use support vector machine regression by combining and filtering features

Abstract

Background

Results

Conclusion

Background

Methods

RNA interference and target sequence data

data mapping methods for SVM

method 1: position specific base composition

method 2: thermodynamics

method 3: Shannon entropy

method 4: guide strand secondary structure

method 5: guide strand secondary structure features

methods 6–11: N-gram

methods 12–13: target strand secondary structure

method 14: target strand multiple binding patches

SVM regression kernel methods

N-fold cross validation within a dataset

individual feature correlation to RNAi activity and feature filtering

software architecture

Results

I a. site specific base composition

I b. guide strand thermodynamics, entropy, secondary structure

I c. sequence structure features

I d. target secondary structure

I e. N-grams

II a. building predictive SVM models with features correlative with RNAi activity

II b. feature filtering on individual feature mapping methods

III a. combining feature mapping methods

III b. feature selection on multiple feature derived models

Discussion

N-grams

features

target secondary structure

V. comparisons with previous machine learning models for RNAi activity

Conclusion

Availability and requirements

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us