Selection of antisense oligonucleotides based on multiple predicted target mRNA structures

Background Local structures of target mRNAs play a significant role in determining the efficacies of antisense oligonucleotides (ODNs), but some structure-based target site selection methods are limited by uncertainties in RNA secondary structure prediction. If all the predicted structures of a given mRNA within a certain energy limit could be used simultaneously, target site selection would obviously be improved in both reliability and efficiency. In this study, some key problems in ODN target selection on the basis of multiple predicted target mRNA structures are systematically discussed. Results Two methods were considered for merging topologically different RNA structures into integrated representations. Several parameters were derived to characterize local target site structures. Statistical analysis on a dataset with 448 ODNs against 28 different mRNAs revealed 9 features quantitatively associated with efficacy. Features of structural consistency seemed to be more highly correlated with efficacy than indices of the proportion of bases in single-stranded or double-stranded regions. The local structures of the target site 5' and 3' termini were also shown to be important in target selection. Neural network efficacy predictors using these features, defined on integrated structures as inputs, performed well in "minus-one-gene" cross-validation experiments. Conclusion Topologically different target mRNA structures can be merged into integrated representations and then used in computer-aided ODN design. The results of this paper imply that some features characterizing multiple predicted target site structures can be used to predict ODN efficacy.


Background
Antisense oligonucleotides (ODNs) have served as powerful tools during the post-genome era. They provide an important approach to sequence-specific knockdown of gene expression, offering significant advantages over gene knockout techniques in respect of cost, time and resource requirements, and have therefore been widely used for determining gene function, validating drug targets and elucidating pathways [1,2]. ODNs also have potential as novel therapeutic agents for various diseases; several anti-sense compounds have been evaluated in clinical trials with promising results [3].
However, even with careful design, only a small proportion of ODNs against a given RNA effectively suppress the target gene in living cells [4]. It is commonly accepted that the identification of accessible sites in the target RNA is of great importance in designing ODNs. Various experimental approaches to the identification of promising local target sites have been described in recent years [5][6][7][8][9][10]. There has also been much interest in computational approaches to ODN design, which have advantages over experimental methods in terms of throughput, cost and efficiency. Several approaches to efficacy prediction have been proposed for rational selection of ODN target sites [11][12][13][14].
Among the factors that influence the activity of a given ODN, the local secondary structures of the target mRNA are very significant in determining in vitro efficiency [5,[15][16][17] and are therefore particularly important in current ODN design strategies [18][19][20]. Local target site structures have also been used as the basis of rational design for other kinds of nucleic acids drugs such as antisense RNAs [21], catalytic RNAs [22] and ribozymes [23]. However, the term "structure" in these studies refers to "single computational predicted structure", not the real structure of the target mRNA; RNA secondary structure is difficult to determine experimentally.
Many RNA secondary structure prediction algorithms have been proposed during the past 20 years. Since the thermodynamically most stable structure of a molecule is generally the one with the minimum free energy (MFE), the initial aim of these prediction methods is to determine the MFE structure [24]. Several MFE structure searching algorithms have been described and are widely used in related research [25,26], especially in ODN target selection. However, partly because of the relatively low reliability of individual target mRNA structure predictions, researchers have often drawn inconsistent conclusions about favorable local structure motifs. The results obtained by Lima et al. [18] and Thierry et al. [19] indicated that single-stranded hairpin loops in RNA were the best target sites, whereas the studies by Laptev et al. [20] suggested that ODNs targeted to sequences predicted to form clustered double-stranded structures in RNA transcripts had the best potential.
It is also possible to consider conformations close to the energy minimum, and algorithms for calculating suboptimal structures within certain energy limits have been proposed [27,28]. The popular RNA secondary structure prediction program MFold now provides results over a range of free energies, mitigating the uncertainty of MFE prediction. Although multiple predicted structures are apparently more reliable, the MFE structure of the target mRNA is still used as the only structural basis in some ODN research. The main difficulty may lie in how to use these foldings simultaneously, since they can be topologically very different.
Studies on ensembles of target structures in ODNs design date back to Jaroszewski et al. [29], who considered the 30 lowest-energy computer-simulated structures of rabbit βglobin mRNA qualitatively. In some thermodynamic models, multiple predicted target structures have been merged into the form of free energy [30,31]. The earliest work on computational ODN design based on the original forms of multiple predicted target mRNA structures was perhaps that of Patzel et al. [17]. Five structures with low energy were predicted and aligned for a given sequence stretch, and ODN sequences were chosen if potentially favourable local structural elements occurred in all five. In vitro experiments showed that this theoretical protocol increased the statistical probability of identifying local target sites accessible to ODN sequences [17,32]. Another way to explore the original forms of optimal and suboptimal mRNA structures simultaneously, which is probably more straightforward, is to merge them into a single-stranded probability profile (SSPP), P = {p i }, 1 ≤ i ≤ n, where p i is the probability that base i is singlestranded. Actually, algorithms for predicting singlestranded regions in RNA secondary structures have long been of interest, since such regions play many important roles in RNA-RNA, RNA -DNA and RNA-protein interactions [33]. The SFold web server [34] can now directly output the SSPP of an RNA molecule instead of definite individual structures. Ding and Lawrence [33] presented a method for predicting accessible sites in the SSPP of rabbit β-globin mRNA, obtained by summing statistical samples of probable secondary structures. Their results showed a significant correlation between the predicted hybridization potential and the degree of inhibition of in vitro translation. Some researchers regard this method as the most successful [11,12].
The original RNA structural information is used in essentially different ways in the two methods described above. In the method based on structure alignment, favorable structural elements are identified by base pairing patterns, which can be illustrated as graphs. The role of secondary structures in this method is similar to its role in earlier studies of ODN design based on the target mRNA MFE structure. The success of this method relies mainly on the greatly increased reliability of structural elements. However, in the method based on SSPP, the RNA structures resemble a special time series rather than molecular "structures" in the usual sense. Base pairing patterns, or topological features, can hardly be explored in SSPP. The common ground between these two methods is the Co-reporter vector pRL-TK, complete sequence 50 4 emphasis on the role of single-stranded regions in determining target accessibility. In the SSPP of rabbit β-globin mRNA, Ding and Lawrence found a significant correlation between the peak value of SSPP and the degree of inhibition of translation. The "well-chacterized" single-stranded regions were revealed by high probability peaks in the profile [33], while in the systematic alignment of multiple predicted target mRNA secondary structures, large (>10 nt) consecutive sequence stretches not involved in base pairing were regarded as favorable structural motifs [17]. Since these two methods were only evaluated on a single target mRNA, further research is needed on a broad range of target genes.
The purpose of this article is to systematically explore the methods for computational selection of ODN target sites based on features defined in multiple predicted structures of the target mRNA. In our approach, the predicted mRNA structures were first merged into integrated representations. Efficacy-associated features were then screened from a set of features defined on these representations. The potential of neural networks for predicting efficacy on the basis of these features was also validated.

Dataset
Three ODN databases have been reported: ODNBase [35], AOdb [12] and an unnamed database with experimental data from Isis Pharmaceuticals [36]. We have also developed a database named AOBase [37] (NAR molecular biology database collection entry number 781) for both the selection and design of ODNs. Currently, it stores 705 ODNs from the published literature tested against transcripts of 54 different target genes. Since no homogeneous database is publicly available, we perforce used a heterogeneous collection of measurements made by different researchers using different experimental techniques as our dataset. Four hundred and forty-eight ODNs against 28 different mRNAs were collected from AOBase to construct this dataset; 54.2% of them had been tested at protein level and the others at mRNA level. The data selection criteria were similar to those used in other ODN efficacy prediction studies [11][12][13]: (a) at least 4 ODNs were tested under the same experimental conditions; (b) ODN efficacies were presented as percentages of the control target gene expression level; (c) virus targets were excluded; (d) ODNs targeting to the translational initiation site were excluded, since regions surrounding the initiation codon are generally considered to be free of secondary structure [38]. To keep in line with most of the research on drug design, the ODN efficacies in our dataset were transformed into [100%-(% of control expression)].
RNA folding calculation times have been greatly reduced in recent years because of faster computers and improved algorithms. The MFold web server [39] can now fold 6000 bases for a batch job, which meets the need of full-length mRNA structure prediction in most cases and is therefore used in this study. Because the number of predicted suboptimal RNA secondary structures increases exponentially as the folding energy increases [40], only structures within 5 percent of the computed minimum free energy were taken into consideration. The upper bound on the number of simultaneously predicted structures was set to 50 to avoid the high computational cost of long RNA sequences. These settings were the default settings of the MFold web server. Table 1 is a brief summary of the dataset.

Integrating multiple predicted target mRNA secondary structures
In this study, two methods were used to represent the multiple predicted local structures of target sites synthetically. All the predicted local structures were first merged into an SSPP, which is easily calculated from the ss-count file in the MFold output. For a more illustrative representation of the multiple predicted structures, the SSPP was further transformed to a "single-stranded/pair/uncertain"  110 120 130 ---------| ---------| ---------| P P P P S S P P P P P S U U S S P P U P P P P P S S P U P P SUP representation loses a lot of structural information in comparison to graphical illustration or dot-parenthesis notation of RNA secondary structure and therefore cannot be used to explore the whole RNA structure. However, for RNA local structural analysis, especially of very RNA short regions, SUP gives a competent simplified representation. Figure 1 illustrates part of these two representations (101-130 nt) of rabbit β-globin mRNA structure.

Selection of efficacy-associated features
The first important step in computational design based on multiple predicted mRNA structures is to find the efficacyassociated features in the SSPP and SUP representations of the target sites. Since the data structures of these two linear representations of multiple predicted structures are very different from graphical illustrations of RNA molecules, the topological features known to be correlated with efficacy must be redefined. However, new representations also afford opportunities to discover novel efficacy-associated features.
A set of features characterizing the local multiply-predicted target mRNA secondary structures was derived. Seven of these features were defined on the SSPP representation (listed in Table 2) while the other eleven were defined on the SUP sequence representation (listed in Table 3). The size of the local target, n, in the definition of features is equal to the length of the ODN.
The mean of all single stranded probabilities within a given target site, f mean , indicates the probability that the target site is single-stranded. The maximum value, f max , has also been used for this purpose [33]. f impulse , can be viewed as a relative peak value compared to the mean. The other statistics, f rms , f peak , f wave , and f difference , describe the structural consistency of the target site.
Numerical features defined on the SUP sequence are directly derived from research results and from empirical rules about target site selection based on local structure.
Features f NS , f NP , f PS , and f PP , give an overall description of target structure, while f 5S , f 5P , f 3S and f 3P emphasize the local structure of the target site termini. Factors f CS and f CP are derived to confirm whether the occurrence of consecutive subsequences in single-stranded or helical regions is correlated with efficacy, as explored by Patzel et al. [17].     Figure 2(b), which indicates no obvious relationships between these features and target size.
Two types of indices, efficiency prediction potential and classification potency, were used to measure the suitability of these parameters for rational ODN design. The efficacy prediction potential was evaluated by calculating the correlation between the features and efficacy, using Pearson linear correlation, Spearman rank correlation and Kendall rank correlation. The classification potency was evaluated by exploring the performance of Fisher linear discriminators, using the feature as the single independent variable. The performance was measured as specificity and sensitivity . Two different efficacy threshold values, 50% and 75%, were used to distinguish between positive and negative cases in our dataset, since these indices depend on threshold. Features matching at least one of the following two criteria were selected as efficacy-associated: (a) statistically significant correlation (p < 0.05) with efficacy; and (b) high specificity (≥0.7) or high sensitivity (≥0.7) in distinguishing between active and inactive ODNs.
The correlation between parameters and efficacy is presented in Table 4. Only four features defined on SSPP, i.e. f rms , f max , f peak and f difference , correlated strongly with efficacy. The most noteworthy finding is that ODN efficacy seems not to rely greatly on the degree of single-strandedness in its target site, as suggested in previous publications [18][19][20], since f mean , f NS and f PS show neither sufficient correlation with efficacy nor good performance in identifying active ODNs. The lengths of consecutive single-stranded regions in the target site, which are characterized by f CS , prove useful for identifying active ODNs. This result is partly consistent with the conclusion drawn by Patzel et al [17]. In contrast to the conclusion of Ding and Lawrence [33], although f max is revealed to be efficacy-associated, the peak value of the target site SSPP correlates negatively with efficacy.
The helical region in the target site appears to be more important, as suggested by Laptev [20],  ODN efficacy may be closely associated with the local structures of the 5' and 3' termini of the target sites. Fisher classifiers using factors f 5S and f 3S gave high specificity or sensitivity in ODN discrimination.
Although some features are efficacy-associated, the relationship between structural factors and efficacy is highly complex. No single feature has been found to correlate highly with efficacy, and no feature is reliable on its own for distinguishing active from inactive ODNs. Two feature sets defined on the SSPP and SUP representations of the target site are selected as inputs of efficacy-predicting neu- The distribution of ODN length and length-limited features ral networks:

Efficacy predicting using neural networks
To assess the ability of selected features to predict efficacy, two neural network models were constructed, one for features defined on the SSPP and the other for features derived from the SUP sequence representation of the target structure.
Previous studies have shown that cross-validation is important for estimating accuracy [11][12][13][14]. Since ODNs always have similar properties if they are near each other on the same gene or are measured in the same study, the network training process should be completely independent of the test data [12,13]. In this research, cross-validation was done by the "minus-one-gene" (-gene) [13] approach. ODNs targeting to 8 mRNAs (listed in Table 6) were selected alternately from the dataset for testing, while the remainder, assayed in the same studies, were used as the training set. The test mRNA selection criteria were: (a) more than 15 different target sites were tested; (b) the efficacy of at least one ODN was greater than 75%.
Sixteen neural networks for efficacy prediction were tested in our cross-validation experiments. The network group N SSPP (N SSPP 1~N SSPP 8) took F SSPP as inputs, and the N SUP group (N SUP 1~N SUP 8) took F SUP as the input parameter set. The outputs of all these networks met the condition of convergence within 100 training cycles.
Several methods have been used to measure the accuracy of ODN predictors [11][12][13][14]. To obtain rounded assessments for the aforementioned neural networks, two different types of indices were computed: (1) specificity S P , sensitivity S e and accuracy calculated using fixed threshold values, as mentioned above in the account of feature selection; (2) the receiver operating characteristics (ROC) curve [41], which is a plot of S e versus 1 -S P at different thresholds. The ROC area was calculated as a quantitative indicator of the ability of the network to classify. The cutoff efficacy value used to distinguish positive from negative ODNs in the cross-validation test was 75%.
The performances of the neural networks are listed in Table 7. The specificities, S P , of all the networks in these two groups are greater than the related sensitivities, S e . This performance is beneficial for ODN design, since users will only be interested in candidates with high predicted efficacy in practical applications [14]. The ROC curves of the 16 networks tested on ODNs targeting to 8 different mRNAs are shown in Figure 3. The best ROC curve areas were obtained in cross-validation experiment 7 (network N SSPP 7 and N SUP 7), which used the data from Matveeva et al. [6] as test set. The average ROC area for N SUP is 0.77. The average for N SUP is 0.73, which is little lower.

Discussion
Compared with most other bioinformatics research problems, studies on computer-aided ODN design are far from "data rich". Moreover, the data collected from the published literature are variable owing to the diversity of experimental methods. To provide a more reliable basis for feature-mining and predictor development, one focus of future work will be on enlargement of the dataset. A large dataset with quality control will make the analysis and cross-validation of grouped homogeneous subsets possible, and therefore make the ODN design systems more reliable.
Another "data poor" limitation in our study and related research [6,17,29] is that not all possible target RNA structures are taken into account. As pointed out by Mathews, an ideal way to integrate the predicted RNA structures would be to compute a partition function, which sums the contributions of all structures weighted by their Boltzmann probabilities [44]. However, the determination of a partition function has O(N 3 ) computational complexity [45], so this method is practicable only for short RNA sequences. Several studies have been done on the estimation of partition function with lower computational cost [44,[46][47][48]. The Vienna RNA secondary structure prediction server [49] can now compute the partition function of RNA up to 5000 bases for batch jobs. One implication of this study that warrants further investigation is ODN design using the partition function of the target mRNA, which is based on more reliable structural information.
The factors influencing the potential of an ODN are complex and so far poorly understood. Although this paper focuses on the relationship between ODN efficacy and target site structure, we do not ignore other factors that have been shown to influence efficacy, such as chemical properties, DNA-RNA duplex stability, sequence motifs, metabolic properties of target mRNA, etc. [4]. We do believe that as more factors are considered in ODN efficacy prediction, the more reliable the target site selection becomes.

Conclusion
This paper presents a method, based on multiple predicted target mRNA structures, for reducing the uncertainty of structure prediction in ODN design. Several efficacy-associated features characterizing the integrated structure of the target site have been discovered. The structural consistency features of the target seem to be correlated with efficacy. In contrast, some features of favorable ODN targets reported in previous research, which emphasized single-stranded regions, were found to correlate weakly with efficacy. In addition, the local structures of the 5' and 3' termini were shown to be important in target site selection.
Neural network efficacy predictors using features defined on integrated structures as inputs have been shown to perform well, implying that these features can also be used for other forms of efficacy prediction such as Bayesian statistics (BS), multiple linear regression (MLR), decision tree (DT) and support vector machine (SVM).

Methods
After preliminary experiments, feed-forward network architecture with a hidden layer containing 20 nodes was applied to each network. The input neurons used a logarithmic sigmoid (tan-sigmoid) activation function; the output neurons used a hyperbolic tangent sigmoid (logsigmoid) activation function. The weights and bias values of the networks were updated according to the Levenberg-Marquardt optimization algorithm [42], which appears to be the fastest method for training a moderate-size feedforward neural network [43]. Matlab ® Neural Network Toolbox 4.0.3 was used for all neural network implementation.

Authors' contributions
SW guided the project. XB and SW conceived of the study. XB wrote program, analyzed the results and drafted the manuscript. SL and DS helped in dataset construction. WS and JY helped in analysis and discussion, gave useful comments. ROC curves for efficacy-predicting neural networks