Automatic detection of exonic splicing enhancers (ESEs) using SVMs

Background Exonic splicing enhancers (ESEs) activate nearby splice sites and promote the inclusion (vs. exclusion) of exons in which they reside, while being a binding site for SR proteins. To study the impact of ESEs on alternative splicing it would be useful to have a possibility to detect them in exons. Identifying SR protein-binding sites in human DNA sequences by machine learning techniques is a formidable task, since the exon sequences are also constrained by their functional role in coding for proteins. Results The choice of training examples needed for machine learning approaches is difficult since there are only few exact locations of human ESEs described in the literature which could be considered as positive examples. Additionally, it is unclear which sequences are suitable as negative examples. Therefore, we developed a motif-oriented data-extraction method that extracts exon sequences around experimentally or theoretically determined ESE patterns. Positive examples are restricted by heuristics based on known properties of ESEs, e.g. location in the vicinity of a splice site, whereas negative examples are taken in the same way from the middle of long exons. We show that a suitably chosen SVM using optimized sequence kernels (e.g., combined oligo kernel) can extract meaningful properties from these training examples. Once the classifier is trained, every potential ESE sequence can be passed to the SVM for verification. Using SVMs with the combined oligo kernel yields a high accuracy of about 90 percent and well interpretable parameters. Conclusion The motif-oriented data-extraction method seems to produce consistent training and test data leading to good classification rates and thus allows verification of potential ESE motifs. The best results were obtained using an SVM with the combined oligo kernel, while oligo kernels with oligomers of a certain length could be used to extract relevant features.


Background
In eukaryotes, after transcription from DNA to messenger RNA (mRNA), the mRNA is initially present as a precursor messenger RNA (pre-mRNA). This pre-mRNA still com-prises the exons and introns of the gene. At this stage it is not known which exons will eventually be included into the mature mRNA. This decision is taken during a process called splicing. Then, the introns are cut out and the exons are reconnected in different ways to yield various mRNAs. The signals which are necessary to define the splice sites are located at the exon/intron boundaries and in the introns, see Figure 1. Especially the AG-and GU-dinucleotides at the ends of introns are highly conserved. In some proteins, the exons are joined in the order in which they appear on the pre-mRNA. However, mostly this is not the case. Different variants of a protein can occur when single exons are skipped or if two splice sites are present in an exon from which only one is used. This is called alternative splicing. Some of the exons are present in every mRNA (constitutive exons), others are alternatively spliced, and thus only present in some of the transcripts. The resulting mRNAs are then translated into proteins which can differ extremely in their function.
The splicing process involves a series of biochemical reactions which are catalyzed by the spliceosome, a complex of small nuclear ribonucleo-proteins (snRNPs). Additionally, there are both cis-acting sequence elements as well as trans-acting elements involved. It is known that splicing is partly promoted by additional sequences which are embedded in exons. Depending on whether the sequence elements assist or inhibit the splicing process, they are called exonic splicing enhancers or exonic splicing silencers. Exonic splicing enhancers (ESEs) are usually purinerich sequences that bind members of the SR protein family (see Figure 1). SR proteins are Serine/Arginine-rich proteins which can bind to the RNA and are required during alternative splicing to select which splice site should be used [1][2][3]. Their binding to these elements enhances or prevents the steric association of the 5'-and 3'-splice sites and consequently has an essential function in exon definition.
Exonic splicing enhancers are not easy to identify because known ESEs consist of quite common motifs of only [6][7][8] bases which occur at many positions in the genome. Therefore, not every such sequence indicates a binding site. A well-known group of ESEs are the purine-rich enhancers containing repeated GAR (GAG and GAA) motifs. Several studies have shown that many sequences can function as ESEs. The employment of functional systematic evolution of ligands by exponential enrichment (SELEX) is an important tool for ESE identification [4][5][6][7][8][9]. The identified motifs are short (6-8 nt) and degenerate.
In [10], ab initio computational approaches for the identification of ESE motifs were employed. RESCUE-ESE identified motifs of length six which are over-represented in exons with non-consensus splice sites ("weak exons") as well as in exons versus introns. In [11], a similar approach to identify octamers over-represented in internal non-coding exons versus unspliced pseudo-exons and the 5' untranslated regions of intronless genes was utilized. Both methodologies resulted in motifs that function as ESEs, as the authors showed experimentally in several cases. Until now, the bioinformatics tools for detecting exonic splicing enhancers are often based on positionweight matrices (PWMs) which were constructed from experimentally verified motifs of ESE sequences. Because of the shortness of ESEs, the use of PWMs leads to a high amount of false positive predictions. Of course, a high amount of false positive predictions is not desirable as it makes the classification less useful for biologists. Every potential positive prediction needs experimental verification. However, a high number of experiments examining all predictions will waste time, money and material. In order to use the predictions to guide experiments, the false positive prediction rate has to be reduced. Examples for pattern matching based programs are ESEfinder [12] and SEE ESE [13].
In order to analyze alternative splicing it is necessary to have an accurate detection method for ESEs. Loss or disruption of ESEs can result in a changed transcript, potentially even be responsible for diseases [2].
Here, we report on a new approach using support vector machines (SVMs, [14][15][16]) with special sequence based Recognition signals and proteins for splicing  kernels to perform the classification of exonic splicing enhancer sequences and our own strategy for the generation of the data sets for training and testing the classifiers.

Results and discussion
Design of positive and negative data sets A problem for detecting ESEs is the selection of suitable sequences for training and testing the SVMs. There are only few exact locations and motifs of ESEs described in the literature. Therefore, we tested two different data generation methods that are presented below.

Neutralized data
The idea behind the data set was obtained from [17]. The data set contained 1000 randomly chosen protein-coding exons with length ranging from 100 to 300 bases from the Vega database [18], which we assume to contain exonic splicing enhancers. From these exons, a set of 1000 negative sequences was created using a mechanism called neutralization [17]. The negative training sequences were generated randomly, but still coded for the same amino acid sequence and maintained the overall composition of the original exons. That is, the codon usage should be preserved as well as the frequencies of dinucleotide occurrences. According to the authors, using this training data resulted in features which performed some function independent of the protein-coding function of exons and can thus be used to discriminate between the original and the neutralized data set. 200 cycles of neutralization were used leading to a mean difference of 73% between the exons and the neutralized counterparts. As described in [17], we examined the dinucleotide composition before and after neutralization and found the frequencies changed only minimally. For a more detailed description of the neutralization procedure see the Methods section or the original literature [17].

Motif-oriented data
For generating a second data set, we developed a motiforiented data-extraction method. Sequences were extracted locally around experimentally or theoretically determined ESE patterns, where we used the 238 hexamers identified by RESCUE-ESE [10]. We assumed that the sequence surrounding an ESE pattern can help to detect them in exons. This is reasonable because of the fact that ESEs are located in the vicinity of other binding sites for splicing proteins or ESEs itself as well as in the vicinity of splice sites [19,20]. As previously mentioned, the ESE patterns are quite short and not every such sequence indicates a binding site. Therefore, positive examples were restricted by heuristics based on known properties of ESEs: • Location in the vicinity of splice sites [19,20] • Presence in an exon with a non-consensus splice site (''weak exon'') [3,10] • Location in a single-stranded region [21] In summary, these criteria led to consistent positive examples, from which local features could be extracted. We obtained 1835 sequences which met the above-mentioned criteria and could thus be used as positive training examples. An advantage of this method is that the classification problem is simplified by the introduction of biological a priori knowledge. A disadvantage is that the new method can only be used for ESEs with known consensus sequences.
The negative examples were extracted from longer exons using the same extraction method as used for positive examples, positioning the "ESE motif" in the center and extracting the surrounding sequence. An advantage is that these sequences have

SVM scenario
An L 1 -norm soft margin support vector machine (SVM) was applied using special sequence based kernels, the combined oligo kernel [22] and the locality improved kernel [15,23]. The combined oligo kernel counts matching oligomers up to a certain length with an adjustable degree of positional uncertainty. This uncertainty is realized using the smoothing parameters σ 1 ,...,σ κ of the Gaussian in the combined oligo kernel function [see Methods]. The locality improved kernel counts matching nucleotides and considers local correlations within local windows of length 2l + 1 [see Methods]. For comparison, a Markov chain model was implemented.

Results for neutralized data
The data for training and testing the SVM classifier consisted of 1000 positive examples, the exons, and 1000 negative examples, their neutralized counterparts. We performed 50 trials with different random partitions of the data into training and test sets.

Adaptation of the parameters of the SVM kernel
In the training phase the parameters of the used kernel had to be adapted. In this case, the oligo kernel [see Methods] was employed. A grid search was used in a 5-fold cross-validation scenario for determining the optimal values of the smoothing parameter σ ∈ {i | 1 ≤ i ≤ 10} of the oligo kernel as well as the regularization parameter C ∈ {0.002·i | 1 ≤ i ≤ 10} of the SVM.

Classification performance
Training an SVM classifier using the oligo kernel resulted in an accuracy of about 95%. This was quite high and unexpected. To analyze the classification performance further, a number of different data sets were used for testing. These consisted of coding exons which were not in the training set, non-coding exons, introns and intergenic regions. From each of these sets, the negative examples were either generated using the neutralization procedure (even if the sequences were not protein-coding) or a randomization process. Randomization generates a random counterpart of the original sequence while maintaining mononucleotide and dinucleotide composition [17]. We expected that for the coding exons as well as for the noncoding exons the classification rates would be good for both types of negative examples. This would have shown that the classifier had extracted exon-specific signals from the original training data from which some were general to both coding and non-coding exons. In contrast, for introns and intergenic regions, we expected that the accuracy would be poor due to the fact that these sequences do not contain exon-specific signals. Using the new data sets as test data for the trained classifier, we obtained accuracies as shown in Table 1. The classification performance for the sets with randomized negative examples were poor, approximately at chance level. In contrast, the performance for the neutralized negative examples was good for all additional data sets mentioned above. This suggested that the classifier learns neutralization-specific features from the data, but not exon-specific features. Thus, it seemed as if the neutralization procedure produced artifacts which could be exploited by the classifier. We can conclude this because the classification performance for introns and intergenic regions should have been poor as well since none of the underlying features contained in exons occur in these sequences. Therefore, at least for exons which were not in the original data set the classifier should have been able to distinguish between them and the randomized counterparts.

Results for motif-oriented data
Since the results with the neutralized negative examples were not very promising, we developed the motif-oriented data-extraction scheme as described before. For the exact mechanism of generating the data sets, please refer to the Methods section. The data for training and testing the SVM classifier consisted of 1835 positive examples and 3000 negative examples. We performed 50 trials with different random partitions of the data into training and test sets.

Adaptation of the parameters of the SVM kernels
In the training phase the parameters of the kernels had to be adapted to the given task of classifying ESEs. For the adaptation of the combined oligo kernel, we used the recently proposed gradient-based optimization of the kernel-target alignment [24] in a 5-fold cross-validation scenario for the parameters σ 1 ,..., σ κ . In our experiments, we tested several values of κ, and obtained the best results with κ = 8. Small oligomers of length one and two could be omitted. This resulted in an equal classification rate while the computational time was significantly reduced. Therefore, we adapted σ 3 ,...,σ 8 . The regularization parameter C of the SVM was adapted using one-dimensional grid-search. We considered grid points {0.1·i | 1 ≤ i ≤ 50}. For the locality improved kernel, a three-dimensional grid-search and 5-fold cross-validation was used for the three parameters C (regularization parameter of the SVM), l and d. We considered C ∈ {0.002·i | 1 ≤ i ≤ 10} and l, d For the Markov chain model, the order n and the value of the pseudocount c pseudo had to be adapted. We used gridsearch over the values c pseudo ∈ {0.2·i | 1 ≤ i ≤ 10} and n ∈ {i | 0 ≤ i ≤ 5} in a five-fold cross-validation scenario.
The final values for the smoothing parameters σ 3 ,...,σ 8 of the combined oligo kernel and the regularization parameter C of the SVM are given in Table 2. The smoothing parameters show that the positional uncertainty increases with oligomer length. An exception is the parameter σ 4 which is very small and thus, on the level of tetramers, the optimized kernels used Gaussians that were narrow peaks implies that a subsequence shifted by three nucleotides still has 70 percent of the contribution of an exact match in the kernel function (2). Table 3 shows the statistics of the final hyperparameters for the locality improved kernel and the Markov chain model. The order of the Markov chain was about two. One reason for the low order is the limited amount of training data which does not allow for estimation of too many model parameters.

Identification of relevant features for classification
To shed light on relevant features which were used by the SVM classifier, visualization techniques as described in [22] were employed [see Methods]. For this purpose, SVMs with oligo kernels using oligomers of length three, four, five and six were employed with the SVMs. These kernels resulted in an inferior classification rate while providing well-interpretable parameters.
In order to extract the most important oligomers for the kernel-based ESE prediction, the oligomer-specific weight functions of the discriminant were calculated. The ten most important K-mers, K = {3,4,5,6}, were identified and displayed in a bar graph in Figure 2. The height of each bar correlates to the average norm of the corresponding K-mer weight function and was scaled to yield an unit maximum. For the oligomers shown in Figure 2, one can identify a group of motifs which is most prominent. These are the motifs which occur in the purine-rich enhancers, as for example GAGGAG or GAAGAA. These motifs are represented by several of the important oligomers shown in Figure 2. showing that other features must play an important role for classification as well.

Classification performance
The classification performances of the different methods are shown in Table 4. The table gives the mean values as well as 25, 50, and 75 percent quantiles over the 50 partitions of the classification rate on the test set (accuracy), specificity, sensitivity, and Matthews correlation coefficient [25]. Specificity is defined by TN/(TN + FP), sensitivity by TP/(TP + FN), and Matthews correlation coefficient by where TP, TN, FP, and FN denote the true positive, true negative, false positive and false negative rates, respectively. For clarification, the notation "true negative" denotes the fraction of negative examples that are classified as negatives, whereas "false negative" indicates the fraction of positive examples that are incorrectly classified as negatives. Using SVMs with the combined oligo kernel, the best classification rates could be achieved. The accuracy of the SVM with optimized combined oligo kernel was significantly better than the accuracy of the SVM with  Due to this it would not have made sense to test with intronic or intergenic data sets. In contrast, for the neutralized data the testing with external data sets was necessary, because the idea behind the data is that a classifier can extract exon-specific features. This needs to be tested using external data such as, for example, introns or intergenic regions.

TP TN FP FN TP FP TP FN TN FP TN FN
In Figure 4, the receiver operating characteristics (ROCs) of the classifiers are shown. For the SVMs, the curves were obtained by simply varying the threshold parameter b [26]. For the Markov chain model, a threshold parameter b was introduced and adjusted, that is, a sequence was classified based on the sign of ln (s) -ln (s) + b. Each curve in Figure 4 corresponds to the median of the 50 trials (similar to the attainment surfaces described in [27]). The superior performance of the SVM with combined oligo kernel was also supported by the receiver operating characteristics in Figure 4, while the Markov chain model showed the worst performance. The SVM with 6-mer oligo kernel performed only slightly worse than the SVM with combined oligo kernel indicating that the hexamers are important for this classification problem.

Consistency of results
As the ESE pattern in the middle of the motif-oriented data is in the vicinity of the splice site, the intronic part Figure 2 Oligomer ranking. The ten most important oligomers for discrimination based on trimers, tetramers, pentamers and hexamers are shown. The heights of the bars correlate to the average norm of the corresponding K-mer weight function and was scaled to yield an unit maximum. . Training an SVM with these data resulted in classification rates of about 80% which demonstrates that the classifier uses more than exon/intron distinctions for its decision.

Oligomer ranking
In order to investigate the influence of the central motif relative to its surround on classification performance, another set of training examples was extracted in order to train an SVM. In these sets of positive and negative examples, any of the 4096 possible hexamers was accepted as a middle motif for a positive or negative example. Predictably, the classification rate dropped by 6% although this was not as strong a drop as might be expected.

General discussion
The best-case scenario for the proposed SVM approach would be the exclusive usage of biologically verified training examples. For the required number of training exam- ples, this is impractical and is likely to remain impractical in the near future. We have shown that, using unverified training data, a meaningful decision function can be learned. Furthermore, arguments for the correlation of this decision function to ESE activity were presented. Based on the experiments described in this paper, we see the value of our method in its fundamental suitability for this classification problem and its ability to incorporate expert knowledge into the training data generation process. This can be done by choosing appropriate biologically verified heuristics for selecting training data. As a consequence, not every positive example will correspond to a "true" ESE (the inverse of course holds for the negative examples). However, by virtue of the used heuristics a significant over-representation of ESEs in the positive training examples as well as a corresponding underrepresentation in the negative training examples is reasonable to assume. Therefore, we do not expect the SVM to perform perfectly but to have a classification rate signifi-cantly above chance. Just as other approaches [12], the SVM will produce incorrect predictions, although we are confident that new insights into the splicing process can be used in a straightforward way to improve the already favorable results still further.

Conclusion
We successfully trained and used SVMs with special sequence based kernels for the detection of exonic splicing enhancers. The main problem was the choice of training examples due to the small amount of annotated exonic splicing enhancers in the literature. As we did not obtain good results using our first approach, the neutralized data, Initial tests showed that these sequences were useful for training an SVM classifier, leading to good results. From the different tested kernels, the best results were obtained using the combined oligo kernel with 90.74% accuracy, a specificity of 96.04% and a sensitivity of 82.09%. From a machine-learning point of view, an SVM is a linear classifier in a feature space and the quality of the SVM is to nearly 100 percent based on the used kernel function realizing a scalar product or similarity measure in that feature space [28]. Thus, for obtaining such favorable results, an appropriate kernel in the form of the combined oligo kernel was a necessary prerequisite for successful classification. As can be seen from the results with locality-improved kernel, using another kernel leads to inferior results. To check the benefit from using SVMs we applied a Markov model to the data which resulted in a significantly lower classification rate (68.42% accuracy). The parameters of the oligo kernel were well interpretable and gave information that longer oligomers can shift in the sequence by several bases. Additionally, the oligo kernels can be visualized, presenting important oligomers for ESE classification. We showed that our SVM approach compares favorably to a wellknown state of the art method (ESEfinder).
In the future, we would like to create a web-based version of the program in order to make it usable for the research community. Additionally, it may be useful to integrate the enhancer prediction into a splice site prediction program, as it was already done for Arabidopsis thaliana in [29].

Methods
In this section, oligo kernels for the analysis of biological sequence data are described. Furthermore, the locality improved kernel which we considered for comparison is described and Markov chain models are introduced as an alternative classification method. Additionally, the methods for choosing the training and test data are presented.

Classification with SVMs
We consider L 1 -norm soft margin support vector machines (SVMs) for binary classification [14][15][16]. Let (x i , y i ), 1 ≤ i ≤ l, be consistent training examples, where y i ∈ {-1, 1} is the label associated with input pattern x i ∈ X. The main idea of SVMs is to map the input patterns to a feature space F and to separate the transformed data linearly in F. The transformation Φ : X → F is implicitly done by a kernel k : X × X → ‫,ޒ‬ which computes a scalar (inner) product in the feature space efficiently, that is,

Oligo Kernels
For oligo kernels [22,24,30], the feature space can be described using oligo functions. These code for occurrences of oligomers in sequences with an adjustable degree of positional uncertainty. In existing methods, they provide either position-dependent [31] or completely positionindependent representations [32]. For an alphabet and a sequence s, which contains K-mer ω ∈ gomers which occur in both sequences contribute equally to the sum, regardless of their distance and the oligo kernel becomes identical to the spectrum kernel [32].

Combined Oligo Kernel
Meinicke et al. already showed that it is beneficial to employ combinations of oligo kernels that consider oligomers of different lengths [22]. The κ-combined oligo kernel was introduced, where the subscript i indicates that the normalized oligo kernel is defined on the oligomers of length i. The level of position-dependency can be controlled for each oligomer length individually using κ parameters σ 1 ,..., σ κ .

Visualization of Oligo Kernels
The oligo kernel can easily be visualized when m is the number of oligomers. For noise reduction all values between 0.25 and -0.25 were set to zero and rows which were totally zero were excluded.
In order to reduce the complexity of the interpretation, the analysis can be restricted to the most important oligomers. Therefore, the component weight functions of w(t) = [w 1 (t), w 2 (t),...,w m (t)] T can be ranked according to their L 2 -norm The norm was approximated using the Euclidean norm of discretized oligo functions. Higher norms indicate a more important role in discrimination and the selection of corresponding weight functions helps to focus on important oligomers.

Locality improved kernel
For comparison, we consider the locality improved kernel [15,23] which counts matching nucleotides and considers local correlations within windows of length 2l + 1. For two sequences s i , s j of length L the locality improved kernel is given by  Effect of the smoothing parameter Figure 5 Effect of the smoothing parameter. Example of two sequences s i and s j and the corresponding oligo functions for ω = TAG for small (left) and large (right) smoothing parameter σ 3 . On the left-hand side, it can be seen that the larger σ 3 results in Gaussians that are still overlapping although the motif TAG is shifted in the two sequences. On the right-hand side, the shifted TAG motifs do not increase the kernel function due to the strongly peaked Gaussian. Here, match t (s i , s j ) = 1, if s i and s j have the same nucleotide at position t and zero otherwise. The weights v t give us the possibility to emphasize regions of the window which are of special importance. In our experiments they are fixed to v t = 0.5 -0.4|l -t|/l. The hyperparameter d determines the order to which local correlations are considered. The locality improved kernel can be considered as a special form of a polynomial kernel, where only a weighted subset of monomers is considered [15].

Markov chain model
As a baseline classifier, we look at simple Markov models of the positive and negative sequences, see [33] for an introduction. We apply inhomogeneous Markov chains, also referred to as weight array matrix models. Given a Markov chain M of order n over an alphabet for strings of a fixed length l (cf. [ [33], Section 4.4.2] and [34]), the likelihood of a sequence is given by

Motif-oriented classification
The classification performance considering only the frequencies of the motifs in the middle of the sequences was calculated using the same data partitionings in training and test data as in the classification using SVM or Markov model. For each partition of the data the frequencies of the different motifs in the middle of the training sequences were counted. Now, for each test sequence, we extracted the middle-motif and decided whether the test sequence was positive or negative with the previously determined motif numbers in the training set. Therefore, if a certain motif is overrepresented in the positive training examples the test sequence is classified as being positive, otherwise it is classified negative.

Data sets Neutralization
Neutralization is a strategy to generate transformed sequences from exons which still code for the same amino acid sequence and maintain the overall composition of the original exons. Three criteria have to be met while the exons are transformed. Firstly, the neutralized sequence codes for the same protein. Secondly, a codon should not be used more frequently to represent a particular amino acid than in the original set. Thirdly, the frequencies of the dinucleotide occurrence should be retained. For the detailed algorithm of the neutralization method, we refer to the original literature [17].

Motif-oriented data-extraction method
The basic problem with ESE classification is the small amount of verified data from the literature or databases which can be used for training and testing machine learning approaches. Because the motifs of the ESEs are known, the positive examples can be extracted from the exons but not every motif found in this way is a real ESE. This leads to unreliable positive examples. We developed a new data-extraction scheme (see Figure 6) where the sequence located around a potential ESE is extracted. A surrounding of 200 bases was considered as sufficient.
To deal with the drawback of unreliable positive examples, each extracted sequence has to meet several criteria which increase the possibility of a motif being an ESE. First of all, only potential ESE motifs in the vicinity of the splice sites are used because it is stated in the literature that ESE sequences are not active far away from the splice sites [19,20]. Therefore, only motifs with distances of less than 100 nucleotides from the splice sites are considered as potential ESE sequences. Furthermore, as claimed in [3,10], ESE motifs can compensate for the presence of "weak" (non-consensus) 3' or 5' splice sites of exons. These exons are under a much higher selective pressure to retain ESE motifs and therefore they often contain a higher amount of exonic splicing enhancers. To include this fact into our training data, we generated position-specific weight matrices (PWMs) for both the 3' splice site and the 5' splice site. We extracted all annotated splice sites from the Vega database [18]. For the 3' splice site, we took 20 bases of the intron and 3 bases of the exon to take the pyrimidine-rich sequence into account. For the 5' splice site, 3 bases of the exon and 6 bases of the intron were considered as splice site. These differences result from the known design of the splice sites, including the pyrimidine-rich sequence into the 3' splice site (see Figure  1). Creating the PWMs, we obtained a 4 × 23-matrix for the 3'splice site and a 4 × 9-matrix for the 5' splice site, which are shown in Figure 7.
Using these PWMs, a score was assigned to every splice site. The score assigned by a PWM to a substring is defined as , where p ij is the probability of observing symbol i at position j of the motif and bi is the probability b i of observing that symbol in the background model. For the background model, we considered all bases as equally represented. Those splice sites with a score among the lower 25% of scores were classified as a weak splice site and those among the upper 25% were classified as strong. Then, only motifs in the vicinity of "weak" exons were considered as being reliable training examples.
Third, RNA binding proteins recognize RNA in a sequence-specific manner where the secondary structure of the RNA plays a role [21]. Binding sites as ESE sequences are often located in single stranded regions. A motif in a double-stranded region has been shown experimentally to have a strong negative correlation with the binding affinity [35] or even abolishes protein-binding [36,37]. Therefore, we calculated energy parameters to characterize the single-strandedness of a substring in an Schematic presentation of the data-extraction method Figure 6 Schematic presentation of the data-extraction method. Sequences are extracted locally around potential ESE motifs and are then declined or accepted as positive examples depending on whether they fulfill certain criteria. First of all, a potential positive example has to be close to a splice site. Secondly, the exon from which the positive example is extracted has to be a weak exon and thirdly, the region in which the positive example is located has to be single-stranded. Each training sequence has a length of 200 bases. Position weight matrix for the 3' and the 5' splice site RNA sequence. For characterization of single-stranded regions, we used a parameter EF a,b described in [21] giving the expected fraction of bases in the substring from position a to position b that do not form base pairs. EF a,b is calculated as with L being the length of the RNA sequence and p ij giving the possibility that base i and j are paired. This parameter can be calculated with the help of RNAfold [38]. Using Publish with Bio Med Central and every scientist can read your work free of charge