Skip to main content

Accurate splice site prediction using support vector machines



For splice site recognition, one has to solve two classification problems: discriminating true from decoy splice sites for both acceptor and donor sites. Gene finding systems typically rely on Markov Chains to solve these tasks.


In this work we consider Support Vector Machines for splice site recognition. We employ the so-called weighted degree kernel which turns out well suited for this task, as we will illustrate in several experiments where we compare its prediction accuracy with that of recently proposed systems. We apply our method to the genome-wide recognition of splice sites in Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, and Homo sapiens. Our performance estimates indicate that splice sites can be recognized very accurately in these genomes and that our method outperforms many other methods including Markov Chains, GeneSplicer and SpliceMachine. We provide genome-wide predictions of splice sites and a stand-alone prediction tool ready to be used for incorporation in a gene finder.


Data, splits, additional information on the model selection, the whole genome predictions, as well as the stand-alone prediction tool are available for download at


With the generation of whole genome sequences, important insight into gene functions and genetic variation has been gained over the last decades. As novel sequencing technologies are rapidly evolving, the way will be paved for cost efficient, high-throughput whole genome sequencing which is going to provide the community with massive amounts of sequences. It is self-evident that the handling of this wealth of data will require efficient and accurate computational methods for sequence analysis. Among the various tools in computational genetic research, gene prediction remains one of the most prominent tasks, as recent competitions have further emphasised (e.g. [1, 2]). Accurate gene prediction is of prime importance for the creation and improvement of annotations of recently sequenced genomes [3, 4]. In the light of new data related to natural variation (e.g. [57]), the importance of accurate computational gene finding gains increasing importance since it helps to understand the effects of polymorphisms on the gene products.

Ab initio gene prediction from sequence is a highly sophisticated procedure as it mimics – in its result – the labour of several complex cellular machineries at a time: identification of the beginning and the end of a gene, as is accomplished by RNA polymerases; splicing of the nascent RNA, in the cell performed by the spliceosome; and eventually the detection of an open reading frame, as does the ribosome. The success of a gene prediction method therefore relies on the accuracy of each of these components. In this paper we will focus on the improvement of signal sensors for the detection of splice sites, as this sub-problem is a core element of any gene finder. A comprehensive understanding of splice sites is not only a prerequisite for splice form prediction but can also be of great value in localizing genes [812].

In eukaryotic genes, splice sites mark the boundaries between exons and introns. The latter are excised from premature mRNAs in a post-processing step after transcription. Both the donor sites at the exon-intron junctions, and the acceptor sites at the intron-exon boundaries, have quite strong consensus sequences which can, however, vary significantly from one organism to another. The vast majority of all splice sites are so called canonical splice sites which are characterised by the presence of the dimers GT and AG for donor and acceptor sites, respectively. The occurrence of the dimer is not sufficient for the splice site. Indeed, it occurs very frequently at non splice site positions. For example in human DNA, which is ≈6·109 nucleotides in size, GT can be found about 400 million times (summed over both strands). For some crude estimate of say 2·104 genes with 20 exons each, only 0.1% of the consensus sites are true splice sites. We therefore face two extremely unbalanced classification tasks, namely the discrimination between true donor sites and decoy positions with the consensus dimer GT or GC (the only non-canonical splice site that we will consider) and the discrimination between true acceptor sites and decoy positions with the consensus dimer AG.

Relation to previous work

Although present-day splice site detectors (e.g. based on Support Vector Machines, neural networks, hidden Markov models) are reported to perform at a fairly good level [9, 1315], several of the reported performance numbers should be interpreted with caution, for a number of reasons. First of all, these results are based on small and potentially biased data sets. Now that many genomes have been fully sequenced, these results will need to be re-evaluated. Second, issues in generating negative examples (decoys) were, if recognized, often not sufficiently documented. The choice of data sets, in particular the decoys, can make a tremendous difference in the measured performance. Third, often only the single site prediction of acceptor and donor sites is considered, whereas the higher goal is to use the splice site predictor within a gene finder. It is uncertain how good the predictors perform in this setting. Keeping these in mind, we provide unbiased genome-wide splice site prediction which enables further evaluation in gene finders.

In this paper, we will apply Support Vector Machines (SVMs) to the recognition of splice sites. SVMs are known to be excellent algorithms for solving classification tasks [1619], and have also been successfully applied to several bioinformatics problems [3, 2023] including splice site detection, cf. e.g. [2432]. Our work builds upon our previous work: In [24, 25] we demonstrated that SVMs using kernels from probabilistic hidden Markov models (cf. [20, 23]) outperform hidden Markov models alone. As this approach did not scale to many training examples, we performed a comparison of different faster methods for splice site recognition [28], where we considered Markov models and SVMs with different kernels: the so-called locality improved kernel, originally proposed for recognition of translation initiation sites [21]; the SVM-pairwise kernel, using alignment scores [33]; the TOP kernel, making use of a probabilistic model (cf. [20, 23]); the standard polynomial kernel[16]; and the so-called weighted degree kernel[28, 34]. A predictor based on the latter kernel has been successfully used in combination with other information for predicting the splice form of a gene, while outperforming other HMM based approaches [3]. This indicates that the improved accuracy of splice site recognition indeed leads to a higher accuracy in ab initio transcript prediction.

Other groups also reported successful SVM based splice site detectors. In [27] it was proposed to use linear SVMs on binary features computed from di-nucleotides, an approach which also outperformed previous Markov models. Even more accurate, the authors of SpliceMachine [29] not only used positional information (one- to trimers) around the splice site, but also explicitly modelled compositional context using tri- to hexamers. To the best of our knowledge, this approach is the current state-of-the art, outperforming previous SVM based approaches as well as GeneSplicer [12] and GeneSplicerESE [35]. In [31] linear SVMs were used on positional features that were extracted from empirical estimates of unconditional positional probabilities. Note that this approach is similar to our TOP kernel method on zeroth-order Markov chains [28]. Recently, [32] reported improved accuracies for splice site prediction also by using SVMs. The method employed in [32] is very similar to a kernel initially proposed in [21] (Salzberg kernel). The idea of this kernel is to use empirical estimates of conditional positional probabilities of the nucleotides around splice sites (estimated by Markov models of first order) which are then used as input for classification by an SVM.

Many other methods have been proposed for splice site recognition. For instance multilayer neural networks with Markovian probabilities as inputs [15]. They train three Markov models on three segments of the input sequence, the upstream, signal and downstream segments. Although they outperform [32] on small datasets, the authors themselves write that the training of the neural networks is especially slow when the number of true and decoy examples are imbalanced and that they have to downsample the number of negatives for training even on small and short sequence sets. Therefore, their method does not seem suitable for large-scale genome-wide computations. Finally, [36] proposed a method based on Bayesian Networks which models statistical dependencies between nucleotide positions.

In this work we will compare a few of our previously considered methods against these approaches and show that the engineering of the kernel, the careful choice of features and a sound model selection procedure are important for obtaining accurate predictions of splice sites.

Our previous comparison in [28] was performed on a relatively small data set derived from the C. elegans genome. Also, the data sets considered in [32] are relatively small (around 300,000 examples, whereas more than 50,000,000 examples are nowadays readily available). In this study we therefore reevaluate the previous results on much larger data sets derived from the genomes of five model organisms, namely Caenorhabditis elegans ("worm"), Arabidopsis thaliana ("cress"), Drosophila melanogaster ("fly"), Danio rerio ("fish"), and Homo sapiens ("human"). Building on our recent work on large scale kernel learning [3740], we now are able to train and evaluate Support Vector Machines on such large data sets as is necessary for analyzing the whole human genome. In particular, we are able to show that increasing the number of training examples indeed helps to obtain a significantly improved performance, and thus will help to improve existing annotation (see, e.g. [3]). We train and evaluate SVMs on newly generated data sets using nested cross-validation and provide genome-wide splice site predictions for any occurring GT, GC and AG dimers, which will be furnished with posterior probability estimates for being true splice sites. We will show that the methods in some cases exhibit dramatic performance differences for the different data sets.

Organization of the paper

The paper is organized as follows: In the next section we present the outcomes of (a) the comparison with the methods proposed in [12, 29, 32, 36], (b) an assessment which window length should be used for classification and, finally, (c) a comparison of the large scale methods on the genome-wide data sets for the five considered genomes. After discussing our results, we also address the question about the interpretability of SVMs. Finally, in the Methods section, we describe the generation of our data sets, the details of cross-validation and model selection, different kernels, and visualizations method that we used in this study.

Results and discussions

In this section we discuss experimental results we obtained with our methods for acceptor and donor splice site predictions for the five considered organisms.

Throughout the paper we measure our prediction accuracy in terms of area under the Receiver Operator Characteristic Curve (auROC) [41, 42] and area under the Precision Recall Curve (auPRC) (e.g., [43]). (We omit to show the classification accuracy, as often more than 99% of the examples are negatively labeled. Thus, the simplest classifier, predicting -1 for all examples, already achieves 99% rendering the accuracy measure meaningless.) Note that for unbalanced data sets the area under the auROC can also be rather meaningless, since this measure is independent of class ratios and large auROC values may not necessarily indicate a good detection performance. The auPRC is a better measure for performance, if the class distribution is very unbalanced. However, it does depend on the class priors on the test set and hence is affected by sub-sampling the decoys, as happened with the data sets used in previous studies (NN269 in [32] contains about 4 times more decoy than true sites, DGSplicer in [32, 36] about 140 times more; in contrast, in the genome scenario the ratio is one to 300 – 1000). In order to compare the results among the different data sets with different class sizes, we therefore also provide the auROC score which is not affected by sub-sampling.

Pilot studies on small datasets

Performance on the NN269 and DGSplicer data sets

For the comparison of our SVM classifiers to the approaches proposed in [32, 36], we first measure the performance of our methods on the four tasks used in [32] (see Methods for details). The approach in [32] is outperformed by a neural network approach proposed in [15]. However, we do not compare our methods to the latter method, since it already reaches its computational limits for the small datasets with only a few thousand short sequences (cf. [15], page 138) and hence is not suitable for large-scale genome-wide computations. On each task we trained SVMs with the weighted degree kernel (WD) [28], and the weighted degree kernel with shifts (WDS) [34]. On the NN269 Acceptor and Donor sets we additionally trained an SVM using the locality improved kernel (LIK) [21]; as it gives the weakest prediction performance and is computationally most expensive we exclude this model from the following investigations. As a benchmark method we also train higher order Markov Chains (MCs) (e.g. [44]) of "linear" structure and predict with the posterior log-odds ratio (cf. Methods section). Note that Position Specific Scoring Matrices (PSSM) are recovered as the special case of zeroth-order MCs. A summary of our results showing the auROC and auPRC scores is displayed in Table 1.

Table 1 Performance evaluation (auROC and auPRC scores) of six different methods on the NN269 and DGSplicer Acceptor and Donor test sets. MC denotes prediction with a Markov Chain, EBN the method proposed in [36], and MC-SVM the SVM based method described in [32] (similar to [21]).

We first note that the simple MCs perform already fairly well in comparison to the SVM methods. Surprisingly, we find that the MC-SVM proposed in [32] performs worse than the MCs. (We have reevaluated the results in [32] with the code provided by the authors and found that the stated false positive rate of their method is wrong by a factor of 10. We have contacted the authors for clarification and they published an erratum [45]. The results for MC-SVMs given in Table 1 are based on the corrected performance measurement.) As anticipated, for the two acceptor recognition tasks, EBN and MCs are outperformed by all kernel models which are performing all at a similar level. However, we were intrigued to observe that for the DGSplicer Donor recognition task, the MC based predictions outperform the kernel methods. For NN269 Donor recognition their performance is similar to the performance of the kernel methods.

There are at least two possible explanations for the strong performance of the MCs. First, the DGSplicer data set has been derived from the genome annotation, which in turn might have been obtained using a MC based gene finder. Hence, the test set may contain false predictions easier reproduced by a MC. Second, the window size for the DGSplicer Donor recognition task is very short and has been tuned in [36] to maximize the performance of their method (EBN) and might be suboptimal for SVMs. We investigated these hypotheses with two experiments:

  • In the first experiment, we shortened the length of the sequences in DGSplicer Acceptor from 36 to 18 (with consensus AG at 8,9). We retrained the MC and WD models doing a full model selection on the shortened training data. We observe that on the shortened data set the prediction performance drops drastically for both MC and WD (by 60% relative) and that, indeed, the MC outperforms the WD method (to 12.9% and 9% auPRC, respectively).

  • In a second experiment, we started with a subset of our new data sets generated from the genomes of worm and human which only uses EST or cDNA confirmed splice sites (see methods section). In the training data we used the same number of true and decoy donor sites as in the DGSplicer data set. For the test data we used the original class ratios (in order to allow a direct comparison to following experiments; cf. Table 2). Training and testing sequences were shortened from 218 nt in steps of 10 nt down to 18 nt (same as in the DGSplicer donor data set). We then trained and tested MCs and WD-SVMs for the sets of sequences of different length. Figure 1 shows the resulting values for the auPRC on the test data for different sequence lengths. For the short sequences, the prediction accuracies of MCs and SVMs are close for both organisms. For human donor sequences of length 18 MCs indeed outperform SVMs. With increasing sequence length, however, the auPRC of SVMs rapidly improves while it degrades for MCs. Recall that the short sequence length in the DGSplicer data was tuned through model selection for EBN, and thus the performance of the EBN method will degrade for longer sequences [36], so that we can safely infer that our methods would also outperform EBN for longer training sequences.

Table 2 Characteristics of the genome-wide data sets containing true and decoy acceptor and donor splice sites for our five model organisms.
Figure 1
figure 1

Comparison of classification performance of the weighted degree kernel based SVM classifier (WD) with the Markov chain based classifier (MC) on a subset of our C. elegans Donor and Human Donor data sets for sequences of varying length. For each length, we performed a full model selection on the training data in order to choose the best model. The performance on the test sets, measured through area under the Precision Recall Curve (auPRC), is displayed in percent.

The results do not support our first hypothesis that the test data sets are enriched with MC predictions. However, the results confirm our second hypothesis that the poor performance of the kernel methods on the NN269 and DGSplicer donor tasks is due to the shortness of sequences. We also conclude that discriminative information between true and decoy donor sequences lies not only in the close vicinity of the splice site but also further away (see also the illustrations using k-mer scoring matrices below). Therefore, the careful choice of features is crucial for building accurate splice site detectors and if an appropriate window size is chosen, the WD kernel based SVM classifiers easily outperform previously proposed methods.

Comparison with SpliceMachine for cress and human

In this section we compare SpliceMachine [29] with the WD kernel based SVMs. SpliceMachine [46] is the current state-of-the art splice site detector. It is based on a linear SVM and outperforms the freely available GeneSplicer [47, 12] by a large margin [29]. We therefore perform an extended comparison of our methods to SpliceMachine on subsets of the genome-wide datasets (cf. the results and methods sections). One fifth and one twenty-fifth of the data set was used each for training and for independent testing for cress and human, respectively. We downloaded the SpliceMachine feature extractor [48] to generate train and test data sets. Similar to the WD kernel, SpliceMachine makes use of positional information around the splice site. As it explicitly extracts these features it is however limited to a low order context (small d). In addition, SpliceMachine explicitly models coding-frame specific compositional context using tri- to hexamers. Note that this compositional information is also available to a gene finding system for which we are targeting our splicing detector. Therefore, in order to avoid redundancy, compositional information should ideally not be used to detect the splicing signal. Nevertheless, for comparative evaluation of the potential of our method, we augment our WD kernel based methods with 6 spectrum kernels [49] (order 3, 4, 5, each up- and downstream of splice site) and use the same large window sizes as were found out to be optimal in [29]. For cress acceptor [-85, +86], donor [-87, +124], and for human acceptor [-105, +146], donor [-107, +104]. For the WD kernel based SVMs, we fixed the model parameters C = 1 and d = 22, and for WDS we additionaly fixed the shift parameter σ = 0.5. For the SpliceMachine we performed an extensive model selection and found C = 10-3 to be consistently optimal. We trained with C {100, 10-1, 10-2, 10-3, 5·10-4, 10-4, 10-5, 10-6, 10-7, 10-8}. Using these parameter settings we trained SVMs a) on the SpliceMachine features (SM), b) using the WD kernel (WD) c) using the WD kernel augmented by the 6 spectrum kernels (WDSP) d) using the WDS kernel (WDS) and e) using the WDS and spectrum kernels (WDSSP). Table 3 shows the area under the ROC and precision recall curve obtained in this comparison. Note that SpliceMachine always outperforms the WD kernel, but is in most cases inferior to the WDS kernel. Furthermore, complementing the WD kernels with spectrum kernels (methods WDSP and WDSSP) always improves precision beyond that of SpliceMachine. As this work is targeted at producing a splicing signal detector to be used in a gene finder, we will omit compositional information in the following genome-wide evaluations. To be fair, one can note that a WDS kernel using a very large shift is able to capture compositional information, and the same holds to some extend for the WD kernel when it has seen many training examples. It is therefore impossible to draw strong conclusions on whether window size and (ab)use of compositional features will prove beneficial when the splice site predictor is used as a module in a gene finder, which we hope is enabled by our work providing genome wide predictions.

Table 3 Performance evaluation (auROC and auPRC scores) of four different methods on a subset of the genome-wide cress and human datasets.

Performance for varying data size

Figure 2 shows the prediction performance in terms of the auROC and auPRC of SVMs using the MC and the WD kernel on the human acceptor and donor splice data that we generated for this work (see the methods section) for varying training set sizes. For training we use up to 80% of all examples and the remaining examples for testing. MCs and SVMs were trained on sets of size varying between 1000 and 8.5 million examples. Here we sub-sampled the negative examples by a factor of five. We observe that the performance steadily increases when using more data for training. For SVMs, over a wide range, the auPRC increases by about 5% (absolute) when the amount of data is multiplied by a factor of 2.7. In the last step, when increasing from 3.3 million to 8.5 million examples, the gain is slightly smaller (3.2 – 3.5%), indicating the start of a plateau. Similarly MCs improve with growing training set sizes. As MCs are computationally a lot less demanding, we performed a full model selection over the model order and pseudo counts for each training set size. For the WD-SVM the parameters were fixed to the ones found optimal in the results section. Nevertheless MCs did constantly perform inferior to WD-SVMs. We may conclude that one should train using all available data to obtain the best results. If this is infeasible, then we suggest to only sub-sample the negatives examples in the training set, until training becomes computationally tractable. The class distribution in the test set, however, should never be changed unless explicitly taken into account in evaluation.

Figure 2
figure 2

Comparison of the classification performance of the weighted degree kernel based SVM classifier (SVM) with the Markov chain based classifier (MC) for different training set sizes. The area under the Precision Recall Curve (auPRC; left) and the area under the Receiver Operator Curve (auROC; middle) are displayed in percent. On the right the CPU time in seconds needed to train the models is shown.

Results on genome-wide data sets

Based on our preliminary studies, we now proceeded to design and train the genome-wide predictors. We first generated new genome-wide data sets for our five model organisms: worm, fly, cress, fish, and human. As our large-scale learning methods allow us to use millions of training examples, we included all available EST information from the commonly used databases. Since the reliability of the true and decoy splice sequences is crucial for a successful training and tuning, these data sets were produced with particular care; the details can be found in the Methods section. We arrived at training data sets of considerable size containing sequences of sufficient length (see Table 2). For fish and human the training datasets were sub-sampled to include only 1/3 and 1/5 of the negative examples, leading to a maximal training set size of 9 million sequences for human donor sites.

For a subsequent use in a gene finder system we aimed at producing unbiased predictions for all candidate splice sites, i.e. for all occurrences of the GT/GC and AG consensus dimer. For a proper model selection and in order to obtain unbiased predictions on the whole genome we employed nested five-fold cross-validation. We additionally estimated posterior probabilities in order to obtain interpretable and comparable scores for the outputs of the different SVM classifiers (see Methods for details). The results summarized in Table 4 are averaged values with standard deviation over the five different test partitions.

Table 4 Performance evaluation on the genome-wide data sets for worm, fly cress, fish, and human.

Confirming our evaluations in the pilot studies, kernel methods outperform the MC methods in all eight classification tasks. Figure 3 displays the precision recall curves for all five organisms comparatively, Table 4 the corresponding auPRC scores. For worm, fly and cress the improvement in the performance accuracy for the SVM in comparison to MC lies in a similar range of 4–10% (absolute), both for donor and for acceptor tasks. However, for fish and especially for human the performance gain is considerable higher. For human, MCs only achieve 16% and 25% auPRC scores, whereas WDS reaches 54% and 57% for acceptor and donor recognition, respectively. The severe decrease in performance from worm to human for all classification methods in the auPRC score can partially be explained by the different fractions of positive examples observed in the test set. However, a weaker decline can also be observed in the auROC scores (also Table 4) which are independent of the class skew (e.g. for acceptor sites from 99.6% on worm to 96.0% on human for MC, and from 99.8% to 97.9% for WDS). The classification task on the human genome seems to be a considerably more difficult problem than the same one on the worm genome. We may speculate that this can be partially explained by a higher incidence of alternative splicing in the human genome. These sites usually exhibit weaker consensus sequences and are therefore more difficult to detect. Additionally, they often lead to mislabeled examples in the training and testing sets. Finally, it might also be due to the used protocol for aligning the sequences which may generate more false splice sites in human than in other organisms. This hypothesis is supported by the fact that the performance significantly increases, if one only considers cDNA confirmed genes (data not shown).

Figure 3
figure 3

Precision Recall Curve for the three methods MC, WD, WDS estimated on the genome-wide data sets for worm, fly, cress, fish, and human in a nested cross-validation scheme. In contrast to the ROC the random guess in this plot corresponds to a horizontal line, that depends on the fraction of positive examples in the test set (e.g. 2% and 3% in the case of the worm acceptor and donor data sets, respectively).

Analysis of the learning result

One of the problems with kernel methods compared to probabilistic methods, such as Position Specific Scoring Matrices [50] or Interpolated Markov Models [11], is that the resulting decision function is hard to interpret and, hence, difficult to use in order to extract relevant biological knowledge from it (see also [5153]). Here, we propose to use k-mer scoring matrices [3, 54] to visualize the contribution of all (k-mer, sequence position) pairs to the final decision function of the SVM with WD-Kernel (cf. Methods section). We obtain a graphical representation from which it is possible to judge where in the sequence which substring lengths are of importance.

We plotted the k-mer scoring matrices corresponding to our trained models for the organisms comparatively in Figure 4, which shows the relative importance of substrings of a certain length for each position in the classified sequences. We can make a few interesting observations: For worm, fly, and potentially also cress there is a rather strong signal about 40–60 nt downstream of the donor and 40–60 nt upstream of the acceptor splice sites. These two signals are related to each other, since introns in these organisms are often only 50 nt long. Additionally, we find the region 20–30 nt upstream of the acceptor splice site of importance, which is very likely related to the branch point. In human it is typically located 20–50 nt upstream and exhibits the consensus CU(A/G)A(C/U), which matches the lengths of important k-mers in that region for human [55]. In worms, the branch point consensus seems shorter (3–4 nt) – confirming previous reports that the branch point is much weaker in worms. In fly and cress the branch point seems rather long (5–6 nt) and important for recognition of the splice site. Finally, note that the exon sequence carries a lot of discriminative information. The k-mers of most importance are of length three, relating to the coding potential of exons. Additionally, the periodicity observed for instance in cress is due to the reading frame. On the supplementary website we also provide a list of most discriminative k-mers for the two splice site recognition tasks.

Figure 4
figure 4

k-mer scoring matrices comparatively for worm, fly, cress, fish, and human. They depict the maximal position-wise contribution of all k-mers up to order 8 to the decision of the trained kernel classifiers, transformed into percentile values (cf. the section on interpreting the SVM classifier). Red values are highest contributions, blue lowest. Position 1 denotes the splice site and the start of the consensus dimer.


In this work we have evaluated several approaches for the recognition of splice sites in worm, fly, cress, fish, and human. In a first step we compared MCs, a Bayesian method (EBN) and SVM based methods using several kernels on existing data sets generated from the human genome. We considered the kernel used in [32] based on MCs, the locality improved kernel [21] and two variants of the weighted degree kernel [28, 34]. We found that these existing data sets have limitations in that the sequences used for training and evaluation turn out to be too short for optimal discrimination performance. For SVMs we showed that they are able to exploit – albeit presumably weak – features as far as 80 nt away from the splice sites. In a comparison to SpliceMachine we could show that our approach perform favorably when complemented with compositional information. Using the protocol proposed in [3], we therefore generated new data sets for the five organisms. These data sets contain sufficiently long sequences and for human as many as 9 million training examples. Based on our previous work on large scale kernel learning [40], we were able to train SVM classifiers also on these rather big data sets. Moreover, we illustrated that the large amount of training data is indeed beneficial for significantly improving the SVM prediction performance, while MCs do not significantly improve when using much more training examples. We therefore encourage using as many examples for training as feasible to obtain the best generalization results.

For worm, fly and cress we were able to improve the performance by 4%–10% (absolute) compared to MCs. The biggest difference between the methods is observed for the most difficult task: acceptor and donor recognition on human DNA. The MCs reach only 16% and 25% auPRC, while SVMs achieve 54% and 57%, respectively. The drastic differences between organisms in the prediction performance scores can be understood as a consequence of the smaller fraction of positive examples and a higher incidence of alternative splicing in the human genome compared to the other genomes. For further comparative studies we provide and discuss k-mer scoring matrices elucidating which features are important for discrimination.

In order to facilitate the use of our classifiers for other studies, we provide whole genome predictions for the five organisms. Additionally, we offer an open-source stand-alone prediction tool allowing, for instance, the integration in other gene finder systems. The predictions, data sets and the stand-alone prediction tool are available for download on the supplementary website


Data sets

NN269 and DGSplicer data sets

For the pilot study we use the NN269 and the DGSplicer data sets originating from [9] and [32], respectively. The data originates from [56] and the training and test splits can be downloaded from [46]. The data sets only include sequences with the canonical splice site dimers AG and GT. We use the same split for training and test sets as used in [32]. A description of the properties of the data set is given in Table 5.

Table 5 Characteristics of the NN269 and DGSplicer data sets containing true and decoy acceptor and donor splice sites derived from the human genome.

Worm, fly, cress, fish, and human

We collected all known ESTs from dbEST [57] (as of February 28, 2007; 346,064 sequences for worm, 514,613 sequences for fly, 1,276,130 sequences for cress, 1,168,572 sequences for fish, and 7,915,689 sequences for human). We additionally used EST and cDNA sequences available from wormbase [58] for worm, (file confirmed_genes.WS170) [59] for fly, (files na_EST.dros and na_dbEST.same.dmel) [60] for cress, (files cDNA_flanking_050524.txt and cDNA_full_reading_050524.txt) [61] for fish, (file Danio_rerio.ZFISH6.43.cdna.known.?? and [62] for fish and human (file dr_mgc_mrna.fasta for fish and hs_mgc_mrna.fasta for human). Using blat[63] we aligned ESTs and cDNA sequences against the genomic DNA (releases WS170, dm5, ath1, zv6, and hg18, respectively). If the sequence could not be unambiguously matched, we only considered the best hit. The alignment was used to confirm exons and introns. We refined the alignment by correcting typical sequencing errors, for instance by removing minor insertions and deletions. If an intron did not exhibit the consensus GT/AG or GC/AG at the 5' and 3' ends, we tried to achieve this by shifting the boundaries up to two base pairs (bp). If this still did not lead to the consensus, then we split the sequence into two parts and considered each subsequence separately. Then, we merged alignments if they did not disagree and if they shared at least one complete exon or intron.

In a next step, we clustered the alignments: In the beginning, each of the above EST and cDNA alignments were in a separate cluster. We iteratively joined clusters, if any two sequences from distinct clusters match to the same genomic location (this includes many forms of alternative splicing).

From the clustered alignments we obtained a compact splicing graph representation [64], which can be easily used to generate a list of positions of true acceptor and donor splice sites. Within the boundaries of the alignments (we cut out 10 nt at both ends of the alignments to exclude potentially undetected splice sites), we identified all positions exhibiting the AG, GT or GC dimer and which were not in the list of confirmed splice sites. The lists of true and decoy splice site positions were used to extract the disjoint training, validation and test sets consisting of sequences in a window around these positions. Additionally, we divided the whole genome into regions, which are disjoint contiguous sequences containing at least two complete genes; if an adjacent gene is less than 250 base pairs away, we merge the adjacent genes into the region. Genes in the same region are also assigned to the same cross-validation split. The splitting was implemented by defining a linkage graph over the regions and by using single linkage clustering. The splits were defined by randomly assigning clusters of regions to the split.

Model selection and evaluation

To be able to apply SVMs, we have to find the optimal soft margin parameter C[18] and the kernel parameters. These are: For the LI-kernel, the degree d and window size l; for the WD kernel, the degree d; and for the WDS kernel, the degree d and the shift parameter σ (see the section on SVMs and kernels for details). For MCs we have to determine the order d of the Markov chain and the pseudocounts for the models of positive and the negative examples (see the posterior log-odds section). In order to tune these parameters we perform the cross-validation procedures described below.

NN269 and DGSplicer

The training and model selection of our methods for each of the four tasks was done separately by partial 10-fold cross-validation on the training data. For this, the training sets for each task are divided into 10 equally sized data splits, each containing the same number of splice sequences and the same proportion of true versus decoy sequences. For each parameter combination, we use only 3 out of the 10 folds, that is we train 3 times by using 9 out of the 10 training data splits and evaluate on the remaining training data split. Since the data is highly unbalanced, we choose the model with the highest average auPRC score on the three evaluation sets. This best model is then trained on the complete training data set. The final evaluation is done on the corresponding independent test sets (same as in [32]). The supplementary website includes tables with all parameter combinations used in model selection for each task and the chosen parameters.

Worm, fly, cress, fish, and human

The training and model selection of our methods for the five organisms on the acceptor and donor recognition tasks was done separately by 5-fold cross-validation. The optimal parameter was chosen by selecting the parameter combination that maximized the auPRC score. This model selection method was nested within 5-fold cross-validation for final evaluation of the performance. The reported auROC and auPRC are averaged scores over the five cross-validation splits. The supplementary website includes tables with all considered parameter combinations and the chosen parameters for each task. All splits were based on the basis of the clusters derived from EST and cDNA alignments, such that different splits come from random draws of the genome.

Performance measures

The sensitivity is defined as the fraction of correctly classified positive examples among the total number of positive examples, i.e. it equals the true positive rate TPR = TP/(TP + FN). Analogously, the fraction FPR = FP/(TN + FP) of negative examples wrongly classified as positive is called the false positive rate. Plotting TPR against FPR results in the Receiver Operator Characteristic Curve (ROC) [41, 42]. Plotting the positive predictive value PPV = TP/(FP + TP), i.e. the fraction of correct positive predictions among all positively predicted examples, against the TPR, one obtains the Precision Recall Curve (PRC) (see e.g., [43]). The area under the ROC and PRC are denoted by auROC and auPRC respectively.

Estimation of posterior probabilities

In order to provide an interpretable and comparable confidence score of the SVM predictions, we estimated the conditional likelihood P(y = 1|f(x)) of the true label y being positive for a given SVM output value f(x). To do this, we applied a piecewise linear function which was determined on the validation set (the same used for the classifier model selection). We used the N = 50 quantiles taken on the SVM output values as supporting points φ i , i = 1,...,N. For convenience, denote φ0 = -∞. For each point φ i the corresponding π ^ i MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFapaCgaqcamaaBaaaleaacqWGPbqAaeqaaaaa@3007@ -value, which represents the empirical probability of being a true positive, was computed as π ^ i = n i T P n i MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFapaCgaqcamaaBaaaleaacqWGPbqAaeqaaOGaeyypa0ZaaSaaaeaacqWGUbGBdaqhaaWcbaGaemyAaKgabaGaemivaqLaemiuaafaaaGcbaGaemOBa42aaSbaaSqaaiabdMgaPbqabaaaaaaa@3964@ , where n i (i = 1,...,N) is the number of examples with output values φi-1f(x) <φ i and n i T P MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGUbGBdaqhaaWcbaGaemyAaKgabaGaemivaqLaemiuaafaaaaa@31F3@ is the number of true splice sites in the same output range. Additionally, we determined the empirical cumulative probability as follows π ^ i c = ( j = i N n j T P ) / ( j = i N n j ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFapaCgaqcamaaDaaaleaacqWGPbqAaeaacqWGJbWyaaGccqGH9aqpdaqadaqaamaaqadabaGaemOBa42aa0baaSqaaiabdQgaQbqaaiabdsfaujabdcfaqbaaaeaacqWGQbGAcqGH9aqpcqWGPbqAaeaacqWGobGta0GaeyyeIuoaaOGaayjkaiaawMcaaiabc+caVmaabmaabaWaaabmaeaacqWGUbGBdaWgaaWcbaGaemOAaOgabeaaaeaacqWGQbGAcqGH9aqpcqWGPbqAaeaacqWGobGta0GaeyyeIuoaaOGaayjkaiaawMcaaaaa@4C5E@ . In order to obtain a smooth and strictly monotonically increasing probability estimate, we solve the following quadratic optimization problem:

min π , π c + N i = 1 N ( s i ( π ) + t i ( π c ) ) s . t . π i π i c ,  for all  i = 1 , ... , N π i π i + 1 ε ,  for all  i = 1 , ... , N 1 π i c π i + 1 c ε , for all  i = 1 , ... , N 1 , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqaaeabcaaaaeaadaWfqaqaaiGbc2gaTjabcMgaPjabc6gaUbWcbaacciGae8hWdaNaeiilaWIae8hWda3aaWbaaWqabeaacqWGJbWyaaWccqGHiiIZtuuDJXwAK1uy0HMmaeHbfv3ySLgzG0uy0HgiuD3BaGabaiab+1risnaaDaaameaacqGHRaWkaeaacqWGobGtaaaaleqaaaGcbaWaaabCaeaacqGGOaakcqWGZbWCdaWgaaWcbaGaemyAaKgabeaakiabcIcaOGGadiab9b8aWjabcMcaPiabgUcaRiabdsha0naaBaaaleaacqWGPbqAaeqaaOGaeiikaGIae0hWda3aaWbaaSqabeaacqWGJbWyaaGccqGGPaqkcqGGPaqkaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6eaobqdcqGHris5aaGcbaGaee4CamNaeiOla4IaeeiDaqNaeiOla4cabaGae8hWda3aaSbaaSqaaiabdMgaPbqabaGccqGHKjYOcqWFapaCdaqhaaWcbaGaemyAaKgabaGaem4yamgaaOGaeiilaWIaeeiiaaIaeeOzayMaee4Ba8MaeeOCaiNaeeiiaaIaeeyyaeMaeeiBaWMaeeiBaWMaeeiiaaIaemyAaKMaeyypa0JaeGymaeJaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaemOta4eabaaabaGae8hWda3aaSbaaSqaaiabdMgaPbqabaGccqGHKjYOcqWFapaCdaWgaaWcbaGaemyAaKMaey4kaSIaeGymaedabeaakiabgkHiTiab=v7aLjabcYcaSiabbccaGiabbAgaMjabb+gaVjabbkhaYjabbccaGiabbggaHjabbYgaSjabbYgaSjabbccaGiabdMgaPjabg2da9iabigdaXiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabd6eaojabgkHiTiabigdaXaqaaaqaaiab=b8aWnaaDaaaleaacqWGPbqAaeaacqWGJbWyaaGccqGHKjYOcqWFapaCdaqhaaWcbaGaemyAaKMaey4kaSIaeGymaedabaGaem4yamgaaOGaeyOeI0Iae8xTduMaeeilaWIaeeiiaaIaeeOzayMaee4Ba8MaeeOCaiNaeeiiaaIaeeyyaeMaeeiBaWMaeeiBaWMaeeiiaaIaemyAaKMaeyypa0JaeGymaeJaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaemOta4KaeyOeI0IaeGymaeJaeiilaWcaaaaa@C8C6@

where ε = 10-4 is a small constant ensuring that the functions are strictly monotonically increasing and s i ( π ) = n i j = 1 N n j ( π i π ^ i ) 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGZbWCdaWgaaWcbaGaemyAaKgabeaakiabcIcaOGGadiab=b8aWjabcMcaPiabg2da9maalaaabaGaemOBa42aaSbaaSqaaiabdMgaPbqabaaakeaadaaeWaqaaiabd6gaUnaaBaaaleaacqWGQbGAaeqaaaqaaiabdQgaQjabg2da9iabigdaXaqaaiabd6eaobqdcqGHris5aaaakiabcIcaOGGaciab+b8aWnaaBaaaleaacqWGPbqAaeqaaOGaeyOeI0Iaf4hWdaNbaKaadaWgaaWcbaGaemyAaKgabeaakiabcMcaPmaaCaaaleqabaGaeGOmaidaaaaa@4B00@ and t i ( π c ) = j = i N n j j = 1 N n j ( π i c π ^ i c ) 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWG0baDdaWgaaWcbaGaemyAaKgabeaakiabcIcaOGGadiab=b8aWnaaCaaaleqabaGaem4yamgaaOGaeiykaKIaeyypa0ZaaSaaaeaadaaeWaqaaiabd6gaUnaaBaaaleaacqWGQbGAaeqaaaqaaiabdQgaQjabg2da9iabdMgaPbqaaiabd6eaobqdcqGHris5aaGcbaWaaabmaeaacqWGUbGBdaWgaaWcbaGaemOAaOgabeaaaeaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGobGta0GaeyyeIuoaaaGccqGGOaakiiGacqGFapaCdaqhaaWcbaGaemyAaKgabaGaem4yamgaaOGaeyOeI0Iaf4hWdaNbaKaadaqhaaWcbaGaemyAaKgabaGaem4yamgaaOGaeiykaKYaaWbaaSqabeaacqaIYaGmaaaaaa@5604@ ensuring that big differences between the final and empirical estimates in ranges with many outputs are penalized stronger. Using the newly computed values π1,...,π N , we can compute for any output value f(x) the corresponding posterior probability estimate P(y = 1|f(x)) by linear interpolation

P ( y = 1 | f ( x ) ) = { π 1 for  f ( x ) < φ 1 r ( φ i , φ i + 1 ) for  φ i f ( x ) < φ i + 1 , π N for  f ( x ) φ N MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakcqWG5bqEcqGH9aqpcqaIXaqmcqGG8baFcqWGMbGzcqGGOaakieWacqWF4baEcqGGPaqkcqGGPaqkcqGH9aqpdaGabeqaauaabaqadiaaaeaaiiGacqGFapaCdaWgaaWcbaGaeGymaedabeaaaOqaaiabbAgaMjabb+gaVjabbkhaYjabbccaGiabdAgaMjabcIcaOiab=Hha4jabcMcaPiabgYda8iab+z8aMnaaBaaaleaacqaIXaqmaeqaaaGcbaGaemOCaiNaeiikaGIae4NXdy2aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqGFgpGzdaWgaaWcbaGaemyAaKMaey4kaSIaeGymaedabeaakiabcMcaPaqaaiabbAgaMjabb+gaVjabbkhaYjabbccaGiab+z8aMnaaBaaaleaacqWGPbqAaeqaaOGaeyizImQaemOzayMaeiikaGIae8hEaGNaeiykaKIaeyipaWJae4NXdy2aaSbaaSqaaiabdMgaPjabgUcaRiabigdaXaqabaGccqGGSaalaeaacqGFapaCdaWgaaWcbaGaemOta4eabeaaaOqaaiabbAgaMjabb+gaVjabbkhaYjabbccaGiabdAgaMjabcIcaOiab=Hha4jabcMcaPiabgwMiZkab+z8aMnaaBaaaleaacqWGobGtaeqaaaaaaOGaay5Eaaaaaa@7E5D@

where r ( φ i , φ i + 1 ) = π i + 1 ( f ( x ) φ i ) + π i ( φ i + 1 f ( x ) ) φ i + 1 φ i MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGYbGCcqGGOaakiiGacqWFgpGzdaWgaaWcbaGaemyAaKgabeaakiabcYcaSiab=z8aMnaaBaaaleaacqWGPbqAcqGHRaWkcqaIXaqmaeqaaOGaeiykaKIaeyypa0ZaaSaaaeaacqWFapaCdaWgaaWcbaGaemyAaKMaey4kaSIaeGymaedabeaakiabcIcaOiabdAgaMjabcIcaOGqadiab+Hha4jabcMcaPiabgkHiTiab=z8aMnaaBaaaleaacqWGPbqAaeqaaOGaeiykaKIaey4kaSIae8hWda3aaSbaaSqaaiabdMgaPbqabaGccqGGOaakcqWFgpGzdaWgaaWcbaGaemyAaKMaey4kaSIaeGymaedabeaakiabgkHiTiabdAgaMjabcIcaOiab+Hha4jabcMcaPiabcMcaPaqaaiab=z8aMnaaBaaaleaacqWGPbqAcqGHRaWkcqaIXaqmaeqaaOGaeyOeI0Iae8NXdy2aaSbaaSqaaiabdMgaPbqabaaaaaaa@634A@ . The cumulative posterior probability Pc(y = 1|f(x)) is computed analogously. The above estimation procedure was performed separately for every classifier.

Identifying splice sites

Machine learning binary classification methods aim at estimating a classification function f : X MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFxepwaaa@384E@ → {±1} using labeled training data from X MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFxepwaaa@384E@ × {±1} such that g will correctly classify unseen examples. In our case, the input space X MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFxepwaaa@384E@ will contain simple representations of sequences of length N, {A, C, G, T}N, while ±1 corresponds to true splice and decoy sites, respectively. We will use the posterior log-odds of a simple probabilistic model and SVMs using different kernels as classifiers as discussed below.

Posterior log-odds

The posterior log-odds of a probabilistic model with parameters θ are defined by

g ( x ) : = log ( P ( y = + 1 | x , θ ) ) log ( P ( y = 1 | x , θ ) ) = log ( P ( x | θ + ) ) log ( P ( x | θ ) ) + b , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqaaeGadaaabaGaem4zaCMaeiikaGccbmGae8hEaGNaeiykaKcabaGaeiOoaOJaeyypa0dabaGagiiBaWMaei4Ba8Maei4zaCMaeiikaGIaemiuaaLaeiikaGIaemyEaKNaeyypa0Jaey4kaSIaeGymaeJaeiiFaWNae8hEaGNaeiilaWcccmGae4hUdeNaeiykaKIaeiykaKIaeyOeI0IagiiBaWMaei4Ba8Maei4zaCMaeiikaGIaemiuaaLaeiikaGIaemyEaKNaeyypa0JaeyOeI0IaeGymaeJaeiiFaWNae8hEaGNaeiilaWIae4hUdeNaeiykaKIaeiykaKcabaaabaGaeyypa0dabaGagiiBaWMaei4Ba8Maei4zaCMaeiikaGIaemiuaaLaeiikaGIae8hEaGNaeiiFaWNae4hUde3aaWbaaSqabeaacqGHRaWkaaGccqGGPaqkcqGGPaqkcqGHsislcyGGSbaBcqGGVbWBcqGGNbWzcqGGOaakcqWGqbaucqGGOaakcqWF4baEcqGG8baFcqGF4oqCdaahaaWcbeqaaiabgkHiTaaakiabcMcaPiabcMcaPiabgUcaRiabdkgaIjabcYcaSaaaaaa@7B34@

where b is a bias term. We use f(x) = sign(g(x)) for classification and Markov chains of order d

P ( x | θ ± ) = P ( x 1 , ... , x N | θ ± ) = P ( x 1 , ... , x d | θ ± ) i = d + 1 N P ( x i | x i 1 , ... , x i d , θ ± ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqaaeGadaaabaGaemiuaaLaeiikaGccbmGae8hEaGNaeiiFaWhccmGae4hUde3aaWbaaSqabeaacqGHXcqSaaGccqGGPaqkaeaacqGH9aqpaeaacqWGqbaucqGGOaakcqWG4baEdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdIha4naaBaaaleaacqWGobGtaeqaaOGaeiiFaWNae4hUde3aaWbaaSqabeaacqGHXcqSaaGccqGGPaqkaeaaaeaacqGH9aqpaeaacqWGqbaucqGGOaakcqWG4baEdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdIha4naaBaaaleaacqWGKbazaeqaaOGaeiiFaWNae4hUde3aaWbaaSqabeaacqGHXcqSaaGccqGGPaqkdaqeWbqaaiabdcfaqjabcIcaOiabdIha4naaBaaaleaacqWGPbqAaeqaaOGaeiiFaWNaemiEaG3aaSbaaSqaaiabdMgaPjabgkHiTiabigdaXaqabaGccqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqWG4baEdaWgaaWcbaGaemyAaKMaeyOeI0IaemizaqgabeaakiabcYcaSiab+H7aXnaaCaaaleqabaGaeyySaelaaaqaaiabdMgaPjabg2da9iabdsgaKjabgUcaRiabigdaXaqaaiabd6eaobqdcqGHpis1aOGaeiykaKcaaaaa@8056@

as for instance described in [44]. Each factor in this product has to be estimated in model training, i.e. one counts how often each symbol appears at each position in the training data conditioned on every possible xi-1,...,xi-d. Then for given model parameters θ we have

P ( x | θ ± ) = θ 0 ± ( x 1 , ... , x d ) i = d + 1 N θ i ± ( x i , ... , x i d ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakieWacqWF4baEcqGG8baFiiWacqGF4oqCdaahaaWcbeqaaiabgglaXcaakiabcMcaPiabg2da9GGaciab9H7aXnaaDaaaleaacqaIWaamaeaacqGHXcqSaaGccqGGOaakcqWG4baEdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdIha4naaBaaaleaacqWGKbazaeqaaOGaeiykaKYaaebCaeaacqqF4oqCdaqhaaWcbaGaemyAaKgabaGaeyySaelaaOGaeiikaGIaemiEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqWG4baEdaWgaaWcbaGaemyAaKMaeyOeI0IaemizaqgabeaakiabcMcaPaWcbaGaemyAaKMaeyypa0JaemizaqMaey4kaSIaeGymaedabaGaemOta4eaniabg+GivdGccqGGSaalaaa@655E@

where θ 0 ± MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF4oqCdaqhaaWcbaGaeGimaadabaGaeyySaelaaaaa@3172@ is an estimate for P(x1,...,x d ) and θ i (x i ,...,xi-d) an estimate for P(x i |xi-1,...,xi-d). As the alphabet has four letters, each model has (N - d + 1)·4d+1parameters and the maximum likelihood estimate is given by:

θ 0 ( s 1 , ... , s d ) = 1 m + π ( k = 1 m I ( s 1 = x 1 k s d = x d k ) + π ) θ i ( s i , ... , s i d ) = k = 1 m I ( s i = x i k s i d = x i d k ) + π k = 1 m I ( s i = x i 1 k s i d = x i d k ) + 4 π MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqaaeabbaaaaeaaiiGacqWF4oqCdaWgaaWcbaGaeGimaadabeaakiabcIcaOiabdohaZnaaBaaaleaacqaIXaqmaeqaaOGaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaem4Cam3aaSbaaSqaaiabdsgaKbqabaGccqGGPaqkcqGH9aqpaeaacaWLjaWaaSaaaeaacqaIXaqmaeaacqWGTbqBcqGHRaWkcqWFapaCaaWaaeWaaeaadaaeWbqaaGqabiab+LeajjabcIcaOiabdohaZnaaBaaaleaacqaIXaqmaeqaaOGaeyypa0JaemiEaG3aa0baaSqaaiabigdaXaqaaiabdUgaRbaakiabgEIizlabl+UimjabgEIizlabdohaZnaaBaaaleaacqWGKbazaeqaaOGaeyypa0JaemiEaG3aa0baaSqaaiabdsgaKbqaaiabdUgaRbaakiabcMcaPiabgUcaRiab=b8aWbWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemyBa0ganiabggHiLdaakiaawIcacaGLPaaaaeaacqWF4oqCdaWgaaWcbaGaemyAaKgabeaakiabcIcaOiabdohaZnaaBaaaleaacqWGPbqAaeqaaOGaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaem4Cam3aaSbaaSqaaiabdMgaPjabgkHiTiabdsgaKbqabaGccqGGPaqkcqGH9aqpaeaacaWLjaWaaSaaaeaadaaeWaqaaiab+LeajjabcIcaOiabdohaZnaaBaaaleaacqWGPbqAaeqaaOGaeyypa0JaemiEaG3aa0baaSqaaiabdMgaPbqaaiabdUgaRbaakiabgEIizlabl+UimjabgEIizlabdohaZnaaBaaaleaacqWGPbqAcqGHsislcqWGKbazaeqaaOGaeyypa0JaemiEaG3aa0baaSqaaiabdMgaPjabgkHiTiabdsgaKbqaaiabdUgaRbaakiabcMcaPiabgUcaRiab=b8aWbWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemyBa0ganiabggHiLdaakeaadaaeWaqaaiab+LeajjabcIcaOiabdohaZnaaBaaaleaacqWGPbqAaeqaaOGaeyypa0JaemiEaG3aa0baaSqaaiabdMgaPjabgkHiTiabigdaXaqaaiabdUgaRbaakiabgEIizlabl+UimjabgEIizlabdohaZnaaBaaaleaacqWGPbqAcqGHsislcqWGKbazaeqaaOGaeyypa0JaemiEaG3aa0baaSqaaiabdMgaPjabgkHiTiabdsgaKbqaaiabdUgaRbaakiabcMcaPiabgUcaRiabisda0iab=b8aWbWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemyBa0ganiabggHiLdaaaaaaaaa@C8DB@

where I(·) is the indicator function, k enumerates over the number of observed sequences m, and π is the commonly used pseudocount (a model parameter, cf. [44]) which is also tuned within the model selection procedure (cf. the model selection and evaluation section).

SVM and kernels for splice site detection

As the second method we use SVMs. The generated classification function can be written as

f ( x ) = sign ( i = 1 m y i α i K ( x i , x ) + b ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGMbGzcqGGOaakieWacqWF4baEcqGGPaqkcqGH9aqpcqqGZbWCcqqGPbqAcqqGNbWzcqqGUbGBdaqadaqaamaaqahabaGaemyEaK3aaSbaaSqaaiabdMgaPbqabaacciGccqGFXoqydaWgaaWcbaGaemyAaKgabeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaakiab9Pq8ljabcIcaOiab=Hha4naaBaaaleaacqWGPbqAaeqaaOGaeiilaWIae8hEaGNaeiykaKIaey4kaSIaemOyaigaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGTbqBa0GaeyyeIuoaaOGaayjkaiaawMcaaiabcYcaSaaa@5C25@

where y i {-1, +1} (i = 1,...,m) is the label of example x i . The α i 's are Lagrange multipliers and b is the usual bias which are the results of SVM training [16]. The kernel K MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFke=saaa@3834@ is the key ingredient for learning with SVMs.

In the following paragraphs we describe the kernels which are used in this study. They are all functions defined on sequences. In the following x= x1x2...x N denotes a sequence of length N.

The locality improved (LI) kernel has been proven useful in the context of translation initiation site (TIS) recognition [21]. Similar to the polynomial kernel of degree d for discrete input data, this kernel considers correlations of matches up to order d. In contrast to polynomial kernels however, the LI kernel only considers local subsequence correlations within a small window of length 2l + 1 around a sequence position:

win p ( x , x ) = ( 1 2 l + 1 j = l + l I ( x p + j = x p + j ) ) d , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqqG3bWDcqqGPbqAcqqGUbGBdaWgaaWcbaGaemiCaahabeaakiabcIcaOGqadiab=Hha4jabcYcaSiqb=Hha4zaafaGaeiykaKIaeyypa0ZaaeWaaeaadaWcaaqaaiabigdaXaqaaiabikdaYiabdYgaSjabgUcaRiabigdaXaaadaaeWbqaaGqabiab+LeajjabcIcaOiabdIha4naaBaaaleaacqWGWbaCcqGHRaWkcqWGQbGAaeqaaOGaeyypa0JafmiEaGNbauaadaWgaaWcbaGaemiCaaNaey4kaSIaemOAaOgabeaakiabcMcaPaWcbaGaemOAaOMaeyypa0JaeyOeI0IaemiBaWgabaGaey4kaSIaemiBaWganiabggHiLdaakiaawIcacaGLPaaadaahaaWcbeqaaiabdsgaKbaakiabcYcaSaaa@59FE@

where p = l + 1,...,N - l. These window scores are then summed up over the length of the sequence using a weighting w p which linearly decreases to both ends of the sequence, i.e. w p = { p l p N / 2 N p l + 1 p > N / 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWG3bWDdaWgaaWcbaGaemiCaahabeaakiabg2da9maaceqabaqbaeqabiGaaaqaaiabdchaWjabgkHiTiabdYgaSbqaaiabdchaWjabgsMiJkabd6eaojabc+caViabikdaYaqaaiabd6eaojabgkHiTiabdchaWjabgkHiTiabdYgaSjabgUcaRiabigdaXaqaaiabdchaWjabg6da+iabd6eaojabc+caViabikdaYaaaaiaawUhaaaaa@48CE@ . Then we have the following kernel:

K ( x , x ) = p = l + 1 N l w p win p ( x , x ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFke=scqGGOaakieWacqGF4baEcqGGSaalcuGF4baEgaqbaiabcMcaPiabg2da9maaqahabaGaem4DaC3aaSbaaSqaaiabdchaWbqabaGccqqG3bWDcqqGPbqAcqqGUbGBdaWgaaWcbaGaemiCaahabeaakiabcIcaOiab+Hha4jabcYcaSiqb+Hha4zaafaGaeiykaKcaleaacqWGWbaCcqGH9aqpcqWGSbaBcqGHRaWkcqaIXaqmaeaacqWGobGtcqGHsislcqWGSbaBa0GaeyyeIuoakiabc6caUaaa@597D@

The weighting allows one to emphasize regions of the sequence which are believed to be of higher importance; in our case this is the center, which is the location of the splice site. (Note that the definition of the LI kernel in [21] is slightly different from ours. Previously the weighting was inside the window and was not very effective. Moreover, the version presented here of the kernel can be computed 2l + 1 times faster than the original one.)

The weighted degree (WD) kernel[28] uses a similar approach by counting matching subsequences uδ,l(x) and uδ,l(x') between two sequences x and x', with uδ,l(x) = x l xl+1...xl+δ-1for all l and 1 ≤ δd. Here, δ denotes the order (length of the subsequence) to be compared. The WD kernel is defined as

K ( x , x ) = δ = 1 d w δ l = 1 N δ + 1 I ( u δ , l ( x ) = u δ , l ( x ) ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFke=scqGGOaakieWacqGF4baEcqGGSaalcuGF4baEgaqbaiabcMcaPiabg2da9maaqahabaGaem4DaC3aaSbaaSqaaGGaciab9r7aKbqabaaabaGae0hTdqMaeyypa0JaeGymaedabaGaemizaqganiabggHiLdGcdaaeWbqaaGqabiab8LeajjabcIcaOiab+vha1naaBaaaleaacqqF0oazcqGGSaalcqWGSbaBaeqaaOGaeiikaGIae4hEaGNaeiykaKIaeyypa0Jae4xDau3aaSbaaSqaaiab9r7aKjabcYcaSiabdYgaSbqabaGccqGGOaakcuGF4baEgaqbaiabcMcaPiabcMcaPaWcbaGaemiBaWMaeyypa0JaeGymaedabaGaemOta4KaeyOeI0Iae0hTdqMaey4kaSIaeGymaedaniabggHiLdGccqGGSaalaaa@6A78@

where we choose the weighting to be w δ = d - δ + 1. This kernel emphasizes position dependent information and the weighting decreases the influence for higher order matches, which would anyway have a higher contribution due to all their matching subsequences. It can be computed very efficiently without even extracting and enumerating all subsequences of the sequences [40]. Note that this kernel is similar to the spectrum kernel as proposed by [49], with the main difference that the weighted degree kernel uses position specific information.

The weighted degree kernel with shifts (WDS)[34] is defined as

K ( x i , x j ) = δ = 1 d w δ l = 1 N δ + 1 s = 0 s + l N S ( l ) δ s μ δ , l , s , x i , x j , μ δ , l , s , x i , x j = I ( u δ , l + s ( x i ) = u δ , l ( x j ) ) + I ( u δ , l ( x i ) = u δ , l + s ( x j ) ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqadeGabaaabaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NcXVKaeiikaGccbmGae4hEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqGF4baEdaWgaaWcbaGaemOAaOgabeaakiabcMcaPiabg2da9maaqahabaGaem4DaC3aaSbaaSqaaGGaciab9r7aKbqabaaabaGae0hTdqMaeyypa0JaeGymaedabaGaemizaqganiabggHiLdGcdaaeWbqaamaaqahabaGae0hTdq2aaSbaaSqaaiabdohaZbqabaGccqqF8oqBdaWgaaWcbaGae0hTdqMaeiilaWIaemiBaWMaeiilaWIaem4CamNaeiilaWIae4hEaG3aaSbaaWqaaiabdMgaPbqabaWccqGGSaalcqGF4baEdaWgaaadbaGaemOAaOgabeaaaSqabaaabaqbaeqabiqaaaqaaiabdohaZjabg2da9iabicdaWaqaaiabdohaZjabgUcaRiabdYgaSjabgsMiJkabd6eaobaaaeaacqWGtbWucqGGOaakcqWGSbaBcqGGPaqka0GaeyyeIuoakiabcYcaSaWcbaGaemiBaWMaeyypa0JaeGymaedabaGaemOta4KaeyOeI0Iae0hTdqMaey4kaSIaeGymaedaniabggHiLdaakeaacqqF8oqBdaWgaaWcbaGae0hTdqMaeiilaWIaemiBaWMaeiilaWIaem4CamNaeiilaWIae4hEaG3aaSbaaWqaaiabdMgaPbqabaWccqGGSaalcqGF4baEdaWgaaadbaGaemOAaOgabeaaaSqabaGccqGH9aqpieqacqaFjbqscqGGOaakcqGF1bqDdaWgaaWcbaGae0hTdqMaeiilaWIaemiBaWMaey4kaSIaem4CamhabeaakiabcIcaOiab+Hha4naaBaaaleaacqWGPbqAaeqaaOGaeiykaKIaeyypa0Jae4xDau3aaSbaaSqaaiab9r7aKjabcYcaSiabdYgaSbqabaGccqGGOaakcqGF4baEdaWgaaWcbaGaemOAaOgabeaakiabcMcaPiabcMcaPiabgUcaRiab8LeajjabcIcaOiab+vha1naaBaaaleaacqaH0oazcqGGSaalcqWGSbaBaeqaaOGaeiikaGIae4hEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkcqGH9aqpcqGF1bqDdaWgaaWcbaGae0hTdqMaeiilaWIaemiBaWMaey4kaSIaem4CamhabeaakiabcIcaOiab+Hha4naaBaaaleaacqWGQbGAaeqaaOGaeiykaKIaeiykaKIaeiilaWcaaaaa@C3E8@

where w δ is as before, δ s = 1/(2(s + 1)) is the weight assigned to shifts (in either direction) of extent s, and S(l) determines the shift range at position l. Here, we choose S(l) = σ|l - l c |, where l c is the position of the splice site. An efficient implementation for this kernel allowing large scale computations is described in [40].

For both the WD and WDS kernel we use the following normalization

K ˜ ( x , x ) = K ( x , x ) K ( x , x ) K ( x , x ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacuWFke=sgaacaiabcIcaOGqadiab+Hha4jabcYcaSiqb+Hha4zaafaGaeiykaKIaeyypa0ZaaSaaaeaacqWFke=scqGGOaakcqGF4baEcqGGSaalcuGF4baEgaqbaiabcMcaPaqaamaakaaabaGae8NcXVKaeiikaGIae4hEaGNaeiilaWIae4hEaGNaeiykaKIae8NcXVKaeiikaGIaf4hEaGNbauaacqGGSaalcuGF4baEgaqbaiabcMcaPaWcbeaaaaGccqGGUaGlaaa@55E7@

Training and evaluation of the SVMs and the MCs were performed using our shogun machine learning toolbox (cf. [38] in which efficient implementations of the aforementioned kernels can be found.

Interpreting the SVM classifier

Kernel methods are aimed directly at the classification task which is to discriminate between the true and decoy classes by learning a decision function separating the classes in an associated feature space. In contrast, generative methods like position weight matrices or Markov models are statistical models which represent the data under specific assumptions on the statistical structure and hence it is relatively straightforward to interpret their results. Although kernel methods outperform in many cases generative models, especially when the true statistical structure is more intricate than the assumed one, one of the main criticisms of kernel methods is the difficulty to directly interpret their decision function in a way that allows to gain biologically relevant insight. However, by taking advantage of our specific kernels and of their sparse representation, we are able to efficiently use the decision function of our SVMs in order to understand which k-mers at which positions are contributing the most in discriminating between true and decoy splice sites.

To see how this is possible, recall that, for SVMs, the resulting classifier can be written as a dot product between an α-weighted linear combination of support vectors mapped into the feature space (which is often only implicitly defined via the kernel function) [18]:

f ( x ) = i = 1 m α i y i Φ ( x i ) w Φ ( x ) = i = 1 m α i y i K ( x i , x ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGMbGzcqGGOaakieWacqWF4baEcqGGPaqkcqGH9aqpdaagaaqaamaaqahabaacciGae4xSde2aaSbaaSqaaiabdMgaPbqabaGccqWG5bqEdaWgaaWcbaGaemyAaKgabeaakiabfA6agjabcIcaOiab=Hha4naaBaaaleaacqWGPbqAaeqaaOGaeiykaKcaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGTbqBa0GaeyyeIuoaaSqaaiab=Dha3bGccaGL44pacqGHflY1cqqHMoGrcqGGOaakcqWF4baEcqGGPaqkcqGH9aqpdaaeWbqaaiab+f7aHnaaBaaaleaacqWGPbqAaeqaaOGaemyEaK3aaSbaaSqaaiabdMgaPbqabaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGccqqFke=scqGGOaakcqWF4baEdaWgaaWcbaGaemyAaKgabeaakiabcYcaSiab=Hha4jabcMcaPaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemyBa0ganiabggHiLdGccqGGUaGlaaa@71B5@

In the case of sparse feature spaces, as with string kernels, one can represent w in a sparse form and then efficiently compute dot products between w and Φ(x) in order to speed up SVM training or testing [40]. This sparse representation comes with the additional benefit of providing us with means to interpret the SVM classifier. For k-mer based string kernels like the spectrum kernel, each dimension w u in w represents a weight assigned to that k-mer u. From the learned weighting one can thus easily identify the k-mers with highest absolute weight or above a given threshold τ: {u| |w u | >τ}. Note that the total number of k-mers appearing in the support vectors is bounded by dN s L where L is the maximum length of the sequences L = max i = 1 , ... , m l x i MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGmbatcqGH9aqpcyGGTbqBcqGGHbqycqGG4baEdaWgaaWcbaGaemyAaKMaeyypa0JaeGymaeJaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaemyBa0gabeaakiabdYgaSnaaBaaaleaaieWacqWF4baEdaWgaaadbaGaemyAaKgabeaaaSqabaaaaa@40F0@ . This approach also works for the WD kernel (with and without shifts). Here a weight is assigned to each k-mer with 1 ≤ kd at each position in the sequence. This allows us to generate the k-mer importance matrices, displayed in Figure 4, associated with our splice classifiers [54]. They display the weight which the SVM assigns to each k-mer at each position in the input sequence, i.e. given a SVM classifier trained with a WD kernel of degree d we extract the k-mers weightings for 1 ≤ k d ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGKbazgaacaaaa@2E0C@ starting at position p = 1,...,N, where we used d as selected in model selection and d ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGKbazgaacaaaa@2E0C@ = 1,...,8. This leads to a weighting for d ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGKbazgaacaaaa@2E0C@ -mers u for each position in the sequence: Wu,p, which may be summarized by S d ˜ , p MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGtbWudaWgaaWcbaGafmizaqMbaGaacqGGSaalcqWGWbaCaeqaaaaa@31B0@ = max u (Wu,p). We compute this quantity for d ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGKbazgaacaaaa@2E0C@ = 1,...,8 leading to the two 8 × 141 matrices, which are transformed into percentile values and then displayed color-coded in Figure 4. Note that the above computation can be done efficiently using string index data structures implemented in SHOGUN and described in detail in [40].


  1. Bajic V, Brent M, Brown R, Frankish A, Harrow J, Ohler U, Solovyev V, Tan S: Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment. Genome Biology 2006,7(Suppl 1):S3-.

    Article  PubMed Central  PubMed  Google Scholar 

  2. Stein L, Blasiar D, Coghlan A, Fiedler T, McKay S, Flicek P: nGASP Gene prediction challenge.2007. []

    Google Scholar 

  3. Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller KR, Sommer R, Schölkopf B: Improving the C. elegans genome annotation using machine learning. PLoS Computational Biology 2007,3(2):e20.

    Article  PubMed Central  PubMed  Google Scholar 

  4. Bernal A, Crammer K, Hatzigeorgiou A, Pereira F: Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction. PLoS Computational Biology 2007,3(3):e54.

    Article  PubMed Central  PubMed  Google Scholar 

  5. Hinds D, Stuve L, Nilsen G, Halperin E, Eskin E, Ballinger D, Frazer K, Cox D: Whole-Genome Patterns of Common DNA Variation in Three Human Populations. Science 2005,307(5712):1072–1079.

    Article  CAS  PubMed  Google Scholar 

  6. International HapMap Consortium: A haplotype map of the human genome. Nature 2005, 437: 1299–1320.

    Article  Google Scholar 

  7. Clark RM, Schweikert G, Toomajian C, Ossowski S, Zeller G, Shinn P, Warthmann N, Hu TT, Fu G, Hinds DA, Chen H, Frazer KA, Huson DH, Schölkopf B, Nordborg M, Rätsch G, Ecker JR, Weigel D: Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana. Science 2007,317(5836):338–342.

    Article  CAS  PubMed  Google Scholar 

  8. Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology 1997, 268: 78–94.

    Article  CAS  PubMed  Google Scholar 

  9. Reese M, Eeckman FH, Kulp D, Haussler D: Improved splice site detection in Genie. Journal of Computational Biology 1997, 4: 311–323.

    Article  CAS  PubMed  Google Scholar 

  10. Salzberg S, Delcher A, Fasman K, Henderson J: A decision tree system for finding genes in DNA. Journal of Computational Biology 1998,5(4):667–680.

    Article  CAS  PubMed  Google Scholar 

  11. Delcher A, Harmon D, Kasif S, White O, Salzberg S: Improved microbial gene identification with GLIMMER. Nucleic Acids Research 1999,27(23):4636–4641.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  12. Pertea M, Lin X, Salzberg S: GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Research 2001,29(5):1185–1190.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  13. Rampone S: Recognition of splice junctions on DNA sequences by BRAIN learning algorithm. Bioinformatics 1998,14(8):676–684.

    Article  CAS  PubMed  Google Scholar 

  14. Cai D, Delcher A, Kao B, Kasif S: Modeling splice sites with Bayes networks. Bioinformatics 2000,16(2):152–158.

    Article  CAS  PubMed  Google Scholar 

  15. Rajapakse J, Ho L: Markov Encoding for Detecting Signals in Genomic Sequences. IEEE ACM Transactions on Computational Biology and Bioinformatics 2005,2(2):131–142.

    Article  CAS  PubMed  Google Scholar 

  16. Vapnik VN: The Nature of Statistical Learning Theory. New York, Springer Verlag; 1995.

    Book  Google Scholar 

  17. Müller KR, Mika S, Rätsch G, Tsuda K, Schölkopf B: An Introduction to Kernel-Based Learning Algorithms. IEEE Transactions on Neural Networks 2001,12(2):181–201.

    Article  PubMed  Google Scholar 

  18. Schölkopf B, Smola AJ: Learning with Kernels. Cambridge, MA, MIT Press; 2002.

    Google Scholar 

  19. Noble WS: What is a Support Vector Machine? Nature Biotechnology 2006,24(12):1565–7.

    Article  CAS  PubMed  Google Scholar 

  20. Jaakkola T, Haussler D: Exploiting Generative Models in Discriminative Classifiers. In Advances in Neural Information Processing Systems. Volume 11. Edited by: Kearns M, Solla S, Cohn D. Cambridge, MA, MIT Press; 1999:487–493.

    Google Scholar 

  21. Zien A, Rätsch G, Mika S, Schölkopf B, Lengauer T, Müller KR: Engineering Support Vector Machine Kernels That Recognize Translation Initiation Sites. BioInformatics 2000,16(9):799–807.

    Article  CAS  PubMed  Google Scholar 

  22. Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet C, Furey TS, Ares JM, Haussler D: Knowledge-based analysis of microarray gene expression data using support vector machines. PNAS 2000, 97: 262–267.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  23. Tsuda K, Kawanabe M, Rätsch G, Sonnenburg S, Müller K: A New Discriminative Kernel from Probabilistic Models. In Advances in Neural information processings systems Edited by: Dietterich T, Becker S, Ghahramani Z. 2002, 14: 977.

    Google Scholar 

  24. Sonnenburg S, Rätsch G, Jagota A, Müller KR: New Methods for Splice-Site Recognition. Proc ICANN'02 2002.

    Google Scholar 

  25. Sonnenburg S: New Methods for Splice Site Recognition. In Master's thesis. Humboldt University; 2002. [Supervised by K.-R. Müller H.-D. Burkhard and G. Rätsch] [Supervised by K.-R. Müller H.-D. Burkhard and G. Rätsch]

    Google Scholar 

  26. Lorena A, de Carvalho A: Human Splice Site Identifications with Multiclass Support Vector Machines and Bagging. Artificial Neural Neural Networks and Neural Information Processing – ICANN/ICONIP 2003 2003., 2714:

    Google Scholar 

  27. Yamamura M, Gotoh O: Detection of the Splicing Sites with Kernel Method Approaches Dealing with Nucleotide Doublets. Genome Informatics 2003, 14: 426–427.

    Google Scholar 

  28. Rätsch G, Sonnenburg S: Accurate Splice Site Detection for Caenorhabditis elegans . In Kernel Methods in Computational Biology. Edited by: B Schölkopf KT, Vert JP. MIT Press; 2004.

    Google Scholar 

  29. Degroeve S, Saeys Y, Baets BD, Rouzé P, de Peer YV: SpliceMachine: predicting splice sites from high-dimensional local context representations. Bioinformatics 2005,21(8):1332–8.

    Article  CAS  PubMed  Google Scholar 

  30. Huang J, Li T, Chen K, Wu J: An approach of encoding for predictionof splice sites using SVM. Biochimie 2006, 88: 923–929.

    Article  CAS  PubMed  Google Scholar 

  31. Zhang Y, Chu CH, Chen Y, Zha H, Ji X: Splice site prediction using support vector machines with a Bayes kernel. Expert Systems with Applications 2006, 30: 73–81.

    Article  Google Scholar 

  32. Baten A, Chang B, Halgamuge S, Li J: Splice site identification using probabilistic parameters and SVM classification. BMC Bioinformatics 2006,7(Suppl 5):S15.

    Article  PubMed Central  PubMed  Google Scholar 

  33. Liao L, Noble WS: Combining pairwise sequence similarity and support vector machines for remote protein homology detection. In Proceedings of the Sixth Annual International Conference on Computational Molecular Biology (RECOMB). Edited by: Myers G, Hannenhalli S, Sankoff D, Istrail S, Pevzner P, Waterman M. New York: ACM Press; 2002:225–232.

    Google Scholar 

  34. Rätsch G, Sonnenburg S, Schölkopf B: RASE: Recognition of Alternatively Spliced Exons in C. elegans . Bioinformatics 2005,21(Suppl 1):i369-i377.

    Article  PubMed  Google Scholar 

  35. Pertea M, Mount S, Salzberg S: A computational survey of candidate exonic splicing enhancer motifs in the model plant Arabidopsis thaliana. BMC Bioinformatics 2007, 8: 159.

    Article  PubMed Central  PubMed  Google Scholar 

  36. Chen TM, Lu CC, Li WH: Prediction of splice sites with dependency graphs and their expanded bayesian networks. Bioinformatics 2005,21(4):471–482.

    Article  CAS  PubMed  Google Scholar 

  37. Sonnenburg S, Zien A, Rätsch G: ARTS: Accurate Recognition of Transcription Starts in Human. Bioinformatics 2006,22(14):e472–480.

    Article  CAS  PubMed  Google Scholar 

  38. Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B: Large Scale Multiple Kernel Learning. Journal of Machine Learning Research 2006, 7: 1531–1565. [Special Topic on Machine Learning and Optimization] [Special Topic on Machine Learning and Optimization]

    Google Scholar 

  39. Rätsch G, Sonnenburg S, Schäfer C: Learning Interpretable SVMs for Biological Sequence Classification. BMC Bioinformatics 2006,7(Suppl 1):S9.

    Article  PubMed Central  PubMed  Google Scholar 

  40. Sonnenburg S, Rätsch G, Rieck K: Large Scale Learning with String Kernels. In Large Scale Kernel Machines. Edited by: Bottou L, Chapelle O, DeCoste D, Weston J. MIT Press; 2007.

    Google Scholar 

  41. Metz CE: Basic principles of ROC analysis. Seminars in Nuclear Medicine 1978.,VIII(4):

  42. Fawcett T: ROC graphs: Notes and practical considerations for data mining researchers. In Technical report hpl-2003–4. HP Laboratories, Palo Alto, CA, USA; 2003.

    Google Scholar 

  43. Davis J, Goadrich M: The relationship between Precision-Recall and ROC curves. ICML 2006, 233–240.

    Chapter  Google Scholar 

  44. Durbin R, Eddy S, Krogh A, Mitchison G: Biological Sequence Analysis – Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK, Cambridge University Press; 1998.

    Book  Google Scholar 

  45. Correction notes to BMC Bioinformatics 2006, 7(Suppl 5):S152006. []

  46. SpliceMachine[]

  47. GeneSplicer[]

  48. SpliceMachine feature extractor[]

  49. Leslie C, Eskin E, Noble WS: The spectrum kernel: A string kernel for SVM protein classification. In PSB. Edited by: Altman R, Dunker A, Hunter L, Lauerdale K, Klein T. River Edge, NJ, World Scientific; 2002:564–575.

    Google Scholar 

  50. Gribskov M, McLachlan AD, Eisenberg D: Profile analysis: Detection of distantly related proteins. Proc Natl Acad Sci U S A 1987, 84: 4355–4358.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  51. Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie C: Profile-based string kernels for remote homology detection and motif extraction. Computational Systems Bioinformatics Conference 2004 2004, 146–154.

    Google Scholar 

  52. Zhang XHF, Heller KA, Hefter I, Leslie CS, Chasin LA: Sequence information for the splicing of human pre-mRNA identified by support vector machine classification. Genome Res 2003,13(12):2637–2650.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  53. Zhang X, Leslie C, Chasin L: Dichotomous splicing signals in exon flanks. Genome Research 2005,15(6):768–79.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  54. Sonnenburg S, Zien A, Philips P, Rätsch G: POIMS: Positional Oligomer Importance Matrices. 2007. [In preparation]

    Google Scholar 

  55. Lewin B: Genes VII. Oxford University Press, New York; 2000.

    Google Scholar 

  56. Fruit fly genome sequence[]

  57. Boguski M, Tolstoshev TLC: dbEST-Database for "Expressed Sequence Tags". Nature Genetics 1993,4(4):332–3.

    Article  CAS  PubMed  Google Scholar 

  58. Wormbase[]

  59. Fruit fly expression sequence tags[]

  60. Riken cress sequence[]

  61. Ensemble[]

  62. Mammalian Gene Collection[]

  63. Kent W: BLAT-the BLAST-like alignment tool. Genome Research 2002,12(4):656–64.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  64. Ong C, Rätsch G: Prediction of Alternative Splicing in Eukaryotes. [In preparation] [In preparation]

Download references


We gratefully acknowledge helpful discussions with Anja Neuber, Alexander Zien, Georg Zeller, Andrei Lupas, Detlef Weigel, Alan Zahler, Koji Tsuda, Christina Leslie and Eleazar Eskin. Additionally, we thank Alexander Zien for helping with the implementation of the k-mer importance matrices and Cheng Soon Ong for the implementation of generating the splice graphs from aligned sequences. Finally, we would like to thank Michiel Van Bel from Ghent University for help to get SpliceMachine to work.

This article has been published as part of BMC Bioinformatics Volume 8 Supplement 10, 2007: Neural Information Processing Systems (NIPS) workshop on New Problems and Methods in Computational Biology. The full contents of the supplement are available online at

Author information

Authors and Affiliations


Corresponding author

Correspondence to Gunnar Rätsch.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

SS provided code for large scale kernel learning and helped carrying out the experiments. PP performed most experiments in the pilot study and drafted the manuscript. GS and JB performed the experiments on the genome-wide data sets and helped generating the data. GR conceived the experiments, generated the data sets and helped performing experiments. All authors contributed to the writing and critically revising the manuscript.

Sören Sonnenburg, Gabriele Schweikert, Petra Philips contributed equally to this work.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Sonnenburg, S., Schweikert, G., Philips, P. et al. Accurate splice site prediction using support vector machines. BMC Bioinformatics 8 (Suppl 10), S7 (2007).

Download citation

  • Published:

  • DOI: