Skip to main content

EnsembleSplice: ensemble deep learning model for splice site prediction

Abstract

Background

Identifying splice site regions is an important step in the genomic DNA sequencing pipelines of biomedical and pharmaceutical research. Within this research purview, efficient and accurate splice site detection is highly desirable, and a variety of computational models have been developed toward this end. Neural network architectures have recently been shown to outperform classical machine learning approaches for the task of splice site prediction. Despite these advances, there is still considerable potential for improvement, especially regarding model prediction accuracy, and error rate.

Results

Given these deficits, we propose EnsembleSplice, an ensemble learning architecture made up of four (4) distinct convolutional neural networks (CNN) model architecture combination that outperform existing splice site detection methods in the experimental evaluation metrics considered including the accuracies and error rates. We trained and tested a variety of ensembles made up of CNNs and DNNs using the five-fold cross-validation method to identify the model that performed the best across the evaluation and diversity metrics. As a result, we developed our diverse and highly effective splice site (SS) detection model, which we evaluated using two (2) genomic Homo sapiens datasets and the Arabidopsis thaliana dataset. The results showed that for of the Homo sapiens EnsembleSplice achieved accuracies of 94.16% for one of the acceptor splice sites and 95.97% for donor splice sites, with an error rate for the same Homo sapiens dataset, 4.03% for the donor splice sites and 5.84% for the acceptor splice sites datasets.

Conclusions

Our five-fold cross validation ensured the prediction accuracy of our models are consistent. For reproducibility, all the datasets used, models generated, and results in our work are publicly available in our GitHub repository here: https://github.com/OluwadareLab/EnsembleSplice

Peer Review reports

Background

The development of high-throughput computational sequencing methods and technologies has created a significant opportunity for gene structure analysis research and experiments. We focus on splice sites detection in this paper, which is critical for gene structure and expression analysis. The gene sequences essential for protein synthesis are composed of alternating nucleotide regions called introns, which are the non-protein-coding regions, and exons, which are the protein-coding regions. During DNA transcription in eukaryotic cells, an enzyme called spliceosomes cuts out introns and concatenates exons; this process is known as RNA splicing and is required for the creation of mature mRNA from pre-mRNA, which is required for gene expression and protein synthesis [1]. The dinucleotides AG and GT are biological markers involved in RNA splicing and are often found in the 3′ intron boundary, or donor splice site (DoSS) region, and the 5′ intron boundary, or acceptor splice site (AcSS) region, respectively [2] as shown in Fig. 1.

Fig. 1
figure 1

Illustration of 2 step biochemistry process for Splice Sites. This figure shows canonical sequence distribution in a splice site location, the Introns are spliced, hence the name splice sites resulting in proteins as a final product

Organismal genomes are studied primarily through genome annotation, which involves classifying genomic elements based on their function and location [3]. This annotation is typically performed at the nucleotide level to determine the locations of key genetic elements in DNA sequences, as well as at the protein level to assess proteomic function and investigate the mechanisms underlying gene interaction and splice site localization [4]. More specifically, different computational methods have been proposed to accurately detect splice sites location, which can be used to identify genes in eukaryotic genomes. This biological and biochemical process has proven to be time-consuming and ineffective in the real world, necessitating the development of computational tools for accurate splice site prediction.

The earliest research on genomic DNA splice site prediction primarily leveraged methods in machine learning and probabilistic modeling. GeneSplicer was the first to achieve record accuracies with its Markov model-enhanced maximal dependence decomposition decision trees, which contributed to the popularity of Markov models for splice site prediction [2]. Other earlier works used Markov Model as a preprocessing technique for other algorithms such as shallow neural networks, or to enhance performance [5, 6]. Burge et al. [7] developed the MDD method [8] as a decision tree approach to reduce the computational burden of increasing the Markov model order. Goel et al. [9] proposed a method also based on Markov model. Some other methods adopted the use of support vector machines (SVMs) for their simplicity and speed [10, 11]. While the intricacy of these machine learning models grew, their accuracy plateaued. This was due to both compute power and the bottleneck of having to select the model's features manually.

Deep learning, along with better computing methods and resources, has largely solved these issues. In recent years, splice site prediction has been performed using the deep learning (DL) approach with neural networks (NNs). Convolutional Neural Networks (CNN) are the most frequent neural network (NN) architecture adopted for this deep learning approaches, and widely deviates in their depth (number of layers) and parameters across studies. SpliceRover [12], SpliceFinder [13], DeepSplicer [14], DeepSS [15], Spliceator [16], and iSS-CNN [17], among others, employ CNNs. Donor and acceptor sites are typically one-hot-encoded and batch-fed into these architectures, which perform feature extraction and exceed the earlier ML techniques in classification accuracy. On genomic DNA, other deep learning methods have been used, including the Long-Short Term Memory (LSTM) neural network and the Recurrent Neural Network (RNN), which are sequence learning networks commonly used in time-series analyses. SpliceViNCI, for example, is a bidirectional LSTM with integrated gradients [18].

In this work, we propose a stacking ensemble method for splice site prediction to combine various classifiers to produce an alpha-classifier that is more effective at classification and generalization than the individual classifiers. Through training, a stack (ensemble) of various neural networks models (base-models) develops its own representation of the genomic data. Following this, each network predicts the unidentified splice sequences on its own. These predictions are combined into a new dataset's pool entries. For example, if the ensemble included three different CNNs and two different DNNs, and the predictions for a splice site were [1], [1], [1], [0], [1] for each network, then the row of entries would read [1, 1, 1, 0, 1]. Following the creation of this new dataset, a final prediction using the new dataset is then made using simple logistic regression (meta-model). The main importance of ensemble learning is that the diversity of predictions balances out the weaknesses of individual base model performances, increasing overall accuracy and resulting in improved performance and robustness. This performance and robustness importance can be seen in other deep learning works of literature, including models for positioning footballers [19] in sport science research, models for predicting generic Escherichia coli population in agricultural ponds based on weather station measurements [20], and improving model performances for the detection of Alzheimer's disease [21] in health science research.

Our method combines deep neural network architectures to create EnsembleSplice, a novel ensemble architecture. Hence, we propose a deep learning architecture that learns from an ensemble of CNNs to achieve a state-of-the-art performance in true and false splice sites prediction accuracy and efficiency. We used grid search methods to determine the best hyperparameters, and the best ensemble selection was achieved using five-fold cross-validation, as shown in the manuscript's tables and results. Furthermore, we compare EnsembleSplice’s splice site identification performance to that of existing splice site tools using three genomic DNA datasets as benchmarks. The datasets, datasets preprocessing using one-hot encoding, EnsembleSplice pipeline, performance benchmarks methods, subsections are discussed in the methodology section, while explanatory evaluation metrics, cross-validation, result discussion and model interpretability subsections are discussed in the experiments and results section, as well as the conclusion sections.

In summary, the aim and objective of this work is as follow:

  • Develop a deep ensemble model architecture consisting of DNNs and/or CNNs that achieves excellent performance on the task of splice site classification.

  • Ensure via cross validation, that the deep ensemble consists of effective component neural networks (CNNs and/or DNNs) with high diversity across them.

  • Ensure that our deep ensemble architecture is robust, with a minimum dispersion and consistent in performance in splice site prediction across different datasets, than current state-of-the-art algorithms.

Methods

Datasets

Each dataset used in this research consists of both confirmed true (positive) AcSS/DoSS and confirmed false (negative) AcSS/DoSS. Evaluation of classification performance is partitioned by splice site type. This means that EnsembleSplice is trained to distinguish between true and false DoSS regions and is trained again and separately to distinguish between true and false AcSS regions.

HS3D

The Homo Sapiens Splice Sites Dataset (HS3D) is a collection of human genomic DNA introns and exons extracted from GenBank Rel.123 [22] HS3D's Primate Division. There are 2796 confirmed true DoSS regions, 2880 true positive AcSS regions, 271,937 confirmed false DoSS regions, and 329,374 confirmed false AcSS regions. This paper randomly selects 2750 false DoSS regions and 2750 false AcSS regions from the 271,937 and 329,374 available in the dataset, respectively; the Python code snippet random.seed(123,454) is used to shuffle the entire HS3D dataset before the false DoSS and false AcSS subsets are selected. The full set of 2750 confirmed true DoSS regions and 2750 confirmed true positive AcSS regions are used. The nucleotide consensus AG for AcSS regions occurs at positions 69 and 70, and the nucleotide consensus GT for DoSS regions occurs at positions 71 and 72. In total, each HS3D donor and acceptor site splice sequence is 140 nucleotides long, with this sequence length used for the cross-validation, performance, and comparison experiment. The HS3D dataset can be accessed at http://www.sci.unisannio.it/docenti/rampone/.

Homo sapiens and Arabidopsis thaliana

The Homo sapiens and Arabidopsis thaliana datasets consist of splice site regions selected from annotated genomic DNA sequences for Homo sapiens and A. thaliana in Ensembl 2018 [23]. Using Bedtools [24, 25], the peripheral nucleotide sequences padding each AcSS, or DoSS were determined. Each splice site region in these datasets is 602 nucleotides long; each DoSS region has consensus GT at positions 301 and 302, and each AcSS has consensus AG also at positions 301 and 302. There are 250,400 confirmed true and false DoSS regions and 248,150 confirmed true and false AcSS regions in the Homo sapiens dataset. There are 110,314 confirmed true and false DoSS regions and 112,336 confirmed true and false AcSS regions in the A. thaliana dataset. The confirmed true AcSS and DoSS regions were selected from chromosomes 21, 2, 2L, 1, and I. This paper randomly selects 8000 true and false DoSS regions (totaling 16,000 entries) and 8000 true and false AcSS regions (totaling 16,000 entries) from both datasets. As with the HS3D dataset, the Python code snippet random.seed(123,454) is used for shuffling the Homo sapiens and A. thaliana datasets before the DoSS and AcSS subsets are selected. The Homo sapiens and A. thaliana datasets can be accessed at https://github.com/SomayahAlbaradei/Splice_Deep.

We used the source sequence length—140 nucleotides for HS3D and 602 for Homo sapiens and A. thaliana datasets —as discussed in the subsections for all cross-validation, performance, and comparison experiments executed and results reported.

One-hot encoding and hyper-parameter search space and tuning

Genomic DNA splice site regions are composed of four nucleotides: A (Adenine), G (Guanine), C (Cytosine), and T (Thymine). Given constraints on the input of DL architectures, these nucleotides are encoded numerically, with each nucleotide corresponding to a row in a 4 × 4 identity matrix. The encoding scheme utilized in this paper is that A corresponds to [1, 0, 0, 0], G corresponds to [0, 0, 1, 0], C corresponds to [0, 1, 0, 0], and T corresponds to [0, 0, 0, 1]. Now consider a family.

$$D = \left\{ {S_{0} , S_{1} , \ldots , S_{n} } \right\}$$

of nucleotide splice site regions. We have the ordered set.

$$S_{i} = \left\{ {x_{1} , x_{2} , \ldots , x_{{\left| {S_{i} } \right|}} } \right\}$$

which \(S_{i}\) is the i-th nucleotide splice site region, and

$$x_{j} \in X = \left\{ {A, C, G, T} \right\}, 0 \le j \le \left| {S_{i} } \right|$$

For all \(0 \le j \le \left| {S_{i} } \right|\) is encoded as a \(\left| {S_{i} } \right|*\left| X \right|\) binary matrix through one-hot encoding.

Alternatively stated, if each splice site region consists of some \(N\) nucleotides, the final numerical representation for each splice site region is a \(N \times 4\) matrix, where each row is a one-hot encoded nucleotide that occurs at the same index as it did in the splice site region's original representation.

We used an easily optimizable hyperparameter tool called KerasTurner (https://keras.io/api/keras_tuner/tuners/hyperband/) for our hyperparameter search. We configure this tool based on the search space parameters as shown in Table 1. This table shows the hyper-parameters, search range, steps, and selected parameters for each CNN and DNN subgroup. To reduce the learning rate as the training proceeds, we used the TensorFlow inverse time decay schedule. For the CNNs, the parameters are initial learning rate 0.001, decay steps 140, and decay rate 0.1, while for the DNNs, the parameters are initial learning rate 0.002, decay steps 80, and decay rate 1.4. We have used a 32-batch size for each neural network model compilation and a 30-epoch for each.

Table 1 EnsembleSplice neural network hyper-parameter search space

Deep learning

Deep learning is a branch of machine learning that uses layered learning and a hierarchical learning model to enable computers to learn complex concepts [26]. Deep Learning is based on an artificial neural network that mimics the concept of brain neurons. Artificial neural networks contain neuronal connections and the ability to send inputs within layers of neurons [26]. Moreso, an artificial neural network with convolutional blocks as its fundamental layers is known as a convolutional neural network. [27,28,29]. The EnsembleSplice model combines convolutional layers and dense layers networks to receive input, transform it, and output the transformed results between layers to a simple logistics regression. In other words, they combine features and pattern extraction on the genomic acceptor and donor datasets with organized (element-wise multiplication) operations between the layer inputs and their corresponding weights. To detect these patterns, the number and size of filters are given. These filters are matrices with randomly defined values in the rows and columns, allowing for effective differentiation of true/false acceptors and donor splice sites. We tested and analyzed mean cross-validation results for the different ensemble architectures across the acceptor and donor organism datasets to find the best performing model for predicting splice locations. The architecture and model parameters are covered in detail in the EnsembleSplice pipeline section below.

EnsembleSplice pipeline

EnsembleSplice is an ensemble learning architecture made up of eight sub-models: four deep neural networks and four convolutional neural networks. The architecture of each CNN and DNN sub-models is shown in Table 2 with colored pattern representation in Fig. 2. Each sub-model's architectural design choices differ significantly. These EnsembleSplice sub-models predict whether inputted genomic DNA sequences are true or false splice regions and handle DoSS and AcSS separately, implying that there are two sets of weights, one for DoSS and the other for AcSS classification. Both DoSS and AcSS use the same sub-model architecture (the architecture of the i-th CNN is identical in both). The sub-models produce binary predictions, which are then aggregated (stacked) into a new dataset, with row i containing all sub-model predictions for data entry i, and this dataset is then fed into an output predictor (logistic regression), which produces a final set of predictions for the inputted nucleotide sequences. Each CNN sub-model in EnsembleSplice is composed of some combination of convolutional layers, a dropout layer, and max-pooling layers. The convolutional layers automatically extract local and global features from the AcSS or DoSS input sequences. These layers create complex representations of the AcSS or DoSS, allowing CNN to distinguish between true and false AcSS/DoSS with accuracy. Each convolutional layer employs the ReLU activation function as its final component; this removes noisy or otherwise irrelevant features, thus improving feature selection [30, 31]. The dropout layer prunes a percentage of each network's total convolutional nodes, which reduces model overfitting by limiting the co-dependencies each node in the network has on other nodes in the network [32]. Each CNN optimizer uses the ADAM optimizer [33] with an inverse time decay learning rate schedule during model compilation.

Table 2 EnsembleSplices’ CNNs and DNNs model architecture
Fig. 2
figure 2

EnsembleSplices’ CNNs and DNNs model architecture. This figure depicts each CNNs and DNNs base model’s architecture used in this cross-validation experiment. This Figure contains a CNN 1; b CNN 2; c CNN 3; d CNN 4; e DNN 1; f DNN 2; g DNN 3; h DNN 4, with architecture containing its respective layers and their distinct labels

Each DNN sub-model in EnsembleSplice consists of several fully connected dense layers, up to 2 dropout layers, and, in some cases, an L2 kernel regularization penalty. Similar to the CNN sub-models, the ReLU activation function and the ADAM optimizer with an inverse time decay learning rate schedule are used.

EnsembleSplice is implemented via the TensorFlow/Keras framework [34, 35]. For all experiments conducted, we use a training maximum epoch of 30. The training and validation were performed in Google Coolaboratory using Graphical Processing Unit (GPU) hardware, and the early model stopping callback, which stops training if the model's validation loss does not decrease for a predetermined number of epochs. The CNN and DNN ensemble sub-model architecture evaluation and selection is discussed in details in the cross-validation section of the Results and Discission section.

Evaluation metrics

The counts of correctly identified True AcSS or DoSS (true positive, "TP"), correctly identified False AcSS or DoSS (true negative, "TN"), incorrectly annotated True AcSS or DoSS (false positive, "FP"), and incorrectly annotated False AcSS or DoSS (false negative, "FN") are used to evaluate EnsembleSplice's classification performance and to compare EnsembleSplice with other splice site detection models used.

For this experiment, we used evaluation metrics standard to splice site detection research. This includes.

  • Accuracy (Acc): the value of AcSS and DoSS correctly identified, given by \(Accuracy = \frac{TP + TN}{{TP + TN + FP + FN}}\).

  • Precision (Pre): the fraction of positive classifications for AcSS or DoSS that were positive, given by \(Precision = \frac{TP}{{{\text{TP}} + FP}}\).

  • Sensitivity (Sn): the fraction of positive AcSS or DoSS with a positive classification (true positive rate), given by \(Sensitivity = \frac{TP}{{{\text{TP}} + {\text{FN}}}}\).

  • Specificity (Sp): the fraction of negative AcSS or DoSS with a negative classification (true negative rate), given by \(Specificity = \frac{TN}{{{\text{TN}} + {\text{FP}}}}\).

  • Matthew’s Correlation Coefficient (MCC): the correlation between true and false AcSS and DoSS and the classifications for them generated by the mode, given by \(MCC = \frac{{{\text{TP}} \times {\text{TN}} - {\text{FP}} \times {\text{FN}}}}{{\sqrt {\left( {{\text{TP}} + {\text{FP}}} \right)\left( {{\text{TP}} + {\text{FN}}} \right)\left( {{\text{TN}} + {\text{FP}}} \right)\left( {{\text{TN}} + {\text{FN}}} \right)} }}\).

  • F1 Score (F1): the harmonic mean of the fraction of positive classifications for AcSS or DoSS that were positive and the fraction of positive AcSS or DoSS that were correctly identified, given by \(F1 Score = \frac{{2 \times {\text{TP}}}}{{2 \times {\text{TP}} + {\text{FP}} + {\text{FN}}}}\).

  • Error Rate: the fraction of AcSS or DoSS incorrectly identified, given by 1 − Accuracy.

We utilized four diversity metrics described below to evaluate how well the different ensembles might generalize in our ensemble cross-validation experiments. They are as follows: correlation, double fault, disagreement, and Q-statistic. For a mathematical illustration of these diversity metrics, we use two classifiers and define \(K^{ij}\) as the number of measures for which binary vector \(s_{y, x} = i\) and \(s_{y, z} = j\). Thus, \(K^{11}\) is the number of examples that is correctly classified by the ensemble classifier [36].

Given the output of two classifiers, \(Q_{x}\) and \(Q_{Z}\):

  • Correlation: the correlation is given by \(\frac{{K^{11} K^{00} - K^{01} K^{10} }}{{\sqrt {\left( {K^{11} + K^{10} } \right)\left( {K^{01} + K^{00} } \right)\left( {K^{11} + K^{01} } \right)\left( {K^{10} + K^{00} } \right)} }}\). The correlation measure is diverse when the value is low

  • Double Fault: this measure the fraction of the misclassified examples by both classifier ensemble and is given by \(\frac{{K^{00} }}{{K^{11} + K^{10} + K^{01} + K^{00} }}\). This metric is diverse when the value is low.

  • Disagreement: the fraction between the true classifier and false classifier to the total number of examples and is given by \(\frac{{K^{01} + K^{10} }}{{K^{11} + K^{10} + K^{01} + K^{00} }}\). Disagreement measure is diverse when the value is high.

  • Q-statistics: this measure is given by \(\frac{{K^{11} K^{00} - K^{01} K^{10} }}{{K^{11} K^{00} - K^{01} K^{10} }}\). A low value shows high diversity for the Q - statistics metrics.

Performance benchmark methods

In this study, we chose existing cutting-edge splice site models iSS-CNN [17], SpliceRover [12], SpliceFinder [13] and DeepSplicer [14] for benchmark comparison with EnsembleSplice based on their training architecture, experiment datasets and recent deep-learning based splice site state-of-the-arts.

Tayara et al. [17]

iSS-CNN [17], which was trained on a subset of HS3D data, has three layers: a dropout layer that prunes 30% of the nodes, a fully connected dense layer using the Sigmoid activation function, one convolutional layer of 16 filters and kernel size 7, stride size 3, and a classification threshold of 0.5 for predicting AcSS or DoSS. The testing was done on the public web server of iSS-CNN and is accessible at http://nsclbio.jbnu.ac.kr/tools/iSS-CNN/. For evaluation, EnsembleSplice uses the same HS3D testing subset as the benchmarked iSS-CNN.

Zuallaert et al. [12]

SpliceRover [12] which is also a deep learning approach to splice site prediction was trained on human genomic DNA data and A. thaliana genomic DNA data. Its architecture consists of a convolutional layer with filters equal in number to the AcSS or DoSS length, a max-pooling layer, and a series of convolutional and max-pooling layers. A fully connected dense layer follows the convolutional layers, and the output is input to the Softmax activation function. When comparing SpliceRover to EnsembleSplice, their publicly accessible web server is used. This time, a cut of 0.5 was used. The web server can be found at the following link: http://bioit2.irc.ugent.be/rover/splicerover.

Wang et al. [13]

SpliceFinder [13] was tested on other species of datasets after being trained on the human dataset. Its classification accuracy was 90.25% and it used one-hot encoding, one convolutional layer, a fully connected layer, and Softmax. We use this method as a benchmark for evaluation comparison since it is a more recent splice site prediction method that has been published.

Akpokiro et al. [14]

DeepSplicer [14] uses a five-fold cross-validation approach for its model selection. This convolutional neural network state-of-the-art method uses three convolution neural network layers with flatten, dense, dropout, and Softmax layers in its architecture. Similar to EnsembleSplice, this method is trained and tested on Homo sapiens and A. thaliana. The models, software architecture, and datasets for SpliceFinder [13] and DeepSplicer [14] are all available in the corresponding GitHub repositories.

Results and discussion

Cross-validation

To establish a more efficient and consistent model, we performed a five-fold cross-validation experiment. Through this experiment, we estimated the splice site prediction accuracy by dividing the balanced training datasets into K equal dataset splits. This split has an equal number of true and false genomic sequences, with true and false splice sites being genomic sequence patterns with consensus AcSS AG and DoSS GT dinucleotide molecules annotated as splice sites and not annotated as splice sites, respectively. We essentially used the K-1 fold for training and the one-fold for testing for each subset of the data partitions. Finally, the reported accuracy represents the mean accuracy computed from all K data splits across each balanced genomic organism dataset. EnsembleSplice employs the StratifiedKFold [17] ML module for its k-fold (k = 5) cross-validation for each acceptor and donor organism dataset. Consequently, there were five groups from the training datasets.

We tested potential ensemble architectures using the cross-validation method on the following set:

  • Ensemble ENS1 contains all DNN’s (DNN1, DNN2, DNN3, DNN4).

  • Ensemble ENS2 contains all the CNNs (CNN1, CNN2, CNN3, CNN4).

  • Ensemble ENS3 contains all the neural network models (DNN1, DNN2, DNN3, DNN4, CNN1, CNN2, CNN3, CNN4).

  • Ensemble ENS4 consists of CNN1, CNN2, CNN3, DNN1, DNN3, this does not include the two worst DNNs and one worst CNN.

  • Ensemble ENS5 contains all the neural network sub models except the single worst CNN and DNN (DNN1, DNN3, DNN4, CNN1, CNN2, CNN3).

  • Ensemble ENS6 includes all DNNs with the worst DNN removed and all CNNs with the worst two CNNs removed (DNN1, DNN3, DNN4, CNN1, CNN2).

The architecture of each of this ensemble sub-models—that is CNNs and DNNs— are provided in Table 2, with the architecture representation in Fig. 3. All the architectures use the one-hot encoding of genome data as their input. Additionally, the output of this architecture serves as the input for a dense and dropout layer. Consequently, we compute the mean results for the evaluation and diversity metrics of the cross-validation results across the organism for each acceptor and donor dataset with results shown in the Table 3. From the table, we observe that the performance of the ENS2 architecture is highly competitive across all the diversity metrics. Importantly, this architecture outperformed the competition in accuracy metrics and error rates for the acceptor and donor splice site datasets for the benchmark organisms. Thus, we selected the ENS2 as the representative EnsembleSplice model. The evaluation metrics section explains the metrics used in this experiment and the Fig. 4 outlines the entire architecture of the ENS2 model, from input, one-hot encoding of the genome data, to output specifying the false and true AcSS/DoSS splice site prediction score.

Fig. 3
figure 3figure 3figure 3

Cross-Validation Ensemble model architecture. These are the architectural representation of each Ensemble model architecture and their individual base model architecture combination used in the cross-validation experiment. This contain a Ensemble ENS1 contains all DNN’s (DNN1, DNN2, DNN3, DNN4); b Ensemble ENS2 contains all the CNNs (CNN1, CNN2, CNN3, CNN4); c Ensemble ENS3 contains all the neural network models (DNN1, DNN2, DNN3, DNN4, CNN1, CNN2, CNN3, CNN4); d Ensemble ENS4 consists of CNN1, CNN2, CNN3, DNN1, DNN3; e Ensemble ENS5 consists of DNN1, DNN3, DNN4, CNN1, CNN2, CNN3; f Ensemble ENS6 consist of DNN1, DNN3, DNN4, CNN1, CNN2. We selected the Ensemble ENS2 from our cross-validation experiment

Table 3 The cross-validation results for the dataset for the genomic organisms
Fig. 4
figure 4

EnsembleSplice architectural pipeline. This figure depicts the Ensemble architecture used for this experiment. This contains the one-hot encoded datasets, the ensemble neural network combination, prediction and label, and the logistics regression and evaluation

Performance evaluation

We evaluated and compared EnsembleSplice performance to the benchmarked methods based on the metrics described above in the evaluation metrics section and the datasets as discussed in the datasets section with the state-of-the-art methods considered because of their deep learning application. EnsembleSplice outperforms all other methods for the HS3D acceptor datasets, with the exception of the precision metrics, where DeepSplicer outperformed EnsembleSplice by a factor of 1.05%. Furthermore, our approach outperforms other cutting-edge methodologies and records an accuracy of 93.79% and a reduced error rate of 6.36%. With an improved accuracy of 96.25% and a reduced error rate of 3.81% in the HS3D donor datasets, EnsembleSplice outperforms competing methods. We continued to test EnsembleSplice to predict splice sites in the A. thaliana genomic dataset and discovered that it performed better than other methods in every metric for both the acceptor and donor genomic dataset organisms. We tested and compared other splice site models on the Homo-sapiens datasets in order to demonstrate EnsembleSplice's consistency in predicting the splice site. In the acceptor and donor datasets, EnsembleSplice records data with higher accuracy and lower error rates than other methods. In the Table 4 result, N/A denoted results for methods of no known datasets model.

Table 4 The Evaluation performance comparison results

Based on the results we have observed and reported above, we can conclude that each of our research objectives have been fulfilled. We have successfully developed a deep ensemble model architecture algorithm for splice site prediction (objective point 1). EnsembleSplice is the first deep ensemble model architecture algorithm proposed for splice site prediction. Our method records an outstanding performance in comparison to the state-of-the-art methods and across the evaluation metrics, especially in accuracy and error rate, as shown in Table 4. This superior performance can be attributed to both the use of individually effective DNN and CNN architectures for splice site prediction and the use five-fold cross-validation to select the best ensemble architecture capable of generalizing for maximum performance and the diversity of our ensemble-based model to provide model performance robustness (objective point 2). Comparing our stable and successful model to other state-of-the-art models, Table 4 demonstrates how the use of ensemble learning for splice site prediction out-performs other cutting-edge models (objective point 3).

Impact and benefit of this study

The primary appeal of deep learning for splice site prediction is that it is more accurate than earlier machine learning methods, especially ones that involved manual feature selection. Although deep learning is somewhat more computationally intensive, it is effective for solving complex problems, which has been its second major appeal. EnsembleSplice further benefits biological research involving splice site classification in that its deep ensemble architecture outperforms individual deep learning networks and exceeds state-of-the-art performance in splice site prediction, not just in terms of accuracy, but also in terms of other classification metrics, such as precision and sensitivity, because of the diverse combination of its base models. Additionally, in this study, we adopt the stacked ensemble learning algorithm which has the major advantage of using a variety of effective models to accomplish classification or regression tasks and produce predictions that perform better than any one model in the ensemble. In our benchmarking results, the performance of EnsembleSplice’s all-CNN stacked ensemble model demonstrates the advantages of using an ensemble architecture over a single CNN model for the prediction of splice sites, and this knowledge may be applied and transferred to other domains to address still unsolved complex regression or classification problems.

EnsembleSplice model interpretability

To increase the model's interpretability, we isolated and showed the motifs that drive our EnsembleSplice model's deep learning processes. Understanding the underlying pattern of the genomic sequence by generating the contribution scores of the sequence window is required for implementation. We used the WebLogo [37] (http://weblogo.threeplusone.com/create.cgi) web server was also used to illustrate the sequence logo for our model interpretability test outputs. WebLogo is a web-based tool for efficiently generating sequence logos from genomic datasets sequence alignment. This genomic sequence logo displays the weighted average nucleotide base position contribution score for the genomic sequence organism. To show the contributions of genomic motifs in each positive and negative acceptor and donor organism dataset, we use the entire HS3D sequence length of 140. Figure 5a indicates that the nucleotide sequence AG contributes significantly to the HS3D acceptor positive splice sites, as Fig. 5b shows that the nucleotide sequence AG contributes significantly to HS3D acceptor negative splice sites. While Fig. 5c shows the nucleotide sequence GT contributes significantly to the HS3D donor positive splice sites as Fig. 5d indicates the nucleotide sequence GT contributes significantly to the HS3D donor negative splice sites. According to this figure, the nucleotide consensus AG for AcSS regions occurs at positions 69 and 70 and the nucleotide consensus GT for DoSS regions occurs at positions 71 and 72 for the HS3D datasets. This figure also validates that the splice site distribution is most significant in sequence region position 70.

Fig. 5
figure 5figure 5

EnsembleSplice model interpretability. This figure is a sequence logo to visualize the importance score for each nucleotide per position for the HS3D datasets. a indicates the acceptor positive splice sites, as b shows that acceptor negative splice sites. While c shows the donor positive splice sites as d indicates the donor negative splice sites

Conclusion

Inspired by the stacking ensemble machine learning method, we introduce a method that combines heterogeneous base neural network models, learns them in parallel, and combines them by training a meta-model to output a prediction based on the different base model predictions. EnsembleSplice has the advantage of balancing out the base model's flaws and produces a diverse and stable model that can be applied to both competitive, industrial, and academic research applications. EnsembleSplice has consistently shown competitive performance on all metrics used when compared to other methods considered in this experiment. As it contributes computationally to the foundation of protein synthesis and gene expression, this tool finds use in both industrial and academic research applications. In our future work, we will test the generalization strength of the EnsembleSplice model for the prediction of splice sites in DNA sequences across a variety of species.

Availability of data and materials

The datasets, models and codebase for this study are available at https://github.com/OluwadareLab/EnsembleSplice. The Python source codes for EnsembleSplice are available at https://github.com/OluwadareLab/EnsembleSplice.

Abbreviations

SS:

Splice site

CNN:

Convolutional neural networks

DL:

Deep learning

ReLU:

Rectified linear unit

ML:

Machine learning

SVM:

Support vector machines

MM:

Markov model

MDD:

Maximum dependency decomposition

AG:

Adenine–Guanine

GT:

Guanine–Thymine

References

  1. Pohl M, Bortfeldt RH, Grützmann K, Schuster S. Alternative splicing of mutually exclusive exons—a review. Biosystems. 2013;114(1):31–8.

    Article  CAS  Google Scholar 

  2. Pertea M, Lin X, Salzberg SL. GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res. 2001;29(5):1185–90.

    Article  CAS  Google Scholar 

  3. Abril JF, Castellano Hereza S. Genome annotation. Elsevier; 2019.

    Google Scholar 

  4. de Sá PH, Guimarães LC, Das Graças DA, de Oliveira Veras AA, Barh D, Azevedo V, Ramos RT. Next-generation sequencing and data analysis: strategies, tools, pipelines and protocols. In: Omics technologies and bio-engineering. Academic Press; 2018. p. 191–207.

    Google Scholar 

  5. Ho LS, Rajapakse JC. Splice site detection with a higher-order Markov model implemented on a neural network. Genome Inf. 2003;14:64–72.

    CAS  Google Scholar 

  6. Huang W, Umbach DM, Ohler U, Li L. Optimized mixed Markov models for motif identification. BMC Bioinform. 2006;7(1):1–17.

    Article  Google Scholar 

  7. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997;268(1):78–94.

    Article  CAS  Google Scholar 

  8. Baten AK, Halgamuge SK, Chang BC. Fast splice site detection using information content and feature reduction. BMC Bioinform. 2008;9(12):1–12.

    Google Scholar 

  9. Goel N, Singh S, Aseri TC. A review of soft computing techniques for gene prediction. International Scholarly Research Notices, (2013).

  10. Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G. Accurate splice site prediction using support vector machines. BMC Bioinform. 2007;8(10):1–16.

    Google Scholar 

  11. Zhang Q, Peng Q, Zhang Q, Yan Y, Li K, Li J. Splice sites prediction of human genome using length-variable Markov model and feature selection. Expert Syst Appl. 2010;37(4):2771–82.

    Article  Google Scholar 

  12. Zuallaert J, Godin F, Kim M, Soete A, Saeys Y, De Neve W. SpliceRover: interpretable convolutional neural networks for improved splice site prediction. Bioinformatics. 2018;34(24):4180–8.

    Article  CAS  Google Scholar 

  13. Wang R, Wang Z, Wang J, Li S. SpliceFinder: ab initio prediction of splice sites using convolutional neural network. BMC Bioinform. 2019;20(23):1–13.

    Google Scholar 

  14. Akpokiro V, Oluwadare O, Kalita J. DeepSplicer: an improved method of splice sites prediction using deep learning. In: 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA). 2021. pp. 606–609

  15. Du X, Yao Y, Diao Y, Zhu H, Zhang Y, Li S. Deepss: exploring splice site motif through convolutional neural network directly from DNA sequence. IEEE Access. 2018;6:32958–78.

    Article  Google Scholar 

  16. Thompson J, Scalzitti N, Kress A, Orhand R, Weber T, Moulinier L, Poch O. Spliceator: multi-species splice site prediction using convolutional neural networks. BMC Bioinform. 2021;22(1):1–26.

    Google Scholar 

  17. Tayara H, Tahir M, Chong KT. iSS-CNN: identifying splicing sites using convolution neural network. Chemom Intell Lab Syst. 2019;188:63–9.

    Article  CAS  Google Scholar 

  18. Dutta A, Singh KK, Anand A. SpliceViNCI: visualizing the splicing of non-canonical introns through recurrent neural networks. J Bioinform Comput Biol. 2021;19(04):2150014.

    Article  Google Scholar 

  19. Buyrukoğlu S, Savaş S. Stacked-based ensemble machine learning model for positioning footballer. Arab J Sci Eng. 2022. https://doi.org/10.1007/s13369-022-06857-8.

    Article  Google Scholar 

  20. Buyrukoğlu G, Buyrukoğlu S, Topalcengiz Z. Comparing regression models with count data to artificial neural network and ensemble models for prediction of generic Escherichia coli population in agricultural ponds based on weather station measurements. Microb Risk Anal. 2021;19: 100171.

    Article  Google Scholar 

  21. Buyrukoğlu S. Improvement of machine learning models’ performances based on ensemble learning for the detection of Alzheimer disease. In 2021 6th International Conference on Computer Science and Engineering (UBMK). 2021. pp. 102–106.

  22. Pollastro P, Rampone S. HS3D, a dataset of Homo Sapiens splice regions, and its extraction procedure from a major public database. Int J Mod Phys C. 2002;13(08):1105–17.

    Article  CAS  Google Scholar 

  23. Zerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D, Bhai J, Flicek P. Ensembl 2018. Nucleic Acids Res. 2018;46(D1):D754–61.

    Article  CAS  Google Scholar 

  24. Albaradei S, Magana-Mora A, Thafar M, Uludag M, Bajic VB, Gojobori T, Jankovic BR. Splice2Deep: an ensemble of deep convolutional neural networks for improved splice site prediction in genomic DNA. Gene. 2020;763: 100035.

    Article  Google Scholar 

  25. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841–2.

    Article  CAS  Google Scholar 

  26. Goodfellow I, Bengio Y, Courville A. Deep learning. MIT press. 2016.

  27. Ren A, Li Z, Ding C, Qiu Q, Wang Y, Li J, Yuan B. Sc-dcnn: highly-scalable deep convolutional neural network using stochastic computing. ACM SIGPLAN Notices. 2017;52(4):405–18.

    Article  Google Scholar 

  28. Bačanin Džakula N. Convolutional neural network layers and architectures. In Sinteza 2019-International Scientific Conference on Information Technology and Data Related Research. Singidunum University; 2019. pp. 445–451.

  29. Tammina S. Transfer learning using VGG-16 with deep convolutional neural network for classifying images. Int J Sci Res Publ (IJSRP). 2019;9(10):143–50.

    Google Scholar 

  30. Hahnloser RH, Sarpeshkar R, Mahowald MA, Douglas RJ, Seung HS. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature. 2000;405(6789):947–51.

    Article  CAS  Google Scholar 

  31. Krizhevsky A, Hinton G. Convolutional deep belief networks on cifar-10. Unpublished manuscript, 2010;40(7): 1–9.

  32. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(1):1929–58.

    Google Scholar 

  33. Kingma DP, Ba J. Adam: A method for stochastic optimization. 2014. arXiv preprint arXiv:1412.6980.

  34. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Zheng X. {TensorFlow}: a system for {Large-Scale} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). 2016. pp. 265–283.

  35. Chollet F. Keras: The python deep learning library. Astrophysics source code library, ascl-1806. (2018)

  36. Johansson U, Lofstrom T, Niklasson L. The importance of diversity in neural network ensembles-an empirical investigation. In: 2007 International Joint Conference on Neural Networks. 2007. pp. 661–666.

  37. Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004;14(6):1188–90.

    Article  CAS  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This work was supported by the National Science Foundation [2050919]. The APC was funded by the start-up funding from the University of Colorado, Colorado Springs [to O.O.]. Any opinions, findings and conclusions or recommendations expressed in this work are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Author information

Authors and Affiliations

Authors

Contributions

OO conceived the project. TM and OO designed the algorithm. TM implemented the algorithm and drafted the initial manuscript. VA wrote and revised the manuscript and performed the statistical and simulation analyses. VA, TM, and OO evaluated the results. All authors reviewed the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Oluwatosin Oluwadare.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Akpokiro, V., Martin, T. & Oluwadare, O. EnsembleSplice: ensemble deep learning model for splice site prediction. BMC Bioinformatics 23, 413 (2022). https://doi.org/10.1186/s12859-022-04971-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-022-04971-w

Keywords