Skip to main content

A sequence-based multiple kernel model for identifying DNA-binding proteins

Abstract

Background

DNA-Binding Proteins (DBP) plays a pivotal role in biological system. A mounting number of researchers are studying the mechanism and detection methods. To detect DBP, the tradition experimental method is time-consuming and resource-consuming. In recent years, Machine Learning methods have been used to detect DBP. However, it is difficult to adequately describe the information of proteins in predicting DNA-binding proteins. In this study, we extract six features from protein sequence and use Multiple Kernel Learning-based on Centered Kernel Alignment to integrate these features. The integrated feature is fed into Support Vector Machine to build predictive model and detect new DBP.

Results

In our work, date sets of PDB1075 and PDB186 are employed to test our method. From the results, our model obtains better results (accuracy) than other existing methods on PDB1075 (\(84.19\%\)) and PDB186 (\(83.7\%\)), respectively.

Conclusion

Multiple kernel learning could fuse the complementary information between different features. Compared with existing methods, our method achieves comparable and best results on benchmark data sets.

Background

DNA-Binding Protein (DBP) plays a vital role in the function of various biomolecules, containing DNA transcription and replication. To detect DNA-binding protein via biological assays, researchers usually employed electrophoretic mobility shift assay, chromatin immunoprecipitation, Yeast One-hybrid System (Y1H) and X-ray crystallography. However, above methods are still time consuming and extremely expensive. The machine learning-based methods have been developed to solve the problem of detecting DNA-binding protein [1,2,3].

In the identification study of DNA-binding proteins, the main task is to determine an unknown protein whether it can bind to DNA. In the previous works, many researchers detected DBP based on structural information. Nimrod et al. [4] constructed a random forest prediction model for DNA-binding protein recognition using the average surface electrostatic potential, dipole moment, and amino acid conservation pattern information; Bhardwaj et al. [5] used overall charge, surface patches and composition feature to train a predictive model via Support Vector Machine (SVM) [6]. Ahmad et al. [7] trained a neural network model to predict DBP. The feature of protein contained the net charge of the protein, electric dipole moment and fourth moment tensor.

The number of protein sequences is larger than the number of known protein structures. The number of protein with relevant structural information is very low and most of the proteins do not have the corresponding structural information. Therefore, the structure-based models cannot be widely used to detect DBP. A method based on protein sequence [8] constructed a Support Vector Machine (SVM) model with amino acid composition and materialized property information. Liu and Cai et al. [9,10,11] extracted overall amino acid composition and Pseudo Amino Acid Composition (PseAAC) to represent protein feature. Liu et al. [12] developed a model called iDNAPro-PseAAC, which is extended with evolutionary information of protein sequence. Kumar et al. [13] used Position Specific Scoring Matrix (PSSM) to propose a classifier called DNAbinder, which is based on SVM. PSSM was produced via PSI-BLAST software [14], which could obtain evolutionary conservation information. The Local-DPP [1] captured local conservation information of PSSM and trained an ensemble model to predict DBP. DBPPred [15] employed Random Forest (RF) to get the optimal feature subset and trained Gaussian Naive Bayes model for predicting DBP. Zou et al. utilized a Fuzzy Kernel Ridge Regression model with Multi-View Sequence Features (FKRR-MVSF) [16] to predict DBP. To further improve the accuracy of DBP prediction, Ding et al. [17] employed a Multi-Kernel SVM based on Heuristically Kernel Alignment (MKSVM-HKA) to integrate different features from protein sequence. In addition, a multiple kernel-based fuzzy SVM model [18] of DNA-binding proteins also was developed to improve prediction performance. Liu et al. [19] proposed a stacking framework model for predicting DBP by orchestrating multi-view features. This stacking framework model was named as MSFBinder. Rahman et al. [20] developed a DNA-binding Protein Prediction model using Chou general PseAAC (DPP-PseAAC) and SVM based Recursive Feature Elimination (RFE) approach. Adilina et al. [21] extracted several features via PseAAC and carried out two different types of feature selection to build predictive model of DBP.

In practical applications, the sequence-based approaches are more adaptable. DNA-methylation sites, recombination spots, Post Translational Modification (PTM) sites (protein) and Protein-Protein Interactions (PPI) have been predicted by sequential methods. In recent years, machine learning methods have been widely used in bioinformatics [16, 17, 22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38]. And some of the biological problems are solved very well, including O-GlcNAcylation sites [23], protein subcellular localization [25, 39, 40], Methyladenosine Sites [22, 26], drug-target interactions [27,28,29,30,31, 37, 41], drug-drug interactions [42, 43], lncRNA-Protein interaction [35, 36] protein crystallization prediction [32, 44], potential disease-associated microRNAs [24, 33, 34, 45, 46] and other RNAs [47,48,49,50].

Inspired by the previous work [1, 8, 9, 11, 13, 16, 17], we propose a new predictive model for DNA-binding protein through multi-kernel support vector machine. Firstly, several types of features are extracted from protein sequences. And these features are employed to construct kernel matrices. We use Multi-Kernel Learning-based on Centered Kernel Alignment (MKL-CKA) algorithm to combine these kernels and obtain an integrated kernel for training SVM model. We call this model as Multi-Kernel SVM (MKSVM) model. Finally, MKSVM is utilized to detect new DNA-binding proteins. Compared with other state-of-the-art models, the proposed method achieves better results. The accuracy of our model are \(84.19\%\) and \(83.7\%\) on the PDB1075 (leave one out test) and PDB186 (independent test) data sets, respectively.

Results

In this section, we test our method on PDB1075 and PDB186 data sets. Firstly, we perform a Leave One Out Cross validation (LOOCV) on the PDB1075. Next, our model are trained by the PDB1075 and tested on the PDB186. Other existing methods are also test on PDB1075 and PDB186. The data set and source code (with Python Programming Language) is obtained from https://figshare.com/s/cf56cef6659c7eed16c9.

Data sets

The details of PDB1075 and PDB186 data sets are list in Table 1. The benchmark data sets (PDB1075 and PDB186) are selected from Protein Data Bank (PDB) [51]. Any two sequences have not more than \(25\%\) similarity. Protein sequences which less than 50 amino acids or contain the ‘X’ character must be removed. The PDB1075 data set (constructed by Liu et al. [9]) is used to test our model under LOOCV. The PDB186 data set (constructed by Lou et al. [15]) is used for independent testing.

Table 1 The detail information of two benchmark data sets
Table 2 The ACC of different parameter values on PDB1075 (five-fold cross validation)

Measurements

The main measures for the evaluation of performance are Accuracy (ACC), Matthew’s Correlation Coefficient (MCC), Sensitivity (SN), Specificity (SP), and Area Under ROC (AUC). The calculation formulas of ACC, SN, SP and MCC indicators are calculated as follows:

$$\begin{aligned} ACC&=\frac{TP+TN}{TP+FP+TN+FN} \end{aligned}$$
(1a)
$$\begin{aligned} SN&=\frac{TP}{TP+FN} \end{aligned}$$
(1b)
$$\begin{aligned} Spec&=\frac{TN}{TN+FP} \end{aligned}$$
(1c)
$$\begin{aligned} MCC&=\frac{TP \times TN - FP \times FN}{\sqrt{(TP+FN)\times (TN+FP) \times (TP+FP) \times (TN+FN)}} \end{aligned}$$
(1d)

where TP is the correct number of positive samples, TN is the correct number of negative samples, FN is the number of false negative samples and FP is the number of false positive samples. Area Under of receiver operating characteristic Curve (AUC) is obtained by calculating the area under the Receiver Operating characteristic Curve (ROC). The higher value of AUC, the better predictive effect.

Parameters selection

To achieve the best performance, we need to select optimal parameters of predictive model. In this section, we employ grid search method to select optimal parameters for SVM model.

The parameters selection of features

To select the optimal parameters of feature NMBAC and PsePSSM, we test the different parameters (the max value of \(lag_{max}\) and lg for PsePSSM and NMBAC) under five-fold cross validation (on PDB1075 data set). We set the range of lg (NMBAC) and \(lag_{max}\) (PsePSSM) values from 5 to 45 (step of 5). In Table 2, the results of the prediction show that the optimal lg (NMBAC) as 30 and \(lag_{max}\) (PsePSSM) as 10 in this study.

Selection of C and \(\boldsymbol{\gamma}\)  

For the selection of SVM parameters, we use the grid search method and the 5-fold Cross Validation (5-CV) method. We set the range of parameter from \(2^{-5}\) to \(2^{5}\) with step \(2^{1}\). The optimal parameters of results are show in Table 3.

Table 3 The optimal parameters for SVM (single kernel)
Table 4 The performance of different kernels (RBF kernel) on PDB1075 data set (leave one out)
Table 5 The weight of six kernels (RBF kernel) by MKL-CKA
Table 6 The sensitivity of different kernels (features) on PDB1075 data set (under the specificity of 0.5)

Before combining multiple kernels, the parameter \(\gamma\) for 6 types of kernels are obtained from their single kernels (Table 3). To achieve the optimal parameters of C under MKSVM (average weight for each kernel), we also utilize the above C range. Comparing the accuracy of different C values, the corresponding values of ACC are shown in the Fig. 1. When \(C=2 \ (logC=1)\), the MKSVM (average weight for each kernel) achieves best ACC (\(82.8\%\)). In our study, the parameter (C) of MKSVM (with MKL-CKA) is same as MKSVM with mean weighted.

Fig. 1
figure1

The ACC values under parameters of C on PDB1075 data set (five-fold cross validation)

To obtain the optimal parameter (\(\lambda\)) of MKL-CKA, we try the different value of \(\lambda\) from 0 to 1 (step is 0.05) under 5-CV on PDB1075 data set. The results are shown in the Fig. 2. When \(\lambda = 0.8\), the ACC value is the highest. We set 0.8 as the optimal parameter (\(\lambda\)) of MKL-CAK.

Fig. 2
figure2

The ACC values under parameters of \(\lambda\) on PDB1075 data set (five-fold cross validation)

Performance analysis on PDB1075

We test the performance of different kernels (features) on PDB1075 (under LOOCV). The results are shown in Table 4 and Fig. 3.

Fig. 3
figure3

The ROC comparison of different kernels (feature) via Leave one out test on PDB1075 data set

As we can see from the table, the results of multi-kernel learning are much better than single kernel model. The PSSM-AB (MCC: 0.547), PSSM-DWT (MCC: 0.522) and PsePSSM (MCC: 0.573) kernels with PSSM information are better than those of GE (MCC: 0.432), MCD (MCC: 0.417) and NMBAC (MCC: 0.424). Among them, we calculate the weights of six kernels by MKL-CKA method (Table 5). The integrated kernel (with MKL-CKA) has the highest results in ACC (\(84.2\%\)), MCC (0.684), SN (\(85.9\%\)), SP (\(82.6\%\)) and AUC (0.914). Obviously, the integrated kernel (with MKL-CKA) is higher than mean weighted kernel.

Under the specificity of 0.5 (on PDB1075 data set), the sensitivity values of different kernel are following: \({\mathbf {K}}_{GE}\): 0.8857, \({\mathbf {K}}_{MCD}\): 0.8495, \({\mathbf {K}}_{NMBAC}\): 0.8590, \({\mathbf {K}}_{PSSM-AB}\): 0.9352, \({\mathbf {K}}_{PsePSSM}\): 0.9657, \({\mathbf {K}}_{PSSM-DWT}\): 0.9523, mean weighted kernel: 0.9847, and \({\mathbf {K}}_{MKL-CKA}\): 0.9885. Some kernels have bias in the learning process. MKL-CKA could filter noise kernels (reducing bias of kernels) by setting low weights of kernels. And the sensitivity of MKL-CKA (0.9885) is better than best single kernel (\({\mathbf {K}}_{PSSM-AB}\): 0.9352). Although our MKL algorithm only improves sensitivity value with a few percentage points, the purpose of MKL is to filter noise feature (kernel) and integrate multiple effective features. The Table 6 shows the sensitivity of different kernels (features) on PDB1075 data set (Under the specificity of 0.5).

Table 7 The running time of different kernels (features) on PDB1075 data set (training)
Table 8 The performance of different kernel functions on PDB1075 data set (Five-fold cross validation)

We also evaluate the running time of different models with different kernels. The results are shown in Table 7. The programs are carried out on the computer Intel Core i5 3.2 GHz CPU 8 GB RAM. The running time (s) of our methods are \({\mathbf {K}}_{GE}\): 0.418, \({\mathbf {K}}_{MCD}\): 3.79, \({\mathbf {K}}_{NMBAC}\): 0.627, \({\mathbf {K}}_{PSSM-AB}\): 0.678, \({\mathbf {K}}_{PsePSSM}\): 3.7, \({\mathbf {K}}_{PSSM-DWT}\): 3.47, mean weighted kernel: 28.7, and MKL-CKA: 68, respectively. Because multiple kernel matrices are calculated and the weight value of each kernel matrix is estimated, MKL-CKA is the most time-consuming.

What’s more, other kernel functions (e.g. linear kernel, polynomial kernel, and sigmoid kernel) are also test. We compare RBF kernel with other 3 types of kernel functions under five-fold cross validation. The results are list in Table 8, which shows that RBF kernel obtain better ACC on GE (\(69.97\%\)), MCD (\(70.21\%\)), PSSM-AB (\(76.54\%\)), PSSM-DWT (\(76.26\%\)) and PsePSSM (\(78.36\%\)), respectively. MKL-CKA also is employed to combine 6 features with four kernel functions, respectively. The RBF kernel (with MKL-CKA) achieves best ACC (\(83.01\%\)).

Comparison to existing predictors on PDB1075

Table 9 Compared with existing methods on PDB1075 data set (LOOCV)

The MKSVM (with MKL-CKA) model and other methods are also test on PDB1075 data set (under LOOCV). The results of ACC, MCC, SN and SP are list in Table 9. Existing methods include IDNA-Prot|dis [2], DNAbinder [13], iDNAPro-PseAAC [10], Kmer1+ACC [12], iDNA-Prot [52], DNA-Prot [53], PseDNA-Pro [9], MKSVM-HKA [17], MSFBinder [19], FKRR-MVSF [16] and Local-DPP [1]. Among these methods, MKSVM-HKA (MCC: 0.63), MSFBinder (MCC: 0.67), FKRR-MVSF (MCC: 0.67), iDNA Pro-PseAAC (MCC: 0.53), PseDNA-Pro (MCC: 0.53), IDNA-Prot|dis (MCC: 0.54) and Local-DPP (MCC: 0.59) also obtained good performance. Local-DPP and iDNAPro-PseAAC take advantage of the PSSM feature to improve performance. MKSVM-HKA, FKRR-MVSF and MSFBinder employed MKL algorithm and ensemble strategy to integrate multiple information and further improve the predictive accuracy. Our method (MKSVM with MKL-CKA) is also based on MKL and achieves best MCC (0.68). Although, the SP value of MSFBinder (\(83.09\%\)) is higher than our method (\(82.55\%\)). Our method is the highest in ACC (\(84.19\%\)), MCC (0.68), SN (\(85.91\%\)).

The statistical significance tests of the differences is necessary. The results in Table 10 list that our method make statistically significant improvement over the other methods (P-value \(<0.05\), by t-test, in term of MCC). The comparison is under 10 fold cross validation on PDB1075. The difference between Local-DPP and our method is significant (P-value: 6.0421E\(-\)6). Comparing with MKSVM-HKA (P-value: 1.5438E\(-\)4), MSFBinder (P-value: 0.0098) and FKRR-MVSF (P-value: 0.0103), our method also shows significantly better prediction accuracy.

Table 10 The statistics of different methods
Table 11 The results of comparison between MKSVM (with MKL-CKA) model and other existing methods on PDB186 data set (independent test)

Independent test

In order to further evaluate the performance of MKSVM (with MKL-CKA) model, we use PDB1075 to construct MKSVM model and test it via PDB186 data set. The results of comparison are shown in Table 11.

Our method achieves \(83.7\%\), 0.691, \(93.6\%\), and \(74.2\%\) on ACC, MCC, SN, and SP, respectively. From the results of independent test, we can find out that our method has certain accuracy in the prediction of DBP. Adilina’s work (MCC: 0.670), MKSVM-HKA (MCC: 0.648), MSFBinder (MCC: 0.616) and FKRR-MVSF (MCC: 0.676) obtained good results on PDB186. Adilina et al. [21] employed 7 types of features and the strategy of feature selection to construct predictive model. FKRR-MVSF [16] and MKSVM-HKA [17] utilized MKL algorithm to combine several features. MSFBinder [19] built a stacking framework model by multiple features. The multiple information fusion-based methods achieved better results. Our method (MKSVM with MKL-CKA) performs better (MCC: 0.691) than most of existing models on PDB186 data set. From the results, the fusion of multiple information can improve the performance of the prediction model. FKRR-MVSF (MCC: 0.676), MKSVM-HKA (MCC: 0.648) and MSFBinder (MCC: 0.616) achieved better results on PDB186. We also test the performance of Random Forest (RF) and Feed forward Neural Network (FNN) on PDB186. RF and FNN achieve MCC of 0.593 and 0.520, respectively. SVM can achieve better performance on small data sets.

Discussion

How to describe and integrate the information of proteins is the difficulty in predicting DNA-binding proteins. In our study, MKL-CKA is utilized to integrate 6 types of features and achieves better results on PDB1075 (MCC: 0.68) and PDB186 (MCC: 0.69) data sets. Other methods, such as FKRR-MVSF, MKSVM-HKA, MSFBinder and Adilina’s work, also obtained good performance. We can find that multiple information fusion-based methods have better generalization performance on DBP prediction. To obtain the optimal weights of kernels, MKL-CKA maximizes the alignment score between feature space and label space. Ideal kernel (label space) contains the category information of the training samples. The Laplace smooth term can further optimize weight values. The performance of MKL-CKA (MCC: 0.684) is better than mean weighted kernels (MCC: 0.664) on PDB1075 (LOOCV). The process of MKL is similar to feature selection. MKL weights each kernel matrix (6 types of features). Whether the predictive models are based on MKL or feature selection, the noise features can be effectively filtered.

Conclusion

Although many models have been constructed to predict DBP, they can still be optimized to improve accuracy. Existing methods do not consider the removal of outliers in data sets. In the future, we will filter noise samples and improve the predictive accuracy of DBP by fuzzy theory and ensemble strategy.

Methods

DBP identification can be considered as a traditional binary classification problem, and we use SVM algorithm to construct predictive model. First, we extract the features of the protein from the sequence information. Six types of kernel matrices are constructed from these features. Above kernels are integrated to construct optimal kernel (including training kernel and testing kernel) by Multi-Kernel Learning-based on Centered Kernel Alignment (MKL-CKA) algorithm. We employ the combined kernel to build a SVM model and identify DBP. Figure 4 represents the framework of MKLSVM (with MKL-CKA). Firstly, six types of features are extracted from protein sequences. Then, six kernels are built by Radial Basis Function (RBF). MKL-CKA algorithm combines the 6 types of kernels. Next, we use the combined kernel and SVM algorithm construct the final predictive model to detect DBP.

Fig. 4
figure4

The framework of our method

Sequence feature

There are six types of features from protein sequence information, including PSSM-based Discrete Wavelet Transform (PSSM-DWT) [54], PSSM-based Average Blocks (PSSM-AB) [55], Pseudo-PSSM (PsePSSM) [10, 12, 56, 57], Multi-scale Continuous and Discontinuous descriptor (MCD) [58], Global Encoding (GE) [59] and Normalized Moreau-Broto Auto correlation (NMBAC) [60, 61]. These features have been detailed descripted in related literatures. We employ RBF to construct six types of kernels. The function formula of RBF is as follow:

$$\begin{aligned} K_{ij}=K({\mathbf {x}}_{i},{\mathbf {x}}_{j}) = exp(-\gamma \Vert {\mathbf {x}}_{i} - {\mathbf {x}}_{j} \Vert ^{2}), \ i,j=1,2,...,N \end{aligned}$$
(2)

where \(\gamma\) is the kernel bandwidth. We can obtain a kernel set \({\mathbf {K}}\) as follows:

$$\begin{aligned} {\mathbf {K}}= \left\{ {\mathbf {K}}_{GE}, {\mathbf {K}}_{MCD}, {\mathbf {K}}_{NMBAC}, {\mathbf {K}}_{PSSM-AB}, {\mathbf {K}}_{PSSM-DWT}, {\mathbf {K}}_{PsePSSM} \right\} \end{aligned}$$
(3)

Support vector machine

Support Vector Machine (SVM) is a classification algorithm, which is developed by Vapnik [6]. By finding the optimal hyper plane, the data set is separated on positive and negative points. The instance-label pairs (a training sample) {\({\mathbf {x}}_{i},y_{i}\)}, \({\mathbf {x}}_{i}\in {\mathbf {R}}^{d \times 1}\) and \(i=1,2,...,N\). Labels \(y_{i}\in \{ +1,-1\}\). The decision function is defined as following:

$$\begin{aligned} f({\mathbf {x}}) = sign[\sum _{i=1}^N y_{i}\alpha _{i}\cdot K({\mathbf {x}},{\mathbf {x}}_{i})+b] \end{aligned}$$
(4)

The coefficient \(\pmb {\alpha }\) are estimated by solving a Quadratic Programming (QP) problem:

$$\begin{aligned}&Maximize \quad \sum _{i=1}^N \alpha _{i} - \frac{1}{2}\sum _{i=1}^N \sum _{j=1}^N \alpha _{i}\alpha _{j}\cdot y_{i}y_{j}\cdot K({\mathbf {x}}_{i},{\mathbf {x}}_{j}) \end{aligned}$$
(5a)
$$\begin{aligned}&s.t. \quad 0 \le \alpha _{i} \le C \end{aligned}$$
(5b)
$$\begin{aligned}&\sum _{i=1}^N \alpha _{i}y_{i} = 0, i=1,2,...,N \end{aligned}$$
(5c)

\({\mathbf {x}}_{i}\) is support vector when the corresponding \(\alpha _{i} > 0\). C denotes the tradeoff between margin and misclassification error. What’s more, we construct a SVM model by LIBSVM [62](http://www.csie.ntu.edu.tw/~cjlin/libsvm/). We employ the grid search method to obtain the optimal parameters of the SVM.

Multiple kernel learning

Because of strong theoretical guarantee and excellent experimental performance, the MKL-CKA [63, 64] method is adopted in our study. MKL-CKA is a multi-kernel learning algorithm based on kernel alignment. The optimal kernel is calculated as follows:

$$\begin{aligned}&{\mathbf {K}}^{*} = \sum _{i=1}^{m} \beta _{i} {\mathbf {K}}_{i}, \end{aligned}$$
(6a)
$$\begin{aligned}&{\mathbf {K}}_{i} \in {\mathbf {R}}^{N \times N}, \end{aligned}$$
(6b)
$$\begin{aligned}&\sum _{i=1}^{m} \beta _{i} = 1 \end{aligned}$$
(6c)

where m is the number of kernels and \(\beta _{i}\) is the weight of the kernel \({\mathbf {K}}_{i}\).

The value of kernel alignment is defined as follow:

$$\begin{aligned} A({\mathbf {P}},{\mathbf {Q}}) = \frac{\left\langle {\mathbf {P}},{\mathbf {Q}} \right\rangle _{F}}{\Vert {\mathbf {P}} \Vert _{F} \Vert {\mathbf {Q}} \Vert _{F}} \end{aligned}$$
(7)

where \({\mathbf {P}}, {\mathbf {Q}} \in {\mathbf {R}}^{N \times N}\), \(\left\langle {\mathbf {P}},{\mathbf {Q}} \right\rangle _{F} = Trace({\mathbf {P}}^{T}{\mathbf {Q}})\) is the Frobenius inner product and \(\Vert {\mathbf {P}} \Vert _{F} = \sqrt{ \left\langle {\mathbf {P}},{\mathbf {P}} \right\rangle _{F}}\) is Frobenius norm.

The score of kernel alignment can be described as the cosine similarity between two kernels. The more high score of kernel alignment, the greater similarity between the kernels. We hope that the alignment score between combined kernel (feature space) and ideal kernel (label space) is high. So, the function formula of centered kernel alignment is as follow:

$$\begin{aligned}&\underset{\pmb {\beta } \ge 0}{\text{ max }} \quad CA({\mathbf {K}}^{*},{\mathbf {y}}_{train}{\mathbf {y}}_{train}^{T}) = \underset{\pmb {\beta } \ge 0}{\text{ max }}\quad \frac{\left\langle {\mathbf {U}}_{N}{\mathbf {K}}^{*}{\mathbf {U}}_{N},{\mathbf {y}}_{train}{\mathbf {y}}_{train}^{T} \right\rangle _{F}}{\Vert {\mathbf {U}}_{N}{\mathbf {K}}^{*}{\mathbf {U}}_{N} \Vert _{F} \Vert {\mathbf {y}}_{train}{\mathbf {y}}_{train}^{T} \Vert _{F}} \end{aligned}$$
(8a)
$$\begin{aligned}&s.t. \ {\mathbf {K}}^{*} = \sum _{i=1}^{m} \beta _{i}{\mathbf {K}}_{i}, \end{aligned}$$
(8b)
$$\begin{aligned}&\beta _{i} \ge 0, \ i = 1,2,...,m, \end{aligned}$$
(8c)
$$\begin{aligned}&\sum _{i=1}^{m} \beta _{i} = 1 \end{aligned}$$
(8d)

where the centering matrix is \({\mathbf {U}}_{N} = {\mathbf {I}}_{N} - (1/N){\mathbf {l}}_{N}{\mathbf {l}}_{N}^{T}\), \({\mathbf {U}}_{N} \in {\mathbf {R}}^{N \times N}\) is centering matrix. \({\mathbf {I}}_{N} \in {\mathbf {R}}^{n \times n}\) denotes identity matrix. \({\mathbf {l}}_{N}\) is identity vector. So, formula 8 can be written as follow:

$$\begin{aligned}&\underset{\mathbf {\pmb {\beta }} \ge 0}{\text{ max }} \quad \frac{\pmb {\beta }^{T}{\mathbf {a}}}{\sqrt{\pmb {\beta }^{T}{\mathbf {M}}\pmb {\beta }}} \end{aligned}$$
(9a)
$$\begin{aligned}&s.t. \ {\mathbf {K}}^{*} = \sum _{i=1}^{m} \beta _{i}{\mathbf {K}}_{i}, \end{aligned}$$
(9b)
$$\begin{aligned}&\beta _{i} \ge 0, \ i = 1,2,...,m, \end{aligned}$$
(9c)
$$\begin{aligned}&\sum _{i=1}^{m} \beta _{i} = 1 \end{aligned}$$
(9d)

In Eq. (9), \({\mathbf {a}}\in {\mathbf {R}}^{m \times 1}\) and \({\mathbf {M}}\in {\mathbf {R}}^{m \times m}\) is represented as Eqs. (10) and (11).

$$\begin{aligned} \begin{aligned} {\mathbf {a}}&= \left( \left\langle {\mathbf {U}}_{N}{\mathbf {K}}_{1}{\mathbf {U}}_{N},{\mathbf {y}}_{train}{\mathbf {y}}_{train}^{T} \right\rangle _{F} ,...,\left\langle {\mathbf {U}}_{N}{\mathbf {K}}_{m}{\mathbf {U}}_{N},{\mathbf {y}}_{train}{\mathbf {y}}_{train}^{T} \right\rangle _{F} \right) ^{T} \in {\mathbf {R}}^{m \times 1} \end{aligned} \end{aligned}$$
(10)
$$\begin{aligned} {\mathbf {M}}&= \left[ \begin{array}{cccc} M_{1,1} &{} M_{1,2} &{} \cdots &{} M_{1,m} \\ M_{2,1} &{} P_{2,2} &{} \cdots &{} M_{2,m} \\ \vdots &{} \vdots &{} M_{e,f} &{} \vdots \\ M_{m,1} &{} M_{m,2} &{} \cdots &{} M_{m,m} \end{array} \right] _{m \times m} \end{aligned}$$
(11a)
$$\begin{aligned} M_{e,f}&= \left\langle {\mathbf {U}}_{N}{\mathbf {K}}_{e}{\mathbf {U}}_{N},{\mathbf {U}}_{N}{\mathbf {K}}_{f}{\mathbf {U}}_{N} \right\rangle _{F} \end{aligned}$$
(11b)
$$\begin{aligned} e,f&=1,2,...,m \end{aligned}$$
(11c)

Equation 9 also can be represented as:

$$\begin{aligned}&\underset{\mathbf {\beta } \ge 0}{\text{ min }} \quad \pmb {\beta }^{T}{\mathbf {M}}\pmb {\beta } - 2\pmb {\beta }^{T}{\mathbf {a}} \end{aligned}$$
(12a)
$$\begin{aligned}&s.t. \ {\mathbf {K}}^{*} = \sum _{i=1}^{m} \beta _{i}{\mathbf {K}}_{i}, \end{aligned}$$
(12b)
$$\begin{aligned}&\beta _{i} \ge 0, \ i = 1,2,...,m, \end{aligned}$$
(12c)
$$\begin{aligned}&\sum _{i=1}^{m} \beta _{i} = 1 \end{aligned}$$
(12d)

In order to prevent extreme situations (the weight of a kernel is close to 1 and the remaining weights are close to 0), we employ the Laplacian regular term to smooth the weights:

$$\begin{aligned} \begin{aligned} \sum _{i,j}^{P} (\beta _{i} - \beta _{j})^2 W_{ij}&= \sum _{i,j}^{P} (\beta _{i}^2 + \beta _{j}^2 - 2 \beta _{i} \beta _{j}) W_{ij}\\&= \sum _{i}^{P} \beta _{i}^2 D_{ii} + \sum _{j}^{P} \beta _{j}^2 D_{jj} - 2 \sum _{i,j}^{P} \beta _{i} \beta _{j} W_{ij}\\&= 2 \pmb {\beta }^{T} {\mathbf {L}} \pmb {\beta } \end{aligned} \end{aligned}$$
(13)

In Eq. (13), \(i,j=1,...,m\), \({\mathbf {W}}\in {\mathbf {R}}^{m \times m}\) is the cosine similarity between two kernels. \({\mathbf {W}}\) can be calculated by Eq. (7). \({\mathbf {D}} \in {\mathbf {R}}^{m \times m}\) is a diagonal matrix, which is calculated by \(D_{ii} = \sum _{j=1}^{m} W_{ij}\). \({\mathbf {L}}\in {\mathbf {R}}^{m \times m}\) is graph Laplacian matrix, which is obtained by \({\mathbf {L}} = {\mathbf {D}} -{\mathbf {W}}\). Equation (12) and formula 13 are integrated as follow:

$$\begin{aligned}&\underset{\mathbf {\beta } \ge 0}{\text{ min }} \quad \pmb {\beta }^{T}{\mathbf {M}}\pmb {\beta } - 2\pmb {\beta }^{T}{\mathbf {a}} + \lambda \pmb {\beta }^{T} {\mathbf {L}} \pmb {\beta }=\underset{\mathbf {\beta } \ge 0}{\text{ min }} \quad \pmb {\beta }^{T}({\mathbf {M}}+ \lambda {\mathbf {L}})\pmb {\beta } - 2\pmb {\beta }^{T}{\mathbf {a}} \end{aligned}$$
(14a)
$$\begin{aligned}&s.t. \ {\mathbf {K}}^{*} = \sum _{i=1}^{m} \beta _{i}{\mathbf {K}}_{i}, \end{aligned}$$
(14b)
$$\begin{aligned}&\beta _{i} \ge 0, \ i = 1,2,...,m, \end{aligned}$$
(14c)
$$\begin{aligned}&\sum _{i=1}^{m} \beta _{i} = 1 \end{aligned}$$
(14d)

where \(\lambda\) is a hyper parameter of MKL-CKA. Finally, the weights obtained according to formula 14 and we calculate the optimal kernel by formula 6a.

Availability of data and materials

The datasets generated and/or analysed during this study are available under open licenses in the data repository, https://figshare.com/s/cf56cef6659c7eed16c9.

Abbreviations

DBP:

DNA-Binding Proteins

ML:

Machine Learning

MKL-CKA:

Multiple Kernel Learning-based on Centered Kernel Alignment

SVM:

Support Vector Machine

Y1H:

Yeast One-hybrid System

PseAAC:

Pseudo Amino Acid Composition

PSSM:

Position Specific Scoring Matrix

RF:

Random Forest

FKRR-MVSF:

Fuzzy Kernel Ridge Regression model with Multi-View Sequence Features

MKSVM-HKA:

Multi-Kernel SVM based on Heuristically Kernel Alignment

DPP-PseAAC:

DNA-binding Protein Prediction model using Chou general PseAAC

RFE:

Recursive Feature Elimination

PTM:

Post Translational Modification

PPI:

Protein–Protein Interactions

MKSVM:

Multi-Kernel SVM

RBF:

Radial Basis Function

PSSM-DWT:

PSSM-based Discrete Wavelet Transform

PSSM-AB:

PSSM-based Average Blocks

PsePSSM:

Pseudo-PSSM

MCD:

Multi-scale Continuous and Discontinuous descriptor

GE:

Global Encoding

NMBAC:

Normalized Moreau-Broto Auto Correlation

QP:

Quadratic Programming

LOOCV:

Leave One Out Cross validation

ACC:

Accuracy

MCC:

Matthew’s Correlation Coefficient

SN:

Sensitivity

SP:

Specificity

AUC:

Area under the receiver-operating characteristic curve

ROC:

Receiver Operating characteristic Curve

5-CV:

5-fold Cross Validation

References

  1. 1.

    Wei L, Tang J, Quan Z. Local-DPP: an improved DNA-binding protein prediction method by exploring local evolutionary information. Inf Sci. 2016;384:135–44.

    Article  Google Scholar 

  2. 2.

    Liu B, Xu J, Lan X, Xu R, Zhou J, Wang X, Chou KC. iDNA-Prot|dis: Identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS ONE. 2014;9:106691.

    Article  CAS  Google Scholar 

  3. 3.

    Wang Y, Ding Y, Guo F, Wei L, Tang J. Improved detection of DNA-binding proteins via compression technology on PSSM information. PLoS ONE. 2017;12(9):e0185587.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  4. 4.

    Nimrod G, Schushan M, Szilágyi A, Leslie C. iDBPS: a web server for the identification of DNA binding proteins. Bioinformatics. 2010;26(5):692–3.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  5. 5.

    Bhardwaj N, Langlois RE, Zhao G, Lu H. Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucleic Acids Res. 2005;33(20):6486–93.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  6. 6.

    Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20:273–97.

    Google Scholar 

  7. 7.

    Ahmad S, Sarai A. Moment-based prediction of DNA-binding proteins. J Mol Biol. 2004;341(1):65–71.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  8. 8.

    Yu X, Cao J, Cai Y, Shi T, Li Y. Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. J Theor Biol. 2006;240(2):175–84.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  9. 9.

    Liu B, Xu J, Fan S, Xu R, Zhou J, Wang X. PseDNA-Pro: DNA-binding protein identification by combining Chou’s PseAAC and physicochemical distance transformation. Mol Inf. 2015;34(1):8–17.

    Article  CAS  Google Scholar 

  10. 10.

    Liu B, Wang S, Wang X. DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Sci Rep. 2015;5:15479.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  11. 11.

    Cai YD, Lin SL. Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim Biophys Acta. 2003;1648(1):127–33.

    CAS  PubMed  Article  Google Scholar 

  12. 12.

    Xu R, Zhou J, Wang H, He Y, Wang X, Liu B. Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation. BMC Syst Biol. 2015;9:10.

    Article  CAS  Google Scholar 

  13. 13.

    Kumar M, Gromiha MM, Raghava GP. Identification of dna-binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics. 2007;8:463.

    CAS  Article  Google Scholar 

  14. 14.

    Lipman DJ, Zhang J, Madden T, Altschul SF, Schäffer AA, Miller W, Zhang Z. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402.

    PubMed  PubMed Central  Article  Google Scholar 

  15. 15.

    Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H. Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian Naïve Bayes. PLoS ONE. 2014;9:86703.

    Article  CAS  Google Scholar 

  16. 16.

    Zou Y, Ding Y, Tang J, Guo F, Peng L. FKRR-MVSF: a fuzzy kernel ridge regression model for identifying DNA-binding proteins by multi-view sequence features via Chou’s five-step rule. Int J Mol Sci. 2019;20(17):4175.

    PubMed Central  Article  CAS  PubMed  Google Scholar 

  17. 17.

    Ding Y, Chen F, Guo X, Tang J, Wu H. Identification of DNA-binding proteins by multiple kernel support vector machine and sequence information. Curr Proteomics. 2019;. https://doi.org/10.2174/1570164616666190417100509.

    Article  Google Scholar 

  18. 18.

    Ding YJ, Tang JJ, Guo F. Identification of DNA-binding proteins via fuzzy multiple kernel model and sequence information. Lect Notes Comput Sci. 2019;11644:468–79.

    Article  Google Scholar 

  19. 19.

    Liu XJ, Gong XJ, Yu H, Xu JH. A model stacking framework for identifying DNA binding proteins by orchestrating multi-view features and classifiers. Genes. 2018;9:394.

    PubMed Central  Article  CAS  PubMed  Google Scholar 

  20. 20.

    Rahman MS, Shatabda S, Saha S, Kaykobad M, Rahman MS. DPP-PseAAC: a DNA-binding protein prediction model using Chou’s general PseAAC. J Theor Biol. 2018;452:22–34.

    CAS  PubMed  Article  Google Scholar 

  21. 21.

    Adilina S, Farid D, Shatabda S. Effective DNA binding protein prediction by using key features via Chou’s general PseAAC. J Theor Biol. 2019;460:64–78.

    CAS  PubMed  Article  Google Scholar 

  22. 22.

    Wei L, Luan S, Nagai L, Su R, Zou Q. Exploring sequence-based features for the improved prediction of DNA n4-methylcytosine sites in multiple species. Bioinformatics. 2019;35:1326–33.

    CAS  PubMed  Article  Google Scholar 

  23. 23.

    Jia C, Zuo Y, Zou Q. O-GlcNAcPRED-II: an integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique. Bioinformatics. 2018;34:2029–36.

    CAS  PubMed  Article  Google Scholar 

  24. 24.

    Zeng X, Liu L, Lu L, Zou Q. Prediction of potential disease-associated microrNAS using structural perturbation method. Bioinformatics. 2018;34:2425–32.

    CAS  PubMed  Article  Google Scholar 

  25. 25.

    Wei L, Ding Y, Su L, Tang J, Zou Q. Prediction of human protein subcellular localization using deep learning. J Parallel Distrib Comput. 2018;117:212–7.

    Article  Google Scholar 

  26. 26.

    Zou Q, Xing P, Wei L, Liu B. Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA. RNA. 2019;25(9):205–18.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  27. 27.

    Ding YJ, Tang JJ, Guo F. The computational models of drug-target interaction prediction. Protein Pept Lett. 2019;26:1–11.

    CAS  Article  Google Scholar 

  28. 28.

    Ding YJ, Tang JJ, Guo F. Identification of drug-side effect association via semi-supervised model and multiple kernel learning. IEEE J Biomed Health Inform. 2019;23(6):2619–32.

    PubMed  Article  Google Scholar 

  29. 29.

    Ding YJ, Tang JJ, Guo F. Identification of protein-ligand binding sites by sequence information and ensemble classifier. J Chem Inf Model. 2017;57(12):3149–61.

    CAS  PubMed  Article  Google Scholar 

  30. 30.

    Ding YJ, Tang JJ, Guo F. Identification of drug-target interactions via multiple information integration. Inf Sci. 2017;418:546–60.

    Article  Google Scholar 

  31. 31.

    Ding YJ, Tang JJ, Guo F. Identification of drug-target interactions via fuzzy bipartite local model. Neural Comput Appl. 2019;. https://doi.org/10.1007/s00521-019-04569-z.

    Article  Google Scholar 

  32. 32.

    Wang YB, Ding YJ, Tang JJ, Dai Y, Guo F. CrystalM: a multi-view fusion approach for protein crystallization prediction. IEEE/ACM Trans Comput Biol Bioinform. 2019;. https://doi.org/10.1109/TCBB.2019.2912173.

    Article  PubMed  PubMed Central  Google Scholar 

  33. 33.

    Jiang L, Xiao Y, Ding Y, Tang J, Guo F. FKL-Spa-LapRLS: an accurate method for identifying human microRNA-disease association. BMC Genomics. 2018;19(Suppl 10):911.

    PubMed  PubMed Central  Article  Google Scholar 

  34. 34.

    Jiang L, Ding Y, Tang J, Guo F. MDA-SKF: similarity kernel fusion for accurately discovering miRNA-disease association. Front Genet 2018, doi: 10.3389/fgene.2018.00618.

    Article  PubMed  PubMed Central  Google Scholar 

  35. 35.

    Shen C, Ding YJ, Tang JJ, Guo F. Multivariate information fusion with fast kernel learning to kernel ridge regression in predicting LncRNA-protein interactions. Front Genet. 2019;. https://doi.org/10.3389/fgene.2018.00716.

    Article  PubMed  PubMed Central  Google Scholar 

  36. 36.

    Shen C, Ding YJ, Tang JJ, Jiang LM, Guo F. LPI-KTASLP: prediction of lncRNA-protein interaction by semi-supervised link learning with multivariate information. IEEE Access. 2019;7:13486–96.

    Article  Google Scholar 

  37. 37.

    Shen C, Ding YJ, Tang JJ, Xu XY, Guo F. An ameliorated prediction of drug-target interactions based on multi-scale discrete wavelet transform and network features. Int J Mol Sci. 2017;18(8):1781.

    PubMed Central  Article  CAS  Google Scholar 

  38. 38.

    Shen C, Ding YJ, Tang JJ, Song J, Guo F. Identification of DNA-protein binding sites through multi-scale local average blocks on sequence information. Molecules. 2017;22(2):2079.

    PubMed Central  Article  CAS  PubMed  Google Scholar 

  39. 39.

    Shen YN, Tang JJ, Guo F. Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou’s general PseAAC. J Theor Biol. 2019;462:230–9.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  40. 40.

    Ding YJ, Tang JJ, Guo F. Human protein subcellular localization identification via fuzzy model on kernelized neighborhood representation. Appl Soft Comput. 2020;96:106596.

    Article  Google Scholar 

  41. 41.

    Ding YJ, Tang JJ, Guo F. Identification of drug-target interactions via dual Laplacian regularized least squares with multiple kernel fusion. Knowl Based Syst. 2020;204:106254.

    Article  Google Scholar 

  42. 42.

    Zhang W, Jing K, Huang F, Chen Y, Li B, Li J, Gong J. SFLLN: a sparse feature learning ensemble method with linear neighborhood regularization for predicting drug–drug interactions. Inf Sci. 2019;497:189–201.

    CAS  Article  Google Scholar 

  43. 43.

    Deng Y, Xu X, Qiu Y, Xia J, Zhang W, Liu S. A multimodal deep learning framework for predicting drug–drug interaction events. Bioinformatics. 2020;. https://doi.org/10.1093/bioinformatics/btaa501.

    Article  PubMed  Google Scholar 

  44. 44.

    Ding YJ, Tang JJ, Guo F. Protein crystallization identification via fuzzy model on linear neighborhood representation. IEEE/ACM Trans Comput Biol Bioinform. 2019;. https://doi.org/10.1109/TCBB.2019.2954826.

    Article  PubMed  PubMed Central  Google Scholar 

  45. 45.

    Zhang W, Li ZS, Guo WZ, Yang WT, Huang F. A fast linear neighborhood similarity-based network link inference method to predict microRNA-disease associations. IEEE/ACM Trans Comput Biol Bioinform. 2019;. https://doi.org/10.1109/TCBB.2019.2931546.

    Article  PubMed  PubMed Central  Google Scholar 

  46. 46.

    Gong YC, Niu YQ, Zhang W, Li XH. A network embedding-based multiple information integration method for the miRNA-disease association prediction. BMC Bioinform. 2019;20(1):468.

    Article  Google Scholar 

  47. 47.

    Zhao Q, Yang YJ, Ren GF, Ge EX, Fan CL. Integrating bipartite network projection and KATZ measure to identify novel circRNA-disease associations. IEEE Trans Nanobiosci. 2019;18(4):578–84.

    Article  Google Scholar 

  48. 48.

    Liu HS, Ren GF, Chen HY, Liu Q, Yang YJ, Zhao Q. Predicting lncrna-mirna interactions based on logistic matrix factorization with neighborhood regularized. Knowl-Based Syst. 2020;191:105261.

    Article  Google Scholar 

  49. 49.

    Zeng X, Lin W, Guo M, Zou Q. A comprehensive overview and evaluation of circular RNA detection tools. PLoS Comput Biol. 2017;13(6):1005420.

    Article  CAS  Google Scholar 

  50. 50.

    Zeng X, Lin W, Guo M, Zou Q. Details in the evaluation of circular RNA detection tools: Reply to Chen and Chuang. PLoS Comput Biol. 2019;15(4):1006916.

    Article  CAS  Google Scholar 

  51. 51.

    Rose PW, Prlić A, Bi C, et al. The RCSB Protein Data Bank: views of structural biology for basic and applied research and education. Nucleic Acids Res. 2015;4(Database issue):345–56.

    Article  CAS  Google Scholar 

  52. 52.

    Lin W, Fang J, Xiao X, Chou K. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS ONE. 2011;6:24756.

    Article  CAS  Google Scholar 

  53. 53.

    Kumar KK, Pugalenthi G, Suganthan PN. DNA-Prot: identification of DNA binding proteins from protein sequence information using random forest. J Biomol Struct Dyn. 2009;26(6):679–86.

    CAS  PubMed  Article  Google Scholar 

  54. 54.

    Nanni L, Brahnam S, Lumini A. Wavelet images and Chou’s pseudo amino acid composition for protein classification. Amino Acids. 2012;43:657–65.

    CAS  PubMed  Article  Google Scholar 

  55. 55.

    Cheol Jeong J, Lin X, Chen XW. On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans Comput Biol Bioinform. 2011;8(2):308–15.

    Article  Google Scholar 

  56. 56.

    Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015;43:65–71.

    CAS  Article  Google Scholar 

  57. 57.

    Chou KC, Shen HB. MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through PSE-PSSM. Biochem Biophys Res Commun. 2007;360(2):339–45.

    CAS  PubMed  Article  Google Scholar 

  58. 58.

    You ZH, Zhu L, Zheng CH, Yu HJ, Deng SP, Ji Z. Prediction of protein–protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set. BMC Bioinform. 2014;15:9.

    Article  Google Scholar 

  59. 59.

    Li X, Liao B, Shu Y, Zeng Q, Luo J. Protein functional class prediction using global encoding of amino acid sequence. J Theor Biol. 2009;261(2):290–3.

    CAS  PubMed  Article  Google Scholar 

  60. 60.

    Feng ZP, Zhang CT. Prediction of membrane protein types based on the hydrophobic index of amino acids. J Protein Chem. 2000;19(4):269–75.

    CAS  PubMed  Article  Google Scholar 

  61. 61.

    Ding Y, Tang J, Guo F. Predicting protein–protein interactions via multivariate mutual information of protein sequences. BMC Bioinform. 2016;17(1):398–410.

    Article  Google Scholar 

  62. 62.

    Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol. 2011;2(27):1–27.

    Article  Google Scholar 

  63. 63.

    Cristianini N, Kandola J, Elisseeff A. On kernel-target alignment. Adv Neural Inf Process Syst. 2001;179(5):367–73.

    Google Scholar 

  64. 64.

    Cortes C, Mohri M, Rostamizadeh A. Algorithms for learning kernels based on centered alignment. J Mach Learn Res. 2012;13(2):795–828.

    Google Scholar 

Download references

Acknowledgements

Not applicable.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 22 Supplement 3, 2021: Proceedings of the 2019 International Conference on Intelligent Computing (ICIC 2019): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-22-supplement-3.

Funding

This work is supported by a Grant from the National Natural Science Foundation of China (NSFC 61902271, 61772362, 61772357, 61902272 and 61972280) and Natural Science Research of Jiangsu Higher Education Institutions of China (19KJB520014). Publication costs of this article are funded by the grants of the above foundations and projects. The funding body did not play any role in the design of the study, collection, analysis, and interpretation of the data, and writing of the manuscript.

Author information

Affiliations

Authors

Contributions

YQ, YD and FG conceived the study. YQ performed the experiments and analyzed the data. LJ, YD, JT and FG drafted the manuscript. All authors read and approved the manuscript.

Corresponding authors

Correspondence to Yijie Ding or Fei Guo.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Qian, Y., Jiang, L., Ding, Y. et al. A sequence-based multiple kernel model for identifying DNA-binding proteins. BMC Bioinformatics 22, 291 (2021). https://doi.org/10.1186/s12859-020-03875-x

Download citation

Keywords

  • DNA-binding proteins
  • Feature extraction
  • Multiple kernel learning
  • Support vector machine
  • Centered kernel alignment