Datasets
In this work, potential interactions between drugs and target proteins were investigated on three benchmark datasets. Two datasets were derived from KEGG DRUG database while the other one was built from DrugBank database (http://www.drugbank.ca/). KEGG DRUG captures abundant approved drugs in Japan, USA and Europe based on the chemical structure and molecular interaction network information, of which most drugs reported corresponding the information of target proteins [24]. While DrugBank offers an appealing freely available resource to the public, including 2555 approved small molecule drugs and 5121 detailed non-redundant sequences of target proteins [25].
The first dataset is provided from reference [26], called as Dataset1. In Dataset1, 4797 drug-target pairs were regarded as positive samples, where are 2719 pairs for enzymes, 1372 for ion channels, 630 for GPCRs and 86 for nuclear receptors. The corresponding negative samples were generated by random selection. The detailed progress is described in the following steps: (i) re-coupling all drugs and targets in the benchmark dataset into pairs after removing those known drug-target interactions in the positive samples. (ii) randomly selecting negative samples until the number of negative samples reached exactly two times as many as that of positive samples.
The second dataset of DTIs, called Dataset2, was manually collected. Protein kinases were integrated into enzymes in the database. Besides, drugs without structural information and target proteins without primary sequence were discarded in the dataset. The drug-target pairs in Dataset2 which is redundant and overlapping with Dataset1 were also omitted. As like Dataset1, the number of corresponding negative samples in Dataset2 is twice as many as positive samples. Ultimately, 16140 drug-target pairs were obtained in Dataset2, where 3627 for enzymes, 5511 for ion channels, 5955 for GPCRs and 1047 for nuclear receptors. Figure 1 illustrates the number of drugs, target proteins as well as drug-target pairs on both Dataset1 and Dataset2. The detailed information can be referred to Additional file 1.
The last dataset is from reference [23], called Dataset3, where inorganic compounds or very small molecule compounds were omitted. The dataset consists of 6262 positive samples, and negative samples with the same number as that of positive samples which were generated by random selection. Thus, 12524 potential drug-target pairs were used in this work.
Drug representations
Molecular descriptor is indeed a series of fixed-length number to represent the effective chemical information encoded within a symbolic representation of small molecule [27]. Currently, it has been routinely applied in cheminformatics area such as QSAR analysis, virtual screening and drug ADME/T prediction, as well as other drug discovery processes. PaDEL-Descriptor is an appealing graphical user interface (GUI) toolbox using Java language to calculate the descriptors of chemical small molecules, which can work on different platforms. It currently involves 1444-dimensional 1D, 2D descriptors, 134-dimensional 3D descriptors and 10 kinds of fingerprints [28]. In this study, 1444-dimensional 1D and 2D descriptors were employed to represent the drug candidates, which can be formulated as D=[D1,D2,D3,…,D1444].
Protein representations
For predicting DTIs, the sequences of target proteins are encoded by different physicochemical properties of amino acids. In order to improve the final predictive performance, 34 properties were extracted from AAindex1 database with the correlation coefficient less than 0.5 [29]. In this process, the correlation coefficients between two properties were calculated and ranked in order. Then, for each property, the number of correlation coefficients more than 0.5 between the property and the other properties was recorded. These properties were ranked in descend order. Beginning from the top property to the lowest one, other properties having correlation coefficient with the beginning property were subsequently removed if the value is more than 0.5. Finally, 34 properties are retained when the process was completed. Protein targets were encoded using these properties by Moran autocorrelation descriptors algorithm [30, 31]. Moran autocorrelation has been widely applied in the prediction of helix contents, and it mainly takes account of the influence of neighboring amino acids around a certain central amino acid[32]. The encoded Moran autocorrelation descriptors of target proteins, called as T, is formulated as follows:
$$ T(d)=\frac{\frac{1}{N-d}\sum_{i=1}^{N-d}(P_{i}-\overline{P})(P_{i+d}-\overline{P})}{\frac{1}{N}\sum_{i=1}^{N-d}(P_{i}-\overline{P})^{2}} $$
(1)
where Pi and Pi+d are property values in one of 34 amino acid properties at sequence positions i and i+d, respectively; d is the distance between the i-th residue and neighboring residue; N is the length of the protein sequence; \(\overline {P}\) is the average value of Pi, i.e. \(\overline {P}=\left (\sum _{i=1}^{N}P_{i}\right)/N\), and d is set as 13 in this work. Therefore, for each of the 34 properties, one protein is represented by vector Tm=[T1,T2,T3,…,T13],m=1∼34. Then 34 vectors are concatenated so that target descriptors are characterized by vector Tp=[T1,T2,T3,…,T442].
Convolutional neural network
An effective deep learning architecture called Convolutional Neural Network (CNN) was widely applied in many areas involving image and video recognition, recommender systems and natural language processing [33]. In addition, a growth number of interesting results has been seen in biomedical applications such as neuronal membrane segmentation and drug discovery. CNN is well-known as feed-forward artificial neural networks inspired by biological processes in that the connectivity pattern between neurons simulates the cognition function of human neural systems [34]. Compared with traditional multilayer perceptron (MLP), the training parameters of CNN are immensely reduced, allowing the network to be deeper with fewer parameters. Thus, CNN can effectively address the problem of vanishing or exploding gradients in the progress of back propagation [35, 36]. A CNN architecture is formed by a stack of distinct layers including convolutional layers, pooling layers and fully connected layers. The convolutional layer represents the core building block of a CNN topology, which is parameterized by a set of learnable filters (or kernels) sliding over a vector or matrix and the result of each filter is called a feature map [37]. Pooling is an operation mostly applied after each convolutional layer, which combines responses at different locations and adds robustness to small spatial variations. Thus, it speeds up the convergence and reduces the amount of computation of neural networks. The outputs of the l-th layer and its previous layer are respectively denoted as Vl, Vl−1, involving only two parts of trainable parameters (i.e. the weight matrix Wl and the bias vector bl). The process can be formulated as:
$$ \mathbf{V}_{l}= pool (f(\mathbf{V}_{l-1} \ast \mathbf{W}_{l}+\mathbf{b}_{l})), $$
(2)
where ∗ represents the convolution operation, pool denotes the max-pooling operation, and f(·) is the activation function.
A dropout layer as a regularization strategy is designed to alleviate the overfitting issue by the means that stochastically adds noise to the hidden layers. The nodes defined as ’dropped out’ do not contribute to the forward pass and do not participate in backpropagation. Fully connection layer usually represents the final layers of a deep neural network topology, of which each neuron is completely connected to all of the nodes in previous and the next layers [38].
Figure 2 illustrates the CNN-based prediction model, which resembles the LeNet-5 framework, adding only one convolutional layer and one pooling layer. In this work, LeNet-5 is considered as a baseline for the comparison of deep learning algorithms due to containing small amount of parameters.
The construction of CNN-based model
Since small dataset used for deep learning model may reduce the generalization ability of the model, data augmentation schema was adopted. First, let’s see one potential drug-target pair that was represented by [D,T], where the vectors is simply a concatenation of 342-dimensional drug vectors D=[D1,D2,D3,…,D342] and 442-dimensional vectors of protein descriptors T=[T1,T2,T3,…,T442]. In this way, the input vectors of our training model comprehensively consider the information of small chemical molecules and target proteins. Additionally, drug vectors with almost the same number of target vectors, which decrease the biases caused by the different amount of vectors. So it makes easier and fairer to train an appropriate model to identify DTIs. As shown in Fig. 3, 342-dimensional drug vectors of each drug-target pair (i.e. [D,T]=[D1,D2,D3,…,D342,T1,T2,T3,…,T442]) were generated by random selection. Then the process was repeated n times and n sets of drug vectors were respectively joined to the 442-dimensional target descriptors. That is to say, one drug-target pair was represented by n sets of generated drug-target pair (Vn=[Dn,T]1×784, n=1,2,…) which involving n different randomly selected drug vectors. The progress is terminated until the number of all drug-target pairs is around 40000∼50000. Thus, the n-times pairs were regarded as the characterization of one drug-target pair.
After that, the features of each drug-target pair were reshaped into a 28×28 matrix, which is similar to digital image recognition and easily used to train a predictive model through CNN algorithm. The final predictions were yielded by the ensemble of n-times pairs’ prediction values. In the ensemble, one drug-target pair is predicted to be interacting if at least half of the n-pairs were predicted as positive samples, otherwise it is a non-interaction pair. The construction of our model pipeline is illustrated in Fig. 3.
Measurement of prediction quality
In this work, four metrics, accuracy (Acc), sensitivity (Sen), precision (Pre) and F1 score (F1), were exploited to evaluate whether the candidate drug and a target protein are interacting. Specifically, F1 score comprehensively measures the rate of sensitivity and precision which is proved to be more credible and objective[39, 40]. Meanwhile, area under the receiver operator characteristic curve (AUC) value is also a common evaluation metric used in machine learning and data mining research to check the ability of a binary classifier system, as its discrimination threshold is varied [41]. The following formulas illustrate the detailed calculation of these metrics.
$$ {\begin{aligned} Acc &=\frac{TP+TN}{TP+FP+TN+FN} \\ Pre&=\frac{TP}{TP+FP}\\ Sen&=\frac{TP}{TP+FN} \\ F1&=\frac{2\times Sen\times Pre}{Sen+Pre} \\ \end{aligned}} $$
(3)
In which, TP (True Positive) and TN (True Negative) respectively represent the correctly predicted drug-target interaction pairs and non-interaction pairs. FP (False Positive) means non-interaction pairs predicted as positive samples and FN (False Negative) is that negative instances are wrongly predicted as DTIs pairs.