GCRNN for CPI prediction
Machine learning and computational methods are enhancing data analysis on a large scale and providing faster solutions, impacting research on biology and pharmaceutics. Biological data have been collected in data banks with plenty of information about genome and proteins being available for researchers to obtain reliable results in areas such as healthcare.
This work address CPI prediction, an important aspect for drug discovery and development. Figure 3 illustrates the development of new drugs for improving health conditions based on the information of proteins and chemical structure of natural or artificial compounds.
A normal or abnormal condition carries information in the genome sequence, which can be translated into a protein sequence that interacts with a compound. The interactions can be stored continuously by using machine learning to determine the effective compound and protein for a specific disease. Then, laboratory experiments provide accurate results for clinical trials, and the resulting compound extends the dataset for new cases and developments.
Deep learning techniques provide state-of-the-art performance and high accuracy for handling protein sequences and modeling molecules. Among the available models and architectures, we combine three powerful methods for CPI prediction, namely, GNN, CNN, and bidirectional RNN, as shown in Fig. 4. These methods constitute the proposed GCRNN and are detailed below.
GNN
The GNN can provide the low-level error vector of a molecular chart. We use the GNN to represent a molecular embedding that maps a graph into a vector through transformation and output functions. In the GNN, the transformation function updates the node values related to the neighboring nodes and edges, and the output function describes the nodes as vectors. In the graph structure, G = (N, E), where N is the set of nodes, and E is the set of edges that connect neighboring nodes. We consider undirected graph G, in which a node ni ∈ N represents atom i of a molecule, and eij ∈ E represents the bond between atoms ni and nj.
Considering molecules as graphs simplifies the representation by defining few types of nodes and bonds and few parameters to learn. We also adopt r-radius subgraphs [26] that outperform the representation learning of the number of neighboring nodes. In an r-radius subgraph, for graph G = (N, E), the set of all nodes within a radius r of node i are represented as S(i,r), and the subgraph of r-radius nodes ni is defined as
$$n_{i}^{\left( r \right)} = \left( {N_{i}^{\left( r \right)} ,E_{i}^{\left( r \right)} } \right),$$
(1)
where \(N_{i}^{\left( r \right)} = \left\{ {n_{j} \left| {j \in S\left( {i,r} \right)} \right.} \right\}\) and \(E_{i}^{\left( r \right)} = \left\{ {e_{mn} {|}\left( {m,n} \right) \in S\left( {i,r} \right) \times S\left( {i,r - 1} \right)} \right\}\). The subgraph for the r-radius edges is defined as
$$e_{ij}^{\left( r \right)} = \left( {N_{i}^{{\left( {r - 1} \right)}} \cup N_{j}^{{\left( {r - 1} \right)}} ,E_{i}^{{\left( {r - 1} \right)}} \cup E_{j}^{{\left( {r - 1} \right)}} } \right).$$
(2)
An embedded vector is assigned for the r-radius node and r-radius edge, which are randomly initialized, and backpropagation is used for training. To update the node information with respect to its neighborhood, the transition functions in Eqs. (3) and (4) are used. At time step t of a given graph with random embeddings of nodes and edges, n(t) represents a node in Eq. (3) and e(t) represents an edge in Eq. (4). The updated vectors are defined as
$$n_{i}^{{\left( {t + 1} \right)}} = \sigma \left( {n_{i}^{\left( t \right)} + \mathop \sum \limits_{j \in S\left( i \right)} p_{ij}^{\left( t \right)} } \right),$$
(3)
where σ is the sigmoid function [27] and \(p_{ij}^{\left( t \right)} = f\left( {W_{{{\text{neighbor}}}} \left[ {\begin{array}{*{20}c} {n_{j}^{\left( t \right)} } \\ {e_{ij}^{\left( t \right)} } \\ \end{array} } \right] + b_{{{\text{neighbor}}}} } \right)\) is a neural network with f being a ReLU (rectified linear unit) activation function [27] and Wneighbor and bneighbor being a weight matrix and bias vector, respectively, at time step t. In the same iteration, the edges are updated as follows:
$$e_{i}^{{\left( {t + 1} \right)}} = \sigma \left( {e_{i}^{\left( t \right)} + q_{ij}^{\left( t \right)} } \right),$$
(4)
where function q is the neural network model for the edges and \(q_{ij}^{\left( t \right)} = f\left( {W_{{{\text{side}}}} \left( {n_{i}^{\left( t \right)} + n_{j}^{\left( t \right)} } \right) + b_{{{\text{side}}}} } \right)\). The transition functions generate an updated set of nodes \({\varvec{N}} = \left\{ {{\varvec{n}}_{1}^{{\left( {\varvec{t}} \right)}} ,{\varvec{n}}_{2}^{{\left( {\varvec{t}} \right)}} ,..,{\varvec{n}}_{{\left| {\varvec{N}} \right|}}^{{\left( {\varvec{t}} \right)}} } \right\}\), where |N| is the number of nodes in the molecular graph. Finally, the molecular representation vector is given by
$$y_{{{\text{molecule}}}} = \frac{1}{\left| N \right|}\mathop \sum \limits_{i = 1}^{\left| N \right|} n_{i}^{\left( t \right)} .$$
(5)
CNN
CNNs are DNNs that also effective for analyzing protein sequences. As a CNN uses a weight-sharing strategy to capture local patterns in data, it is suitable for studying DNA (deoxyribonucleic acid) because convolution filters can determine functions of protein sequences that are short repeating patterns in DNA that may have a biological function. The proposed deep CNN is characterized by sequential interactive convolutional and pooling layers that extract features form sequence at various scales, followed by a fully connected layer that computes the whole-sequence information to extract protein features. Each CNN layer undergoes a linear transformation from the previous output. Then, it is multiplied by a weight matrix and proceeds with a nonlinear transformation. To minimize prediction errors, the weighted value matrix is learned during training. The CNN model base layer is a convolutional layer that calculates the output of a one-dimensional operation concerning a specific number of kernels (weight matrices later transformed by ReLU activation). The CNN for proteins maps sequence P into vector y with multiple filter functions. The first CNN layer is applied to proteins, where the n-gram (n = 3) technique allows to represent amino acids as words. The group of three overlapping amino acids makes a word and represents input sequence P. The convolution for input protein sequence P is defined as
$$convolution\left( P \right)_{ik}^{\left( t \right)} = ReLU\left( {\mathop \sum \limits_{m = 0}^{M - 1} \mathop \sum \limits_{n = 0}^{N - 1} W_{mn}^{k} P^{\left( t \right)}_{i + m,n} } \right),$$
(6)
where i is the index of the output position, k is the index of the kernels, and Wk is an M × N weight matrix with M windows and N input channels. Then, a max-pooling layer reduces the size of the input or hidden layers by choosing the maximally activated neuron from a convolutional layer. Accordingly, for the CNN to be independent of the length of the protein sequence, max-pooling is applied when the maximally activated neuron is selected from the convolutional layer. Consequently, the number of hidden neurons generated by the convolution filter is the same as that of filters and not affected by the length of the input. For input Q, pooling is defined as
$$pooling\left( Q \right)_{ik}^{\left( t \right)} = {\text{max}}\left( {Q_{iM,k} ,Q_{{\left( {iM + 1,k} \right)}} , \ldots ,Q_{{\left( {iM + M - 1,k} \right)}} } \right).$$
(7)
Bidirectional RNN
An RNN is another type of DNN. Unlike a CNN, the connections between the RNN units form a directed cycle that creates an internal state of the network to exhibit a dynamic temporal or spatial behavior. A bidirectional LSTM is a variant of the RNN that combines the outputs of two RNNs to process a sequence both from left to right and from right to left. Instead of regular hidden units, the two proposed RNNs contain LSTM layers, which are smart network units that can remember a value over an arbitrary period. A bidirectional LSTM can capture long-term dependencies and has been effective for various machine learning applications. Bidirectional gated recurrent units (GRUs) are an alternative to bidirectional LSTMs to constantly represent sequential input without using separate memory units [28]. We use LSTM and GRU to prove that adding a recurrent structure after the CNN increases performance.
Features V provided from the pooling layer form a sequence \({\varvec{x}} = \left\{ {{\varvec{x}}_{1} ,{\varvec{x}}_{2} , \ldots ,{\varvec{x}}_{{\varvec{V}}} } \right\}\), which serves as input for a two-layer bidirectional neural network. The bidirectional LSTM layers updated at step v depend on forward and backward processing as follows:
$$\begin{aligned} \vec{i}_{t} & = \sigma \left( {\vec{W}_{i} \left[ {x_{t} ,\vec{h}_{t - 1} } \right] + \vec{b}_{i} } \right), \\ \vec{f}_{t} & = \sigma \left( {\vec{W}_{f} \left[ {x_{t} ,\vec{h}_{t - 1} } \right] + \vec{b}_{f} } \right), \\ \vec{o}_{t} & = \sigma \left( {\vec{W}_{o} \left[ {x_{t} ,\vec{h}_{t - 1} } \right] + \vec{b}_{o} } \right), \\ \vec{g}_{t} & = tanh\left( {\vec{W}_{c} \left[ {x_{t} ,\vec{h}_{t - 1} } \right] + \vec{b}_{c} } \right), \\ \vec{c}_{t} & = \vec{f}_{t} \odot \vec{c}_{t - 1} + \vec{i}_{t} \odot \vec{g}_{t} , \\ \vec{h}_{t} & = \vec{o}_{t} tanh\left( {{ }\vec{c}_{t} } \right) \\ H & = \vec{W}_{h} \vec{h}_{t} + \mathop{W}\limits^{\leftarrow} _{h} \mathop{h}\limits^{\leftarrow} _{t} . \\ \end{aligned}$$
(8)
At time t, → and ← indicate the calculation direction, i is the input gate, f is the forget gate, o is the modulate gate, h is the hidden state at time t, Wi, WF, Wo, and Wc are weight matrices for their corresponding gates, and \(\odot\) denotes the elementwise multiplication. The equivalent bidirectional GRU is defined as follows:
$$\begin{aligned} \vec{z}_{t} & = \sigma \left( {\vec{W}_{z} \left[ {x_{t} ,\vec{h}_{t - 1} } \right] + \vec{b}_{z} } \right), \\ \vec{r}_{t} & = \sigma \left( {\vec{W}_{r} \left[ {x_{t} ,\vec{h}_{t - 1} } \right] + \vec{b}_{r} } \right), \\ \overrightarrow {{\tilde{h}}}_{t} & = tanh\left( {\vec{W}_{h} \left[ {x_{t} ,r_{t} \odot \vec{h}_{t - 1} } \right]} \right), \\ \vec{h}_{t} & = \left( {1 - z_{t} } \right) \odot \vec{h}_{t - 1} + \vec{g}_{t} \odot \overrightarrow {{\tilde{h}}}_{t} , \\ \end{aligned}$$
(9)
where r is the reset gate and z is the update gate. The LSTM/GRU encodes \(\vec{\user2{h}}_{{\varvec{t}}} \user2{ }\) along the left direction of the embedded protein at position t. As both the left and right directions are important for the global structure of proteins, we use a bidirectional LSTM (or bidirectional GRU). The bidirectional layers encode each position into leftward and rightward representations. H is the output, which is the sum of the results along both directions:
$$H = \vec{W}_{h} \vec{h}_{t} +\overleftarrow {W} _{h}\overleftarrow {W} _{t},$$
(11)
where H is a set of hidden vectors \(H = \left\{ { h_{1}^{{\prime}\;(t)} , h_{2}^{{\prime}\;(t)} , \ldots , h_{\left| V \right|}^{{\prime}\;(t)}} \right\}\) obtained from the bidirectional LSTM/bidirectional GRU output. The protein vector representation is given by
$$y_{{{\text{protein}}}} = \frac{1}{\left| V \right|}\mathop \sum \limits_{i = 1}^{\left| V \right|} b_{i}^{\left( t \right)}$$
(12)
The vectors are concatenated to obtain output vector \({\text{Out}} = W_{{{\text{out}}}} \left[ {y_{{{\text{molecule}}}} ;y_{{{\text{protein}}}} } \right] + b_{{{\text{out}}}}\), which is the input of a classifier, where \({\varvec{W}}_{{{\mathbf{out}}}}\) is a weight matrix and \({\varvec{b}}_{{{\varvec{out}}}}\) is a bias vector. Finally, softmax activation is added to vector Out[\({\varvec{y}}_{0} ,{\varvec{y}}_{1}\)] to predict a binary label that represents the existence or not of a CPI.