Our proposed hybrid framework based on GAT, named HFGAT, for metabolic pathway prediction is shown in Fig. 2. HFGAT mainly consists of a two-branch feature extracting layer and a fully connected (FC) layer.

In the two-branch feature extracting layer, one branch is responsible to extract global features of the compound; and the other branch introduces a graph attention network (GAT) consisting of two graph attention layers to extract local structural features of the compound. Both the global and the local features of the compound are then integrated into the FC layer which outputs the prediction result of metabolic pathway categories that the compound belongs to. The details of the HFGAT are described in the following.

### Representations of the input compounds

Just like most other researches, we represent the input compound molecules as the SMILES (Simplified Molecular Input Line Entry System) sequences.

The SMILES is a line notation for representing molecules and reactions, which was developed by Arthur Weininger and David in the late 1980s and modified and expanded by other researchers [18]. It is designed for the storage of chemical information and a kind of language with few words (atoms and bonds) and grammar rules. Moreover, this molecular representation is unique and quite compact compared to most other methods (i.e. fingerprint descriptors, molecular map coding and 2D image coding of molecular maps) of representing structure, which makes it the popular representation language of artificial intelligence and chemical experts. For example, the SMILES sequence of L-Arginine is NC(CCCNC(N)=N)C(O)=O.

Every SMILES sequence of a compound molecule can then be processed by using RDKit [19] to generate different kinds of molecular descriptors which can then be used for the subsequent extraction of global and local features of the compound.

### Global features extraction

The top half of the dotted line in Fig. 2 aims for extracting the global features of the compounds. In HFGAT, we extracted two global feature vectors for each compound molecule, denoted as \(V_1\) and \(V_2\) respectively. \(V_1\) mainly characterizes the general molecular information of the compound; \(V_2\) mainly describes the connection between the constituent atoms of the compound molecule.

The generation of \(V_1\) is as same as [16]. For completeness, we briefly described the process of generating \(V_1\). Firstly, seven molecular descriptors related to the size, rigidity, lipophilicity and polarizability of the compound are extracted to form the 7-dimension vector \(V_0\). Each of the descriptor value is obtained by using RDKit [19] . Then the widely used MACCS fingerprint^{Footnote 1} of the compound is obtained by using RDKit [19] to form the vector \(V_0^{'}\). Now that a MACCS fingerprint is a 166-bit binary string where the “1” or “0” respectively indicates the presence or absence of a specific type of substructure in the molecule, \(V_0^{'}\) is thus a 166-dimensional binary vector describing the whole strucutral information of a compound. Finally, \(V_0\) and \(V_{0}^{'}\) are concatenated together to generate the 173-dimensional feature vector \(V_1\).

In addition to \(V_1\), we also extracted the connection information between atoms of the compound in this work. Concretely, we first generated the adjacency matrix of the compound, in which a row (column) corresponds to an atom of the compound. If there is the connection between atoms *i* and *j*, then the element of (*i*, *j*) is set to “1”, otherwise, “0”. Then we flatten the adjacency matrix as the feature vector \(V_2\).

Both \(V_1\) and \(V_2\) characterize the global information of the compound molecule, thus they act as the global features of the compound.

### Local features extraction

The bottom half of the dotted line in Fig. 2 aims for extracting the local features of the compounds. First of all, a set of subgraphs are extracted from the molecule; then the subgraphs are initially encoded with a set of *d*-dimensional (*d*=70 in this work) random vectors {\(I_1\), \(I_2\),\(\cdots\),\(I_n\)} (*n* is the number of the atoms of the compound) which are processed by the GAT layers to generate the final feature vectors. Obviously, the GAT helps to embed the local structural information of the subgraphs into the vectors. In this way, a set of feature vectors {\(O_1\), \(O_2\),\(\cdots\),\(O_n\)} characterizing the local structural information of the molecule can be obtained.

### GAT-based graph embedding

In our work, every local feature corresponds to a subgraph of an atom in the molecule. The subgraph of an atom consists of the atom and all its neighbors, as well as all bonds between the atom and its neighbors.

Just like [16], we adopted the graph embedding method to obtain the embedding vector for each subgraph, rather than explicitly using the atom and bond features, for the embedding method can easily covert the molecular graph into a low-dimensional vector which requires less computing power to process than the vector in the original form [20]. Different with [16], we first assigned a *d*-dimensional random vector to every atom and then embedded the structural information of the subgraph of the atom into the vector by using GAT [17]. The key idea of GAT is to introduce the attention mechanism to calculate the influence weight of each node’s neighbors on it, to obtain the overall information of the whole graph from the local information. Therefore, the set of local features obtained from GAT should be beneficial for metabolic pathway prediction.

The process of GAT-based graph embedding is shown in Fig. 3. Every input vector is first linearly transformed by a shared weight matrix \(\mathbf {W}\) (parameters to be learnt). Then for each atom, the importance score of each of neighbor atoms of it is calculated by self-attention mechanism and normalized. Finally, all of the normalized scores are used to calculate the nonlinear combination of corresponding feature vectors as the final output feature vector.

In order to make the final output features have powerful representation ability, it is necessary to conduct at least one learnable linear transformation on the input features. Thus a shared weight matrix \(\mathbf {W}\) is used to do the linear transformation for every *d*-dimensional input vector \(I_i\):

$$\begin{aligned} I^{'}_i=\mathbf {W} I_i \end{aligned}$$

(1)

where the parameter \(\mathbf {W}\) represents the \(d \times d\) dimensional weight matrix that will be learnt during the model training process; and it is randomly initialized in the beginning.

#### Calculation of attention coefficient

The self-attention mechanism based on parameterized *LeakyReLU* nolinear function is adopted to compute the attention coefficient score between atom *i* and its neighbor atom *j* in GAT [17] :

$$\begin{aligned} e_{ij}= LeakyReLU\left( A^{T}\left[ I^{'}_{i} \Vert I^{'}_{j}\right] \right) \end{aligned}$$

(2)

where the symbol \('||'\) denotes the operation of concatenation; the parameter *A* represents a 2*d*-dimensional weight vector and is randomly initialized; \(j \in \mathcal {N}_i\); and \(\mathcal {N}_i\) is the set of neighbor atom nodes of *i*.

The \(e_{ij}\) reflects the importance of neighbor atom *j* to atom *i* so that the contribution of *j* to the feature vector of *i* can be accordingly weighted by \(e_{ij}\). In order to make the coefficients comparable among different atom nodes, the *softmax* function is introduced to normalize \(e_{ij}\) for all *j*:

$$\begin{aligned} \alpha _{i j}=softmax _{j}\left( e_{i j}\right) =\frac{\exp \left( e_{i j}\right) }{\sum _{k \in \mathcal {N}_{i}} \exp \left( e_{i k}\right) } \end{aligned}$$

(3)

#### Generation of the feature vector

After obtaining the normalized attention coefficients of all *i*’s neighbor atoms, the embedded feature vector of atom *i* can be obtained by the following nonlinear weighted combination:

$$\begin{aligned} R_{i}=\sigma \left( \sum _{j \in \mathcal {N}_{i}} \alpha _{ij} I^{'}_{j}\right) \end{aligned}$$

(4)

where \(\sigma\) is the ELU activation function in GAT [17]. ELU is also know as Exponential Linear Unit. It can reduce the effect of bias shift and make the normal gradient closer to the unit natural gradient, thus accelerating the learning of the mean towards zero.

Since it has been found that the multi-head attention by independently executing the above procedure more than once and concatenating the outputs as the final feature representation is beneficial to stabilize the learning process of self-attention in GAT [17], we also adopted such strategy in this work. In particular, different heads pay attention to different positions of the graph structure, thus the extracted features are different and complement with each other. Suppose the transformation process is independently executed for *m* times (*m*=2 in this work) and the output of Eq. (4) is \(R^{k}_{i}\) for atom *i* in the \(k\hbox {th}\) execution, then we can get the final embedded feature vector \(O_{i}\) characterizing the subgraph centered on atom *i* is the following:

$$\begin{aligned} O_{i}=\Vert _{k=1}^{m} R^{k}_{i} \end{aligned}$$

(5)

### Fully connected layer

Both the global features (\(V_1\), \(V_2\)) and the local features (\(O_1\), \(O_2\),\(\cdots\)) of the compound molecules are input into the FC layer with the SoftMax function for predicting the metabolic pathways. Therefore, the number of the input nodes in the FC layer is the number of the feature vectors, and the number of the output nodes in the FC layer is the number of categories of the metabolic pathways. Since the function outputs a set of probabilities that the compound belongs to specific metabolic pathway categories, we simply use 0.5 as the threshold to make the final prediction. That is, if the probability of a compound belonging to a certain type of metabolic pathway is greater than 0.5, then it is considered to belong to the metabolic pathway category.