 Research
 Open access
 Published:
A comparison of embedding aggregation strategies in drug–target interaction prediction
BMC Bioinformatics volume 25, Article number: 59 (2024)
Abstract
The prediction of interactions between novel drugs and biological targets is a vital step in the early stage of the drug discovery pipeline. Many deep learning approaches have been proposed over the last decade, with a substantial fraction of them sharing the same underlying twobranch architecture. Their distinction is limited to the use of different types of feature representations and branches (multilayer perceptrons, convolutional neural networks, graph neural networks and transformers). In contrast, the strategy used to combine the outputs (embeddings) of the branches has remained mostly the same. The same general architecture has also been used extensively in the area of recommender systems, where the choice of an aggregation strategy is still an open question. In this work, we investigate the effectiveness of three different embedding aggregation strategies in the area of drug–target interaction (DTI) prediction. We formally define these strategies and prove their universal approximator capabilities. We then present experiments that compare the different strategies on benchmark datasets from the area of DTI prediction, showcasing conditions under which specific strategies could be the obvious choice.
Introduction
Drug discovery is a challenging task that has consistently proven to be timeconsuming and expensive [1]. As a result, pharmaceutical companies are turning towards computational methods capable of automating more stages of their drug discovery pipelines. One of the earliest stages involves the prediction of interactions between chemical compounds and biological targets, also known as drug–target interaction (DTI) prediction, replacing largescale biological screening experiments with more efficient approaches. In this area, two groups of computational methods are distinguished: docking methods and machine learning approaches. Docking methods simulate the physical interaction of molecules and proteins in the 3D space accounting for the physical structure [2]. Machine learning approaches, on the other hand, predict drug–target interactions by learning from data, using databases that contain results of traditional highthroughput screening experiments [3].
Recently, deep learning methods have gained a lot of interest in DTI prediction. In these methods, explicit features for the compounds and proteins are available beforehand and passed to a deep neural network. Even though many approaches exist, we will focus on a quite popular subcategory of twobranch deep neural networks. As shown in Fig. 1, these architectures consist of two neural network components that encode the compound and protein features, respectively. The two embedding vectors generated by the branches are then aggregated to obtain the final prediction. A nonexhaustive list of recent approaches that consider an architecture of that kind can be found in Table 1.
DTI prediction is closely related to recommender systems based on collaborative filtering, such as the wellknown Netflix challenge [16]. In both cases, a standard dataset takes the form of triplets: \(\{compound, protein, activity\}\) or \(\{user, item, interaction\}\). These triplets can be arranged in a sparse matrix, and in its simplest form the prediction task is matrix completion. However, when explicit feature representations are used, three additional prediction settings become feasible, in addition to matrix completion [17]: making predictions for new compounds, new proteins or new proteincompound combinations. In DTI prediction, every method utilizes the explicit features that are available for the compounds and proteins. In contrast, similar methods in recommender systems often do not use explicit features, but create onehot dummy vectors (implicit features) to characterize the users and items [18, 19].
DTI prediction and collaborative filtering methods also consider different aggregation strategies for the drug–target or useritem embeddings. Every DTI prediction method in Table 1 concatenates the two embeddings and then passes the resulting vector to a multilayer perceptron (MLP strategy, see Fig. 1). Conversely, the area of collaborative filtering has a more diverse landscape, with various aggregation strategies that are regularly used, see e.g. [18, 20,21,22,23,24,25,26]. Especially the dot product has been extensively used in that area, and papers benchmarking the dot product versus the MLP have been published [27, 28]. Currently, a similar investigation is missing from the area of DTI prediction.
In this work, we will discuss the behavior of different embedding aggregation strategies in DTI prediction. To this end, we will analyze the abovementioned MLP and dot product strategies, as well as a third strategy, the tensor product, which used to be popular in the era of kernel methods [29,30,31]. First, we will present theoretical results highlighting the universal approximation properties of all three strategies, departing from wellknown mathematical building blocks. Subsequently, we will present benchmarking results of the three strategies on combinations of DTI datasets, branch types and prediction settings. Furthermore, we will investigate the effect of adding implicit feature representations, and we will interpret the learned embeddings. As a result, the main goal of this paper is to compare the aggregation strategies in detail and not to present a winning architecture that achieves stateoftheart performance.
Methods
Aggregation strategies
We first describe the details of the considered aggregation strategies. As shown in Fig. 1, we compare three strategies: the dot product, the multilayer perceptron and the tensor product.

1.
Dot product In the dot product strategy, the dot product of the two embeddings is directly computed and used as the prediction of the model. This strategy requires both embeddings to have the same size, a restriction that we will further investigate in a later section.

2.
Multilayer perceptron The MLP strategy concatenates the embeddings and then passes the resulting vector as input to a multilayer perceptron, which, in turn, terminates at a single output node. As stated before, the use of an MLP increases the capacity of the overall model compared to the simple dot product operation, but, perhaps, also introduces an unwanted overhead.

3.
Tensor product The tensor product strategy first computes the tensor product of the two embeddings and then uses a single fullyconnected layer that terminates at an output node. In fact, this aggregation strategy has never been suggested as an embedding aggregation strategy for deep neural networks, but it has been extensively used in kernel methods [29,30,31].
For reproducibility reasons, and to describe the universal approximation capabilities of the three aggregation strategies, we present formal definitions of the three strategies. Let \({\mathcal {X}}\) and \({\mathcal {T}}\) be two Euclidean spaces for compounds \(\vec {x}\) and proteins \(\vec {t}\), respectively. We formally define the problem of DTI prediction as that of estimating functions \(f: {\mathcal {X}} \times {\mathcal {T}} \rightarrow {\mathcal {Y}}\), where \({\mathcal {Y}} = {\mathbb {R}}\) in case of regression problems. Let us consider hypothesis spaces
for learning compound embeddings from \({\mathcal {X}}\) to \({\mathbb {R}}^{D_1}\) with dimensionality \(D_1\), and protein embeddings from \({\mathcal {T}}\) to \({\mathbb {R}}^{D_2}\) with dimensionality \(D_2\). \(\vec {\theta }_{\mathcal {X}}\) and \(\vec {\theta }_{\mathcal {T}}\) denote the parameterizations of the two types of functions. Moreover, let us consider the space \(C({\mathcal {X}} \times {\mathcal {T}})\) of all continuous realvalued functions \(f:{\mathcal {X}} \times {\mathcal {T}} \rightarrow {\mathbb {R}}\). A subspace \({\mathcal {H}}_{\textrm{DP}}\) of \(C({\mathcal {X}} \times {\mathcal {T}})\) corresponds to the dot product strategy of the twobranch architecture, i.e., functions of the form
with \(\vec {g_{\theta _{\mathcal {X}}}}=(g_1,\ldots ,g_D) \in {\mathcal {H}}_{{\mathcal {X}}}\), \(\vec {h_{\theta _{\mathcal {T}}}}=(h_1,\ldots ,h_D) \in {\mathcal {H}}_{{\mathcal {T}}}\), and D the common dimensionality of the two embeddings.
A second subspace \({\mathcal {H}}_{\textrm{MLP}}\) corresponds to the MLP strategy of the twobranch architecture, in which the MLP is comprised of one hidden layer of size \(L\in {\mathbb {N}}\), i.e., functions of the form
where \(\vec {C}^{(3)} \in {\mathbb {R}}^{L \times (D_1 + D_2) }\), \(\vec {B}^{(3)} \in {\mathbb {R}}^{1 \times L}\), \(\vec {b}^{(3)} \in {\mathbb {R}}^{D_1+D_2}\), and \([\vec {g},\vec {h}]\) denotes vector concatenation of the \(D_1\)dimensional vector \(\vec {g}\) and the \(D_2\)dimensional vector \(\vec {h}\). \(\sigma \circ\) represents an elementwise nonlinear transformation.
We introduce a third and final subspace \({\mathcal {H}}_{\textrm{TP}}\) that defines the tensor product strategy of the twobranch architecture, in which compound and protein embeddings are aggregated by means of a tensor product, followed by a linear layer with a single output neuron. The resulting functions are of the form
where \(\vec {g_{\theta _{\mathcal {X}}}}=(g_1,\ldots ,g_{D_1}) \in {\mathcal {H}}_{{\mathcal {X}}}\), \(\vec {h_{\theta _{\mathcal {T}}}}=(h_1,\ldots ,h_{D_2}) \in {\mathcal {H}}_{{\mathcal {T}}}\), and \(\vec {W} \in {\mathbb {R}}^{D_1 \times D_2 }\).
Datasets
The proposed variants of the twobranch neural network architecture were evaluated on two benchmarks, Davis [32] and KIBA [33], because of their use as benchmarks in many of the studies presented in Table 1 and their varying numbers of compounds, proteins, and recorded affinities (see Table 2). The models were trained on the regression task, which aims to predict the affinity scores for compoundtarget pairs.
Prediction settings
Collaborative filtering methods usually predict missing interactions between collections of users and items that have been observed during training (randomsplit), something that does not require any explicit feature representations. In contrast, the availability of such features in the typical DTI prediction task makes three additional prediction settings feasible [17]. The model could be expected to generate predictions for novel drugs (colddrug) or novel targets (coldtarget) that have not been observed during training. A fourth option that combines the strategies of the previous two is concerned with the prediction for pairs of novel drugs and targets that have not been observed during training. In the collaborative filtering area, these types of prediction settings are used less frequently but also witness a growing interest (cold start collaborative filtering). In this paper, we run experiments for the first three prediction settings: randomsplit, colddrug and coldtarget. The fourth prediction setting (combination of colddrug and coldtarget) was excluded from our analysis, as it is not implemented in the DeepPurpose library, which is used as the building block for protein and compound branches—see next paragraph. For each prediction setting, every dataset is split into training, validation and test sets (70–10–20%). However, the way the data is separated differs. For the coldtarget setting, 70% of the targets only appear in the training dataset, 10% only appear in the validation datasets, and the remaining 20% only appear in the test set. In the colddrug setting, the same ratios are used to split the drugs. Finally, for the randomsplit, the ratios are used to split the \(\{compound, protein, activity\}\) triplets in the dataset.
Benchmarking experiments
For the implementation of the twobranch architectures, the DeepPurpose library [15] was chosen as the starting point. A forked version of the repository, which has been heavily modified is available online.^{Footnote 1}
Since multiple branch pairings were possible, we included three combinations, reflecting varying degrees of descriptor and branch baseline complexity. These combinations included:

An MLP compound branch on Morgan fingerprints [34] and an MLP protein branch on amino acid composition descriptors [35].

A 1D Convolutional Neural Network (CNN) [36] compound branch on SMILES strings and a CNN protein branch on amino acid sequences.

A MessagePassing Neural Network (MPNN) [37] compound branch on molecular graphs and a CNN protein branch on amino acid sequences.
Implicit feature representations
All the experiments mentioned above utilize different forms of explicit feature representations for both compounds and proteins, but not the structure of the interaction matrix. To better investigate the quality of these sources of information, we conducted additional experiments with the following differences:

An MLP branch for the compounds and proteins that, instead of using explicit features, utilizes onehot encoded dummy vectors. Since no generalization to new compounds or new proteins is possible when using this type of feature, we only focus on the randomsplit setting. When the dot product is used as the aggregation strategy the resulting architecture is a close analogue to traditional matrix factorization methods.

A twobranch architecture where each branch is comprised of an internal twobranch model (Fig. 1B). The internal model is designed to utilize the explicit and implicit features of the compounds/proteins, something that could potentially lead to improved performance. The MLP strategy is always used for the aggregation of the internal embeddings, while all three strategies of interest are available for the external embeddings.
Hyperparameter optimization
For the comparison of the three aggregation strategies across two DTI prediction datasets and the three prediction settings, we utilized random search as the hyperparameter optimization method of choice. For every optimization round, a budget of 100 configurations was allocated, with each experiment training for up to 100 epochs (early stopping on the validation loss was also used).
The hyperparameter ranges of every experiment had to be adapted based on the aggregation strategy and branch architectures used. The full details can be found in the Additional file 1: Appendix. Every model was trained on a single GPU (either NVIDIA Ampere A100 or NVIDIA Volta V100), and all the results were logged using the Weights and Biases platform [38].
Metrics
To guarantee a consistent comparison across all our experiments, we adopted the same regression metrics as used in the majority of the work presented in Table 1. These include metrics like:

(i)
Mean squared error (MSE): Measures the differences between the predicted values and the real values. Assuming n drug–target pairs, the MSE is calculated as the average of the squared differences between the predicted affinity scores \({\hat{y}}\) and the true affinity scores y. The goal is to minimize the MSE score as this means that the predictions are close to the true values:
$$\begin{aligned} \text {MSE} = \frac{1}{n} \sum _{i=1}^n ( {\hat{y}}_i  y_i)^2. \end{aligned}$$(2) 
(ii)
Rsquared (R\(^2\)): Also known as the coefficient of determination, R\(^2\) is a statistical measure that represents the proportion of the variance of the dependent variable that is explained by the independent variables in a regression model. Unlike MSE, which is a measure of the model’s absolute error, R\(^2\) is a relative measure of how well the regression predictions approximate the true values. An R\(^2\) of 1 indicates that the regression predictions perfectly fit the data. In the context of drug–target interaction prediction, it quantifies how well the variations in the predicted affinity scores \({\hat{y}}\) explain the variation in the true affinity scores y. The formula for R\(^2\) is given by:
$$\begin{aligned} R^2 = 1  \frac{\sum _{i=1}^n (y_i  {\hat{y}}_i)^2}{\sum _{i=1}^n (y_i  {\bar{y}})^2}, \end{aligned}$$(3)where \({\bar{y}}\) is the mean of the true affinity scores. A higher R\(^2\) score indicates a better model fit.

(iii)
Concordance index (CI): CI is the probability that the predicted affinity scores of two randomly chosen drug–target pairs are in the correct order:
$$\begin{aligned} \textrm{CI} = \frac{1}{\sum _{i=1}^n \sum _{j=1}^n I(y_i> y_j)} \sum _{i=1}^n \sum _{j=1}^n I(y_i> y_j) I({\hat{y}}_i > {\hat{y}}_j), \end{aligned}$$(4)where I is the indicator function, taking value 1 if its argument is true, 0 otherwise. A higher value of the concordance index indicates a better model fit.
Results
Universal approximation
Using existing mathematical results from [39, 40] as building blocks, one can easily show that all three aggregation strategies are universal approximators. We provide further details and formal derivations in Additional file 1: Appendix, but summarize the main insights here. Broadly speaking, universal approximation theorems imply that neural networks can represent a wide variety of interesting functions when given appropriate weights. On the other hand, they typically do not provide a construction for the weights, but merely state that such a construction is possible.
In the setting of DTI prediction, universal approximation can only be guaranteed if the protein branch, compound branch and the aggregation strategy are universal approximators. For the formal results presented in the Additional file 1: Appendix, we assume that the protein and compound branches are universal approximators, and we show that this is sufficient to prove universal approximation of the three aggregation strategies. For simplicity, we also assume that protein and compound feature vectors can be represented in Euclidean spaces, with multilayer perceptrons as protein and compound branches. Similar universal approximation theorems could be easily derived for different activation functions [41, 42], nonEuclidean spaces [43], and other types of neural network architectures, such as deep convolutional neural networks [44].
Benchmarking experiments
In this section, we present extensive comparisons of the three embedding aggregation strategies on popular benchmark datasets from the field of DTI prediction. The experiments span two different DTI prediction datasets, three prediction settings (random, colddrug, coldtarget), as well as three different combinations of input feature representations and compoundprotein branch architectures. Table 3 offers a quick summary of the results that have been obtained.
In the majority of cases and for both recorded metrics, we see that the dot product and tensor product strategies can be seen as competitive alternatives to the MLP. In many cases, they achieve superior performance. At the same time, none of the three strategies can be highlighted as the strategy of choice purely based on the final performance, as none of them consistently outperforms the others. These results confirm our theoretical findings presented in the Additional file 1: Appendix, as all three strategies can approximate any target function. Interestingly, specific combinations of dataset, prediction setting and branch pair exist, in which the three strategies result in unexpected performance differences. More specifically, when we used an MPNN as the compound branch and a CNN as the protein branch, the dot product and tensor product strategies clearly failed to reach the performance of the MLP strategy in both randomly split datasets. A more detailed investigation of the reasons behind this result as well as a potential remedy for the dot product and tensor product strategies are presented in a later subsection.
An important characteristic of any model that can influence many practical aspects of the training process is its overall capacity. Even though capacity measures of neural networks exist (e.g. Vapnik–Chervonenkis dimension [45]), they are primarily dealing with simple MLP architectures. Since the twobranch architecture we utilize can be equipped with more complex branches (CNNs, MPNNs, etc.) we decided to simplify things and instead use the total number of trainable parameters as the measure of model capacity. Since a smaller network with fewer parameters can result in reduced memory and lower computational requirements, a smaller model that can still achieve a similar performance compared with a larger counterpart is highly desirable. Our initial hypothesis was that the overhead of the MLP strategy introduced by the fullyconnected layers after the concatenation of the compound and protein embeddings would result in larger models. Based on the results shown in Table 3 the aforementioned competitiveness of the dot product strategy is usually achieved by small models. By accounting for this extra information, we can more confidently suggest the dot product strategy as a replacement for the MLPbased architecture.
Revisiting the close connection with recommender systems, a highlycited publication by He et al. [18] first presented the dot product strategy as the simplest neural network approximation of matrix factorization. He et al. [18] then suggested the MLP strategy as a more powerful approach with the capacity to model more complex relationships between the items and users. This experimentallybacked strategy was then adopted by a series of subsequent publications [20,21,22,23] in the area of recommender systems, while proposals with the dot product strategy continued to be considered [24,25,26].
The superiority of the MLP aggregation strategy in the area of collaborative filtering was questioned by several subsequent publications. Rendle et al. [27] showed that, with careful hyperparameter selection, the dot product strategy could outperform the MLP strategy. They also pointed out that an MLP cannot trivially approximate the seemingly basic dot product operation. Xu et al. [28] offered a more rigorous comparison by investigating the limiting expressivity of each strategy, the convergence under the practical gradient descent algorithm, and the generalization potential. The two aforementioned publications approach the comparison of strategies exclusively in the area of recommender systems using benchmark datasets that are missing any explicit features for the users or items. In our investigation, which includes explicit feature representation and multiple generalization settings, we formulate similar conclusions as [27] and [28].
Implicit feature representations
Furthermore, Table 4 contains the results for two additional neural network configurations: twobranch neural networks that only use implicit features, and twobranch neural networks that use implicit and explicit features. So, in combination with the results of Table 3, which summarized the results for twobranch neural networks that only included explicit features, we compare here three types of twobranch neural networks. Overall, the initial setup with only explicit features gives the best results, but the differences between the three variants is small. The negligible differences let us conclude that adding implicit features does not have benefits for the datasets and models that we considered. However, the neural network that only uses implicit features still yields a satisfactory performance, so a clear structure must be present in the interaction matrices of the two datasets.
In the last decade, matrix factorization methods have been extensively used to exploit the structure in the interaction matrix by decomposing it into two small matrices that contain the implicit features [46]. Formally speaking, the structure of the interaction matrix can be summarized using the singular values of that matrix. If most singular values differ from zero, the interaction matrix has a high rank, so approximating it as a product of two smaller matrices will lead to little predictive power and meaningless implicit features. Conversely, if most singular values are equal to zero, the interaction matrix has a low rank, and lowrank matrix approximation via matrix factorization will result in good predictive performance and meaningful implicit features.
Matrix factorization methods share major similarities with the twobranch architectures we consider in this work. The simplest version of the twobranch architecture, which uses only the implicit features and aggregates the embeddings via the dot product strategy, can be seen as a way of performing matrix factorization [18, 28, 47,48,49]. For explicit feature representations and other aggregation strategies, the link with matrix factorization is less obvious, and the models become more difficult to analyze in a formal way. However, we believe that the structure of the interaction matrix is also exploited by the models in that case, because lowdimensional embeddings of proteins and compounds are constructed. To our opinion, that’s the main reason why adding implicit features does not lead to performance gains in our experiments.
Let us remark that matrix factorization methods have been extensively used for DTI prediction. Early work by Cobanoglu et al. [50] used probabilistic matrix factorization combined with active learning without relying on compound or protein similarities. Ezzat et al. [51] proposed a graph regularized matrix factorization method (and a weighted variant) to perform manifold learning and improve the performance in the colddrug and coldtarget prediction settings. More recent work by Mazzone et al. [52] used the NXTfusion [53] library that extends traditional matrix factorization methods in a nonlinear fashion by inferring over an arbitrary number of data matrices, which are built as an entity relation graph. The data fusion step was performed by training a multitask neural network that includes side information.
The concept of combining implicit and explicit information to improve the performance of DTI prediction tasks has also been incorporated into various kernelbased methods. Gonen [29] proposed a Bayesian formulation that combines kenrelbased nonlinear dimensionality reduction, matrix factorization and binary classification to predict interactions using only the chemical similarity between compounds and genomic similarity between proteins. Another kernel method, called NMTFDTI [54], used a nonnegative trifactorization technique based on Laplacian regularization and multiple similarity matrices for drugs and targets. A third approach defined the Gaussian interaction profile kernel [55] to capture topological information from the drug–target interaction network, and a variation of that method improved the predictive performance even further by incorporating extra sources of chemical and genomic information with additional kernels. These results do not agree with our findings, but the datasets used in most kernelbased papers were significantly smaller than those we analyzed.
Similar to the twobranch neural networks we discuss in this work, kernel methods enable the projection of the compounds and proteins into a shared space from which the predictions are generated. However, in twobranch neural networks, the branches “learn” the compound and protein embeddings, while kernelbased methods precompute these embeddings by specifying a specific kernel. In general, one can assume that learning the embeddings is better than defining them based on a kernel, especially when datasets are large. That’s probably the reason why deep learning methods have won the battle against kernel methods in DTI prediction. Furthermore, it is worth mentioning that the use of multiple kernels for drugs and targets with the goal of improving performance [56, 57] shows many similarities with multimodal neural network architectures that utilize multiple branches (one per feature representation) before the aggregation step [58,59,60,61]. Both concepts essentially try to boost performance by including different types of information from the same entity.
Embedding visualizations
To better understand the similarities and differences between the MLP and dot product, we experimented with various visualizations and included the most interesting examples in this work. From this point onwards, our analysis skips the tensor product, as it does not provide any performance gains or distinct characteristics compared with the MLP and dot product strategies. In their simplified form, one of the key conceptual distinctions among these strategies lies in the location of the learning process. In the dot product strategy, this is exclusively done in the two branches since the aggregation strategy is just the dot product operation. In contrast, the MLP strategy shares the learning between the two branches and the fullyconnected layers after the concatenation of the embeddings. A basic visualization of the compoundprotein embeddings obtained from the fullyconnected layers (Fig. 2) of a wellperforming MLP strategy shows an improving separation of the active and inactive compoundprotein pairs the closer we get to the output node.
To investigate the quality of the compound embeddings, we experimented with another type of visualization in Fig. 3. The four subplots visualize the affinity scores of four proteins (with the highest number of recorded affinities in the KIBA dataset) after the Uniform Manifold Approximation and Projection (UMAP) [62] is applied on the compound embeddings.
Even though the affinity differences observed in compound clusters between the four proteins are quite interesting, we focus on the relative comparison of the learned clusters. Based on the aforementioned discussion about the location of the learning process, the welldefined clusters of the compound embeddings from the dot product strategy are largely expected, as the two branches are exclusively responsible for learning to predict affinities. At the same time and, somewhat surprisingly, the visual inspection of the compound embeddings obtained from the MLP strategy shows similar embedding quality as the dot product, even though the fullyconnected layers after the embedding concatenation share the learning responsibility.
Oversmoothing effect and its importance when selecting an embedding aggregation strategy
For specific combinations of dataset, prediction setting and branch pair, Table 3 includes dot product and tensor product examples that are clearly outperformed by the MLP aggregation strategy. These cases include the use of the MPNN compound branch and the CNN protein branch on the randomly split version of the two datasets. As the CNN protein branch had been successfully used in the CNN–CNN combination, we focused our efforts on the MPNN branch as the defective model. In an effort to increase its capacity, we first tested configurations with deeper MPNNs, something that ultimately did not yield any improvements. Our hypothesis was that the oversmoothing effect that many graph neural networks suffer from [63] made any attempt to increase the capacity fail performancewise. This is shown in the two plots of Fig. 4, where points from models that use the dot product strategy and varying sizes of the MPNN branch architecture fail to approach the performance the MLP strategy achieves. Assuming that the superior performance of the MLPbased architecture was a result of the fullyconnected layers, we decided to test new dotproductbased configurations by appending fullyconnected layers after a smaller MPNN model and before the embedding aggregation step. With this strategy, we managed to both increase the capacity of the compound branch and avoid the oversmoothing effect that larger MPNN models suffer from. Referring back to Fig. 4, these modified architectures colored in red show major improvements, as they became comparable performancewise with the MLP strategy.
Oversmoothing is a wellrecognized challenge in GNNs, and numerous research papers have been dedicated to address this issue [64,65,66]. The underlying message of this exploration is that the dot product has the inability to counterbalance weaknesses that may arise from the branches, as it lacks the trainable parameters that the MLP strategy has. This problem can be tackled by increasing the capacity of the defective branch. However, cases still exist where this strategy is inadequate because of branchspecific weaknesses (e.g., oversmoothing in GNNs). In such cases, it is suggested to increase the capacity of the defective branch with fullyconnected layers or replace the branch altogether.
Conclusion
In this work, we analyzed alternatives to the traditional embedding aggregation strategy that has dominated the twobranch architectures in the field of DTI prediction. We presented formal and experimental results which show that all three analyzed strategies can be used to aggregate embeddings. We identified conditions under which a particular type of aggregation strategy might outperform others, and we presented various visualizations. We believe that this work can be the first step in convincing the DTI prediction community to also focus on the embedding aggregation options. Even though this may not seem vital when aggregating only two embeddings, it can become an important choice in multimodal architectures or when more than two embeddings have to be aggregated (e.g. drug–drug–protein interaction prediction).
With regard to future work, we intend to test attentionbased embedding aggregation methods, which have become popular in other application domains. Furthermore, increasing the number of embeddings that are aggregated is also an interesting avenue we intend to explore. This approach is used when different representations are available for the same entity. For example, different feature representations for a given compound (Morgan fingerprint, 2D image, molecular graph) could be combined in different ways. Thus, the order and strategy used at every stage of aggregating embeddings is a complex task. Finally, the evaluation of the added value that pretrained embeddings from different strategies can bring to transfer learning tasks is an interesting topic that we intend to investigate in future work.
Availability of data and materials
The KIBA and DAVIS datasets are publicly available https://github.com/futianfan/DeepPurpose_Data. The scripts used to perform the experiments are also available online https://github.com/diliadis/DeepPurpose as a forked and modified repository of the original DeepPurpose project.
References
Sinha S, Vohora D. Drug discovery and development: an overview. Pharm Med Transl Clin Res. 2018;19–32.
Pujadas G, Vaque M, Ardevol A, Blade C, Salvado MJ, Blay M, FernandezLarrea J, Arola L. Proteinligand docking: a review of recent advances and future perspectives. Curr Pharm Anal. 2008;4(1):1–19. https://doi.org/10.2174/157341208783497597.
Zanni R, GálvezLlompart M, Gálvez J, GarcíaDomenech R. QSAR multitarget in drug discovery: a review. Curr Comput Aided Drug Des. 2014;10(2):129–36. https://doi.org/10.2174/157340991002140708105124.
Lee I, Keum J, Nam H. DeepConvDTI: prediction of drug–target interactions via deep learning with convolution on protein sequences. PLoS Comput Biol. 2019;15(6):1007129.
Öztürk H, Özgür A, Ozkirimli E. DeepDTA: deep drug–target binding affinity prediction. Bioinformatics. 2018;34:821–9.
Shin B, Park S, Kang K, Ho JC. Selfattention based molecule representation for predicting drug–target interaction. Proc Mach Learn Res (PMLR). 2019;106:1–18.
Rifaioglu AS, Atalay RC, Kahraman DC, Doǧan T, Martin M, Atalay V. MDeePred: novel multichannel protein featurization for deep learningbased binding affinity prediction in drug discovery. Bioinformatics. 2021;37(5):693–704.
Chen W, Chen G, Zhao L, YuChian Chen C. Predicting drug–target interactions with deepembedding learning of graphs and sequences. J Phys Chem. 2021;125:5642.
Torng W, Altman RB. Graph convolutional neural networks for predicting drug–target interactions. J Chem Inf Model. 2019.
Karki N, Verma N, Trozzi F, Tao P, Kraka E, Zoltowski B. SSnet: a deep learning approach for protein–ligand interaction prediction. Int J Mol Sci. 2021;22(3):1392.
Tsubaki M, Tomii K, Sese J. Compound–protein interaction prediction with endtoend learning of neural networks for graphs and sequences. Bioinformatics. 2019;35(2):309–18.
Kang H, Goo S, Lee H, Chae JW, Yun HY, Jung S. Finetuning of Bert model to accurately predict drug–target interactions. Pharmaceutics. 2022;14(8):1710.
Nguyen T, Le H, Quinn TP, Nguyen T, Le TD, Venkatesh S. GraphDTA: predicting drug–target binding affinity with graph neural networks. Bioinformatics. 2021;37(8):1140–7.
Jiang M, Li Z, Zhang S, Wang S, Wang X, Yuan Q, Wei Z. Drug–target affinity prediction using graph neural network and contact maps. RSC Adv. 2020;10(35):20701–12.
Huang K, Fu T, Glass LM, Zitnik M, Xiao C, Sun J. DeepPurpose: a deep learning library for drug–target interaction prediction. Bioinformatics. 2020;36(22–23):5545–7.
Bennett J, Lanning S. The Netflix Prize. In: Proceedings of KDD cup and workshop 2007. https://www.semanticscholar.org/paper/TheNetflixPrizeBennettLanning/31af4b8793e93fd35e89569ccd663ae8777f0072. Accessed 16 Feb 2023.
Pahikkala T, Airola A, Pietilä S, Shakyawar S, Szwajda A, Tang J, Aittokallio T. Toward more realistic drug–target interaction predictions. Brief Bioinform. 2015;16(2):325–37.
He X, Liao L, Zhang H, Nie L, Hu X, Chua TS. Neural collaborative filtering. In: Proceedings of the 26th international conference on world wide web (WWW), 2017. pp. 173–182.
Wu Y, DuBois C, Zheng AX, Ester M. Collaborative denoising autoencoders for topn recommender systems. In: Proceedings of the ninth ACM international conference on web search and data mining (WSDM), 2016. pp. 153–162.
Chen W, Cai F, Chen H, Rijke MD, Chen H. Joint neural collaborative filtering for recommender systems. ACM Trans Inf Syst (TOIS). 2019;37(4):39.
Karolina Dziugaite G, Roy DM. Neural network matrix factorization. arXiv preprint arXiv:1511.06443 2015.
Liu Y, Wang S, Khan MS, He J. A novel deep hybrid recommender system based on autoencoder with neural collaborative filtering. Big Data Min Anal. 2018;1(3):211–21.
Nguyen DM, Tsiligianni E, Deligiannis N. Extendable neural matrix completion. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2018. pp. 6328–6332.
Wang T, Brovman YM, Madhvanath S. Personalized embeddingbased ecommerce recommendations at eBay. arXiv preprint arXiv:2102.06156 2021.
Yang J, Yi X, Zhiyuan Cheng D, Hong L, Li Y, Xiaoming Wang S, Xu T, Chi EH. Mixed negative sampling for learning twotower neural networks in recommendations. In: Companion proceedings of the web conference 2020 (TheWebConf), 2020. pp. 441–447.
Yi X, Yang J, Hong L, Cheng DZ, Heldt L, Kumthekar A, Zhao Z, Wei L, Chi E. Samplingbiascorrected neural modeling for large corpus item recommendations. In: Proceedings of the 13th ACM conference on recommender systems (RecSys), 2019. pp. 269–277.
Rendle S, Krichene W, Zhang L, Anderson J. Neural collaborative filtering vs. matrix factorization revisited. In: Proceedings of the 14th ACM conference on recommender systems (RecSys), 2020. pp. 240–248.
Xu D, Ruan C, Korpeoglu E, Kumar S, Achan K. Rethinking neural vs. matrixfactorization collaborative filtering: the theoretical perspectives. In: International conference on machine learning (ICML). PMLR; 2021. pp. 11514–11524.
Gönen M. Predicting drug–target interactions from chemical and genomic kernels using Bayesian matrix factorization. Bioinformatics. 2012;28(18):2304–10.
Stock M, Pahikkala T, Airola A, De Baets B, Waegeman W. A comparative study of pairwise learning methods based on kernel ridge regression. Neural Comput. 2018;30(8):2245–83.
Vert JP, Qiu J, Noble WS. A new pairwise kernel for biological network inference with support vector machines. In: BMC bioinformatics, vol. 8. Springer; 2007. pp. 1–10.
Davis MI, Hunt JP, Herrgard S, Ciceri P, Wodicka LM, Pallares G, Hocker M, Treiber DK, Zarrinkar PP. Comprehensive analysis of kinase inhibitor selectivity. Nat Biotechnol. 2011;29(11):1046–51.
Tang J, Szwajda A, Shakyawar S, Xu T, Hintsanen P, Wennerberg K, Aittokallio T. Making sense of largescale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. J Chem Inf Model. 2014;54(3):735–43.
Rogers D, Hahn M. Extendedconnectivity fingerprints. J Chem Inf Model. 2010;50(5):742–54.
Reczko M, Bohr H. The def data base of sequence based protein fold class predictions. Nucleic Acids Res (NAR). 1994;22(17):3616.
Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Commun ACM. 2017;60(6):84–90.
Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE. Neural message passing for quantum chemistry. In: International conference on machine learning (ICML), Proceedings of machine learning research (PMLR). 2017. pp. 1263–1272.
Biewald L. Experiment tracking with weights and biases. Software available from wandb.com 2020. https://www.wandb.com/.
Allenby PD, Labuschagne CCA. On the uniform density of c(x) \(\otimes\) c(y) in c(x\(\times\)y). Indag Math. 2009;20(1):19–22. https://doi.org/10.1016/S00193577(09)000159.
Cybenko G. Approximation by superpositions of a sigmoidal function. Math Control Signals Syst. 1989;2(4):303–14.
Leshno M, Lin VY, Pinkus A, Schocken S. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. 1993;6(6):861–7. https://doi.org/10.1016/S08936080(05)801315.
Pinkus A. Approximation theory of the MLP model in neural networks. Acta Numer. 1999;8:143–95.
Brüel Gabrielsson R. Universal function approximation on graphs. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, editors. Advances in neural information processing systems (NIPS), vol. 33. Curran Associates, Inc; 2020. pp. 19762–19772. https://proceedings.neurips.cc/paper_files/paper/2020/file/e4acb4c86de9d2d9a41364f93951028dPaper.pdf.
Zhou DX. Universality of deep convolutional neural networks. Appl Comput Harmon Anal (ACHA). 2020;48(2):787–94.
Vapnik VN, Chervonenkis AY. On the uniform convergence of relative frequencies of events to their probabilities. In: Measures of complexity: festschrift for Alexey Chervonenkis. Springer, Cham; 2015. pp. 11–30.
Waegeman W, Dembczyński K, Hüllermeier E. Multitarget prediction: a unifying view on problems and methods. Data Min Knowl Discov (KDD). 2019;33(2):293–324.
Chen X, Zhang Y, Ai Q, Xu H, Yan J, Qin Z. Personalized key frame recommendation. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, 2017. pp. 315–324.
Wang X, He X, Nie L, Chua TS. Item silk road: recommending items from information domains to social users. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, 2017. pp. 185–194.
Zhang S, Yao L, Sun A, Tay Y. Deep learning based recommender system: a survey and new perspectives. ACM Comput Surv (CSUR). 2019;52(1):1–38.
Cobanoglu MC, Liu C, Hu F, Oltvai ZN, Bahar I. Predicting drug–target interactions using probabilistic matrix factorization. J Chem Inf Model. 2013;53(12):3399–409.
Ezzat A, Zhao P, Wu M, Li XL, Kwoh CK. Drug–target interaction prediction with graph regularized matrix factorization. IEEE/ACM Trans Comput Biol Bioinform (TCBB). 2016;14(3):646–56.
Mazzone E, Moreau Y, Fariselli P, Raimondi D. Nonlinear data fusion over entity–relation graphs for drug–target interaction prediction. Bioinformatics. 2023;348.
Raimondi D, Simm J, Arany A, Moreau Y. A novel method for data fusion over entity–relation graphs and its application to protein–protein interaction prediction. Bioinformatics. 2021;37(16):2275–81.
Jamali AA, Kusalik A, Wu F. NMTFDTI: a nonnegative matrix trifactorization approach with multiple kernel fusion for drug–target interaction prediction. IEEE/ACM Trans Comput Biol Bioinform (TCBB). 2021.
Van Laarhoven T, Nabuurs SB, Marchiori E. Gaussian interaction profile kernels for predicting drug–target interaction. Bioinformatics. 2011;27(21):3036–43.
Nascimento AC, Prudêncio RB, Costa IG. A multiple kernel learning algorithm for drug–target interaction prediction. BMC Bioinform. 2016;17:1–16.
Zheng X, Ding H, Mamitsuka H, Zhu S. Collaborative matrix factorization with multiple similarities for predicting drug–target interactions. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD), 2013;1025–1033.
Boezer M, Tavakol M, Sajadi Z. FastDTI: drug–target interaction prediction using multimodality and transformers. In: Proceedings of the northern lights deep learning workshop, vol. 4. 2023.
Ren ZH, You ZH, Zou Q, Yu CQ, Ma YF, Guan YJ, You HR, Wang XF, Pan J. DeepMPF: deep learning framework for predicting drug–target interactions based on multimodal representation with metapath semantic analysis. J Transl Med. 2023;21(1):1–18.
Yang X, Niu Z, Liu Y, Song B, Lu W, Zeng L, Zeng X. ModalityDTA: multimodality fusion strategy for drug–target affinity prediction. IEEE/ACM Trans Comput Biol Bioinform (TCBB). 2022;20(2):1200–10.
Zhou D, Xu Z, Li W, Xie X, Peng S. MultiDTI: drug–target interaction prediction based on multimodal representation learning to bridge the gap between new chemical entities and known heterogeneous network. Bioinformatics. 2021;37(23):4485–92.
McInnes L, Healy J, Melville J. UMAP: uniform manifold approximation and projection for dimension reduction. arXiv 2020. https://doi.org/10.48550/arXiv.1802.03426 . arxiv: 1802.03426
Rusch TK, Bronstein MM, Mishra S. A survey on oversmoothing in graph neural networks. arXiv preprint arXiv:2303.10993 2023.
Oono K, Suzuki T. Graph neural networks exponentially lose expressive power for node classification. arXiv preprint arXiv:1905.10947 2019.
Chen D, Lin Y, Li W, Li P, Zhou J, Sun X. Measuring and relieving the oversmoothing problem for graph neural networks from the topological view. In: Proceedings of the AAAI conference on artificial intelligence, vol. 34. 2020. pp. 3438–3445.
Zhao BW, Su XR, Hu PW, Huang YA, You ZH, Hu L. IGRLDTI: an improved graph representation learning method for predicting drug–target interactions over heterogeneous biological information network. Bioinformatics. 2023;39(8):451.
Funding
This research received funding from the Flemish Government under the “Onderzoeksprogramma Artificiele Intelligentie (AI) Vlaanderen” programme.
Author information
Authors and Affiliations
Contributions
DI performed the research. DI, WW, TP and BB contributed to the manuscript, read and approved the final version.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1: A)
Contains theorems that show how all three aggregation strategies we consider are universal approximators. B) Contains detailed information about the hyperparameter ranges of every experiment included in this work.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Iliadis, D., De Baets, B., Pahikkala, T. et al. A comparison of embedding aggregation strategies in drug–target interaction prediction. BMC Bioinformatics 25, 59 (2024). https://doi.org/10.1186/s1285902405684y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902405684y