Problem description
As shown in Fig. 1a, conventional DDI prediction focuses on SMDs, only containing one type of drug node and drugprotein association, and drug features only consist of structural forms like SMILES. In comparison, in Fig. 1b after adding BioDs three types of nodes and five types of associations make the SBI prediction more complex. Furthermore, BioDs are composed of amino acid sequences, which differ from SMDs. The other problem is that there are no accurately annotated negative samples in the database, which means the prediction results depend on the sampling strategy. To solve the above problem, we use multimodal representation learning to learn complex drug pair features and apply the PUsampling method to deal with imbalanced data.
Multimodal representation learning
The performance of deep learning methods is largely reflected in efficient data representation, which means that a model can automatically discover the representation needed for feature extraction or classification from raw data using a set of techniques. This process is called representation learning, which is one of the fundamental steps in endtoend deep learning. Many works have integrated deep learning methods into the feature representation design of input data to more easily extract useful feature information [18,19,20,21,22,23,24].
The workflow of MultiSBI is depicted in Fig. 2. Considering the structural specificity and relational complexity of SMD and BioD, our multimodal representation learning comprises two separate pathways. As shown in Fig. 2a, structure feature representation and network topology representation are obtained. In addition to traditional methods, we propose two independent threelayer 1DCNN blocks to learn the drug structure features from the sequence input(Structure/Sequence). After onehot encoding the four interconnected networks (SMDprotein interaction (SPI), BioDprotein interaction (BPI), SMDSMD interaction (SSI), and BioDBioD interaction (BBI)), the similarity is encoded into a heterogeneous network to fully characterize drugs relational topology representation.
Structure feature representation
In previous studies, the information about the chemical structure of SMD derives from the drug's chemical substructure, i.e., molecular fingerprints. Here, we apply Chemistry Development Kit (CDK) [25], an opensource tool commonly used in DDI prediction, to generate substructures. In more detail, we select the daylight fingerprint method in the CDK toolkit, which is the most typical representative of the topological molecular fingerprint. The raw inputs are the simplified molecular input line entry system (SMILES) of all drugs downloaded from DrugBank [26], and 1024dimensional molecular structure features of SMDs are extracted after the algorithm.
The structure of BioD is similar to protein, both of which are composed of primary amino acid sequences. Many feature extraction methods are based on amino acid sequences [27, 28]. Expressly, these features usually represent information about the physicochemical properties or positions of amino acids that appear in the protein sequence. However, BioD sequence data are scarce in the field of a drug interaction. This study has only 148 unique BioDs, and traditional methods cannot extract highly discriminative features in such a small amount of data. Therefore, here we utilize ESM [29] to pretrain BioDs. Because the ESM specially adopts a masking language to model the target and contains information that is not available in other feature extraction methods. Given a BioD, we intercept the top 1024 bits of its amino acid sequence and encode it through the ESM algorithm. In this way, each BioD is encoded into a 1280dimensional vector.
Traditional methods directly apply molecular fingerprints or molecular descriptors of drugs and targets without considering the local connection between atoms and the chemical structure of amino acids [30, 31]. In addition to daylight and ESM, we integrate two 1DCNN blocks for the original sequence features to complementarily extract the complex chemical information and contextual relationships between the local structures in the sequence.
In this study, the SMILES string for SMD consists of 64 different characters, and BioD consists of 25 different characters. We represent each character with the corresponding integer (e.g. "[": 1, "H": 2, "@": 3). In addition, both SMILES and amino acid sequences have different lengths in order to represent the two classes of drugs efficiently, we convert each SMILES and amino acid sequence into embedding vectors of length 1000 and 100, and input them into a twochannel CNN in the module.
As shown in Fig. 3, the twochannel CNN module in this study contains two independent CNN blocks, and each aims at learning representations from SMILES strings and amino acid sequences. For each CNN block, we use three consecutive 1D convolutional layers with an increasing number of filters. The second layer has twice as many filters as the first layer, and the third convolutional layer has three times as many filters as the first. The last layer is the maximum pooling layer. The output of the maximum pooling layer are connected and fed into the threelayer DNN classifier.
Network topology feature representation
The integration of bioinformatics prior knowledge can effectively improve the accuracy of prediction [8]. Therefore, in addition to applicable drug structure and sequence features, we use four network topology features from the DrugBank database as another modality.
The topology network inputs for MultiSBI are constructed based on known prior knowledge: SSI, BBI, SPI, and BPI. Among them, the protein in the SPI and BPI includes four parts: target, enzyme, carrier, and transporter. MultiSBI first performs onehot encoding on each network to obtain the distribution of each drug node, which captures its topological relationship to all other nodes in the heterogeneous network. We generate a 2308dimensional SSI embedding and a 1910dimensional SPI embedding for each SMD through the onehot encoding strategy. The value (1 or 0) indicates the presence or absence of the proteinrelated interaction with the corresponding drug. Similarly, we generate the 151dimensional BBI embedding and the 201dimensional BPI embedding for BioDs.
A critical problem of direct onehot encoding is that the calculated topological relationship is not entirely accurate, partly because of the noisy, incomplete, and highdimensional nature of biological data. To speed up the prediction process and eliminate noise as much as possible, we compress features to reduce sparsity. Instead of using bit vectors, we use the Jaccard similarity metric to calculate paired drug–drug similarity from bit vectors. Jaccard similarity is calculated by Eq. (1):
$$J\left( {A,B} \right) = \frac{ A \cap B }{{\left A \right + \left B \right  \left {A \cap B} \right}}$$
(1)
Among them, A and B are the set forms of the position vectors of the two drugs; A ∩ B is the intersection of A and B. Using Jaccard similarity, we convert topological features of SMD drugs and BioD drugs to 1941 and 148 dimensions (determined by the number of drugs). Because SMD drugs have 1941 dimensions, we use PCA to reduce the feature dimension to 512 dimensions.
Finally, we obtain the drug pair feature consisting of two types of sequence features and two types of topological features.
PUsampling
In some applications, such as drug interaction prediction, only positive cases are known and labeled, while unlabeled data may include negative and unlabeled positive cases. Previous methods used experimentally verified DDI as positive samples and randomly generated negative samples to learn predictive models. However, randomly generated negative samples may include unknown true positive samples. A classifier trained with such randomly generated negative samples may produce high crossvalidation accuracy, but it is likely to perform poorly on independent real test data set. Therefore, screening highly reliable negative samples is essential to improve the effectiveness of computational prediction methods [32].
As shown in Fig. 2b, to address the unbalanced data set problem in DDI prediction, we introduce an undersampling method, PUsampling, based on Positiveunlabeled learning (PU Learning) [33]. The core concept of PU Learning is converting positive and unlabeled examples into a series of supervised binary classification problems discriminating the known positive examples from random subsamples of the unlabeled set. As more details are shown in Fig. 4, positive samples are labeled with red triangles. Firstly, PUsampling scores all unlabeled examples through many simple decision tree classifiers. Then removes lowconfidence negative sample drug pairs that are painted in light green circles. Finally, during the training process, high confidence samples are selected from the remaining unlabeled set with the same number of positives to compose the 1:1 balanced data set. As will be introduced in the “Experiment” section, there are 148 BioDs and 1,941 SMDs in the data set, generating 287,268 potential SBI drug pairs. However, only 40,959 SBI are verified positive in DrugBank. The remaining 246,309 are unlabeled. Here, we denote positive drug pairs as set P, unlabeled drug pairs as set U, and selected highconfidence negative drug pairs as N, correspondingly. The PUsampling algorithm is as follows:

1.
Randomly select the same number of P from U temporarily considered as negative in binary classification, and utilize the decision tree model to evaluate the unlabeled examples with a score from 0(negative) to 1(positive);

2.
Repeat step (1) T times and record the scores from the classifiers, which means T decision tree models have been trained and the unlabeled drugs have been evaluated many times. It is believed that the average score can be used as the confidence of the negative samples;

3.
Finally, after sorting all the scores, set 1 as the threshold to eliminate positive samples. Then samples with a score close to 0 can be regarded as highconfidence negative. Because the "true" negative samples theoretically are distinguishable from the labeled positive drugs, whose values should be very close to zero. Thus samples with the lowest score are taken as the negative samples set N in the following experiments.
Finally, as the positive samples are 40,959, the same number of negative samples were retained from 246,309 unlabeled drug pairs.
DNN construction
MultiSBI is designed as a multiclassification model that can predict multiple SBI types for a given drug pair (multiple output neurons are activated simultaneously, and each neuron represents one SBI type). In this work, we adopt "DNN" as the multivariate classifier. Since there are four types of feature, we construct four submodels based on each type of feature using the DNN. The average operator combines the outputs from submodels to produce the final prediction.
Figure 2c shows that each prediction submodel concatenates a pair of SMD and BioD embedding vectors, which is input to the fully connected layer to calculate the interacting probability. The output layer has 49 output neurons, representing the 49 classification types considered in this study. These output neurons have activity values between 0 (no interaction) and 1 (possible interaction), which can be considered a probability [34].
As shown in Fig. 2c, the DNN consists of three layers, with the number of nodes being 512, 256, and 49.