Ensemble learning
Ensemble learning builds a set of diversified models and combines them. Theoretically and empirically, numerous studies have demonstrated that ensemble learning usually yields higher accuracy than individual models [11, 12, 30–32]; a collection of weak models (inducers) can be combined to produce a single strong ensemble model.
Framework
Ensemble learning can be divided into independent and dependent frameworks for building ensembles [33]. In the independent framework, also called the randomizationbased approach, individual inducers can be trained independently in parallel. On the other hand, in the dependent framework (also called the boostingbased approach), base inducers are affected sequentially by previous inducers. In terms of individual learning, we used both independent and dependent frameworks, e.g., RF and gradient boosting, respectively. In terms of combining learning, we treated the individual inducers independently.
Diversity
Diversity is well known as a crucial condition for ensemble learning [34, 35]. Diversity leads to uncorrelated inducers, which in turn improves the final prediction performance [36]. In this paper, we focus on the following three types of diversity.

Dataset diversity
The original dataset can be diversified by sampling. Random sampling with replacement (bootstrapping) from an original dataset can generate multiple datasets with different levels of variation. If the original and bootstrap datasets are the same size (n), the bootstrap datasets are expected to have (\(1\frac {1}{e}\)) (≈63.2% for n) unique samples in the original data, with the remainder being duplicated. Dataset variation results in different prediction, even with the same algorithm, which produces homogeneous base inducers. Bagging (bootstrap aggregating) belongs to this category and is known to improve unstable or relatively large varianceerror factors [37].

Learning method diversity
Diverse learning algorithms that produce heterogeneous inducers yield different predictions for the same problem. Combining the predictions from heterogeneous inducers leads to improved performance that is difficult to achieve with a single inducer. Ensemble combining of diverse methods is prevalently used as a final technique in competitions, that presented in [10]. We attempted to combine popular learning methods, including random forest (RF) [8, 38], support vector machine (SVM) [39], gradient boosting machine (GBM) [40], and neural network (NN).

Input representation diversity
Drugs (chemical compounds) can be expressed with diverse representations. The diversified input representations produce different types of input features and lead to different predictions. [21] demonstrated improved performance by applying ensemble learning to a diverse set of molecular fingerprints. We used diverse representations from PubChem [22], ECFP [23], and MACCS [24] fingerprints and from a simplified molecular input line entry system (SMILES) [25].
Combining a set of models
For the final decision, ensemble learning should combine predictions from multiple inducers. There are two main combination methods: weighting (nonlearning) and metalearning. Weighting method, such as majority voting and averaging, have been frequently used for their convenience and are useful for homogeneous inducers. Metalearning methods, such as stacking [41], are a learningbased methods (secondlevel learning) that use predictions from firstlevel inducers and are usually employed in heterogeneous inducers. For example, let f_{θ} be a classifier of an individual QSAR classifier with parameter θ, trained for a single subject (drugspecific task) p(X) with dataset X that outputs y given an input x. The optimal θ can be achieved by
$$ \theta^{*} = \text{argmax}_{\theta}\mathbb{E}_{(x,y)\in X}[p_{\theta}(yx)] $$
(1)
Then, the secondlevel learning will learn to maximize output y by learning how to update the individual QSAR classifier \(\phantom {\dot {i}\!}f_{\theta ^{*}}\). “Firstlevel: individual learning” section details the firstlevel learning and, “Secondlevel: combined learning” section details the secondlevel learning.
Chemical compound representation
Chemical compounds can be expressed with various types of chemical descriptors that represent their structural information. One representative type of chemical compound descriptor is a molecular fingerprint. Molecular fingerprints are encoded representations of a molecular structure as a bitstring; these have been studied and used in drug discovery for a long time. Depending on the transformation to a bitstring, there are several types of molecular fingerprints: structure keybased, topological or pathbased, circular, and hybrid [42]. Structure keybased fingerprints, such as PubChem [22] and MACCS [24], encode molecular structures based on the presence of substructures or features. Circular fingerprints, such as ECFP [23], encode molecular structures based on hashing fragments up to a specific radius.
Another chemical compound representation is the simplified molecularinput lineentry system (SMILES) [25], which is a string type notation expressing a chemical compound structure with characters, e.g., C,O, or N for atoms, = for bonds, and (,) for a ring structure. SMILES is generated by the symbol nodes encountered in a 2D structure in a depthfirst search in terms of a graphbased computational procedure. The generated SMILES can be reconverted into a 2D or 3D representation of the chemical compound.
Examples of SMILES and molecular fingerprints of leucine, which is an essential amino acid for hemoglobin formation, are as follows:

SMILES string: CC(C)CC(C(=O)O)N

PubChem fingerprint: 1,1,0,0,0,0,0,0,0,1,1,0,0,0,1,0,⋯

ECFP fingerprint: 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯

MACCS fingerprint: 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
(Most values in this molecular fingerprint are zero).
Figure 3 shows the twolevels of learning procedure. Firstlevel learning is an individual learning level from diversified learning algorithms and chemical compound representations. The prediction probabilities produced from firstlevel learning models are used as inputs for secondlevel learning. Secondlevel learning makes the final decision by learning the importance of individual models produced from the firstlevel predictions.
Notation
The notation used in our paper is as follows:

x: preprocessed chemical compoundrepresentation input, where x can be a particular type of molecular fingerprints or SMILES.

h: hidden representation

\(\mathcal {L}\): firstlevel individual learning algorithm (\(\mathcal {L}_{i}\): ith algorithm, i={1,⋯,n})

\(\boldsymbol {\mathcal {L}}\): secondlevel learning algorithm

P: predicted probability from the individual model (P_{i}: predicted probability from the \(\mathcal {L}_{i}\))

\(\hat {y}\): final predicted decision from the secondlevel learning

σ: activation function (σ_{s}: sigmoid, σ_{r}: rectified linear unit (ReLU), and σ_{t}: hyperbolic tangent)

n: total number of individual algorithms
Firstlevel: individual learning
With a combination of learning algorithms and chemical compound input representations, we generated thirteen kinds of individual learning models: nine models from conventional machine learning methods, three models from a plain feedforward neural network, and one model from the 1DCNN and RNNbased newly proposed neural network model.
Conventional machine learning methods
Among the conventional machine learning methods, we used SVM, RF, and GBM with three types of molecular fingerprints, resulting in nine combination models consisting of all unique pairs of learning algorithms (SVM, RF, and GBM) and fingerprints (PubChem, ECFP, and MACCS). We set the penalty parameter to 0.05 for the linear SVM, and the number of estimators was set to 100 for RF and GBM based on a grid search and experimental efficiency. The prediction probabilities from these learning methods are used as inputs for secondlevel learning. However, SVM outputs a signed distance to the hyperplane rather than a probability. Thus, we applied a probability calibration method to convert the SVM results into probabilistic outputs.
Plain feedforward neural network
We used a plain feedforward neural network (NN) for the vectortype fingerprints: PubChemNN, ECFPNN, and MACCSNN. The neural network structure consists of three fully connected layers (Fcl) with 512, 64, and 1 units in each layer and using, the ReLU, tanh, and sigmoid activation functions, respectively,
$$ P= \sigma_{s}(\mathbf{Fcl}(\sigma_{t}(\mathbf{Fcl}(\sigma_{r}(\mathbf{Fcl}(\mathbf{x})))))). $$
(2)
The sigmoid activation function outputs a probability for binary classification. We used the Adam optimizer [43] with binary crossentropy loss (learning rate: 0.001, epoch: 30, and minibatch size: 256).
Convolutional and recurrent neural networks
To learn key features through endtoend neural network learning automatically, we used a SMILES string as input and exploited the neural network structures of the 1DCNNs and RNNs. A CNN is used to recognize the shortterm dependencies, and an RNN is used as the next layer to learn longterm dependencies from the recognized local patterns.
As illustrated in Fig. 4 of the preprocessing step, the input SMILES strings were preprocessed with onehot encoding [44–46], which sets only the corresponding symbol to 1 and others to 0. The input is truncated/padded to a maximum length of 100. We only consider the most frequent nine characters in SMILES and treat the remaining symbols as OTHERS, thus the encoding dimension was reduced to 10.
As illustrated in Fig. 4 of the neural networks step, the preprocessed input x was fed into the CNN layer without pooling (CNN filter length: 17, number of filters: 384). Then, the outputs from the CNN were fed into the GRU layer (dimension: 9, structure: manytomany).
$$ \mathbf{h}= \sigma_{t}(\mathbf{GRU}(\sigma_{r}(\mathbf{Conv}(\mathbf{x})))), $$
(3)
where h is the output of GRU layer, σ_{r} is the ReLU, and σ_{t} is the hyperbolic tangent. The output h was flattened and then fed into a fully connected neural network.
$$ P= \sigma_{s}(\mathbf{Fcl}(\sigma_{r}(\mathbf{Fcl}(\mathbf{h}_{\text{\texttt{flatten}}})))), $$
(4)
where P is the output probability from the sigmoid activation function for binary classification. The output P is subsequently used for secondlevel learning as in the last step in Fig. 4.
We used dropout for each layer (CNN: 0.9, RNN: 0.6, first Fcl: 0.6) and an Adam optimizer (learning rate: 0.001, epoch: 120, minibatch size: 256) with binary crossentropy. Most of these hyperparameters were empirically determined.
Secondlevel: combined learning
We combined the firstlevel predictions generated from the set of individual models to obtain the final decision.
We have n individual learning algorithms \(\mathcal {L}_{i}\), where i={1,⋯,n}, and the ith model outputs the prediction probability P_{i} for a given x. We can determine the final prediction \(\hat {y}\) by weighting, w_{i}:
$$ \hat{y}=\sum_{i=1}^{n}w_{i}P_{i}(\mathbf{x}), $$
(5)
where if the weight w_{i}=1/n,∀i indicates, uniform averaging.
As another technique, we can combine the firstlevel output predictions through metalearning. The performance of individual methods varies depending on each dataset as shown in “Performance comparison with individual models” section; there is no invincible universal method. The learned weights from the individual models are applied to the corresponding datasets. Thus, we use learning based combining methods (metalearning) rather than simple averaging or voting.
$$\begin{array}{*{20}l} \hat{y}&=\boldsymbol{\mathcal{L}}(\mathcal{L}_{1}(\mathbf{x}), \mathcal{L}_{2}(\mathbf{x}),\cdots,\mathcal{L}_{n}(\mathbf{x})) \end{array} $$
(6)
$$\begin{array}{*{20}l} &=\boldsymbol{\mathcal{L}} \left ([P_{1}, P_{2}, \cdots, P_{n}] \right), \end{array} $$
(7)
where \(\boldsymbol {\mathcal {L}}\) is a secondlevel learning algorithm, and any machine learning method can be applied this level. All P_{i}, where i={1,2,⋯,n} are concatenated and used as inputs. The model importance imposes a weight w_{i} on P_{i} and is determined through metalearning.