Ensemble learning
Ensemble learning builds a set of diversified models and combines them. Theoretically and empirically, numerous studies have demonstrated that ensemble learning usually yields higher accuracy than individual models [11, 12, 30–32]; a collection of weak models (inducers) can be combined to produce a single strong ensemble model.
Framework
Ensemble learning can be divided into independent and dependent frameworks for building ensembles [33]. In the independent framework, also called the randomization-based approach, individual inducers can be trained independently in parallel. On the other hand, in the dependent framework (also called the boosting-based approach), base inducers are affected sequentially by previous inducers. In terms of individual learning, we used both independent and dependent frameworks, e.g., RF and gradient boosting, respectively. In terms of combining learning, we treated the individual inducers independently.
Diversity
Diversity is well known as a crucial condition for ensemble learning [34, 35]. Diversity leads to uncorrelated inducers, which in turn improves the final prediction performance [36]. In this paper, we focus on the following three types of diversity.
-
Dataset diversity
The original dataset can be diversified by sampling. Random sampling with replacement (bootstrapping) from an original dataset can generate multiple datasets with different levels of variation. If the original and bootstrap datasets are the same size (n), the bootstrap datasets are expected to have (\(1-\frac {1}{e}\)) (≈63.2% for n) unique samples in the original data, with the remainder being duplicated. Dataset variation results in different prediction, even with the same algorithm, which produces homogeneous base inducers. Bagging (bootstrap aggregating) belongs to this category and is known to improve unstable or relatively large variance-error factors [37].
-
Learning method diversity
Diverse learning algorithms that produce heterogeneous inducers yield different predictions for the same problem. Combining the predictions from heterogeneous inducers leads to improved performance that is difficult to achieve with a single inducer. Ensemble combining of diverse methods is prevalently used as a final technique in competitions, that presented in [10]. We attempted to combine popular learning methods, including random forest (RF) [8, 38], support vector machine (SVM) [39], gradient boosting machine (GBM) [40], and neural network (NN).
-
Input representation diversity
Drugs (chemical compounds) can be expressed with diverse representations. The diversified input representations produce different types of input features and lead to different predictions. [21] demonstrated improved performance by applying ensemble learning to a diverse set of molecular fingerprints. We used diverse representations from PubChem [22], ECFP [23], and MACCS [24] fingerprints and from a simplified molecular input line entry system (SMILES) [25].
Combining a set of models
For the final decision, ensemble learning should combine predictions from multiple inducers. There are two main combination methods: weighting (non-learning) and meta-learning. Weighting method, such as majority voting and averaging, have been frequently used for their convenience and are useful for homogeneous inducers. Meta-learning methods, such as stacking [41], are a learning-based methods (second-level learning) that use predictions from first-level inducers and are usually employed in heterogeneous inducers. For example, let fθ be a classifier of an individual QSAR classifier with parameter θ, trained for a single subject (drug-specific task) p(X) with dataset X that outputs y given an input x. The optimal θ can be achieved by
$$ \theta^{*} = \text{argmax}_{\theta}\mathbb{E}_{(x,y)\in X}[p_{\theta}(y|x)] $$
(1)
Then, the second-level learning will learn to maximize output y by learning how to update the individual QSAR classifier \(\phantom {\dot {i}\!}f_{\theta ^{*}}\). “First-level: individual learning” section details the first-level learning and, “Second-level: combined learning” section details the second-level learning.
Chemical compound representation
Chemical compounds can be expressed with various types of chemical descriptors that represent their structural information. One representative type of chemical compound descriptor is a molecular fingerprint. Molecular fingerprints are encoded representations of a molecular structure as a bit-string; these have been studied and used in drug discovery for a long time. Depending on the transformation to a bit-string, there are several types of molecular fingerprints: structure key-based, topological or path-based, circular, and hybrid [42]. Structure key-based fingerprints, such as PubChem [22] and MACCS [24], encode molecular structures based on the presence of substructures or features. Circular fingerprints, such as ECFP [23], encode molecular structures based on hashing fragments up to a specific radius.
Another chemical compound representation is the simplified molecular-input line-entry system (SMILES) [25], which is a string type notation expressing a chemical compound structure with characters, e.g., C,O, or N for atoms, = for bonds, and (,) for a ring structure. SMILES is generated by the symbol nodes encountered in a 2D structure in a depth-first search in terms of a graph-based computational procedure. The generated SMILES can be reconverted into a 2D or 3D representation of the chemical compound.
Examples of SMILES and molecular fingerprints of leucine, which is an essential amino acid for hemoglobin formation, are as follows:
-
SMILES string: CC(C)CC(C(=O)O)N
-
PubChem fingerprint: 1,1,0,0,0,0,0,0,0,1,1,0,0,0,1,0,⋯
-
ECFP fingerprint: 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
-
MACCS fingerprint: 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
(Most values in this molecular fingerprint are zero).
Figure 3 shows the two-levels of learning procedure. First-level learning is an individual learning level from diversified learning algorithms and chemical compound representations. The prediction probabilities produced from first-level learning models are used as inputs for second-level learning. Second-level learning makes the final decision by learning the importance of individual models produced from the first-level predictions.
Notation
The notation used in our paper is as follows:
-
x: preprocessed chemical compound-representation input, where x can be a particular type of molecular fingerprints or SMILES.
-
h: hidden representation
-
\(\mathcal {L}\): first-level individual learning algorithm (\(\mathcal {L}_{i}\): i-th algorithm, i={1,⋯,n})
-
\(\boldsymbol {\mathcal {L}}\): second-level learning algorithm
-
P: predicted probability from the individual model (Pi: predicted probability from the \(\mathcal {L}_{i}\))
-
\(\hat {y}\): final predicted decision from the second-level learning
-
σ: activation function (σs: sigmoid, σr: rectified linear unit (ReLU), and σt: hyperbolic tangent)
-
n: total number of individual algorithms
First-level: individual learning
With a combination of learning algorithms and chemical compound input representations, we generated thirteen kinds of individual learning models: nine models from conventional machine learning methods, three models from a plain feed-forward neural network, and one model from the 1D-CNN and RNN-based newly proposed neural network model.
Conventional machine learning methods
Among the conventional machine learning methods, we used SVM, RF, and GBM with three types of molecular fingerprints, resulting in nine combination models consisting of all unique pairs of learning algorithms (SVM, RF, and GBM) and fingerprints (PubChem, ECFP, and MACCS). We set the penalty parameter to 0.05 for the linear SVM, and the number of estimators was set to 100 for RF and GBM based on a grid search and experimental efficiency. The prediction probabilities from these learning methods are used as inputs for second-level learning. However, SVM outputs a signed distance to the hyperplane rather than a probability. Thus, we applied a probability calibration method to convert the SVM results into probabilistic outputs.
Plain feed-forward neural network
We used a plain feed-forward neural network (NN) for the vector-type fingerprints: PubChem-NN, ECFP-NN, and MACCS-NN. The neural network structure consists of three fully connected layers (Fcl) with 512, 64, and 1 units in each layer and using, the ReLU, tanh, and sigmoid activation functions, respectively,
$$ P= \sigma_{s}(\mathbf{Fcl}(\sigma_{t}(\mathbf{Fcl}(\sigma_{r}(\mathbf{Fcl}(\mathbf{x})))))). $$
(2)
The sigmoid activation function outputs a probability for binary classification. We used the Adam optimizer [43] with binary cross-entropy loss (learning rate: 0.001, epoch: 30, and mini-batch size: 256).
Convolutional and recurrent neural networks
To learn key features through end-to-end neural network learning automatically, we used a SMILES string as input and exploited the neural network structures of the 1D-CNNs and RNNs. A CNN is used to recognize the short-term dependencies, and an RNN is used as the next layer to learn long-term dependencies from the recognized local patterns.
As illustrated in Fig. 4 of the preprocessing step, the input SMILES strings were preprocessed with one-hot encoding [44–46], which sets only the corresponding symbol to 1 and others to 0. The input is truncated/padded to a maximum length of 100. We only consider the most frequent nine characters in SMILES and treat the remaining symbols as OTHERS, thus the encoding dimension was reduced to 10.
As illustrated in Fig. 4 of the neural networks step, the preprocessed input x was fed into the CNN layer without pooling (CNN filter length: 17, number of filters: 384). Then, the outputs from the CNN were fed into the GRU layer (dimension: 9, structure: many-to-many).
$$ \mathbf{h}= \sigma_{t}(\mathbf{GRU}(\sigma_{r}(\mathbf{Conv}(\mathbf{x})))), $$
(3)
where h is the output of GRU layer, σr is the ReLU, and σt is the hyperbolic tangent. The output h was flattened and then fed into a fully connected neural network.
$$ P= \sigma_{s}(\mathbf{Fcl}(\sigma_{r}(\mathbf{Fcl}(\mathbf{h}_{\text{\texttt{flatten}}})))), $$
(4)
where P is the output probability from the sigmoid activation function for binary classification. The output P is subsequently used for second-level learning as in the last step in Fig. 4.
We used dropout for each layer (CNN: 0.9, RNN: 0.6, first Fcl: 0.6) and an Adam optimizer (learning rate: 0.001, epoch: 120, mini-batch size: 256) with binary cross-entropy. Most of these hyperparameters were empirically determined.
Second-level: combined learning
We combined the first-level predictions generated from the set of individual models to obtain the final decision.
We have n individual learning algorithms \(\mathcal {L}_{i}\), where i={1,⋯,n}, and the i-th model outputs the prediction probability Pi for a given x. We can determine the final prediction \(\hat {y}\) by weighting, wi:
$$ \hat{y}=\sum_{i=1}^{n}w_{i}P_{i}(\mathbf{x}), $$
(5)
where if the weight wi=1/n,∀i indicates, uniform averaging.
As another technique, we can combine the first-level output predictions through meta-learning. The performance of individual methods varies depending on each dataset as shown in “Performance comparison with individual models” section; there is no invincible universal method. The learned weights from the individual models are applied to the corresponding datasets. Thus, we use learning based combining methods (meta-learning) rather than simple averaging or voting.
$$\begin{array}{*{20}l} \hat{y}&=\boldsymbol{\mathcal{L}}(\mathcal{L}_{1}(\mathbf{x}), \mathcal{L}_{2}(\mathbf{x}),\cdots,\mathcal{L}_{n}(\mathbf{x})) \end{array} $$
(6)
$$\begin{array}{*{20}l} &=\boldsymbol{\mathcal{L}} \left ([P_{1}, P_{2}, \cdots, P_{n}] \right), \end{array} $$
(7)
where \(\boldsymbol {\mathcal {L}}\) is a second-level learning algorithm, and any machine learning method can be applied this level. All Pi, where i={1,2,⋯,n} are concatenated and used as inputs. The model importance imposes a weight wi on Pi and is determined through meta-learning.