Skip to main content

Hybrid fragment-SMILES tokenization for ADMET prediction in drug discovery

Abstract

Background:

Drug discovery and development is the extremely costly and time-consuming process of identifying new molecules that can interact with a biomarker target to interrupt the disease pathway of interest. In addition to binding the target, a drug candidate needs to satisfy multiple properties affecting absorption, distribution, metabolism, excretion, and toxicity (ADMET). Artificial intelligence approaches provide an opportunity to improve each step of the drug discovery and development process, in which the first question faced by us is how a molecule can be informatively represented such that the in-silico solutions are optimized.

Results:

This study introduces a novel hybrid SMILES-fragment tokenization method, coupled with two pre-training strategies, utilizing a Transformer-based model. We investigate the efficacy of hybrid tokenization in improving the performance of ADMET prediction tasks. Our approach leverages MTL-BERT, an encoder-only Transformer model that achieves state-of-the-art ADMET predictions, and contrasts the standard SMILES tokenization with our hybrid method across a spectrum of fragment library cutoffs.

Conclusion:

The findings reveal that while an excess of fragments can impede performance, using hybrid tokenization with high frequency fragments enhances results beyond the base SMILES tokenization. This advancement underscores the potential of integrating fragment- and character-level molecular features within the training of Transformer models for ADMET property prediction.

Peer Review reports

Introduction

Drug design has evolved from serendipitous screening of natural compounds to an increasingly rational and data-driven approach, focusing on the molecular structure and mechanisms behind disease-related targets [1]. The application of artificial intelligence (AI), particularly machine learning (ML), has revolutionized the pharmaceutical field, which is able to take advantage of the vast arrays of biomedical data that has been gathered [2]. AI and ML contribute to various aspects of drug design, including predicting pharmacokinetic and pharmacodynamic properties, identifying binding sites on a given biomolecular target, repurposing drugs, and creating new molecules with desired characteristics, all of which reduce the time and cost associated with developing effective and safe medications [3,4,5]. Furthermore, absorption, distribution, metabolism, excretion, and toxicity (ADMET) are crucial in evaluating drug post-administration behaviour, and in minimizing clinical trial failures [6, 7]. Despite challenges, such as data scarcity and complex molecular structures in the area of ADMET prediction, ML techniques have been able to extrapolate structural patterns that implicate molecular properties, and circumvent the need for costly assays during a large-scale screening process. As a result, ML plays a significant role in the identification and early exclusion of unsuitable compounds, mitigating financial burdens from unsuccessful ventures in the drug development cycle.

In the field of computational chemistry, molecular structures can be represented through various formats. Line notations, such as Simplified Molecular Input Line Entry System (SMILES) [8] provide a textual method to describe the structure of chemical entities, including molecular information such as C for carbon, = for a double bond, parentheses for branches, and @, /, /, for stereochemistry. As an example, climbazole is represented as CC(C)(C)C(=O)C(N1C=CN=C1)OC2=CC=C(C=C2)Cl. Despite the widespread use of SMILES, its strict syntactical guidelines often result in the production of numerous invalid molecular structures, leading to the development of other line notations such as DeepSMILES [9], SELF-Referencing Embedded String (SELFIES) [10], and Group SELFIES [11] which mitigate some of the mentioned issues. However, they are not as widely supported as SMILES and may necessitate larger alphabets.

Fragmentation is another approach to representing a molecule where a large molecule is broken apart into smaller pieces [12]. The fragmentation process can reveal important structural and functional features of the original molecule that are not easily discernible from an atomic-level representation such as SMILES. For example, fragmentation can generate sub-molecules that contain specific functional groups or motifs that are relevant for physicochemical properties. However, fragmentation is complex due to the variety of methods and criteria involved in bond cleavage and the selection of sub-molecular entities [13]. Additionally, fragmentation presents several challenges, such as producing sub-molecules that are unusually large or small, and the formation of a vast library of fragments that appear with varying frequencies, with a significant number rarely occurring.

To the best of our knowledge, there exists no direct comparison between fragment and atom-based models for ADMET prediction using the same model. In this study, we construct a hybrid fragment-SMILES encoding technique to combine the advantages of both representations for use in machine language models. As illustrated in Fig. 1, there are a large amount of fragments that occur infrequently. Thus, we construct various models with varying frequency cutoffs to produce a fragment spectrum of models, and perform a fair comparative investigation between SMILES and a fragment spectrum using the hybrid encoding technique for ADMET prediction with a Transformer architecture. Moreover, we also experiment with two pre-training techniques which we denote one-phase and two-phase.

Fig. 1
figure 1

Fragment library proportions by observation frequency. An integer in the brackets is a threshold. The corresponding percentage represents the portion of fragments above the threshold

The rest of this study is organized as follows. Section "Related works" discusses related works, giving information on the use of Transformers for ADMET prediction, and graph-based neural networks for ADMET prediction. Section "Methodology" discusses the methods used within this work, describing the model and the encoding approach. Afterwards, Section "Experimentation" illustrates all the necessary information for replicating the experiments performed in this study. Following is Section "Results and discussions" which displays and investigates the results of our experimentation. Lastly, concluding remarks are made in Section "Conclusion".

Related works

Transformer-based ADMET models

Language models are a class of deep learning models that learn the semantic and syntactic patterns of natural language from a large corpora of text. They can also be applied to molecular sequences, such as SMILES strings, to capture the structural and functional features of molecules. Before the advent of Transformer models, recurrent neural networks (RNNs) were commonly used for language modelling tasks, and also for tasks within ADMET prediction [14, 15]. One of the advantages of language models is their ability to leverage pre-trained weights in an unsupervised or semi-supervised fashion over a general domain, and then fine-tune for downstream tasks such as ADMET prediction. This process, known as transfer learning, helps improve performance and generalization capabilities by reducing the risk of overfitting and increasing the diversity of molecules for syntactic and semantic understanding. Transfer learning is particularly useful when the data is scarce or noisy, such as is frequent in ADMET prediction where gathering data is costly.

Attention mechanisms are not novel, as they have been applied in RNNs before the birth of Transformers [16, 17], however the Transformer architecture emphasizes self-attention to focus on the important sequence sections and capture long-range dependencies and relationships among them [18]. As Transformers have shown superior performance over RNNs for natural language processing (NLP) tasks, they too have become a focus of research towards ADMET prediction. Many of the constructed ADMET Transformer models make use of those popularized in NLP literature, such as but not limited to BERT, RoBERTa, and GPT-2 [19,20,21,22,23,24,25,26,27,28,29]. Others combine graph representations with the Transformer to obtain graph-level contextual understanding [30,31,32,33,34,35,36]. In addition, some works use a combination of molecular line notation and pre-fabricated descriptors [25], while the remaining use the Transformer with various training strategies and architectural changes [37,38,39,40]. Despite the application of various modelling strategies, including pre-training techniques and attention mechanisms for ADMET prediction, a common thread in prior research is the use of SMILES or molecular graph representations. This differs from our investigation which utilizes a hybridized fragment and SMILES encoding using the MTL-BERT model [41].

Although Transformer models have been proposed for ADMET prediction, they face several challenges and limitations, such as data availability, data quality, model interpretability, and robustness. Both data availability and quality are crucial for training accurate and reliable ADMET prediction models, however many ADMET datasets are class imbalanced and can at times be inconsistent. In addition, datasets are often imbalanced in terms of sample quantity when combined for training purposes, and improper sampling strategies are likely to cause catastrophic forgetting among tasks when considering a multi-task approach [42]. Model interpretability and robustness are important for understanding the rationale behind predictions and ensuring their applicability in varying scenarios. Equally as important, Transformer models must also be computationally efficient and scalable to handle large-scale and complex molecular datasets. This becomes important with the rise of foundational models in NLP literature, which has in turn spilled over into the cheminformatics domain [37].

Graph-based ADMET models

Graph-based neural network (GNN) models are a popular and effective way of leveraging information gained from using graph-structured data such as molecules [43]. A graph is a collection of nodes and edges, where nodes are the building blocks, such as atoms, and edges are connections between entities, like bonds. GNNs learn meaningful representations of molecular structures and properties by aggregating information from local neighbourhoods through different operations such as message passing or convolution. Attention may also be included in GNNs to determine the important neighbouring node features to aggregate. Afterwards, node features constructed by the model can then be combined to obtain a feature vector for the whole graph, which is in turn used for downstream tasks such as ADMET prediction. GNNs have been widely used in ADMET prediction as they can capture the structural and chemical properties of molecules similar to SMILES, and similarly to Transformer models, have the ability to leverage structural information of molecules through transfer learning [44, 45].

GNNs applied to ADMET prediction can be broadly categorized into four groups: graph convolutional neural networks (GCNs) [43], graph attention networks (GATs) [46], message passing neural networks (MPNNs) [47], and graph isomorphism networks (GINs) [48]. GCNs apply convolution operations on the graph nodes to learn node embeddings, which are then pooled to obtain a graph-level representation. MT-PotentialNet [49], Weave [50], and other models [51, 52], are notable works that fall into this division. GATs use attention mechanisms to assign varying weights to neighbouring nodes and edges, allowing the model to focus on the most relevant parts of the graph. ADMETLab 2.0 [53], AttentiveFP [54], and GASA [55] are examples of GATs for ADMET prediction. MPNNs use a message passing scheme to propagate information across the graph, where each node sends and receives messages from its neighbours, and then updates its own hidden state accordingly. In this regard there are models like D-MPNN [56], GeomGCL [57], and MGSSL [58]. Lastly, GINs generalize the Weisfeiler-Lehman graph isomorphism test and learn node embeddings by aggregating and transforming features of neighbouring nodes with learnable parameters, afterwards pooled for a graph-level representation. This powerful GNN variant is seen represented in MolGIN [59]. It should also be noted that many GNN models for ADMET prediction use a hybridization of the various categories to improve performance [60,61,62,63,64], including multi-task learning to leverage information from multiple ADMET datasets, some of which may have a low amount of samples. As is discussed in [65], many works using graph-based neural network models consider 2D chemical topology, but disregard geometrical data that provides useful information when predicting molecular properties.

Methodology

Hybrid fragment-SMILES tokenization

Prior to inputting a molecule into a Transformer model, it must undergo tokenization and be encoded into a numerical representation. We propose a novel tokenization procedure, named hybrid fragment-SMILES tokenization (HFST), that incorporates both fragments and SMILES, where SMILES fragments are generated using the method described in HierVAE [66] and DeepFMPO [12]. Following their technique, bonds connected away from a ring atom are broken, and an attachment point is inserted for later molecule reconstruction. To encode a molecule using the hybrid method, we first fragment and loop through all fragments. If a fragment is in the vocabulary, we use a single numerical value for encoding. Otherwise, we encode the fragment using the SMILES atomic-level representation. If two or more successive fragments are encoded using SMILES, a separator tag is placed in-between them to designate the ending of one fragment and the start of another.

The advantages of using the HFST encoding over the standard SMILES encoding are fourfold. (1) It overcomes the issue of low-frequent fragments in training. From Fig. 1, we can see that the frequency of fragments follows a power-low distribution. Low-frequent fragments dominate the over-sized vocabulary and thus lead to poor contextual embedding. (2) It solves the problem of new fragments in inference. In predictive tasks, some fragments of new molecules may not exist in the vocabulary of fragments extracted from the training data. Using an unknown token leads to information missing. (3) The fragment spectrum perspective unifies fragment-based and SMILES-based tokenizations, allowing us to select a cutoff that takes benefits of both fragment and SMILES representations. Fragment-based (no cutoff) and SMILES (all cutoff) are extreme cases of the HFST representation. (4) It can reduce the sequence length and thus reduce some computational complexity of the Transformer model, which is quadratic with length of input sequence. Our HFST method is illustrated in Fig. 2, where a singular molecule on the left is first fragmented, as seen by the colours blue, red, yellow, and green. Afterwards during encoding, the fragments shaded as blue, yellow and green were found in the vocabulary and so take the form of a single numerical value. However, the red fragment is encoded as a SMILES string as it was not found in the vocabulary, and thus will be encoded as many numerical values, one for each SMILES token.

Fig. 2
figure 2

Illustration of the hybrid SMILES-fragment encoding process using climbazole (SMILES representation: CC(C)(C)C(=O)C(N1C=CN=C1)OC2=CC=C(C=C2)Cl)

Transformer ADMET model

We adopt the MTL-BERT model originally proposed by Zhang et al. for predicting ADMET properties from SMILES strings [41]. The original MTL-BERT is depicted in Fig. 3. It uses a multi-task learning framework based on BERT, a Transformer-based model that learns bidirectional representations from large-scale unlabelled data. Similarly to BERT, MTL-BERT uses transfer learning which consists of two parts. In the pre-training phase, a masked language model objective is used to learn the contextual information of SMILES sequences from a large corpus of unlabelled molecules. Unlike the BERT model, next sequence prediction is not included as a pre-training objective. Afterwards, the pre-trained MTL-BERT model is fine-tuned on multiple downstream prediction tasks simultaneously with the inclusion of multiple task-specific tokens and prediction layers. The prepending of multiple task-specific tokens to a sequence, one for each predictive task, differs from the original BERT model which prepends a singular token. In the original work by Zhang et al., SMILES enumeration was used as a data augmentation technique to increase diversity, however this is not included in our work due to the generation of rare fragments not present in our curated library.

MTL-BERT is selected as the model for this study as it leverages large-scale unlabelled data to learn contextual information about SMILES strings during pre-training, which has been shown in previously mentioned studies to improve performance in downstream tasks. In addition, as MTL-BERT is inherently a multi-task model, it can benefit from sharing information amongst multiple tasks, enhancing the generalization capability of the model. Multi-task learning is particularly useful for ADMET prediction as there are numerous predictive tasks, many of which have a low amount of samples. Furthermore, as reported in [41], MTL-BERT outperforms the multi-task graph attention (MGA) framework [53] for the same ADMET tasks. We were unable to set up the MGA framework because the code from https://github.com/wzxxxx/MGA is incomplete and lacks of important details. However, we believe that adopting MTL-BERT for the HFST representation in our study is sufficient due to MTL-BERT’s reported superior performance.

Fig. 3
figure 3

Schematic illustration of the MTL-BERT model architecture that utilizes the Transformer encoder, incorporates masked language modelling for pre-training, and operates with multiple task-specific tokens and heads in fine-tuning

Experimentation

This section outlines our experimental procedures to assess the performance of the proposed HFST method for ADMET prediction. First, we describe the data used for training our models. This is followed by examining the alteration of fragment vocabulary size through various frequency thresholds, resulting in a spectrum of fragments. Afterwards, the hyperparameters utilized during training to ensure replicability are presented. Additionally, two strategies for pre-training the Transformer model on large-scale unlabelled data are introduced. Lastly, the description of the metrics and methodologies used to ensure an equitable evaluation of the models is given.

Data and preprocessing

We train our model with transfer learning, which segregates the training process into two parts: pre-training and fine-tuning. During the pre-training stage, we use a large collection of unlabelled molecules to train our Transformer models. This ensures the model acquires a generalized representation of molecular structures through self-supervised learning, specifically with masked language modelling. The pre-training data consists of molecules from ChEMBL [67], MOSES [68], and ZINC-250K [14] datasets, where canonical duplicates are removed and SMILES strings above 100 tokens discarded. In total, the dataset comprises roughly 4 million molecules, which is divided into a random 80-20 train-test split.

In the fine-tuning stage, a variety of smaller datasets are utilized to fine-tune the pre-trained network, enabling it to concurrently predict 29 ADMET properties through multi-task learning. The fine-tuning data consists of molecules with experimentally measured ADMET values from different sources, having a combined size of 108,315 samples. In line with our pre-training data, SMILES sequences that surpass 100 tokens are removed. Table 1 illustrates the various ADMET datasets with their accompanying size, task type, and ADMET category. All ADMET datasets for fine-tuning were obtained from Therapeutics Data Commons (TDC) [69].

Table 1 Fine-tuning datasets

Fragment spectrum

To apply the hybrid fragment-SMILES encoding of molecules, we constructed two vocabularies: one for SMILES and one for fragments. These vocabularies are derived exclusively from the pre-processed pre-training data, as described in the previous section. We use the RDKit Python package [70] to tokenize SMILES strings and the fragmentation technique from HierVAE [66] and DeepFMPO [12] to generate the fragments. Figure 1 illustrates the proportional distribution of fragment frequencies within the pre-training dataset, highlighting that a majority of fragments are uncommon, with only a select few being prevalent. The figure further segments the cutoff thresholds with vertical lines and shows the proportion of fragments meeting or surpassing these values, emphasizing the rarity of most fragments. In this study, we explore the impact of different fragment frequency thresholds ranging from 2 to 1000, as well as the absence of any threshold, on the efficacy of a Transformer model in predicting ADMET properties using our hybrid tokenization method.

Model hyperparameters

Table 2 MTL-BERT hyperparameters

As mentioned previously, we adopt the MTL-BERT model proposed in [41] as the backbone of our experimental framework. In their work, Zhang et al. categorized hyperparameter values by small, medium, and large, where it was reported that the medium parameter size achieved a good balance between predictive performance and computational efficiency. Therefore, we follow their settings and use the medium hyperparameters for our model, as shown in Table 2. Specifically, our model has a hidden size of 256, 8 encoder layers, 8 attention heads, a dropout rate of 0.1, and a feedforward dimension of 1024. We use the Adam optimizer with a learning rate of \(1e^{-4}\), betas 0.9 and 0.98, and cross entropy loss to pre-train our model on a large corpus of molecules. Then, we fine-tune our model on the task-specific datasets using the AdamW optimizer with a learning rate of \(0.5e^{-4}\), the same beta values as in pre-training, mean squared error loss for regression tasks, and binary cross entropy loss for classification tasks.

For both pre-training and fine-tuning, we set the batch size to 64. To monitor the training progress and avoid overfitting, we conduct a testing epoch every 5000 training batches during pre-training and stop the training process if there is no improvement in the testing loss for two consecutive epochs. For fine-tuning, we perform a testing epoch after every training epoch and terminate if the testing loss increases two epochs in a row. In the pre-training stage, 15% of tokens are chosen at random. Of these, there is an 80% probability that a token is substituted with a mask token, a 10% probability of alteration to a random token, and a 10% probability that it will remain unchanged.

One-phase and two-phase pre-training

We experiment with two different pre-training strategies for our Transformer model: one-phase and two-phase. In both strategies, we use a large corpus of unlabelled molecular structures as the pre-training data, accompanied with masked language modelling objectives. In one-phase pre-training, the Transformer model is pre-trained using the hybrid fragment-SMILES encoding. This strategy allows the model to directly learn the hybrid encoding without any intermediate steps. In two-phase pre-training, the Transformer model is pre-trained first on the SMILES encoding until no further performance improvement, and then afterwards on hybrid fragment-SMILES encoding until completion. The two-phase approach is designed to capitalize on the insights gained from SMILES encoding before learning the hybrid SMILES-fragment encoding. We hypothesize that the inclusion of low-frequency fragments in the fragment vocabulary may result in reduced visibility of SMILES tokens. Thus, two-phase pre-training allows the model to gradually adapt to the hybrid encoding while preserving the knowledge learned from SMILES embeddings. After pre-training, we perform fine-tuning on the various ADMET datasets using the pre-trained model.

Evaluation

We evaluate the performance of our Transformer model under three scenarios: pre-training (section Pre-training), fine-tuning on 29 ADMET datasets (section Fine-tuning for ADMET prediction), and fine-tuning on the ADMET group benchmarks from TDC (section Fine-tuning on therapeutics data commons ADMET benchmark). For pre-training, we compare model performance by utilizing testing loss and accuracy. For fine-tuning on the 29 ADMET datasets, we compare using area under the receiver operating characteristic curve (AUROC) on classification tasks, and the coefficient of determination (\(R^2\)) on regression tasks, both from the testing set, as is indicated in Table 1. For fine-tuning on the ADMET group benchmarks from TDC, various evaluation methods are employed, including mean average error (MAE), AUROC, Spearman’s rank correlation coefficient (Spearman), area under the precision-recall curve (AUPRC), each of which are identified, along with its corresponding dataset, in Table 6. Similarly, we report the testing set performance on benchmark datasets.

Since the combination of ADMET datasets used in this study are imbalanced with sample size, we adopt a stratified batching strategy during fine-tuning on the 29 ADMET datasets and benchmark datasets, ensuring that each batch contains at least one sample from each dataset. By adopting this approach, we prevent the model from overfitting to larger datasets where samples may be overrepresented in batches, thereby enhancing model generalization. To mitigate the impact of data partitioning on model performance variability, we repeat the entire training procedure 5 times using distinct random seeds specified in Table 2. We also implement fivefold cross-validation when fine-tuning with the 29 ADMET datasets and present both the mean and the standard deviation of the performance metrics across all folds and the 5 training runs. For fine-tuning on the TDC benchmark datasets, a 70–10–20 train–validation–test scaffold split [71] is performed, with 5 training runs, along with expression of the mean and standard deviation of all performance metrics.

Results and discussions

Pre-training

Table 3 Pre-training results on the testing set in terms of mean and standard deviation among five executions
Fig. 4
figure 4

Pre-training curves in one-phase and two-phase strategies, averaged among executions

The results of all pre-training experiments in the final testing epoch, with mean and standard deviation among the five executions, is shown in Table 3. Indicated is a negative correlation between the frequency cutoff and both loss and accuracy. The accuracy in the pre-training stage is calculated for the prediction of masked tokens. This implies that MTL-BERT has more difficulty in predicting the masked fragment tokens when infrequent, and diverse fragments are used in place of SMILES. With a reduction in the cutoff, there is a swift rise in the number of fragments, as depicted in Fig. 1. A decrease in frequency cutoff is likely to lead to lower accuracy due to the increased presence of rarely occurring fragments, which the model may not effectively contextualize. The performance difference between the one-phase and two-phase pre-training strategies is also observed.

Apart from the outcomes at the 1000 frequency level, the two-phase approach invariably results in a slightly higher loss and improved accuracy. This may be due to the difficulties in contextualizing a hybrid sequence input that starts with SMILES tokenization and subsequently employs hybrid-fragment tokenization. In contrast, the one-phase strategy teaches the model to contextualize both fragments and SMILES simultaneously from the start. Hence, two-phase models might have to recalibrate their SMILES contextualization alongside integrating fragment information, which can result in a diminished overall performance compared to the one-phase pre-training approach. It is important to note that the comparison of pre-training performance across different cutoffs and pure SMILES, as shown by the testing loss and accuracy in Table 3, is biased, as the varying sizes of the vocabularies indicate differing degrees of challenge.

The average pre-training curves of hybrid tokenization models on the testing set are illustrated in Fig. 4, separated by one-phase and two-phase. The results show a rapid improvement in the first epoch, which is characterized by a steep learning curve. This is followed by a more gradual progression in subsequent epochs, with less improvement but still evident. While the SMILES performance pattern is not illustrated, it mirrors the hybrid models with a significant initial improvement, although demonstrating lower loss values and higher accuracy rates at the end of training. The observed trends raise questions about the optimal configuration of the learning rate. Rapid early improvements hint at a robust initial grasp of data representation learning, however the plateau in later stages implies a potential overfitting or inability to further generalize from the training data. Adjusting the learning rate could help the model learn more effectively throughout the pre-training process, however we did not do so due to the high computational demands of the pre-training phase, in conjunction with the need to tune the learning rate according to each fragment frequency cutoff.

Furthermore, the consistent, although marginal, gains after the first epoch suggest that the models are still extracting valuable information at a reduced rate. This could imply that the models are approaching their capacity for learning from the given dataset, or that the complexity of the data requires more nuanced learning strategies. Further improvements to modelling could be found from an increased, and complex dataset of molecules, altering the masking rates of the masked language modelling strategy, or employing different learning strategy than masked language modelling.

In summary, in this section, we investigated the performance of our HFST strategy in masked model pre-training, finding it to be a challenging task as more infrequent fragments are included in the vocabulary. In the context of the testing loss metric, it is observed that the performance marginally declines with the application of two-phase pre-training compared to a single phase approach. This reduction in performance may be attributed to the necessity of recontextualizing the embeddings upon transitioning from the initial to the subsequent phase. Unlike two-phase, the one-phase approach employs our proposed HFST method from the outset, thereby averting the need for recontextualization. While the one-phase approach demonstrates a preferable performance compared to the two-phase approach during pre-training, we evaluate the efficacy of our proposed method for ADMET prediction in the subsequent section.

Fine-tuning for ADMET prediction

Fig. 5
figure 5

Comparison between two-phase fine-tuning experimentation and SMILES for a classification and b regression tasks, averaged among folds and executions

Fig. 6
figure 6

Comparison between one-phase fine-tuning experimentation and SMILES for a classification and b regression tasks, averaged among folds and executions

The performance of the hybrid and SMILES tokenization models during the final testing epoch, averaged across multiple runs and folds, is presented in Figs. 5 and 6. Additionally, Tables 4 and 5 state the resultant values, categorized by the two-phase and one-phase strategies.

In the two-phase approach, we observed that SMILES tokenization consistently achieved the best performance, followed closely by the 1000 frequency hybrid tokenization, with worse metric values as more infrequent fragments are incorporated. With rare fragments included, the model fails to effectively contextualize rare fragments and accurately predict ADMET properties. Interestingly, the 1000 frequency hybrid approach outperformed SMILES specifically in CYP2D6 substrate classification and hERG regression tasks, both of which are critical for drug metabolism and safety within the cardiovascular system.

When using one-phase pre-training, both SMILES and 1000 frequency hybrid tokenizations emerge as top performers, with a trend of lower performance for lower frequency cutoffs persisting between the one-phase and two-phase strategies. Notably, the one-phase 1000 frequency hybrid tokenization consistently outperforms SMILES across most tasks, with an exception in the blood-brain barrier predictive task, where 10 frequency matched SMILES and 1000 frequency hybrid. As well, 500 frequency hybrid outpaces all remaining frequencies and SMILES for microsome clearance prediction. This suggests that specific tasks benefit from varying tokenization strategies, as certain molecular substructures carry high predictive weight.

Overall, our models demonstrated better performance on classification tasks than on regression tasks. The binary nature of the ADMET classification tasks makes them inherently easier to predict than regression tasks, which take on continuous values. Both the hybrid and SMILES tokenization model exhibited poor performance on the half-life dataset, with suboptimal average prediction and a high standard deviation, as seen in Figs. 5 and 6. This dataset likely poses unique challenges due to the complex interplay of molecular features that affect drug half-life, some of which may not be directly related to the molecule itself. Despite the poor performance overall, one-phase 1000 frequency hybrid tokenization still outperformed SMILES on the half-life task, suggesting that the hybrid approach offers resilience in challenging predictive tasks.

Fine-tuning on therapeutics data commons ADMET benchmark

We fine-tune our HFST model and SMILES model utilizing the ADMET group benchmark from TDC, encompassing a total of 22 datasets. The mean and standard deviation test set performance of the 1000 frequency one-phase HFST and SMILES tokenization models is presented in Table 6, provided with the corresponding performance metrics as explained in Sect. 4.5, and highlighting of best metric values. Furthermore, we provide a comparative analysis between our model and five non-ensemble models that are prominently featured on the leaderboard, having submitted entries across all benchmark datasets. This comparison spans a diverse array of machine learning methodologies, ranging from conventional ADMET prediction employing molecular fingerprints and multilayer perceptron (MLP) models to contemporary deep learning methods, including convolutional neural networks (CNNs) and GNNs. The models used for comparison include Basic ML [72], DeepPurpose (with variants Morgan + MLP and CNN, each executed separately) [73], Chemprop (a message passing GNN model) [74], and AttentiveFP (a GAT model) [54].

The results in Table 6 reveal HFST demonstrating superior performance over the traditional SMILES notation for molecular language modelling in a majority of the benchmark tasks, as corroborated by the findings in Sect. 5.2. Notably, the one-phase 1000 frequency HFST excels in predicting bioavailability and hepatocyte clearance, while SMILES tokenization shows its strengths in CYP 2C9 substrate and microsome clearance tasks. This performance underscores the potential of HFST in certain ADMET applications. Extending beyond our approach, no single model dominates across all 22 benchmark datasets, suggesting that a tailored approach of selecting specific models for specific tasks may yield the most effective strategy in predictive modeling for drug discovery. This observation aligns with the current absence of language models on the TDC leaderboard and the prevalence of GNNs, indicating a need for enhancements in training Transformer models to establish their competitive edge in this domain. The results collectively signal an opportunity for the development of more robust models capable of consistent performance across a diverse array of ADMET prediction tasks.

Conclusion

In this study, we explore the impact of a novel hybrid fragment-SMILES tokenization (HFST) procedure alongside two pre-training strategies for Transformer-based ADMET prediction, while experimenting with a spectrum of fragment vocabularies. Our findings underscore the critical role of data representation and learning methodology in achieving accurate predictions for classification and regression tasks. Although SMILES tokenization remains a robust baseline, our hybrid approach, especially at the 1000 frequency level, consistently outperforms SMILES tokenization, under both a collection of 29 ADMET datasets and the TDC ADMET group benchmark. However, it is important to recognize that the selection of a frequency cutoff significantly impacts model performance, and incorporating lower frequency fragments tends to have a detrimental affect on ADMET predictions. Therefore, adjusting the fragments frequency emerges as an important hyperparameter that requires tuning before model training.

From our experimentation, the need for learning rate optimization is clear, and further tuning could yield substantial improvements in ADMET prediction accuracy. Due to the large computational cost of each experiment, and the number of experiments performed in this study, we did not tune the learning rate. However, we predict that for each differing frequency cutoff vocabulary, the learning rate must be tuned. In addition, we propose adjusting the masked language modelling approach to prioritize converting fragments into mask tokens and assigning them a higher weight during loss calculation. By doing so, the model should more effectively contextualize between fragments and SMILES tokens within our hybrid approach. Given the limited efficiency of the masked language modelling strategy, where only 15% of tokens are used for prediction, we recommend exploring an encoder-decoder Transformer model and language modelling strategy. A full Transformer model learns to contextualize entire sequences at once, potentially addressing some of the limitations observed in our current approach. Last but not the least, the hybrid encoding idea is applicable to other line notation representation methods for molecules, such as SELFIES [10], along with suitable fragmentation techniques, and to a range of quantitative structure activity relationship (QSAR) predictive tasks. This generalization needs to be comprehensively studied as future work.

Availability of data and materials

The molecular data used in this research is a combined set of the MOSES, ChEMBL, and ZINC-250K databases for pre-training, 29 ADMET datasets for fine-tuning, and benchmark datasets under the ADMET group leaderboard, all of which have been gathered from Therapeutics Data Commons (accessible via https://tdcommons.ai). The implementation of this research can be found at https://github.com/Pixelatory/HybridFragmentTokenization.

Abbreviations

2D:

Two-dimensional

3D:

Three-dimensional

ADMET:

Absorption, distribution, metabolism, excretion, and toxicity

AI:

Artificial intelligence

AUPRC:

Area under the precision-recall curve

AUROC:

Area under the receiver operating characteristic curve

BERT:

Bidirectional encoder representations from transformers

CNN:

Convolutional neural network

HFST:

Hybrid-fragment SMILES tokenization

GAT:

Graph attention network

GCN:

Graph convolutional neural network

GIN:

Graph isomorphism network

GNN:

Graph-based neural network

GPT:

Generative pre-trained transformer

MAE:

Mean absolute error

ML:

Machine learning

MLP:

Multilayer perceptron

MPNN:

Message passing neural network

MTL-BERT:

Multi-task learning BERT

NLP:

Natural language processing

QSAR:

Quantitative structure activity relationship

\(R^2\) :

Coefficient of determination

RNN:

Recurrent neural network

SELFIES:

SELF-referencing embedded string

SMILES:

Simplified molecular input line entry system

Spearman:

Spearman’s rank correlation coefficient

TDC:

Therapeutics data commons

References

  1. Malerba F, Orsenigo L. The evolution of the pharmaceutical industry. Bus Hist. 2015;57(5):664–87.

    Article  Google Scholar 

  2. Lu M, Yin J, Zhu Q, Lin G, Mou M, Liu F, Pan Z, You N, Lian X, Li F, et al. Artificial intelligence in pharmaceutical sciences. Engineering 2023

  3. Kumar M, Nguyen TN, Kaur J, Singh TG, Soni D, Singh R, Kumar P. Opportunities and challenges in application of artificial intelligence in pharmacology. Pharmacol Rep. 2023;1–16.

  4. Lipinski CF, Maltarollo VG, Oliveira PR, Da Silva AB, Honorio KM. Advances and perspectives in applying deep learning for drug design and discovery. Front Robot AI. 2019;6:108.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Tran TTV, Surya Wibowo A, Tayara H, Chong KT. Artificial intelligence in drug toxicity prediction: recent advances, challenges, and future perspectives. J Chem Inf Model. 2023;63(9):2628–43.

    Article  CAS  PubMed  Google Scholar 

  6. Rajman I. PK/PD modelling and simulations: utility in drug development. Drug Discov Today. 2008;13(7–8):341–6.

    Article  CAS  PubMed  Google Scholar 

  7. Ferreira LL, Andricopulo AD. ADMET modeling approaches in drug discovery. Drug Discov Today. 2019;24(5):1157–65.

    Article  CAS  PubMed  Google Scholar 

  8. Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci. 1988;28(1):31–6.

    Article  CAS  Google Scholar 

  9. O’Boyle N, Dalke A. DeepSMILES: an adaptation of SMILES for use in machine-learning of chemical structures. ChemRxiv. 2018.

  10. Krenn M, Häse F, Nigam A, Friederich P, Aspuru-Guzik A. Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach Learn: Sci Technol. 2020;1(4): 045024.

    Google Scholar 

  11. Cheng AH, Cai A, Miret S, Malkomes G, Phielipp M, Aspuru-Guzik A. Group SELFIES: a robust fragment-based molecular string representation. Digit Discov. 2023.

  12. Ståhl N, Falkman G, Karlsson A, Mathiason G, Bostrom J. Deep reinforcement learning for multiparameter optimization in de novo drug design. J Chem Inf Model. 2019;59(7):3166–76.

    Article  PubMed  Google Scholar 

  13. Degen J, Wegscheid-Gerlach C, Zaliani A, Rarey M. On the art of compiling and using ‘drug-like’ chemical fragment spaces. ChemMedChem. 2008;3(10):1503–7.

    Article  CAS  PubMed  Google Scholar 

  14. Gómez-Bombarelli R, Wei JN, Duvenaud D, Hernández-Lobato JM, Sánchez-Lengeling B, Sheberla D, Aguilera-Iparraguirre J, Hirzel TD, Adams RP, Aspuru-Guzik A. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent Sci. 2018;4(2):268–76.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Winter R, Montanari F, Noé F, Clevert D-A. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem Sci. 2019;10(6):1692–701.

    Article  CAS  PubMed  Google Scholar 

  16. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations. 2015.

  17. Luong M-T, Pham H, Manning CD. Effective approaches to attention-based neural machine translation. In: Conference on Empirical Methods in Natural Language Processing. 2015.

  18. Lin T, Wang Y, Liu X, Qiu X. A survey of transformers. AI Open. 2022;3:111–32.

    Article  Google Scholar 

  19. Fabian B, Edlich T, Gaspar H, Segler M, Meyers J, Fiscato M, Ahmed M. Molecular representation learning with language models and domain-relevant auxiliary tasks. 2020. arXiv preprint arXiv:2011.13230.

  20. Wu Z, Jiang D, Wang J, Zhang X, Du H, Pan L, Hsieh C-Y, Cao D, Hou T. Knowledge-based BERT: a method to extract molecular features like computational chemists. Brief Bioinform. 2022;23(3):131.

    Article  Google Scholar 

  21. Ahmad W, Simon E, Chithrananda S, Grand G, Ramsundar B. ChemBERTa-2: towards chemical foundation models. 2020. arXiv:2209.01712;2022.

  22. Chithrananda S, Grand G, Ramsundar B. ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. 2020. arXiv preprint arXiv:2010.09885.

  23. Zhang X-C, Wu C-K, Yang Z-J, Wu Z-X, Yi J-C, Hsieh C-Y, Hou T-J, Cao D-S. MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction. Brief Bioinform. 2021;22(6):152.

    Article  Google Scholar 

  24. Wang S, Guo Y, Wang Y, Sun H, Huang J. SMILES-BERT: Large scale unsupervised pre-training for molecular property prediction. In: ACM International Conference on Bioinformatics, Computational Biology and Health Informatics 2019;429–436.

  25. Yang L, Jin C, Yang G, Bing Z, Huang L, Niu Y, Yang L. Transformer-based deep learning method for optimizing ADMET properties of lead compounds. Phys Chem Chem Phys. 2023;25:2377–85.

    Article  CAS  PubMed  Google Scholar 

  26. Adilov S. Generative pre-training from molecules ChemRxiv preprint. 2021. https://doi.org/10.26434/chemrxiv-2021-5fwjd.

    Article  Google Scholar 

  27. Liu Y, Zhang R, Li T, Jiang J, Ma J, Wang P. MolRoPE-BERT: an enhanced molecular representation with rotary position embedding for molecular property prediction. J Mol Graph Model. 2023;118: 108344.

    Article  CAS  PubMed  Google Scholar 

  28. Irwin R, Dimitriadis S, He J, Bjerrum EJ. Chemformer: a pre-trained transformer for computational chemistry. Mach Learn: Sci Technol. 2022;3(1):015022.

    Google Scholar 

  29. Méndez-Lucio O, Nicolaou C, Earnshaw B. MolE: a molecular foundation model for drug discovery. 2022. arXiv preprint arXiv:2211.02657.

  30. Torres LH, Ribeiro B, Arrais JP. Few-shot learning with transformers via graph embeddings for molecular property prediction. Expert Syst Appl. 2023;225: 120005.

    Article  Google Scholar 

  31. Jiang Y, Jin S, Jin X, Xiao X, Wu W, Liu X, Zhang Q, Zeng X, Yang G, Niu Z. Pharmacophoric-constrained heterogeneous graph transformer model for molecular property prediction. Commun Chem. 2023;6(1):60.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Song Y, Chen J, Wang W, Chen G, Ma Z. Double-head transformer neural network for molecular property prediction. J Cheminform. 2023;15(1):1–16.

    Article  Google Scholar 

  33. Rong Y, Bian Y, Xu T, Xie W, Wei Y, Huang W, Huang J. Self-supervised graph transformer on large-scale molecular data. Adv Neural Inf Process Syst. 2020;33:12559–71.

    Google Scholar 

  34. Ying C, Cai T, Luo S, Zheng S, Ke G, He D, Shen Y, Liu T-Y. Do transformers really perform badly for graph representation? Adv Neural Inf Process Syst. 2021;34:28877–88.

    Google Scholar 

  35. Chen J, Zheng S, Song Y, Rao J, Yang Y. Learning attributed graph representations with communicative message passing transformer. 2021. arXiv preprint arXiv:2107.08773.

  36. Li H, Zhao D, Zeng J. KPGT: knowledge-guided pre-training of graph transformer for molecular property prediction. 2022. arXiv:2206.03364.

  37. Ross J, Belgodere B, Chenthamarakshan V, Padhi I, Mroueh Y, Das P. Large-scale chemical language representations capture molecular structure and properties. Nature Mach Intell. 2022;4(12):1256–64.

    Article  Google Scholar 

  38. Karpov P, Godin G, Tetko IV. Transformer-CNN: Swiss knife for QSAR modeling and interpretation. J Cheminform. 2020;12(1):1–12.

    Article  Google Scholar 

  39. Maziarka L, Danel T, Mucha S, Rataj K, Tabor J, Jastrzebski S. Molecule attention transformer. 2020. arXiv preprint arXiv:2002.08264.

  40. Honda S, Shi S, Ueda HR. SMILES Transformer: pre-trained molecular fingerprint for low data drug discovery. 2019. arXiv preprint arXiv:1911.04738.

  41. Zhang X-C, Wu C-K, Yi J-C, Zeng X-X, Yang C-Q, Lu A-P, Hou T-J, Cao D-S. Pushing the boundaries of molecular property prediction for drug discovery with multitask learning BERT enhanced by SMILES enumeration. Research. 2022;2022:0004.

    Article  CAS  Google Scholar 

  42. Ke Z, Liu B, Ma N, Xu H, Shu L. Achieving forgetting prevention and knowledge transfer in continual learning. Adv Neural Inf Process Syst. 2021;34:22443–56.

    Google Scholar 

  43. Duvenaud D, Maclaurin D, Aguilera-Iparraguirre J, Bombarell R, Hirzel T, Aspuru-Guzik A, Adams RP. Convolutional networks on graphs for learning molecular fingerprints. In: Conference on Neural Information Processing Systems. 2015.

  44. Hu W, Liu B, Gomes J, Zitnik M, Liang P, Pande V, Leskovec J. Strategies for pre-training graph neural networks. 2019. arXiv preprint arXiv:1905.12265.

  45. Wieder O, Kohlbacher S, Kuenemann M, Garon A, Ducrot P, Seidel T, Langer T. A compact review of molecular property prediction with graph neural networks. Drug Discov Today Technol. 2020;37:1–12.

    Article  PubMed  Google Scholar 

  46. Velickovic P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y. Graph attention networks. In: International Conference on Learning Representations. 2018.

  47. Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE. Neural message passing for quantum chemistry. In: ICML. 2017.

  48. Xu K, Hu W, Leskovec J, Jegelka S. How powerful are graph neural networks. In: International Conference on Learning Representations. 2019.

  49. Feinberg EN, Joshi E, Pande VS, Cheng AC. Improvement in ADMET prediction with multitask deep featurization. J Med Chem. 2020;63(16):8835–48.

    Article  CAS  PubMed  Google Scholar 

  50. Kearnes S, McCloskey K, Berndl M, Pande V, Riley P. Molecular graph convolutions: moving beyond fingerprints. J Comput Aided Mol Des. 2016;30:595–608.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Coley CW, Barzilay R, Green WH, Jaakkola TS, Jensen KF. Convolutional embedding of attributed molecular graphs for physical property prediction. J Chem Inf Model. 2017;57(8):1757–72.

    Article  CAS  PubMed  Google Scholar 

  52. Montanari F, Kuhnke L, Ter Laak A, Clevert D-A. Modeling physico-chemical admet endpoints with multitask graph convolutional networks. Molecules. 2019;25(1):44.

    Article  PubMed  PubMed Central  Google Scholar 

  53. Xiong G, Wu Z, Yi J, Fu L, Yang Z, Hsieh C, Yin M, Zeng X, Wu C, Lu A, et al. ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties. Nucl Acids Res. 2021;49(W1):5–14.

    Article  Google Scholar 

  54. Xiong Z, Wang D, Liu X, Zhong F, Wan X, Li X, Li Z, Luo X, Chen K, Jiang H, et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J Med Chem. 2019;63(16):8749–60.

    Article  PubMed  Google Scholar 

  55. Yu J, Wang J, Zhao H, Gao J, Kang Y, Cao D, Wang Z, Hou T. Organic compound synthetic accessibility prediction based on the graph attention mechanism. J Chem Inf Model. 2022;62(12):2973–86.

    Article  CAS  PubMed  Google Scholar 

  56. Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M, et al. Analyzing learned molecular representations for property prediction. J Chem Inf Model. 2019;59(8):3370–88.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Li S, Zhou J, Xu T, Dou D, Xiong H. GeomGCL: geometric graph contrastive learning for molecular property prediction. In: AAAI Conference on Artificial Intelligence, Vol. 36. 2022. pp. 4541–9.

  58. Zhang Z, Liu Q, Wang H, Lu C, Lee C-K. Motif-based graph self-supervised learning for molecular property prediction. Adv Neural Inf Process Syst. 2021;34:15870–82.

    Google Scholar 

  59. Peng Y, Lin Y, Jing X-Y, Zhang H, Huang Y, Luo GS. Enhanced graph isomorphism network for molecular ADMET properties prediction. IEEE Access. 2020;8:168344–60.

    Article  Google Scholar 

  60. Wei Y, Li S, Li Z, Wan Z, Lin J. Interpretable-ADMET: a web service for ADMET prediction and optimization based on deep neural representation. Bioinformatics. 2022;38(10):2863–71.

    Article  CAS  PubMed  Google Scholar 

  61. Du B-X, Xu Y, Yiu S-M, Yu H, Shi J-Y. MTGL-ADMET: a novel multi-task graph learning framework for ADMET prediction enhanced by status-theory and maximum flow. In: International Conference on Research in Computational Molecular Biology. Springer. 2023. pp. 85–103.

  62. Zhang S, Yan Z, Huang Y, Liu L, He D, Wang W, Fang X, Zhang X, Wang F, Wu H, et al. HelixADMET: a robust and endpoint extensible ADMET system incorporating self-supervised knowledge transfer. Bioinformatics. 2022;38(13):3444–53.

    Article  CAS  PubMed  Google Scholar 

  63. Wang Y, Wang J, Cao Z, Barati Farimani A. Molecular contrastive learning of representations via graph neural networks. Nat Mach Intell. 2022;4(3):279–87.

    Article  Google Scholar 

  64. Li P, Wang J, Qiao Y, Chen H, Yu Y, Yao X, Gao P, Xie G, Song S. Learn molecular representations from large-scale unlabeled molecules for drug discovery. 2020. arXiv preprint arXiv:2012.11175.

  65. Fang X, Liu L, Lei J, He D, Zhang S, Zhou J, Wang F, Wu H, Wang H. Geometry-enhanced molecular representation learning for property prediction. Nat Mach Intell. 2022;4(2):127–34.

    Article  Google Scholar 

  66. Jin W, Barzilay R, Jaakkola T. Hierarchical generation of molecular graphs using structural motifs. In: International Conference on Machine Learning, 2020; 4839–4848. PMLR.

  67. Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012;40(D1):1100–7.

    Article  Google Scholar 

  68. Polykovskiy D, Zhebrak A, Sanchez-Lengeling B, Golovanov S, Tatanov O, Belyaev S, Kurbanov R, Artamonov A, Aladinskiy V, Veselov M, et al. Molecular sets (MOSES): a benchmarking platform for molecular generation models. Front Pharmacol. 2020;11: 565644.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Huang K, Fu T, Gao W, Zhao Y, Roohani Y, Leskovec J, Coley CW, Xiao C, Sun J, Zitnik M. Therapeutics Data Commons: machine learning datasets and tasks for drug discovery and development. 2021. arXiv preprint arXiv:2102.09548.

  70. Landrum G. RDKit: open-source cheminformatics. 2006. http://www.rdkit.org.

  71. Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, Leswing K, Pande V. Moleculenet: a benchmark for molecular machine learning. Chem Sci. 2018;9(2):513–30.

    Article  CAS  PubMed  Google Scholar 

  72. Boral N, Ghosh P, Goswami A, Bhattacharyya M. Accountable prediction of drug ADMET properties with molecular descriptors. bioRxiv, 2022;2022-06.

  73. Huang K, Fu T, Glass LM, Zitnik M, Xiao C, Sun J. DeepPurpose: a deep learning library for drug-target interaction prediction. Bioinformatics. 2020;36(22–23):5545–7.

    CAS  PubMed Central  Google Scholar 

  74. Heid E, Greenman KP, Chung Y, Li S-C, Graff DE, Vermeire FH, Wu H, Green WH, McGill CJ. Chemprop: a machine learning package for chemical property prediction. J Chem Inf Model. 2023;64(1):9–17.

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The authors thank the anonymous reviewers for their valuable suggestions.

Funding

This work is supported in part by funds from (1) the AI for Design Challenge Program, National Research Council Canada (AI4D-108 to YL and AT), (2) the Discovery Grant Program, Natural Sciences and Engineering Research Council of Canada (RGPIN-2022-05418 to BOB and RGPIN-2021-03879 to YL), (3) Canada Research Chair Program (2021-00214 to YL), and (4) Canada Foundation for Innovation (42115 to YL).

Author information

Authors and Affiliations

Authors

Contributions

NA collected and processed the molecular and ADMET data, implemented the HFST and ADMET prediction framework, conducted all the experiments, analyzed the results, and drafted the manuscript. AT provided essential feedback on this work. BOB and YL co-supervised NA and led this research. All authors proof-read and produced the final version of this manuscript.

Corresponding authors

Correspondence to Yifeng Li or Beatrice Ombuki-Berman.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Comparison of ADMET modelling experimentation

Appendix A: Comparison of ADMET modelling experimentation

The detailed results for the two-phase and one-phase strategies pre-training and then fine-tuning on ADMET tasks are given in Tables 4 and 5.

Table 4 Comparison of ADMET modelling experimentation using two-phase pre-training
Table 5 Comparison of ADMET modelling experimentation using one-phase pre-training
Table 6 Comparison of ADMET prediction models on TDC ADMET group benchmark

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aksamit, N., Tchagang, A., Li, Y. et al. Hybrid fragment-SMILES tokenization for ADMET prediction in drug discovery. BMC Bioinformatics 25, 255 (2024). https://doi.org/10.1186/s12859-024-05861-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-024-05861-z

Keywords