Can Large Language Models Understand Molecules?

Purpose: Large Language Models (LLMs) like GPT (Generative Pre-trained Transformer) from OpenAI and LLaMA (Large Language Model Meta AI) from Meta AI are increasingly recognized for their potential in the field of cheminformatics, particularly in understanding Simplified Molecular Input Line Entry System (SMILES), a standard method for representing chemical structures. These LLMs also have the ability to decode SMILES strings into vector representations. Method: We investigate the performance of GPT and LLaMA compared to pre-trained models on SMILES in embedding SMILES strings on downstream tasks, focusing on two key applications: molecular property prediction and drug-drug interaction prediction. Results: We find that SMILES embeddings generated using LLaMA outperform those from GPT in both molecular property and DDI prediction tasks. Notably, LLaMA-based SMILES embeddings show results comparable to pre-trained models on SMILES in molecular prediction tasks and outperform the pre-trained models for the DDI prediction tasks. Conclusion: The performance of LLMs in generating SMILES embeddings shows great potential for further investigation of these models for molecular embedding. We hope our study bridges the gap between LLMs and molecular embedding, motivating additional research into the potential of LLMs in the molecular representation field. GitHub: https://github.com/sshaghayeghs/LLaMA-VS-GPT

pre-trained models [4,15,16].These embedding techniques aim to capture relevant structural and chemical information in a compact numerical representation [17].The fundamental hypothesis asserts that structurally similar molecules behave in similar ways.This enables machine learning algorithms to process and analyze molecular structures for property prediction and drug discovery tasks.
With the breakthroughs made in LLMs, one prominent question is whether LLMs can understand molecules and make inferences on molecule data?More specifically, can LLMs produce high quality semantic representations?Gua et al. [18] made a preliminary study by evaluating several chemical inference tasks using LLMs.Their study has been limited to utilizing and evaluating LLMs performance in answering SMILES-related queries.We move further by exploring the ability of these models to effectively embed SMILES has yet to be fully explored, maybe partially due to cost of API calls.Our conclusions are: (1) LLMs do outperform traditional methods.
(2) The performance is task dependent, sometimes data dependent.
(3) Newer versions of LLMs do improve over older versions, even though they are trained on more generic tasks.(4) We observe that embeddings from LLaMA overall outperform GPT embeddings.(5) Another interesting observation of our research is that LLaMA and LLaMa2 are very close regarding embedding performance.

Related work
For accurate prediction of chemical properties using machine learning, leveraging molecule embeddings as input feature vectors is crucial [19].Early molecular embedding methods such as Morgan FingerPrint (FP) [20] encode the structural information of a molecule into a fixed-length binary or integer vector with the knowledge of chemistry.However, for a more generalized embedding, numerous studies have explored methods to embed molecular structures.While some studies focus on the graph representation of the molecular structure to encode the important topology information directly [21][22][23], many choose the string representation of molecules (SMILES) due to rapid advancements in natural language processing (NLP).Initial efforts in this domain utilized foundational NLP architectures like auto-encoders [24] and recurrent neural networks (RNN) to generate embeddings [19].However, the scarcity of labelled data has shifted focus towards methods that can be pre-trained on unlabeled data, such as Mol2Vec and SPVec [14,25].
With the increasing prominence of transformer models in natural language analysis-where they are pre-trained on extensive unsupervised data and then fine-tuned for specific tasks like classification-transformer-based models have become increasingly relevant in the SMILES language domain.For instance, SMILES-BERT [15] has inspired numerous studies to adapt the transformers framework.These studies try to modify this framework to improve their performance on SMILES strings by adapting RoB-ERTa (Robustly optimized BERT approach) instead of the BERT model [6] or develop domain-specific self-supervised pre-training tasks [16], or integrate the local message passing mechanism of graph neural networks (GNNs) into BERT to enhance learning from molecular graphs [5].Additionally, MolFormer [4] introduces a novel approach by combining molecular language with transformer encoder models, incorporating rotary positional embeddings (RoPE) from RoFormer, to produce more effective molecular embeddings [4,26].
However, pre-training these models on millions of molecules requires substantial hardware resources.For example, MolFormer necessitates up to 16 V100 graphics processing units (GPUs) [4].Consequently, it is computationally more feasible to use pre-trained large language models (LLMs), such as GPT [27] and LLaMA [28,29], for generating embeddings.These models have already been trained on vast amounts of data, making them readily available for processing SMILES strings to obtain molecular embeddings without extensive hardware.
Up to our current knowledge, the application of GPT and LLaMA in chemistry has primarily been limited to utilizing and evaluating its performance in answering queries.Further exploration and implementation of LLMs for more advanced tasks within chemistry are yet to be thoroughly documented.For example, to examine how well LLMs understand chemistry, Guo et al. [18] used LLMs to assess the performance of these models on practical chemistry tasks only using queries.Their results demonstrate that GPT models are comparable with classical machine learning models when applied to chemical problems that can be transformed into classification or ranking tasks such as property prediction.However, they stop evaluating the LLM's ability to answer prompts and do not evaluate the embedding power of LLMs.Hence, inspired by many languagebased methods that tried to extract molecular embedding, our study represents a pioneering effort, being the first to rigorously assess the capabilities of LLMs like GPT and LLaMA in using LLMs embedding for chemistry tasks.

LLMs
LLMs, exemplified by architectures like BERT [12], GPT [27], LLaMA [28], and LLaMA2 [29] excel at understanding context within sentences and generating coherent text.They leverage attention mechanisms and vast training data to capture contextual information, making them versatile for text generation, translation, and sentiment analysis tasks.While Word2Vec enhances word-level semantics, language models provide a deeper understanding of context and facilitate more comprehensive language understanding and generation.Pre-trained models from LLMs can transform text into dense, highdimensional vectors, which capture contextual information and meaning.Using pretrained LLMs offers an edge as they transfer knowledge from their vast training data, enabling the extraction of context-sensitive representations without requiring extensive task-specific data or feature engineering [30].
This work focuses on obtaining the embeddings of SMILES strings from GPT and LLaMA models to find the model that achieves the best performance.OpenAI [31] present many GPT-based embeddings including: 'text-embedding-ada-002', 'text-embedding-3-small', 'text-embedding-3-large'.Our research used the most recent embedding model, text-small-3-embeddings.This model is acclaimed for being the best among available embedding models and the most affordable method available by OpenAI.Textsmall-3-embeddings employs the 'cl100k-base' token calculator to generate embeddings, resulting in a 1536-dimensional vector representation.We input SMILES strings into this model, allowing GPT to create embeddings for each string.These embeddings serve as the feature vector for our classification tasks.
In parallel, we leveraged the capabilities of LLaMA [28] and its advanced variant, LLaMA2 [29].These models, ranging from only 7 to 65 billion parameters, are built on the Transformers architecture.LLaMA2, an enhancement of LLaMA, benefits from training on an expanded publicly available data set.Its pre-training corpus grew by 40%, and its context length doubled to 4096 tokens.LLaMa models employ a decoderonly Transformer architecture with causal multi-headed attention in each layer.Drawing architectural inspiration from prominent language models like GPT-3 and PaLM (Pathways Language Model) [32], they incorporate features such as pre-normalization, RMSNorm, SwiGLU activation functions, and rotary positional embeddings (RoPE) [26] in every transformer layer.
The training dataset of LLaMA [28,33] predominantly comprises webpages, accounting for over 80% of its content.This is supplemented by various sources, including 6.5% code-centric data from GitHub and StackExchange, 4.5% literary content from books, and 2.5% scientific material primarily sourced from arXiv.
In contrast, GPT [33,34] was developed using a comprehensive and mixed dataset.This dataset includes diverse sources like CommonCrawl, WebText2, two different book collections (Books1 and Books2), and Wikipedia.
SMILES is utilized as a "chemical language" that encodes the structural elements of a chemical graph-including atoms, bonds, and rings-into a brief textual format.This is achieved through a systematic, depth-first tree traversal of the chemical structure.The method uses alphanumeric characters to represent atoms (such as C, S, Br) and symbols such as '-' , ' = ' , and ' # ' to indicate different types of chemical bonds.For instance, the SMILES notation for Ibuprofen is CC(C)Cc1ccc(cc1)C(C)C(O)=O (Fig. 1).

Fig. 1 Drug chemical representations
Table 1 compares how each model tokenizes SMILES strings.ChemBERTa, explicitly designed for molecular embeddings, tokenizes SMILES using the Byte-Pair Encoder (BPE) strategy.Meanwhile, MolFormer-XL employs a SMILES-specific regular expression method, as described by Schwaller et al. [35], using an atom-wise tokenization strategy with the regular expression pattern that is formatted as follows and is able to differentiate between atom characters and symbols for chemical bonds: However, LLaMA, as a general-purpose model, employs a different tokenization approach.Its tokenizer is based on SentencePiece Byte-Pair Encoding (BPE).This tokenizer processes the input string character by character, searching for the largest known subword units it can match based on its training.Consequently, as it can be seen in Table 1, it treats 'CS' from the 'CCS(=O)(=O)CCBr' string as a single token, possibly interpreting it as an abbreviation in natural language.However, 'C' and 'S' should be considered as separate tokens, since each represents a distinct atom.
Table 2 compares molecular embedding in terms of the number of layers, parameters and their speed in generating a SMILES embedding.Compared with Morgan FP, language models are extremely slow.However, GPT performs the fastest among the language models, while LLaMA models are the slowest.There is also a relation between the number of layers and the speed of embedding generation.Although GPT remains an exception.

Experiments
Our study aims to generate molecular representation via LLMs and then evaluate the representation on various downstream tasks.To demonstrate the effectiveness of LLMs' molecular representations, we benchmarked their performance on numerous challenging classification and regression tasks from MoleculeNet [36] as well as link prediction from BioSnap [37] and DrugBank [38].The objective of link prediction in this research is to map the drugs as nodes and their interactions as edges and identify whether there is a missing edge between two drug nodes.

Experimental setup
We experimented with seven models, each evaluated by six classifications, three regression and two link prediction tasks.To generate embeddings from LLaMAs, BERT, ChemBERTa, and MolFormer models, we first download and load the model weights using the Transformers library and then generate the embeddings.For LLaMA weights, we download the weights provided by Meta for LLaMAs and then convert them into PyTorch format.We extract embeddings from the last layer of the LLMs, following the practice in [39].Pooling strategies can impact performance, and we explored a variety of combinations.The overall result remains the same.Hence, for the sake of simplicity, we use only the last layer.For GPT embeddings, we choose the recent model, text-small-3-embeddings.
To generate LLaMA and LLaMA2 embeddings, we employed four NVIDIA A2 GPUs to load the 7 billion parameter version of LLaMAs.In this configuration, the average speed of generating embeddings is one molecule per second.In our experiments, we generated embeddings for over 65,000 molecules.
Following MoleculeNet [36], for classification tasks, we partition the datasets into 5-stratified folds to ensure robust benchmarking.This approach ensures that each fold maintains the same proportion of observations for each target class as in the complete dataset.We employ a logistic regression model from scikit-learn, equipped with the following default parameters: L2 regularization, 'lbfgs' for optimization, and maximum 100 iterations allowed for the solvers to converge.The reported performance metrics are the mean and standard deviation of the F1-score and AUROC, calculated across the five folds.
For regression tasks, we implement five-fold cross-validation to assess model performance.We employ a Ridge regression model which is a linear regression model with l2 regularization.From scikit-learn with the following default parameters: tolerance of 0.001 for the optimization and a auto solver to automatically chooses the most appropriate solver method based on the data type.The metrics reported are the mean and standard deviation of the RMSE and the R 2 , calculated across the five folds.
Following MIRACLE [40], a state-of-the-art method in DDI, for link prediction, we split all interaction samples from the DrugBank and BioSnap datasets into training and test sets using a 4:1 ratio.We further select 1/4 of the training dataset as a validation set.The reported results are the mean and standard deviation of AUROC and AUPR across 10 different runs of the GCN model.We set each parameter learning rate using an exponentially decaying schedule with an initial learning rate of 0.0002 and a multiplicative factor of 0.96.For the proposed model's hyperparameters, we set the dimension of the hidden state of drugs as 256 and 3 layers for the GCN encoder.To further regularise the model, dropout with p = 0.3 is applied to every intermediate layer's output.We use Pytorch-geometric [41] for GCN.GCN Model is trained using Adam optimizer.

Benchmarking data sets
For classification and regression tasks, we use datasets from MoleculeNet [36], which is a collection of diverse datasets that cover a range of tasks, such as identifying properties like toxicity, bioactivity, and whether a molecule is an inhibitor.MoleculeNet is a widely used benchmark dataset in the field of computational chemistry and drug discovery and it is designed to evaluate and compare the performance of various machine learning models and algorithms on tasks related to molecular property prediction, compound screening, and other cheminformatics tasks [3-6, 18, 23, 42].
For the link prediction task, however, we utilize two DDI networks: BioSnap [37] and DrugBank [38].These datasets represent interactions among FDA-approved drugs as a biological network, with drugs as nodes and interactions as edges.
We extracted the SMILES strings of drugs in the DrugBank database.It should be noted that we conduct data removal because of some improper drug SMILES strings in Drugbank, which can not be converted into molecular graphs, as determined by the RDKit library.The errors include so-old storage format of SMILES strings, wrong characters, etc.Through these curation efforts, we have fortified the quality and coherence of our DDI network, ensuring its suitability for comprehensive analysis and interpretation.
For the BioSnap dataset, 1320 drugs have SMILES strings, while the DrugBank dataset has 1690 drugs with SMILES strings.Hence, the number of edges for BioSnap and Drug-Bank reduced to 41,577 and 190,609, respectively.

Results on classification tasks
Figure 2a, Table 3, and 4 present our experiments on classification tasks.Surprisingly, LLaMA embeddings achieve comparable performance to established pre-trained models such as MolFormer-XL [4] and ChemBERTa [6] across all datasets.Conversely, GPT embeddings underperform in every case.Intriguingly, Morgan FP representations nearly match the performance of other pre-trained methods but are more computationally efficient; generating Morgan FP for a large dataset takes less than a minute without the need for a GPU, whereas LLaMA requires GPUs and processes only 117 molecules per minute (Table 2).We also tested other classifiers, including SVM and Random Forest, with similar results.The small standard deviation in the evaluation scores indicates that these performance differences are statistically significant.Despite ChemBERTa and Mol-Former-XL being pre-trained on millions of compounds from PubChem and ZINC, they perform comparably or, in some instances, less effectively than the BERT model.This showcases the importance of fine-tuning the results of pre-trained models.

Results on regression tasks
Figure 2a and Table 5 present the evaluation results for the regression tasks.Similar to the classification results, GPT underperforms relative to other models, and in some instances, it even falls short of Morgan Fingerprint's performance.ChemBERTa

Table 3 Results on classification tasks
The reported performance metrics are the mean and standard deviation of the F1-score and AUROC, calculated across the five-folds.The Best Performance is Highlighted in Bold

Table 4 Results on multi-task classification tasks
The reported performance metrics are the mean and standard deviation of the F1-score and AUROC, calculated across the five-folds.The Best Performance is Highlighted in Bold  consistently emerges as the top-performing model for regression across all tested datasets.BERT and LLaMA exhibit performances that are closely comparable to Chem-BERTa in the regression tasks.Additionally, we observed a general decline in the performance of all methods when applied to larger datasets, such as Lipophilicity.

Results on link prediction tasks
Table 6 presents the results for the link prediction tasks on DDI networks.LLaMA consistently outperforms all other models across both datasets by a significant margin.Notably, Morgan FP surpasses the performance of embeddings from pre-trained models.
It appears that the size of the embeddings impacts model performance, as larger embeddings generally yield better results.Nevertheless, despite having the same size, there are still noticeable performance differences between the LLaMA and LLaMA2 models.

Ablation study
LLaMA Vs LLaMA2 Figure 3 compares the LLaMA and LLaMA2 models.The performance of these two models is similar, mainly across various tasks.However, there are notable differences in specific instances.For example, in the link prediction tasks (Table 6), LLaMA2 outperforms LLaMA.This trend is also observed in classification and regression tasks, where LLaMA2 generally matches or exceeds the performance of LLaMA.Both models share similar architecture and training presets.Nevertheless, LLaMA2 has been trained on 40% more data and supports twice the context length of its predecessor, enhancing its capability to understand more complex language structures [28,29].Dimension reduction We investigated the impact of dimension reduction on LLMs with substantial embedding sizes, as illustrated in Fig. 4. Using Principal Component Analysis (PCA) for dimension reduction, we experimented with various reduction sizes.Our findings indicate that the impact of dimension reduction on the   classification performance of GPT and LLaMA models is minimal, although there is a noticeable decrease in performance post-reduction.In contrast, for regression tasks, dimension reduction significantly lowers the performance of the models.This suggests a correlation between the size of the embeddings in LLMs and their effectiveness in handling regression tasks.LLM and anistropy It is well-documented that LLM embeddings suffer from the isotropy problem, meaning they are not uniformly distributed in terms of direction [43][44][45].Instead, these embeddings occupy a narrow cone in the vector space, making them anisotropic.The anisotropy problem in LLM model embeddings is evident from  Our comparative analysis also reveals that LLMs embeddings demonstrate a higher degree of anisotropy than pre-trained embeddings and Morgan FP (Fig. 5).This is evident since the distribution of cosine similarity of embeddings is more closely grouped together in their representation (Fig. 5).However, our experiments indicate that better isotropy does not imply a performance gain in machine-learning tasks.As can be  seen, the cosine similarity distribution of LLaMA2 embeddings is a lot narrower than GPT and Morgan FP; however, LLaMA2 outperforms both models in most cases.
As illustrated in Fig. 6, we also noticed that the PCA representation of GPT's embeddings is predominantly concentrated within a range smaller than 1.This observation also suggests a high likelihood that the GPT embeddings have been pre-normalized.
GPT Vs LLaMA Figure 7 demonstrates that LLaMA consistently outperforms GPT across all datasets by a significant margin.This raises the question of whether these differences are due to the architectural design or the specific training of the models.As outlined in the GPT-4 technical report, GPT models are capable of interpreting SMILES strings.Notably, approximately 2.5% of the LLaMA training dataset, as reported in [28,33], consists of scientific material primarily sourced from arXiv, including bioinformatics papers.
Both LLaMA and GPT models utilize a transformer-based architecture with a heavy reliance on self-attention mechanisms and a decoder-only configuration.However, the opaque nature of GPT as a "black box" model complicates direct comparisons with LLaMA regarding whether their efficiency stems solely from architecture or pre-training specifics.Nonetheless, considering their training on SMILES strings, the data from Fig. 7 and Table 6 suggest that the LLaMA architecture is particularly adept at handling complex language structures like SMILES strings.Furthermore, Table 1 reveals that while the LLaMA2 tokenizer may not perform as well as the MolFormer tokenizer, it tokenizes SMILES strings more effectively than BERT.Unfortunately, we cannot compare the GPT tokenizer directly with other models due to limitations in OpenAI's API access.Link prediction with SMILES VS drug description We also extracted the text-format drug description information of drugs from the DrugBank database.Drug description embedding in DDI prediction significantly outperforms using SMILES strings when leveraging LLMs.This enhancement is consistent with applying LLMs pre-trained on general text data, as depicted in Fig. 8.When applied to drug descriptions closer to natural language, GPT outperforms the LLaMA models on both datasets and both AUROC and AUPRC metrics.

Conclusions
In summary, this research underscores the potential of LLMs like GPT and LLaMA for molecular embedding.We specifically recommend LLaMA models over GPT due to their superior performance in generating molecular embeddings from SMILES strings, which is notable in our studies.These findings suggest that LLaMA could be particularly effective in predicting molecular properties and drug interactions.Although models like LLaMA and GPT are not explicitly designed for SMILES string embedding-unlike specialized models such as ChemBERTa and MolFormer-XL-they still demonstrate  competitive performance.Our work lays the groundwork for future improvements in utilizing LLMs for molecular embedding.Future efforts will focus on enhancing the quality of molecular embeddings derived from LLMs inspired by natural language sentence embedding techniques, such as fine-tuning and modifications to LLaMA tokenization.

Fig. 2
Fig. 2 Results on classification and regression tasks.Each line represent the mean value of five-Fold cross validation while the shaded area shows their standard deviation

Fig. 3
Fig. 3 Comparison of LLaMA and LLaMA2 performance

Fig. 4
Fig. 4 Effect of dimension reduction on the performance of LLMs

Fig. 6
Fig. 6 PCA representation embedding for classification task.Red represent positive samples while blue represent negative samples

Table 1
Comparison of tokenizers for molecular SMILES string

Table 2
Comparison of embedding models used in this study *Speed of generating embedding.Speed is dependent on the machine

Table 5
Results on regression tasksThe reported performance metrics are the mean and standard deviation of the RMSE and R 2 , calculated across the five-folds.The Best Performance is Highlighted in Bold

Table 6
Results on link prediction tasksThe reported performance metrics are the mean and standard deviation of the AUROC and AUPR, calculated across the 10 runs.The Best Performance is Highlighted in Bold