Minimizing late-stage failure in drug development with transformer models: Enhancing drug screening and pharmacokinetic predictions

doi:10.1016/j.cej.2025.160423

Chemical Engineering Journal

Available online 8 February 2025, 160423

In Press, Journal Pre-proof

https://doi.org/10.1016/j.cej.2025.160423 Get rights and content

Highlights

•
The model uses transformer for early-stage drug screening
•
The model was pre-trained on 1.8 billion molecules from ZINC and PubChem.
•
Fine-tuning on ChEMBL, MoleculeNet, and TOXRIC datasets.
•
The model predicts a wide array of ADME-T properties with high accuracy.
•
Case study on HIV Integrase 1 screening 1.04 million compounds to 143 candidates.

Abstract

Drug discovery is an intricate, multi-phase process aimed at identifying therapeutic compounds capable of modulating disease pathways. Central to this process are target identification, initial candidate selection, and the evaluation of key pharmacokinetic and toxicity properties. Traditional machine learning (ML) methods, while beneficial in facilitating property prediction, often rely heavily on feature engineering and molecular descriptors, which limits their scalability and flexibility in multi-task predictions. In response to these limitations, we propose a transformer-based model that leverages self-attention mechanisms to generate molecular embeddings directly from SMILES sequences. These embeddings, free from the constraints of manual feature engineering, enable the prediction of a wide range of drug properties, including ADME-T (absorption, distribution, metabolism, excretion, and toxicity) characteristics. Our approach improves early-stage drug screening by providing a unified model capable of predicting multiple properties, reducing the need for separate pipelines for each task. Additionally, it enhances computational efficiency through linear attention mechanisms and provides robust molecular representations via Rotary Positional Embedding (RoPE). The model is validated through a case study on HIV Integrase-1, demonstrating its ability to identify promising drug candidates while filtering out compounds with poor pharmacokinetic or toxicity profiles. By streamlining the property prediction process and integrating early ADME-T testing, our method reduces the likelihood of late-stage failures, accelerates the identification of viable drug candidates, and ultimately lowers development costs. This transformer-based approach offers a scalable and adaptable tool for various drug discovery tasks, making it a pivotal advancement in data-driven drug development.

Introduction

Drug discovery is a multifaceted and iterative process aimed at identifying novel therapeutic compounds capable of modulating disease pathways. It typically begins with target identification, where key biological molecules—such as proteins, enzymes, or receptors involved in disease mechanisms—are pinpointed as potential intervention points. Following identification of target, the focus shifts to initial candidate selection, which involves screening large chemical libraries to identify compounds that interact effectively through mechanisms such as inhibition or activation [1], [2]. Promising compounds are refined through medicinal chemistry to enhance their potency, selectivity, and drug-like properties. This process is iterative, relying on continuous feedback from biochemical, cellular, and in vivo assays to optimize pharmacokinetic (PK) and pharmacodynamic (PD) profiles. Among these steps one key component is testing molecules for their ADME-T (absorption, distribution, metabolism, excretion and toxicity) profiles. Testing candidates for these properties is vital since approximately 40 % of compounds fail during the ADME-T testing stage due to unfavorable pharmacokinetic and toxicity profiles [3], [4], [5], which substantially increases the cost and extends the timelines of drug development.

ADME-T testing evaluates how a drug is absorbed, distributed, metabolized, and excreted and its potential toxicity. These factors are crucial for determining whether a compound can be safely and effectively used as a drug, as they influence its bioavailability, therapeutic efficacy, and safety profile. Poor performance in these areas can lead to a drug candidate's failure in later stages of development. For example, while the estimated median cost of research and development of drugs is 319.3 million USD, after accounting for the cost of the failed trials, it surges to a median of 1.14 billion USD [6]. This underscores the importance of integrating ADME-T testing immediately after target identification and initial candidate selection. This ensures that compounds with poor pharmacokinetic or toxicity profiles are eliminated early in the drug discovery process before advancing to more resource-intensive stages like lead optimization and clinical trials.

Given these high stakes, the increasing role of data-driven models in drug discovery is reshaping how we approach therapeutic compound identification and development [6], [7], [8], [9], [10]. Techniques such as decision trees (DTs) have been used to aid in the clear classification of compounds with potential toxicity or therapeutic benefits [10]. Artificial neural networks (ANNs) are adept at identifying properties like lipophilicity [11] and aqueous solubility [12], [13], [14]. Methods like support vector machines (SVMs), k-nearest neighbors (kNN), and random forests (RFs) have been effectively applied to predict ADME-T properties [15], [16], [17]. These techniques utilize molecular descriptors to represent compounds, predicting how these compounds will interact in biological systems. While these methods have facilitated advancements in property prediction and candidate screening, they are often constrained by their reliance on hand-crafted molecular descriptors and feature engineering [15]. This feature dependency can limit scalability and flexibility, primarily when predicting multiple drug-related properties. In a data-driven context, bypassing feature engineering and predicting diverse properties with a unified model is crucial to streamlining drug discovery workflows.

To address these limitations, our model leverages transformer-based architectures [16], which use self-attention mechanisms to generate molecular embeddings directly from SMILES sequences [17]. Transformers have been widely used in the chemical domain, including for process modeling and control. For example, attention mechanisms have been applied in time-series transformers (TSTs) to capture long- and short-term dependencies in chemical processes and predict the next system state, enabling real-time process optimization and control [18]. Additionally, they have been used to enhance system-to-system transferability and optimize batch crystallization [19], [20]. More prominently, transformers have been employed for molecular property prediction, where their ability to capture intricate molecular relationships has proven essential like in predicting nonlinear properties in ionic liquids, polymeric and protein-based systems by generating context aware chemical embeddings [21], [22], [23]. These embeddings capture intricate molecular features and interactions without needing explicit feature engineering, offering a significant advantage over conventional ML models that depend on predefined molecular descriptors. By their nature, transformers capture long-range dependencies and relationships in molecular sequences, making them more flexible and scalable in multi-task property prediction than traditional methods. The generalizability of transformers allows for seamless application to diverse prediction tasks, from toxicity profiling to absorption and distribution characteristics, through a unified framework that simplifies the prediction pipeline. Building on this, we have enhanced the transformer architecture by integrating linear attention mechanisms, which improve computational efficiency, particularly when handling large molecular sequences. Additionally, the use of Rotary Positional Embedding (RoPE) allows the model to capture spatial relationships between atoms, further enriching the quality of the embeddings. Our transformer-based methodology addresses a critical challenge in drug discovery by minimizing late-stage failures caused by unfavorable pharmacokinetic and toxicity profiles. By incorporating ADME-T predictions early in the process, the approach ensures that only molecules with optimal properties progress to subsequent stages, significantly reducing the risk of costly failures. The framework's modular design allows for easy adaptation to specific drug discovery needs, enabling the addition or removal of properties and adjustment of thresholds based on therapeutic focus or domain requirements. This adaptability ensures the model's relevance across diverse contexts, supporting efficient evaluation of pharmacological properties.

As shown in Fig. 1 during the stage 1 screening, the embeddings generated by the transformer model are passed through a feed-forward network to predict drug-like properties. These predictions are based on Lipinski's Rule of 5, along with the verification of biological activity against the target. In the stage 2 screening, the same transformer-generated embeddings used to predict key ADME-T properties, including solubility, BBB penetration, enzyme inhibition, and toxicity profiles by passing the embeddings through simple feed-forward networks. This approach ensures that the transformer architecture efficiently generates embeddings, which are then used to carry out the necessary property predictions through a feed-forward network. The prediction of a wide array of properties is possible since the transformer's embeddings capture intricate molecular features and relationships. Finally, in stage 3 screening, the transformer predicts the IC50 values, ensuring that only the most potent and promising candidates proceed to experimental testing. We showcased a case study on screening molecules for HIV Integrase 1, demonstrating the real-world application of the developed, integrated approach for more effective drug candidate selection.

Access through your organization

Check access to the full text by signing in through your organization.

Access through your organization

Section snippets

Encoder-based transformer architecture

In our study, we utilize a transformer model based on an encoder-based architecture with 12 layers to process the SMILES representations of chemical compounds. The initial step involves tokenizing the input data to discrete elements, typically individual atoms, which are then embedded into a continuous numerical space. This embedding is facilitated by a matrix where the dimensions correspond to the size of the vocabulary and the vector embedding dimensions. Each token is transformed into a

Regression tasks for drug screening: Lipophilicity, aqueous solubility, VDss, and LD50

The model was trained on several regression-based tasks crucial to drug screening: lipophilicity, aqueous solubility, volume distribution at steady state (VDss), and LD50 acute toxicity. Each of these parameters plays a vital role in evaluating potential therapeutic compounds, and we discuss their importance below.

1.
Lipophilicity: Lipophilicity measures an affinity of compound for lipid environments, often quantified by the partition coefficient (log P) between a nonpolar solvent like octanol and

Conclusion

In this paper, we introduced a transformer-based methodology designed to enhance early-stage drug screening by addressing key challenges such as scalability, flexibility, and the limitations of traditional machine learning models that rely on feature engineering. By utilizing attention mechanisms and deep learning architectures, our approach generated molecular embeddings directly from SMILES sequences, enabling the prediction of drug-like properties, ADME-T characteristics, and IC50 values in

CRediT authorship contribution statement

Aahil Khambhawala: Writing – review & editing, Writing – original draft, Visualization, Validation, Supervision, Software, Resources, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. Chi Ho Lee: Writing – review & editing, Writing – original draft, Validation, Supervision, Software, Project administration, Methodology, Investigation, Formal analysis, Conceptualization. Silabrata Pahari: Writing – original draft, Visualization, Validation, Supervision, Software,

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The authors gratefully acknowledge financial support from the Artie McFerrin Department of Chemical Engineering, and the Texas A&M Energy Institute. Portions of this research were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing.

References (42)

H. Beck et al.
Small molecules and their impact in drug discovery: a perspective on the occasion of the 125th anniversary of the Bayer Chemical Research Laboratory
Drug Discov. Today
(2022)
C.A. Lipinski
Drug-like properties and the causes of poor solubility and poor permeability
J. Pharmacol. Toxicol. Methods
(2000)
F. Wan et al.
DeepCPI: a deep learning-based framework for large-scale in silico drug screening
Genomics Proteomics Bioinformatics
(2019)
P. Gao et al.
Accurate predictions of drugs aqueous solubility via deep learning tools
J. Mol. Struct.
(2022)
N. Sitapure et al.
CrystalGPT: Enhancing system-to-system transferability in crystallization prediction and control using time-series-transformers
Comput. Chem. Eng.
(2023)
N. Sitapure et al.
Exploring the potential of time-series transformers for process modeling and control in chemical systems: an inevitable paradigm shift?
Chem. Eng. Res. Des.
(2023)
A. Khambhawala et al.
Advanced transformer models for structure-property relationship predictions of ionic liquid melting points
Chem. Eng. J.
(2025)
H. Gubler et al.
Theoretical and experimental relationships between percent inhibition and IC50 data observed in high-throughput screening
SLAS Discovery
(2013)
Z. Wu et al.
MoleculeNet: a benchmark for molecular machine learning
Chem. Sci.
(2018)
L.M. Berezhkovskiy
Volume of distribution at steady state for a linear pharmacokinetic system with peripheral elimination
J. Pharm. Sci.
(2004)

B.E. Matter

Problems of testing drugs for potential mutagenicity

Mutation Research/environmental Mutagenesis and Related Subjects

(1976)

K. Mortelmans et al.

The ames salmonella/microsome mutagenicity assay

Mutation Research/fundamental and Molecular Mechanisms of Mutagenesis

(2000)

D. Clive

Mutagenicity in drug development: Interpretation and significance of test results

Regul. Toxicol. Pharm.

(1985)

J. Drews

Drug discovery: a historical perspective

Science

(1979)

J.A. DiMasi

Success rates for new drugs entering clinical testing in the United States

Clin. Pharmacol. Ther.

(1995)

J.S. Akhila et al.

Acute toxicity studies and determination of median lethal dose

Curr. Sci.

(2007)

T. Unterthiner et al.

Deep learning as an opportunity in virtual screening

Q. Pu et al.

Screen efficiency comparisons of decision tree and neural network algorithms in machine learning assisted drug design

Sci. China Chem.

(2019)

M. Bahi et al.

Deep Learning for Ligand-Based Virtual Screening in Drug Discovery

M. Galushka et al.

Prediction of chemical compounds properties using a deep learning model

Neural Comput. & Applic.

(2021)

M.J. Waring

Lipophilicity in drug discovery

Expert Opin. Drug Discov.

(2010)

Cited by (0)

¹: These authors have contributed equally.

View full text