MolE: A Foundation Model for Molecular Graphs Using Disentangled Attention

2024-11-21

MolE represents a significant advancement in modeling molecular graphs using the Transformer architecture with a disentangled attention mechanism.

Molecular property prediction has always been a challenging aspect of chemical sciences. Traditional approaches often struggle to capture the intricacies of molecular graphs, which limits their performance in accurately predicting key properties. To address these limitations, a new Transformer-based model called MolE has been introduced. MolE, short for Molecular Embeddings, utilizes disentangled attention to learn complex molecular structures and has shown significant improvements in downstream predictive tasks. In this blog, we explore what makes MolE unique and how it can impact the field of molecular property prediction.

The Challenges of Molecular Representation

Molecules have long been represented in various ways for machine learning applications, from physicochemical properties to molecular fingerprints such as MACCS keys or Extended Connectivity Fingerprints (ECFPs). These methods were effective in early Quantitative Structure-Activity Relationship (QSAR) studies. However, the lack of detail in preserving the complete molecular graph topology presented a significant barrier to achieving highly accurate predictions.

Inspired by the recent advances in natural language modeling, molecular structures can now be represented as SMILES strings—a format that has allowed deep learning models like Recurrent Neural Networks (RNNs) and Transformers to be applied. However, these approaches still have limitations, such as the lack of a unique SMILES representation. To overcome these challenges, researchers turned to molecular graph representations, which allow for more sophisticated learning using Graph Neural Networks (GNNs).

Introducing MolE: Molecular Embeddings with Disentangled Attention

MolE represents a significant advancement in modeling molecular graphs using the Transformer architecture with a disentangled attention mechanism. Developed by Recursion, MolE employs a two-step pretraining strategy to learn molecular embeddings from a dataset of over 842 million molecules. The disentangled attention mechanism is adapted from DeBERTa to effectively account for the relative positioning of atoms within a molecular graph.

Key Features of MolE

Disentangled Attention Mechanism : MolE leverages disentangled self-attention to integrate both token and positional information, enabling the model to learn molecular relationships in a more nuanced manner. Unlike traditional attention, disentangled attention distinguishes between token content and relative position, making MolE invariant to atom order within the graph.
Self-Supervised Pretraining : The first stage of MolE's pretraining is a self-supervised approach inspired by BERT-like masked token modeling. Each atom in the molecular graph is randomly masked, and the task is to predict its atom environment, including all neighboring atoms within a specified radius.
Supervised Multi-Task Learning : After the initial pretraining, MolE undergoes supervised pretraining on a labeled dataset of around 456,000 molecules. This combination of self-supervised and supervised learning allows MolE to excel at both local and global feature representation, making it highly suitable for downstream tasks like ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) predictions.

Performance on ADMET Tasks

MolE was evaluated on 22 ADMET tasks from the Therapeutic Data Commons (TDC), comprising both regression and binary classification tasks. The results were outstanding: MolE achieved state-of-the-art performance in 10 of these 22 tasks, including crucial properties such as CYP inhibition and half-life prediction. Notably, the model performed particularly well in tasks with large datasets, but it also showed promise for smaller datasets, such as DILI prediction, highlighting its adaptability.

ADMET Tasks

Impact of Pretraining Strategies

MolE's impressive results are largely due to its pretraining strategies. The model uses a combination of node-level and graph-level pretraining, which allows it to capture both atomic environments and global molecular properties. Ablation studies conducted during the research showed that the inclusion of disentangled attention significantly improved performance, while models trained without pretraining or with only supervised pretraining fell short.

A Deep Dive into MolE's Architecture

MolE's architecture is rooted in Transformers, which are well-known for their efficacy in modeling sequences and graphs. Using the disentangled self-attention mechanism adapted from DeBERTa, MolE incorporates atom identifiers and graph connectivity as inputs. Each atom identifier is a hashed value based on atomic properties, such as atomic mass, valence, and ring membership, while graph connectivity information provides a topological distance matrix.

Training with Diverse Datasets

MolE was pretrained using data from ZINC20 and ExCAPE-DB, which contained over 842 million molecular graphs. The model's ability to learn from such an extensive dataset enhances its generalizability and ensures that it is equipped to handle diverse chemical structures. Further analysis demonstrated that pretraining on larger and more diverse datasets improved the model's performance across numerous tasks.

Real-World Implications and Future Work

The development of MolE holds significant implications for the field of drug discovery and chemical sciences. Its ability to accurately predict molecular properties makes it a valuable tool for identifying promising compounds early in the research pipeline, potentially speeding up the drug development process. The researchers noted that MolE’s embeddings could also be useful for similarity searches and other predictive tasks, positioning it as a foundational model in chemical informatics.

Future directions include expanding the diversity of pretraining datasets and exploring the model’s application to non-drug-like chemical spaces. There are also plans to further investigate MolE’s embeddings for use in similarity analysis, as well as to expand its capabilities for more specialized chemical predictions.

Conclusion

MolE showcases the power of Transformer-based architectures in learning meaningful molecular representations. By utilizing disentangled attention and pretraining strategies that span both self-supervised and supervised learning, MolE has set a new standard for molecular property prediction. The potential applications in drug discovery are vast, and as the model continues to evolve, it is likely to become an indispensable tool for chemists and researchers worldwide.