A production-ready transformer-based molecular generation and property prediction system for drug discovery, featuring reward-guided sampling, scaffold conditioning, and multi-task learning.
This project implements a complete AI-driven drug discovery pipeline that generates novel, drug-like molecules using transformer-based language models and multi-task property prediction. The system combines:
- Generative Models: GRU and Transformer language models trained on SMILES strings
- Property Predictors: Multi-task transformer for pIC50, logP, and QED prediction
- Reward-Guided Search: Active filtering mechanism for quality-focused generation
- Scaffold Conditioning: Targeted exploration of chemical space around seed structures
- Dataset: 15,037 validated drug-like molecules
- Vocabulary Size: 36 chemical tokens
- Model Parameters: ~4.3M (language model), ~3.2M (predictors)
- Generation Quality: 70-85% valid SMILES, 80%+ drug-likeness (QED ≥ 0.5)
- Novelty: 60-70% of generated molecules not in training set
- Multiple Sampling Strategies (temperature, beam, top-k, nucleus)
- Conditioning modes (scaffold, property-guided, unconditioned)
- RDKit validation, synthesizability scoring, diversity metrics
INPUT → GENERATION MODELS (GRU / Transformer) → REWARD-GUIDED SEARCH → PROPERTY PREDICTION → RANKING & FILTERING → OUTPUT
This notebook is hosted on Kaggle where required packages are available in the environment. To run on Kaggle, open the notebook and "Run" — no requirements.txt is necessary.
If you need to run locally, create a virtual environment and install the minimal packages used in the notebook (example):
python -m venv venv
source venv/bin/activate
pip install torch gradio rdkit pandas numpy scikit-learn matplotlib tqdmfrom model import sample_smiles
smiles = sample_smiles(lm_model, max_len=150, temperature=0.8)- Tokenization, encoding/decoding utilities
- Transformer-based language model and multi-task predictors
- Reward and filtering utilities using RDKit
- Mode 1: Generate & Filter (fast)
- Mode 2: Reward-Guided Search (quality-focused)
- Mode 3: Scaffold-Conditioned (targeted)
Transformer LM (causal) and shared encoder multi-task predictor.
Hyperparameters and training loop examples are provided in the notebook.
- SMILES validity, QED, novelty, diversity, prediction MAE/RMSE
Top generated molecules, diversity analysis, and temperature sweep summaries are in the notebook.
Pipeline configuration dataclass and examples are available in the code.
- Logging, checkpointing, memory management, and export utilities.
- Aram Elheni
- Youssef Jaziri
- Chaima Ben Yedder
- Zied Knani
Made with ❤️ for drug discovery