Skip to content

pritampanda15/Oncology-Drug-Response-Prediction-System_DL-RAG

Repository files navigation

Oncology Drug Response Prediction System

AI-Powered Drug Response Prediction with Biological Explanations

Python 3.10+ PyTorch License: MIT


πŸ“‹ Table of Contents


🎯 Overview

This system combines Deep Learning and Retrieval-Augmented Generation (RAG) to predict cancer drug response and provide interpretable, biologically grounded explanations. It's designed for clinical decision support, helping oncologists understand not just what the prediction is, but why the model made that prediction.

Key Capabilities

  • Predict Drug Response: Binary classification (Responder/Non-Responder) for chemotherapy drugs
  • Identify Key Genes: Highlight the most important genes driving each prediction
  • Generate Biological Explanations: Use RAG to provide context-rich explanations based on biomedical knowledge
  • Interactive Web Interface: Streamlit app for real-time predictions and visualizations

Supported Drugs

βœ… Cisplatin | βœ… Docetaxel | βœ… Paclitaxel | βœ… Gemcitabine


✨ Features

πŸ€– Deep Learning Model

  • Architecture: Feed-forward neural network optimized for gene expression data
  • Input: 1,000 curated gene expression features (prioritizing drug-response genes)
  • Output: Binary classification with confidence scores
  • Performance: AUC ~0.76-0.79 across tested drugs

🧠 RAG-Powered Explanations

  • Vector Store: ChromaDB with sentence transformers (all-MiniLM-L6-v2)
  • Knowledge Base: Curated drug mechanism documents covering:
    • DNA damage response pathways
    • Drug resistance mechanisms
    • Key biomarkers (BRCA1/2, TP53, ERCC1, etc.)
  • LLM: GPT-4o-mini for natural language explanation generation

πŸ–₯️ Interactive Streamlit App

  • Patient Selection: Choose from 760+ cancer cell lines (GDSC2 dataset)
  • Real-Time Predictions: Instant results with confidence scores
  • Visual Gene Analysis: Bar charts and tables of top contributing genes
  • Biological Context: RAG-generated explanations with cited genes
  • Export Results: Download predictions as CSV

πŸ”Œ REST API

  • Flask-based API for programmatic access
  • JSON input/output format
  • Easy integration with other systems

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Streamlit Web Interface                  β”‚
β”‚         (Patient Selection β€’ Visualization β€’ Export)        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   DrugResponseTool (Core)                   β”‚
β”‚         β€’ Model Loading  β€’ Prediction  β€’ Orchestration      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    ↓                           ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   Deep Learning Module    β”‚   β”‚      RAG Pipeline        β”‚
    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€   β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
    β”‚ β€’ PyTorch Neural Network  β”‚   β”‚ β€’ ChromaDB Retriever     β”‚
    β”‚ β€’ 1000 gene features      β”‚   β”‚ β€’ Sentence Transformers  β”‚
    β”‚ β€’ Gradient-based          β”‚   β”‚ β€’ GPT-4o-mini Generator  β”‚
    β”‚   importance scoring      β”‚   β”‚ β€’ Biological context     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    ↓                           ↓
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚         Prediction + Explanation     β”‚
              β”‚  β€’ Response class                    β”‚
              β”‚  β€’ Confidence score                  β”‚
              β”‚  β€’ Top genes                         β”‚
              β”‚  β€’ Biological rationale              β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Installation

Prerequisites

  • Python 3.10 or higher
  • CUDA-capable GPU (optional, but recommended)
  • ~10GB disk space (for data and models)

1. Clone the Repository

git clone https://github.com/yourusername/oncology_cds.git
cd oncology_cds

2. Create Virtual Environment

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

4. Download GDSC2 Data

Download the following files from GDSC:

  • Cell_line_RMA_proc_basalExp.txt (Gene expression matrix)
  • GDSC2_fitted_dose_response_27Oct23.xlsx (Drug response data)

Place them in the data/raw/ directory.

5. Set Up Environment Variables

Create a .env file in the project root:

OPENAI_API_KEY=your_openai_api_key_here

⚑ Quick Start

Train Models

# Train all 4 drug models (takes ~10-15 minutes on GPU)
CUDA_VISIBLE_DEVICES=0 python3 scripts/train_model.py

This will create model files in models/:

  • cisplatin_predictor.pt
  • docetaxel_predictor.pt
  • paclitaxel_predictor.pt
  • gemcitabine_predictor.pt

Build Knowledge Base

# Create vector store from drug mechanism documents
python3 scripts/build_knowledge_base.py

Launch Streamlit App

# Start the interactive web interface
./run_streamlit.sh

# Or manually:
streamlit run streamlit_app.py

Then open your browser to http://localhost:8501


πŸ“– Usage

Option 1: Streamlit Web App (Recommended)

  1. Launch the app: ./run_streamlit.sh
  2. Load data: Click "Load Data" in the sidebar
  3. Select drug: Choose from the dropdown (e.g., Cisplatin)
  4. Pick patient: Use "Random Patient" or select a specific COSMIC ID
  5. Run prediction: Click "Run Prediction"
  6. Explore results: View prediction, confidence, top genes, and biological explanation

See STREAMLIT_GUIDE.md for detailed instructions.

Option 2: Python Script

from src.data.loader import GDSCDataLoader
from src.tools.prediction_tool import DrugResponsePredictor

# Load data
loader = GDSCDataLoader("data/raw")
loader.load_expression("Cell_line_RMA_proc_basalExp.txt")
loader.load_drug_response("GDSC2_fitted_dose_response_27Oct23.xlsx")

X, y = loader.get_dataset_for_drug("Cisplatin")
X.columns = X.columns.astype(str)

# Initialize predictor
tool = DrugResponsePredictor(
    models_dir="models",
    vector_store_dir="knowledge_base/vector_store"
)

# Make prediction
sample_patient = X.iloc[[0]]
result = tool.predict_and_explain(
    drug_name="Cisplatin",
    gene_expression=sample_patient
)

print(f"Prediction: {result['prediction']}")
print(f"Confidence: {result['confidence']:.1%}")
print(f"Top Genes: {', '.join(result['top_genes'][:5])}")
print(f"\nExplanation:\n{result['explanation']}")

See examples/full_pipeline_DL+RAG.py for a complete example.

Option 3: REST API

# Start the Flask API server
python3 app.py

Then make POST requests to http://localhost:5000/api/predict (requires all 1000 gene values).


πŸ“ Project Structure

oncology_cds/
β”œβ”€β”€ config/
β”‚   └── gene_sets.yaml              # Drug-specific gene sets
β”œβ”€β”€ data/
β”‚   └── raw/                        # GDSC2 data files (not versioned)
β”œβ”€β”€ examples/
β”‚   └── full_pipeline_DL+RAG.py     # Demo script
β”œβ”€β”€ knowledge_base/
β”‚   β”œβ”€β”€ documents/                  # Drug mechanism documents
β”‚   β”‚   β”œβ”€β”€ cisplatin.txt
β”‚   β”‚   β”œβ”€β”€ docetaxel.txt
β”‚   β”‚   β”œβ”€β”€ paclitaxel.txt
β”‚   β”‚   └── gemcitabine.txt
β”‚   └── vector_store/               # ChromaDB vector store
β”œβ”€β”€ models/                         # Trained model checkpoints (not versioned)
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ train_model.py              # Model training script
β”‚   └── build_knowledge_base.py     # Vector store builder
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   └── routes.py               # Flask API routes
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ loader.py               # GDSC2 data loader
β”‚   β”‚   └── preprocessor.py         # Feature selection & scaling
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ predictor.py            # Neural network architecture
β”‚   β”‚   └── trainer.py              # Training utilities
β”‚   β”œβ”€β”€ rag/
β”‚   β”‚   β”œβ”€β”€ ingestion.py            # Document loading & chunking
β”‚   β”‚   β”œβ”€β”€ retriever.py            # Vector search (ChromaDB)
β”‚   β”‚   └── generator.py            # LLM explanation generation
β”‚   └── tools/
β”‚       └── prediction_tool.py      # Main prediction + RAG integration
β”œβ”€β”€ app.py                          # Flask API entry point
β”œβ”€β”€ streamlit_app.py                # Streamlit web interface
β”œβ”€β”€ run_streamlit.sh                # Streamlit launcher script
β”œβ”€β”€ requirements.txt                # Python dependencies
β”œβ”€β”€ STREAMLIT_GUIDE.md             # Streamlit usage guide
└── README.md                       # This file

πŸ“Š Results

Model Performance

Drug AUC Accuracy Selected Genes
Cisplatin 0.7696 70.4% 1000 (29 known + 971 variance)
Docetaxel 0.7901 72.0% 1000 (15 known + 985 variance)
Paclitaxel 0.7850 71.5% 1000 (9 known + 991 variance)
Gemcitabine 0.7780 70.8% 1000 (11 known + 989 variance)

Performance metrics on test set (20% of data)

Sample Prediction Output

Drug: Cisplatin
Prediction: Responder
Confidence: 69.7%
Top Genes: HLA-B, HLA-DPA1, IFI30, HLA-DRA, LDHB

Biological Explanation:
The prediction of this patient as a responder to Cisplatin is supported by
the involvement of several key genes. HLA-B and HLA-DPA1 are part of the
immune response and may influence the effectiveness of Cisplatin by
modulating the immune system's ability to recognize and attack cancer cells.
GSTP1 is known for its role in drug metabolism and detoxification, where
variations can affect Cisplatin's cytotoxicity. Additionally, LDHB and VIM
are associated with cellular metabolism and epithelial-mesenchymal transition,
which can impact tumor response to chemotherapy.

πŸ› οΈ Technical Details

Data Processing

  • Raw data: 17,737 genes Γ— 1,018 cell lines
  • After filtering: 17,419 genes (removed 318 rows with invalid gene names)
  • Feature selection: 1,000 genes per drug
    • Prioritizes known drug-response genes from curated gene sets
    • Fills remaining slots with high-variance genes
  • Preprocessing: StandardScaler normalization

Model Architecture

DrugResponsePredictor(
    input_dim=1000,
    hidden_dims=[256, 128, 32],
    dropout_rate=0.5
)
  • Layers: 1000 β†’ 256 β†’ 128 β†’ 32 β†’ 1
  • Activation: ReLU
  • Dropout: 0.5 (prevents overfitting)
  • Output: Sigmoid (binary classification)

Training Configuration

  • Optimizer: Adam (lr=0.0005)
  • Weight Decay: 0.01 (L2 regularization)
  • Scheduler: ReduceLROnPlateau (monitors AUC)
  • Early Stopping: Patience=15 epochs
  • Batch Size: 64
  • Max Epochs: 100

RAG Pipeline

  1. Document Chunking: 500-character chunks with overlap
  2. Embedding: all-MiniLM-L6-v2 (384 dimensions)
  3. Retrieval: Top 3 most relevant chunks per query
  4. Generation: GPT-4o-mini with temperature=0.3

πŸ› Troubleshooting

Common Issues

Issue: ModuleNotFoundError: No module named 'src'

  • Solution: Ensure you're running scripts from the project root directory

Issue: FileNotFoundError: data/raw/Cell_line_RMA_proc_basalExp.txt

  • Solution: Download GDSC2 data files and place them in data/raw/

Issue: RAG explanations not working

  • Solution:
    1. Check .env has valid OPENAI_API_KEY
    2. Run python3 scripts/build_knowledge_base.py to build vector store

Issue: CUDA out of memory

  • Solution: Reduce batch size in scripts/train_model.py (line 58)

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Development Setup

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Run tests (if applicable)
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

  • GDSC: Genomics of Drug Sensitivity in Cancer database
  • Wellcome Sanger Institute: For providing public cancer cell line data
  • PyTorch: Deep learning framework
  • ChromaDB: Vector database for RAG
  • Streamlit: Web application framework

πŸ“§ Contact

For questions or feedback, please open an issue on GitHub or contact pritam@stanford.edu


Built with ❀️ for advancing precision oncology

Report Bug β€’ Request Feature

About

This project implements a clinical decision support system that uses Deep Learning (DL) to predict cancer cell line response to various drugs and employs a Retrieval-Augmented Generation (RAG) pipeline to provide human-readable, context-specific explanations for the predictions.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors