- Overview
- Features
- Architecture
- Installation
- Quick Start
- Usage
- Project Structure
- Results
- Contributing
This system combines Deep Learning and Retrieval-Augmented Generation (RAG) to predict cancer drug response and provide interpretable, biologically grounded explanations. It's designed for clinical decision support, helping oncologists understand not just what the prediction is, but why the model made that prediction.
- Predict Drug Response: Binary classification (Responder/Non-Responder) for chemotherapy drugs
- Identify Key Genes: Highlight the most important genes driving each prediction
- Generate Biological Explanations: Use RAG to provide context-rich explanations based on biomedical knowledge
- Interactive Web Interface: Streamlit app for real-time predictions and visualizations
β Cisplatin | β Docetaxel | β Paclitaxel | β Gemcitabine
- Architecture: Feed-forward neural network optimized for gene expression data
- Input: 1,000 curated gene expression features (prioritizing drug-response genes)
- Output: Binary classification with confidence scores
- Performance: AUC ~0.76-0.79 across tested drugs
- Vector Store: ChromaDB with sentence transformers (all-MiniLM-L6-v2)
- Knowledge Base: Curated drug mechanism documents covering:
- DNA damage response pathways
- Drug resistance mechanisms
- Key biomarkers (BRCA1/2, TP53, ERCC1, etc.)
- LLM: GPT-4o-mini for natural language explanation generation
- Patient Selection: Choose from 760+ cancer cell lines (GDSC2 dataset)
- Real-Time Predictions: Instant results with confidence scores
- Visual Gene Analysis: Bar charts and tables of top contributing genes
- Biological Context: RAG-generated explanations with cited genes
- Export Results: Download predictions as CSV
- Flask-based API for programmatic access
- JSON input/output format
- Easy integration with other systems
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Streamlit Web Interface β
β (Patient Selection β’ Visualization β’ Export) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DrugResponseTool (Core) β
β β’ Model Loading β’ Prediction β’ Orchestration β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
βββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ
β Deep Learning Module β β RAG Pipeline β
βββββββββββββββββββββββββββββ€ ββββββββββββββββββββββββββββ€
β β’ PyTorch Neural Network β β β’ ChromaDB Retriever β
β β’ 1000 gene features β β β’ Sentence Transformers β
β β’ Gradient-based β β β’ GPT-4o-mini Generator β
β importance scoring β β β’ Biological context β
βββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ
β β
ββββββββββββββββββββββββββββββββββββββββ
β Prediction + Explanation β
β β’ Response class β
β β’ Confidence score β
β β’ Top genes β
β β’ Biological rationale β
ββββββββββββββββββββββββββββββββββββββββ
- Python 3.10 or higher
- CUDA-capable GPU (optional, but recommended)
- ~10GB disk space (for data and models)
git clone https://github.com/yourusername/oncology_cds.git
cd oncology_cdspython3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtDownload the following files from GDSC:
Cell_line_RMA_proc_basalExp.txt(Gene expression matrix)GDSC2_fitted_dose_response_27Oct23.xlsx(Drug response data)
Place them in the data/raw/ directory.
Create a .env file in the project root:
OPENAI_API_KEY=your_openai_api_key_here# Train all 4 drug models (takes ~10-15 minutes on GPU)
CUDA_VISIBLE_DEVICES=0 python3 scripts/train_model.pyThis will create model files in models/:
cisplatin_predictor.ptdocetaxel_predictor.ptpaclitaxel_predictor.ptgemcitabine_predictor.pt
# Create vector store from drug mechanism documents
python3 scripts/build_knowledge_base.py# Start the interactive web interface
./run_streamlit.sh
# Or manually:
streamlit run streamlit_app.pyThen open your browser to http://localhost:8501
- Launch the app:
./run_streamlit.sh - Load data: Click "Load Data" in the sidebar
- Select drug: Choose from the dropdown (e.g., Cisplatin)
- Pick patient: Use "Random Patient" or select a specific COSMIC ID
- Run prediction: Click "Run Prediction"
- Explore results: View prediction, confidence, top genes, and biological explanation
See STREAMLIT_GUIDE.md for detailed instructions.
from src.data.loader import GDSCDataLoader
from src.tools.prediction_tool import DrugResponsePredictor
# Load data
loader = GDSCDataLoader("data/raw")
loader.load_expression("Cell_line_RMA_proc_basalExp.txt")
loader.load_drug_response("GDSC2_fitted_dose_response_27Oct23.xlsx")
X, y = loader.get_dataset_for_drug("Cisplatin")
X.columns = X.columns.astype(str)
# Initialize predictor
tool = DrugResponsePredictor(
models_dir="models",
vector_store_dir="knowledge_base/vector_store"
)
# Make prediction
sample_patient = X.iloc[[0]]
result = tool.predict_and_explain(
drug_name="Cisplatin",
gene_expression=sample_patient
)
print(f"Prediction: {result['prediction']}")
print(f"Confidence: {result['confidence']:.1%}")
print(f"Top Genes: {', '.join(result['top_genes'][:5])}")
print(f"\nExplanation:\n{result['explanation']}")See examples/full_pipeline_DL+RAG.py for a complete example.
# Start the Flask API server
python3 app.pyThen make POST requests to http://localhost:5000/api/predict (requires all 1000 gene values).
oncology_cds/
βββ config/
β βββ gene_sets.yaml # Drug-specific gene sets
βββ data/
β βββ raw/ # GDSC2 data files (not versioned)
βββ examples/
β βββ full_pipeline_DL+RAG.py # Demo script
βββ knowledge_base/
β βββ documents/ # Drug mechanism documents
β β βββ cisplatin.txt
β β βββ docetaxel.txt
β β βββ paclitaxel.txt
β β βββ gemcitabine.txt
β βββ vector_store/ # ChromaDB vector store
βββ models/ # Trained model checkpoints (not versioned)
βββ scripts/
β βββ train_model.py # Model training script
β βββ build_knowledge_base.py # Vector store builder
βββ src/
β βββ api/
β β βββ routes.py # Flask API routes
β βββ data/
β β βββ loader.py # GDSC2 data loader
β β βββ preprocessor.py # Feature selection & scaling
β βββ models/
β β βββ predictor.py # Neural network architecture
β β βββ trainer.py # Training utilities
β βββ rag/
β β βββ ingestion.py # Document loading & chunking
β β βββ retriever.py # Vector search (ChromaDB)
β β βββ generator.py # LLM explanation generation
β βββ tools/
β βββ prediction_tool.py # Main prediction + RAG integration
βββ app.py # Flask API entry point
βββ streamlit_app.py # Streamlit web interface
βββ run_streamlit.sh # Streamlit launcher script
βββ requirements.txt # Python dependencies
βββ STREAMLIT_GUIDE.md # Streamlit usage guide
βββ README.md # This file
| Drug | AUC | Accuracy | Selected Genes |
|---|---|---|---|
| Cisplatin | 0.7696 | 70.4% | 1000 (29 known + 971 variance) |
| Docetaxel | 0.7901 | 72.0% | 1000 (15 known + 985 variance) |
| Paclitaxel | 0.7850 | 71.5% | 1000 (9 known + 991 variance) |
| Gemcitabine | 0.7780 | 70.8% | 1000 (11 known + 989 variance) |
Performance metrics on test set (20% of data)
Drug: Cisplatin
Prediction: Responder
Confidence: 69.7%
Top Genes: HLA-B, HLA-DPA1, IFI30, HLA-DRA, LDHB
Biological Explanation:
The prediction of this patient as a responder to Cisplatin is supported by
the involvement of several key genes. HLA-B and HLA-DPA1 are part of the
immune response and may influence the effectiveness of Cisplatin by
modulating the immune system's ability to recognize and attack cancer cells.
GSTP1 is known for its role in drug metabolism and detoxification, where
variations can affect Cisplatin's cytotoxicity. Additionally, LDHB and VIM
are associated with cellular metabolism and epithelial-mesenchymal transition,
which can impact tumor response to chemotherapy.
- Raw data: 17,737 genes Γ 1,018 cell lines
- After filtering: 17,419 genes (removed 318 rows with invalid gene names)
- Feature selection: 1,000 genes per drug
- Prioritizes known drug-response genes from curated gene sets
- Fills remaining slots with high-variance genes
- Preprocessing: StandardScaler normalization
DrugResponsePredictor(
input_dim=1000,
hidden_dims=[256, 128, 32],
dropout_rate=0.5
)- Layers: 1000 β 256 β 128 β 32 β 1
- Activation: ReLU
- Dropout: 0.5 (prevents overfitting)
- Output: Sigmoid (binary classification)
- Optimizer: Adam (lr=0.0005)
- Weight Decay: 0.01 (L2 regularization)
- Scheduler: ReduceLROnPlateau (monitors AUC)
- Early Stopping: Patience=15 epochs
- Batch Size: 64
- Max Epochs: 100
- Document Chunking: 500-character chunks with overlap
- Embedding: all-MiniLM-L6-v2 (384 dimensions)
- Retrieval: Top 3 most relevant chunks per query
- Generation: GPT-4o-mini with temperature=0.3
Issue: ModuleNotFoundError: No module named 'src'
- Solution: Ensure you're running scripts from the project root directory
Issue: FileNotFoundError: data/raw/Cell_line_RMA_proc_basalExp.txt
- Solution: Download GDSC2 data files and place them in
data/raw/
Issue: RAG explanations not working
- Solution:
- Check
.envhas validOPENAI_API_KEY - Run
python3 scripts/build_knowledge_base.pyto build vector store
- Check
Issue: CUDA out of memory
- Solution: Reduce batch size in
scripts/train_model.py(line 58)
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Run tests (if applicable)
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- GDSC: Genomics of Drug Sensitivity in Cancer database
- Wellcome Sanger Institute: For providing public cancer cell line data
- PyTorch: Deep learning framework
- ChromaDB: Vector database for RAG
- Streamlit: Web application framework
For questions or feedback, please open an issue on GitHub or contact pritam@stanford.edu
Built with β€οΈ for advancing precision oncology