Skip to content

Latest commit

 

History

History
278 lines (208 loc) · 6.44 KB

File metadata and controls

278 lines (208 loc) · 6.44 KB

🚀 Quick Start Guide - Bioactivity Prediction ML

This guide will help you get started with the Bioactivity Prediction ML platform quickly and easily.

📋 Prerequisites

  • Python 3.8+ (recommended: Python 3.10)
  • Git for version control
  • 8GB+ RAM for comfortable usage
  • Internet connection for downloading dependencies

⚡ Quick Installation (5 minutes)

1. Clone the Repository

git clone https://github.com/yourusername/bioactivity-prediction-ml.git
cd bioactivity-prediction-ml

2. Create Virtual Environment

# Using venv (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Or using conda
conda create -n bioactivity python=3.10
conda activate bioactivity

3. Install Dependencies

# Install core dependencies
pip install -r requirements.txt

# Optional: Install development dependencies
pip install -r requirements-dev.txt

4. Download Sample Data

python scripts/download_data.py --dataset acetylcholinesterase --size 1000

5. Launch the Web Application

streamlit run app/main.py

The application will open in your browser at http://localhost:8501

🎯 Quick Demo (2 minutes)

Test the Pipeline with Sample Data

  1. Start the Web App:

    streamlit run app/main.py
  2. Navigate to "Data Upload" and click "Load Sample Dataset"

  3. Go to "Molecule Analysis" and explore molecular structures

  4. Try "Prediction" to see bioactivity predictions

  5. Check "Model Performance" for evaluation metrics

🧪 Command Line Usage

Train Models

python scripts/train_models.py \
    --data data/raw/acetylcholinesterase.csv \
    --algorithms random_forest xgboost \
    --optimize \
    --shap-analysis

Download Different Datasets

# Acetylcholinesterase inhibitors
python scripts/download_data.py --dataset acetylcholinesterase --size 1000

# General molecular dataset
python scripts/download_data.py --dataset molecular --size 500

# Try ChEMBL download (requires internet)
python scripts/download_data.py --dataset chembl --target-id CHEMBL220

📊 Jupyter Notebooks

Explore the complete workflow with interactive notebooks:

# Start Jupyter
jupyter lab

# Open the complete workflow notebook
notebooks/01_complete_workflow.ipynb

🐳 Docker Quick Start

Using Docker Compose (Recommended)

# Build and run
docker-compose up -d

# View logs
docker-compose logs -f

# Access application at http://localhost:8501

Using Docker Directly

# Build image
docker build -t bioactivity-app .

# Run container
docker run -p 8501:8501 -v $(pwd)/data:/app/data bioactivity-app

🔧 Basic Configuration

Environment Variables

Create a .env file:

# Optional configurations
LOG_LEVEL=INFO
MODEL_CACHE_DIR=models/cache
DATA_DIR=data
STREAMLIT_SERVER_PORT=8501

Custom Configuration

# Create custom config
from src.bioactivity.utils.config import Config

config = Config(
    test_size=0.3,
    algorithms=['random_forest', 'xgboost'],
    cv_folds=5
)

📈 Example Workflows

1. Basic Prediction Pipeline

from src.bioactivity.data.loader import BioactivityDataLoader
from src.bioactivity.features.descriptors import MolecularDescriptors
from src.bioactivity.models.training import ModelTrainer

# Load data
loader = BioactivityDataLoader()
df = loader.create_sample_dataset(size=500)

# Extract features
descriptor_calc = MolecularDescriptors()
features = descriptor_calc.calculate_descriptors_batch(df['smiles'].tolist())

# Train model
trainer = ModelTrainer()
X = features.drop('smiles', axis=1).fillna(0)
y = df['bioactivity_label']

X_train, X_test, y_train, y_test = trainer.prepare_data(X, y)
model = trainer.train_random_forest(X_train, y_train)

# Evaluate
metrics = trainer.evaluate_model(model, X_test, y_test)
print(f"Accuracy: {metrics['accuracy']:.3f}")

2. SHAP Analysis

from src.bioactivity.interpretation.shap_analysis import SHAPAnalyzer

# Initialize SHAP analyzer
analyzer = SHAPAnalyzer(model, feature_names=X.columns.tolist())

# Calculate SHAP values
shap_values = analyzer.calculate_shap_values(X_test[:50])

# Generate plots
analyzer.plot_summary(shap_values, X_test[:50])
importance_df = analyzer.get_feature_importance(shap_values)

3. Molecular Visualization

from rdkit import Chem
from rdkit.Chem import Draw

# Visualize molecules
smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"  # Caffeine
mol = Chem.MolFromSmiles(smiles)
img = Draw.MolToImage(mol, size=(300, 300))
img.show()

🐛 Troubleshooting

Common Issues

1. RDKit Installation Issues

# Try conda installation
conda install -c conda-forge rdkit

# Or use rdkit-pypi
pip install rdkit-pypi

2. SHAP Import Errors

pip install shap
# If still failing:
pip install shap --no-build-isolation

3. Streamlit Port Issues

# Use different port
streamlit run app/main.py --server.port 8502

4. Memory Issues with Large Datasets

  • Reduce batch size in training scripts
  • Use smaller fingerprint bit sizes
  • Limit the number of molecules processed

Performance Tips

  1. For Large Datasets:

    • Use --batch-size parameter in training
    • Enable multiprocessing: n_jobs=-1
    • Consider feature selection
  2. For Faster Training:

    • Disable hyperparameter optimization initially
    • Use fewer cross-validation folds
    • Start with Random Forest (fastest)
  3. For Better Accuracy:

    • Enable hyperparameter optimization
    • Use ensemble methods
    • Include more diverse molecular descriptors

📚 Next Steps

  1. Explore the Web Interface: Try all features in the Streamlit app
  2. Run Jupyter Notebooks: Follow the complete workflow tutorial
  3. Train Custom Models: Use your own datasets
  4. Read Documentation: Check the docs/ directory
  5. Join Community: Contribute to the project on GitHub

🤝 Getting Help

  • Documentation: Check docs/ directory
  • GitHub Issues: Report bugs and request features
  • Discussions: Ask questions in GitHub Discussions
  • Examples: Explore notebooks/ for detailed examples

🎉 You're Ready!

You now have a complete bioactivity prediction platform running. Start by:

  1. Opening the web application: http://localhost:8501
  2. Uploading your own molecular data (CSV with SMILES)
  3. Training models on your data
  4. Making predictions and interpreting results

Happy molecular modeling! 🧬