🚀 Quick Start Guide - Bioactivity Prediction ML

This guide will help you get started with the Bioactivity Prediction ML platform quickly and easily.

📋 Prerequisites

Python 3.8+ (recommended: Python 3.10)
Git for version control
8GB+ RAM for comfortable usage
Internet connection for downloading dependencies

⚡ Quick Installation (5 minutes)

1. Clone the Repository

git clone https://github.com/yourusername/bioactivity-prediction-ml.git
cd bioactivity-prediction-ml

2. Create Virtual Environment

# Using venv (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Or using conda
conda create -n bioactivity python=3.10
conda activate bioactivity

3. Install Dependencies

# Install core dependencies
pip install -r requirements.txt

# Optional: Install development dependencies
pip install -r requirements-dev.txt

4. Download Sample Data

python scripts/download_data.py --dataset acetylcholinesterase --size 1000

5. Launch the Web Application

streamlit run app/main.py

The application will open in your browser at http://localhost:8501

🎯 Quick Demo (2 minutes)

Test the Pipeline with Sample Data

Start the Web App:
```
streamlit run app/main.py
```
Navigate to "Data Upload" and click "Load Sample Dataset"
Go to "Molecule Analysis" and explore molecular structures
Try "Prediction" to see bioactivity predictions
Check "Model Performance" for evaluation metrics

🧪 Command Line Usage

Train Models

python scripts/train_models.py \
    --data data/raw/acetylcholinesterase.csv \
    --algorithms random_forest xgboost \
    --optimize \
    --shap-analysis

Download Different Datasets

# Acetylcholinesterase inhibitors
python scripts/download_data.py --dataset acetylcholinesterase --size 1000

# General molecular dataset
python scripts/download_data.py --dataset molecular --size 500

# Try ChEMBL download (requires internet)
python scripts/download_data.py --dataset chembl --target-id CHEMBL220

📊 Jupyter Notebooks

Explore the complete workflow with interactive notebooks:

# Start Jupyter
jupyter lab

# Open the complete workflow notebook
notebooks/01_complete_workflow.ipynb

🐳 Docker Quick Start

Using Docker Compose (Recommended)

# Build and run
docker-compose up -d

# View logs
docker-compose logs -f

# Access application at http://localhost:8501

Using Docker Directly

# Build image
docker build -t bioactivity-app .

# Run container
docker run -p 8501:8501 -v $(pwd)/data:/app/data bioactivity-app

🔧 Basic Configuration

Environment Variables

Create a .env file:

# Optional configurations
LOG_LEVEL=INFO
MODEL_CACHE_DIR=models/cache
DATA_DIR=data
STREAMLIT_SERVER_PORT=8501

Custom Configuration

# Create custom config
from src.bioactivity.utils.config import Config

config = Config(
    test_size=0.3,
    algorithms=['random_forest', 'xgboost'],
    cv_folds=5
)

📈 Example Workflows

1. Basic Prediction Pipeline

from src.bioactivity.data.loader import BioactivityDataLoader
from src.bioactivity.features.descriptors import MolecularDescriptors
from src.bioactivity.models.training import ModelTrainer

# Load data
loader = BioactivityDataLoader()
df = loader.create_sample_dataset(size=500)

# Extract features
descriptor_calc = MolecularDescriptors()
features = descriptor_calc.calculate_descriptors_batch(df['smiles'].tolist())

# Train model
trainer = ModelTrainer()
X = features.drop('smiles', axis=1).fillna(0)
y = df['bioactivity_label']

X_train, X_test, y_train, y_test = trainer.prepare_data(X, y)
model = trainer.train_random_forest(X_train, y_train)

# Evaluate
metrics = trainer.evaluate_model(model, X_test, y_test)
print(f"Accuracy: {metrics['accuracy']:.3f}")

2. SHAP Analysis

from src.bioactivity.interpretation.shap_analysis import SHAPAnalyzer

# Initialize SHAP analyzer
analyzer = SHAPAnalyzer(model, feature_names=X.columns.tolist())

# Calculate SHAP values
shap_values = analyzer.calculate_shap_values(X_test[:50])

# Generate plots
analyzer.plot_summary(shap_values, X_test[:50])
importance_df = analyzer.get_feature_importance(shap_values)

3. Molecular Visualization

from rdkit import Chem
from rdkit.Chem import Draw

# Visualize molecules
smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"  # Caffeine
mol = Chem.MolFromSmiles(smiles)
img = Draw.MolToImage(mol, size=(300, 300))
img.show()

🐛 Troubleshooting

Common Issues

1. RDKit Installation Issues

# Try conda installation
conda install -c conda-forge rdkit

# Or use rdkit-pypi
pip install rdkit-pypi

2. SHAP Import Errors

pip install shap
# If still failing:
pip install shap --no-build-isolation

3. Streamlit Port Issues

# Use different port
streamlit run app/main.py --server.port 8502

4. Memory Issues with Large Datasets

Reduce batch size in training scripts
Use smaller fingerprint bit sizes
Limit the number of molecules processed

Performance Tips

For Large Datasets:
- Use --batch-size parameter in training
- Enable multiprocessing: n_jobs=-1
- Consider feature selection
For Faster Training:
- Disable hyperparameter optimization initially
- Use fewer cross-validation folds
- Start with Random Forest (fastest)
For Better Accuracy:
- Enable hyperparameter optimization
- Use ensemble methods
- Include more diverse molecular descriptors

📚 Next Steps

Explore the Web Interface: Try all features in the Streamlit app
Run Jupyter Notebooks: Follow the complete workflow tutorial
Train Custom Models: Use your own datasets
Read Documentation: Check the docs/ directory
Join Community: Contribute to the project on GitHub

🤝 Getting Help

Documentation: Check docs/ directory
GitHub Issues: Report bugs and request features
Discussions: Ask questions in GitHub Discussions
Examples: Explore notebooks/ for detailed examples

🎉 You're Ready!

You now have a complete bioactivity prediction platform running. Start by:

Opening the web application: http://localhost:8501
Uploading your own molecular data (CSV with SMILES)
Training models on your data
Making predictions and interpreting results

Happy molecular modeling! 🧬

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 Quick Start Guide - Bioactivity Prediction ML

📋 Prerequisites

⚡ Quick Installation (5 minutes)

1. Clone the Repository

2. Create Virtual Environment

3. Install Dependencies

4. Download Sample Data

5. Launch the Web Application

🎯 Quick Demo (2 minutes)

Test the Pipeline with Sample Data

🧪 Command Line Usage

Train Models

Download Different Datasets

📊 Jupyter Notebooks

🐳 Docker Quick Start

Using Docker Compose (Recommended)

Using Docker Directly

🔧 Basic Configuration

Environment Variables

Custom Configuration

📈 Example Workflows

1. Basic Prediction Pipeline

2. SHAP Analysis

3. Molecular Visualization

🐛 Troubleshooting

Common Issues

Performance Tips

📚 Next Steps

🤝 Getting Help

🎉 You're Ready!

FilesExpand file tree

QUICKSTART.md

Latest commit

History

QUICKSTART.md

File metadata and controls

🚀 Quick Start Guide - Bioactivity Prediction ML

📋 Prerequisites

⚡ Quick Installation (5 minutes)

1. Clone the Repository

2. Create Virtual Environment

3. Install Dependencies

4. Download Sample Data

5. Launch the Web Application

🎯 Quick Demo (2 minutes)

Test the Pipeline with Sample Data

🧪 Command Line Usage

Train Models

Download Different Datasets

📊 Jupyter Notebooks

🐳 Docker Quick Start

Using Docker Compose (Recommended)

Using Docker Directly

🔧 Basic Configuration

Environment Variables

Custom Configuration

📈 Example Workflows

1. Basic Prediction Pipeline

2. SHAP Analysis

3. Molecular Visualization

🐛 Troubleshooting

Common Issues

Performance Tips

📚 Next Steps

🤝 Getting Help

🎉 You're Ready!