This guide will help you get started with the Bioactivity Prediction ML platform quickly and easily.
- Python 3.8+ (recommended: Python 3.10)
- Git for version control
- 8GB+ RAM for comfortable usage
- Internet connection for downloading dependencies
git clone https://github.com/yourusername/bioactivity-prediction-ml.git
cd bioactivity-prediction-ml# Using venv (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Or using conda
conda create -n bioactivity python=3.10
conda activate bioactivity# Install core dependencies
pip install -r requirements.txt
# Optional: Install development dependencies
pip install -r requirements-dev.txtpython scripts/download_data.py --dataset acetylcholinesterase --size 1000streamlit run app/main.pyThe application will open in your browser at http://localhost:8501
-
Start the Web App:
streamlit run app/main.py
-
Navigate to "Data Upload" and click "Load Sample Dataset"
-
Go to "Molecule Analysis" and explore molecular structures
-
Try "Prediction" to see bioactivity predictions
-
Check "Model Performance" for evaluation metrics
python scripts/train_models.py \
--data data/raw/acetylcholinesterase.csv \
--algorithms random_forest xgboost \
--optimize \
--shap-analysis# Acetylcholinesterase inhibitors
python scripts/download_data.py --dataset acetylcholinesterase --size 1000
# General molecular dataset
python scripts/download_data.py --dataset molecular --size 500
# Try ChEMBL download (requires internet)
python scripts/download_data.py --dataset chembl --target-id CHEMBL220Explore the complete workflow with interactive notebooks:
# Start Jupyter
jupyter lab
# Open the complete workflow notebook
notebooks/01_complete_workflow.ipynb# Build and run
docker-compose up -d
# View logs
docker-compose logs -f
# Access application at http://localhost:8501# Build image
docker build -t bioactivity-app .
# Run container
docker run -p 8501:8501 -v $(pwd)/data:/app/data bioactivity-appCreate a .env file:
# Optional configurations
LOG_LEVEL=INFO
MODEL_CACHE_DIR=models/cache
DATA_DIR=data
STREAMLIT_SERVER_PORT=8501# Create custom config
from src.bioactivity.utils.config import Config
config = Config(
test_size=0.3,
algorithms=['random_forest', 'xgboost'],
cv_folds=5
)from src.bioactivity.data.loader import BioactivityDataLoader
from src.bioactivity.features.descriptors import MolecularDescriptors
from src.bioactivity.models.training import ModelTrainer
# Load data
loader = BioactivityDataLoader()
df = loader.create_sample_dataset(size=500)
# Extract features
descriptor_calc = MolecularDescriptors()
features = descriptor_calc.calculate_descriptors_batch(df['smiles'].tolist())
# Train model
trainer = ModelTrainer()
X = features.drop('smiles', axis=1).fillna(0)
y = df['bioactivity_label']
X_train, X_test, y_train, y_test = trainer.prepare_data(X, y)
model = trainer.train_random_forest(X_train, y_train)
# Evaluate
metrics = trainer.evaluate_model(model, X_test, y_test)
print(f"Accuracy: {metrics['accuracy']:.3f}")from src.bioactivity.interpretation.shap_analysis import SHAPAnalyzer
# Initialize SHAP analyzer
analyzer = SHAPAnalyzer(model, feature_names=X.columns.tolist())
# Calculate SHAP values
shap_values = analyzer.calculate_shap_values(X_test[:50])
# Generate plots
analyzer.plot_summary(shap_values, X_test[:50])
importance_df = analyzer.get_feature_importance(shap_values)from rdkit import Chem
from rdkit.Chem import Draw
# Visualize molecules
smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" # Caffeine
mol = Chem.MolFromSmiles(smiles)
img = Draw.MolToImage(mol, size=(300, 300))
img.show()1. RDKit Installation Issues
# Try conda installation
conda install -c conda-forge rdkit
# Or use rdkit-pypi
pip install rdkit-pypi2. SHAP Import Errors
pip install shap
# If still failing:
pip install shap --no-build-isolation3. Streamlit Port Issues
# Use different port
streamlit run app/main.py --server.port 85024. Memory Issues with Large Datasets
- Reduce batch size in training scripts
- Use smaller fingerprint bit sizes
- Limit the number of molecules processed
-
For Large Datasets:
- Use
--batch-sizeparameter in training - Enable multiprocessing:
n_jobs=-1 - Consider feature selection
- Use
-
For Faster Training:
- Disable hyperparameter optimization initially
- Use fewer cross-validation folds
- Start with Random Forest (fastest)
-
For Better Accuracy:
- Enable hyperparameter optimization
- Use ensemble methods
- Include more diverse molecular descriptors
- Explore the Web Interface: Try all features in the Streamlit app
- Run Jupyter Notebooks: Follow the complete workflow tutorial
- Train Custom Models: Use your own datasets
- Read Documentation: Check the
docs/directory - Join Community: Contribute to the project on GitHub
- Documentation: Check
docs/directory - GitHub Issues: Report bugs and request features
- Discussions: Ask questions in GitHub Discussions
- Examples: Explore
notebooks/for detailed examples
You now have a complete bioactivity prediction platform running. Start by:
- Opening the web application: http://localhost:8501
- Uploading your own molecular data (CSV with SMILES)
- Training models on your data
- Making predictions and interpreting results
Happy molecular modeling! 🧬