genpat-it
diff --git a/‎.github/workflows/docker-simple.yml‎
Lines changed: 45 additions & 0 deletions b/‎.github/workflows/docker-simple.yml‎
Lines changed: 45 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 34 additions & 0 deletions b/‎.gitignore‎
Lines changed: 34 additions & 0 deletions
diff --git a/‎Dockerfile‎
Lines changed: 34 additions & 0 deletions b/‎Dockerfile‎
Lines changed: 34 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 140 additions & 0 deletions b/‎README.md‎
Lines changed: 140 additions & 0 deletions
@@ -0,0 +1,45 @@
+name: Build and Push Docker Image
+
+on:
+  push:
+    branches: [ main ]
+  workflow_dispatch:
+
+jobs:
+  docker:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v3
+
+      - name: Log in to GitHub Container Registry
+        uses: docker/login-action@v3
+        with:
+          registry: ghcr.io
+          username: ${{ github.repository_owner }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Extract metadata
+        id: meta
+        uses: docker/metadata-action@v5
+        with:
+          images: ghcr.io/${{ github.repository }}
+          tags: |
+            type=ref,event=branch
+            type=sha,prefix={{branch}}-
+            type=raw,value=latest,enable={{is_default_branch}}
+
+      - name: Build and push
+        uses: docker/build-push-action@v5
+        with:
+          context: .
+          platforms: linux/amd64
+          push: true
+          tags: ${{ steps.meta.outputs.tags }}
+          labels: ${{ steps.meta.outputs.labels }}
+
+      - name: Image digest
+        run: echo ${{ steps.build.outputs.digest }}
@@ -0,0 +1,34 @@
+# Old files and backups
+old/
+
+# Jupyter notebooks
+*.ipynb
+.ipynb_checkpoints/
+
+# Large data files
+*.csv
+!sample_data.csv
+
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+env/
+venv/
+ENV/
+
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+
+# OS
+.DS_Store
+Thumbs.db
+
+# Docker
+*.log
@@ -0,0 +1,34 @@
+FROM python:3.10-slim
+
+# Set working directory
+WORKDIR /app
+
+# Install system dependencies required for some Python libraries
+RUN apt-get update && apt-get install -y \
+    gcc \
+    g++ \
+    && rm -rf /var/lib/apt/lists/*
+
+# Copy requirements file
+COPY requirements.txt .
+
+# Install Python dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Copy application code
+COPY campyml_model.py .
+
+# Copy models
+COPY models/*.pkl ./models/
+
+# Create output directory
+RUN mkdir -p /app/output
+
+# Set environment variables
+ENV PYTHONUNBUFFERED=1
+
+# Default entrypoint
+ENTRYPOINT ["python", "campyml_model.py"]
+
+# Default arguments (show help)
+CMD ["--help"]
@@ -0,0 +1,140 @@
+# CampyML - Machine Learning for Campylobacter Source Attribution
+
+A machine learning model for predicting the source (host species) of Campylobacter isolates based on genomic data using XGBoost.
+
+## Quick Start - Try It Now!
+
+```bash
+# Clone and enter the repository
+git clone https://github.com/genpat-it/campy-ml.git
+cd campy-ml
+
+# Run prediction on sample data (10 diverse Campylobacter samples)
+docker run --rm -v $(pwd):/workspace ghcr.io/genpat-it/campy-ml:latest \
+  --mode predict \
+  --data /workspace/data/sample_data.csv \
+  --model /app/models/modello_xgb_jejuni_coli_pubmlst_IZS_v2.pkl \
+  --output /workspace/data/predictions.csv
+
+# Check results
+cat data/predictions.csv
+```
+
+## Overview
+
+CampyML uses whole genome MLST (wgMLST) profiles to predict the likely source of Campylobacter jejuni and Campylobacter coli isolates. The model is trained on data from pubMLST combined with internal IZS sequences to predict sources such as:
+- Chicken
+- Cattle
+- Sheep
+- Turkey
+- Environmental waters
+- Human stool
+- Other sources
+
+## Usage
+
+### Making Predictions
+
+```bash
+# Pull the latest image
+docker pull ghcr.io/genpat-it/campy-ml:latest
+
+# Run prediction on your data
+docker run --rm -v $(pwd):/workspace ghcr.io/genpat-it/campy-ml:latest \
+  --mode predict \
+  --data /workspace/your_samples.csv \
+  --model /app/models/modello_xgb_jejuni_coli_pubmlst_IZS_v2.pkl \
+  --output /workspace/predictions.csv
+```
+
+### Training a New Model
+
+```bash
+# Prepare training data with 'source' column
+# Then train a new model
+docker run --rm -v $(pwd):/workspace ghcr.io/genpat-it/campy-ml:latest \
+  --mode train \
+  --data /workspace/training_data.csv \
+  --model /workspace/my_new_model.pkl \
+  --target source
+```
+
+### Local Development
+
+```bash
+# Install dependencies
+python -m venv venv
+source venv/bin/activate
+pip install -r requirements.txt
+
+# Run predictions locally
+python campyml_model.py --mode predict \
+  --data data/sample_data.csv \
+  --model models/modello_xgb_jejuni_coli_pubmlst_IZS_v2.pkl \
+  --output data/predictions.csv
+
+# Train locally
+python campyml_model.py --mode train \
+  --data data/training_data.csv \
+  --model data/my_model.pkl \
+  --target source
+```
+
+## Input Data Format
+
+### For Predictions
+- CSV file with genomic features (cgMLST allele profiles)
+- Required columns: ID, Location, Species, aspA, glnA, gltA, glyA, pgm, tkt, uncA, CAMP0001-CAMP1164
+- No source labels needed
+
+### For Training
+- Same format as predictions PLUS
+- Additional column: `source` (target labels like "chicken", "cattle", etc.)
+
+## Output
+
+The prediction output includes:
+- `prediction`: Predicted source/host species
+- `confidence`: Confidence score for the prediction
+- `prob_[source]`: Probability for each possible source class
+
+## Model Details
+
+The current model (`modello_xgb_jejuni_coli_pubmlst_IZS_v2.pkl`) features:
+- 1171 cgMLST features for maximum accuracy
+- RandomForest classifier optimized for Campylobacter data
+- Trained on diverse samples including pubMLST and IZS data
+- 9 source classes: chicken, cattle, sheep, environmental waters, human stool, etc.
+
+## Requirements
+
+- Docker (recommended) OR Python 3.8+
+- At least 4GB RAM for model operations
+- Input data in CSV format with cgMLST profiles
+
+## Sample Data
+
+The repository includes `data/sample_data.csv` with 10 test samples for immediate experimentation.
+
+## Credits
+
+**Developer**: Laura Di Egidio (Master's thesis project)
+**Organization**: IZS Teramo - GenPat Project
+**Contact**: For questions and support, please open an issue on GitHub
+
+## Citation
+
+If you use CampyML in your research, please cite:
+```
+CampyML: Machine Learning for Campylobacter Source Attribution
+Laura Di Egidio (Master's thesis), IZS Teramo - GenPat Project
+https://github.com/genpat-it/campy-ml
+```
+
+## License
+
+This project is part of the GenPat initiative at IZS Teramo.
+
+## Contact
+
+For questions and support, please open an issue on GitHub.