Skip to content

Commit b4e991d

Browse files
committed
Initial commit: CampyML - Machine Learning for Campylobacter Source Attribution
Complete machine learning pipeline for predicting Campylobacter source attribution using genomic data. Features: - RandomForest model with 1171 cgMLST features for maximum accuracy - Docker containerization with automated CI/CD pipeline - Sample data for immediate testing and training - Comprehensive documentation and usage examples - Both prediction and training modes supported Developer: Laura Di Egidio (Master's thesis project) Organization: IZS Teramo - GenPat Project
0 parents  commit b4e991d

File tree

10 files changed

+580
-0
lines changed

10 files changed

+580
-0
lines changed
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
name: Build and Push Docker Image
2+
3+
on:
4+
push:
5+
branches: [ main ]
6+
workflow_dispatch:
7+
8+
jobs:
9+
docker:
10+
runs-on: ubuntu-latest
11+
steps:
12+
- name: Checkout
13+
uses: actions/checkout@v4
14+
15+
- name: Set up Docker Buildx
16+
uses: docker/setup-buildx-action@v3
17+
18+
- name: Log in to GitHub Container Registry
19+
uses: docker/login-action@v3
20+
with:
21+
registry: ghcr.io
22+
username: ${{ github.repository_owner }}
23+
password: ${{ secrets.GITHUB_TOKEN }}
24+
25+
- name: Extract metadata
26+
id: meta
27+
uses: docker/metadata-action@v5
28+
with:
29+
images: ghcr.io/${{ github.repository }}
30+
tags: |
31+
type=ref,event=branch
32+
type=sha,prefix={{branch}}-
33+
type=raw,value=latest,enable={{is_default_branch}}
34+
35+
- name: Build and push
36+
uses: docker/build-push-action@v5
37+
with:
38+
context: .
39+
platforms: linux/amd64
40+
push: true
41+
tags: ${{ steps.meta.outputs.tags }}
42+
labels: ${{ steps.meta.outputs.labels }}
43+
44+
- name: Image digest
45+
run: echo ${{ steps.build.outputs.digest }}

.gitignore

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Old files and backups
2+
old/
3+
4+
# Jupyter notebooks
5+
*.ipynb
6+
.ipynb_checkpoints/
7+
8+
# Large data files
9+
*.csv
10+
!sample_data.csv
11+
12+
# Python
13+
__pycache__/
14+
*.py[cod]
15+
*$py.class
16+
*.so
17+
.Python
18+
env/
19+
venv/
20+
ENV/
21+
22+
# IDE
23+
.vscode/
24+
.idea/
25+
*.swp
26+
*.swo
27+
*~
28+
29+
# OS
30+
.DS_Store
31+
Thumbs.db
32+
33+
# Docker
34+
*.log

Dockerfile

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
FROM python:3.10-slim
2+
3+
# Set working directory
4+
WORKDIR /app
5+
6+
# Install system dependencies required for some Python libraries
7+
RUN apt-get update && apt-get install -y \
8+
gcc \
9+
g++ \
10+
&& rm -rf /var/lib/apt/lists/*
11+
12+
# Copy requirements file
13+
COPY requirements.txt .
14+
15+
# Install Python dependencies
16+
RUN pip install --no-cache-dir -r requirements.txt
17+
18+
# Copy application code
19+
COPY campyml_model.py .
20+
21+
# Copy models
22+
COPY models/*.pkl ./models/
23+
24+
# Create output directory
25+
RUN mkdir -p /app/output
26+
27+
# Set environment variables
28+
ENV PYTHONUNBUFFERED=1
29+
30+
# Default entrypoint
31+
ENTRYPOINT ["python", "campyml_model.py"]
32+
33+
# Default arguments (show help)
34+
CMD ["--help"]

README.md

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# CampyML - Machine Learning for Campylobacter Source Attribution
2+
3+
A machine learning model for predicting the source (host species) of Campylobacter isolates based on genomic data using XGBoost.
4+
5+
## Quick Start - Try It Now!
6+
7+
```bash
8+
# Clone and enter the repository
9+
git clone https://github.com/genpat-it/campy-ml.git
10+
cd campy-ml
11+
12+
# Run prediction on sample data (10 diverse Campylobacter samples)
13+
docker run --rm -v $(pwd):/workspace ghcr.io/genpat-it/campy-ml:latest \
14+
--mode predict \
15+
--data /workspace/data/sample_data.csv \
16+
--model /app/models/modello_xgb_jejuni_coli_pubmlst_IZS_v2.pkl \
17+
--output /workspace/data/predictions.csv
18+
19+
# Check results
20+
cat data/predictions.csv
21+
```
22+
23+
## Overview
24+
25+
CampyML uses whole genome MLST (wgMLST) profiles to predict the likely source of Campylobacter jejuni and Campylobacter coli isolates. The model is trained on data from pubMLST combined with internal IZS sequences to predict sources such as:
26+
- Chicken
27+
- Cattle
28+
- Sheep
29+
- Turkey
30+
- Environmental waters
31+
- Human stool
32+
- Other sources
33+
34+
## Usage
35+
36+
### Making Predictions
37+
38+
```bash
39+
# Pull the latest image
40+
docker pull ghcr.io/genpat-it/campy-ml:latest
41+
42+
# Run prediction on your data
43+
docker run --rm -v $(pwd):/workspace ghcr.io/genpat-it/campy-ml:latest \
44+
--mode predict \
45+
--data /workspace/your_samples.csv \
46+
--model /app/models/modello_xgb_jejuni_coli_pubmlst_IZS_v2.pkl \
47+
--output /workspace/predictions.csv
48+
```
49+
50+
### Training a New Model
51+
52+
```bash
53+
# Prepare training data with 'source' column
54+
# Then train a new model
55+
docker run --rm -v $(pwd):/workspace ghcr.io/genpat-it/campy-ml:latest \
56+
--mode train \
57+
--data /workspace/training_data.csv \
58+
--model /workspace/my_new_model.pkl \
59+
--target source
60+
```
61+
62+
### Local Development
63+
64+
```bash
65+
# Install dependencies
66+
python -m venv venv
67+
source venv/bin/activate
68+
pip install -r requirements.txt
69+
70+
# Run predictions locally
71+
python campyml_model.py --mode predict \
72+
--data data/sample_data.csv \
73+
--model models/modello_xgb_jejuni_coli_pubmlst_IZS_v2.pkl \
74+
--output data/predictions.csv
75+
76+
# Train locally
77+
python campyml_model.py --mode train \
78+
--data data/training_data.csv \
79+
--model data/my_model.pkl \
80+
--target source
81+
```
82+
83+
## Input Data Format
84+
85+
### For Predictions
86+
- CSV file with genomic features (cgMLST allele profiles)
87+
- Required columns: ID, Location, Species, aspA, glnA, gltA, glyA, pgm, tkt, uncA, CAMP0001-CAMP1164
88+
- No source labels needed
89+
90+
### For Training
91+
- Same format as predictions PLUS
92+
- Additional column: `source` (target labels like "chicken", "cattle", etc.)
93+
94+
## Output
95+
96+
The prediction output includes:
97+
- `prediction`: Predicted source/host species
98+
- `confidence`: Confidence score for the prediction
99+
- `prob_[source]`: Probability for each possible source class
100+
101+
## Model Details
102+
103+
The current model (`modello_xgb_jejuni_coli_pubmlst_IZS_v2.pkl`) features:
104+
- 1171 cgMLST features for maximum accuracy
105+
- RandomForest classifier optimized for Campylobacter data
106+
- Trained on diverse samples including pubMLST and IZS data
107+
- 9 source classes: chicken, cattle, sheep, environmental waters, human stool, etc.
108+
109+
## Requirements
110+
111+
- Docker (recommended) OR Python 3.8+
112+
- At least 4GB RAM for model operations
113+
- Input data in CSV format with cgMLST profiles
114+
115+
## Sample Data
116+
117+
The repository includes `data/sample_data.csv` with 10 test samples for immediate experimentation.
118+
119+
## Credits
120+
121+
**Developer**: Laura Di Egidio (Master's thesis project)
122+
**Organization**: IZS Teramo - GenPat Project
123+
**Contact**: For questions and support, please open an issue on GitHub
124+
125+
## Citation
126+
127+
If you use CampyML in your research, please cite:
128+
```
129+
CampyML: Machine Learning for Campylobacter Source Attribution
130+
Laura Di Egidio (Master's thesis), IZS Teramo - GenPat Project
131+
https://github.com/genpat-it/campy-ml
132+
```
133+
134+
## License
135+
136+
This project is part of the GenPat initiative at IZS Teramo.
137+
138+
## Contact
139+
140+
For questions and support, please open an issue on GitHub.

0 commit comments

Comments
 (0)