UNSW-NB15 ML Pipeline - Docker Setup

Automated ML training pipeline for UNSW-NB15 network intrusion detection with GPU support.

Prerequisites

Docker (20.10+)
Docker Compose (v2.0+)

NVIDIA Docker Runtime (for GPU support)

# Install nvidia-docker
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

UNSW-NB15 Dataset in ./training_and_test_sets/
- UNSW_NB15_training-set.csv
- UNSW_NB15_testing-set.csv

Quick Start

1. Verify GPU Access

docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

2. Run the Pipeline

docker compose up

That's it! The pipeline will:

Preprocess data (one-hot encoding, normalization)
Train Isolation Forest
Train XGBoost
Export models to ONNX format

3. Check Output

ls -lh ./data/     # Preprocessed data
ls -lh ./models/   # Trained models (.pkl, .json, .onnx)

Directory Structure

.
├── Dockerfile                          # Docker image definition
├── docker-compose.yml                  # Docker Compose config
├── requirements.txt                    # Python dependencies
├── run_pipeline.sh                     # Pipeline runner script
├── 1_preprocess_data_FIXED.py
├── 2_train_isolation_forest.py
├── 3_train_xgboost.py
├── 4_export_to_onnx.py
├── training_and_test_sets/             # Input data (you provide)
│   ├── UNSW_NB15_training-set.csv
│   └── UNSW_NB15_testing-set.csv
├── data/                               # Output: preprocessed data
│   ├── X_train.npy
│   ├── y_train.npy
│   ├── X_test.npy
│   ├── y_test.npy
│   ├── X_normal.npy
│   ├── scaler.pkl
│   └── feature_names.txt
└── models/                             # Output: trained models
    ├── isolation_forest.pkl
    ├── xgboost_classifier.pkl
    ├── xgboost_classifier.json
    ├── isolation_forest.onnx
    └── xgboost_classifier.onnx

Manual Steps

Build Only

docker compose build

Run Interactively

docker compose run --rm ml-pipeline bash
# Inside container:
python 1_preprocess_data_FIXED.py
python 2_train_isolation_forest.py
python 3_train_xgboost.py
python 4_export_to_onnx.py

View Logs

docker compose logs -f

Clean Up

docker compose down
docker rmi unsw-nb15-ml-pipeline:latest

Expected Output

Console Output

========================================
UNSW-NB15 ML Pipeline - Docker
========================================

✓ Training data found

========================================
Step 1/4: Data Preprocessing
========================================
==================================================
UNSW-NB15 Data Preprocessing (FIXED)
==================================================
...
✓ Training set: (175341, 45)
✓ Testing set: (82332, 45)
✓ One-hot encoded 3 categorical features
...

========================================
Step 2/4: Training Isolation Forest
========================================
...
✓ Test Accuracy: 0.XXXX

========================================
Step 3/4: Training XGBoost
========================================
...
✓ Test accuracy: 0.87XX

========================================
Step 4/4: Exporting to ONNX
========================================
...
✓ XGBoost exported to ./models/xgboost_classifier.onnx

========================================
Pipeline Complete!
========================================

Performance Expectations

Preprocessing: ~30-60 seconds
Isolation Forest: ~2-5 minutes
XGBoost: ~5-15 minutes (with GPU)
ONNX Export: ~10-30 seconds
Total Time: ~10-20 minutes

Troubleshooting

GPU Not Detected

# Check nvidia-docker is installed
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

# If fails, reinstall nvidia-docker runtime
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

Dataset Not Found

❌ ERROR: Training data not found!

Solution: Ensure CSV files are in ./training_and_test_sets/:

ls -lh ./training_and_test_sets/
# Should show:
# UNSW_NB15_training-set.csv
# UNSW_NB15_testing-set.csv

ONNX Export Fails (Isolation Forest)

❌ Error exporting Isolation Forest: ...
This is a known issue with sklearn-onnx.

Solution: This is expected! IsolationForest ONNX export is fragile. Check if XGBoost export succeeded:

ls -lh ./models/xgboost_classifier.onnx

XGBoost export is more reliable and sufficient for the PoC.

Out of Memory

If you get OOM errors, reduce XGBoost parameters in 3_train_xgboost.py:

n_estimators=50  # Reduce from 100
max_depth=4      # Reduce from 6

Permission Denied (Output Files)

sudo chown -R $USER:$USER ./data ./models

CPU-Only Mode

If you don't have GPU or nvidia-docker:

Edit requirements.txt:

# Change:
onnxruntime-gpu>=1.16.0
# To:
onnxruntime>=1.16.0

Edit docker-compose.yml:

# Comment out the deploy section:
# deploy:
#   resources:
#     reservations:
#       devices:
#         - driver: nvidia
#           count: 1
#           capabilities: [gpu]

Use CPU base image in Dockerfile:

# Change FROM line:
FROM python:3.10-slim

Next Steps

After successful training:

Check accuracy in console output
Copy ONNX models to Raspberry Pi
Deploy using Rust + ort crate
Test real-time inference

Support

For issues:

Check Docker logs: docker compose logs
Check GPU: nvidia-smi
Verify data: ls -lh ./training_and_test_sets/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UNSW-NB15 ML Pipeline - Docker Setup

Prerequisites

Quick Start

1. Verify GPU Access

2. Run the Pipeline

3. Check Output

Directory Structure

Manual Steps

Build Only

Run Interactively

View Logs

Clean Up

Expected Output

Console Output

Performance Expectations

Troubleshooting

GPU Not Detected

Dataset Not Found

ONNX Export Fails (Isolation Forest)

Out of Memory

Permission Denied (Output Files)

CPU-Only Mode

Next Steps

Support

FilesExpand file tree

README_DOCKER.md

Latest commit

History

README_DOCKER.md

File metadata and controls

UNSW-NB15 ML Pipeline - Docker Setup

Prerequisites

Quick Start

1. Verify GPU Access

2. Run the Pipeline

3. Check Output

Directory Structure

Manual Steps

Build Only

Run Interactively

View Logs

Clean Up

Expected Output

Console Output

Performance Expectations

Troubleshooting

GPU Not Detected

Dataset Not Found

ONNX Export Fails (Isolation Forest)

Out of Memory

Permission Denied (Output Files)

CPU-Only Mode

Next Steps

Support