The original 1_preprocess_data.py has bugs. Use the FIXED version instead!
1_preprocess_data_FIXED.py← Use this one!2_train_isolation_forest.py← Original is OK3_train_xgboost.py← Original is OK4_export_to_onnx.py← Original is OK
1_preprocess_data.py← Has CSV header bug
# 1. Create environment
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# 2. Install dependencies
pip install -r requirements.txt
# 3. Create directories
mkdir -p data models
# 4. Download UNSW-NB15 dataset
# Get from: https://research.unsw.edu.au/projects/unsw-nb15-dataset
# Files needed:
# - UNSW_NB15_training-set.csv
# - UNSW_NB15_testing-set.csv
# Place in ./data/# Step 1: Preprocess (FIXED VERSION)
python 1_preprocess_data_FIXED.py
# Step 2: Train Isolation Forest
python 2_train_isolation_forest.py
# Step 3: Train XGBoost
python 3_train_xgboost.py
# Step 4: Export to ONNX
python 4_export_to_onnx.py# Check preprocessed data
ls -lh ./data/*.npy
# Check trained models
ls -lh ./models/*.pkl ./models/*.json
# Check ONNX models (for Rust)
ls -lh ./models/*.onnxExpected output:
./data/X_train.npy - ~60 MB
./data/X_test.npy - ~30 MB
./data/X_normal.npy - ~50 MB
./models/isolation_forest.pkl - ~5 MB
./models/xgboost_classifier.pkl - ~10 MB
./models/xgboost_classifier.onnx - ~10 MB ← For Rust!
// Cargo.toml
[dependencies]
ort = "2.0"
ndarray = "0.15"
// main.rs
use ort::{Environment, SessionBuilder};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let env = Environment::builder().build()?;
// Load models
let xgb = SessionBuilder::new(&env)?
.with_model_from_file("xgboost_classifier.onnx")?;
// Run inference...
Ok(())
}ValueError: Length mismatch: Expected 49 columns
Fix: Make sure you use 1_preprocess_data_FIXED.py
MemoryError: Unable to allocate array
Fix: Process in batches or use smaller sample:
# In preprocessing script, add:
train_df = train_df.sample(n=50000) # Use subsetError converting Isolation Forest
Fix: This is a known issue. XGBoost export should work. If needed:
# Update packages
pip install --upgrade onnx onnxruntime skl2onnx onnxmltools| Model | Accuracy | Notes |
|---|---|---|
| Isolation Forest | 85-95% | Trained on normal traffic only |
| XGBoost | 95-99% | Binary classification |
Original Plan: 3 models (FastText + Isolation Forest + XGBoost)
Actual Implementation: 2 models (Isolation Forest + XGBoost)
Why: FastText is designed for text data. UNSW-NB15 is mostly numerical network flow features, so FastText doesn't apply here. This is standard practice in the research.
- Downloaded UNSW-NB15 dataset
- Placed CSV files in ./data/
- Installed requirements.txt
- Run
1_preprocess_data_FIXED.py(not the original!) - Trained both models successfully
- Exported to ONNX
- Have .onnx files ready for Rust
- Dataset: https://research.unsw.edu.au/projects/unsw-nb15-dataset
- Rust ort crate: https://github.com/pykeio/ort
- Read:
ISSUES_AND_FIXES.mdfor technical details
Remember: Use the FIXED preprocessing script! 🎯