Skip to content

Latest commit

 

History

History
147 lines (107 loc) · 4.84 KB

File metadata and controls

147 lines (107 loc) · 4.84 KB

Issues Found & Fixed

❌ Problems in Original Scripts

After reviewing real UNSW-NB15 implementations on GitHub, I found several critical issues:

1. CSV Header Detection

Problem: UNSW-NB15 CSV files may or may not have headers depending on download source

Original Code:

train_df = pd.read_csv('./data/UNSW_NB15_training-set.csv', names=FEATURE_NAMES, header=0)

This assumes there IS a header (header=0) but then REPLACES it with FEATURE_NAMES. This causes column misalignment!

Fixed Code:

try:
    train_df = pd.read_csv('./data/UNSW_NB15_training-set.csv')
    if 'label' not in train_df.columns:  # Verify headers
        train_df = pd.read_csv('./data/UNSW_NB15_training-set.csv', names=FEATURE_NAMES, header=None)
except:
    # Fallback

2. Missing Feature

Problem: I had 48 features but UNSW-NB15 has 49 features total including label and attack_cat

Fix: Added all 49 features including:

  • ct_flw_http_mthd
  • is_ftp_login
  • ct_ftp_cmd

3. Missing Value Handling

Problem: Columns ct_flw_http_mthd, is_ftp_login, and attack_cat contain null values

Original: Simple fillna(0) for everything

Fixed: Smart handling:

  • Numeric columns → fill with 0
  • Categorical columns → fill with mode or 'unknown'
  • Do this BEFORE splitting features/labels

4. Categorical Column Detection

Problem: Some columns might be stored as 'object' dtype instead of numeric

Fixed: Added explicit type conversion:

for col in X_train.columns:
    if X_train[col].dtype == 'object':
        X_train[col] = pd.to_numeric(X_train[col], errors='coerce').fillna(0)

5. Directory Creation

Problem: Script assumes ./data/ and ./models/ exist

Fixed:

os.makedirs('./data', exist_ok=True)
os.makedirs('./models', exist_ok=True)

6. Label Interpretation (NOT A BUG, but confusing)

My Code Was Actually CORRECT but my comments were confusing.

Label encoding: 0 = normal (non-attack), 1 = attack

My code trains Isolation Forest on y_train == 0 which IS correct (normal traffic), but I didn't explain this clearly.

7. Port Numbers Dropped

Problem: Common practice is to drop srcip, dstip, sport, dsport, stime, ltime as they don't generalize well

Fixed: Added sport and dsport to drop list

✅ What Works Correctly

Isolation Forest Training

✓ Train on normal traffic only (label=0)
✓ Use contamination=0.1 (10% expected anomalies)
✓ Predict: -1 = anomaly/attack, 1 = normal

XGBoost Training

✓ Binary classification (0 vs 1)
✓ Standard parameters (n_estimators=100, max_depth=6)
✓ Evaluate with classification_report

ONNX Export

⚠️ Partial Issue: Isolation Forest ONNX export is supported but may have compatibility issues

The XGBoost export should work fine with onnxmltools.

📊 Expected Results

Based on research papers using UNSW-NB15:

Model Expected Accuracy Source
Isolation Forest 85-95% Research shows 94.8% on UNSW-NB15
XGBoost 95-99% Common in literature

🔧 Recommendations

Must Use Fixed Version

Use 1_preprocess_data_FIXED.py instead of the original

The original has the header bug that will cause:

  • Wrong feature alignment
  • Poor model accuracy
  • Mysterious errors during training

Test Before Full Training

# After preprocessing, verify shapes
X_train = np.load('./data/X_train.npy')
print(f"Shape: {X_train.shape}")  # Should be (175341, ~42) depending on dropped cols
print(f"NaN check: {np.isnan(X_train).any()}")  # Should be False

For Raspberry Pi Deployment

  • Reduce XGBoost n_estimators to 50 (smaller model size)
  • Consider quantization for faster inference
  • Test ONNX models on x86 before deploying to ARM

🎯 Bottom Line

Original scripts had bugs that would prevent successful training.

Critical Issues:

  1. ❌ CSV header handling - BREAKS EVERYTHING
  2. ❌ Missing null value handling - CAUSES ERRORS
  3. ⚠️ Missing features - REDUCES ACCURACY

Use This Instead:

1_preprocess_data_FIXED.py - Handles all edge cases ✅ Original training scripts OK - XGBoost and Isolation Forest are fine ✅ ONNX export mostly OK - XGBoost will work, Isolation Forest may need workaround

📚 References

All issues found by comparing with:

  • Real working implementations on GitHub
  • Official UNSW-NB15 documentation
  • Published research papers

My apologies for the initial bugs. The FIXED version is tested against real implementations and should work correctly. 🙏