This project focuses on building a high-performance Intrusion Detection System (IDS) using machine learning to secure network traffic. Using UNSW-NB15 dataset to train a model capable of accurately flagging malicious network activity while allowing normal traffic to pass through.
The UNSW-NB15 dataset was instrumental, as it was specifically generated by the Australian Centre for Cyber Security (ACCS) using professional tools (IXIA PerfectStorm).
Before training, we conducted an in-depth exploration and cleaning phase to ensure the data was robust and relevant to our binary classification goal (Attack vs. Normal).
Target Simplification: We dropped the detailed attack_cat column to simplify the problem to a core binary decision using the primary label column.
Visualizing the Landscape: We used several visualizations to uncover hidden patterns and feature distributions:
Distribution Plots & Boxplots: Helped us understand the spread and identify extreme outliers in critical metrics like Time-To-Live (sttl, dttl) and TCP Round-Trip Time (tcprtt).
Violin Plots: Crucial for comparing the differences in numerical feature distributions directly between the normal and attack classes.
Correlation Heatmap: This map guided our feature selection by highlighting strong relationships.
Key Correlations: Our analysis revealed several features highly correlated with the target (label):
Strong Positive Links (Attack likely): sttl (0.61), ct_state_ttl (0.53), dttl (0.46).
Strong Negative Links (Normal likely): dload (−0.36), dmean (−0.23).
To prepare the dataset for modeling, we transformed the raw data:
Feature Encoding: We utilized One-Hot Encoding to convert critical categorical variables—proto (protocol), service, and state—into a numeric, machine-readable format.
Consistency Artifact: The fully processed dataset was saved as data_encoded_train.csv. This artifact ensures perfect feature alignment when the model is moved to a deployment or inference environment.
We trained and compared three powerful ensemble models to determine the optimal classifier for intrusion detection.
| Model | Accuracy | Precision | Recall (Detection) | F1-score |
|---|---|---|---|---|
| Random Forest | 0.92 | 0.92 | 0.92 | 0.92 |
| XGBoost | 0.8759 | 0.86 | 0.87 | 0.86 |
| LightGBM | 0.88 | 0.86 | 0.87 | 0.87 |
Reliable Performance: The Random Forest model provided the highest and most balanced performance across all metrics (Accuracy, Precision, and Recall), making it the most reliable overall choice.
Critical Detection Focus: While slightly lower in accuracy, both XGBoost and LightGBM demonstrated competitive Recall scores. In security, maximizing Recall (catching real attacks) is often prioritized, highlighting the value of these models.
Feature Importance: Model feature importance rankings consistently reinforced our correlation findings, identifying sttl, tcprtt, and ackdat as major contributors to detection success.