This project presents a hybrid malware classification pipeline that integrates both static features (from EMBER v2) and dynamic features (from CIC-MalMem 2022) to enhance detection accuracy and robustness. It leverages machine learning models with a strong focus on interpretability using SHAP (SHapley Additive exPlanations).
To replicate or test this project, please download the following datasets from Kaggle:
-
EMBER v2 (2018) - Static PE Features
- https://www.kaggle.com/datasets/elastic/ember
- Files used:
train_features.csv
,test_features.csv
-
CIC-MalMem 2022 - Dynamic Malware Behavior
- https://www.kaggle.com/datasets/kevingeorge/malware-analysis-dataset
- File used:
Obfuscated-MalMem2022.parquet
Note: Store all downloaded files in a root directory before running any scripts. The structure is presented below.
Note: You only need
ember_v2_2018
which has train (train_ember_2018_v2.parquet) and test dataset (test_ember_2018_v2.parquet) not the mergerd,cic_malmem_2022
which has one dataset (Obfuscated-MalMem2022.parquet) file. and the ipynb files that are provided. Other files/folders will be gereated by the code.
FINAL_HYBRID/
├── cic_malmem_2022/ # Raw CIC-MalMem 2022 dataset
│ └── Obfuscated-MalMem2022.parquet # CIC-MalMem 2022 dynamic feature file
├── ember_v2_2018/ # Raw EMBER v2 dataset
│ ├── test_ember_2018_v2.parquet # Test set of EMBER 2018
│ ├── train_ember_2018_v2.parquet # Train set of EMBER 2018
│ └── ember_merged.parquet # Combined EMBER data (train + test)
│
├── dataset_cleanup.ipynb # Notebook for dataset loading & merging
├── data_preprocessing.ipynb # Notebook for scaling, PCA, and SMOTE
├── data_analysis.ipynb # Visualizes class distributions and PCA
├── model_training.ipynb # Trains XGBoost, LightGBM, MLP models
├── SHAP.ipynb # SHAP-based feature interpretability
├── visualization.ipynb # Final result visualizations (ROC, SHAP)
│
├── clean_hybrid_dataset.parquet # Cleaned combined dataset before PCA
├── hybrid_dataset_pca.parquet # PCA-reduced dataset (300 components)
├── hybrid_dataset_pca_smoted.parquet # PCA-reduced + SMOTE-balanced dataset
├── hybrid_dataset_reduced.parquet # (Optional) Alternative reduced dataset
├── hybrid_malware_dataset.csv # Final CSV snapshot of hybrid data
├── hybrid_malware_dataset.parquet # Final Parquet version of hybrid data
│
├── train_split.parquet # Training data split
├── val_split.parquet # Validation data split
├── test_split.parquet # Test data split
├── Xy_train.pkl # Training features + labels
├── Xy_val.pkl # Validation features + labels
├── Xy_test.pkl # Test features + labels
│
├── xgb_model.pkl # Base XGBoost model
├── xgb_tuned_model.pkl # Grid-searched optimized XGBoost
├── xgb_best_model.pkl # Best performing XGBoost model
├── xgboost_hybrid_model.json # Exported JSON model (optional use)
│
├── pca_model.pkl # Saved PCA transformation model
├── raw_feature_names.pkl # Saved raw feature names pre-PCA
├── shap_feature_importance.csv # SHAP feature importance values
├── shap_pca_feature_mapping.csv # Mapping: PCA component ↔️ raw features
│
├── README.md # Project documentation (you are here)
The workflow is organized into the following sequential notebooks:
Step | Notebook | Description |
---|---|---|
1️⃣ | dataset_cleanup.ipynb |
Loads and cleans both datasets. Adds source labels and merges them. |
2️⃣ | data_preprocessing.ipynb |
Standardizes the data, applies PCA, and balances classes using SMOTE. |
3️⃣ | data_analysis.ipynb |
Visualizes dataset distribution and PCA variance. |
4️⃣ | model_training.ipynb |
Trains and evaluates XGBoost, LightGBM, and MLP models. Hyperparameter tuning with GridSearchCV included. |
5️⃣ | SHAP.ipynb |
Explains XGBoost decisions using SHAP. Includes back-mapping of PCA features. |
6️⃣ | visualization.ipynb |
Creates final ROC curves, metric comparisons, and SHAP vs XGBoost importance plots. |
-
📦 Create and activate a new Conda environment (recommended):
conda create -n malware_env python=3.10 conda activate malware_env
-
🛠 Install dependencies from
requirements.txt
:pip install -r requirements.txt
-
📂 Download datasets from Kaggle and place them in their respective folders:
ember_v2_2018/
→ placetrain_ember_2018_v2.parquet
,test_ember_2018_v2.parquet
cic_malmem_2022/
→ placeObfuscated-MalMem2022.parquet
-
🧪 Run notebooks in the following order using Jupyter Notebook:
dataset_cleanup.ipynb
data_preprocessing.ipynb
data_analysis.ipynb
model_training.ipynb
SHAP.ipynb
visualization.ipynb
- Class distribution plots (before/after SMOTE)
- PCA Scree plot
- Confusion matrices, ROC curves, model metric tables
- SHAP summary, force plots, and raw feature mappings
- Visual comparison of XGBoost vs SHAP importance
- XGBoost: Optimized via GridSearchCV; best performer
- LightGBM
- Multilayer Perceptron (MLP)
- Python 3.10
- Pandas, NumPy, Scikit-learn
- XGBoost, LightGBM, TensorFlow/Keras
- SHAP, imbalanced-learn, Matplotlib, Seaborn
- Ensure your system has sufficient memory, CPU, and GPU (The preprocessing, model training, PCA, SMOTE, and SHAP requires RAM and powerfull CPU and GPU).
- Trained model weights and PCA transformation matrices are saved for reproducibility.
- Comments are added throughout each notebook for code readability.
- Rudraksh Gupta
- Neel Pankaj Soni
This project is submitted as part of the graduate-level coursework and is for academic purposes only as of now.