Skip to content

This project presents a hybrid malware classification pipeline that integrates both static features (from EMBER v2) and dynamic features (from CIC-MalMem 2022) to enhance detection accuracy and robustness. It leverages machine learning models with a strong focus on interpretability using SHAP (SHapley Additive exPlanations).

Notifications You must be signed in to change notification settings

neelsoni26/hybrid-malware-classification

Repository files navigation

🛡️ Hybrid Malware Classification using Static and Dynamic Features

This project presents a hybrid malware classification pipeline that integrates both static features (from EMBER v2) and dynamic features (from CIC-MalMem 2022) to enhance detection accuracy and robustness. It leverages machine learning models with a strong focus on interpretability using SHAP (SHapley Additive exPlanations).


📁 Dataset Sources

To replicate or test this project, please download the following datasets from Kaggle:

  1. EMBER v2 (2018) - Static PE Features

  2. CIC-MalMem 2022 - Dynamic Malware Behavior

Note: Store all downloaded files in a root directory before running any scripts. The structure is presented below.

📂 Project Directory Structure

Note: You only need ember_v2_2018 which has train (train_ember_2018_v2.parquet) and test dataset (test_ember_2018_v2.parquet) not the mergerd, cic_malmem_2022 which has one dataset (Obfuscated-MalMem2022.parquet) file. and the ipynb files that are provided. Other files/folders will be gereated by the code.

FINAL_HYBRID/
├── cic_malmem_2022/                      # Raw CIC-MalMem 2022 dataset
│   └── Obfuscated-MalMem2022.parquet     # CIC-MalMem 2022 dynamic feature file
├── ember_v2_2018/                        # Raw EMBER v2 dataset
│   ├── test_ember_2018_v2.parquet          # Test set of EMBER 2018
│   ├── train_ember_2018_v2.parquet         # Train set of EMBER 2018
│   └── ember_merged.parquet                # Combined EMBER data (train + test)
│
├── dataset_cleanup.ipynb                 # Notebook for dataset loading & merging
├── data_preprocessing.ipynb              # Notebook for scaling, PCA, and SMOTE
├── data_analysis.ipynb                   # Visualizes class distributions and PCA
├── model_training.ipynb                  # Trains XGBoost, LightGBM, MLP models
├── SHAP.ipynb                            # SHAP-based feature interpretability
├── visualization.ipynb                   # Final result visualizations (ROC, SHAP)
│
├── clean_hybrid_dataset.parquet          # Cleaned combined dataset before PCA
├── hybrid_dataset_pca.parquet            # PCA-reduced dataset (300 components)
├── hybrid_dataset_pca_smoted.parquet     # PCA-reduced + SMOTE-balanced dataset
├── hybrid_dataset_reduced.parquet        # (Optional) Alternative reduced dataset
├── hybrid_malware_dataset.csv            # Final CSV snapshot of hybrid data
├── hybrid_malware_dataset.parquet        # Final Parquet version of hybrid data
│
├── train_split.parquet                   # Training data split
├── val_split.parquet                     # Validation data split
├── test_split.parquet                    # Test data split
├── Xy_train.pkl                          # Training features + labels
├── Xy_val.pkl                            # Validation features + labels
├── Xy_test.pkl                           # Test features + labels
│
├── xgb_model.pkl                         # Base XGBoost model
├── xgb_tuned_model.pkl                   # Grid-searched optimized XGBoost
├── xgb_best_model.pkl                    # Best performing XGBoost model
├── xgboost_hybrid_model.json             # Exported JSON model (optional use)
│
├── pca_model.pkl                         # Saved PCA transformation model
├── raw_feature_names.pkl                 # Saved raw feature names pre-PCA
├── shap_feature_importance.csv           # SHAP feature importance values
├── shap_pca_feature_mapping.csv          # Mapping: PCA component ↔️ raw features
│
├── README.md                             # Project documentation (you are here)

🔧 Project Structure

The workflow is organized into the following sequential notebooks:

Step Notebook Description
1️⃣ dataset_cleanup.ipynb Loads and cleans both datasets. Adds source labels and merges them.
2️⃣ data_preprocessing.ipynb Standardizes the data, applies PCA, and balances classes using SMOTE.
3️⃣ data_analysis.ipynb Visualizes dataset distribution and PCA variance.
4️⃣ model_training.ipynb Trains and evaluates XGBoost, LightGBM, and MLP models. Hyperparameter tuning with GridSearchCV included.
5️⃣ SHAP.ipynb Explains XGBoost decisions using SHAP. Includes back-mapping of PCA features.
6️⃣ visualization.ipynb Creates final ROC curves, metric comparisons, and SHAP vs XGBoost importance plots.

🚀 How to Run the Project (macOS with Miniconda)

  1. 📦 Create and activate a new Conda environment (recommended):

    conda create -n malware_env python=3.10
    conda activate malware_env
  2. 🛠 Install dependencies from requirements.txt:

    pip install -r requirements.txt
  3. 📂 Download datasets from Kaggle and place them in their respective folders:

    • ember_v2_2018/ → place train_ember_2018_v2.parquet, test_ember_2018_v2.parquet
    • cic_malmem_2022/ → place Obfuscated-MalMem2022.parquet
  4. 🧪 Run notebooks in the following order using Jupyter Notebook:

    • dataset_cleanup.ipynb
    • data_preprocessing.ipynb
    • data_analysis.ipynb
    • model_training.ipynb
    • SHAP.ipynb
    • visualization.ipynb

📊 Output Summary

  • Class distribution plots (before/after SMOTE)
  • PCA Scree plot
  • Confusion matrices, ROC curves, model metric tables
  • SHAP summary, force plots, and raw feature mappings
  • Visual comparison of XGBoost vs SHAP importance

🧠 Models Used

  • XGBoost: Optimized via GridSearchCV; best performer
  • LightGBM
  • Multilayer Perceptron (MLP)

🧰 Tools & Libraries

  • Python 3.10
  • Pandas, NumPy, Scikit-learn
  • XGBoost, LightGBM, TensorFlow/Keras
  • SHAP, imbalanced-learn, Matplotlib, Seaborn

📌 Notes

  • Ensure your system has sufficient memory, CPU, and GPU (The preprocessing, model training, PCA, SMOTE, and SHAP requires RAM and powerfull CPU and GPU).
  • Trained model weights and PCA transformation matrices are saved for reproducibility.
  • Comments are added throughout each notebook for code readability.

👥 Authors

  • Rudraksh Gupta
  • Neel Pankaj Soni

📄 License

This project is submitted as part of the graduate-level coursework and is for academic purposes only as of now.


About

This project presents a hybrid malware classification pipeline that integrates both static features (from EMBER v2) and dynamic features (from CIC-MalMem 2022) to enhance detection accuracy and robustness. It leverages machine learning models with a strong focus on interpretability using SHAP (SHapley Additive exPlanations).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •