🛡️ Hybrid Malware Classification using Static and Dynamic Features

This project presents a hybrid malware classification pipeline that integrates both static features (from EMBER v2) and dynamic features (from CIC-MalMem 2022) to enhance detection accuracy and robustness. It leverages machine learning models with a strong focus on interpretability using SHAP (SHapley Additive exPlanations).

📁 Dataset Sources

To replicate or test this project, please download the following datasets from Kaggle:

EMBER v2 (2018) - Static PE Features
- https://www.kaggle.com/datasets/elastic/ember
- Files used: train_features.csv, test_features.csv
CIC-MalMem 2022 - Dynamic Malware Behavior
- https://www.kaggle.com/datasets/kevingeorge/malware-analysis-dataset
- File used: Obfuscated-MalMem2022.parquet

Note: Store all downloaded files in a root directory before running any scripts. The structure is presented below.

📂 Project Directory Structure

Note: You only need ember_v2_2018 which has train (train_ember_2018_v2.parquet) and test dataset (test_ember_2018_v2.parquet) not the mergerd, cic_malmem_2022 which has one dataset (Obfuscated-MalMem2022.parquet) file. and the ipynb files that are provided. Other files/folders will be gereated by the code.

FINAL_HYBRID/
├── cic_malmem_2022/                      # Raw CIC-MalMem 2022 dataset
│   └── Obfuscated-MalMem2022.parquet     # CIC-MalMem 2022 dynamic feature file
├── ember_v2_2018/                        # Raw EMBER v2 dataset
│   ├── test_ember_2018_v2.parquet          # Test set of EMBER 2018
│   ├── train_ember_2018_v2.parquet         # Train set of EMBER 2018
│   └── ember_merged.parquet                # Combined EMBER data (train + test)
│
├── dataset_cleanup.ipynb                 # Notebook for dataset loading & merging
├── data_preprocessing.ipynb              # Notebook for scaling, PCA, and SMOTE
├── data_analysis.ipynb                   # Visualizes class distributions and PCA
├── model_training.ipynb                  # Trains XGBoost, LightGBM, MLP models
├── SHAP.ipynb                            # SHAP-based feature interpretability
├── visualization.ipynb                   # Final result visualizations (ROC, SHAP)
│
├── clean_hybrid_dataset.parquet          # Cleaned combined dataset before PCA
├── hybrid_dataset_pca.parquet            # PCA-reduced dataset (300 components)
├── hybrid_dataset_pca_smoted.parquet     # PCA-reduced + SMOTE-balanced dataset
├── hybrid_dataset_reduced.parquet        # (Optional) Alternative reduced dataset
├── hybrid_malware_dataset.csv            # Final CSV snapshot of hybrid data
├── hybrid_malware_dataset.parquet        # Final Parquet version of hybrid data
│
├── train_split.parquet                   # Training data split
├── val_split.parquet                     # Validation data split
├── test_split.parquet                    # Test data split
├── Xy_train.pkl                          # Training features + labels
├── Xy_val.pkl                            # Validation features + labels
├── Xy_test.pkl                           # Test features + labels
│
├── xgb_model.pkl                         # Base XGBoost model
├── xgb_tuned_model.pkl                   # Grid-searched optimized XGBoost
├── xgb_best_model.pkl                    # Best performing XGBoost model
├── xgboost_hybrid_model.json             # Exported JSON model (optional use)
│
├── pca_model.pkl                         # Saved PCA transformation model
├── raw_feature_names.pkl                 # Saved raw feature names pre-PCA
├── shap_feature_importance.csv           # SHAP feature importance values
├── shap_pca_feature_mapping.csv          # Mapping: PCA component ↔️ raw features
│
├── README.md                             # Project documentation (you are here)

🔧 Project Structure

The workflow is organized into the following sequential notebooks:

Step	Notebook	Description
1️⃣	`dataset_cleanup.ipynb`	Loads and cleans both datasets. Adds source labels and merges them.
2️⃣	`data_preprocessing.ipynb`	Standardizes the data, applies PCA, and balances classes using SMOTE.
3️⃣	`data_analysis.ipynb`	Visualizes dataset distribution and PCA variance.
4️⃣	`model_training.ipynb`	Trains and evaluates XGBoost, LightGBM, and MLP models. Hyperparameter tuning with GridSearchCV included.
5️⃣	`SHAP.ipynb`	Explains XGBoost decisions using SHAP. Includes back-mapping of PCA features.
6️⃣	`visualization.ipynb`	Creates final ROC curves, metric comparisons, and SHAP vs XGBoost importance plots.

🚀 How to Run the Project (macOS with Miniconda)

📦 Create and activate a new Conda environment (recommended):

conda create -n malware_env python=3.10
conda activate malware_env

🛠 Install dependencies from requirements.txt:
```
pip install -r requirements.txt
```
📂 Download datasets from Kaggle and place them in their respective folders:
- ember_v2_2018/ → place train_ember_2018_v2.parquet, test_ember_2018_v2.parquet
- cic_malmem_2022/ → place Obfuscated-MalMem2022.parquet
🧪 Run notebooks in the following order using Jupyter Notebook:
- dataset_cleanup.ipynb
- data_preprocessing.ipynb
- data_analysis.ipynb
- model_training.ipynb
- SHAP.ipynb
- visualization.ipynb

📊 Output Summary

Class distribution plots (before/after SMOTE)
PCA Scree plot
Confusion matrices, ROC curves, model metric tables
SHAP summary, force plots, and raw feature mappings
Visual comparison of XGBoost vs SHAP importance

🧠 Models Used

XGBoost: Optimized via GridSearchCV; best performer
LightGBM
Multilayer Perceptron (MLP)

🧰 Tools & Libraries

Python 3.10
Pandas, NumPy, Scikit-learn
XGBoost, LightGBM, TensorFlow/Keras
SHAP, imbalanced-learn, Matplotlib, Seaborn

📌 Notes

Ensure your system has sufficient memory, CPU, and GPU (The preprocessing, model training, PCA, SMOTE, and SHAP requires RAM and powerfull CPU and GPU).
Trained model weights and PCA transformation matrices are saved for reproducibility.
Comments are added throughout each notebook for code readability.

👥 Authors

Rudraksh Gupta
Neel Pankaj Soni

📄 License

This project is submitted as part of the graduate-level coursework and is for academic purposes only as of now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🛡️ Hybrid Malware Classification using Static and Dynamic Features

📁 Dataset Sources

📂 Project Directory Structure

🔧 Project Structure

🚀 How to Run the Project (macOS with Miniconda)

📊 Output Summary

🧠 Models Used

🧰 Tools & Libraries

📌 Notes

👥 Authors

📄 License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
visuals		visuals
.gitignore		.gitignore
README.md		README.md
Research- Final Report.pdf		Research- Final Report.pdf
SHAP.ipynb		SHAP.ipynb
data_analysis.ipynb		data_analysis.ipynb
data_preprocessing.ipynb		data_preprocessing.ipynb
dataset_cleanup.ipynb		dataset_cleanup.ipynb
model_training.ipynb		model_training.ipynb
requirements.txt		requirements.txt
visualization.ipynb		visualization.ipynb

neelsoni26/hybrid-malware-classification

Folders and files

Latest commit

History

Repository files navigation

🛡️ Hybrid Malware Classification using Static and Dynamic Features

📁 Dataset Sources

📂 Project Directory Structure

🔧 Project Structure

🚀 How to Run the Project (macOS with Miniconda)

📊 Output Summary

🧠 Models Used

🧰 Tools & Libraries

📌 Notes

👥 Authors

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages