Skip to content

Code for SLB: Deep Learning from Imperfectly Labeled Malware Data. CCS 2025

Notifications You must be signed in to change notification settings

ICL-ml4csec/SLB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SLB: Deep Learning from Imperfectly Labeled Malware Data

SLB is a framework for correcting noisy labels in malware datasets. It has two components:

  • Data Split — partition the dataset into a clean subset and a noisy subset.
  • Continuous Revision — revise labels during training.

Output: a robust classifier and a revised dataset.


Table of Contents


Main Contribution

  • The core implementation is in SLB.py.

Supporting Files

  • Eval.py — reports and saves evaluation results
  • LoadData.py — loads and prepares datasets
  • model.py — defines the DNN models
  • NoiseFunction.py — generates random noise
  • ImageModels.py — applies SLB to large image models on the Virus-MNIST (VM) dataset
  • ML.py — trains ML models using SLB-corrected labels

Prerequisites & Requirements

Reference environment (used for our experiments):
Ubuntu 22.04.5 LTS; AMD EPYC 7502P (32 cores); 128 GB RAM; 4× NVIDIA A40 (48 GB)
Python 3.8; PyTorch 2.1.2; and CUDA 11.8

Other setups may work, but the above is what we validated.

Installation. Clone the repo and install dependencies in a fresh environment.

Conda (recommended):

conda create -n slb python=3.8 -y
conda activate slb
pip install -r requirements.txt

Contents of 'requirements.txt':

numpy==1.24.3
pandas==2.0.3
scikit-learn==1.3.0
tqdm==4.66.5
torch==2.1.2
torchvision==0.16.2
pillow==10.4.0
matplotlib==3.7.3
joblib==1.4.2

Data

Processed datasets

Download the exact processed datasets used in our experiments from:

https://zenodo.org/records/17281377

After downloading, extract everything into the SLB/ directory at the project root.

Expected layout:

project-root/
└─ SLB/data
   ├─ Android/
   ├─ continous/
   ├─ IDS17/
   ├─ LIDS17/
   ├─ virusnet/
   └─ Windows_PE/

These are pre-processed and ready to use with our codebase.

Original (raw) sources

If you prefer to obtain the datasets from their original sources and handle preprocessing yourself, please refer to the following works:

  • Malware PE (PE)
    Xian Wu, Wenbo Guo, Jia Yan, Baris Coskun, Xinyu Xing. From Grim Reality to Practical Solution: Malware Classification in Real-World Noise. IEEE S&P 2023, pp. 2602–2619. DOI: 10.1109/SP46215.2023.10179453

  • VirusShare 2018 (VS18)
    Tomás Concepción Miranda, Pierre-François Gimenez, Jean-François Lalande, Valérie Viet Triem Tong, Pierre Wilke. Debiasing Android Malware Datasets: How Can I Trust Your Results If Your Dataset Is Biased? IEEE TIFS 17 (2022), 2182–2197. DOI: 10.1109/TIFS.2022.3180184

  • APIGraph (AG)
    Yizheng Chen, Zhoujie Ding, David A. Wagner. Continuous Learning for Android Malware Detection. In USENIX Security 2023, pp. 1127–1144.

  • CICIDS2017 (IDS17)
    Lisa Liu, Gints Engelen, Timothy M. Lynar, Daryl Essam, Wouter Joosen. Error Prevalence in NIDS Datasets: A Case Study on CIC-IDS-2017 and CSE-CIC-IDS-2018. In IEEE CNS 2022, pp. 254–262. DOI: 10.1109/CNS56114.2022.9947235

  • Virus-MNIST (VM)
    David A. Noever, Samantha E. Miller Noever. Virus-MNIST: A Benchmark Malware Dataset. arXiv: 2103.00602 (2021).


Re-running experiments in the paper

Shell scripts in ./Scripts are provided to re-run our experiments:

  • HPSearch.sh: extensive hyperparameter search across all datasets

  • RandomAG.sh, RandomMini-IDS.sh, RandomPE.sh, RandomVS18.sh: random noise experiments (§4.3.1) and clean experiments (§4.3.3)

  • RealNoise.sh: real noise experiments (§4.3.2)

  • Correction.sh: label accuracy experiments (§4.4.1)

  • ML.sh: training ML models with SLB-corrected labels (§4.4.2)

  • ImageModels.sh: SLB on image models (§4.5.1)

  • LargeAG.sh, LargeIDS.sh: SLB on large datasets and models (§4.5.2)

Run, e.g.:

bash RandomPE.sh

You can run SLB directly from the terminal. Example:

python SLB.py --noise_type random --noise_rate 0.1 --exp Random --dataset windows_pe --n_epoch 140 --con_thres 1.0 --init_num 15 --flip 1 --beta 0.9999 --model_size small

Key parameters:

  • n_epoch 140: T in Algorithm 2
  • con_thres 1.0: r in Equation 4
  • init_num 15: e in Equation 2
  • flip 1: m in Equation 10
  • beta 0.9999: β in Equation 6

Citation

@inproceedings{alotaibi25imperfect,
  title        = {{Deep Learning from Imperfectly Labelled Malware Data}},
  author       = {Alotaibi, Fahad and Goodbrand, Euan and Maffeis, Sergio},
  booktitle    = {Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security (CCS '25)},
  year         = {2025},
  month        = oct,
  address      = {Taipei, Taiwan},
  organization = {Association for Computing Machinery},
  isbn         = {979-8-4007-1525-9/2025/10},
  doi          = {10.1145/3719027.3765197},
  url          = {https://doi.org/10.1145/3719027.3765197}
}

Contact

If you have any questions or need further assistance, please feel free to reach out to me at any time:

  • Email: f.alotaibi21@imperial.ac.uk
  • Alternate Email: fahadalkarshmi@gmail.com

About

Code for SLB: Deep Learning from Imperfectly Labeled Malware Data. CCS 2025

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published