SLB: Deep Learning from Imperfectly Labeled Malware Data

SLB is a framework for correcting noisy labels in malware datasets. It has two components:

Data Split — partition the dataset into a clean subset and a noisy subset.
Continuous Revision — revise labels during training.

Output: a robust classifier and a revised dataset.

Main Contribution

The core implementation is in SLB.py.

Supporting Files

Eval.py — reports and saves evaluation results
LoadData.py — loads and prepares datasets
model.py — defines the DNN models
NoiseFunction.py — generates random noise
ImageModels.py — applies SLB to large image models on the Virus-MNIST (VM) dataset
ML.py — trains ML models using SLB-corrected labels

Prerequisites & Requirements

Reference environment (used for our experiments):
Ubuntu 22.04.5 LTS; AMD EPYC 7502P (32 cores); 128 GB RAM; 4× NVIDIA A40 (48 GB)
Python 3.8; PyTorch 2.1.2; and CUDA 11.8

Other setups may work, but the above is what we validated.

Installation. Clone the repo and install dependencies in a fresh environment.

Conda (recommended):

conda create -n slb python=3.8 -y
conda activate slb
pip install -r requirements.txt

Contents of 'requirements.txt':

numpy==1.24.3
pandas==2.0.3
scikit-learn==1.3.0
tqdm==4.66.5
torch==2.1.2
torchvision==0.16.2
pillow==10.4.0
matplotlib==3.7.3
joblib==1.4.2

Data

Processed datasets

Download the exact processed datasets used in our experiments from:

https://zenodo.org/records/17281377

After downloading, extract everything into the SLB/ directory at the project root.

Expected layout:

project-root/
└─ SLB/data
   ├─ Android/
   ├─ continous/
   ├─ IDS17/
   ├─ LIDS17/
   ├─ virusnet/
   └─ Windows_PE/

These are pre-processed and ready to use with our codebase.

Original (raw) sources

If you prefer to obtain the datasets from their original sources and handle preprocessing yourself, please refer to the following works:

Malware PE (PE)
Xian Wu, Wenbo Guo, Jia Yan, Baris Coskun, Xinyu Xing. From Grim Reality to Practical Solution: Malware Classification in Real-World Noise. IEEE S&P 2023, pp. 2602–2619. DOI: 10.1109/SP46215.2023.10179453
VirusShare 2018 (VS18)
Tomás Concepción Miranda, Pierre-François Gimenez, Jean-François Lalande, Valérie Viet Triem Tong, Pierre Wilke. Debiasing Android Malware Datasets: How Can I Trust Your Results If Your Dataset Is Biased? IEEE TIFS 17 (2022), 2182–2197. DOI: 10.1109/TIFS.2022.3180184
APIGraph (AG)
Yizheng Chen, Zhoujie Ding, David A. Wagner. Continuous Learning for Android Malware Detection. In USENIX Security 2023, pp. 1127–1144.
CICIDS2017 (IDS17)
Lisa Liu, Gints Engelen, Timothy M. Lynar, Daryl Essam, Wouter Joosen. Error Prevalence in NIDS Datasets: A Case Study on CIC-IDS-2017 and CSE-CIC-IDS-2018. In IEEE CNS 2022, pp. 254–262. DOI: 10.1109/CNS56114.2022.9947235
Virus-MNIST (VM)
David A. Noever, Samantha E. Miller Noever. Virus-MNIST: A Benchmark Malware Dataset. arXiv: 2103.00602 (2021).

Re-running experiments in the paper

Shell scripts in ./Scripts are provided to re-run our experiments:

HPSearch.sh: extensive hyperparameter search across all datasets
RandomAG.sh, RandomMini-IDS.sh, RandomPE.sh, RandomVS18.sh: random noise experiments (§4.3.1) and clean experiments (§4.3.3)
RealNoise.sh: real noise experiments (§4.3.2)
Correction.sh: label accuracy experiments (§4.4.1)
ML.sh: training ML models with SLB-corrected labels (§4.4.2)
ImageModels.sh: SLB on image models (§4.5.1)
LargeAG.sh, LargeIDS.sh: SLB on large datasets and models (§4.5.2)

Run, e.g.:

bash RandomPE.sh

You can run SLB directly from the terminal. Example:

python SLB.py --noise_type random --noise_rate 0.1 --exp Random --dataset windows_pe --n_epoch 140 --con_thres 1.0 --init_num 15 --flip 1 --beta 0.9999 --model_size small

Key parameters:

n_epoch 140: T in Algorithm 2
con_thres 1.0: r in Equation 4
init_num 15: e in Equation 2
flip 1: m in Equation 10
beta 0.9999: β in Equation 6

Citation

@inproceedings{alotaibi25imperfect,
  title        = {{Deep Learning from Imperfectly Labelled Malware Data}},
  author       = {Alotaibi, Fahad and Goodbrand, Euan and Maffeis, Sergio},
  booktitle    = {Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security (CCS '25)},
  year         = {2025},
  month        = oct,
  address      = {Taipei, Taiwan},
  organization = {Association for Computing Machinery},
  isbn         = {979-8-4007-1525-9/2025/10},
  doi          = {10.1145/3719027.3765197},
  url          = {https://doi.org/10.1145/3719027.3765197}
}

Contact

If you have any questions or need further assistance, please feel free to reach out to me at any time:

Email: f.alotaibi21@imperial.ac.uk
Alternate Email: fahadalkarshmi@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SLB: Deep Learning from Imperfectly Labeled Malware Data

Table of Contents

Main Contribution

Supporting Files

Prerequisites & Requirements

Data

Processed datasets

Original (raw) sources

Re-running experiments in the paper

Citation

Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Scripts		Scripts
Eval.py		Eval.py
ImageModels.py		ImageModels.py
LoadData.py		LoadData.py
ML.py		ML.py
NoiseFunction.py		NoiseFunction.py
README.md		README.md
SLB.py		SLB.py
model.py		model.py
requirements.txt		requirements.txt

ICL-ml4csec/SLB

Folders and files

Latest commit

History

Repository files navigation

SLB: Deep Learning from Imperfectly Labeled Malware Data

Table of Contents

Main Contribution

Supporting Files

Prerequisites & Requirements

Data

Processed datasets

Original (raw) sources

Re-running experiments in the paper

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages