SLB is a framework for correcting noisy labels in malware datasets. It has two components:
- Data Split — partition the dataset into a clean subset and a noisy subset.
- Continuous Revision — revise labels during training.
Output: a robust classifier and a revised dataset.
- Main Contribution
- Supporting Files
- Prerequisites & Requirements
- Data
- Re-running paper experiments
- Citation
- Contact
- The core implementation is in
SLB.py.
Eval.py— reports and saves evaluation resultsLoadData.py— loads and prepares datasetsmodel.py— defines the DNN modelsNoiseFunction.py— generates random noiseImageModels.py— applies SLB to large image models on the Virus-MNIST (VM) datasetML.py— trains ML models using SLB-corrected labels
Reference environment (used for our experiments):
Ubuntu 22.04.5 LTS; AMD EPYC 7502P (32 cores); 128 GB RAM; 4× NVIDIA A40 (48 GB)
Python 3.8; PyTorch 2.1.2; and CUDA 11.8
Other setups may work, but the above is what we validated.
Installation. Clone the repo and install dependencies in a fresh environment.
Conda (recommended):
conda create -n slb python=3.8 -y
conda activate slb
pip install -r requirements.txtContents of 'requirements.txt':
numpy==1.24.3
pandas==2.0.3
scikit-learn==1.3.0
tqdm==4.66.5
torch==2.1.2
torchvision==0.16.2
pillow==10.4.0
matplotlib==3.7.3
joblib==1.4.2
Download the exact processed datasets used in our experiments from:
https://zenodo.org/records/17281377
After downloading, extract everything into the SLB/ directory at the project root.
Expected layout:
project-root/
└─ SLB/data
├─ Android/
├─ continous/
├─ IDS17/
├─ LIDS17/
├─ virusnet/
└─ Windows_PE/
These are pre-processed and ready to use with our codebase.
If you prefer to obtain the datasets from their original sources and handle preprocessing yourself, please refer to the following works:
-
Malware PE (PE)
Xian Wu, Wenbo Guo, Jia Yan, Baris Coskun, Xinyu Xing. From Grim Reality to Practical Solution: Malware Classification in Real-World Noise. IEEE S&P 2023, pp. 2602–2619. DOI: 10.1109/SP46215.2023.10179453 -
VirusShare 2018 (VS18)
Tomás Concepción Miranda, Pierre-François Gimenez, Jean-François Lalande, Valérie Viet Triem Tong, Pierre Wilke. Debiasing Android Malware Datasets: How Can I Trust Your Results If Your Dataset Is Biased? IEEE TIFS 17 (2022), 2182–2197. DOI: 10.1109/TIFS.2022.3180184 -
APIGraph (AG)
Yizheng Chen, Zhoujie Ding, David A. Wagner. Continuous Learning for Android Malware Detection. In USENIX Security 2023, pp. 1127–1144. -
CICIDS2017 (IDS17)
Lisa Liu, Gints Engelen, Timothy M. Lynar, Daryl Essam, Wouter Joosen. Error Prevalence in NIDS Datasets: A Case Study on CIC-IDS-2017 and CSE-CIC-IDS-2018. In IEEE CNS 2022, pp. 254–262. DOI: 10.1109/CNS56114.2022.9947235 -
Virus-MNIST (VM)
David A. Noever, Samantha E. Miller Noever. Virus-MNIST: A Benchmark Malware Dataset. arXiv: 2103.00602 (2021).
Shell scripts in ./Scripts are provided to re-run our experiments:
-
HPSearch.sh: extensive hyperparameter search across all datasets
-
RandomAG.sh, RandomMini-IDS.sh, RandomPE.sh, RandomVS18.sh: random noise experiments (§4.3.1) and clean experiments (§4.3.3)
-
RealNoise.sh: real noise experiments (§4.3.2)
-
Correction.sh: label accuracy experiments (§4.4.1)
-
ML.sh: training ML models with SLB-corrected labels (§4.4.2)
-
ImageModels.sh: SLB on image models (§4.5.1)
-
LargeAG.sh, LargeIDS.sh: SLB on large datasets and models (§4.5.2)
Run, e.g.:
bash RandomPE.sh
You can run SLB directly from the terminal. Example:
python SLB.py --noise_type random --noise_rate 0.1 --exp Random --dataset windows_pe --n_epoch 140 --con_thres 1.0 --init_num 15 --flip 1 --beta 0.9999 --model_size small
Key parameters:
- n_epoch 140: T in Algorithm 2
- con_thres 1.0: r in Equation 4
- init_num 15: e in Equation 2
- flip 1: m in Equation 10
- beta 0.9999: β in Equation 6
@inproceedings{alotaibi25imperfect,
title = {{Deep Learning from Imperfectly Labelled Malware Data}},
author = {Alotaibi, Fahad and Goodbrand, Euan and Maffeis, Sergio},
booktitle = {Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security (CCS '25)},
year = {2025},
month = oct,
address = {Taipei, Taiwan},
organization = {Association for Computing Machinery},
isbn = {979-8-4007-1525-9/2025/10},
doi = {10.1145/3719027.3765197},
url = {https://doi.org/10.1145/3719027.3765197}
}
If you have any questions or need further assistance, please feel free to reach out to me at any time:
- Email:
f.alotaibi21@imperial.ac.uk - Alternate Email:
fahadalkarshmi@gmail.com