Rethinking Weak Supervision in Anomaly Detection: A Comprehensive Benchmark
WSADBench is a comprehensive benchmark for weakly-supervised anomaly detection, supporting multiple data modalities including tabular data (classical, CV features, NLP embeddings), video data, and inexact supervision (MIL bags).
- Key Features
- Installation
- Quick Start
- Data Preparation
- Supported Models
- Project Structure
- Advanced Usage
- Citation
- License
- Acknowledgments
- Multi-Modal Support: Tabular (classical, CV features, NLP embeddings), Video, and MIL bags
- 30+ Baseline Models: Weak supervision, semi-supervised, and unsupervised methods
- Flexible Supervision Settings: Configurable labeled anomaly ratios (RLA), labeled normal ratios (ELN), unlabeled ratios, and label noise
- Parallel Execution: Multi-GPU support with automatic GPU assignment
- Reproducible Experiments: Built-in result logging, resume capability, and statistical reporting
- Python 3.9+
- CUDA 11.8+ (for GPU support)
# Clone the repository
git clone https://github.com/your-org/WSADBench.git
cd WSADBench
# Create conda environment
conda create -n wsad python=3.9 -y
conda activate wsad
# Install PyTorch (adjust CUDA version as needed)
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
# Install dependencies
pip install -r requirements.txt
pip install pytorchvideo opencv-pythonAlternatively, use the provided setup script:
bash setup.sh# Run a single model on classical tabular datasets
python run_experiment.py --data_type tabular_classical --models DevNet --rla_list 1.0
# Run multiple models with different labeled anomaly ratios
python run_experiment.py \
--data_type tabular_classical \
--models DeepSAD DevNet FEAWAD \
--rla_list 0.01 0.05 0.1 0.5 1.0 \
--n_jobs 4
# Run with custom seeds
python run_experiment.py \
--data_type tabular_classical \
--models DevNet \
--seed_list 1 2 3 4 5
# Run Incomplete (rla/nla/unlabel) experiments
python run_experiment.py --data_type tabular_classical --models DevNet --rla_list 0.01 0.05 0.1 0.25 0.5 1.0 --eln_list 0.0 --ru_list 1.0 --flip_nr_list 0.0 --flip_ar_list 0.0 --seed_list 0 1 2 3 4 --n_jobs 3 --gpus 7 --target_for_unlabeled fill_unlabel_0 --exp_note incomplete_rla
python run_experiment.py --data_type tabular_CV_by_ViT --models DeepSAD --rla_list 1 3 5 10 15 20 50 --eln_list 0.0 --ru_list 1.0 --flip_nr_list 0.0 --flip_ar_list 0.0 --seed_list 0 1 2 3 4 --n_jobs 3 --gpus 6 --target_for_unlabeled fill_unlabel_0 --exp_note incomplete_nla
python run_experiment.py --data_type tabular_NLP_by_RoBERTa --models REPEN --rla_list 1 10 20 50 --eln_list 0.0 --ru_list 20 50 200 1000 --flip_nr_list 0.0 --flip_ar_list 0.0 --seed_list 0 1 2 3 4 --n_jobs 3 --gpus 1 --target_for_unlabeled fill_unlabel_0 --exp_note unlabel_nlanu
# Run Inaccurate (fnr/far/double) experiments
python run_experiment.py --data_type tabular_classical --models RoSAS --rla_list 1.0 --eln_list 0.0 --ru_list 1.0 --flip_nr_list 0.01 0.05 0.1 0.25 0.5 --flip_ar_list 0.0 --seed_list 0 1 2 3 4 --n_jobs 3 --gpus 6 --target_for_unlabeled fill_unlabel_0 --noise_type label_contamination --is_cleanlab false --exp_note inaccurate_fnr
python run_experiment.py --data_type tabular_classical --models RoSAS --rla_list 1.0 --eln_list 0.0 --ru_list 1.0 --flip_nr_list 0.0 --flip_ar_list 0.01 0.05 0.1 0.25 0.5 --seed_list 0 1 2 3 4 --n_jobs 3 --gpus 5 --target_for_unlabeled fill_unlabel_0 --noise_type label_contamination --is_cleanlab false --exp_note inaccurate_far
python run_experiment.py --data_type tabular_classical --models DevNet --rla_list 1.0 --eln_list 0.0 --ru_list 1.0 --flip_nr_list 0.01 0.05 0.1 0.25 0.5 --flip_ar_list 0.01 0.05 0.1 0.25 0.5 --seed_list 0 1 2 3 4 --n_jobs 3 --gpus 3 --target_for_unlabeled fill_unlabel_0 --noise_type label_contamination --is_cleanlab false --exp_note inaccurate_double
# Run Inexact experiments
# Generate MIL bags datasets
python WSADBench/build_bags.py --input-dir WSADBench/datasets/Classical --output-dir WSADBench/datasets/classical_bags_inexact --bag-size 10 --bag-prob 0.3 --seed 331 --no-resume --gpus 0
# Run tabular inexact experiments
python run_experiment.py --data_type classical_bags_inexact --models Sultani TabPFN --rla_list 0.01 0.05 0.1 0.25 0.5 1.0 --eln_list 0.0 --ru_list 1.0 --flip_nr_list 0.0 --flip_ar_list 0.0 --seed_list 0 1 2 3 4 --n_jobs 1 --gpus 2 --target_for_unlabeled fill_unlabel_0 --exp_note tabular_inexact
# Run video anomaly detection
python run_experiment.py \
--data_type video \
--models Sultani \
--datasets UCF_Crime \
--rla_list 1.0 \
--n_jobs 1 \
--gpus 0
# Multi-GPU parallel execution
python run_experiment.py \
--data_type video \
--models Sultani \
--datasets UCF_Crime \
--n_jobs 2 \
--rla_list 1.0 \
--gpus 0,1# WSADBench automatically skips completed experiments
python run_experiment.py --data_type tabular_classical --models DevNet
# Force re-run all experiments
python run_experiment.py --data_type tabular_classical --models DevNet --NO_RESUME# Generate summary from existing results without running experiments
python run_experiment.py --data_type tabular_classical --dry_summaryNote: The complete benchmark datasets (including pre-extracted features for all modalities) will be released after the paper is accepted. For video datasets, we have unified the pretrained models used for feature extraction and re-extracted all features from the original videos to ensure consistency. The feature extraction code is available in this repository.
Datasets should be prepared as symbolic links in the WSADBench/datasets/ directory. See DATASETS.md for detailed instructions on:
- Download links for all supported datasets
- Preprocessing instructions for each data type
- Directory structure requirements
- Feature extraction scripts (for CV/NLP features)
Quick Setup:
# After downloading datasets, create symlinks
ln -s /path/to/your/classical_datasets WSADBench/datasets/Classical
ln -s /path/to/your/video_features WSADBench/datasets/CV_by_I3D
ln -s /path/to/your/cv_features WSADBench/datasets/CV_by_ResNet18| Data Type | CLI Flag | Description |
|---|---|---|
| Classical Tabular | tabular_classical |
Traditional AD benchmarks (47 datasets) |
| CV Features (ResNet18) | tabular_CV_by_ResNet18 |
Image features extracted by ResNet18 |
| CV Features (ViT) | tabular_CV_by_ViT |
Image features extracted by ViT |
| NLP Features (BERT) | tabular_NLP_by_BERT |
Text embeddings from BERT |
| NLP Features (RoBERTa) | tabular_NLP_by_RoBERTa |
Text embeddings from RoBERTa |
| Video | video |
Video anomaly detection (I3D features) |
| MIL Bags (Classical) | classical_bags_inexact |
Classical data in MIL bag format |
| MIL Bags (CV) | CV_by_ViT_bags_inexact |
CV features in MIL bag format |
| Model | Category | Description |
|---|---|---|
| DevNet | Score Learning | Deviation networks for anomaly detection with limited supervision |
| DeepSAD | Score Learning | Deep semi-supervised anomaly detection via one-class classification |
| PReNet | Score Learning | Pairwise relation network for anomaly detection |
| REPEN | Repr. Learning | Representation learning for PU learning |
| XGBOD | Repr. Learning | Feature augmentation for outlier detection |
| RoSAS | Data Aug. | Robust semi-supervised anomaly segmentation |
| Dual-MGAN | Data Aug. | Dual-MGAN for anomaly detection |
| FEAWAD | Reconstruction | Feature encoding with autoencoders for weakly-supervised AD |
| DDAE | Diffusion DAE | Anomaly detection with denoising diffusion autoencoders |
| SOEL-NTL | Pseudo-Labeling | Self-training with outlier exposure |
| AA-BiGAN | GAN-based | Adversarially learned anomaly detection with BiGAN |
| GAnomaly | GAN-based | GAN-based anomaly detection |
| Model | Category | Description |
|---|---|---|
| IForest | Isolation-based | Isolation Forest - classical baseline |
| AutoEncoder | Reconstruction | Autoencoder reconstruction error |
| VAE | Reconstruction | Variational Autoencoder |
| PCA | Reconstruction | Principal Component Analysis |
| DeepSVDD | Deep One-class | Deep Support Vector Data Description |
| ECOD | Probabilistic | Empirical Cumulative Distribution |
| CBLOF | Cluster-based | Cluster-based Local Outlier Factor |
| LOF | Density-based | Local Outlier Factor |
| LUNAR | GNN-based | Graph neural network for anomaly detection |
| Model | Category | Description |
|---|---|---|
| Sultani | Vanilla MIL | MIL-based weakly supervised video anomaly detection |
| RTFM | Magnitude MIL | Robust temporal feature magnitude |
| MGFN | Magnitude MIL | Multi-graph fusion network |
| AR-Net | Dynamic MIL | Dynamic MIL for video anomaly detection |
| VadCLIP | Language-Guided MIL | Vision-language video anomaly detection |
| UR-DMU | Uncertainty-Aware MIL | Unified representation for detection of multiple anomalies |
| GCN-Anomaly | Label Denoising | Graph convolutional network for anomaly detection |
| PUMA | PU MIL | PU-learning based multi-model anomaly detection |
| Model | Category | Description |
|---|---|---|
| XGBoost | GBDT | Gradient boosting decision trees |
| CatBoost | GBDT | Categorical boosting |
| FTTransformer | Deep (Sup.) | Feature-wise transformer for tabular data |
| TabM | Deep (Sup.) | Tabular deep learning model |
| TabR-S | Deep (Sup.) | Tabular regression with scaled embeddings |
| Model | Category | Description |
|---|---|---|
| TabPFN | Found. Model | Descriminative Foundation Model |
| LimiX | Found. Model | Generative Foundation Model |
WSADBench/
βββ run_experiment.py # Main entry point
βββ requirements.txt # Python dependencies
βββ setup.sh # Environment setup script
βββ LICENSE # MIT License
βββ README.md # This file
βββ DATASETS.md # Dataset preparation guide
β
βββ WSADBench/ # Core package
β βββ baseline/ # Model implementations
β β βββ DeepSAD/ # DeepSAD implementation
β β βββ DevNet/ # DevNet implementation
β β βββ FEAWAD/ # FEAWAD implementation
β β βββ Sultani/ # Sultani video AD
β β βββ PyOD.py # PyOD wrapper (20+ models)
β β βββ ... # 30+ other models
β β
β βββ datasets/ # Dataset handling
β β βββ data_generator.py # Data generation & loading
β β βββ cv_data_generator.py # CV dataset handling
β β βββ dataset_configs/ # Dataset configuration (YAML)
β β βββ dataset_support/ # Video preprocessing utilities
β β
β βββ model_configs/ # Model hyperparameters (YAML)
β β βββ tabular/ # Tabular model configs
β β βββ video/ # Video model configs
β β βββ tabular_bags_inexact/ # MIL bag configs
β β
β βββ myutils.py # Utility functions
β βββ build_bags.py # Instance β MIL bag conversion
β
βββ common_utils/ # Shared utilities
β βββ baseline_utils.py # Video-specific utilities
β βββ argTypes.py # Argument type parsing
β
βββ results/ # Experiment outputs (git-ignored)
| Argument | Description | Default |
|---|---|---|
--data_type |
Data modality (required) | - |
--models |
Model names to run | - |
--datasets |
Specific datasets | All available |
--rla_list |
Labeled anomaly ratios | [1.0] |
--eln_list |
Labeled normal ratios (relative to RLA) | [0.0, 0.01, ...] |
--ru_list |
Unlabeled sample ratios | [1.0] |
--flip_nr_list |
Label noise (normalβanomaly) | [0.0] |
--flip_ar_list |
Label noise (anomalyβnormal) | [0.0] |
--target_for_unlabeled |
How to handle unlabeled samples | fill_unlabel_0 |
--noise_type |
Noise type for experiments | None |
--is_cleanlab |
Enable cleanlab data cleaning | false |
--seed_list |
Random seeds | [1-10] |
--n_jobs |
Parallel jobs | 1 |
--gpus |
GPU IDs (e.g., "0,1,2") | All available |
--output_dir |
Results directory | results/{data_type} |
--NO_RESUME |
Force re-run completed experiments | False |
--dry_summary |
Only generate summary | False |
--DEBUG |
Enable debug mode | False |
--exp_note |
Experiment note for tracking | None |
WSADBench supports comprehensive weak supervision configurations:
- RLA (Ratio of Labeled Anomalies): Proportion of anomalies that are labeled in training data
- ELN (Ratio of Labeled Normal samples): Proportion of labeled normal samples relative to labeled anomalies
- RU (Ratio of Unlabeled): Proportion of unlabeled samples in training data
- Label Contamination: Simulate annotation errors with
flip_nr_listandflip_ar_list
# Example: 10% labeled anomalies, 50% unlabeled data, 5% label noise
python run_experiment.py \
--data_type tabular_classical \
--models DevNet \
--rla_list 0.1 \
--ru_list 0.5 \
--flip_nr_list 0.05 \
--flip_ar_list 0.05Model hyperparameters are stored in WSADBench/model_configs/{data_type}/{model_name}.yaml:
# Example: WSADBench/model_configs/tabular/DeepSAD.yaml
model_class: "WSADBench.baseline.DeepSAD.run.DeepSAD"
parameters:
latent_dim: 32
hidden_dims: [64, 32]
epochs: 100
batch_size: 256
lr: 0.001- Create a new directory in
WSADBench/baseline/YourModel/ - Implement
run.pywith a class that has:__init__(self, seed, **kwargs): Initialize modelfit(self, X, y, ...): Training methodpredict_score(self, X, ...): Return anomaly scores
- Create config file
WSADBench/model_configs/{data_type}/YourModel.yaml - Add model to
ModelRegistryinrun_experiment.py
Results are saved in JSONL format:
results/
βββ {data_type}/
βββ detail/
β βββ {model_name}/
β βββ {model_name}_results.jsonl # Individual results
β βββ model_stats.json # Model statistics
βββ summary/
βββ summary.xlsx # Aggregated statistics
If you use WSADBench in your research, please cite:
@article{wsadbench2025,
title={Rethinking Weak Supervision in Anomaly Detection: A Comprehensive Benchmark},
author={WSADBench Authors},
journal={arXiv preprint},
year={2025}
}This project is licensed under the MIT License - see the LICENSE file for details.
For questions and issues, please open an issue on GitHub.