Implementation of Non-Parametric Transformers for anomaly detection on tabular data.
This repository contains of the NPT-AD (Non-Parametric Transformers for Anomaly Detection) method. The original code has been refactored to be more accessible and easier to use for both researchers and practitioners.
- Clone the repository:
git clone git@github.com:hugothimonier/NPT-AD.git
cd NPT-AD- Create and activate the conda environment:
conda create -n nptad python=3.8
conda activate nptad
pip install -r requirements.txtfrom npt.simple_trainer import train_anomaly_detector
from npt.config_manager import NPTADConfig
# Base configuration with default values
config = NPTADConfig()
# modify parameters of your choice
config.training.num_total_steps = 5000
config.model.dim_hidden = 64
trainer = train_anomaly_detector('abalone', config)# Quick test
python -m npt.cli test --dataset separable
# Train with custom settings
python -m npt.cli train --dataset abalone --steps 5000 --batch-size 32 --n_runs 5
# List available datasets
python -m npt.cli list-datasets
# Generate configuration file
python -m npt.cli config --preset small_dataset --output my_config.jsonCheck out the examples/ directory for comprehensive examples:
quick_start.py: Basic usage examplescustom_dataset.py: How to create and use custom datasets
NPT-AD uses a configuration system with defaults and presets.
-
quick_test: Fast testing with minimal resources -
small_dataset: Optimized for small datasets ($n<1000$ ). -
small_dataset_high_d: Optimized for small datasets ($n<1000$ ) with a number of features higher than 20 ($d>20$ ). -
medium_dataset: Optimized for medium datasets ($1000<n<10,000$ ). -
medium_dataset_high_d: Optimized for medium datasets ($1000<n<10,000$ ) with a number of features higher than 20 ($d>20$ ). -
large_dataset: Optimized for large datasets ($n>10,000$ ). -
large_dataset_high_d: Optimized for large datasets ($n>10,000$ ) and a number of features higher than 20 ($d>20$ ).
# Train with preset setting on a custom dataset
python -m npt.cli train --dataset custom_dataset --preset medium_dataset_high_dOne can also modify the preset parameters by adding arguments:
# Train with preset setting on a custom dataset
python -m npt.cli train --dataset custom_dataset --preset medium_dataset_high_d --steps 5000 --batch-size 32from npt.config_manager import NPTADConfig
config = NPTADConfig()
# Model settings
config.model.dim_hidden = 128
config.model.num_heads = 8
config.model.stacking_depth = 6
# Training settings
config.training.num_total_steps = 10000
config.training.lr = 0.001
config.training.batch_size = 32
# Data settings
config.data.dataset = 'your_dataset'
config.data.data_path = 'path/to/your/data'from npt.datasets.dataset_registry import create_simple_dataset, DatasetRegistry
import pandas as pd
import numpy as np
# From pandas DataFrame
df = pd.read_csv('your_data.csv')
YourDataset = create_simple_dataset(
name='your_dataset',
data_source=df,
target_column='anomaly_label',
categorical_columns=['cat_feature_1', 'cat_feature_2'],
numerical_columns=['num_feature_1', 'num_feature_2']
)
# Register the dataset
DatasetRegistry.register('your_dataset', YourDataset)
# Use it
trainer = train_anomaly_detector('your_dataset')If the former does not work, one can directly create a custom dataset class following the defined BaseDataset class.
from npt.datasets.base import BaseDataset
import pandas as pd
class YourCustomDataset(BaseDataset):
def __init__(self, config):
super().__init__(fixed_test_set_index=None)
self.config = config
self.is_data_loaded = False
def load(self):
# Your data loading logic here
df = pd.read_csv('your_data.csv')
# ... process data ...
self.is_data_loaded = True
# Register and use
DatasetRegistry.register('your_custom', YourCustomDataset)The following datasets are included by default:
abalone: Abalone dataset from UCI ML Repositoryseparable: Synthetic separable dataset for testingannthyroid: Thyroid datasetarrhythmia: Arrhythmia datasetbackdoor: Backdoor datasetbreastw: Breast cancer datasetcampaign: Campaign datasetcardio: Cardiotocography datasetecoli: E.coli datasetfraud: Fraud datasetglass: Glass identification datasetionosphere: Ionosphere datasetletter: Letter recognition datasetlympho: Lymphography datasetmammography: Mammography datasetmnist: Mnist datasetmullcross: Mullcross datasetmusk: Musk datasetoptdigits: Optical digits datasetpendigits: Pen-based digits datasetpima: Pima Indians diabetes datasetsatellite: Satellite datasetsatimage: Satellite image datasetseismic: Seismic datasetshuttle: Shuttle datasetspeech: Speech datasetthyroid: Thyroid datasetvertebral: Vertebral column datasetvowels: Vowel datasetwbc: Wisconsin breast cancer datasetwine: Wine dataset
- Our code easily runs on CPU only instances for up to medium dataset with a reasonable number of features.
- For larger datasets, single-gpu or multi-gpu training is compatible with our codebase.
Our code is compatible with distributed training under minimal adjustments and can be launched with the usual torchrun command.
One can set the distributed parameters as follows:
config = NPTADConfig()
config.system.distributed = True
config.system.gpus = 4
config.training.batch_size = 64
trainer = train_anomaly_detector('large_dataset', config)or with the command line interface
python -m npt.cli train --dataset custom_dataset --preset medium_dataset_high_d --steps 5000 --batch-size 32 --n_gpus 4 --distributed True- During training: reduce batch size or model size (e.g.
config.model.dim_hidden,config.model.num_headsorconfig.model.stacking_depth). - During inference, reduce
config.training.num_train_inference.
If you use this code in your research, please cite our paper:
@InProceedings{pmlr-v235-thimonier24a,
title = {Beyond Individual Input for Deep Anomaly Detection on Tabular Data},
author = {Thimonier, Hugo and Popineau, Fabrice and Rimmel, Arpad and Doan, Bich-Li\^{e}n},
booktitle = {Proceedings of the 41st International Conference on Machine Learning},
pages = {48097--48123},
year = {2024},
volume = {235},
series = {Proceedings of Machine Learning Research},
month = {21--27 Jul},
publisher = {PMLR},
}This project is licensed under the MIT License - see the LICENSE file for details.
This work is based on the original Non-Parametric Transformers implementation from OATML. We thank the original authors for their excellent work.