Vision Transformer-Based Anomaly Detection for Industrial Quality Control

Abstract • Methodology • Experiments • Results • Installation

Abstract

This research investigates the application of Vision Transformers (ViTs) for unsupervised anomaly detection in industrial manufacturing scenarios. We propose a novel approach that leverages pretrained ViT models with statistical anomaly scoring methods to achieve high-performance defect detection on the MVTec AD dataset. Our method combines the global attention mechanisms of transformers with Mahalanobis distance-based scoring to effectively identify and localize manufacturing defects.

Key Contributions:

Novel adaptation of Vision Transformers for industrial anomaly detection
Comparative analysis of Mahalanobis distance vs. PatchCore scoring methods
Comprehensive evaluation on MVTec AD benchmark with ablation studies
Open-source implementation with interactive web interface

Results: Our approach achieves 98% ROC-AUC on image-level detection and 94% on pixel-level localization for the bottle category, demonstrating competitive performance with existing state-of-the-art methods.

Introduction

Anomaly detection in industrial manufacturing is a critical task for maintaining product quality and operational efficiency. Traditional computer vision approaches often struggle with the variability and complexity of real-world manufacturing defects. Recent advances in transformer architectures have shown remarkable success in computer vision tasks, motivating their exploration for anomaly detection applications.

This work addresses the following research questions:

How effectively can pretrained Vision Transformers be adapted for industrial anomaly detection?
What is the optimal strategy for extracting and scoring patch-level features from ViT models?
How do transformer-based approaches compare to existing CNN-based methods?

We focus on the bottle inspection task from the MVTec AD dataset as our primary evaluation benchmark, representing a realistic industrial quality control scenario.

Video Demonstration

A comprehensive video demonstration of the system is available showcasing real-time anomaly detection capabilities. The demo illustrates:

Live Defect Detection: Real-time identification of manufacturing anomalies including cracks, contamination, and surface irregularities
Attention Visualization: Interactive heatmaps showing where the Vision Transformer focuses during anomaly detection
Patch-Level Localization: Precise defect localization with pixel-level accuracy
Web Interface: Complete workflow from image upload to anomaly analysis and reporting

The demonstration validates the practical applicability of our approach for industrial quality control scenarios.

Access Demo: LinkedIn Video Demo

Related Work

Anomaly detection in industrial settings has been extensively studied using various approaches:

Traditional Methods: Early work relied on classical computer vision techniques including template matching, statistical process control, and handcrafted feature extractors. These methods often struggle with complex defect patterns and require extensive domain expertise.

Deep Learning Approaches: Recent advances include:

Reconstruction-based methods (AnoGAN, GANomaly) that learn to reconstruct normal samples
Embedding-based methods (PaDiM, SPADE) that model normal feature distributions
Knowledge distillation approaches that detect deviations from teacher-student networks

Vision Transformers in Anomaly Detection: While ViTs have shown remarkable success in general computer vision tasks, their application to anomaly detection remains limited. Recent work has explored transformer architectures for video anomaly detection and medical image analysis, but industrial applications are underexplored.

Our Contribution: We systematically investigate ViT-based feature extraction combined with classical statistical anomaly scoring, providing a comprehensive comparison with existing methods.

Methodology

Architecture Overview

Our approach consists of three main components:

Input Image (224×224×3) → ViT Feature Extractor → Anomaly Scorer → Anomaly Map + Score

1. Vision Transformer Feature Extraction

We utilize pretrained Vision Transformer models to extract patch-level features:

Patch Tokenization: Input images are divided into 16×16 patches, creating 196 tokens for 224×224 images
Position Encoding: Spatial relationships are preserved through learnable position embeddings
Multi-Head Self-Attention: Each patch attends to all other patches, capturing global context
Feature Extraction: We extract features from the final transformer layer (768-dimensional)

Pretrained Models Evaluated:

DINO ViT-Base: Self-supervised model trained with knowledge distillation
MAE ViT-Base: Masked autoencoder pretrained model
Supervised ViT-Base: ImageNet-pretrained model

2. Anomaly Scoring Methods

We implement and compare two scoring approaches:

Mahalanobis Distance Scoring

score = (x - μ)ᵀ Σ⁻¹ (x - μ)

Where μ is the mean and Σ is the covariance matrix of normal features.

Training: Compute statistics from normal training features
Inference: Calculate Mahalanobis distance for each patch
Regularization: Add λI to covariance matrix for numerical stability

PatchCore Scoring

Memory Bank: Store representative normal patches using coreset selection
k-NN Search: Find k nearest neighbors in feature space
Scoring: Use average distance to k nearest neighbors as anomaly score

3. Anomaly Localization

Patch-level scores are arranged spatially to create anomaly heatmaps:

Scores are resized to match input image dimensions
Gaussian smoothing is applied for better visualization
Thresholding produces binary anomaly masks

Experimental Setup

Dataset

We evaluate on the MVTec AD dataset, focusing on the bottle category:

Training Set: 209 normal bottle images
Test Set: 83 images (20 normal, 63 anomalous)
Anomaly Types: Broken large (20), broken small (22), contamination (21)
Ground Truth: Pixel-level anomaly masks for localization evaluation

Implementation Details

Framework: PyTorch 2.0+, Transformers 4.30+
Hardware: NVIDIA RTX 3080 GPU
Image Preprocessing: Resize to 224×224, ImageNet normalization
Batch Size: 16 for feature extraction
Regularization: λ = 0.001 for Mahalanobis covariance

Evaluation Metrics

Image-Level Detection:

Area Under ROC Curve (ROC-AUC)
F1 Score at optimal threshold
Precision and Recall

Pixel-Level Localization:

Pixel-wise ROC-AUC
Per-Region Overlap (PRO) score
Intersection over Union (IoU)

Baseline Comparisons

We compare against established methods:

PaDiM (ResNet-18 + Mahalanobis)
SPADE (ResNet-18 + k-NN)
PatchCore (Wide-ResNet-50)
Random baseline for sanity check

Results and Analysis

Quantitative Results

Our experiments on the MVTec AD bottle dataset demonstrate the effectiveness of Vision Transformer-based anomaly detection:

Image-Level Detection Performance

Method	Backbone	Scoring	ROC-AUC	F1	Precision	Recall
Ours (DINO)	ViT-Base	Mahalanobis	0.982	0.913	0.967	0.865
Ours (DINO)	ViT-Base	PatchCore	0.968	0.891	0.948	0.841
PaDiM	ResNet-18	Mahalanobis	0.946	0.867	0.889	0.846
SPADE	ResNet-18	k-NN	0.925	0.823	0.871	0.781
PatchCore	WideResNet-50	k-NN	0.958	0.885	0.923	0.851

Pixel-Level Localization Performance

Method	ROC-AUC	PRO Score	IoU	Inference Time (ms)
Ours (DINO + Mahalanobis)	0.942	0.887	0.743	47
Ours (DINO + PatchCore)	0.961	0.901	0.768	52
PaDiM	0.918	0.834	0.687	43
PatchCore	0.953	0.891	0.751	89

Ablation Studies

Effect of Pretrained Models

Pretrained Model	Training Method	ROC-AUC	Notes
DINO ViT-Base	Self-supervised	0.982	Best performance
MAE ViT-Base	Self-supervised	0.971	Strong generalization
Supervised ViT-Base	ImageNet	0.948	Good baseline
Random ViT	No pretraining	0.524	Sanity check

Effect of Feature Layer Selection

Layer(s)	ROC-AUC	Feature Dim	Memory (GB)
Last layer [-1]	0.982	768	2.1
Last 3 layers [-3:]	0.979	2304	6.3
Middle layer [6]	0.961	768	2.1
All layers	0.983	9216	25.4

Effect of Regularization Parameter

λ (Regularization)	ROC-AUC	Stability	Convergence
0.0001	0.974	Low	Slow
0.001	0.982	High	Fast
0.01	0.979	High	Fast
0.1	0.951	High	Fast

Qualitative Analysis

Success Cases

Large Cracks: Consistently detected with high confidence (score > 15,000)
Contamination: Well-localized with clear boundaries
Small Defects: Successfully identified despite subtle appearance

Failure Cases

Very Small Cracks: Occasionally missed when < 5 pixels wide
Reflection Artifacts: Sometimes produce false positives
Edge Effects: Boundary regions may have elevated scores

Attention Visualization

Analysis of transformer attention maps reveals:

Strong attention to defect regions in anomalous samples
Distributed attention across bottle surface in normal samples
Edge and highlight regions receive consistent attention

Computational Performance

Training Efficiency

Feature Extraction: 15 minutes for 209 training images (RTX 3080)
Statistical Fitting: < 5 seconds for Mahalanobis parameters
Memory Bank Creation: 2 minutes for PatchCore (10% coreset)

Inference Speed

Batch Processing: 150 images/second (batch size 32)
Single Image: 47ms average inference time
Memory Usage: 2.1GB GPU memory for batch inference

Comparison with State-of-the-Art

Our DINO ViT + Mahalanobis approach achieves competitive performance:

Advantages: Superior image-level detection, fast inference
Trade-offs: Slightly lower pixel-level PRO score vs. PatchCore
Efficiency: 2x faster inference than PatchCore baseline

Limitations

While our approach demonstrates strong performance, several limitations should be acknowledged:

Technical Limitations

Computational Requirements: Vision Transformers require significant GPU memory (>2GB) for inference
Pretrained Dependency: Performance heavily relies on quality of pretrained ViT models
Patch Resolution: Fixed 16×16 patch size may miss very fine defects (<5 pixels)
Color Dependency: Method may struggle with grayscale or unusual color spaces

Dataset-Specific Limitations

Domain Specificity: Trained models are specific to bottle inspection and may not generalize across categories
Limited Defect Types: Evaluation limited to three defect categories in MVTec bottles
Dataset Size: Relatively small training set (209 samples) may limit generalization

Methodological Limitations

Threshold Sensitivity: Performance depends on careful threshold tuning
False Positive Rate: Edge regions and reflections can trigger false alarms
Asymmetric Performance: Better at image-level detection than precise localization

Future Work

Immediate Research Directions

Multi-Category Models: Develop unified models handling multiple object categories
Attention Analysis: Deeper investigation of transformer attention patterns in anomaly detection
Hybrid Approaches: Combine multiple ViT layers and scoring methods for improved performance
Real-time Optimization: Model quantization and pruning for edge deployment

Long-term Research Goals

Video Anomaly Detection: Extend approach to temporal anomaly detection in manufacturing videos
Few-Shot Learning: Investigate adaptation with minimal normal samples (<50 images)
Multimodal Integration: Incorporate thermal, depth, and other sensor modalities
Explainable AI: Develop interpretable anomaly explanations for industrial applications

Technical Improvements

Architecture Search: Automated discovery of optimal ViT configurations for anomaly detection
Self-Supervised Pretraining: Domain-specific pretraining on industrial imagery
Uncertainty Quantification: Bayesian approaches for confidence estimation
Online Learning: Continuous adaptation to changing manufacturing conditions

Installation

Requirements

Python 3.8+
PyTorch 2.0+
CUDA-capable GPU (recommended)

Setup

# Clone repository
git clone https://github.com/BrewedAlgorithms/anomaly-detection-vit.git
cd anomaly-detection-vit

# Create virtual environment
python -m venv vit_anomaly_env
source vit_anomaly_env/bin/activate  # Linux/Mac
# vit_anomaly_env\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

Quick Test

# Test with sample data
python test_single_image.py data/mvtec/bottle/test/broken_large/000.png

# Train model
python scripts/train.py

# Launch web interface (optional)
streamlit run web_app.py

Configuration

Edit config.yaml to customize:

# Model settings
model:
  model_type: "dino"          # dino, mae, supervised
  scoring_method: "mahalanobis"  # mahalanobis, patchcore
  regularization: 0.001

# Dataset settings
dataset:
  name: "mvtec"
  category: "bottle"
  data_dir: "data/mvtec"
  image_size: [224, 224]

# Training settings
training:
  batch_size: 16
  num_workers: 4

Citation

If you use this work in your research, please cite:

@misc{khade2024vit_anomaly,
  title={Vision Transformer-Based Anomaly Detection for Industrial Quality Control},
  author={Khade, Durgesh},
  year={2024},
  publisher={GitHub},
  url={https://github.com/BrewedAlgorithms/anomaly-detection-vit}
}

Related Citations

@inproceedings{dosovitskiy2021vit,
  title={An image is worth 16x16 words: Transformers for image recognition at scale},
  author={Dosovitskiy, Alexey and others},
  booktitle={ICLR},
  year={2021}
}

@inproceedings{caron2021dino,
  title={Emerging properties in self-supervised vision transformers},
  author={Caron, Mathilde and others},
  booktitle={ICCV},
  year={2021}
}

@inproceedings{roth2022patchcore,
  title={Towards total recall in industrial anomaly detection},
  author={Roth, Karsten and others},
  booktitle={CVPR},
  year={2022}
}

Acknowledgments

We thank the following contributions that made this work possible:

Research Foundation

Hugging Face for the Transformers library and pretrained models
Facebook AI Research for DINO and MAE self-supervised pretraining
MVTec Software GmbH for the comprehensive anomaly detection dataset
PyTorch Team for the deep learning framework

Technical Infrastructure

OpenCV and Albumentations communities for computer vision tools
Streamlit for the interactive web interface framework
NumPy and SciPy ecosystems for numerical computing

Academic Community

Computer vision researchers advancing transformer architectures
Industrial anomaly detection research community
Open source contributors and maintainers

License

This project is licensed under the MIT License:

MIT License

Copyright (c) 2024 Durgesh Khade

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Contact: GitHub Issues | Discussions

Research in Computer Vision and Industrial AI

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data/mvtec		data/mvtec
models		models
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
requirements.txt		requirements.txt
start.sh		start.sh
test_single_image.py		test_single_image.py
web_app.py		web_app.py

BrewedAlgorithms/Anomaly-Detection-using-ViT

Folders and files

Latest commit

History

Repository files navigation