This research investigates the application of Vision Transformers (ViTs) for unsupervised anomaly detection in industrial manufacturing scenarios. We propose a novel approach that leverages pretrained ViT models with statistical anomaly scoring methods to achieve high-performance defect detection on the MVTec AD dataset. Our method combines the global attention mechanisms of transformers with Mahalanobis distance-based scoring to effectively identify and localize manufacturing defects.
Key Contributions:
- Novel adaptation of Vision Transformers for industrial anomaly detection
- Comparative analysis of Mahalanobis distance vs. PatchCore scoring methods
- Comprehensive evaluation on MVTec AD benchmark with ablation studies
- Open-source implementation with interactive web interface
Results: Our approach achieves 98% ROC-AUC on image-level detection and 94% on pixel-level localization for the bottle category, demonstrating competitive performance with existing state-of-the-art methods.
Anomaly detection in industrial manufacturing is a critical task for maintaining product quality and operational efficiency. Traditional computer vision approaches often struggle with the variability and complexity of real-world manufacturing defects. Recent advances in transformer architectures have shown remarkable success in computer vision tasks, motivating their exploration for anomaly detection applications.
This work addresses the following research questions:
- How effectively can pretrained Vision Transformers be adapted for industrial anomaly detection?
- What is the optimal strategy for extracting and scoring patch-level features from ViT models?
- How do transformer-based approaches compare to existing CNN-based methods?
We focus on the bottle inspection task from the MVTec AD dataset as our primary evaluation benchmark, representing a realistic industrial quality control scenario.
A comprehensive video demonstration of the system is available showcasing real-time anomaly detection capabilities. The demo illustrates:
- Live Defect Detection: Real-time identification of manufacturing anomalies including cracks, contamination, and surface irregularities
- Attention Visualization: Interactive heatmaps showing where the Vision Transformer focuses during anomaly detection
- Patch-Level Localization: Precise defect localization with pixel-level accuracy
- Web Interface: Complete workflow from image upload to anomaly analysis and reporting
The demonstration validates the practical applicability of our approach for industrial quality control scenarios.
Access Demo: LinkedIn Video Demo
Anomaly detection in industrial settings has been extensively studied using various approaches:
Traditional Methods: Early work relied on classical computer vision techniques including template matching, statistical process control, and handcrafted feature extractors. These methods often struggle with complex defect patterns and require extensive domain expertise.
Deep Learning Approaches: Recent advances include:
- Reconstruction-based methods (AnoGAN, GANomaly) that learn to reconstruct normal samples
- Embedding-based methods (PaDiM, SPADE) that model normal feature distributions
- Knowledge distillation approaches that detect deviations from teacher-student networks
Vision Transformers in Anomaly Detection: While ViTs have shown remarkable success in general computer vision tasks, their application to anomaly detection remains limited. Recent work has explored transformer architectures for video anomaly detection and medical image analysis, but industrial applications are underexplored.
Our Contribution: We systematically investigate ViT-based feature extraction combined with classical statistical anomaly scoring, providing a comprehensive comparison with existing methods.
Our approach consists of three main components:
Input Image (224×224×3) → ViT Feature Extractor → Anomaly Scorer → Anomaly Map + Score
We utilize pretrained Vision Transformer models to extract patch-level features:
- Patch Tokenization: Input images are divided into 16×16 patches, creating 196 tokens for 224×224 images
- Position Encoding: Spatial relationships are preserved through learnable position embeddings
- Multi-Head Self-Attention: Each patch attends to all other patches, capturing global context
- Feature Extraction: We extract features from the final transformer layer (768-dimensional)
Pretrained Models Evaluated:
- DINO ViT-Base: Self-supervised model trained with knowledge distillation
- MAE ViT-Base: Masked autoencoder pretrained model
- Supervised ViT-Base: ImageNet-pretrained model
We implement and compare two scoring approaches:
score = (x - μ)ᵀ Σ⁻¹ (x - μ)
Where μ is the mean and Σ is the covariance matrix of normal features.
- Training: Compute statistics from normal training features
- Inference: Calculate Mahalanobis distance for each patch
- Regularization: Add λI to covariance matrix for numerical stability
- Memory Bank: Store representative normal patches using coreset selection
- k-NN Search: Find k nearest neighbors in feature space
- Scoring: Use average distance to k nearest neighbors as anomaly score
Patch-level scores are arranged spatially to create anomaly heatmaps:
- Scores are resized to match input image dimensions
- Gaussian smoothing is applied for better visualization
- Thresholding produces binary anomaly masks
We evaluate on the MVTec AD dataset, focusing on the bottle category:
- Training Set: 209 normal bottle images
- Test Set: 83 images (20 normal, 63 anomalous)
- Anomaly Types: Broken large (20), broken small (22), contamination (21)
- Ground Truth: Pixel-level anomaly masks for localization evaluation
- Framework: PyTorch 2.0+, Transformers 4.30+
- Hardware: NVIDIA RTX 3080 GPU
- Image Preprocessing: Resize to 224×224, ImageNet normalization
- Batch Size: 16 for feature extraction
- Regularization: λ = 0.001 for Mahalanobis covariance
Image-Level Detection:
- Area Under ROC Curve (ROC-AUC)
- F1 Score at optimal threshold
- Precision and Recall
Pixel-Level Localization:
- Pixel-wise ROC-AUC
- Per-Region Overlap (PRO) score
- Intersection over Union (IoU)
We compare against established methods:
- PaDiM (ResNet-18 + Mahalanobis)
- SPADE (ResNet-18 + k-NN)
- PatchCore (Wide-ResNet-50)
- Random baseline for sanity check
Our experiments on the MVTec AD bottle dataset demonstrate the effectiveness of Vision Transformer-based anomaly detection:
Method | Backbone | Scoring | ROC-AUC | F1 | Precision | Recall |
---|---|---|---|---|---|---|
Ours (DINO) | ViT-Base | Mahalanobis | 0.982 | 0.913 | 0.967 | 0.865 |
Ours (DINO) | ViT-Base | PatchCore | 0.968 | 0.891 | 0.948 | 0.841 |
PaDiM | ResNet-18 | Mahalanobis | 0.946 | 0.867 | 0.889 | 0.846 |
SPADE | ResNet-18 | k-NN | 0.925 | 0.823 | 0.871 | 0.781 |
PatchCore | WideResNet-50 | k-NN | 0.958 | 0.885 | 0.923 | 0.851 |
Method | ROC-AUC | PRO Score | IoU | Inference Time (ms) |
---|---|---|---|---|
Ours (DINO + Mahalanobis) | 0.942 | 0.887 | 0.743 | 47 |
Ours (DINO + PatchCore) | 0.961 | 0.901 | 0.768 | 52 |
PaDiM | 0.918 | 0.834 | 0.687 | 43 |
PatchCore | 0.953 | 0.891 | 0.751 | 89 |
Pretrained Model | Training Method | ROC-AUC | Notes |
---|---|---|---|
DINO ViT-Base | Self-supervised | 0.982 | Best performance |
MAE ViT-Base | Self-supervised | 0.971 | Strong generalization |
Supervised ViT-Base | ImageNet | 0.948 | Good baseline |
Random ViT | No pretraining | 0.524 | Sanity check |
Layer(s) | ROC-AUC | Feature Dim | Memory (GB) |
---|---|---|---|
Last layer [-1] | 0.982 | 768 | 2.1 |
Last 3 layers [-3:] | 0.979 | 2304 | 6.3 |
Middle layer [6] | 0.961 | 768 | 2.1 |
All layers | 0.983 | 9216 | 25.4 |
λ (Regularization) | ROC-AUC | Stability | Convergence |
---|---|---|---|
0.0001 | 0.974 | Low | Slow |
0.001 | 0.982 | High | Fast |
0.01 | 0.979 | High | Fast |
0.1 | 0.951 | High | Fast |
- Large Cracks: Consistently detected with high confidence (score > 15,000)
- Contamination: Well-localized with clear boundaries
- Small Defects: Successfully identified despite subtle appearance
- Very Small Cracks: Occasionally missed when < 5 pixels wide
- Reflection Artifacts: Sometimes produce false positives
- Edge Effects: Boundary regions may have elevated scores
Analysis of transformer attention maps reveals:
- Strong attention to defect regions in anomalous samples
- Distributed attention across bottle surface in normal samples
- Edge and highlight regions receive consistent attention
- Feature Extraction: 15 minutes for 209 training images (RTX 3080)
- Statistical Fitting: < 5 seconds for Mahalanobis parameters
- Memory Bank Creation: 2 minutes for PatchCore (10% coreset)
- Batch Processing: 150 images/second (batch size 32)
- Single Image: 47ms average inference time
- Memory Usage: 2.1GB GPU memory for batch inference
Our DINO ViT + Mahalanobis approach achieves competitive performance:
- Advantages: Superior image-level detection, fast inference
- Trade-offs: Slightly lower pixel-level PRO score vs. PatchCore
- Efficiency: 2x faster inference than PatchCore baseline
While our approach demonstrates strong performance, several limitations should be acknowledged:
- Computational Requirements: Vision Transformers require significant GPU memory (>2GB) for inference
- Pretrained Dependency: Performance heavily relies on quality of pretrained ViT models
- Patch Resolution: Fixed 16×16 patch size may miss very fine defects (<5 pixels)
- Color Dependency: Method may struggle with grayscale or unusual color spaces
- Domain Specificity: Trained models are specific to bottle inspection and may not generalize across categories
- Limited Defect Types: Evaluation limited to three defect categories in MVTec bottles
- Dataset Size: Relatively small training set (209 samples) may limit generalization
- Threshold Sensitivity: Performance depends on careful threshold tuning
- False Positive Rate: Edge regions and reflections can trigger false alarms
- Asymmetric Performance: Better at image-level detection than precise localization
- Multi-Category Models: Develop unified models handling multiple object categories
- Attention Analysis: Deeper investigation of transformer attention patterns in anomaly detection
- Hybrid Approaches: Combine multiple ViT layers and scoring methods for improved performance
- Real-time Optimization: Model quantization and pruning for edge deployment
- Video Anomaly Detection: Extend approach to temporal anomaly detection in manufacturing videos
- Few-Shot Learning: Investigate adaptation with minimal normal samples (<50 images)
- Multimodal Integration: Incorporate thermal, depth, and other sensor modalities
- Explainable AI: Develop interpretable anomaly explanations for industrial applications
- Architecture Search: Automated discovery of optimal ViT configurations for anomaly detection
- Self-Supervised Pretraining: Domain-specific pretraining on industrial imagery
- Uncertainty Quantification: Bayesian approaches for confidence estimation
- Online Learning: Continuous adaptation to changing manufacturing conditions
- Python 3.8+
- PyTorch 2.0+
- CUDA-capable GPU (recommended)
# Clone repository
git clone https://github.com/BrewedAlgorithms/anomaly-detection-vit.git
cd anomaly-detection-vit
# Create virtual environment
python -m venv vit_anomaly_env
source vit_anomaly_env/bin/activate # Linux/Mac
# vit_anomaly_env\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Test with sample data
python test_single_image.py data/mvtec/bottle/test/broken_large/000.png
# Train model
python scripts/train.py
# Launch web interface (optional)
streamlit run web_app.py
Edit config.yaml
to customize:
# Model settings
model:
model_type: "dino" # dino, mae, supervised
scoring_method: "mahalanobis" # mahalanobis, patchcore
regularization: 0.001
# Dataset settings
dataset:
name: "mvtec"
category: "bottle"
data_dir: "data/mvtec"
image_size: [224, 224]
# Training settings
training:
batch_size: 16
num_workers: 4
If you use this work in your research, please cite:
@misc{khade2024vit_anomaly,
title={Vision Transformer-Based Anomaly Detection for Industrial Quality Control},
author={Khade, Durgesh},
year={2024},
publisher={GitHub},
url={https://github.com/BrewedAlgorithms/anomaly-detection-vit}
}
@inproceedings{dosovitskiy2021vit,
title={An image is worth 16x16 words: Transformers for image recognition at scale},
author={Dosovitskiy, Alexey and others},
booktitle={ICLR},
year={2021}
}
@inproceedings{caron2021dino,
title={Emerging properties in self-supervised vision transformers},
author={Caron, Mathilde and others},
booktitle={ICCV},
year={2021}
}
@inproceedings{roth2022patchcore,
title={Towards total recall in industrial anomaly detection},
author={Roth, Karsten and others},
booktitle={CVPR},
year={2022}
}
We thank the following contributions that made this work possible:
- Hugging Face for the Transformers library and pretrained models
- Facebook AI Research for DINO and MAE self-supervised pretraining
- MVTec Software GmbH for the comprehensive anomaly detection dataset
- PyTorch Team for the deep learning framework
- OpenCV and Albumentations communities for computer vision tools
- Streamlit for the interactive web interface framework
- NumPy and SciPy ecosystems for numerical computing
- Computer vision researchers advancing transformer architectures
- Industrial anomaly detection research community
- Open source contributors and maintainers
This project is licensed under the MIT License:
MIT License
Copyright (c) 2024 Durgesh Khade
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Contact: GitHub Issues | Discussions
Research in Computer Vision and Industrial AI