Video Anomaly Detection using LSTM Autoencoder (PixelPlay Competition)

This repository contains the complete implementation of a sequence-based Video Anomaly Detection (VAD) system built for the PixelPlay Kaggle competition. The approach is based on an LSTM Autoencoder trained only on normal video sequences, where anomalies are detected using reconstruction error.

Overall Pipeline

Video
  ↓
Frame Extraction
  ↓
Resize + Normalize
  ↓
Sliding Window Sequences
  ↓
LSTM Autoencoder (train on normal only)
  ↓
Reconstruction Error
  ↓
Temporal Smoothing
  ↓
Frame-wise Anomaly Scores → Submission

Repository Structure

pixelplay-vad/
│
├── README.md               ← This file (full explanation)
├── requirements.txt        ← Python dependencies
│         
│
├── src/
│   ├── README.md
│   ├── dataset.py          ← Sequence construction logic
│   ├── feature_extractor.py            
│   ├── model.py            ← LSTM Autoencoder definition
│   ├── train.py            ← Training loop (normal data only)
│   ├── inference.py        ← Reconstruction error scoring
│   └── postprocess.py     ← Smoothing + submission generation
│
└── pixel-play-9.ipynb  ← Executed Kaggle notebook used for submission

The src/ folder contains clean modular code. The notebook is kept only for reproducibility and proof of execution on Kaggle.

Dataset Assumptions

Expected structure:

data/
 ├── train/
 │    └── normal/
 │         └── video_id/frame.jpg
 └── test/
      └── video_id/frame.jpg

Why this structure?

Training is unsupervised → only normal videos are used
Test videos may contain anomalies
Frame order must be preserved to build sequences

Anomaly detection is based on the assumption:

The model should only learn what "normal" looks like. Anything it cannot reconstruct well is considered abnormal.

Preprocessing

1. Frame Loading

Frames are loaded from disk using OpenCV.

Why frames instead of raw video?

PyTorch models operate on tensors, not video files
Extracted frames give direct pixel access
Allows random access and sliding window generation

2. Resizing

Frames are resized to 64×64.

Why resize?

LSTM input size = H × W
Memory usage grows quadratically with resolution
Higher resolution caused GPU memory overflow

Trade-off:

✔ Faster training
❌ Loss of fine spatial details

3. Grayscale Conversion

Tried to use Grayscale conversion thought that it would improved the prediction but it failed

Why?

Motion and structure matter more than color
Reduces input size by 3×
Makes training more stable

4. Normalization

Pixel values are scaled to [0,1].

Why?

Neural networks train better on small numeric ranges
Prevents unstable gradients
Required for MSE reconstruction stability

Feeding raw 0–255 values slows convergence.

Sequence Construction

Frames are grouped into overlapping sequences using sliding windows:

[1,2,3,4,5]
[2,3,4,5,6]
[3,4,5,6,7]
...

Each training sample = one temporal sequence.

Why sequences are mandatory

LSTM models temporal dependencies. If we feed only single frames:

LSTM degenerates into a useless dense layer
No motion or behavior can be learned

This step converts:

static image problem → temporal behavior problem

Model Architecture

LSTM Autoencoder

The model consists of:

Encoder LSTM
- Compresses the input sequence into a latent representation
Decoder LSTM
- Attempts to reconstruct the original sequence
Linear Output Layer
- Maps hidden states back to pixel space

Why Autoencoder for Anomaly Detection?

Because we want:

Low error for normal behavior
High error for unseen abnormal behavior

This avoids the need for frame-level labels.

Why No CNN Feature Extractor?

Ideally, the pipeline should be:

Frame → CNN → Feature Vector → LSTM Autoencoder

But in this project:

Only raw pixels were used
No spatial feature learning

Reason:

CNN+LSTM pipelines are harder to debug
Much higher GPU memory usage
Time constraints during competition

Consequence:

LSTM had to learn both spatial and temporal patterns
This severely limited representation quality

Training Strategy

Normal-Only Training

Only normal video sequences are used during training.

Why?

If anomalies are included, the model will learn to reconstruct them
Reconstruction error would no longer signal abnormality

This is a core principle of reconstruction-based anomaly detection.

Loss Function: Mean Squared Error (MSE)

Loss = difference between:

input sequence
reconstructed sequence

Why MSE?

Pixel-wise reconstruction task
Penalizes small deviations across entire frame

Higher MSE → more abnormal motion pattern.

Optimizer: Adam

Why Adam?

Stable for RNN training
Less sensitive to learning rate
Works well without heavy tuning

SGD would require more careful scheduling and tuning.

Epochs and Batch Size

Small batch size
Limited epochs

Why?

Sequences consume much more memory than images
Kaggle GPU session limits
Overfitting observed early

Increasing epochs did not significantly improve validation behavior.

Inference

During testing:

Generate sequences
Pass through autoencoder
Compute reconstruction error

Each sequence receives one anomaly score.

Mapping Scores Back to Frames

Because submission requires frame-wise scores:

Sequence scores are aligned to center frames
Scores are propagated to timeline

This step is required because the model does not predict per-frame directly.

Post-Processing

Temporal Smoothing

Rolling average is applied over time.

Why?

Reconstruction error is noisy
Slight lighting changes create spikes
Real anomalies are temporally continuous

Smoothing enforces time consistency.

Video-wise Normalization

Scores are normalized per video.

Why?

Different videos have different base reconstruction error
Ranking-based metrics benefit from normalization

This improves relative ordering between frames.

Performance Summary

Final score: approximately 0.58 AUC

What Worked

Proper temporal modeling using LSTM
Unsupervised normal-only training
Reconstruction-based anomaly scoring
Temporal smoothing

What Limited Performance

No CNN feature extraction
Short temporal windows
High sensitivity to lighting noise
Weak spatial representations

This was not a tuning problem. It was a representation limitation.

Notebook

The file:

pixel-play-9.ipynb

Contains the full Kaggle pipeline:

Data loading
Training
Inference
Submission

It is already executed to allow reviewers to inspect outputs directly.

Final Note

This repository represents a learning-focused implementation of temporal anomaly detection rather than a production-grade or competition-winning system. The design choices were made to balance:

correctness of anomaly detection principle
feasibility within Kaggle resource limits
implementation complexity for a first VAD project

Despite limitations, the project demonstrates a complete end-to-end VAD pipeline using sequence modeling, which is the correct conceptual direction for surveillance anomaly detection.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
src		src
README.md		README.md
pixel-play-9.ipynb		pixel-play-9.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Video Anomaly Detection using LSTM Autoencoder (PixelPlay Competition)

Overall Pipeline

Repository Structure

Dataset Assumptions

Why this structure?

Preprocessing

1. Frame Loading

2. Resizing

3. Grayscale Conversion

4. Normalization

Sequence Construction

Why sequences are mandatory

Model Architecture

LSTM Autoencoder

Why Autoencoder for Anomaly Detection?

Why No CNN Feature Extractor?

Training Strategy

Normal-Only Training

Loss Function: Mean Squared Error (MSE)

Optimizer: Adam

Epochs and Batch Size

Inference

Mapping Scores Back to Frames

Post-Processing

Temporal Smoothing

Video-wise Normalization

Performance Summary

What Worked

What Limited Performance

Notebook

Final Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages