Transformer Gesture Recognition 🎥➡️🤖

This project shows how to build a Transformer-based gesture recognition system using PyTorch, ONNX, and Gradio. You’ll train on a small dataset, export to ONNX for faster inference, and run a real-time demo app.

Project Structure

transformer-gesture/
│
├── data/                # Put your gesture videos here
│   ├── swipe_left/
│   ├── swipe_right/
│   └── stop/
│
├── images/              # Screenshots for tutorial & README
│   ├── training-logs.png
│   ├── confusion-matrix.png
│   └── realtime-demo.png
│
├── labels.txt           # One class name per line (matches folders in data/)
├── dataset.py           # Dataset loader
├── train.py             # Training script
├── export_onnx.py       # Export trained model to ONNX
├── app.py               # Gradio demo app (upload/record gestures)
├── eval.py              # Evaluate accuracy + confusion matrix
├── benchmark.py         # Measure inference latency
├── requirements.txt     # Dependencies
└── README.md            # This file

Setup

Clone this repo and create a virtual environment:

git clone <your-repo-url>
cd transformer-gesture
python -m venv .venv
source .venv/bin/activate   # (Linux/Mac)
.venv\Scripts\activate    # (Windows)

Install requirements:
```
pip install -r requirements.txt
```

Prepare Data

Place your gesture videos under data/<class_name>/. For example:

data/
├── swipe_left/
│   ├── clip1.mp4
│   └── clip2.mp4
├── swipe_right/
└── stop/

Update labels.txt so each line matches the folder names:

swipe_left
swipe_right
stop

💡 Tip: In the Gradio app, you can also record clips directly from your webcam.

Train the Model

python train.py

This saves the best weights to vit_temporal_best.pt.

Here’s what the training logs look like:

Export to ONNX

python export_onnx.py

Generates vit_temporal.onnx for fast inference.

Run the Demo App

python app.py

Open the URL shown in the terminal (default: http://127.0.0.1:7860). You can record a short gesture and get predictions like this:

Evaluate Accuracy

python eval.py

Prints validation accuracy and displays a confusion matrix heatmap:

Benchmark Latency

python benchmark.py

Measures average inference time per clip.

Notes

This project is intended as a tutorial/demo, not production code.
For higher accuracy, expand your dataset or use a stronger video Transformer like TimeSformer or VideoMAE.
Always consider accessibility, fairness, and ethical use when deploying gesture/speech models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer Gesture Recognition 🎥➡️🤖

Project Structure

Setup

Prepare Data

Train the Model

Export to ONNX

Run the Demo App

Evaluate Accuracy

Benchmark Latency

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
.gitignore		.gitignore
README.md		README.md
app.py		app.py
dataset.py		dataset.py
eval.py		eval.py
export_onnx.py		export_onnx.py
generate_synthetic_gestures.py		generate_synthetic_gestures.py
labels.txt		labels.txt
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Transformer Gesture Recognition 🎥➡️🤖

Project Structure

Setup

Prepare Data

Train the Model

Export to ONNX

Run the Demo App

Evaluate Accuracy

Benchmark Latency

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages