This project shows how to build a Transformer-based gesture recognition system using PyTorch, ONNX, and Gradio. You’ll train on a small dataset, export to ONNX for faster inference, and run a real-time demo app.
transformer-gesture/
│
├── data/ # Put your gesture videos here
│ ├── swipe_left/
│ ├── swipe_right/
│ └── stop/
│
├── images/ # Screenshots for tutorial & README
│ ├── training-logs.png
│ ├── confusion-matrix.png
│ └── realtime-demo.png
│
├── labels.txt # One class name per line (matches folders in data/)
├── dataset.py # Dataset loader
├── train.py # Training script
├── export_onnx.py # Export trained model to ONNX
├── app.py # Gradio demo app (upload/record gestures)
├── eval.py # Evaluate accuracy + confusion matrix
├── benchmark.py # Measure inference latency
├── requirements.txt # Dependencies
└── README.md # This file
-
Clone this repo and create a virtual environment:
git clone <your-repo-url> cd transformer-gesture python -m venv .venv source .venv/bin/activate # (Linux/Mac) .venv\Scripts\activate # (Windows)
-
Install requirements:
pip install -r requirements.txt
Place your gesture videos under data/<class_name>/. For example:
data/
├── swipe_left/
│ ├── clip1.mp4
│ └── clip2.mp4
├── swipe_right/
└── stop/
Update labels.txt so each line matches the folder names:
swipe_left
swipe_right
stop
💡 Tip: In the Gradio app, you can also record clips directly from your webcam.
python train.pyThis saves the best weights to vit_temporal_best.pt.
Here’s what the training logs look like:
python export_onnx.pyGenerates vit_temporal.onnx for fast inference.
python app.pyOpen the URL shown in the terminal (default: http://127.0.0.1:7860). You can record a short gesture and get predictions like this:
python eval.pyPrints validation accuracy and displays a confusion matrix heatmap:
python benchmark.pyMeasures average inference time per clip.
- This project is intended as a tutorial/demo, not production code.
- For higher accuracy, expand your dataset or use a stronger video Transformer like TimeSformer or VideoMAE.
- Always consider accessibility, fairness, and ethical use when deploying gesture/speech models.


