Real-time pose-based action classification for fencing footage
An end-to-end machine learning system that automatically recognizes four fundamental fencing actions (idle, advance, retreat, lunge) from video footage using pose estimation and temporal modeling.
This project demonstrates a complete ML pipeline for sports video analysis:
- Video Processing β Extract pose keypoints from fencing videos using MediaPipe
- Feature Engineering β Convert raw poses into 23 biomechanical features per frame
- Temporal Modeling β Train lightweight CNN to classify action sequences
- Real-Time Inference β Process live video with action predictions overlay
Key Results:
- 97.14% validation accuracy on training data
- Real-time inference at 30+ FPS on CPU (M2 MacBook)
- 246K parameters (~0.25MB model) suitable for edge deployment
- Works well on dynamic actions; style-dependent on static positions
Limitations:
- Trained on single fencer (limited style diversity)
- Struggles with wide guard positions (sometimes misclassifies as lunge)
- Best performance on clear, isolated actions
# Clone repository
git clone <your-repo-url>
cd fencingveo
# Install dependencies
pip install -r requirements.txtRequired packages:
- torch >= 2.0.0
- opencv-python >= 4.8.0
- mediapipe >= 0.10.0
- numpy >= 1.24.0
- pandas >= 2.0.0
- scikit-learn >= 1.3.0
- matplotlib >= 3.7.0
Organize your fencing videos by action type (there are already some default ones loaded):
src/data/videoes/
βββ idle/
β βββ idle_video.mp4
βββ advance/
β βββ advance_video.mp4
βββ retreat/
β βββ retreat_video.mp4
βββ lunge/
βββ lunge_video.mp4
Process all videos at once:
python src/data/batch_process_videos.py \
--videos_dir src/data/videoes \
--output_dir data/real \
--sequence_length 60 \
--overlap 30Or process individual videos:
python src/data/extract_poses_from_video.py \
--video_path path/to/video.mp4 \
--output_dir data/real \
--action_label lunge \
--sequence_length 60 \
--visualizeWhat this does:
- Uses MediaPipe to detect poses frame-by-frame
- Segments video into 60-frame sequences (2 seconds at 30fps)
- Saves pose sequences as
.npyfiles - Creates/updates
data/real/labels.csvwith action labels
python src/training/train.py \
--data_dir data/real \
--model_type cnn \
--batch_size 16 \
--epochs 50 \
--learning_rate 0.0005Training outputs:
models/best_model.pt- Best model checkpointmodels/final_model.pt- Final model after trainingmodels/training_history.json- Loss/accuracy logsmodels/training_curves.png- Visualization
Expected training time: ~5-10 minutes on CPU for 200-300 sequences
python src/training/evaluate.py \
--model_path models/best_model.pt \
--data_dir data/realEvaluation outputs:
- Per-class precision, recall, F1 scores
- Confusion matrix (saved as PNG)
- Overall accuracy metrics
results/evaluation_results_test.json
Process a fencing video with live predictions:
python src/inference/video_inference.py \
--video_path path/to/test_video.mp4 \
--model_path models/best_model.pt \
--output_path results/annotated.mp4What you'll see:
- Pose skeleton overlay on video
- Current action prediction with confidence
- Color-coded by action (idle=blue, advance=green, retreat=yellow, lunge=red)
- Press 'q' to quit
Performance: ~30 FPS on M2 MacBook (CPU only)
MediaPipe Pose extracts 33 3D landmarks per frame, which we convert to 18 2D keypoints (OpenPose format):
Keypoints: Nose, Neck, Shoulders, Elbows, Wrists, Hips, Knees, Ankles, Eyes, Ears
From raw pose keypoints, we extract 23 biomechanical features per frame:
- 6 joint angles: Elbows, knees, hips (radians)
- 7 distances: Torso, upper arms, thighs, shoulder/hip width
- 6 spatial stats: Mean position, std deviation, span
- 2 center of mass: (x, y) coordinates
- 2 velocities: Frame-to-frame COM displacement
These features capture the biomechanical patterns that distinguish fencing actions.
Architecture:
Input: [batch, 60 frames, 23 features]
β
Conv1D Block 1: 23 β 64 channels (kernel=5)
β MaxPool + Dropout
Conv1D Block 2: 64 β 128 channels
β MaxPool + Dropout
Conv1D Block 3: 128 β 256 channels
β MaxPool + Dropout
Global Average Pooling
β
FC: 256 β 128 β 4 classes
Total parameters: 246,916 (~0.25MB)
The model learns temporal patterns across the 60-frame sequences:
- Early layers detect local motion (5-10 frames)
- Deeper layers recognize full action patterns (20+ frames)
Pipeline:
- Read video frame
- MediaPipe pose detection (~20ms)
- Maintain sliding 60-frame buffer
- Extract features when buffer full
- Model prediction every 15 frames (~5ms)
- Overlay prediction on video
Total latency: <100ms per prediction
fencingveo/
βββ README.md # This file
βββ MEDIUM_ARTICLE.md # Detailed technical writeup
βββ requirements.txt # Python dependencies
βββ src/
β βββ data/
β β βββ batch_process_videos.py # Batch video processing
β β βββ extract_poses_from_video.py # Single video pose extraction
β β βββ dataset.py # PyTorch Dataset classes
β β βββ videoes/ # Place your training videos here
β βββ features/
β β βββ pose_features.py # Feature extraction from keypoints
β βββ models/
β β βββ temporal_cnn.py # CNN model architecture
β β βββ lstm_model.py # LSTM alternative (optional)
β βββ training/
β β βββ train.py # Training pipeline
β β βββ evaluate.py # Model evaluation
β βββ inference/
β βββ video_inference.py # Real-time video inference
βββ data/
β βββ real/ # Extracted pose sequences go here
βββ models/ # Trained model checkpoints
βββ results/ # Evaluation outputs, annotated videos
Best practices for recording:
- Camera Position: Side view, 3-5 meters from fencer
- Frame Rate: 30 FPS or higher
- Lighting: Good, even lighting (avoid shadows)
- Background: Uncluttered, contrasting with fencer
- Full Body: Keep entire body in frame throughout action
- Clothing: Regular clothes work fine (no fencing gear needed)
Video organization:
Place videos in folders by action type:
src/data/videoes/
idle/ # Standing in en-garde position
advance/ # Forward footwork movements
retreat/ # Backward footwork movements
lunge/ # Attack lungesHow many videos?
- Minimum: 1-2 videos per action (~30-60 seconds each)
- Better: 3-5 videos per action with variation
- Ideal: Multiple fencers, different styles, various speeds
Default is 60 frames (2 seconds at 30fps). Adjust based on your actions:
# Shorter sequences for quick actions
python src/data/extract_poses_from_video.py --sequence_length 30
# Longer for complex combinations
python src/data/extract_poses_from_video.py --sequence_length 90- Create new folder in
src/data/videoes/ - Add videos of the new action
- Update action list in
src/data/dataset.py:ACTIONS = ['idle', 'advance', 'retreat', 'lunge', 'parry'] # Add your action
- Re-extract poses and retrain
Alternative to Temporal CNN:
python src/training/train.py \
--data_dir data/real \
--model_type lstm \
--batch_size 16 \
--epochs 50LSTM has more parameters (~520K) but can capture longer-range dependencies.
To make the model more robust:
- Record multiple fencers - 3-5 different people with various styles
- Vary guard positions - Include both compact and wide stances
- Mix video conditions - Different lighting, backgrounds, camera angles
- Data augmentation - Time warping, spatial jittering during training
Pose sequences: NumPy arrays [T, K, 2]
T= 60 frames (time dimension)K= 18 keypoints- Last dim = (x, y) normalized coordinates in [0, 1]
Labels: CSV file with columns:
sequence_path: Path to .npy filelabel: Action name (idle/advance/retreat/lunge)
Data split: 70% train / 15% validation / 15% test
Optimizer: Adam (lr=0.0005, weight_decay=1e-4)
Loss: CrossEntropyLoss
Scheduler: ReduceLROnPlateau (factor=0.5, patience=5)
Early stopping: Patience=10 epochs on validation accuracy
Temporal CNN:
- 3 Conv1D blocks (64β128β256 channels)
- Kernel size: 5 (captures ~0.15 sec patterns)
- BatchNorm + ReLU + MaxPool + Dropout(0.3)
- Global average pooling
- FC layers: 256β128β4
Parameters: 246,916 (~0.25MB file size)
This system demonstrates capabilities relevant to sports analytics platforms:
- Automated Tagging - Automatically label video segments by action type
- Performance Metrics - Count action frequencies (advances per minute)
- Tactical Analysis - Track movement patterns and tendencies
- Coaching Tools - Identify technique issues in real-time
- Highlight Generation - Detect exciting moments (lunges, exchanges)
For Veo specifically: The edge-deployable model (~0.25MB) can run on camera hardware, enabling real-time on-device analysis without cloud dependencies.
MIT License - Free to use for educational and portfolio purposes.