An unsupervised deep learning framework for realistic image animation using learned keypoints, local affine motion modeling, and occlusion-aware generation.
This project implements an unsupervised image animation system that transfers motion from a driving video to a static source image within the same semantic category (e.g., faces, articulated bodies, gifs).
The method jointly learns:
- A compact first-order motion representation consisting of sparse keypoints and local affine transformations.
- A dense motion network that aggregates local motions into a full optical flow field and an occlusion map.
- An occlusion-aware generator that warps, filters (using the occlusion mask), and inpaints to produce temporally coherent frames.
Training is self-supervised — the model reconstructs frames from videos without any explicit landmark, pose, or segmentation annotations.
The core of the method is a first-order motion model that uses a learned set of keypoints to represent complex deformations between a source image and a driving video frame.
Given a frame ( X ), we estimate each transformation ( T_{X \leftarrow R} ) in the neighborhood of the learned keypoints.
Formally, for ( K ) keypoints ( p_1, \dots, p_K ):
After computing first-order Taylor expansions for motion from reference pose ( R ) to source ( S ) and driving ( D ), we approximate:
Where the Jacobian ( J_k ) represents the local affine transformation at keypoint ( p_k ):
Legend
| Symbol | Meaning |
|---|---|
| ( P ) | Point locations in the reference pose space ( R ) |
| ( Z ) | Point locations in the source ( S ) or driving ( D ) space |
- Architecture: Hourglass encoder–decoder
- Outputs: Heatmaps → Keypoint positions via spatial expectation
- Optional: Predicts local Jacobians (2×2 affine transformation per keypoint)
- Builds Gaussian heatmaps centered at keypoints (for source & driving frames)
- Constructs sparse motion fields using local affine transforms (( J_k ))
- Composes them into a dense deformation field (optical flow)
- Outputs:
- Deformation field (dense flow grid)
- Occlusion map (pixel-wise visibility mask)
sparse_deformed: per-keypoint warped source features
- Encoder–decoder with residual bottleneck
- Warps encoder features using the dense deformation field
- Applies the occlusion mask to filter unreliable regions
- Decoder inpaints masked regions to produce coherent, realistic frames
All datasets are configured via config/*.yaml files.
| Dataset | Directory | Frame Shape | Description |
|---|---|---|---|
| BAIR | data/bair |
256×256×3 | Robotic arm motion |
| Fashion | data/fashion-png |
256×256×3 | Clothing articulation |
| Moving GIF | data/moving-gif |
256×256×3 | Diverse synthetic motions |
| Vox / Taichi | (in config folder) | 256×256×3 | Human articulation / performance |
- Dataset is randomly split into training and test sets unless predefined folders exist.
- During training, the
frames_dataset.pyloader samples two random frames from the same video:- One as the source frame
- One as the driving frame
| File / Directory | Purpose |
|---|---|
train.py |
Main training loop orchestrator |
animate.py |
Inference script for generating animations |
frames_dataset.py |
Loads, pairs, and transforms video frames |
augmentation.py |
Data augmentations (flip, crop, color jitter) |
modules/* |
Core model components (KPDetector, DenseMotion, Generator, Discriminator, Loss) |
config/*.yaml |
Dataset-specific hyperparameter configs |
logger.py |
Handles checkpoints, TensorBoard logs, and visualization |
sync_batchnorm/* |
Custom synchronized BatchNorm for multi-GPU setups |
# 1. Clone the repository
git clone https://github.com/your-username/unsupervised-animation.git
cd unsupervised-animation
# 2. Install dependencies
pip install -r requirements.txtRun the training script, specifying the dataset config file.
# Example training run on the 'fashion' dataset
python train.py --config config/fashion-256.yamlGenerate a video animation by transferring motion from a driving video to a source image.
python animate.py --config config/fashion-256.yaml \
--source_image path/to/source.png \
--driving_video path/to/driving.mp4 \
--checkpoint path/to/model.pth.tar \
--result_video path/to/output.mp4.png)
.png)