Skip to content

Commit 0bade2e

Browse files
add SV4D 2.0
1 parent 1659a1c commit 0bade2e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1008
-76
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,5 @@
1212
/outputs
1313
/build
1414
/src
15-
/.vscode
15+
/.vscode
16+
**/__pycache__/

README.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,32 @@
55
## News
66

77

8+
**April 3, 2025**
9+
- We are releasing **[Stable Video 4D 2.0 (SV4D 2.0)](https://huggingface.co/stabilityai/sv4d2.0)**, an enhanced video-to-4D diffusion model for high-fidelity novel-view video synthesis and 4D asset generation. For research purposes:
10+
- **SV4D 2.0** was trained to generate 48 frames (12 video frames x 4 camera views) at 576x576 resolution, given a 12-frame input video of the same size, ideally consisting of white-background images of a moving object.
11+
- Compared to our previous 4D model [SV4D](https://huggingface.co/stabilityai/sv4d), **SV4D 2.0** can generate videos with higher fidelity, sharper details during motion, and better spatio-temporal consistency. It also generalizes much better to real-world videos. Moreover, it does not rely on refernce multi-view of the first frame generated by SV3D, making it more robust to self-occlusions.
12+
- To generate longer novel-view videos, we autoregressively generate 12 frames at a time and use the previous generation as conditioning views for the remaining frames.
13+
- Please check our [project page](https://sv4d20.github.io), [arxiv paper](https://arxiv.org/pdf/2503.16396) and [video summary](https://www.youtube.com/watch?v=dtqj-s50ynU) for more details.
14+
15+
**QUICKSTART** :
16+
- `python scripts/sampling/simple_video_sample_4d2.py --input_path assets/sv4d_videos/camel.gif --output_folder outputs/sv4d2`
17+
- We also train a 8-view model that generates 5 frames x 8 views at a time (same as SV4D). For example, run `python scripts/sampling/simple_video_sample_4d2_8views.py --input_path assets/sv4d_videos/chest.gif --output_folder outputs/sv4d2_8views`
18+
19+
To run **SV4D 2.0** on a single input video of 21 frames:
20+
- Download SV4D 2.0 models (`sv4d2.safetensors` and `sv4d2_8views.safetensors`) from [here](https://huggingface.co/stabilityai/sv4d2.0) to `checkpoints/`
21+
- Run `python scripts/sampling/simple_video_sample_4d2.py --input_path <path/to/video>`
22+
- `input_path` : The input video `<path/to/video>` can be
23+
- a single video file in `gif` or `mp4` format, such as `assets/sv4d_videos/camel.gif`, or
24+
- a folder containing images of video frames in `.jpg`, `.jpeg`, or `.png` format, or
25+
- a file name pattern matching images of video frames.
26+
- `num_steps` : default is 50, can decrease to it to shorten sampling time.
27+
- `elevations_deg` : specified elevations (reletive to input view), default is 0.0 (same as input view).
28+
- **Background removal** : For input videos with plain background, (optionally) use [rembg](https://github.com/danielgatis/rembg) to remove background and crop video frames by setting `--remove_bg=True`. To obtain higher quality outputs on real-world input videos with noisy background, try segmenting the foreground object using [Clipdrop](https://clipdrop.co/) or [SAM2](https://github.com/facebookresearch/segment-anything-2) before running SV4D.
29+
- **Low VRAM environment** : To run on GPUs with low VRAM, try setting `--encoding_t=1` (of frames encoded at a time) and `--decoding_t=1` (of frames decoded at a time) or lower video resolution like `--img_size=512`.
30+
31+
![tile](assets/sv4d2.gif)
32+
33+
834
**July 24, 2024**
935
- We are releasing **[Stable Video 4D (SV4D)](https://huggingface.co/stabilityai/sv4d)**, a video-to-4D diffusion model for novel-view video synthesis. For research purposes:
1036
- **SV4D** was trained to generate 40 frames (5 video frames x 8 camera views) at 576x576 resolution, given 5 context frames (the input video), and 8 reference views (synthesised from the first frame of the input video, using a multi-view diffusion model like SV3D) of the same size, ideally white-background images with one object.

assets/sv4d2.gif

9.71 MB
Loading

assets/sv4d_videos/bear.gif

2.16 MB
Loading

assets/sv4d_videos/bee.gif

638 KB
Loading

assets/sv4d_videos/bmx-bumps.gif

2.23 MB
Loading

assets/sv4d_videos/bunnyman.mp4

-47.1 KB
Binary file not shown.

assets/sv4d_videos/camel.gif

1.93 MB
Loading

assets/sv4d_videos/chameleon.gif

1.4 MB
Loading

assets/sv4d_videos/chest.gif

2.2 MB
Loading

0 commit comments

Comments
 (0)