You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+21-6Lines changed: 21 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,20 +5,19 @@
5
5
## News
6
6
7
7
8
-
**April 3, 2025**
8
+
**April 4, 2025**
9
9
- We are releasing **[Stable Video 4D 2.0 (SV4D 2.0)](https://huggingface.co/stabilityai/sv4d2.0)**, an enhanced video-to-4D diffusion model for high-fidelity novel-view video synthesis and 4D asset generation. For research purposes:
10
10
-**SV4D 2.0** was trained to generate 48 frames (12 video frames x 4 camera views) at 576x576 resolution, given a 12-frame input video of the same size, ideally consisting of white-background images of a moving object.
11
11
- Compared to our previous 4D model [SV4D](https://huggingface.co/stabilityai/sv4d), **SV4D 2.0** can generate videos with higher fidelity, sharper details during motion, and better spatio-temporal consistency. It also generalizes much better to real-world videos. Moreover, it does not rely on refernce multi-view of the first frame generated by SV3D, making it more robust to self-occlusions.
12
12
- To generate longer novel-view videos, we autoregressively generate 12 frames at a time and use the previous generation as conditioning views for the remaining frames.
13
13
- Please check our [project page](https://sv4d20.github.io), [arxiv paper](https://arxiv.org/pdf/2503.16396) and [video summary](https://www.youtube.com/watch?v=dtqj-s50ynU) for more details.
- We also train a 8-view model that generates 5 frames x 8 views at a time (same as SV4D). For example, run `python scripts/sampling/simple_video_sample_4d2_8views.py --input_path assets/sv4d_videos/chest.gif --output_folder outputs/sv4d2_8views`
15
+
**QUICKSTART** :
16
+
-`python scripts/sampling/simple_video_sample_4d2.py --input_path assets/sv4d_videos/camel.gif --output_folder outputs` (after downloading [sv4d2.safetensors](https://huggingface.co/stabilityai/sv4d2.0) from HuggingFace into `checkpoints/`)
18
17
19
18
To run **SV4D 2.0** on a single input video of 21 frames:
20
-
- Download SV4D 2.0 models (`sv4d2.safetensors` and `sv4d2_8views.safetensors`) from [here](https://huggingface.co/stabilityai/sv4d2.0) to `checkpoints/`
21
-
- Run `python scripts/sampling/simple_video_sample_4d2.py --input_path <path/to/video>`
19
+
- Download SV4D 2.0 model (`sv4d2.safetensors`) from [here](https://huggingface.co/stabilityai/sv4d2.0) to `checkpoints/`: `huggingface-cli download stabilityai/sv4d2.0 sv4d2.safetensors --local-dir checkpoints`
20
+
- Run inference: `python scripts/sampling/simple_video_sample_4d2.py --input_path <path/to/video>`
22
21
-`input_path` : The input video `<path/to/video>` can be
23
22
- a single video file in `gif` or `mp4` format, such as `assets/sv4d_videos/camel.gif`, or
24
23
- a folder containing images of video frames in `.jpg`, `.jpeg`, or `.png` format, or
@@ -28,6 +27,21 @@ To run **SV4D 2.0** on a single input video of 21 frames:
28
27
-**Background removal** : For input videos with plain background, (optionally) use [rembg](https://github.com/danielgatis/rembg) to remove background and crop video frames by setting `--remove_bg=True`. To obtain higher quality outputs on real-world input videos with noisy background, try segmenting the foreground object using [Clipdrop](https://clipdrop.co/) or [SAM2](https://github.com/facebookresearch/segment-anything-2) before running SV4D.
29
28
-**Low VRAM environment** : To run on GPUs with low VRAM, try setting `--encoding_t=1` (of frames encoded at a time) and `--decoding_t=1` (of frames decoded at a time) or lower video resolution like `--img_size=512`.
30
29
30
+
Notes:
31
+
- We also train a 8-view model that generates 5 frames x 8 views at a time (same as SV4D).
32
+
- Download the model from huggingface: `huggingface-cli download stabilityai/sv4d2.0 sv4d2_8views.safetensors --local-dir checkpoints`
- The 5x8 model takes 5 frames of input at a time. But the inference scripts for both model take 21-frame video as input by default (same as SV3D and SV4D), we run the model autoregressively until we generate 21 frames.
35
+
- Install dependencies before running:
36
+
```
37
+
python3.10 -m venv .generativemodels
38
+
source .generativemodels/bin/activate
39
+
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # check CUDA version
0 commit comments