Stability-AI
diff --git a/‎README.md‎
Lines changed: 21 additions & 6 deletions b/‎README.md‎
Lines changed: 21 additions & 6 deletions
diff --git a/‎requirements/pt2.txt‎
Lines changed: 4 additions & 1 deletion b/‎requirements/pt2.txt‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎scripts/demo/sv4d_helpers.py‎
Lines changed: 112 additions & 56 deletions b/‎scripts/demo/sv4d_helpers.py‎
Lines changed: 112 additions & 56 deletions
@@ -5,20 +5,19 @@
 ## News
 
 
-**April 3, 2025**
+**April 4, 2025**
 - We are releasing **[Stable Video 4D 2.0 (SV4D 2.0)](https://huggingface.co/stabilityai/sv4d2.0)**, an enhanced video-to-4D diffusion model for high-fidelity novel-view video synthesis and 4D asset generation. For research purposes:
     - **SV4D 2.0** was trained to generate 48 frames (12 video frames x 4 camera views) at 576x576 resolution, given a 12-frame input video of the same size, ideally consisting of white-background images of a moving object.
     - Compared to our previous 4D model [SV4D](https://huggingface.co/stabilityai/sv4d), **SV4D 2.0** can generate videos with higher fidelity, sharper details during motion, and better spatio-temporal consistency. It also generalizes much better to real-world videos. Moreover, it does not rely on refernce multi-view of the first frame generated by SV3D, making it more robust to self-occlusions.
     - To generate longer novel-view videos, we autoregressively generate 12 frames at a time and use the previous generation as conditioning views for the remaining frames.
     - Please check our [project page](https://sv4d20.github.io), [arxiv paper](https://arxiv.org/pdf/2503.16396) and [video summary](https://www.youtube.com/watch?v=dtqj-s50ynU) for more details.
 
-**QUICKSTART** : 
-- `python scripts/sampling/simple_video_sample_4d2.py --input_path assets/sv4d_videos/camel.gif --output_folder outputs/sv4d2`
-- We also train a 8-view model that generates 5 frames x 8 views at a time (same as SV4D). For example, run `python scripts/sampling/simple_video_sample_4d2_8views.py --input_path assets/sv4d_videos/chest.gif --output_folder outputs/sv4d2_8views`
+**QUICKSTART** :
+- `python scripts/sampling/simple_video_sample_4d2.py --input_path assets/sv4d_videos/camel.gif --output_folder outputs` (after downloading [sv4d2.safetensors](https://huggingface.co/stabilityai/sv4d2.0) from HuggingFace into `checkpoints/`)
 
 To run **SV4D 2.0** on a single input video of 21 frames:
-- Download SV4D 2.0 models (`sv4d2.safetensors` and `sv4d2_8views.safetensors`) from [here](https://huggingface.co/stabilityai/sv4d2.0) to `checkpoints/`
-- Run `python scripts/sampling/simple_video_sample_4d2.py --input_path <path/to/video>`
+- Download SV4D 2.0 model (`sv4d2.safetensors`) from [here](https://huggingface.co/stabilityai/sv4d2.0) to `checkpoints/`: `huggingface-cli download stabilityai/sv4d2.0 sv4d2.safetensors --local-dir checkpoints`
+- Run inference: `python scripts/sampling/simple_video_sample_4d2.py --input_path <path/to/video>`
     - `input_path` : The input video `<path/to/video>` can be
       - a single video file in `gif` or `mp4` format, such as `assets/sv4d_videos/camel.gif`, or
       - a folder containing images of video frames in `.jpg`, `.jpeg`, or `.png` format, or
@@ -28,6 +27,21 @@ To run **SV4D 2.0** on a single input video of 21 frames:
     - **Background removal** : For input videos with plain background, (optionally) use [rembg](https://github.com/danielgatis/rembg) to remove background and crop video frames by setting `--remove_bg=True`. To obtain higher quality outputs on real-world input videos with noisy background, try segmenting the foreground object using [Clipdrop](https://clipdrop.co/) or [SAM2](https://github.com/facebookresearch/segment-anything-2) before running SV4D.
     - **Low VRAM environment** : To run on GPUs with low VRAM, try setting `--encoding_t=1` (of frames encoded at a time) and `--decoding_t=1` (of frames decoded at a time) or lower video resolution like `--img_size=512`.
 
+Notes:
+- We also train a 8-view model that generates 5 frames x 8 views at a time (same as SV4D).
+  - Download the model from huggingface: `huggingface-cli download stabilityai/sv4d2.0 sv4d2_8views.safetensors --local-dir checkpoints`
+  - Run inference: `python scripts/sampling/simple_video_sample_4d2.py --model_path checkpoints/sv4d2_8views.safetensors --input_path assets/sv4d_videos/chest.gif --output_folder outputs`
+  - The 5x8 model takes 5 frames of input at a time. But the inference scripts for both model take 21-frame video as input by default (same as SV3D and SV4D), we run the model autoregressively until we generate 21 frames.
+- Install dependencies before running:
+```
+python3.10 -m venv .generativemodels
+source .generativemodels/bin/activate
+pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # check CUDA version
+pip3 install -r requirements/pt2.txt
+pip3 install .
+pip3 install -e git+https://github.com/Stability-AI/datapipelines.git@main#egg=sdata
+```
+
   ![tile](assets/sv4d2.gif)
 
 
@@ -190,6 +204,7 @@ This is assuming you have navigated to the `generative-models` root after clonin
 # install required packages from pypi
 python3 -m venv .pt2
 source .pt2/bin/activate
+pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
 pip3 install -r requirements/pt2.txt
 ```
 
 
@@ -5,13 +5,16 @@ einops>=0.6.1
 fairscale>=0.4.13
 fire>=0.5.0
 fsspec>=2023.6.0
+imageio[ffmpeg]
+imageio[pyav]
 invisible-watermark>=0.2.0
 kornia==0.6.9
 matplotlib>=3.7.2
 natsort>=8.4.0
 ninja>=1.11.1
-numpy>=1.24.4
+numpy==2.1
 omegaconf>=2.3.0
+onnxruntime
 open-clip-torch>=2.20.0
 opencv-python==4.6.0.66
 pandas>=2.0.3
 
@@ -13,9 +13,6 @@
 from omegaconf import ListConfig, OmegaConf
 from PIL import Image, ImageSequence
 from rembg import remove
-from torch import autocast
-from torchvision.transforms import ToTensor
-
 from scripts.util.detection.nsfw_and_watermark_dectection import DeepFloydDataFiltering
 from sgm.modules.autoencoding.temporal_ae import VideoDecoder
 from sgm.modules.diffusionmodules.guiders import (
@@ -34,6 +31,8 @@
     LinearMultistepSampler,
 )
 from sgm.util import default, instantiate_from_config
+from torch import autocast
+from torchvision.transforms import ToTensor
 
 
 def load_module_gpu(model):
@@ -166,14 +165,14 @@ def read_video(
 
 
 def preprocess_video(
-    input_path, 
-    remove_bg=False, 
-    n_frames=21, 
-    W=576, 
-    H=576, 
-    output_folder=None, 
+    input_path,
+    remove_bg=False,
+    n_frames=21,
+    W=576,
+    H=576,
+    output_folder=None,
     image_frame_ratio=0.917,
-    base_count=0
+    base_count=0,
 ):
     print(f"preprocess {input_path}")
     if output_folder is None:
@@ -208,7 +207,9 @@ def preprocess_video(
         images = [Image.open(img_path) for img_path in all_img_paths]
 
     if len(images) != n_frames:
-        raise ValueError(f"Input video contains {len(images)} frames, fewer than {n_frames} frames.")
+        raise ValueError(
+            f"Input video contains {len(images)} frames, fewer than {n_frames} frames."
+        )
 
     # Remove background
     for i, image in enumerate(images):
@@ -235,29 +236,36 @@ def preprocess_video(
         else:
             # assume the input image has white background
             ret, mask = cv2.threshold(
-                (np.array(image).mean(-1) <= white_thresh).astype(np.uint8) * 255, 0, 255, cv2.THRESH_BINARY
+                (np.array(image).mean(-1) <= white_thresh).astype(np.uint8) * 255,
+                0,
+                255,
+                cv2.THRESH_BINARY,
             )
-            
+
         x, y, w, h = cv2.boundingRect(mask)
         box_coord[0] = min(box_coord[0], x)
         box_coord[1] = min(box_coord[1], y)
         box_coord[2] = max(box_coord[2], x + w)
         box_coord[3] = max(box_coord[3], y + h)
-    box_square = max(original_center[0] - box_coord[0], original_center[1] - box_coord[1])
+    box_square = max(
+        original_center[0] - box_coord[0], original_center[1] - box_coord[1]
+    )
     box_square = max(box_square, box_coord[2] - original_center[0])
     box_square = max(box_square, box_coord[3] - original_center[1])
-    x, y = max(0, original_center[0] - box_square), max(0, original_center[1] - box_square)
-    w, h = min(image_arr.shape[0], 2 * box_square), min(image_arr.shape[1], 2 * box_square)
+    x, y = max(0, original_center[0] - box_square), max(
+        0, original_center[1] - box_square
+    )
+    w, h = min(image_arr.shape[0], 2 * box_square), min(
+        image_arr.shape[1], 2 * box_square
+    )
     box_size = box_square * 2
 
     for image in images:
         if image.mode == "RGB":
             image = image.convert("RGBA")
         image_arr = np.array(image)
         side_len = (
-            int(box_size / image_frame_ratio)
-            if image_frame_ratio is not None
-            else in_w
+            int(box_size / image_frame_ratio) if image_frame_ratio is not None else in_w
         )
         padded_image = np.zeros((side_len, side_len, 4), dtype=np.uint8)
         center = side_len // 2
@@ -273,9 +281,9 @@ def preprocess_video(
         rgba_arr = np.array(rgba) / 255.0
         rgb = rgba_arr[..., :3] * rgba_arr[..., -1:] + (1 - rgba_arr[..., -1:])
         image = (rgb * 255).astype(np.uint8)
-        
+
         images_v0.append(image)
-    
+
     processed_file = os.path.join(output_folder, f"{base_count:06d}_process_input.mp4")
     imageio.mimwrite(processed_file, images_v0, fps=10)
     return processed_file
@@ -393,13 +401,17 @@ def denoiser(input, sigma, c):
     return samples
 
 
-def decode_latents(model, samples_z, img_matrix, frame_indices, view_indices, timesteps):
+def decode_latents(
+    model, samples_z, img_matrix, frame_indices, view_indices, timesteps
+):
     load_module_gpu(model.first_stage_model)
     for t in frame_indices:
         for v in view_indices:
-            if True: # t != 0 and v != 0:
+            if True:  # t != 0 and v != 0:
                 if isinstance(model.first_stage_model.decoder, VideoDecoder):
-                    samples_x = model.decode_first_stage(samples_z[t, v][None], timesteps=timesteps)
+                    samples_x = model.decode_first_stage(
+                        samples_z[t, v][None], timesteps=timesteps
+                    )
                 else:
                     samples_x = model.decode_first_stage(samples_z[t, v][None])
                 samples = torch.clamp((samples_x + 1.0) / 2.0, min=0.0, max=1.0)
@@ -785,7 +797,19 @@ def run_img2vid(
     return samples
 
 
-def prepare_inputs_forward_backward(img_matrix, view_indices, frame_indices, v0, t0, t1, model, version_dict, seed, polars, azims):
+def prepare_inputs_forward_backward(
+    img_matrix,
+    view_indices,
+    frame_indices,
+    v0,
+    t0,
+    t1,
+    model,
+    version_dict,
+    seed,
+    polars,
+    azims,
+):
     # forward sampling
     forward_frame_indices = frame_indices.copy()
     image = img_matrix[t0][v0]
@@ -801,7 +825,7 @@ def prepare_inputs_forward_backward(img_matrix, view_indices, frame_indices, v0,
         cond_motion,
         cond_view,
     )
-        
+
     # backward sampling
     backward_frame_indices = frame_indices[::-1].copy()
     image = img_matrix[t1][v0]
@@ -817,10 +841,25 @@ def prepare_inputs_forward_backward(img_matrix, view_indices, frame_indices, v0,
         cond_motion,
         cond_view,
     )
-    return forward_inputs, forward_frame_indices, backward_inputs, backward_frame_indices
+    return (
+        forward_inputs,
+        forward_frame_indices,
+        backward_inputs,
+        backward_frame_indices,
+    )
 
 
-def prepare_inputs(frame_indices, img_matrix, v0, view_indices, model, version_dict, seed, polars, azims):
+def prepare_inputs(
+    frame_indices,
+    img_matrix,
+    v0,
+    view_indices,
+    model,
+    version_dict,
+    seed,
+    polars,
+    azims,
+):
     load_module_gpu(model.conditioner)
     # forward sampling
     forward_frame_indices = frame_indices.copy()
@@ -829,35 +868,40 @@ def prepare_inputs(frame_indices, img_matrix, v0, view_indices, model, version_d
     cond_motion = torch.cat([img_matrix[t][v0] for t in forward_frame_indices], 0)
     cond_view = torch.cat([img_matrix[t0][v] for v in view_indices], 0)
     forward_inputs = prepare_sampling(
-                version_dict,
-                model,
-                image,
-                seed,
-                polars,
-                azims,
-                cond_motion,
-                cond_view,
-        )
-        
+        version_dict,
+        model,
+        image,
+        seed,
+        polars,
+        azims,
+        cond_motion,
+        cond_view,
+    )
+
     # backward sampling
     backward_frame_indices = frame_indices[::-1].copy()
     t0 = backward_frame_indices[0]
     image = img_matrix[t0][v0]
     cond_motion = torch.cat([img_matrix[t][v0] for t in backward_frame_indices], 0)
     cond_view = torch.cat([img_matrix[t0][v] for v in view_indices], 0)
     backward_inputs = prepare_sampling(
-            version_dict,
-            model,
-            image,
-            seed,
-            polars,
-            azims,
-            cond_motion,
-            cond_view,
-        )
+        version_dict,
+        model,
+        image,
+        seed,
+        polars,
+        azims,
+        cond_motion,
+        cond_view,
+    )
 
     unload_module_gpu(model.conditioner)
-    return forward_inputs, forward_frame_indices, backward_inputs, backward_frame_indices
+    return (
+        forward_inputs,
+        forward_frame_indices,
+        backward_inputs,
+        backward_frame_indices,
+    )
 
 
 def do_sample(
@@ -913,7 +957,7 @@ def do_sample(
                             lambda y: y[k][: math.prod(num_samples)].to("cuda"), (c, uc)
                         )
 
-                if value_dict['image_only_indicator'] == 0:
+                if value_dict["image_only_indicator"] == 0:
                     c["cond_view"] *= 0
                     uc["cond_view"] *= 0
 
@@ -932,9 +976,12 @@ def do_sample(
                                 SpatiotemporalPredictionGuider,
                             ),
                         ):
-                            additional_model_inputs[k] = torch.zeros(
-                                num_samples[0] * 2, num_samples[1]
-                            ).to("cuda") + value_dict['image_only_indicator']
+                            additional_model_inputs[k] = (
+                                torch.zeros(num_samples[0] * 2, num_samples[1]).to(
+                                    "cuda"
+                                )
+                                + value_dict["image_only_indicator"]
+                            )
                         else:
                             additional_model_inputs[k] = torch.zeros(num_samples).to(
                                 "cuda"
@@ -949,6 +996,7 @@ def denoiser(input, sigma, c):
                     return model.denoiser(
                         model.model, input, sigma, c, **additional_model_inputs
                     )
+
                 load_module_gpu(model.model)
                 load_module_gpu(model.denoiser)
                 samples_z = sampler(denoiser, randn, cond=c, uc=uc)
@@ -1034,9 +1082,12 @@ def prepare_sampling_(
                                 SpatiotemporalPredictionGuider,
                             ),
                         ):
-                            additional_model_inputs[k] = torch.zeros(
-                                num_samples[0] * 2, num_samples[1]
-                            ).to("cuda") + value_dict['image_only_indicator']
+                            additional_model_inputs[k] = (
+                                torch.zeros(num_samples[0] * 2, num_samples[1]).to(
+                                    "cuda"
+                                )
+                                + value_dict["image_only_indicator"]
+                            )
                         else:
                             additional_model_inputs[k] = torch.zeros(num_samples).to(
                                 "cuda"
@@ -1047,7 +1098,9 @@ def prepare_sampling_(
     return c, uc, additional_model_inputs
 
 
-def do_sample_per_step(model, sampler, noisy_latents, c, uc, step, additional_model_inputs):
+def do_sample_per_step(
+    model, sampler, noisy_latents, c, uc, step, additional_model_inputs
+):
     precision_scope = autocast
     with torch.no_grad():
         with precision_scope("cuda"):
@@ -1337,6 +1390,7 @@ def load_model(
     num_frames: int,
     num_steps: int,
     verbose: bool = False,
+    ckpt_path: str = None,
 ):
     config = OmegaConf.load(config)
     if device == "cuda":
@@ -1349,6 +1403,8 @@ def load_model(
     config.model.params.sampler_config.params.guider_config.params.num_frames = (
         num_frames
     )
+    if ckpt_path is not None:
+        config.model.params.ckpt_path = ckpt_path
     if device == "cuda":
         with torch.device(device):
             model = instantiate_from_config(config.model).to(device).eval()