text to video synthesis bug fix (mindspore-lab#985)

JIJIARONGjijiarong · jiarongji · web-flow · commit a9aa3e1e6d56 · 2025-06-11T06:33:22.000Z
Co-authored-by: jijiarong &lt;jijiarong@huawei.com&gt;
diff --git a/docs/diffusers/api/pipelines/text_to_video_zero.md b/docs/diffusers/api/pipelines/text_to_video_zero.md
@@ -91,6 +91,57 @@ imageio.mimsave("video.mp4", result, fps=4)
 ```
 
 
+### Text-To-Video with Pose Control
+To generate a video from prompt with additional pose control
+
+1. Download a demo video
+
+    ```python
+    from huggingface_hub import hf_hub_download
+
+    filename = "__assets__/poses_skeleton_gifs/dance1_corr.mp4"
+    repo_id = "PAIR/Text2Video-Zero"
+    video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename)
+    ```
+
+
+2. Read video containing extracted pose images
+    ```python
+    from PIL import Image
+    import imageio
+
+    reader = imageio.get_reader(video_path, "ffmpeg")
+    frame_count = 8
+    pose_images = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]
+    ```
+    To extract pose from actual video, read [ControlNet documentation](controlnet).
+
+3. Run `StableDiffusionControlNetPipeline` with our custom attention processor
+
+    ```python
+    import mindspore as ms
+    from mindone.diffusers import StableDiffusionControlNetPipeline, ControlNetModel
+    from mindone.diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor
+
+    model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5"
+    controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=ms.float16)
+    pipe = StableDiffusionControlNetPipeline.from_pretrained(
+        model_id, controlnet=controlnet, torch_dtype=ms.float16
+    )
+
+    # Set the attention processor
+    pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
+    pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
+
+    # fix latents for all frames
+    latents = ms.ops.randn((1, 4, 64, 64), dtype=ms.float16).repeat(len(pose_images), 1, 1, 1)
+
+    prompt = "Darth Vader dancing in a desert"
+    result = pipe(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images
+    imageio.mimsave("video.mp4", result, fps=4)
+    ```
+
+
 ### Text-To-Video with Edge Control
 
 To generate a video from prompt with additional Canny edge control, follow the same steps described above for pose-guided generation using [Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny).