Skip to content

关于Cross-chunk latent stitching问题 #81

@Binn37

Description

@Binn37
def generate_avc(...):
    ......        
    # 5. Prepare latent variables
    video = self.video_processor.preprocess_video(self.video_processor, video, height=height, width=width, resize_mode=resize_mode)
    video = video.to(device=device, dtype=prompt_embeds.dtype) 
    cond_videos = video[:, :, -num_cond_frames:]
    cond_videos_latents = retrieve_latents(self.vae.encode(cond_videos), generator, sample_mode="argmax")
    cond_videos_latents = self.normalize_latents(cond_videos_latents)
    .......

Narration in paper:
However, during the generation of each continuation, obtaining the context latents requires first decoding the previously
predicted chunk via the VAE and then re-encoding the selected context frames. This repetitive VAE decode-encode
cycle introduces two major issues. First, the repeated encoding and decoding can result in information loss and error
accumulation within each chunk. Second, it significantly reduces inference efficiency.

问题是:理解多个片段视频生产,代码是实现还是使用旧的方案(就是技术报告图7 (a)),还是进行对前一个片段视频的motion帧进行编码再引入。请问这里是不是有误?请解释一下?谢谢!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions