关于Cross-chunk latent stitching问题

    def generate_avc(...):
        ......        
        # 5. Prepare latent variables
        video = self.video_processor.preprocess_video(self.video_processor, video, height=height, width=width, resize_mode=resize_mode)
        video = video.to(device=device, dtype=prompt_embeds.dtype) 
        cond_videos = video[:, :, -num_cond_frames:]
        cond_videos_latents = retrieve_latents(self.vae.encode(cond_videos), generator, sample_mode="argmax")
        cond_videos_latents = self.normalize_latents(cond_videos_latents)
        .......

Narration in paper:
However, during the generation of each continuation, obtaining the context latents requires first decoding the previously
predicted chunk via the VAE and then re-encoding the selected context frames. This repetitive VAE decode-encode
cycle introduces two major issues. First, the repeated encoding and decoding can result in information loss and error
accumulation within each chunk. Second, it significantly reduces inference efficiency. 

问题是：理解多个片段视频生产，代码是实现还是使用旧的方案（就是技术报告图7 (a)），还是进行对前一个片段视频的motion帧进行编码再引入。请问这里是不是有误？请解释一下？谢谢！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于Cross-chunk latent stitching问题 #81

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

关于Cross-chunk latent stitching问题 #81

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions