-
Notifications
You must be signed in to change notification settings - Fork 309
Open
Description
def generate_avc(...):
......
# 5. Prepare latent variables
video = self.video_processor.preprocess_video(self.video_processor, video, height=height, width=width, resize_mode=resize_mode)
video = video.to(device=device, dtype=prompt_embeds.dtype)
cond_videos = video[:, :, -num_cond_frames:]
cond_videos_latents = retrieve_latents(self.vae.encode(cond_videos), generator, sample_mode="argmax")
cond_videos_latents = self.normalize_latents(cond_videos_latents)
.......
Narration in paper:
However, during the generation of each continuation, obtaining the context latents requires first decoding the previously
predicted chunk via the VAE and then re-encoding the selected context frames. This repetitive VAE decode-encode
cycle introduces two major issues. First, the repeated encoding and decoding can result in information loss and error
accumulation within each chunk. Second, it significantly reduces inference efficiency.
问题是:理解多个片段视频生产,代码是实现还是使用旧的方案(就是技术报告图7 (a)),还是进行对前一个片段视频的motion帧进行编码再引入。请问这里是不是有误?请解释一下?谢谢!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels