Edit the subject of a talking portrait video given an edit prompt while maintaining high fidelity (using Diffusion Models)
Source: https://ustc3dv.github.io/CosAvatar/
- Maintaining High Fidelity
- Maintaining Lip movements, gestures and other expressions
- Background, Lighting etc.
- Deformed Eyes, Faces, Fingers (Stable Diffusion - v1.5)
- Temporal Consistency of Edits across the video
- Maintain the original resolution / picture quality
There are two groups of approaches found in the literature:
- Perform Frame level Edits using Text2Image (T2I) models
- Eg. Prompt2Prompt, InstructPix2Pix, ControlNet, etc.
- Stitch the frames together to generate Video maintaining temporal consistency
- With FineTuning
- Tune-A-Video, Text2Live, CCEdit etc.
- Without FineTuning (Zero-Shot)
- Eg. Pix2Video, TokenFlow, Text2Video_Zero, ControlVideo, etc.
- With FineTuning
- Perform Face Edits using Text2Image Models (like InstructPix2Pix)
- Render the Edits in 3D using Dynamic NERF
- Advantages: Better temporal and 3D consistency
- Examples
- CosAvatar, AvatarStudio, etc.
-
The approach taken in this work is based on Text2Video_Zero framework
- Edit Frames using T2I Model - InstructPix2Pix, ControlNet
- Use Cross-Frame Attention for enforcing appearance consistency across video frames
- For ControlNet Model, Explore:
- Face Landmark/Pose conditioning for preserving lip movement and facial expressions
- RealisticVision Stable Diffusion checkpoints to generate realistic portraits
- Portrait masking to preserve background
- For InstructPix2Pix
- Has the best perceptually best edits, but fails to maintain coherent appearance across frames
- Explore better video diffusion / frame interaction approaches to improve temporal appearance consistency, like TokenFlow
- General
- StableDiffusion-1.5 ruins faces if the face is smaller w.r.t entire image frame - Explore separate face and body generation
- Explore fine-tuning/adaptation with Video specific training for best results
- Explore 3D Rendering based approaches like CosAvatar
Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., & Dekel, T. (2022). Text2LIVE: Text-Driven Layered Image and Video Editing. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, & T. Hassner (Eds.), Computer Vision – ECCV 2022 (pp. 707–723). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-19784-0_41
Brooks, T., Holynski, A., & Efros, A. A. (2023). InstructPix2Pix: Learning to Follow Image Editing Instructions (arXiv:2211.09800). arXiv. https://doi.org/10.48550/arXiv.2211.09800
Ceylan, D., Huang, C.-H. P., & Mitra, N. J. (2023). Pix2Video: Video Editing using Image Diffusion. 23206–23217. https://openaccess.thecvf.com/content/ICCV2023/html/Ceylan_Pix2Video_Video_Editing_using_Image_Diffusion_ICCV_2023_paper.html
Chu, E., Huang, T., Lin, S.-Y., & Chen, J.-C. (2023). MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance (arXiv:2308.10079). arXiv. https://doi.org/10.48550/arXiv.2308.10079
Feng, R., Weng, W., Wang, Y., Yuan, Y., Bao, J., Luo, C., Chen, Z., & Guo, B. (2023). CCEdit: Creative and Controllable Video Editing via Diffusion Models (arXiv:2309.16496). arXiv. https://doi.org/10.48550/arXiv.2309.16496
Geyer, M., Bar-Tal, O., Bagon, S., & Dekel, T. (2023). TokenFlow: Consistent Diffusion Features for Consistent Video Editing (arXiv:2307.10373). arXiv. https://doi.org/10.48550/arXiv.2307.10373
Haque, A., Tancik, M., Efros, A. A., Holynski, A., & Kanazawa, A. (2023). Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions (arXiv:2303.12789). arXiv. https://doi.org/10.48550/arXiv.2303.12789
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., & Cohen-Or, D. (2022). Prompt-to-Prompt Image Editing with Cross Attention Control (arXiv:2208.01626). arXiv. https://doi.org/10.48550/arXiv.2208.01626
Khachatryan, L., Movsisyan, A., Tadevosyan, V., Henschel, R., Wang, Z., Navasardyan, S., & Shi, H. (2023). Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators (arXiv:2303.13439). arXiv. https://doi.org/10.48550/arXiv.2303.13439
Liu, S., Zhang, Y., Li, W., Lin, Z., & Jia, J. (2023). Video-P2P: Video Editing with Cross-attention Control (arXiv:2303.04761). arXiv. https://doi.org/10.48550/arXiv.2303.04761
Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., & Ermon, S. (2022). SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations (arXiv:2108.01073). arXiv. https://doi.org/10.48550/arXiv.2108.01073
Mendiratta, M., Pan, X., Elgharib, M., Teotia, K., R, M. B., Tewari, A., Golyanik, V., Kortylewski, A., & Theobalt, C. (2023). AvatarStudio: Text-driven Editing of 3D Dynamic Human Head Avatars (arXiv:2306.00547). arXiv. https://doi.org/10.48550/arXiv.2306.00547
Qi, C., Cun, X., Zhang, Y., Lei, C., Wang, X., Shan, Y., & Chen, Q. (2023). FateZero: Fusing Attentions for Zero-shot Text-based Video Editing (arXiv:2303.09535). arXiv. https://doi.org/10.48550/arXiv.2303.09535
Shi, F., Gu, J., Xu, H., Xu, S., Zhang, W., & Wang, L. (2023). BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models (arXiv:2312.02813). arXiv. https://doi.org/10.48550/arXiv.2312.02813
Wu, J. Z., Ge, Y., Wang, X., Lei, S. W., Gu, Y., Shi, Y., Hsu, W., Shan, Y., Qie, X., & Shou, M. Z. (2023). Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation. 7623–7633. https://openaccess.thecvf.com/content/ICCV2023/html/Wu_Tune-A-Video_One-Shot_Tuning_of_Image_Diffusion_Models_for_Text-to-Video_Generation_ICCV_2023_paper.html
Wu, J. Z., Li, X., Gao, D., Dong, Z., Bai, J., Singh, A., Xiang, X., Li, Y., Huang, Z., Sun, Y., He, R., Hu, F., Hu, J., Huang, H., Zhu, H., Cheng, X., Tang, J., Shou, M. Z., Keutzer, K., & Iandola, F. (2023). CVPR 2023 Text Guided Video Editing Competition (arXiv:2310.16003). arXiv. https://doi.org/10.48550/arXiv.2310.16003
Xiao, H., Zhong, C., Gao, X., Guo, Y., & Zhang, J. (2023). CosAvatar: Consistent and Animatable Portrait Video Tuning with Text Prompt (arXiv:2311.18288). arXiv. https://doi.org/10.48550/arXiv.2311.18288
Zhang, L., Rao, A., & Agrawala, M. (2023). Adding Conditional Control to Text-to-Image Diffusion Models. 3836–3847. https://openaccess.thecvf.com/content/ICCV2023/html/Zhang_Adding_Conditional_Control_to_Text-to-Image_Diffusion_Models_ICCV_2023_paper.html
Zhang, Y., Wei, Y., Jiang, D., Zhang, X., Zuo, W., & Tian, Q. (2023). ControlVideo: Training-free Controllable Text-to-Video Generation (arXiv:2305.13077). arXiv. https://doi.org/10.48550/arXiv.2305.13077





