| title | - | paper | code | dataset | keywords |
|---|---|---|---|---|---|
| EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model | SIGGRAPH (22) | paper | emotion | ||
| Expressive Talking Head Generation with Granular Audio-Visual Control | CVPR(22) | paper | - | ||
| Deep Learning for Visual Speech Analysis: A Survey | - | paper | - | - | survey |
| StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN | - | paper | code | - | stylegan |
| Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation | - | paper | code(coming soon) | NeRF | |
| Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation | - | paper | - | - | - |
| One-shot talking face generation from single-speaker audio-visual correlation learning | AAAI(22) | paper | code | - | - |
| SyncTalkFace: Talking Face Generation with Precise Lip-syncing via Audio-Lip Memory | AAAI(22) | paper(temp) | - | LRW, LRS2, BBC News | - |
| DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering | paper | NeRF | |||
| Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos | paper | ||||
| Dynamic Neural Textures: Generating Talking-Face Videos with Continuously Controllable Expressions | paper | ||||
| DialogueNeRF: Towards Realistic Avatar Face-to-face Conversation Video Generation | paper | ||||
| Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion | paper |
| title | - | paper | code | dataset |
|---|---|---|---|---|
| Parallel and High-Fidelity Text-to-Lip Generation | paper | |||
| [Survey]Deep Person Generation: A Survey from the Perspective of Face, Pose and Cloth Synthesis | - | paper | - | - |
| FaceFormer: Speech-Driven 3D Facial Animation with Transformers | CVPR(22) | paper | code | - |
| Voice2Mesh: Cross-Modal 3D Face Model Generation from Voices | - | paper | code | - |
| FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning | ICCV | paper | code | - |
| Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis | - | paper | code | - |
| Audio-Driven Emotional Video Portraits | CVPR | paper | code | MEAD, LRW |
| LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization | CVPR | paper | - | - |
| Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation | CVPR | paper | code | VoxCeleb2, LRW |
| Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset | CVPR | paper | code | HDTF |
| MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement | ICCV | paper | code(coming soon) | - |
| AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis | ICCV | paper | code | - |
| Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation | AAAI | paper | code(coming soon) | Mocap dataset |
| Visual Speech Enhancement Without A Real Visual Stream | - | paper | - | - |
| Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary | - | paper | code | - |
| Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion | IJCAI | paper | code | VoxCeleb, GRID, LRW |
| 3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head | - | paper | - | - |
| AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person | - | paper | - | VoxCeleb2, Obama |
| title | - | paper | code | dataset |
|---|---|---|---|---|
| [Survey]What comprises a good talking-head video generation?: A survey and benchmark | - | paper | code | - |
| One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing | CVPR(21) | paper | code | - |
| Speech Driven Talking Face Generation from a Single Image and an Emotion Condition | - | paper | code | CREMA-D |
| A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild | ACMMM | paper | code | LRS2 |
| Talking-head Generation with Rhythmic Head Motion | ECCV | paper | code | Crema, Grid, Voxceleb, Lrs3 |
| MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation | ECCV | paper | code | VoxCeleb2, AffectNet |
| Neural voice puppetry:Audio-driven facial reenactment | ECCV | paper | - | - |
| Fast Bi-layer Neural Synthesis of One-Shot Realistic Head Avatars | ECCV | paper | code | - |
| HeadGAN:Video-and-Audio-Driven Talking Head Synthesis | - | paper | - | VoxCeleb2 |
| MakeItTalk: Speaker-Aware Talking Head Animation | - | paper | code, code | VoxCeleb2, VCTK |
| Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose | - | paper | code | ImageNet, FaceWarehouse, LRW |
| Photorealistic Lip Sync with Adversarial Temporal Convolutional Networks | - | paper | - | - |
| SPEECH-DRIVEN FACIAL ANIMATION USING POLYNOMIAL FUSION OF FEATURES | - | paper | - | LRW |
| Animating Face using Disentangled Audio Representations | WACV | paper | - | |
| Everybody’s Talkin’: Let Me Talk as You Want | - | paper | - | - |
| Multimodal Inputs Driven Talking Face Generation With Spatial-Temporal Dependency | - | paper | - | - |
| Speech Driven Talking Face Generation from a Single Image and an Emotion Condition | - | paper | - | - |
| title | - | paper | code | dataset |
|---|---|---|---|---|
| Hierarchical Cross-Modal Talking Face Generation with Dynamic Pixel-Wise Loss | CVPR | paper | code | VGG Face, LRW |
- PSNR (peak signal-to-noise ratio)
- SSIM (structural similarity index measure)
- LMD (landmark distance error)
- LRA (lip-reading accuracy) -
- FID (Fréchet inception distance)
- LSE-D (Lip Sync Error - Distance)
- LSE-C (Lip Sync Error - Confidence)
- LPIPS (Learned Perceptual Image Patch Similarity) -
- NIQE (Natural Image Quality Evaluator) -