🤖✨ Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics
Junjie Chen1,2 ·
Fei Wang1,2 ·
Zhihao Huang5,6 ·
Qing Zhou8 ·
Kun Li7
Dan Guo1 ·
Linfeng Zhang4 ·
Xun Yang3
1 Hefei University of Technology ·
2 IAI, Hefei Comprehensive National Science Center
3 USTC ·
4 SJTU ·
5 TeleAI, China Telecom ·
6 Northwestern Polytechnical University
7 United Arab Emirates University ·
8 Anhui Polytechnic University
- 🧠 Causal turn-level formulation for streaming conversational generation
- 🔄 Unified talking & listening modeling within a single framework
- 🎧🗣️ Interleaved multimodal tokens from both interlocutors
- 🌊 Diffusion-based 3D head decoding for expressive and stochastic motion
- 📉 15–30% error reduction over strong baselines (e.g., DualTalk)
Human conversation is a continuous exchange of speech and nonverbal cues—including head nods, gaze shifts, and subtle expressions.
Most existing approaches, however, treat talking-head and listening-head generation as separate problems, or rely on non-causal full-sequence modeling that is unsuitable for real-time interaction.
We propose a causal, turn-level framework for interactive 3D conversational head generation.
Our method models dialogue as a sequence of causally linked turns, where each turn accumulates multimodal context from both participants to produce coherent, responsive, and humanlike 3D head dynamics.
TIMAR (Turn-level Interleaved Masked AutoRegression) is the core method proposed in this work.
- Represent conversation as interleaved audio–visual tokens:
- 👤 User speech + user head motion
- 🤖 Agent speech + agent head motion
- Perform:
- 🔁 Bidirectional fusion within each turn (intra-turn alignment)
- ⏱️ Strictly causal reasoning across turns (inter-turn dependency)
This design mirrors how humans coordinate speaking and listening over time.
Core components:
- 🧠 Turn-Level Causal Attention (TLCA)
- Bidirectional attention inside a turn
- Causal masking across turns (no future leakage)
- 🌊 Lightweight Diffusion Head
- Predicts continuous 3D head motion
- Captures expressive stochasticity beyond deterministic regression
We evaluate our framework on the interactive 3D conversational head benchmark, following the DualTalk protocol.
Results at a glance:
- ⬇️ 15–30% reduction in Frechet Distance (FD) and MSE
- 📈 Improved expressiveness and synchronization (SID ↑)
- 🌍 Strong generalization on out-of-distribution conversations
Click to see the results
| Demo | Preview |
|---|---|
| Demo 1 | demo_1.mp4 |
| Demo 2 | demo_2.mp4 |
| Demo 3 | demo_3.mp4 |
Notation
- Agent GT denotes the ground-truth 3D head motion.
- TIMAR Agent denotes our generated results.
- DualTalk Agent denotes the outputs from the DualTalk baseline.
TIMAR produces:
- Natural listening behavior when the agent is silent
- Context-aware reactions with longer conversational history
- Smoother and more stable 3D head motion
We analyze the contribution of each design choice:
- ❌ MLP head vs 🌊 diffusion-based head
- ❌ Full bidirectional attention vs ✅ turn-level causal attention
- ❌ Encoder–decoder vs ✅ encoder-only backbone
Each component is critical for causal coherence and generalization.
🚧 Code will be released soon!
The full implementation of TIMAR, including training and inference pipelines, will be publicly released.
If you are interested, feel free to ⭐️ this repository and check back later.
If you find this work useful, please consider citing:
@article{chen2025timar,
title={Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics},
author={Chen, Junjie and Wang, Fei and Hunag, Zhihao and Zhou, Qing and Li, Kun and Guo, Dan and Zhang, Linfeng and Yang, Xun},
journal={arXiv preprint arXiv:2512.15340},
year={2025}
}



