Skip to content

[🚧 Code will be released soon!] Official repository of the paper "Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics"

Notifications You must be signed in to change notification settings

CoderChen01/towards-seamless-interaction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project Logo

🤖✨ Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

Junjie Chen1,2 · Fei Wang1,2 · Zhihao Huang5,6 · Qing Zhou8 · Kun Li7
Dan Guo1 · Linfeng Zhang4 · Xun Yang3

1 Hefei University of Technology   ·   2 IAI, Hefei Comprehensive National Science Center
3 USTC   ·   4 SJTU   ·   5 TeleAI, China Telecom   ·   6 Northwestern Polytechnical University
7 United Arab Emirates University   ·   8 Anhui Polytechnic University


🔥 Highlights

  • 🧠 Causal turn-level formulation for streaming conversational generation
  • 🔄 Unified talking & listening modeling within a single framework
  • 🎧🗣️ Interleaved multimodal tokens from both interlocutors
  • 🌊 Diffusion-based 3D head decoding for expressive and stochastic motion
  • 📉 15–30% error reduction over strong baselines (e.g., DualTalk)

🚀 Overview

Human conversation is a continuous exchange of speech and nonverbal cues—including head nods, gaze shifts, and subtle expressions.
Most existing approaches, however, treat talking-head and listening-head generation as separate problems, or rely on non-causal full-sequence modeling that is unsuitable for real-time interaction.

We propose a causal, turn-level framework for interactive 3D conversational head generation.
Our method models dialogue as a sequence of causally linked turns, where each turn accumulates multimodal context from both participants to produce coherent, responsive, and humanlike 3D head dynamics.

Framework Overview

🧩 Method: TIMAR

TIMAR (Turn-level Interleaved Masked AutoRegression) is the core method proposed in this work.

🧱 Key Idea

  • Represent conversation as interleaved audio–visual tokens:
    • 👤 User speech + user head motion
    • 🤖 Agent speech + agent head motion
  • Perform:
    • 🔁 Bidirectional fusion within each turn (intra-turn alignment)
    • ⏱️ Strictly causal reasoning across turns (inter-turn dependency)

This design mirrors how humans coordinate speaking and listening over time.

⚙️ Architecture

TIMAR Architecture

Core components:

  • 🧠 Turn-Level Causal Attention (TLCA)
    • Bidirectional attention inside a turn
    • Causal masking across turns (no future leakage)
  • 🌊 Lightweight Diffusion Head
    • Predicts continuous 3D head motion
    • Captures expressive stochasticity beyond deterministic regression

🧪 Experiments

We evaluate our framework on the interactive 3D conversational head benchmark, following the DualTalk protocol.

📊 Quantitative Results

Click to see the results

Quantitative Results

Results at a glance:

  • ⬇️ 15–30% reduction in Frechet Distance (FD) and MSE
  • 📈 Improved expressiveness and synchronization (SID ↑)
  • 🌍 Strong generalization on out-of-distribution conversations

🎭 Qualitative Results

Click to see the results

Qualitative Results

Demo Preview
Demo 1
demo_1.mp4
Demo 2
demo_2.mp4
Demo 3
demo_3.mp4

Notation

  • Agent GT denotes the ground-truth 3D head motion.
  • TIMAR Agent denotes our generated results.
  • DualTalk Agent denotes the outputs from the DualTalk baseline.

TIMAR produces:

  • Natural listening behavior when the agent is silent
  • Context-aware reactions with longer conversational history
  • Smoother and more stable 3D head motion

🧩 Ablation Studies

Click to see the results

Ablation Studies Ablation Studies

We analyze the contribution of each design choice:

  • ❌ MLP head vs 🌊 diffusion-based head
  • ❌ Full bidirectional attention vs ✅ turn-level causal attention
  • ❌ Encoder–decoder vs ✅ encoder-only backbone

Each component is critical for causal coherence and generalization.

📦 Code Release

🚧 Code will be released soon!

The full implementation of TIMAR, including training and inference pipelines, will be publicly released.
If you are interested, feel free to ⭐️ this repository and check back later.

📚 Citation

If you find this work useful, please consider citing:

@article{chen2025timar,
  title={Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics},
  author={Chen, Junjie and Wang, Fei and Hunag, Zhihao and Zhou, Qing and Li, Kun and Guo, Dan and Zhang, Linfeng and Yang, Xun},
  journal={arXiv preprint arXiv:2512.15340},
  year={2025}
}

About

[🚧 Code will be released soon!] Official repository of the paper "Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published