Fantasy AIGC Family is an open-source initiative exploring Human-centric AI, World Modeling, and Human-World Interaction, aiming to bridge perception, understanding, and generation in the real and digital worlds.
- 📢 Jan 2026 – We released the training and inference code and model weights of FantasyVLN.
- 🏆 Dec 2025 - FantasyWorld ranked 1st on the WorldScore Leaderboard (by Stanford Prof. Fei-Fei Li's Team), validating our approach against global state-of-the-art models.
- 🏛 Nov 2025 – Two papers from our family, FantasyTalking2 and FantasyHSI, have been accepted to AAAI 2026.
- 🏛 Nov 2025 – Two papers from our family, FantasyTalking2 and FantasyHSI, have been accepted to AAAI 2026.
- 🏛 Jul 2025 – FantasyTalking is accepted by ACM MM 2025.
- 📢 Apr 2025 – We released the inference code and model weights of FantasyTalking and FantasyID.
A unified multimodal Chain-of-Thought (CoT) reasoning framework that enables efficient and precise navigation based on natural language instructions and visual observations.
Corresponds to the "Worlds" dimension. A unified world model integrating video priors and geometric grounding for synthesizing explorable and geometrically consistent 3D scenes. It emphasizes spatiotemporal consistency driven by Action and serves as a verifiable structural anchor for spatial intelligence.
The first Wan-based high-fidelity audio-driven avatar system that synchronizes facial expressions, lip motion, and body gestures in dynamic scenes through dual-stage audio-visual alignment and controllable motion modulation.
A novel Timestep-Layer Adaptive Multi-Expert Preference Optimization (TLPO) method enhances the quality of audio-driven avatar in three dimensions: lip-sync, motion naturalness, and visual quality.
A novel expression-driven video-generation method that pairs emotion-enhanced learning with masked cross-attention, enabling the creation of high-quality, richly expressive animations for both single and multi-portrait scenarios.
Corresponds to the "Interaction" dimension. A graph-based multi-agent framework that grounds video generation within 3D world dynamics. It unifies the action space with a broader interaction loop, transforming video generation from a content endpoint into a control channel for interactive systems.
A tuning-free text-to-video model that leverages 3D facial priors, multi-view augmentation, and layer-aware guidance injection to deliver dynamic, identity-preserving video generation.
- Giving Back to the Community: In our daily work, we benefit immensely from the resources, expertise, and support of the open source community, and we aim to give back by making our own projects open source.
- Attracting More Contributors: By open sourcing our code, we invite developers worldwide to collaborate—making our models smarter, our engineering more robust, and extending benefits to even more users.
- Building an Open Ecosystem: We believe that open source brings together diverse expertise to create a collaborative innovation platform—driving technological progress, industry growth, and broader societal impact.