OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction (ACL 2025 main)
Haonan Zhang*, Run Luo*, Xiong Liu*, Yuchuan Wu, Ting-En Lin, Pengpeng Zeng, Qiang Qu, Feiteng Fang, Min Yang, Lianli Gao, Jingkuan Songβ‘, Fei Huang, Yongbin Liβ‘ (* Equal contribution β‘ Corresponding author)
This is the official code implementation of the paper "OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction".
We are continuously refactoring our code, be patient and wait for the latest updates!
- Release the pre-trained weight and datasets.
- Release the training and evaluation code.
- We release the paper for OmniCharacter!
- Clone the repo
git clone --recursive https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/OmniCharacter
cd OmniCharacter
- Create Conda env:
conda create -n omnicharacter python=3.10 -y
conda activate omnicharacter
pip install --upgrade pip # enable PEP 660 support
pip install -e ".[train]"
pip install -r requirements.txt
# Install Flash Attention 2 for training (https://github.com/Dao-AILab/flash-attention)
# =>> If you run into difficulty, try `pip cache remove flash_attn` first
pip install packaging ninja
ninja --version; echo $? # Verify Ninja --> should return exit code "0"
pip install "flash-attn" --no-build-isolation
- Download the dataset. First, download the OmniCharacter training and test sets from our HuggingFaceπ€ repository. After downloading, place the dataset in a folder named data/ under the project root:
mkdir -p data
# Put the downloaded files into the data/ folder
- Prepare checkpoints and Speech Modules
We finetune OmniCharacter based on the OpenOmni pre-trained weights. You can download the OpenOmni checkpoints from HuggingFaceπ€ and place them in a checkpoints/ directory, which will be also downloaded with the OpenOmni:
mkdir -p checkpoints
# Put the OpenOmni weights into checkpoints/
In addition, make sure the following modules are also placed under the checkpoints/ directory:
-
speech_projector: The pre-trained speech encoder used to extract speech features from reference audio.
-
speech_generator: The pre-trained speech decoder model used for generating speech tokens.
Your directory structure should look like this:
OmniCharacter/
βββ checkpoints/
β βββ openomni/
β βββ pretrained/
β β βββ speech_projector/
β β βββ speech_generator/
β βββ qwen/
βββ data/
β βββ omnicharacter_10k_train.json
β βββ omnicharacter_test.json
β βββ audio_data/
- You can train the model with the following command:
Stage-1: focuses on aligning speech features (user query) and text (role profile, dialogue contexts, etc.) in the shared personality space. Use the provided shell script to launch training:
bash omnicharacter_stage1_qwen2.5.sh
This will save outputs to a designated directory results/.
Stage-2: further finetunes the speech generator
Once Stage 1 completes, locate the checkpoint (e.g., results/stage1/checkpoint-xxx/) and pass it to Stage 2 as --model_name_or_path:
bash omnicharacter_stage2_qwen2.5.sh
After downloading the weights and configuring the paths properly. A speech tokenizer are needed for speech discretization and reconstruction, i.e., GLM-4-Voice
Fast inference:
python inference.py
If this project contributes to your research, we kindly ask you to cite the following paper:
@article{zhang2025omnicharacter,
title={OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction},
author={Zhang, Haonan and Luo, Run and Liu, Xiong and Wu, Yuchuan and Lin, Ting-En and Zeng, Pengpeng and Qu, Qiang and Fang, Feiteng and Yang, Min and Gao, Lianli and others},
journal={ACL 2025},
year={2025}
}
@article{luo2025openomni,
title={OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis},
author={Luo, Run and Lin, Ting-En and Zhang, Haonan and Wu, Yuchuan and Liu, Xiong and Yang, Min and Li, Yongbin and Chen, Longze and Li, Jiaming and Zhang, Lei and others},
journal={arXiv preprint arXiv:2501.04561},
year={2025}
}
@article{luo2024mmevol,
title={Mmevol: Empowering multimodal large language models with evol-instruct},
author={Luo, Run and Zhang, Haonan and Chen, Longze and Lin, Ting-En and Liu, Xiong and Wu, Yuchuan and Yang, Min and Wang, Minzheng and Zeng, Pengpeng and Gao, Lianli and others},
journal={ACL 2025},
year={2024}
}
If you have any questions or need assistance, feel free to reach out via the contact information below.
-
Haonan Zhang β zchiowal@gmail.com
-
Run Luo β r.luo@siat.ac.cn
-
OpenOmni: The backbone multimodal foundation model powering our speech-language finetuning. We are truly excited to build on top of this open effort!
-
LLaVA and LLaVA-Omni: The foundational codebases our work builds upon. We sincerely appreciate their pioneering contributions to the community!
-
CosVoice: An excellent open-source speech tokenizer enabling discretization and reconstruction with a 6k vocabularyβessential for expressive speech representation.
-
GLM4Voice: Another impressive speech tokenizer supporting high-fidelity reconstruction with a 16k vocabulary. Huge thanks for making this resource available!