Skip to content

Commit df02bb6

Browse files
authored
Merge pull request #212 from zchoi/main
Update the code and README of OmniCharacter
2 parents e36b409 + 3509b95 commit df02bb6

File tree

115 files changed

+22845
-1
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

115 files changed

+22845
-1
lines changed

.DS_Store

2 KB
Binary file not shown.

OmniCharacter/README.md

Lines changed: 143 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,143 @@
1-
This repo contains the official code for the paper: "OmniCharacter: OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction". The source code, dataset, and demo will be released soon. Please wait patiently.
1+
# _OmniCharacter_: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction (ACL 2025 main)
2+
3+
[Haonan Zhang](https://zchoi.github.io/)\*, [Run Luo](https://scholar.google.com/citations?user=phg8yxoAAAAJ&hl=zh-CN&oi=ao)\*, Xiong Liu\*, Yuchuan Wu, Ting-En Lin, Pengpeng Zeng, Qiang Qu, Feiteng Fang, Min Yang, Lianli Gao, Jingkuan Song<sup>‡</sup>, Fei Huang, Yongbin Li<sup>‡</sup> (\* Equal contribution ‡ Corresponding author)
4+
5+
This is the official code implementation of the paper "**OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction**".
6+
7+
We are continuously refactoring our code, be patient and wait for the latest updates!
8+
9+
## 🔥 Updates
10+
11+
- [ ] Release the pre-trained weight and datasets.
12+
- [x] Release the training and evaluation code.
13+
- [x] We release the [paper](https://arxiv.org/abs/2505.20277) for OmniCharacter!
14+
15+
## ⚙️ Installation
16+
17+
1. Clone the repo
18+
19+
```
20+
git clone --recursive https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/OmniCharacter
21+
cd OmniCharacter
22+
```
23+
24+
2. Create Conda env:
25+
```
26+
conda create -n omnicharacter python=3.10 -y
27+
conda activate omnicharacter
28+
pip install --upgrade pip # enable PEP 660 support
29+
pip install -e ".[train]"
30+
pip install -r requirements.txt
31+
32+
# Install Flash Attention 2 for training (https://github.com/Dao-AILab/flash-attention)
33+
# =>> If you run into difficulty, try `pip cache remove flash_attn` first
34+
pip install packaging ninja
35+
ninja --version; echo $? # Verify Ninja --> should return exit code "0"
36+
pip install "flash-attn" --no-build-isolation
37+
```
38+
39+
## 🚀 Train
40+
1. Download the dataset.
41+
First, download the OmniCharacter training and test sets from our [HuggingFace🤗](https://huggingface.co/datasets/Tongyi-ConvAI/OmniCharacter) repository.
42+
After downloading, place the dataset in a folder named data/ under the project root:
43+
44+
```
45+
mkdir -p data
46+
# Put the downloaded files into the data/ folder
47+
```
48+
49+
2. Prepare checkpoints and Speech Modules
50+
51+
We finetune OmniCharacter based on the [OpenOmni](https://arxiv.org/abs/2501.04561) pre-trained weights.
52+
You can download the OpenOmni checkpoints from [HuggingFace🤗](https://huggingface.co/Tongyi-ConvAI/OpenOmni/tree/main) and place them in a checkpoints/ directory, which will be also downloaded with the OpenOmni:
53+
```
54+
mkdir -p checkpoints
55+
# Put the OpenOmni weights into checkpoints/
56+
```
57+
In addition, make sure the following modules are also placed under the checkpoints/ directory:
58+
59+
- speech_projector: The pre-trained speech encoder used to extract speech features from reference audio.
60+
61+
- speech_generator: The pre-trained speech decoder model used for generating speech tokens.
62+
63+
Your directory structure should look like this:
64+
```
65+
OmniCharacter/
66+
├── checkpoints/
67+
│ └── openomni/
68+
│ ├── pretrained/
69+
│ │ ├── speech_projector/
70+
│ │ └── speech_generator/
71+
│ └── qwen/
72+
├── data/
73+
│ ├── omnicharacter_10k_train.json
74+
│ ├── omnicharacter_test.json
75+
│ └── audio_data/
76+
```
77+
78+
3. You can train the model with the following command:
79+
80+
**Stage-1**: focuses on aligning speech features (user query) and text (role profile, dialogue contexts, etc.) in the shared personality space. Use the provided shell script to launch training:
81+
```
82+
bash omnicharacter_stage1_qwen2.5.sh
83+
```
84+
This will save outputs to a designated directory ```results/```.
85+
86+
**Stage-2**: further finetunes the speech generator
87+
88+
Once Stage 1 completes, locate the checkpoint (e.g., results/stage1/checkpoint-xxx/) and pass it to Stage 2 as ```--model_name_or_path```:
89+
```
90+
bash omnicharacter_stage2_qwen2.5.sh
91+
```
92+
93+
## 🍃 Inference
94+
After downloading the weights and configuring the paths properly. A speech tokenizer are needed for speech discretization and reconstruction, _i.e._, [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice)
95+
96+
Fast inference:
97+
```
98+
python inference.py
99+
```
100+
101+
## 📖 Citation
102+
If this project contributes to your research, we kindly ask you to cite the following paper:
103+
```
104+
@article{zhang2025omnicharacter,
105+
title={OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction},
106+
author={Zhang, Haonan and Luo, Run and Liu, Xiong and Wu, Yuchuan and Lin, Ting-En and Zeng, Pengpeng and Qu, Qiang and Fang, Feiteng and Yang, Min and Gao, Lianli and others},
107+
journal={ACL 2025},
108+
year={2025}
109+
}
110+
```
111+
```
112+
@article{luo2025openomni,
113+
title={OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis},
114+
author={Luo, Run and Lin, Ting-En and Zhang, Haonan and Wu, Yuchuan and Liu, Xiong and Yang, Min and Li, Yongbin and Chen, Longze and Li, Jiaming and Zhang, Lei and others},
115+
journal={arXiv preprint arXiv:2501.04561},
116+
year={2025}
117+
}
118+
```
119+
```
120+
@article{luo2024mmevol,
121+
title={Mmevol: Empowering multimodal large language models with evol-instruct},
122+
author={Luo, Run and Zhang, Haonan and Chen, Longze and Lin, Ting-En and Liu, Xiong and Wu, Yuchuan and Yang, Min and Wang, Minzheng and Zeng, Pengpeng and Gao, Lianli and others},
123+
journal={ACL 2025},
124+
year={2024}
125+
}
126+
```
127+
## 📧 Contact
128+
If you have any questions or need assistance, feel free to reach out via the contact information below.
129+
130+
- Haonan Zhang — zchiowal@gmail.com
131+
132+
- Run Luo — r.luo@siat.ac.cn
133+
134+
135+
## Acknowledgement
136+
137+
- [**OpenOmni**](https://huggingface.co/AlibabaResearch/OpenOmni): The backbone multimodal foundation model powering our speech-language finetuning. We are truly excited to build on top of this open effort!
138+
139+
- [**LLaVA**](https://github.com/haotian-liu/LLaVA) and [**LLaVA-Omni**](https://github.com/ictnlp/LLaMA-Omni): The foundational codebases our work builds upon. We sincerely appreciate their pioneering contributions to the community!
140+
141+
- [**CosVoice**](https://github.com/FunAudioLLM/CosyVoice): An excellent open-source speech tokenizer enabling discretization and reconstruction with a 6k vocabulary—essential for expressive speech representation.
142+
143+
- [**GLM4Voice**](https://github.com/THUDM/GLM-4-Voice): Another impressive speech tokenizer supporting high-fidelity reconstruction with a 16k vocabulary. Huge thanks for making this resource available!

OmniCharacter/assets/logo.png

219 KB
Loading

OmniCharacter/cosyvoice/__init__.py

Whitespace-only changes.
Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from __future__ import print_function
16+
17+
import argparse
18+
import logging
19+
logging.getLogger('matplotlib').setLevel(logging.WARNING)
20+
import os
21+
22+
import torch
23+
from torch.utils.data import DataLoader
24+
import torchaudio
25+
from hyperpyyaml import load_hyperpyyaml
26+
from tqdm import tqdm
27+
from cosyvoice.cli.model import CosyVoiceModel
28+
29+
from cosyvoice.dataset.dataset import Dataset
30+
31+
def get_args():
32+
parser = argparse.ArgumentParser(description='inference with your model')
33+
parser.add_argument('--config', required=True, help='config file')
34+
parser.add_argument('--prompt_data', required=True, help='prompt data file')
35+
parser.add_argument('--prompt_utt2data', required=True, help='prompt data file')
36+
parser.add_argument('--tts_text', required=True, help='tts input file')
37+
parser.add_argument('--llm_model', required=True, help='llm model file')
38+
parser.add_argument('--flow_model', required=True, help='flow model file')
39+
parser.add_argument('--hifigan_model', required=True, help='hifigan model file')
40+
parser.add_argument('--gpu',
41+
type=int,
42+
default=-1,
43+
help='gpu id for this rank, -1 for cpu')
44+
parser.add_argument('--mode',
45+
default='sft',
46+
choices=['sft', 'zero_shot'],
47+
help='inference mode')
48+
parser.add_argument('--result_dir', required=True, help='asr result file')
49+
args = parser.parse_args()
50+
print(args)
51+
return args
52+
53+
54+
def main():
55+
args = get_args()
56+
logging.basicConfig(level=logging.DEBUG,
57+
format='%(asctime)s %(levelname)s %(message)s')
58+
os.environ['CUDA_VISIBLE_DEVICES'] = str(args.gpu)
59+
60+
# Init cosyvoice models from configs
61+
use_cuda = args.gpu >= 0 and torch.cuda.is_available()
62+
device = torch.device('cuda' if use_cuda else 'cpu')
63+
with open(args.config, 'r') as f:
64+
configs = load_hyperpyyaml(f)
65+
66+
model = CosyVoiceModel(configs['llm'], configs['flow'], configs['hift'])
67+
model.load(args.llm_model, args.flow_model, args.hifigan_model)
68+
69+
test_dataset = Dataset(args.prompt_data, data_pipeline=configs['data_pipeline'], mode='inference', shuffle=False, partition=False, tts_file=args.tts_text, prompt_utt2data=args.prompt_utt2data)
70+
test_data_loader = DataLoader(test_dataset, batch_size=None, num_workers=0)
71+
72+
del configs
73+
os.makedirs(args.result_dir, exist_ok=True)
74+
fn = os.path.join(args.result_dir, 'wav.scp')
75+
f = open(fn, 'w')
76+
with torch.no_grad():
77+
for batch_idx, batch in tqdm(enumerate(test_data_loader)):
78+
utts = batch["utts"]
79+
assert len(utts) == 1, "inference mode only support batchsize 1"
80+
text = batch["text"]
81+
text_token = batch["text_token"].to(device)
82+
text_token_len = batch["text_token_len"].to(device)
83+
tts_text = batch["tts_text"]
84+
tts_index = batch["tts_index"]
85+
tts_text_token = batch["tts_text_token"].to(device)
86+
tts_text_token_len = batch["tts_text_token_len"].to(device)
87+
speech_token = batch["speech_token"].to(device)
88+
speech_token_len = batch["speech_token_len"].to(device)
89+
speech_feat = batch["speech_feat"].to(device)
90+
speech_feat_len = batch["speech_feat_len"].to(device)
91+
utt_embedding = batch["utt_embedding"].to(device)
92+
spk_embedding = batch["spk_embedding"].to(device)
93+
if args.mode == 'sft':
94+
model_input = {'text': tts_text_token, 'text_len': tts_text_token_len,
95+
'llm_embedding': spk_embedding, 'flow_embedding': spk_embedding}
96+
else:
97+
model_input = {'text': tts_text_token, 'text_len': tts_text_token_len,
98+
'prompt_text': text_token, 'prompt_text_len': text_token_len,
99+
'llm_prompt_speech_token': speech_token, 'llm_prompt_speech_token_len': speech_token_len,
100+
'flow_prompt_speech_token': speech_token, 'flow_prompt_speech_token_len': speech_token_len,
101+
'prompt_speech_feat': speech_feat, 'prompt_speech_feat_len': speech_feat_len,
102+
'llm_embedding': utt_embedding, 'flow_embedding': utt_embedding}
103+
model_output = model.inference(**model_input)
104+
tts_key = '{}_{}'.format(utts[0], tts_index[0])
105+
tts_fn = os.path.join(args.result_dir, '{}.wav'.format(tts_key))
106+
torchaudio.save(tts_fn, model_output['tts_speech'], sample_rate=22050)
107+
f.write('{} {}\n'.format(tts_key, tts_fn))
108+
f.flush()
109+
f.close()
110+
logging.info('Result wav.scp saved in {}'.format(fn))
111+
112+
113+
if __name__ == '__main__':
114+
main()

0 commit comments

Comments
 (0)