AlibabaResearch
diff --git a/‎.DS_Store‎
2 KB b/‎.DS_Store‎
2 KB
diff --git a/‎OmniCharacter/README.md‎
Lines changed: 143 additions & 1 deletion b/‎OmniCharacter/README.md‎
Lines changed: 143 additions & 1 deletion
diff --git a/‎OmniCharacter/assets/logo.png‎
219 KB b/‎OmniCharacter/assets/logo.png‎
219 KB
diff --git a/‎OmniCharacter/cosyvoice/__init__.py‎ b/‎OmniCharacter/cosyvoice/__init__.py‎
diff --git a/‎OmniCharacter/cosyvoice/bin/inference.py‎
Lines changed: 114 additions & 0 deletions b/‎OmniCharacter/cosyvoice/bin/inference.py‎
Lines changed: 114 additions & 0 deletions
@@ -1 +1,143 @@
-This repo contains the official code for the paper: "OmniCharacter: OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction". The source code, dataset, and demo will be released soon. Please wait patiently.
+# _OmniCharacter_: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction (ACL 2025 main)
+
+[Haonan Zhang](https://zchoi.github.io/)\*, [Run Luo](https://scholar.google.com/citations?user=phg8yxoAAAAJ&hl=zh-CN&oi=ao)\*, Xiong Liu\*, Yuchuan Wu, Ting-En Lin, Pengpeng Zeng, Qiang Qu, Feiteng Fang, Min Yang, Lianli Gao, Jingkuan Song<sup>‡</sup>, Fei Huang, Yongbin Li<sup>‡</sup> (\* Equal contribution ‡ Corresponding author)
+
+This is the official code implementation of the paper "**OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction**".
+
+We are continuously refactoring our code, be patient and wait for the latest updates!
+
+## 🔥 Updates
+
+- [ ] Release the pre-trained weight and datasets.
+- [x] Release the training and evaluation code.
+- [x] We release the [paper](https://arxiv.org/abs/2505.20277) for OmniCharacter!
+
+## ⚙️ Installation
+
+1.  Clone the repo
+
+```
+git clone --recursive https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/OmniCharacter
+cd OmniCharacter
+```
+
+2. Create Conda env:
+```
+conda create -n omnicharacter python=3.10 -y
+conda activate omnicharacter
+pip install --upgrade pip  # enable PEP 660 support
+pip install -e ".[train]"
+pip install -r requirements.txt
+
+# Install Flash Attention 2 for training (https://github.com/Dao-AILab/flash-attention)
+#   =>> If you run into difficulty, try `pip cache remove flash_attn` first
+pip install packaging ninja
+ninja --version; echo $?  # Verify Ninja --> should return exit code "0"
+pip install "flash-attn" --no-build-isolation
+```
+
+## 🚀 Train
+1. Download the dataset.
+First, download the OmniCharacter training and test sets from our [HuggingFace🤗](https://huggingface.co/datasets/Tongyi-ConvAI/OmniCharacter) repository.
+After downloading, place the dataset in a folder named data/ under the project root:
+
+```
+mkdir -p data
+# Put the downloaded files into the data/ folder
+```
+
+2. Prepare checkpoints and Speech Modules
+
+We finetune OmniCharacter based on the [OpenOmni](https://arxiv.org/abs/2501.04561) pre-trained weights.
+You can download the OpenOmni checkpoints from [HuggingFace🤗](https://huggingface.co/Tongyi-ConvAI/OpenOmni/tree/main) and place them in a checkpoints/ directory, which will be also downloaded with the OpenOmni:
+```
+mkdir -p checkpoints
+# Put the OpenOmni weights into checkpoints/
+```
+In addition, make sure the following modules are also placed under the checkpoints/ directory:
+
+- speech_projector: The pre-trained speech encoder used to extract speech features from reference audio.
+
+- speech_generator: The pre-trained speech decoder model used for generating speech tokens.
+
+Your directory structure should look like this:
+```
+OmniCharacter/
+├── checkpoints/
+│   └── openomni/
+│       ├── pretrained/
+│       │   ├── speech_projector/
+│       │   └── speech_generator/
+│       └── qwen/
+├── data/
+│   ├── omnicharacter_10k_train.json
+│   ├── omnicharacter_test.json
+│   └── audio_data/
+```
+
+3. You can train the model with the following command:
+
+**Stage-1**: focuses on aligning speech features (user query) and text (role profile, dialogue contexts, etc.) in the shared personality space. Use the provided shell script to launch training:
+```
+bash omnicharacter_stage1_qwen2.5.sh
+```
+This will save outputs to a designated directory ```results/```.
+
+**Stage-2**: further finetunes the speech generator
+
+Once Stage 1 completes, locate the checkpoint (e.g., results/stage1/checkpoint-xxx/) and pass it to Stage 2 as ```--model_name_or_path```:
+```
+bash omnicharacter_stage2_qwen2.5.sh
+```
+
+## 🍃 Inference
+After downloading the weights and configuring the paths properly. A speech tokenizer are needed for speech discretization and reconstruction, _i.e._, [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice)
+
+Fast inference:
+```
+python inference.py
+```
+
+## 📖 Citation
+If this project contributes to your research, we kindly ask you to cite the following paper:
+```
+@article{zhang2025omnicharacter,
+  title={OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction},
+  author={Zhang, Haonan and Luo, Run and Liu, Xiong and Wu, Yuchuan and Lin, Ting-En and Zeng, Pengpeng and Qu, Qiang and Fang, Feiteng and Yang, Min and Gao, Lianli and others},
+  journal={ACL 2025},
+  year={2025}
+}
+```
+```
+@article{luo2025openomni,
+  title={OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis},
+  author={Luo, Run and Lin, Ting-En and Zhang, Haonan and Wu, Yuchuan and Liu, Xiong and Yang, Min and Li, Yongbin and Chen, Longze and Li, Jiaming and Zhang, Lei and others},
+  journal={arXiv preprint arXiv:2501.04561},
+  year={2025}
+}
+```
+```
+@article{luo2024mmevol,
+  title={Mmevol: Empowering multimodal large language models with evol-instruct},
+  author={Luo, Run and Zhang, Haonan and Chen, Longze and Lin, Ting-En and Liu, Xiong and Wu, Yuchuan and Yang, Min and Wang, Minzheng and Zeng, Pengpeng and Gao, Lianli and others},
+  journal={ACL 2025},
+  year={2024}
+}
+```
+## 📧 Contact
+If you have any questions or need assistance, feel free to reach out via the contact information below.
+
+- Haonan Zhang — zchiowal@gmail.com
+
+- Run Luo — r.luo@siat.ac.cn
+
+
+## Acknowledgement
+
+- [**OpenOmni**](https://huggingface.co/AlibabaResearch/OpenOmni): The backbone multimodal foundation model powering our speech-language finetuning. We are truly excited to build on top of this open effort!
+  
+- [**LLaVA**](https://github.com/haotian-liu/LLaVA) and [**LLaVA-Omni**](https://github.com/ictnlp/LLaMA-Omni): The foundational codebases our work builds upon. We sincerely appreciate their pioneering contributions to the community!
+
+- [**CosVoice**](https://github.com/FunAudioLLM/CosyVoice): An excellent open-source speech tokenizer enabling discretization and reconstruction with a 6k vocabulary—essential for expressive speech representation.
+
+- [**GLM4Voice**](https://github.com/THUDM/GLM-4-Voice): Another impressive speech tokenizer supporting high-fidelity reconstruction with a 16k vocabulary. Huge thanks for making this resource available!
@@ -0,0 +1,114 @@
+# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+
+import argparse
+import logging
+logging.getLogger('matplotlib').setLevel(logging.WARNING)
+import os
+
+import torch
+from torch.utils.data import DataLoader
+import torchaudio
+from hyperpyyaml import load_hyperpyyaml
+from tqdm import tqdm
+from cosyvoice.cli.model import CosyVoiceModel
+
+from cosyvoice.dataset.dataset import Dataset
+
+def get_args():
+    parser = argparse.ArgumentParser(description='inference with your model')
+    parser.add_argument('--config', required=True, help='config file')
+    parser.add_argument('--prompt_data', required=True, help='prompt data file')
+    parser.add_argument('--prompt_utt2data', required=True, help='prompt data file')
+    parser.add_argument('--tts_text', required=True, help='tts input file')
+    parser.add_argument('--llm_model', required=True, help='llm model file')
+    parser.add_argument('--flow_model', required=True, help='flow model file')
+    parser.add_argument('--hifigan_model', required=True, help='hifigan model file')
+    parser.add_argument('--gpu',
+                        type=int,
+                        default=-1,
+                        help='gpu id for this rank, -1 for cpu')
+    parser.add_argument('--mode',
+                        default='sft',
+                        choices=['sft', 'zero_shot'],
+                        help='inference mode')
+    parser.add_argument('--result_dir', required=True, help='asr result file')
+    args = parser.parse_args()
+    print(args)
+    return args
+
+
+def main():
+    args = get_args()
+    logging.basicConfig(level=logging.DEBUG,
+                        format='%(asctime)s %(levelname)s %(message)s')
+    os.environ['CUDA_VISIBLE_DEVICES'] = str(args.gpu)
+
+    # Init cosyvoice models from configs
+    use_cuda = args.gpu >= 0 and torch.cuda.is_available()
+    device = torch.device('cuda' if use_cuda else 'cpu')
+    with open(args.config, 'r') as f:
+        configs = load_hyperpyyaml(f)
+
+    model = CosyVoiceModel(configs['llm'], configs['flow'], configs['hift'])
+    model.load(args.llm_model, args.flow_model, args.hifigan_model)
+
+    test_dataset = Dataset(args.prompt_data, data_pipeline=configs['data_pipeline'], mode='inference', shuffle=False, partition=False, tts_file=args.tts_text, prompt_utt2data=args.prompt_utt2data)
+    test_data_loader = DataLoader(test_dataset, batch_size=None, num_workers=0)
+
+    del configs
+    os.makedirs(args.result_dir, exist_ok=True)
+    fn = os.path.join(args.result_dir, 'wav.scp')
+    f = open(fn, 'w')
+    with torch.no_grad():
+        for batch_idx, batch in tqdm(enumerate(test_data_loader)):
+            utts = batch["utts"]
+            assert len(utts) == 1, "inference mode only support batchsize 1"
+            text = batch["text"]
+            text_token = batch["text_token"].to(device)
+            text_token_len = batch["text_token_len"].to(device)
+            tts_text = batch["tts_text"]
+            tts_index = batch["tts_index"]
+            tts_text_token = batch["tts_text_token"].to(device)
+            tts_text_token_len = batch["tts_text_token_len"].to(device)
+            speech_token = batch["speech_token"].to(device)
+            speech_token_len = batch["speech_token_len"].to(device)
+            speech_feat = batch["speech_feat"].to(device)
+            speech_feat_len = batch["speech_feat_len"].to(device)
+            utt_embedding = batch["utt_embedding"].to(device)
+            spk_embedding = batch["spk_embedding"].to(device)
+            if args.mode == 'sft':
+                model_input = {'text': tts_text_token, 'text_len': tts_text_token_len,
+                               'llm_embedding': spk_embedding, 'flow_embedding': spk_embedding}
+            else:
+                model_input = {'text': tts_text_token, 'text_len': tts_text_token_len,
+                               'prompt_text': text_token, 'prompt_text_len': text_token_len,
+                               'llm_prompt_speech_token': speech_token, 'llm_prompt_speech_token_len': speech_token_len,
+                               'flow_prompt_speech_token': speech_token, 'flow_prompt_speech_token_len': speech_token_len,
+                               'prompt_speech_feat': speech_feat, 'prompt_speech_feat_len': speech_feat_len,
+                               'llm_embedding': utt_embedding, 'flow_embedding': utt_embedding}
+            model_output = model.inference(**model_input)
+            tts_key = '{}_{}'.format(utts[0], tts_index[0])
+            tts_fn = os.path.join(args.result_dir, '{}.wav'.format(tts_key))
+            torchaudio.save(tts_fn, model_output['tts_speech'], sample_rate=22050)
+            f.write('{} {}\n'.format(tts_key, tts_fn))
+            f.flush()
+    f.close()
+    logging.info('Result wav.scp saved in {}'.format(fn))
+
+
+if __name__ == '__main__':
+    main()