VITA-QinYu, the first end-to-end spoken language model that supports role-playing and singing via a hybrid text–speech modeling framework and a large-scale data synthesis pipeline, while achieving state-of-the-art performance in natural conversational speech.
- Expressive Speech Generation. VITA-QinYu supports singing and role-playing capabilities within a unified end-to-end model.
- Hybrid Speech–Text Modeling. VITA-QinYu adopts an interleaved speech–text modeling paradigm with parallel multi-codebook audio tokens to enable richer paralinguistic representation.
- Large-Scale Synthetic Data. VITA-QinYu employs a comprehensive pipeline to generate large-scale, high-quality expressive speech data for training.
- Strong Performance. VITA-QinYu achieves state-of-the-art conversational accuracy while improving expressive speech generation.
- Open-Source Deployment. VITA-QinYu provides open-source models, training code, and a streaming full-duplex web demonstration.
2026.04.x🌟 We release VITA-QinYu with model weights, inference & training code and web demo.
| Model | LLM Size | Huggingface Weights |
|---|---|---|
| VITA-QinYu-8B | 8B | https://huggingface.co/VITA-MLLM/VITA-QinYu-8B |
| VITA-QinYu-4B | 4B | https://huggingface.co/VITA-MLLM/VITA-QinYu-4B |
docker pull mikexu/vita-qinyu:base
git clone https://github.com/VITA-MLLM/VITA-QinYu.git
cd VITA-QinYu
git submodule update --init --recursive
pip install -r requirements.txt
pip install -e .
mkdir /vita-qinyu-models
# VITA-QinYu
hf download VITA-MLLM/VITA-QinYu-Models --local-dir /vita-qinyu-models/VITA-QinYu-Models
hf download VITA-MLLM/VITA-QinYu-4B --local-dir /vita-qinyu-models/VITA-QinYu-4B
hf download VITA-MLLM/VITA-QinYu-8B --local-dir /vita-qinyu-models/VITA-QinYu-8B
# FunAudioLLM/SenseVoiceSmall
hf download FunAudioLLM/SenseVoiceSmall --local-dir /vita-qinyu-models/FunAudioLLM/SenseVoiceSmall
# openai/whisper-large-v3
hf download openai/whisper-large-v3 --local-dir /vita-qinyu-models/openai/whisper-large-v3
# TEN-framework/TEN_Turn_Detection
hf download TEN-framework/TEN_Turn_Detection --local-dir /vita-qinyu-models/TEN-framework/TEN_Turn_DetectionWe provide a simple inference script that covers speech-to-speech, ASR and TTS examples.
CUDA_VISIBLE_DEVICES=0 python tools/inference_sts.py --model /vita-qinyu-models/VITA-QinYu-8B --output_dir /output
CUDA_VISIBLE_DEVICES=0 python tools/inference_sts.py --model /vita-qinyu-models/VITA-QinYu-4B --output_dir /output
# Natural
CUDA_VISIBLE_DEVICES=0 python web_demo_stream.py --port 8080 --model /vita-qinyu-models/VITA-QinYu-4B
# RolePlay
CUDA_VISIBLE_DEVICES=0 python web_demo_stream.py --port 8080 --model /vita-qinyu-models/VITA-QinYu-4B --mode roleplay --role_description "该角色是一个幼儿女性,身份是世家千金,性格活泼机敏、爱撒娇,气质天真灵动,音色甜润,语速较快"
--modeNatural conversation whenmode=default; role-playing whenmode=roleplay.--modelthe path of VITA-QinYu model--role_descriptiondescribe role information whenmode=roleplay--portthe port of web demodefault:8080
Then you can visit localhost:8080
You can finetune your own model.
hf download VITA-MLLM/VITA-QinYu-ToySample --local-dir /vita-qinyu-models/ToySample
bash scripts/deepspeed/vita_qinyu_utu/finetune.sh /vita-qinyu-models/ToySample/toy_sample.yaml /vita-qinyu-models/ToySample /vita-qinyu-models
bash scripts/deepspeed/vita_qinyu_qwen3/finetune.sh /vita-qinyu-models/ToySample/toy_sample.yaml /vita-qinyu-models/ToySample /vita-qinyu-models Discuss on Github Issues.
Scan the QR code to join our official QQ chat group.
VITA-QinYu is trained on large-scale open-source corpus, and its output has randomness. Any content generated by VITA-QinYu does not represent the views of the model developers. We are not responsible for any problems arising from the use, misuse, and dissemination of VITA-QinYu, including but not limited to public opinion risks and data security issues.
Coming soon...

