Skip to content

Commit 1850e2a

Browse files
authored
Merge pull request #1489 from yuekaizhang/triton
[runtime] Support Cosyvoice2 Nvidia TensorRT-LLM Inference Solution
2 parents 11515d0 + 07cbc51 commit 1850e2a

File tree

18 files changed

+3464
-0
lines changed

18 files changed

+3464
-0
lines changed
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
FROM nvcr.io/nvidia/tritonserver:25.06-trtllm-python-py3
2+
RUN apt-get update && apt-get install -y cmake
3+
RUN git clone https://github.com/pytorch/audio.git && cd audio && git checkout c670ad8 && PATH=/usr/local/cuda/bin:$PATH python3 setup.py develop
4+
COPY ./requirements.txt /workspace/requirements.txt
5+
RUN pip install -r /workspace/requirements.txt
6+
WORKDIR /workspace

runtime/triton_trtllm/README.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
## Best Practices for Serving CosyVoice with NVIDIA Triton Inference Server
2+
3+
### Quick Start
4+
Launch the service directly with Docker Compose:
5+
```sh
6+
docker compose up
7+
```
8+
9+
### Build the Docker Image
10+
Build the image from scratch:
11+
```sh
12+
docker build . -f Dockerfile.server -t soar97/triton-cosyvoice:25.06
13+
```
14+
15+
### Run a Docker Container
16+
```sh
17+
your_mount_dir=/mnt:/mnt
18+
docker run -it --name "cosyvoice-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-cosyvoice:25.06
19+
```
20+
21+
### Understanding `run.sh`
22+
The `run.sh` script orchestrates the entire workflow through numbered stages.
23+
24+
Run a subset of stages with:
25+
```sh
26+
bash run.sh <start_stage> <stop_stage> [service_type]
27+
```
28+
- `<start_stage>` – stage to start from (0-5).
29+
- `<stop_stage>` – stage to stop after (0-5).
30+
31+
Stages:
32+
- **Stage 0** – Download the cosyvoice-2 0.5B model from HuggingFace.
33+
- **Stage 1** – Convert the HuggingFace checkpoint to TensorRT-LLM format and build TensorRT engines.
34+
- **Stage 2** – Create the Triton model repository and configure the model files (adjusts depending on whether `Decoupled=True/False` will be used later).
35+
- **Stage 3** – Launch the Triton Inference Server.
36+
- **Stage 4** – Run the single-utterance HTTP client.
37+
- **Stage 5** – Run the gRPC benchmark client.
38+
39+
### Export Models to TensorRT-LLM and Launch the Server
40+
Inside the Docker container, prepare the models and start the Triton server by running stages 0-3:
41+
```sh
42+
# Runs stages 0, 1, 2, and 3
43+
bash run.sh 0 3
44+
```
45+
*Note: Stage 2 prepares the model repository differently depending on whether you intend to run with `Decoupled=False` or `Decoupled=True`. Rerun stage 2 if you switch the service type.*
46+
47+
### Single-Utterance HTTP Client
48+
Send a single HTTP inference request:
49+
```sh
50+
bash run.sh 4 4
51+
```
52+
53+
### Benchmark with a Dataset
54+
Benchmark the running Triton server. Pass either `streaming` or `offline` as the third argument.
55+
```sh
56+
bash run.sh 5 5
57+
58+
# You can also customise parameters such as num_task and dataset split directly:
59+
# python3 client_grpc.py --num-tasks 2 --huggingface-dataset yuekai/seed_tts_cosy2 --split-name test_zh --mode [streaming|offline]
60+
```
61+
> [!TIP]
62+
> Only offline CosyVoice TTS is currently supported. Setting the client to `streaming` simply enables NVIDIA Triton’s decoupled mode so that responses are returned as soon as they are ready.
63+
64+
### Benchmark Results
65+
Decoding on a single L20 GPU with 26 prompt_audio/target_text [pairs](https://huggingface.co/datasets/yuekai/seed_tts) (≈221 s of audio):
66+
67+
| Mode | Note | Concurrency | Avg Latency (ms) | P50 Latency (ms) | RTF |
68+
|------|------|-------------|------------------|------------------|-----|
69+
| Decoupled=False | [Commit](https://github.com/yuekaizhang/CosyVoice/commit/b44f12110224cb11c03aee4084b1597e7b9331cb) | 1 | 758.04 | 615.79 | 0.0891 |
70+
| Decoupled=False | [Commit](https://github.com/yuekaizhang/CosyVoice/commit/b44f12110224cb11c03aee4084b1597e7b9331cb) | 2 | 1025.93 | 901.68 | 0.0657 |
71+
| Decoupled=False | [Commit](https://github.com/yuekaizhang/CosyVoice/commit/b44f12110224cb11c03aee4084b1597e7b9331cb) | 4 | 1914.13 | 1783.58 | 0.0610 |
72+
| Decoupled=True | [Commit](https://github.com/yuekaizhang/CosyVoice/commit/b44f12110224cb11c03aee4084b1597e7b9331cb) | 1 | 659.87 | 655.63 | 0.0891 |
73+
| Decoupled=True | [Commit](https://github.com/yuekaizhang/CosyVoice/commit/b44f12110224cb11c03aee4084b1597e7b9331cb) | 2 | 1103.16 | 992.96 | 0.0693 |
74+
| Decoupled=True | [Commit](https://github.com/yuekaizhang/CosyVoice/commit/b44f12110224cb11c03aee4084b1597e7b9331cb) | 4 | 1790.91 | 1668.63 | 0.0604 |
75+
76+
### OpenAI-Compatible Server
77+
To launch an OpenAI-compatible service, run:
78+
```sh
79+
git clone https://github.com/yuekaizhang/Triton-OpenAI-Speech.git
80+
pip install -r requirements.txt
81+
# After the Triton service is up, start the FastAPI bridge:
82+
python3 tts_server.py --url http://localhost:8000 --ref_audios_dir ./ref_audios/ --port 10086 --default_sample_rate 24000
83+
# Test with curl
84+
bash test/test_cosyvoice.sh
85+
```
86+
87+
### Acknowledgements
88+
This section originates from the NVIDIA CISI project. We also provide other multimodal resources—see [mair-hub](https://github.com/nvidia-china-sae/mair-hub) for details.
89+

0 commit comments

Comments
 (0)