Skip to content

Commit 4de4c53

Browse files
authored
release v1.1.0
2 parents 30bc57b + c603bdd commit 4de4c53

File tree

192 files changed

+84801
-689
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

192 files changed

+84801
-689
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,5 @@ dist
55
*.egg-info
66
.idea
77
.vscode
8-
test/outs
8+
test/outs
9+
pretrained_models

.gitmodules

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,3 @@
1-
[submodule "cosyvoice"]
2-
path = cosyvoice
3-
url = https://github.com/FunAudioLLM/CosyVoice.git
41
[submodule "third_party/Matcha-TTS"]
52
path = third_party/Matcha-TTS
63
url = https://github.com/shivammehta25/Matcha-TTS.git

.pre-commit-config.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,11 @@ repos:
66
language_version: python3
77
args: [--line-length=120]
88
additional_dependencies: ['click==8.0.4']
9+
exclude: '^cosyvoice/'
910
- repo: https://github.com/pycqa/flake8
1011
rev: 3.9.0
1112
hooks:
1213
- id: flake8
1314
additional_dependencies: [flake8-typing-imports==1.9.0]
14-
args: ['--config=.flake8', '--max-line-length=120', '--ignore=C901, E203, E266, E402, E302, E241, E902, E731, F403, E701, F405, F401, W292, W293, W503, W606']
15+
args: ['--config=.flake8', '--max-line-length=120', '--ignore=C901, E203, E266, E402, E302, E241, E902, E731, F403, E701, F405, F401, W292, W293, W503, W606']
16+
exclude: '^cosyvoice/'

Dockerfile

Lines changed: 35 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,46 @@
1-
FROM pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime
2-
ARG MAMBA_VERSION=23.1.0-1
1+
ARG CUDA_VERSION=12.8.0
2+
FROM nvidia/cuda:${CUDA_VERSION}-cudnn-devel-ubuntu22.04
3+
ARG PYTHON_VERSION=3.10
4+
ARG MAMBA_VERSION=24.7.1-0
35
ARG TARGETPLATFORM
6+
ENV PATH=/opt/conda/bin:$PATH \
7+
CONDA_PREFIX=/opt/conda
48

5-
WORKDIR /opt
9+
WORKDIR /root
610

711
RUN chmod 777 -R /tmp && apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
812
ca-certificates \
913
libssl-dev \
1014
curl \
1115
g++ \
1216
make \
13-
git && \
17+
git \
18+
ffmpeg \
19+
unzip && \
1420
rm -rf /var/lib/apt/lists/*
1521

16-
RUN git clone --recursive https://github.com/ModelTC/light-tts.git
17-
RUN cd light-tts && pip3 install -r requirements.txt
18-
WORKDIR /opt/light-tts
22+
RUN case ${TARGETPLATFORM} in \
23+
"linux/arm64") MAMBA_ARCH=aarch64 ;; \
24+
*) MAMBA_ARCH=x86_64 ;; \
25+
esac && \
26+
curl -fsSL -o ~/mambaforge.sh "https://github.com/conda-forge/miniforge/releases/download/${MAMBA_VERSION}/Mambaforge-${MAMBA_VERSION}-Linux-${MAMBA_ARCH}.sh" && \
27+
bash ~/mambaforge.sh -b -p /opt/conda && \
28+
rm ~/mambaforge.sh
29+
30+
RUN case ${TARGETPLATFORM} in \
31+
"linux/arm64") exit 1 ;; \
32+
*) /opt/conda/bin/conda update -y conda && \
33+
/opt/conda/bin/conda install -y "python=${PYTHON_VERSION}" ;; \
34+
esac && \
35+
/opt/conda/bin/conda clean -ya
36+
37+
COPY ./requirements.txt /lighttts/requirements.txt
38+
RUN pip install -U pip
39+
RUN pip install -r /lighttts/requirements.txt --no-cache-dir
40+
41+
COPY . /lighttts
42+
WORKDIR /lighttts
43+
RUN cd pretrained_models/CosyVoice-ttsfrd/ && \
44+
unzip resource.zip -d . && \
45+
pip install ttsfrd_dependency-0.1-py3-none-any.whl && \
46+
pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl

NOTICE

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
Light TTS
2+
Copyright (c) 2024 Light TTS Contributors
3+
4+
This project contains code from the following third-party projects:
5+
6+
================================================================================
7+
8+
CosyVoice (cosyvoice/)
9+
https://github.com/FunAudioLLM/CosyVoice
10+
Copyright (c) Alibaba, Inc. and its affiliates.
11+
Licensed under the Apache License, Version 2.0
12+
Original commit: bc34459
13+
14+
The cosyvoice/ directory contains a modified copy of the CosyVoice project.
15+
We have integrated and adapted this code for use in Light TTS.
16+
The original LICENSE file is preserved in cosyvoice/LICENSE.
17+
18+
All modifications to the original CosyVoice code are also licensed under
19+
the Apache License, Version 2.0.
20+
21+
================================================================================
22+

README.md

Lines changed: 123 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,13 @@
11
![Light TTS Banner](asset/light-tts.jpg)
22

3-
# light-tts
3+
# LightTTS
44

5-
**light-tts** is a lightweight and high-performance text-to-speech (TTS) inference and service framework based on Python. It is built around the [cosyvoice](https://github.com/FunAudioLLM/CosyVoice) model and based on the [lightllm](https://github.com/ModelTC/lightllm), with optimizations to support fast, scalable, and service-ready TTS deployment.
5+
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
6+
[![Docker](https://img.shields.io/badge/docker-ready-brightgreen.svg)](https://hub.docker.com/r/lighttts/light-tts)
7+
8+
**⚡ Lightning-Fast Text-to-Speech Inference & Service Framework**
9+
10+
**LightTTS** is a lightweight and high-performance text-to-speech (TTS) inference and service framework based on Python. It supports **CosyVoice2** and **CosyVoice3** models, built upon the [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) architecture and [LightLLM](https://github.com/ModelTC/lightllm) framework, with optimizations to support fast, scalable, and service-ready TTS deployment.
611

712
---
813

@@ -19,34 +24,33 @@
1924

2025
### Installation
2126

22-
- Installing with Docker
27+
- (Option 1 Recommended) Run with Docker
2328
```bash
24-
# The easiest way to install Lightllm is by using the official image. You can directly pull and run the official image
25-
docker pull lighttts/light-tts:v1.0
29+
# The easiest way to install LightTTS is by using the official image. You can directly pull and run the official image
30+
docker pull lighttts/light-tts:latest
2631

2732
# Or you can manually build the image
28-
docker build -t light-tts:v1.0 .
33+
docker build -t light-tts:latest .
2934

3035
# Run the image
31-
docker run -it --gpus all -p 8080:8080 --shm-size 4g -v your_local_path:/data/ light-tts:v1.0 /bin/bash
36+
docker run -it --gpus all -p 8080:8080 --shm-size 4g -v your_local_path:/data/ light-tts:latest /bin/bash
3237

33-
- Installing from Source
38+
- (Option 2) Install from Source
3439

3540
```bash
3641
# Clone the repo
37-
git clone --recursive https://github.com/ModelTC/light-tts.git
38-
cd light-tts
42+
git clone --recursive https://github.com/ModelTC/LightTTS.git
43+
cd LightTTS
3944
# If you failed to clone the submodule due to network failures, please run the following command until success
40-
# cd light-tts
45+
# cd LightTTS
4146
# git submodule update --init --recursive
4247
4348
# (Recommended) Create a new conda environment
44-
conda create -n light-tts python=3.10 -y
49+
conda create -n light-tts python=3.10
4550
conda activate light-tts
4651
47-
# pynini is required by WeTextProcessing, use conda to install it as it can be executed on all platforms.
48-
conda install -y -c conda-forge pynini==2.1.5
49-
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
52+
# Install dependencies (We use the latest torch==2.9.1, but other versions are also compatible)
53+
pip install -r requirements.txt
5054
5155
# If you encounter sox compatibility issues
5256
# ubuntu
@@ -55,23 +59,25 @@
5559
sudo yum install sox sox-devel
5660
```
5761

58-
### Model download
62+
### Model Download
5963

60-
We now only support CosyVoice2 model.
64+
We now support CosyVoice2 and CosyVoice3 models.
6165

6266
```python
63-
# SDK模型下载
67+
# ModelScope SDK model download (SDK模型下载)
6468
from modelscope import snapshot_download
69+
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
6570
snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
6671
snapshot_download('iic/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
67-
```
68-
```python
69-
# git模型下载,请确保已安装git lfs
70-
mkdir -p pretrained_models
71-
git clone https://www.modelscope.cn/iic/CosyVoice2-0.5B.git pretrained_models/CosyVoice2-0.5B
72-
git clone https://www.modelscope.cn/iic/CosyVoice-ttsfrd.git pretrained_models/CosyVoice-ttsfrd
72+
73+
# For overseas users, HuggingFace SDK model download
74+
from huggingface_hub import snapshot_download
75+
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
76+
snapshot_download('FunAudioLLM/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
77+
snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
7378
```
7479

80+
(We have already installed the ttsfrd package in the docker image. If you are using docker image, you can skip this installation)
7581
For better text normalization performance, you can optionally install the ttsfrd package and unzip its resources. This step is not required — if skipped, the system will fall back to WeTextProcessing by default.
7682

7783
```bash
@@ -80,73 +86,109 @@ unzip resource.zip -d .
8086
pip install ttsfrd_dependency-0.1-py3-none-any.whl
8187
pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
8288
```
83-
📝 This setup instruction is based on the original guide from the [CosyVoice repository](https://github.com/FunAudioLLM/CosyVoice).
8489

8590
### Start the Model Service
8691

92+
**Note:** It is recommended to enable the `load_trt` parameter for acceleration. The default flow precision is fp16 for CosyVoice2 and fp32 for CosyVoice3.
93+
94+
**For CosyVoice2:**
95+
8796
```bash
88-
# It is recommended to enable the load_trt parameter for acceleration.
89-
# The default is fp16 mode.
90-
python -m light_tts.server.api_server --model_dir ./pretrained_models/CosyVoice2-0.5B-latest --load_trt True --max_total_token_num 65536 --max_req_total_len 32768
97+
python -m light_tts.server.api_server --model_dir ./pretrained_models/CosyVoice2-0.5B
9198
```
9299

93-
- max_total_token_num: llm arg, the total token nums the gpu and model can support, equals = `max_batch * (input_len + output_len)`
94-
- max_req_total_len: llm arg, the max value for `req_input_len + req_output_len`, 32768 is set here because the `max_position_embeddings` of the llm part is 32768
95-
- There are many other parameters that can be viewed in `light_tts/server/api_cli.py`
100+
**For CosyVoice3:**
101+
102+
```bash
103+
python -m light_tts.server.api_server --model_dir ./pretrained_models/Fun-CosyVoice3-0.5B-2512
104+
```
105+
106+
**With custom data type** (float32, bfloat16, or float16; default: float16):
107+
108+
```bash
109+
# Use float32 for better accuracy or float16 for faster speed
110+
python -m light_tts.server.api_server --model_dir ./pretrained_models/Fun-CosyVoice3-0.5B-2512 --data_type float32
111+
```
112+
113+
**Available Parameters:**
114+
115+
The default values are usually the fastest and generally do not need to be adjusted. If you need to customize them, please refer to the following parameter descriptions:
116+
- `load_trt`: Whether to load the flow_decoder in TensorRT mode (default: True).
117+
- `data_type`: The data type for LLM inference (default: float16)
118+
- `load_jit`: Whether to load the flow_encoder in JIT mode (default: False).
119+
- `max_total_token_num`: LLM arg, total token count the GPU and model can support = `max_batch * (input_len + output_len)` (default: 64 * 1024)
120+
- `max_req_total_len`: LLM arg, maximum value for `req_input_len + req_output_len` (default: 32768, matches `max_position_embeddings`)
121+
- `graph_max_len_in_batch`: Maximum sequence length for CUDA graph capture in decoding stage (default: 32768)
122+
- `graph_max_batch_size`: Maximum batch size for CUDA graph capture in decoding stage (default: 16)
123+
124+
For more parameters, see `light_tts/server/api_cli.py`
96125

97-
Wait for a while, this service will be started. The default startup is localhost:8080.
126+
Wait for the service to initialize. The default address is `http://localhost:8080`.
98127

99128
### Request Examples
100129

101-
When your service is started, you can call the service through the http API. We support three modes: non-streaming, streaming and bi-streaming.
102-
103-
- non-streaming and streaming. You can also use `test/test_zero_shot.py`, which can print information such as rtf and ttft.
104-
105-
106-
```python
107-
import requests
108-
import time
109-
import soundfile as sf
110-
import numpy as np
111-
import os
112-
import threading
113-
import json
114-
115-
url = "http://localhost:8080/inference_zero_shot"
116-
path = "cosyvoice/asset/zero_shot_prompt.wav" # wav file path
117-
prompt_text = "希望你以后能够做的比我还好呦。"
118-
tts_text = "收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。"
119-
stream = True # Whether to use streaming inference
120-
files = {
121-
"prompt_wav": ("sample.wav", open(path, "rb"), "audio/wav")
122-
}
123-
data = {
124-
"tts_text": tts_text,
125-
"prompt_text": prompt_text,
126-
"stream": stream
127-
}
128-
response = requests.post(url, files=files, data=data, stream=True)
129-
sample_rate = 24000
130-
131-
audio_data = bytearray()
132-
try:
133-
for chunk in response.iter_content(chunk_size=4096):
134-
if chunk:
135-
audio_data.extend(chunk)
136-
except Exception as e:
137-
print(f"Exception: {e}")
138-
print(f"Error: {response.status_code}, {response.text}")
139-
return
140-
audio_np = np.frombuffer(audio_data, dtype=np.int16)
141-
if response.status_code == 200:
142-
output_wav = f"./outs/output{'_stream' if stream else ''}_{index}.wav"
143-
sf.write(output_wav, audio_np, samplerate=sample_rate, subtype="PCM_16")
144-
print(f"saved as {output_wav}")
145-
else:
146-
print("Error:", response.status_code, response.text)
147-
```
130+
Once the service is running, you can interact with it through the HTTP API. We support three modes: **non-streaming**, **streaming**, and **bi-streaming**.
131+
132+
- **Non-streaming and Streaming**: Use `test/test_zero_shot.py` for examples, which prints metrics such as RTF (Real-Time Factor) and TTFT (Time To First Token)
133+
- **Bi-streaming**: Uses WebSocket interface. See usage examples in `test/test_bistream.py`
134+
135+
## 📊 Performance Benchmarks
136+
137+
We have conducted performance benchmarks on different GPU configurations to demonstrate the throughput and latency characteristics of LightTTS in streaming mode.
138+
139+
Model: `Fun-CosyVoice3-0.5B-2512` datatype: `float16`
140+
141+
### NVIDIA GeForce RTX 4090D
142+
non-stream: `test/test_zs.py`
143+
144+
|num_workers|cost time 50%|cost time 90%|cost time 99%|rtf 50%|rtf 90%|rtf 99%|avg rtf|total_cost_time|qps|
145+
|------|------|------|------|------|------|------|------|------|------|
146+
|1|0.61|1.09|1.51|0.13|0.16|0.22|0.13|33.95|1.47|
147+
|2|0.8|1.24|1.71|0.15|0.22|0.25|0.16|21.46|2.33|
148+
|4|1.02|1.88|2.27|0.22|0.29|0.38|0.23|15.31|3.27|
149+
|8|1.76|2.36|3.48|0.33|0.49|0.62|0.36|12.18|4.1|
150+
151+
stream: `test/test_zs_stream.py`
148152

149-
- bi-streaming. We use the websocket interface implementation, and we can find usage examples in `test/test_bistream.py`.
153+
|num_workers|cost time 50%|cost time 90%|cost time 99%|ttft 50%|ttft 90%|ttft 99%|rtf 50%|rtf 90%|rtf 99%|avg rtf|total_cost_time|qps|
154+
|------|------|------|------|------|------|------|------|------|------|------|------|------|
155+
|1|1.01|2.15|2.82|0.33|0.34|0.9|0.21|0.25|0.34|0.22|60.13|0.83|
156+
|2|1.83|3.56|5.16|0.93|1.53|2.3|0.34|0.63|0.81|0.4|52.47|0.95|
157+
|4|3.43|5.76|7.31|2.62|4.37|5.8|0.7|1.28|2.16|0.81|48.74|1.03|
158+
|8|7.27|10.01|10.45|6.4|8.55|9.03|1.28|2.67|3.66|1.57|47.37|1.06|
159+
160+
### NVIDIA GeForce RTX 5090
161+
non-stream
162+
163+
|num_workers|cost time 50%|cost time 90%|cost time 99%|rtf 50%|rtf 90%|rtf 99%|avg rtf|total_cost_time|qps|
164+
|------|------|------|------|------|------|------|------|------|------|
165+
|1|0.51|0.81|1.61|0.11|0.13|0.23|0.11|28.9|1.73|
166+
|2|0.64|1.1|1.48|0.13|0.16|0.26|0.13|17.54|2.85|
167+
|4|0.87|1.28|1.68|0.17|0.23|0.36|0.18|11.45|4.37|
168+
|8|1.32|1.86|2.14|0.25|0.4|0.6|0.29|8.97|5.57|
169+
170+
stream
171+
172+
|num_workers|cost time 50%|cost time 90%|cost time 99%|ttft 50%|ttft 90%|ttft 99%|rtf 50%|rtf 90%|rtf 99%|avg rtf|total_cost_time|qps|
173+
|------|------|------|------|------|------|------|------|------|------|------|------|------|
174+
|1|0.76|1.41|2.27|0.28|0.3|0.31|0.16|0.18|0.22|0.16|44.06|1.13|
175+
|2|1.45|2.34|3.46|0.74|1.28|1.75|0.27|0.45|0.7|0.3|38.82|1.29|
176+
|4|2.9|4.04|4.7|2.16|3.03|3.4|0.5|1.04|1.51|0.61|37.75|1.32|
177+
|8|5.78|7.74|8.49|5.01|6.73|7.35|1.03|2.09|2.85|1.22|37.67|1.33|
178+
179+
**Metrics Explanation:**
180+
- **num_workers**: Number of concurrent workers
181+
- **cost time**: Total request processing time in seconds (50th/90th/99th percentile)
182+
- **ttft**: Time to First Token in seconds (50th/90th/99th percentile)
183+
- **rtf**: Real-Time Factor (50th/90th/99th percentile)
184+
- **avg rtf**: Average Real-Time Factor
185+
- **total_cost_time**: Total benchmark duration in seconds
186+
- **qps**: Queries Per Second
150187

151188
## License
152-
This repository is released under the [Apache-2.0](LICENSE) license.
189+
190+
This repository is released under the [Apache-2.0](LICENSE) license.
191+
192+
### Third-Party Code Attribution
193+
194+
This project includes code from [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) (Copyright Alibaba, Inc. and its affiliates), which is also licensed under Apache-2.0. The CosyVoice code is located in the `cosyvoice/` directory and has been integrated and modified as part of LightTTS. See the [NOTICE](NOTICE) file for complete attribution details.

cosyvoice

Lines changed: 0 additions & 1 deletion
This file was deleted.

0 commit comments

Comments
 (0)