Skip to content

Commit 10c3246

Browse files
committed
feat: support Dia TTS
1 parent 4eff0a9 commit 10c3246

File tree

10 files changed

+857
-60
lines changed

10 files changed

+857
-60
lines changed

.gitmodules

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
11
[submodule "vox_box/third_party/CosyVoice"]
22
path = vox_box/third_party/CosyVoice
33
url = https://github.com/FunAudioLLM/CosyVoice/
4+
[submodule "vox_box/third_party/dia"]
5+
path = vox_box/third_party/dia
6+
url = https://github.com/nari-labs/dia.git

README.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
# Vox Box
22

3-
A text-to-speech and speech-to-text server compatible with the OpenAI API, powered by backend support from Whisper, FunASR, Bark, and CosyVoice.
3+
A text-to-speech and speech-to-text server compatible with the OpenAI API, powered by backend support from Whisper, FunASR, Bark, Dia and CosyVoice.
44

55
## Requirements
66

77
- Python 3.10 or greater
88
- Support Nvidia GPU, requires the following NVIDIA libraries to be installed:
99
- [cuBLAS for CUDA 12](https://developer.nvidia.com/cublas)
10-
- [cuDNN 9 for CUDA 12](https://developer.nvidia.com/cudnn)
10+
- [cuDNN 9 for CUDA 12](https://developer.nvidia.com/cudnn)
1111

1212
## Installation
1313

@@ -34,6 +34,7 @@ vox-box start --huggingface-repo-id Systran/faster-whisper-small --data-dir C:\U
3434
```
3535

3636
### Options
37+
3738
- -d, --debug: Enable debug mode.
3839
- --host: Host to bind the server to. Default is 0.0.0.0.
3940
- --port: Port to bind the server to. Default is 80.
@@ -71,16 +72,18 @@ vox-box start --huggingface-repo-id Systran/faster-whisper-small --data-dir C:\U
7172
| CosyVoice-300M-SFT | text-to-speech | [Hugging Face](https://huggingface.co/FunAudioLLM/CosyVoice-300M-SFT), [ModelScope](https://modelscope.cn/models/iic/CosyVoice-300M-SFT) | Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅ |
7273
| CosyVoice-300M | text-to-speech | [Hugging Face](https://huggingface.co/FunAudioLLM/CosyVoice-300M), [ModelScope](https://modelscope.cn/models/iic/CosyVoice-300M) | Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅ |
7374
| CosyVoice-300M-25Hz | text-to-speech | [ModelScope](https://modelscope.cn/models/iic/CosyVoice-300M-25Hz) | Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅ |
75+
| Dia-1.6B | text-to-speech | [Hugging Face](https://huggingface.co/nari-labs/Dia-1.6B), [ModelScope](https://modelscope.cn/models/nari-labs/Dia-1.6B) | Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅ |
7476

7577
## Supported APIs
7678

77-
### Create speech
79+
### Create speech
7880

7981
**Endpoint**: `POST /v1/audio/speech`
8082

8183
Generates audio from the input text. Compatible with the [OpenAI audio/speech API](https://platform.openai.com/docs/api-reference/audio/createSpeech).
8284

8385
**Example Request**:
86+
8487
```bash
8588
curl http://localhost/v1/audio/speech \
8689
-H "Authorization: Bearer $OPENAI_API_KEY" \
@@ -96,13 +99,14 @@ curl http://localhost/v1/audio/speech \
9699
**Response**:
97100
The audio file content.
98101

99-
### Create transcription
102+
### Create transcription
100103

101104
**Endpoint**: `POST /v1/audio/transcriptions`
102105

103106
Transcribes audio into the input language. Compatible with the [OpenAI audio/transcription API](https://platform.openai.com/docs/api-reference/audio/createTranscription).
104107

105108
**Example Request**:
109+
106110
```bash
107111
curl https://localhost/v1/audio/transcriptions \
108112
-H "Authorization: Bearer $OPENAI_API_KEY" \
@@ -112,6 +116,7 @@ curl https://localhost/v1/audio/transcriptions \
112116
```
113117

114118
**Response**:
119+
115120
```json
116121
{
117122
"text": "Hello world."

0 commit comments

Comments
 (0)