11# Vox Box
22
3- A text-to-speech and speech-to-text server compatible with the OpenAI API, powered by backend support from Whisper, FunASR, Bark, and CosyVoice.
3+ A text-to-speech and speech-to-text server compatible with the OpenAI API, powered by backend support from Whisper, FunASR, Bark, Dia and CosyVoice.
44
55## Requirements
66
77- Python 3.10 or greater
88- Support Nvidia GPU, requires the following NVIDIA libraries to be installed:
99 - [ cuBLAS for CUDA 12] ( https://developer.nvidia.com/cublas )
10- - [ cuDNN 9 for CUDA 12] ( https://developer.nvidia.com/cudnn )
10+ - [ cuDNN 9 for CUDA 12] ( https://developer.nvidia.com/cudnn )
1111
1212## Installation
1313
@@ -34,6 +34,7 @@ vox-box start --huggingface-repo-id Systran/faster-whisper-small --data-dir C:\U
3434```
3535
3636### Options
37+
3738- -d, --debug: Enable debug mode.
3839- --host: Host to bind the server to. Default is 0.0.0.0.
3940- --port: Port to bind the server to. Default is 80.
@@ -71,16 +72,18 @@ vox-box start --huggingface-repo-id Systran/faster-whisper-small --data-dir C:\U
7172| CosyVoice-300M-SFT | text-to-speech | [ Hugging Face] ( https://huggingface.co/FunAudioLLM/CosyVoice-300M-SFT ) , [ ModelScope] ( https://modelscope.cn/models/iic/CosyVoice-300M-SFT ) | Linux(ARM not supported) ✅ ; , Windows(Not supported), macOS ✅ ; |
7273| CosyVoice-300M | text-to-speech | [ Hugging Face] ( https://huggingface.co/FunAudioLLM/CosyVoice-300M ) , [ ModelScope] ( https://modelscope.cn/models/iic/CosyVoice-300M ) | Linux(ARM not supported) ✅ ; , Windows(Not supported), macOS ✅ ; |
7374| CosyVoice-300M-25Hz | text-to-speech | [ ModelScope] ( https://modelscope.cn/models/iic/CosyVoice-300M-25Hz ) | Linux(ARM not supported) ✅ ; , Windows(Not supported), macOS ✅ ; |
75+ | Dia-1.6B | text-to-speech | [ Hugging Face] ( https://huggingface.co/nari-labs/Dia-1.6B ) , [ ModelScope] ( https://modelscope.cn/models/nari-labs/Dia-1.6B ) | Linux(ARM not supported) ✅ ; , Windows(Not supported), macOS ✅ ; |
7476
7577## Supported APIs
7678
77- ### Create speech
79+ ### Create speech
7880
7981** Endpoint** : ` POST /v1/audio/speech `
8082
8183Generates audio from the input text. Compatible with the [ OpenAI audio/speech API] ( https://platform.openai.com/docs/api-reference/audio/createSpeech ) .
8284
8385** Example Request** :
86+
8487``` bash
8588curl http://localhost/v1/audio/speech \
8689 -H " Authorization: Bearer $OPENAI_API_KEY " \
@@ -96,13 +99,14 @@ curl http://localhost/v1/audio/speech \
9699** Response** :
97100The audio file content.
98101
99- ### Create transcription
102+ ### Create transcription
100103
101104** Endpoint** : ` POST /v1/audio/transcriptions `
102105
103106Transcribes audio into the input language. Compatible with the [ OpenAI audio/transcription API] ( https://platform.openai.com/docs/api-reference/audio/createTranscription ) .
104107
105108** Example Request** :
109+
106110``` bash
107111curl https://localhost/v1/audio/transcriptions \
108112 -H " Authorization: Bearer $OPENAI_API_KEY " \
@@ -112,6 +116,7 @@ curl https://localhost/v1/audio/transcriptions \
112116```
113117
114118** Response** :
119+
115120``` json
116121{
117122 "text" : " Hello world."
0 commit comments