Skip to content

Commit ee5be65

Browse files
authored
Merge pull request #6 from dino65-dev/master
Enhance Audio Converter
2 parents d1c3feb + 4089444 commit ee5be65

File tree

8 files changed

+484
-65
lines changed

8 files changed

+484
-65
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,5 @@ env/
1414
.Python
1515

1616

17+
.coverage
18+

CHANGELOG.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,12 @@
22

33
Notable changes to this project will be documented in this file.
44

5-
## 2024-01-09
65

6+
## 2025-01-02
7+
- Audio Format Support:
8+
- Added comprehensive audio format conversion support (mp3, wav, opus, flac)
9+
10+
## 2025-01-01
711
### Added
812
- Gradio Web Interface:
913
- Added simple web UI utility for audio generation from input or txt file

README.md

Lines changed: 43 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@
33
</p>
44

55
# Kokoro TTS API
6-
[![Tests](https://img.shields.io/badge/tests-81%20passed-darkgreen)]()
7-
[![Coverage](https://img.shields.io/badge/coverage-76%25-darkgreen)]()
6+
[![Tests](https://img.shields.io/badge/tests-89%20passed-darkgreen)]()
7+
[![Coverage](https://img.shields.io/badge/coverage-80%25-darkgreen)]()
88
[![Tested at Model Commit](https://img.shields.io/badge/last--tested--model--commit-a67f113-blue)](https://huggingface.co/hexgrad/Kokoro-82M/tree/c3b0d86e2a980e027ef71c28819ea02e351c2667)
99

1010
Dockerized FastAPI wrapper for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) text-to-speech model
@@ -14,8 +14,7 @@ Dockerized FastAPI wrapper for [Kokoro-82M](https://huggingface.co/hexgrad/Kokor
1414
- automatic chunking/stitching for long texts
1515
- simple audio generation web ui utility
1616

17-
<details open>
18-
<summary><b>Quick Start</b></summary>
17+
## Quick Start
1918

2019
The service can be accessed through either the API endpoints or the Gradio web interface.
2120

@@ -48,9 +47,10 @@ The service can be accessed through either the API endpoints or the Gradio web i
4847
<p align="center">
4948
<img src="ui\GradioScreenShot.png" width="80%" alt="Voice Analysis Comparison" style="border: 2px solid #333; padding: 10px;">
5049
</p>
51-
</details>
50+
51+
## Features
5252
<details>
53-
<summary><b>OpenAI-Compatible Speech Endpoint</b></summary>
53+
<summary>OpenAI-Compatible Speech Endpoint</summary>
5454

5555
```python
5656
# Using OpenAI's Python library
@@ -98,7 +98,10 @@ python examples/test_all_voices.py # Test all available voices
9898
</details>
9999

100100
<details>
101-
<summary><b>Voice Combination</b></summary>
101+
<summary>Voice Combination</summary>
102+
103+
- Averages model weights of any existing voicepacks
104+
- Saves generated voicepacks for future use
102105

103106
Combine voices and generate audio:
104107
```python
@@ -129,7 +132,23 @@ response = requests.post(
129132
</details>
130133
131134
<details>
132-
<summary><b>Gradio Web Utility</b></summary>
135+
<summary>Multiple Output Audio Formats</summary>
136+
137+
- mp3
138+
- wav
139+
- opus
140+
- flac
141+
- aac
142+
- pcm
143+
144+
<p align="center">
145+
<img src="examples/benchmarks/format_comparison.png" width="80%" alt="Audio Format Comparison" style="border: 2px solid #333; padding: 10px;">
146+
</p>
147+
148+
</details>
149+
150+
<details>
151+
<summary>Gradio Web Utility</summary>
133152
134153
Access the interactive web UI at http://localhost:7860 after starting the service. Features include:
135154
- Voice/format/speed selection
@@ -141,9 +160,9 @@ If you only want the API, just comment out everything in the docker-compose.yml
141160
Currently, voices created via the API are accessible here, but voice combination/creation has not yet been added
142161
</details>
143162
144-
163+
## Processing Details
145164
<details>
146-
<summary><b>Performance Benchmarks</b></summary>
165+
<summary>Performance Benchmarks</summary>
147166
148167
Benchmarking was performed on generation via the local API using text lengths up to feature-length books (~1.5 hours output), measuring processing time and realtime factor. Tests were run on:
149168
- Windows 11 Home w/ WSL2
@@ -163,7 +182,7 @@ Key Performance Metrics:
163182
- Average Processing Rate: 137.67 tokens/second (cl100k_base)
164183
</details>
165184
<details>
166-
<summary><b>GPU Vs. CPU<b></summary>
185+
<summary>GPU Vs. CPU</summary>
167186
168187
```bash
169188
# GPU: Requires NVIDIA GPU with CUDA 12.1 support
@@ -172,35 +191,29 @@ docker compose up --build
172191
# CPU: ~10x slower than GPU inference
173192
docker compose -f docker-compose.cpu.yml up --build
174193
```
175-
</details>
176-
<details>
177-
<summary><b>Features</b></summary>
178194
179-
- OpenAI-compatible API endpoints (with optional Gradio Web UI)
180-
- GPU-accelerated inference (if desired)
181-
- Multiple audio formats: mp3, wav, opus, flac, (aac & pcm not implemented)
182-
- Natural Boundary Detection:
183-
- Automatically splits and stitches at sentence boundaries to reduce artifacts and maintain performacne
184-
- Voice Combination:
185-
- Averages model weights of any existing voicepacks
186-
- Saves generated voicepacks for future use
195+
*Note: CPU Inference is currently a very basic implementation, and not heavily tested*
187196
197+
</details>
188198
199+
<details>
200+
<summary>Natural Boundary Detection</summary>
189201
190-
*Note: CPU Inference is currently a very basic implementation, and not heavily tested*
202+
- Automatically splits and stitches at sentence boundaries
203+
- Helps to reduce artifacts and allow long form processing as the base model is only currently configured for approximately 30s output
191204
</details>
192205
206+
## Model and License
207+
193208
<details open>
194-
<summary><b>Model</b></summary>
209+
<summary>Model</summary>
195210
196211
This API uses the [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) model from HuggingFace.
197212
198213
Visit the model page for more details about training, architecture, and capabilities. I have no affiliation with any of their work, and produced this wrapper for ease of use and personal projects.
199214
</details>
200-
201215
<details>
202-
<summary><b>License</b></summary>
203-
216+
<summary>License</summary>
204217
This project is licensed under the Apache License 2.0 - see below for details:
205218
206219
- The Kokoro model weights are licensed under Apache 2.0 (see [model page](https://huggingface.co/hexgrad/Kokoro-82M))
@@ -209,3 +222,6 @@ This project is licensed under the Apache License 2.0 - see below for details:
209222
210223
The full Apache 2.0 license text can be found at: https://www.apache.org/licenses/LICENSE-2.0
211224
</details>
225+
226+
227+

api/src/services/audio.py

Lines changed: 20 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@
44

55
import numpy as np
66
import soundfile as sf
7-
import scipy.io.wavfile as wavfile
87
from loguru import logger
98

109

@@ -20,7 +19,7 @@ def convert_audio(
2019
Args:
2120
audio_data: Numpy array of audio samples
2221
sample_rate: Sample rate of the audio
23-
output_format: Target format (wav, mp3, etc.)
22+
output_format: Target format (wav, mp3, opus, flac, pcm)
2423
2524
Returns:
2625
Bytes of the converted audio
@@ -30,46 +29,36 @@ def convert_audio(
3029
try:
3130
if output_format == "wav":
3231
logger.info("Writing to WAV format...")
33-
wavfile.write(buffer, sample_rate, audio_data)
34-
return buffer.getvalue()
35-
32+
# Ensure audio_data is in int16 format for WAV
33+
audio_data_wav = (
34+
audio_data / np.abs(audio_data).max() * np.iinfo(np.int16).max
35+
).astype(np.int16) # Normalize
36+
sf.write(buffer, audio_data_wav, sample_rate, format="WAV")
3637
elif output_format == "mp3":
37-
# For MP3, we need to convert to WAV first
3838
logger.info("Converting to MP3 format...")
39-
wav_buffer = BytesIO()
40-
wavfile.write(wav_buffer, sample_rate, audio_data)
41-
wav_buffer.seek(0)
42-
43-
# Convert WAV to MP3 using soundfile
44-
buffer = BytesIO()
45-
sf.write(buffer, audio_data, sample_rate, format="mp3")
46-
return buffer.getvalue()
47-
39+
# soundfile can write MP3 if ffmpeg or libsox is installed
40+
sf.write(buffer, audio_data, sample_rate, format="MP3")
4841
elif output_format == "opus":
4942
logger.info("Converting to Opus format...")
50-
sf.write(buffer, audio_data, sample_rate, format="ogg", subtype="opus")
51-
return buffer.getvalue()
52-
43+
sf.write(buffer, audio_data, sample_rate, format="OGG", subtype="OPUS")
5344
elif output_format == "flac":
5445
logger.info("Converting to FLAC format...")
55-
sf.write(buffer, audio_data, sample_rate, format="flac")
56-
return buffer.getvalue()
57-
58-
elif output_format == "aac":
59-
raise ValueError(
60-
"AAC format is not currently supported. Please use wav, mp3, opus, or flac."
61-
)
62-
46+
sf.write(buffer, audio_data, sample_rate, format="FLAC")
6347
elif output_format == "pcm":
64-
raise ValueError(
65-
"PCM format is not currently supported. Please use wav, mp3, opus, or flac."
66-
)
67-
48+
logger.info("Extracting PCM data...")
49+
# Ensure audio_data is in int16 format for PCM
50+
audio_data_pcm = (
51+
audio_data / np.abs(audio_data).max() * np.iinfo(np.int16).max
52+
).astype(np.int16) # Normalize
53+
buffer.write(audio_data_pcm.tobytes())
6854
else:
6955
raise ValueError(
70-
f"Format {output_format} not supported. Supported formats are: wav, mp3, opus, flac."
56+
f"Format {output_format} not supported. Supported formats are: wav, mp3, opus, flac, pcm."
7157
)
7258

59+
buffer.seek(0)
60+
return buffer.getvalue()
61+
7362
except Exception as e:
7463
logger.error(f"Error converting audio to {output_format}: {str(e)}")
7564
raise ValueError(f"Failed to convert audio to {output_format}: {str(e)}")

api/tests/test_audio_service.py

Lines changed: 57 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -51,19 +51,71 @@ def test_convert_to_flac(sample_audio):
5151
def test_convert_to_aac_raises_error(sample_audio):
5252
"""Test that converting to AAC raises an error"""
5353
audio_data, sample_rate = sample_audio
54-
with pytest.raises(ValueError, match="AAC format is not currently supported"):
54+
with pytest.raises(
55+
ValueError,
56+
match="Format aac not supported. Supported formats are: wav, mp3, opus, flac, pcm.",
57+
):
5558
AudioService.convert_audio(audio_data, sample_rate, "aac")
5659

5760

58-
def test_convert_to_pcm_raises_error(sample_audio):
59-
"""Test that converting to PCM raises an error"""
61+
def test_convert_to_pcm(sample_audio):
62+
"""Test converting to PCM format"""
6063
audio_data, sample_rate = sample_audio
61-
with pytest.raises(ValueError, match="PCM format is not currently supported"):
62-
AudioService.convert_audio(audio_data, sample_rate, "pcm")
64+
result = AudioService.convert_audio(audio_data, sample_rate, "pcm")
65+
assert isinstance(result, bytes)
66+
assert len(result) > 0
6367

6468

6569
def test_convert_to_invalid_format_raises_error(sample_audio):
6670
"""Test that converting to an invalid format raises an error"""
6771
audio_data, sample_rate = sample_audio
6872
with pytest.raises(ValueError, match="Format invalid not supported"):
6973
AudioService.convert_audio(audio_data, sample_rate, "invalid")
74+
75+
76+
def test_normalization_wav(sample_audio):
77+
"""Test that WAV output is properly normalized to int16 range"""
78+
audio_data, sample_rate = sample_audio
79+
# Create audio data outside int16 range
80+
large_audio = audio_data * 1e5
81+
result = AudioService.convert_audio(large_audio, sample_rate, "wav")
82+
assert isinstance(result, bytes)
83+
assert len(result) > 0
84+
85+
86+
def test_normalization_pcm(sample_audio):
87+
"""Test that PCM output is properly normalized to int16 range"""
88+
audio_data, sample_rate = sample_audio
89+
# Create audio data outside int16 range
90+
large_audio = audio_data * 1e5
91+
result = AudioService.convert_audio(large_audio, sample_rate, "pcm")
92+
assert isinstance(result, bytes)
93+
assert len(result) > 0
94+
95+
96+
def test_invalid_audio_data():
97+
"""Test handling of invalid audio data"""
98+
invalid_audio = np.array([]) # Empty array
99+
sample_rate = 24000
100+
with pytest.raises(ValueError):
101+
AudioService.convert_audio(invalid_audio, sample_rate, "wav")
102+
103+
104+
def test_different_sample_rates(sample_audio):
105+
"""Test converting audio with different sample rates"""
106+
audio_data, _ = sample_audio
107+
sample_rates = [8000, 16000, 44100, 48000]
108+
109+
for rate in sample_rates:
110+
result = AudioService.convert_audio(audio_data, rate, "wav")
111+
assert isinstance(result, bytes)
112+
assert len(result) > 0
113+
114+
115+
def test_buffer_position_after_conversion(sample_audio):
116+
"""Test that buffer position is reset after writing"""
117+
audio_data, sample_rate = sample_audio
118+
result = AudioService.convert_audio(audio_data, sample_rate, "wav")
119+
# Convert again to ensure buffer was properly reset
120+
result2 = AudioService.convert_audio(audio_data, sample_rate, "wav")
121+
assert len(result) == len(result2)

0 commit comments

Comments
 (0)