remsky
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 0 deletions b/‎.gitignore‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 5 additions & 1 deletion b/‎CHANGELOG.md‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 43 additions & 27 deletions b/‎README.md‎
Lines changed: 43 additions & 27 deletions
diff --git a/‎api/src/services/audio.py‎
Lines changed: 20 additions & 31 deletions b/‎api/src/services/audio.py‎
Lines changed: 20 additions & 31 deletions
diff --git a/‎api/tests/test_audio_service.py‎
Lines changed: 57 additions & 5 deletions b/‎api/tests/test_audio_service.py‎
Lines changed: 57 additions & 5 deletions
@@ -14,3 +14,5 @@ env/
 .Python
 
 
+.coverage
+
@@ -2,8 +2,12 @@
 
 Notable changes to this project will be documented in this file.
 
-## 2024-01-09
 
+## 2025-01-02
+- Audio Format Support:
+  - Added comprehensive audio format conversion support (mp3, wav, opus, flac)
+
+## 2025-01-01
 ### Added
 - Gradio Web Interface:
   - Added simple web UI utility for audio generation from input or txt file
 
@@ -3,8 +3,8 @@
 </p>
 
 # Kokoro TTS API
-[![Tests](https://img.shields.io/badge/tests-81%20passed-darkgreen)]()
-[![Coverage](https://img.shields.io/badge/coverage-76%25-darkgreen)]()
+[![Tests](https://img.shields.io/badge/tests-89%20passed-darkgreen)]()
+[![Coverage](https://img.shields.io/badge/coverage-80%25-darkgreen)]()
 [![Tested at Model Commit](https://img.shields.io/badge/last--tested--model--commit-a67f113-blue)](https://huggingface.co/hexgrad/Kokoro-82M/tree/c3b0d86e2a980e027ef71c28819ea02e351c2667)
 
 Dockerized FastAPI wrapper for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) text-to-speech model
@@ -14,8 +14,7 @@ Dockerized FastAPI wrapper for [Kokoro-82M](https://huggingface.co/hexgrad/Kokor
 - automatic chunking/stitching for long texts
 - simple audio generation web ui utility
 
-<details open>
-<summary><b>Quick Start</b></summary>
+## Quick Start
 
 The service can be accessed through either the API endpoints or the Gradio web interface.
 
@@ -48,9 +47,10 @@ The service can be accessed through either the API endpoints or the Gradio web i
     <p align="center">
     <img src="ui\GradioScreenShot.png" width="80%" alt="Voice Analysis Comparison" style="border: 2px solid #333; padding: 10px;">
     </p>
-</details>
+
+## Features 
 <details>
-<summary><b>OpenAI-Compatible Speech Endpoint</b></summary>
+<summary>OpenAI-Compatible Speech Endpoint</summary>
 
 ```python
 # Using OpenAI's Python library
@@ -98,7 +98,10 @@ python examples/test_all_voices.py # Test all available voices
 </details>
 
 <details>
-<summary><b>Voice Combination</b></summary>
+<summary>Voice Combination</summary>
+
+- Averages model weights of any existing voicepacks
+- Saves generated voicepacks for future use
 
 Combine voices and generate audio:
 ```python
@@ -129,7 +132,23 @@ response = requests.post(
 </details>
 
 <details>
-<summary><b>Gradio Web Utility</b></summary>
+<summary>Multiple Output Audio Formats</summary>
+
+- mp3
+- wav
+- opus 
+- flac
+- aac
+- pcm
+
+<p align="center">
+<img src="examples/benchmarks/format_comparison.png" width="80%" alt="Audio Format Comparison" style="border: 2px solid #333; padding: 10px;">
+</p>
+
+</details>
+
+<details>
+<summary>Gradio Web Utility</summary>
 
 Access the interactive web UI at http://localhost:7860 after starting the service. Features include:
 - Voice/format/speed selection
@@ -141,9 +160,9 @@ If you only want the API, just comment out everything in the docker-compose.yml
 Currently, voices created via the API are accessible here, but voice combination/creation has not yet been added
 </details>
 
-
+## Processing Details
 <details>
-<summary><b>Performance Benchmarks</b></summary>
+<summary>Performance Benchmarks</summary>
 
 Benchmarking was performed on generation via the local API using text lengths up to feature-length books (~1.5 hours output), measuring processing time and realtime factor. Tests were run on: 
 - Windows 11 Home w/ WSL2 
@@ -163,7 +182,7 @@ Key Performance Metrics:
 - Average Processing Rate: 137.67 tokens/second (cl100k_base)
 </details>
 <details>
-<summary><b>GPU Vs. CPU<b></summary>
+<summary>GPU Vs. CPU</summary>
 
 ```bash
 # GPU: Requires NVIDIA GPU with CUDA 12.1 support
@@ -172,35 +191,29 @@ docker compose up --build
 # CPU: ~10x slower than GPU inference
 docker compose -f docker-compose.cpu.yml up --build
 ```
-</details>
-<details>
-<summary><b>Features</b></summary>
 
-- OpenAI-compatible API endpoints (with optional Gradio Web UI)
-- GPU-accelerated inference (if desired)
-- Multiple audio formats: mp3, wav, opus, flac, (aac & pcm not implemented)
-- Natural Boundary Detection:
-    - Automatically splits and stitches at sentence boundaries to reduce artifacts and maintain performacne
-- Voice Combination:
-    - Averages model weights of any existing voicepacks
-    - Saves generated voicepacks for future use
+*Note: CPU Inference is currently a very basic implementation, and not heavily tested*
 
+</details>
 
+<details>
+<summary>Natural Boundary Detection</summary>
 
-*Note: CPU Inference is currently a very basic implementation, and not heavily tested*
+- Automatically splits and stitches at sentence boundaries 
+- Helps to reduce artifacts and allow long form processing as the base model is only currently configured for approximately 30s output 
 </details>
 
+## Model and License
+
 <details open>
-<summary><b>Model</b></summary>
+<summary>Model</summary>
 
 This API uses the [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) model from HuggingFace. 
 
 Visit the model page for more details about training, architecture, and capabilities. I have no affiliation with any of their work, and produced this wrapper for ease of use and personal projects.
 </details>
-
 <details>
-<summary><b>License</b></summary>
-
+<summary>License</summary>
 This project is licensed under the Apache License 2.0 - see below for details:
 
 - The Kokoro model weights are licensed under Apache 2.0 (see [model page](https://huggingface.co/hexgrad/Kokoro-82M))
@@ -209,3 +222,6 @@ This project is licensed under the Apache License 2.0 - see below for details:
 
 The full Apache 2.0 license text can be found at: https://www.apache.org/licenses/LICENSE-2.0
 </details>
+
+
+
@@ -4,7 +4,6 @@
 
 import numpy as np
 import soundfile as sf
-import scipy.io.wavfile as wavfile
 from loguru import logger
 
 
@@ -20,7 +19,7 @@ def convert_audio(
         Args:
             audio_data: Numpy array of audio samples
             sample_rate: Sample rate of the audio
-            output_format: Target format (wav, mp3, etc.)
+            output_format: Target format (wav, mp3, opus, flac, pcm)
 
         Returns:
             Bytes of the converted audio
@@ -30,46 +29,36 @@ def convert_audio(
         try:
             if output_format == "wav":
                 logger.info("Writing to WAV format...")
-                wavfile.write(buffer, sample_rate, audio_data)
-                return buffer.getvalue()
-
+                # Ensure audio_data is in int16 format for WAV
+                audio_data_wav = (
+                    audio_data / np.abs(audio_data).max() * np.iinfo(np.int16).max
+                ).astype(np.int16)  # Normalize
+                sf.write(buffer, audio_data_wav, sample_rate, format="WAV")
             elif output_format == "mp3":
-                # For MP3, we need to convert to WAV first
                 logger.info("Converting to MP3 format...")
-                wav_buffer = BytesIO()
-                wavfile.write(wav_buffer, sample_rate, audio_data)
-                wav_buffer.seek(0)
-
-                # Convert WAV to MP3 using soundfile
-                buffer = BytesIO()
-                sf.write(buffer, audio_data, sample_rate, format="mp3")
-                return buffer.getvalue()
-
+                # soundfile can write MP3 if ffmpeg or libsox is installed
+                sf.write(buffer, audio_data, sample_rate, format="MP3")
             elif output_format == "opus":
                 logger.info("Converting to Opus format...")
-                sf.write(buffer, audio_data, sample_rate, format="ogg", subtype="opus")
-                return buffer.getvalue()
-
+                sf.write(buffer, audio_data, sample_rate, format="OGG", subtype="OPUS")
             elif output_format == "flac":
                 logger.info("Converting to FLAC format...")
-                sf.write(buffer, audio_data, sample_rate, format="flac")
-                return buffer.getvalue()
-
-            elif output_format == "aac":
-                raise ValueError(
-                    "AAC format is not currently supported. Please use wav, mp3, opus, or flac."
-                )
-
+                sf.write(buffer, audio_data, sample_rate, format="FLAC")
             elif output_format == "pcm":
-                raise ValueError(
-                    "PCM format is not currently supported. Please use wav, mp3, opus, or flac."
-                )
-
+                logger.info("Extracting PCM data...")
+                # Ensure audio_data is in int16 format for PCM
+                audio_data_pcm = (
+                    audio_data / np.abs(audio_data).max() * np.iinfo(np.int16).max
+                ).astype(np.int16)  # Normalize
+                buffer.write(audio_data_pcm.tobytes())
             else:
                 raise ValueError(
-                    f"Format {output_format} not supported. Supported formats are: wav, mp3, opus, flac."
+                    f"Format {output_format} not supported. Supported formats are: wav, mp3, opus, flac, pcm."
                 )
 
+            buffer.seek(0)
+            return buffer.getvalue()
+
         except Exception as e:
             logger.error(f"Error converting audio to {output_format}: {str(e)}")
             raise ValueError(f"Failed to convert audio to {output_format}: {str(e)}")
@@ -51,19 +51,71 @@ def test_convert_to_flac(sample_audio):
 def test_convert_to_aac_raises_error(sample_audio):
     """Test that converting to AAC raises an error"""
     audio_data, sample_rate = sample_audio
-    with pytest.raises(ValueError, match="AAC format is not currently supported"):
+    with pytest.raises(
+        ValueError,
+        match="Format aac not supported. Supported formats are: wav, mp3, opus, flac, pcm.",
+    ):
         AudioService.convert_audio(audio_data, sample_rate, "aac")
 
 
-def test_convert_to_pcm_raises_error(sample_audio):
-    """Test that converting to PCM raises an error"""
+def test_convert_to_pcm(sample_audio):
+    """Test converting to PCM format"""
     audio_data, sample_rate = sample_audio
-    with pytest.raises(ValueError, match="PCM format is not currently supported"):
-        AudioService.convert_audio(audio_data, sample_rate, "pcm")
+    result = AudioService.convert_audio(audio_data, sample_rate, "pcm")
+    assert isinstance(result, bytes)
+    assert len(result) > 0
 
 
 def test_convert_to_invalid_format_raises_error(sample_audio):
     """Test that converting to an invalid format raises an error"""
     audio_data, sample_rate = sample_audio
     with pytest.raises(ValueError, match="Format invalid not supported"):
         AudioService.convert_audio(audio_data, sample_rate, "invalid")
+
+
+def test_normalization_wav(sample_audio):
+    """Test that WAV output is properly normalized to int16 range"""
+    audio_data, sample_rate = sample_audio
+    # Create audio data outside int16 range
+    large_audio = audio_data * 1e5
+    result = AudioService.convert_audio(large_audio, sample_rate, "wav")
+    assert isinstance(result, bytes)
+    assert len(result) > 0
+
+
+def test_normalization_pcm(sample_audio):
+    """Test that PCM output is properly normalized to int16 range"""
+    audio_data, sample_rate = sample_audio
+    # Create audio data outside int16 range
+    large_audio = audio_data * 1e5
+    result = AudioService.convert_audio(large_audio, sample_rate, "pcm")
+    assert isinstance(result, bytes)
+    assert len(result) > 0
+
+
+def test_invalid_audio_data():
+    """Test handling of invalid audio data"""
+    invalid_audio = np.array([])  # Empty array
+    sample_rate = 24000
+    with pytest.raises(ValueError):
+        AudioService.convert_audio(invalid_audio, sample_rate, "wav")
+
+
+def test_different_sample_rates(sample_audio):
+    """Test converting audio with different sample rates"""
+    audio_data, _ = sample_audio
+    sample_rates = [8000, 16000, 44100, 48000]
+
+    for rate in sample_rates:
+        result = AudioService.convert_audio(audio_data, rate, "wav")
+        assert isinstance(result, bytes)
+        assert len(result) > 0
+
+
+def test_buffer_position_after_conversion(sample_audio):
+    """Test that buffer position is reset after writing"""
+    audio_data, sample_rate = sample_audio
+    result = AudioService.convert_audio(audio_data, sample_rate, "wav")
+    # Convert again to ensure buffer was properly reset
+    result2 = AudioService.convert_audio(audio_data, sample_rate, "wav")
+    assert len(result) == len(result2)
Original file line number	Diff line number	Diff line change
`@@ -14,3 +14,5 @@ env/`
`14`	`14`	`.Python`
`15`	`15`
`16`	`16`
	`17`	`+.coverage`
	`18`	`+`