Skip to content

Commit ed4c701

Browse files
authored
[Enhancement] Adding support for Openai TTS models (#54)
* Adding support for Openai TTS models * Linting and format fixes * Cleaner design to handle multimodal outputs * Adding new tests for custom_models.py * Renaming test files * Minor fixes * Creating base64 encoded data url to ensure reusability in subsequent nodes * Changes to store multimodal data in output file as paths in place of b64 encoded urls * linting fixes * fix lint issues * mypy fixes * Adding missing documentation to tts
1 parent 3818055 commit ed4c701

19 files changed

+2800
-48
lines changed
Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
# Text to Speech Data Generation
2+
3+
This module introduces support for multimodal data generation pipelines that accept **text** as input and produce **audio outputs** using text-to-speech (TTS) models. It expands traditional text-only pipelines to support audio generation tasks like audiobook creation, voice narration, and multi-voice dialogue generation.
4+
5+
## Key Features
6+
7+
- Supports **text-to-audio** generation using OpenAI TTS models.
8+
- Converts text inputs into **base64-encoded audio data URLs** compatible with standard audio formats.
9+
- Compatible with HuggingFace datasets, streaming, and on-disk formats.
10+
- Supports multiple voice options and audio formats.
11+
- Variable **speed control** (0.25x to 4.0x).
12+
- Automatic handling of multimodal outputs with file saving capabilities.
13+
14+
## Supported Models
15+
16+
**Currently, we only support OpenAI TTS models:**
17+
18+
- `tts-1` - Standard quality, optimized for speed
19+
- `tts-1-hd` - High-definition quality, optimized for quality
20+
- `gpt-4o-mini-tts` - OpenAI's newest and most reliable text-to-speech mode
21+
22+
Both models support all voice options and audio formats listed below.
23+
24+
## Input Requirements
25+
26+
### Text Input
27+
28+
Each text field to be converted to speech must:
29+
30+
- Be a string containing the text to synthesize
31+
- Not exceed **4096 characters** (OpenAI TTS limit)
32+
- Be specified in the model configuration
33+
- Can be local dataset or from HuggingFace datasets
34+
35+
### Voice Options
36+
You can choose from the following voices: https://platform.openai.com/docs/guides/text-to-speech#voice-options
37+
38+
### Audio Formats
39+
You can choose from the following audio formats: https://platform.openai.com/docs/guides/text-to-speech#supported-output-formats
40+
41+
### Supported languages
42+
The TTS models support multiple languages, including but not limited to: https://platform.openai.com/docs/guides/text-to-speech#supported-languages
43+
44+
## How Text-to-Speech Generation Works
45+
46+
1. Text input is extracted from the specified field in each record.
47+
2. The TTS model generates audio from the text.
48+
3. Audio is returned as a **base64-encoded data URL** (e.g., `data:audio/mp3;base64,...`).
49+
4. The data URL is converted to a file and saved to disk.
50+
5. The output json/jsonl gives the absolute path to the audio file.
51+
52+
## Model Configuration
53+
54+
The model configuration for TTS generation must specify `output_type: audio` and include TTS-specific parameters:
55+
56+
```yaml
57+
tts_openai:
58+
model: tts
59+
output_type: audio
60+
model_type: azure_openai
61+
api_version: 2025-03-01-preview
62+
parameters:
63+
voice: "alloy"
64+
response_format: "wav"
65+
```
66+
67+
## Example Configuration: Audiobook Generation
68+
69+
```yaml
70+
data_config:
71+
source:
72+
type: "disk"
73+
file_path: "data/chapters.json"
74+
75+
graph_config:
76+
nodes:
77+
generate_chapter_audio:
78+
node_type: llm
79+
output_keys: audio
80+
prompt:
81+
- user: |
82+
"{chapter_text}"
83+
model:
84+
parameters:
85+
voice: nova
86+
response_format: mp3
87+
speed: 1.0
88+
89+
edges:
90+
- from: START
91+
to: generate_chapter_audio
92+
- from: generate_chapter_audio
93+
to: END
94+
95+
output_config:
96+
output_map:
97+
id:
98+
from: "id"
99+
chapter_number:
100+
from: "chapter_number"
101+
chapter_text:
102+
from: "chapter_text"
103+
audio:
104+
from: "audio"
105+
```
106+
107+
### Input Data (`data/chapters.json`)
108+
109+
```json
110+
[
111+
{
112+
"id": "1",
113+
"chapter_number": 1,
114+
"chapter_text": "Chapter One: The Beginning. It was a dark and stormy night..."
115+
},
116+
{
117+
"id": "2",
118+
"chapter_number": 2,
119+
"chapter_text": "Chapter Two: The Journey. The next morning brought clear skies..."
120+
}
121+
]
122+
```
123+
124+
### Output
125+
126+
```json
127+
[
128+
{
129+
"id": "1",
130+
"chapter_number": 1,
131+
"chapter_text": "Chapter One: The Beginning. It was a dark and stormy night...",
132+
"audio": "/path/to/multimodal_output/audio/1_audio_0.mp3"
133+
},
134+
{
135+
"id": "2",
136+
"chapter_number": 2,
137+
"chapter_text": "Chapter Two: The Journey. The next morning brought clear skies...",
138+
"audio": "/path/to/multimodal_output/audio/2_audio_0.mp3"
139+
}
140+
]
141+
```
142+
143+
---
144+
145+
## Notes
146+
147+
- **Text-to-speech generation is currently only supported for OpenAI TTS models.** Support for additional providers may be added in future releases.
148+
- The output_type in model configuration must be set to `audio` to enable TTS generation.
149+
- Audio files are automatically saved and managed.
150+
151+
---
152+
153+
## See Also
154+
155+
- [Audio to Text](./audio_to_text.md) - For speech recognition and audio transcription
156+
- [Image to Text](./image_to_text.md) - For vision-based multimodal pipelines
157+
- [OpenAI TTS Documentation](https://platform.openai.com/docs/guides/text-to-speech) - Official OpenAI TTS API reference

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ nav:
3636
- Multimodal:
3737
- Audio to Text: concepts/multimodal/audio_to_text.md
3838
- Image to Text: concepts/multimodal/image_to_text.md
39+
- Text to Speech: concepts/multimodal/text_to_speech.md
3940
- Nodes:
4041
- Agent Node: concepts/nodes/agent_node.md
4142
- Lambda Node: concepts/nodes/lambda_node.md

sygra/config/models.yaml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,3 +78,14 @@ qwen3_1.7b:
7878
post_process: sygra.core.models.model_postprocessor.RemoveThinkData
7979
parameters:
8080
temperature: 0.8
81+
82+
# TTS openai model
83+
tts_openai:
84+
model: tts
85+
output_type: audio # This triggers TTS functionality
86+
model_type: azure_openai # Use azure_openai or openai model type
87+
api_version: 2025-03-01-preview
88+
# URL and api_key should be defined at .env file as SYGRA_TTS_OPENAI_URL and SYGRA_TTS_OPENAI_TOKEN
89+
parameters:
90+
voice: "alloy"
91+
response_format: "wav"

sygra/core/dataset/dataset_processor.py

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
import signal
44
import time
55
import uuid
6+
from pathlib import Path
67
from typing import Any, Callable, Optional, Union, cast
78

89
import datasets # type: ignore[import-untyped]
@@ -13,7 +14,7 @@
1314
from sygra.core.resumable_execution import ResumableExecutionManager
1415
from sygra.data_mapper.mapper import DataMapper
1516
from sygra.logger.logger_config import logger
16-
from sygra.utils import constants, graph_utils, utils
17+
from sygra.utils import constants, graph_utils, multimodal_processor, utils
1718
from sygra.validators.schema_validator_base import SchemaValidator
1819

1920

@@ -286,6 +287,17 @@ async def _write_checkpoint(self, is_oasst_mapper_required: bool) -> None:
286287
self.graph_results, self.output_record_generator
287288
)
288289

290+
# Process multimodal data: save base64 data URLs to files and replace with file paths
291+
try:
292+
multimodal_output_dir = ".".join(self.output_file.split(".")[:-1])
293+
output_records = multimodal_processor.process_batch_multimodal_data(
294+
output_records, Path(multimodal_output_dir)
295+
)
296+
except Exception as e:
297+
logger.warning(
298+
f"Failed to process multimodal data: {e}. Continuing with original records."
299+
)
300+
289301
# Handle intermediate writing if needed
290302
if (
291303
is_oasst_mapper_required

sygra/core/models/client/openai_azure_client.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -160,3 +160,41 @@ def send_request(
160160
return client.chat.completions.create(**payload, model=model_name, **generation_params)
161161
else:
162162
return client.completions.create(**payload, model=model_name, **generation_params)
163+
164+
async def create_speech(
165+
self,
166+
model: str,
167+
input: str,
168+
voice: str,
169+
response_format: str = "mp3",
170+
speed: float = 1.0,
171+
) -> Any:
172+
"""
173+
Create speech audio from text using Azure OpenAI's text-to-speech API.
174+
175+
Args:
176+
model (str): The TTS model deployment name (e.g., 'tts-1', 'tts-1-hd')
177+
input (str): The text to convert to speech
178+
voice (str): The voice to use like alloy, echo, fable, onyx, nova, shimmer etc.
179+
response_format (str, optional): The audio formats like mp3, opus, aac, flac, wav, pcm etc. Defaults to 'wav'
180+
speed (float, optional): The speed of the audio (0.25 to 4.0). Defaults to 1.0
181+
182+
Returns:
183+
Any: The audio response from the API
184+
185+
Raises:
186+
ValueError: If async_client is False (TTS requires async client)
187+
"""
188+
if not self.async_client:
189+
raise ValueError(
190+
"TTS API requires async client. Please initialize with async_client=True"
191+
)
192+
193+
client = cast(Any, self.client)
194+
return await client.audio.speech.create(
195+
model=model,
196+
input=input,
197+
voice=voice,
198+
response_format=response_format,
199+
speed=speed,
200+
)

sygra/core/models/client/openai_client.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -182,3 +182,41 @@ def send_request(
182182
extra_body=additional_params,
183183
**standard_params,
184184
)
185+
186+
async def create_speech(
187+
self,
188+
model: str,
189+
input: str,
190+
voice: str,
191+
response_format: str = "mp3",
192+
speed: float = 1.0,
193+
) -> Any:
194+
"""
195+
Create speech audio from text using OpenAI's text-to-speech API.
196+
197+
Args:
198+
model (str): The TTS model to use (e.g., 'tts-1', 'tts-1-hd')
199+
input (str): The text to convert to speech
200+
voice (str): The voice to use like alloy, echo, fable, onyx, nova, shimmer etc.
201+
response_format (str, optional): The audio formats like mp3, opus, aac, flac, wav, pcm etc. Defaults to 'wav'
202+
speed (float, optional): The speed of the audio (0.25 to 4.0). Defaults to 1.0
203+
204+
Returns:
205+
Any: The audio response from the API
206+
207+
Raises:
208+
ValueError: If async_client is False (TTS requires async client)
209+
"""
210+
if not self.async_client:
211+
raise ValueError(
212+
"TTS API requires async client. Please initialize with async_client=True"
213+
)
214+
215+
client = cast(Any, self.client)
216+
return await client.audio.speech.create(
217+
model=model,
218+
input=input,
219+
voice=voice,
220+
response_format=response_format,
221+
speed=speed,
222+
)

0 commit comments

Comments
 (0)