Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
157 changes: 157 additions & 0 deletions docs/concepts/multimodal/text_to_speech.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
# Text to Speech Data Generation

This module introduces support for multimodal data generation pipelines that accept **text** as input and produce **audio outputs** using text-to-speech (TTS) models. It expands traditional text-only pipelines to support audio generation tasks like audiobook creation, voice narration, and multi-voice dialogue generation.

## Key Features

- Supports **text-to-audio** generation using OpenAI TTS models.
- Converts text inputs into **base64-encoded audio data URLs** compatible with standard audio formats.
- Compatible with HuggingFace datasets, streaming, and on-disk formats.
- Supports multiple voice options and audio formats.
- Variable **speed control** (0.25x to 4.0x).
- Automatic handling of multimodal outputs with file saving capabilities.

## Supported Models

**Currently, we only support OpenAI TTS models:**

- `tts-1` - Standard quality, optimized for speed
- `tts-1-hd` - High-definition quality, optimized for quality
- `gpt-4o-mini-tts` - OpenAI's newest and most reliable text-to-speech mode

Both models support all voice options and audio formats listed below.

## Input Requirements

### Text Input

Each text field to be converted to speech must:

- Be a string containing the text to synthesize
- Not exceed **4096 characters** (OpenAI TTS limit)
- Be specified in the model configuration
- Can be local dataset or from HuggingFace datasets

### Voice Options
You can choose from the following voices: https://platform.openai.com/docs/guides/text-to-speech#voice-options

### Audio Formats
You can choose from the following audio formats: https://platform.openai.com/docs/guides/text-to-speech#supported-output-formats

### Supported languages
The TTS models support multiple languages, including but not limited to: https://platform.openai.com/docs/guides/text-to-speech#supported-languages

## How Text-to-Speech Generation Works

1. Text input is extracted from the specified field in each record.
2. The TTS model generates audio from the text.
3. Audio is returned as a **base64-encoded data URL** (e.g., `data:audio/mp3;base64,...`).
4. The data URL is converted to a file and saved to disk.
5. The output json/jsonl gives the absolute path to the audio file.

## Model Configuration

The model configuration for TTS generation must specify `output_type: audio` and include TTS-specific parameters:

```yaml
tts_openai:
model: tts
output_type: audio
model_type: azure_openai
api_version: 2025-03-01-preview
parameters:
voice: "alloy"
response_format: "wav"
```

## Example Configuration: Audiobook Generation

```yaml
data_config:
source:
type: "disk"
file_path: "data/chapters.json"

graph_config:
nodes:
generate_chapter_audio:
node_type: llm
output_keys: audio
prompt:
- user: |
"{chapter_text}"
model:
parameters:
voice: nova
response_format: mp3
speed: 1.0

edges:
- from: START
to: generate_chapter_audio
- from: generate_chapter_audio
to: END

output_config:
output_map:
id:
from: "id"
chapter_number:
from: "chapter_number"
chapter_text:
from: "chapter_text"
audio:
from: "audio"
```

### Input Data (`data/chapters.json`)

```json
[
{
"id": "1",
"chapter_number": 1,
"chapter_text": "Chapter One: The Beginning. It was a dark and stormy night..."
},
{
"id": "2",
"chapter_number": 2,
"chapter_text": "Chapter Two: The Journey. The next morning brought clear skies..."
}
]
```

### Output

```json
[
{
"id": "1",
"chapter_number": 1,
"chapter_text": "Chapter One: The Beginning. It was a dark and stormy night...",
"audio": "/path/to/multimodal_output/audio/1_audio_0.mp3"
},
{
"id": "2",
"chapter_number": 2,
"chapter_text": "Chapter Two: The Journey. The next morning brought clear skies...",
"audio": "/path/to/multimodal_output/audio/2_audio_0.mp3"
}
]
```

---

## Notes

- **Text-to-speech generation is currently only supported for OpenAI TTS models.** Support for additional providers may be added in future releases.
- The output_type in model configuration must be set to `audio` to enable TTS generation.
- Audio files are automatically saved and managed.

---

## See Also

- [Audio to Text](./audio_to_text.md) - For speech recognition and audio transcription
- [Image to Text](./image_to_text.md) - For vision-based multimodal pipelines
- [OpenAI TTS Documentation](https://platform.openai.com/docs/guides/text-to-speech) - Official OpenAI TTS API reference
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ nav:
- Multimodal:
- Audio to Text: concepts/multimodal/audio_to_text.md
- Image to Text: concepts/multimodal/image_to_text.md
- Text to Speech: concepts/multimodal/text_to_speech.md
- Nodes:
- Agent Node: concepts/nodes/agent_node.md
- Lambda Node: concepts/nodes/lambda_node.md
Expand Down
11 changes: 11 additions & 0 deletions sygra/config/models.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -78,3 +78,14 @@ qwen3_1.7b:
post_process: sygra.core.models.model_postprocessor.RemoveThinkData
parameters:
temperature: 0.8

# TTS openai model
tts_openai:
model: tts
output_type: audio # This triggers TTS functionality
model_type: azure_openai # Use azure_openai or openai model type
api_version: 2025-03-01-preview
# URL and api_key should be defined at .env file as SYGRA_TTS_OPENAI_URL and SYGRA_TTS_OPENAI_TOKEN
parameters:
voice: "alloy"
response_format: "wav"
14 changes: 13 additions & 1 deletion sygra/core/dataset/dataset_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import signal
import time
import uuid
from pathlib import Path
from typing import Any, Callable, Optional, Union, cast

import datasets # type: ignore[import-untyped]
Expand All @@ -13,7 +14,7 @@
from sygra.core.resumable_execution import ResumableExecutionManager
from sygra.data_mapper.mapper import DataMapper
from sygra.logger.logger_config import logger
from sygra.utils import constants, graph_utils, utils
from sygra.utils import constants, graph_utils, multimodal_processor, utils
from sygra.validators.schema_validator_base import SchemaValidator


Expand Down Expand Up @@ -286,6 +287,17 @@ async def _write_checkpoint(self, is_oasst_mapper_required: bool) -> None:
self.graph_results, self.output_record_generator
)

# Process multimodal data: save base64 data URLs to files and replace with file paths
try:
multimodal_output_dir = ".".join(self.output_file.split(".")[:-1])
output_records = multimodal_processor.process_batch_multimodal_data(
output_records, Path(multimodal_output_dir)
)
except Exception as e:
logger.warning(
f"Failed to process multimodal data: {e}. Continuing with original records."
)

# Handle intermediate writing if needed
if (
is_oasst_mapper_required
Expand Down
38 changes: 38 additions & 0 deletions sygra/core/models/client/openai_azure_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,3 +160,41 @@ def send_request(
return client.chat.completions.create(**payload, model=model_name, **generation_params)
else:
return client.completions.create(**payload, model=model_name, **generation_params)

async def create_speech(
self,
model: str,
input: str,
voice: str,
response_format: str = "mp3",
speed: float = 1.0,
) -> Any:
"""
Create speech audio from text using Azure OpenAI's text-to-speech API.

Args:
model (str): The TTS model deployment name (e.g., 'tts-1', 'tts-1-hd')
input (str): The text to convert to speech
voice (str): The voice to use like alloy, echo, fable, onyx, nova, shimmer etc.
response_format (str, optional): The audio formats like mp3, opus, aac, flac, wav, pcm etc. Defaults to 'wav'
speed (float, optional): The speed of the audio (0.25 to 4.0). Defaults to 1.0

Returns:
Any: The audio response from the API

Raises:
ValueError: If async_client is False (TTS requires async client)
"""
if not self.async_client:
raise ValueError(
"TTS API requires async client. Please initialize with async_client=True"
)

client = cast(Any, self.client)
return await client.audio.speech.create(
model=model,
input=input,
voice=voice,
response_format=response_format,
speed=speed,
)
38 changes: 38 additions & 0 deletions sygra/core/models/client/openai_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,3 +182,41 @@ def send_request(
extra_body=additional_params,
**standard_params,
)

async def create_speech(
self,
model: str,
input: str,
voice: str,
response_format: str = "mp3",
speed: float = 1.0,
) -> Any:
"""
Create speech audio from text using OpenAI's text-to-speech API.

Args:
model (str): The TTS model to use (e.g., 'tts-1', 'tts-1-hd')
input (str): The text to convert to speech
voice (str): The voice to use like alloy, echo, fable, onyx, nova, shimmer etc.
response_format (str, optional): The audio formats like mp3, opus, aac, flac, wav, pcm etc. Defaults to 'wav'
speed (float, optional): The speed of the audio (0.25 to 4.0). Defaults to 1.0

Returns:
Any: The audio response from the API

Raises:
ValueError: If async_client is False (TTS requires async client)
"""
if not self.async_client:
raise ValueError(
"TTS API requires async client. Please initialize with async_client=True"
)

client = cast(Any, self.client)
return await client.audio.speech.create(
model=model,
input=input,
voice=voice,
response_format=response_format,
speed=speed,
)
Loading