ServiceNow · amitsnow · Oct 24, 2025 · Oct 16, 2025 · Oct 16, 2025 · Oct 16, 2025
@@ -0,0 +1,157 @@
+# Text to Speech Data Generation
+
+This module introduces support for multimodal data generation pipelines that accept **text** as input and produce **audio outputs** using text-to-speech (TTS) models. It expands traditional text-only pipelines to support audio generation tasks like audiobook creation, voice narration, and multi-voice dialogue generation.
+
+## Key Features
+
+- Supports **text-to-audio** generation using OpenAI TTS models.
+- Converts text inputs into **base64-encoded audio data URLs** compatible with standard audio formats.
+- Compatible with HuggingFace datasets, streaming, and on-disk formats.
+- Supports multiple voice options and audio formats.
+- Variable **speed control** (0.25x to 4.0x).
+- Automatic handling of multimodal outputs with file saving capabilities.
+
+## Supported Models
+
+**Currently, we only support OpenAI TTS models:**
+
+- `tts-1` - Standard quality, optimized for speed
+- `tts-1-hd` - High-definition quality, optimized for quality
+- `gpt-4o-mini-tts` - OpenAI's newest and most reliable text-to-speech mode
+
+Both models support all voice options and audio formats listed below.
+
+## Input Requirements
+
+### Text Input
+
+Each text field to be converted to speech must:
+
+- Be a string containing the text to synthesize
+- Not exceed **4096 characters** (OpenAI TTS limit)
+- Be specified in the model configuration
+- Can be local dataset or from HuggingFace datasets
+
+### Voice Options
+You can choose from the following voices: https://platform.openai.com/docs/guides/text-to-speech#voice-options
+
+### Audio Formats
+You can choose from the following audio formats: https://platform.openai.com/docs/guides/text-to-speech#supported-output-formats
+
+### Supported languages
+The TTS models support multiple languages, including but not limited to: https://platform.openai.com/docs/guides/text-to-speech#supported-languages
+
+## How Text-to-Speech Generation Works
+
+1. Text input is extracted from the specified field in each record.
+2. The TTS model generates audio from the text.
+3. Audio is returned as a **base64-encoded data URL** (e.g., `data:audio/mp3;base64,...`).
+4. The data URL is converted to a file and saved to disk.
+5. The output json/jsonl gives the absolute path to the audio file.
+
+## Model Configuration
+
+The model configuration for TTS generation must specify `output_type: audio` and include TTS-specific parameters:
+
+```yaml
+tts_openai:
+  model: tts
+  output_type: audio 
+  model_type: azure_openai 
+  api_version: 2025-03-01-preview
+  parameters:
+    voice: "alloy"
+    response_format: "wav"
+```
+
+## Example Configuration: Audiobook Generation
+
+```yaml
+data_config:
+  source:
+    type: "disk"
+    file_path: "data/chapters.json"
+
+graph_config:
+  nodes:
+    generate_chapter_audio:
+      node_type: llm
+      output_keys: audio
+      prompt:
+        - user: |
+           "{chapter_text}"
+      model:
+        parameters:
+          voice: nova
+          response_format: mp3
+          speed: 1.0
+
+  edges:
+    - from: START
+      to: generate_chapter_audio
+    - from: generate_chapter_audio
+      to: END
+
+output_config:
+  output_map:
+    id:
+      from: "id"
+    chapter_number:
+      from: "chapter_number"
+    chapter_text:
+      from: "chapter_text"
+    audio:
+      from: "audio"
+```
+
+### Input Data (`data/chapters.json`)
+
+```json
+[
+  {
+    "id": "1",
+    "chapter_number": 1,
+    "chapter_text": "Chapter One: The Beginning. It was a dark and stormy night..."
+  },
+  {
+    "id": "2",
+    "chapter_number": 2,
+    "chapter_text": "Chapter Two: The Journey. The next morning brought clear skies..."
+  }
+]
+```
+
+### Output
+
+```json
+[
+  {
+    "id": "1",
+    "chapter_number": 1,
+    "chapter_text": "Chapter One: The Beginning. It was a dark and stormy night...",
+    "audio": "/path/to/multimodal_output/audio/1_audio_0.mp3"
+  },
+  {
+    "id": "2",
+    "chapter_number": 2,
+    "chapter_text": "Chapter Two: The Journey. The next morning brought clear skies...",
+    "audio": "/path/to/multimodal_output/audio/2_audio_0.mp3"
+  }
+]
+```
+
+---
+
+## Notes
+
+- **Text-to-speech generation is currently only supported for OpenAI TTS models.** Support for additional providers may be added in future releases.
+- The output_type in model configuration must be set to `audio` to enable TTS generation.
+- Audio files are automatically saved and managed.
+
+---
+
+## See Also
+
+- [Audio to Text](./audio_to_text.md) - For speech recognition and audio transcription
+- [Image to Text](./image_to_text.md) - For vision-based multimodal pipelines
+- [OpenAI TTS Documentation](https://platform.openai.com/docs/guides/text-to-speech) - Official OpenAI TTS API reference
@@ -36,6 +36,7 @@ nav:
       - Multimodal:
           - Audio to Text: concepts/multimodal/audio_to_text.md
           - Image to Text: concepts/multimodal/image_to_text.md
+          - Text to Speech: concepts/multimodal/text_to_speech.md
       - Nodes:
           - Agent Node: concepts/nodes/agent_node.md
           - Lambda Node: concepts/nodes/lambda_node.md

@@ -78,3 +78,14 @@ qwen3_1.7b:
   post_process: sygra.core.models.model_postprocessor.RemoveThinkData
   parameters:
     temperature: 0.8
+
+# TTS openai model
+tts_openai:
+  model: tts
+  output_type: audio  # This triggers TTS functionality
+  model_type: azure_openai  # Use azure_openai or openai model type
+  api_version: 2025-03-01-preview
+  # URL and api_key should be defined at .env file as SYGRA_TTS_OPENAI_URL and SYGRA_TTS_OPENAI_TOKEN
+  parameters:
+    voice: "alloy"
+    response_format: "wav"
@@ -3,6 +3,7 @@
 import signal
 import time
 import uuid
+from pathlib import Path
 from typing import Any, Callable, Optional, Union, cast
 
 import datasets  # type: ignore[import-untyped]
@@ -13,7 +14,7 @@
 from sygra.core.resumable_execution import ResumableExecutionManager
 from sygra.data_mapper.mapper import DataMapper
 from sygra.logger.logger_config import logger
-from sygra.utils import constants, graph_utils, utils
+from sygra.utils import constants, graph_utils, multimodal_processor, utils
 from sygra.validators.schema_validator_base import SchemaValidator
 
 
@@ -286,6 +287,17 @@ async def _write_checkpoint(self, is_oasst_mapper_required: bool) -> None:
             self.graph_results, self.output_record_generator
         )
 
+        # Process multimodal data: save base64 data URLs to files and replace with file paths
+        try:
+            multimodal_output_dir = ".".join(self.output_file.split(".")[:-1])
+            output_records = multimodal_processor.process_batch_multimodal_data(
+                output_records, Path(multimodal_output_dir)
+            )
+        except Exception as e:
+            logger.warning(
+                f"Failed to process multimodal data: {e}. Continuing with original records."
+            )
+
         # Handle intermediate writing if needed
         if (
             is_oasst_mapper_required

@@ -160,3 +160,41 @@ def send_request(
             return client.chat.completions.create(**payload, model=model_name, **generation_params)
         else:
             return client.completions.create(**payload, model=model_name, **generation_params)
+
+    async def create_speech(
+        self,
+        model: str,
+        input: str,
+        voice: str,
+        response_format: str = "mp3",
+        speed: float = 1.0,
+    ) -> Any:
+        """
+        Create speech audio from text using Azure OpenAI's text-to-speech API.
+
+        Args:
+            model (str): The TTS model deployment name (e.g., 'tts-1', 'tts-1-hd')
+            input (str): The text to convert to speech
+            voice (str): The voice to use like alloy, echo, fable, onyx, nova, shimmer etc.
+            response_format (str, optional): The audio formats like mp3, opus, aac, flac, wav, pcm etc. Defaults to 'wav'
+            speed (float, optional): The speed of the audio (0.25 to 4.0). Defaults to 1.0
+
+        Returns:
+            Any: The audio response from the API
+
+        Raises:
+            ValueError: If async_client is False (TTS requires async client)
+        """
+        if not self.async_client:
+            raise ValueError(
+                "TTS API requires async client. Please initialize with async_client=True"
+            )
+
+        client = cast(Any, self.client)
+        return await client.audio.speech.create(
+            model=model,
+            input=input,
+            voice=voice,
+            response_format=response_format,
+            speed=speed,
+        )
@@ -182,3 +182,41 @@ def send_request(
                     extra_body=additional_params,
                     **standard_params,
                 )
+
+    async def create_speech(
+        self,
+        model: str,
+        input: str,
+        voice: str,
+        response_format: str = "mp3",
+        speed: float = 1.0,
+    ) -> Any:
+        """
+        Create speech audio from text using OpenAI's text-to-speech API.
+
+        Args:
+            model (str): The TTS model to use (e.g., 'tts-1', 'tts-1-hd')
+            input (str): The text to convert to speech
+            voice (str): The voice to use like alloy, echo, fable, onyx, nova, shimmer etc.
+            response_format (str, optional): The audio formats like mp3, opus, aac, flac, wav, pcm etc. Defaults to 'wav'
+            speed (float, optional): The speed of the audio (0.25 to 4.0). Defaults to 1.0
+
+        Returns:
+            Any: The audio response from the API
+
+        Raises:
+            ValueError: If async_client is False (TTS requires async client)
+        """
+        if not self.async_client:
+            raise ValueError(
+                "TTS API requires async client. Please initialize with async_client=True"
+            )
+
+        client = cast(Any, self.client)
+        return await client.audio.speech.create(
+            model=model,
+            input=input,
+            voice=voice,
+            response_format=response_format,
+            speed=speed,
+        )