ServiceNow
diff --git a/‎docs/concepts/multimodal/audio_to_text.md‎
Lines changed: 325 additions & 8 deletions b/‎docs/concepts/multimodal/audio_to_text.md‎
Lines changed: 325 additions & 8 deletions
diff --git a/‎docs/getting_started/model_configuration.md‎
Lines changed: 20 additions & 19 deletions b/‎docs/getting_started/model_configuration.md‎
Lines changed: 20 additions & 19 deletions
@@ -1,17 +1,52 @@
 # Audio to Text Data Generation
 
-This module introduces support for multimodal data generation pipelines that accept **audio** or **audio + text** as input and produce **textual outputs** using audio-capable LLMs like `Qwen2-Audio-7B`. It expands traditional text-only pipelines to support audio reasoning tasks like speech recognition, audio classification, and multimodal QA.
+This module introduces support for multimodal data generation pipelines that convert **audio** to **text**. SyGra supports two distinct approaches for audio-to-text conversion:
 
+1. **Audio Understanding LLMs** - Models like `Qwen2-Audio-7B` that can reason about, analyze, and answer questions about audio content
+2. **Dedicated Transcription Models** - Models like `Whisper` and `gpt-4o-transcribe` optimized specifically for accurate speech-to-text conversion
+
+> **Note:** 
+> For gpt-4o-audio multimodal generation, see the [GPT-4o Audio](./gpt_4o_audio.md) documentation.
 ## Key Features
 
-- Supports **audio-only** and **audio+text** prompts.
-- Converts audio fields into **base64-encoded data URLs** compatible with LLM APIs.
-- Compatible with HuggingFace datasets, streaming, and on-disk formats.
-- Automatically handles **lists of audio** per field.
-- Seamless round-tripping between loading, prompting, and output publishing.
+### Audio Understanding LLMs
+- Supports **audio-only** and **audio+text** prompts
+- Audio reasoning, classification, and Q&A capabilities
+- Uses standard chat completions API
+- Contextual understanding of audio content
+
+### Dedicated Transcription Models
+- Accurate speech-to-text conversion
+- Multilingual support (50+ languages)
+- Multiple output formats (JSON, SRT, VTT, text)
+- Word and segment-level timestamps
+- Optimized for transcription accuracy
+
+### Common Features
+- Converts audio fields into **base64-encoded data URLs** compatible with LLM APIs
+- Compatible with HuggingFace datasets, streaming, and on-disk formats
+- Automatically handles **lists of audio** per field
+- Seamless round-tripping between loading, prompting, and output publishing
+
+## Choosing the Right Approach
+
+| Use Case | Recommended Approach |
+|----------|---------------------|
+| Accurate speech-to-text transcription | **Transcription Models** |
+| Generating subtitles with timestamps | **Transcription Models** |
+| Multilingual transcription | **Transcription Models** |
+| Audio classification or event detection | **Audio Understanding LLMs** |
+| Answering questions about audio | **Audio Understanding LLMs** |
+| Audio reasoning or analysis | **Audio Understanding LLMs** |
+| Combining audio with text context | **Audio Understanding LLMs** |
 
 ---
-## Supported Image Input Types
+
+# Part 1: Audio Understanding with LLMs
+
+This section covers audio understanding using LLMs like `Qwen2-Audio-7B` that can reason about audio content.
+
+## Supported Audio Input Types
 
 Each audio field in a dataset record may be one of the following:
 
@@ -202,9 +237,291 @@ output_config:
           from: "animal"
 ```
 
+---
+
+# Part 2: Speech-to-Text Transcription
+
+This section covers dedicated transcription models optimized for accurate speech-to-text conversion.
+
+## Supported Transcription Models
+
+- `whisper-1` - OpenAI's Whisper model, general-purpose transcription
+- `gpt-4o-transcribe` - OpenAI's GPT-4o-based transcription model with improved accuracy
+
+## Transcription Model Configuration
+
+Configure the transcription model in your `sygra/config/models.yaml`:
+
+```yaml
+transcribe:
+  model: gpt-4o-transcribe  # or whisper-1
+  input_type: audio  # Required for transcription routing
+  model_type: azure_openai  # or openai
+  api_version: 2025-03-01-preview
+  # URL and auth_token from environment variables:
+  # SYGRA_TRANSCRIBE_URL and SYGRA_TRANSCRIBE_TOKEN
+  parameters:
+    language: en  # Optional: ISO-639-1 language code
+    response_format: json  # json, verbose_json, text, srt, vtt
+    temperature: 0  # 0-1, controls randomness
+```
+
+### Critical Configuration: `input_type: audio`
+
+Transcription requires `input_type: audio` in the model configuration to route to the transcription API:
+
+```yaml
+# ✓ Correct - Routes to transcription API
+transcribe:
+  model: whisper-1
+  input_type: audio
+  model_type: openai
+
+# ✗ Incorrect - Will not route to transcription API
+transcribe:
+  model: whisper-1
+  model_type: openai
+```
+
+## Supported Languages
+
+Transcription models support 50+ languages including:
+
+| Language | Code | Language | Code |
+|----------|------|----------|------|
+| English | en | Spanish | es |
+| French | fr | German | de |
+| Italian | it | Portuguese | pt |
+| Dutch | nl | Russian | ru |
+| Chinese | zh | Japanese | ja |
+| Korean | ko | Arabic | ar |
+| Hindi | hi | Turkish | tr |
+
+For a complete list, see [OpenAI Whisper Documentation](https://platform.openai.com/docs/guides/speech-to-text).
+
+## Response Formats
+
+| Format | Description | Use Case |
+|--------|-------------|----------|
+| `json` | JSON with transcribed text only | Simple transcription |
+| `verbose_json` | JSON with text, timestamps, and metadata | Detailed analysis |
+| `text` | Plain text only | Direct text output |
+| `srt` | SubRip subtitle format with timestamps | Video subtitles |
+| `vtt` | WebVTT subtitle format with timestamps | Web video subtitles |
+
+### Example Outputs
+
+**JSON Format:**
+```json
+{
+  "text": "Hello, how are you today?"
+}
+```
+
+**Verbose JSON Format:**
+```json
+{
+  "task": "transcribe",
+  "language": "english",
+  "duration": 2.5,
+  "text": "Hello, how are you today?",
+  "segments": [
+    {
+      "id": 0,
+      "seek": 0,
+      "start": 0.0,
+      "end": 2.5,
+      "text": " Hello, how are you today?",
+      "temperature": 0.0,
+      "avg_logprob": -0.2
+    }
+  ]
+}
+```
+
+**SRT Format:**
+```
+1
+00:00:00,000 --> 00:00:02,500
+Hello, how are you today?
+```
+
+## Transcription Example Configuration
+
+Based on `tasks/examples/transcription_apis/graph_config.yaml`:
+
+### Input Data (`test.json`)
+
+```json
+[
+  {
+    "id": "1",
+    "audio": "/path/to/audio/meeting_recording.mp3"
+  },
+  {
+    "id": "2",
+    "audio": "/path/to/audio/interview.wav"
+  }
+]
+```
+
+### Graph Configuration
+
+```yaml
+data_config:
+  source:
+    type: "disk"
+    file_path: "tasks/examples/transcription_apis/test.json"
+
+graph_config:
+  nodes:
+    audio_to_text:
+      output_keys: transcription
+      node_type: llm
+      prompt:
+        - user:
+            - type: audio_url
+              audio_url: "{audio}"
+      model:
+        name: transcribe
+
+  edges:
+    - from: START
+      to: audio_to_text
+    - from: audio_to_text
+      to: END
+
+output_config:
+  output_map:
+    id:
+      from: id
+    audio:
+      from: audio
+    transcription:
+      from: transcription
+```
+
+### Output
+
+```json
+[
+  {
+    "id": "1",
+    "audio": "/path/to/audio/meeting_recording.mp3",
+    "transcription": "Welcome everyone to today's meeting. Let's start with the agenda..."
+  },
+  {
+    "id": "2",
+    "audio": "/path/to/audio/interview.wav",
+    "transcription": "Thank you for joining us today. Can you tell us about your background?"
+  }
+]
+```
+
+## Advanced Transcription Features
+
+### Language Specification
+
+Specifying the language improves accuracy and speed:
+
+```yaml
+model:
+  name: transcribe
+  parameters:
+    language: es  # Spanish
+    response_format: json
+    temperature: 0
+```
+
+### Timestamps (Verbose JSON)
+
+For detailed timestamp information:
+
+```yaml
+model:
+  name: transcribe
+  parameters:
+    response_format: verbose_json
+    timestamp_granularities: ["word", "segment"]  # Word and segment-level timestamps
+```
+
+### Context Prompt
+
+Provide context to improve accuracy on specific terms:
+
+```yaml
+prompt:
+  - user:
+      - type: audio_url
+        audio_url: "{audio}"
+      - type: text
+        text: "The audio contains technical terms like Kubernetes, Docker, and CI/CD."
+```
+
+The text prompt is automatically passed as the `prompt` parameter to the transcription API.
+
+## Comparison: Transcription vs Audio-Understanding LLMs
+
+| Feature | Transcription Models | Audio LLMs (Qwen2-Audio) |
+|---------|---------------------|---------------------------|
+| **Primary Use** | Speech-to-text conversion | Audio understanding, reasoning, Q&A |
+| **API Endpoint** | `audio.transcriptions.create` | `chat.completions.create` |
+| **Output** | Transcribed text only | Contextual text responses |
+| **Timestamps** | Yes (word/segment level) | No |
+| **Multiple Formats** | Yes (JSON, SRT, VTT, text) | No (text only) |
+| **Language Support** | 50+ languages | Varies by model |
+| **Best For** | Accurate transcription, subtitles | Audio reasoning, classification, Q&A |
+| **Configuration** | `input_type: audio` required | Standard LLM config |
+| **Supported Audio** | MP3, MP4, MPEG, MPGA, M4A, WAV, WEBM, FLAC, OGG | Same |
+
+## Best Practices for Transcription
+
+### 1. Language Specification
+Always specify the language if known:
+```yaml
+parameters:
+  language: en  # or es, fr, de, etc.
+```
+
+### 2. Temperature Setting
+Use temperature 0 for deterministic transcription:
+```yaml
+parameters:
+  temperature: 0  # Recommended for transcription
+```
+
+### 3. Audio Quality
+- Use high-quality audio files (16kHz or higher sample rate)
+- Minimize background noise for better accuracy
+- Ensure clear speech with minimal overlapping speakers
+
+### 4. Context Prompts
+Provide context for technical terms or specific vocabulary:
+```yaml
+- type: text
+  text: "This audio discusses machine learning models including BERT, GPT, and transformers."
+```
+
+### 5. File Size Limits
+- Maximum audio file size: 25 MB (OpenAI limit)
+- For longer audio, split into chunks before transcription
+
+---
+
 ## Notes
 
 - **Audio generation is not supported** in this module. The `audio_url` type is strictly for passing existing audio inputs (e.g., loaded from datasets), not for generating new audio via model output.
-- For a complete working example, see: [`tasks/audio_to_text`](https://github.com/ServiceNow/SyGra/tree/main/tasks/examples/audio_to_text)
+- **Transcription models** require `input_type: audio` in model configuration to route to the transcription API.
+- For audio understanding LLM examples, see: [`tasks/examples/audio_to_text`](https://github.com/ServiceNow/SyGra/tree/main/tasks/examples/audio_to_text)
+- For transcription examples, see: [`tasks/examples/transcription_apis`](https://github.com/ServiceNow/SyGra/tree/main/tasks/examples/transcription_apis)
+
+---
+
+## See Also
+
+- [GPT-4o Audio](./gpt_4o_audio.md) - Multimodal audio generation and understanding with GPT-4o
+- [Text to Speech](./text_to_speech.md) - Text-to-speech generation
+- [Image to Text](./image_to_text.md) - Vision-based multimodal pipelines
+- [OpenAI Whisper Documentation](https://platform.openai.com/docs/guides/speech-to-text) - Official OpenAI Whisper API reference
 
 
@@ -45,25 +45,26 @@ SYGRA_MIXTRAL_8X7B_CHAT_TEMPLATE={% for m in messages %} ... {% endfor %}
 ### Configuration Properties
 
 
-| Key                         | Description                                                                                                                                           |
-|-----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `model_type`                | Type of backend server (`tgi`, `vllm`, `openai`, `azure_openai`, `azure`, `mistralai`, `ollama`, `triton`, `bedrock`, `vertex_ai`)                    |
-| `model_name`                | Model name for your deployments (for Azure/Azure OpenAI)                                                                                              |
-| `api_version`               | API version for Azure or Azure OpenAI                                                                                                                 |
-| `multi_modal`               | *(Optional)* Boolean: Set this to false if the model is not multi-modal (default: true)                                                               |
-| `backend`                   | *(Optional)* Backend for the model (default: `litellm` for litellm supported models, `custom` for other models) Supported values: `litellm`, `custom` |
-| `completions_api`           | *(Optional)* Boolean: use completions API instead of chat completions API (default: false) <br/> Supported models: `tgi`, `vllm`, `ollama`            |
-| `hf_chat_template_model_id` | *(Optional)* Hugging Face model ID. Make sure to set this when completions_api is set to `true`                                                       |
-| `modify_tokenizer`          | *(Optional)* Boolean: apply custom chat template and modify the base model tokenizer (default: false)                                                 |
-| `special_tokens`            | *(Optional)* List of special stop tokens used in generation                                                                                           |
-| `post_process`              | *(Optional)* Post processor after model inference (e.g. `models.model_postprocessor.RemoveThinkData`)                                                 |
-| `parameters`                | *(Optional)* Generation parameters (see below)                                                                                                        |
-| `image_capabilities`        | *(Optional)* Image model limits as dict. Supports `prompt_char_limit` (warn if exceeded) and `max_edit_images` (truncate extra input images).         |
-| `chat_template_params`      | *(Optional)* Chat template parameters (e.g. `reasoning_effort` for `gpt-oss-120b`) <br/> when `completions_api` is enabled                            |
-| `ssl_verify`                | *(Optional)* Verify SSL certificate (default: true)                                                                                                   |
-| `ssl_cert`                  | *(Optional)* Path to SSL certificate file                                                                                                             |
-| `json_payload`              | *(Optional)* Boolean: use JSON payload instead of JSON string for `http client` based models (default: false)                                         |
-| `headers`                   | *(Optional)* Dictionary of headers to be sent with the request for `http client` based models                                                         |
+| Key                         | Description                                                                                                                                                                          |
+|-----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `model_type`                | Type of backend server (`tgi`, `vllm`, `openai`, `azure_openai`, `azure`, `mistralai`, `ollama`, `triton`, `bedrock`, `vertex_ai`)                                                   |
+| `model_name`                | Model name for your deployments (for Azure/Azure OpenAI)                                                                                                                             |
+| `api_version`               | API version for Azure or Azure OpenAI                                                                                                                                                |
+| `input_type`                | *(Optional)* What type of input the model accepts (default: `text`) <br/> Supported values: `text`, `image`, `audio`.<br/><br/>**`input_type: audio` mandatory for transcription models** |
+| `output_type`               | *(Optional)* What type of output the model generates (default: `text`) <br/> Supported values: `text`, `image`, `audio`                                                                 |
+| `backend`                   | *(Optional)* Backend for the model (default: `litellm` for litellm supported models, `custom` for other models) Supported values: `litellm`, `custom`                                |
+| `completions_api`           | *(Optional)* Boolean: use completions API instead of chat completions API (default: false) <br/> Supported models: `tgi`, `vllm`, `ollama`                                           |
+| `hf_chat_template_model_id` | *(Optional)* Hugging Face model ID. Make sure to set this when completions_api is set to `true`                                                                                      |
+| `modify_tokenizer`          | *(Optional)* Boolean: apply custom chat template and modify the base model tokenizer (default: false)                                                                                |
+| `special_tokens`            | *(Optional)* List of special stop tokens used in generation                                                                                                                          |
+| `post_process`              | *(Optional)* Post processor after model inference (e.g. `models.model_postprocessor.RemoveThinkData`)                                                                                |
+| `parameters`                | *(Optional)* Generation parameters (see below)                                                                                                                                       |
+| `image_capabilities`        | *(Optional)* Image model limits as dict. Supports `prompt_char_limit` (warn if exceeded) and `max_edit_images` (truncate extra input images).                                        |
+| `chat_template_params`      | *(Optional)* Chat template parameters (e.g. `reasoning_effort` for `gpt-oss-120b`) <br/> when `completions_api` is enabled                                                           |
+| `ssl_verify`                | *(Optional)* Verify SSL certificate (default: true)                                                                                                                                  |
+| `ssl_cert`                  | *(Optional)* Path to SSL certificate file                                                                                                                                            |
+| `json_payload`              | *(Optional)* Boolean: use JSON payload instead of JSON string for `http client` based models (default: false)                                                                        |
+| `headers`                   | *(Optional)* Dictionary of headers to be sent with the request for `http client` based models                                                                                        |
 ![Note](https://img.shields.io/badge/Note-important-yellow)  
 > - Do **not** include `url`, `auth_token`, or `api_key` in your YAML config. These are sourced from environment variables as described above.<br>
 > - If you want to set **ssl_verify** to **false** globally, you can set `ssl_verify:false` under `model_config` section in config/configuration.yaml