Skip to content

Commit d72998a

Browse files
amitsnowpsriramsncvipul-mittal
authored
[Enhancement] Audio Transcription Support (#63)
* Added VLLM, OpenAI litellm models * Updated VLLM, OpenAI, AzureOpenAI litellm models * Fixed imports VLLM litellm model * Updated Model factory to set litellm backend for supported models and updated test cases * Updated Documentation * Added tests cases for litellm models * Updated readme * Support for gpt transcription apis * add missing _generate_audio_chat_completion for openai lite_llm integration * Remove local path info from gpt-4o-audio example * Updates to test message to handle transcription models and introduction of input_type to route to transcription api * fix langgraph factory tests * fix custom litellm vllm tests * format fixes * lint fixes * fix formatter issues * Readme update for transcription API * format fixes * Add reference for gpt-4o-audio in audio_to_text.md * Update sygra/core/models/custom_models.py Co-authored-by: Sriram Puttagunta <[email protected]> * Review comments * lint fixes * transcription refactor based on new code * transcription refactor based on new code * test fixes * format fixes * doc update * Fix lint issues for unused imports and unit tests * Adding clearer logging for extracting image and audio urls * utils refactor and test fixes --------- Co-authored-by: sriram.puttagunta <[email protected]> Co-authored-by: Vipul Mittal <[email protected]>
1 parent 907c2e5 commit d72998a

File tree

17 files changed

+1622
-195
lines changed

17 files changed

+1622
-195
lines changed

docs/concepts/multimodal/audio_to_text.md

Lines changed: 325 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,52 @@
11
# Audio to Text Data Generation
22

3-
This module introduces support for multimodal data generation pipelines that accept **audio** or **audio + text** as input and produce **textual outputs** using audio-capable LLMs like `Qwen2-Audio-7B`. It expands traditional text-only pipelines to support audio reasoning tasks like speech recognition, audio classification, and multimodal QA.
3+
This module introduces support for multimodal data generation pipelines that convert **audio** to **text**. SyGra supports two distinct approaches for audio-to-text conversion:
44

5+
1. **Audio Understanding LLMs** - Models like `Qwen2-Audio-7B` that can reason about, analyze, and answer questions about audio content
6+
2. **Dedicated Transcription Models** - Models like `Whisper` and `gpt-4o-transcribe` optimized specifically for accurate speech-to-text conversion
7+
8+
> **Note:**
9+
> For gpt-4o-audio multimodal generation, see the [GPT-4o Audio](./gpt_4o_audio.md) documentation.
510
## Key Features
611

7-
- Supports **audio-only** and **audio+text** prompts.
8-
- Converts audio fields into **base64-encoded data URLs** compatible with LLM APIs.
9-
- Compatible with HuggingFace datasets, streaming, and on-disk formats.
10-
- Automatically handles **lists of audio** per field.
11-
- Seamless round-tripping between loading, prompting, and output publishing.
12+
### Audio Understanding LLMs
13+
- Supports **audio-only** and **audio+text** prompts
14+
- Audio reasoning, classification, and Q&A capabilities
15+
- Uses standard chat completions API
16+
- Contextual understanding of audio content
17+
18+
### Dedicated Transcription Models
19+
- Accurate speech-to-text conversion
20+
- Multilingual support (50+ languages)
21+
- Multiple output formats (JSON, SRT, VTT, text)
22+
- Word and segment-level timestamps
23+
- Optimized for transcription accuracy
24+
25+
### Common Features
26+
- Converts audio fields into **base64-encoded data URLs** compatible with LLM APIs
27+
- Compatible with HuggingFace datasets, streaming, and on-disk formats
28+
- Automatically handles **lists of audio** per field
29+
- Seamless round-tripping between loading, prompting, and output publishing
30+
31+
## Choosing the Right Approach
32+
33+
| Use Case | Recommended Approach |
34+
|----------|---------------------|
35+
| Accurate speech-to-text transcription | **Transcription Models** |
36+
| Generating subtitles with timestamps | **Transcription Models** |
37+
| Multilingual transcription | **Transcription Models** |
38+
| Audio classification or event detection | **Audio Understanding LLMs** |
39+
| Answering questions about audio | **Audio Understanding LLMs** |
40+
| Audio reasoning or analysis | **Audio Understanding LLMs** |
41+
| Combining audio with text context | **Audio Understanding LLMs** |
1242

1343
---
14-
## Supported Image Input Types
44+
45+
# Part 1: Audio Understanding with LLMs
46+
47+
This section covers audio understanding using LLMs like `Qwen2-Audio-7B` that can reason about audio content.
48+
49+
## Supported Audio Input Types
1550

1651
Each audio field in a dataset record may be one of the following:
1752

@@ -202,9 +237,291 @@ output_config:
202237
from: "animal"
203238
```
204239

240+
---
241+
242+
# Part 2: Speech-to-Text Transcription
243+
244+
This section covers dedicated transcription models optimized for accurate speech-to-text conversion.
245+
246+
## Supported Transcription Models
247+
248+
- `whisper-1` - OpenAI's Whisper model, general-purpose transcription
249+
- `gpt-4o-transcribe` - OpenAI's GPT-4o-based transcription model with improved accuracy
250+
251+
## Transcription Model Configuration
252+
253+
Configure the transcription model in your `sygra/config/models.yaml`:
254+
255+
```yaml
256+
transcribe:
257+
model: gpt-4o-transcribe # or whisper-1
258+
input_type: audio # Required for transcription routing
259+
model_type: azure_openai # or openai
260+
api_version: 2025-03-01-preview
261+
# URL and auth_token from environment variables:
262+
# SYGRA_TRANSCRIBE_URL and SYGRA_TRANSCRIBE_TOKEN
263+
parameters:
264+
language: en # Optional: ISO-639-1 language code
265+
response_format: json # json, verbose_json, text, srt, vtt
266+
temperature: 0 # 0-1, controls randomness
267+
```
268+
269+
### Critical Configuration: `input_type: audio`
270+
271+
Transcription requires `input_type: audio` in the model configuration to route to the transcription API:
272+
273+
```yaml
274+
# ✓ Correct - Routes to transcription API
275+
transcribe:
276+
model: whisper-1
277+
input_type: audio
278+
model_type: openai
279+
280+
# ✗ Incorrect - Will not route to transcription API
281+
transcribe:
282+
model: whisper-1
283+
model_type: openai
284+
```
285+
286+
## Supported Languages
287+
288+
Transcription models support 50+ languages including:
289+
290+
| Language | Code | Language | Code |
291+
|----------|------|----------|------|
292+
| English | en | Spanish | es |
293+
| French | fr | German | de |
294+
| Italian | it | Portuguese | pt |
295+
| Dutch | nl | Russian | ru |
296+
| Chinese | zh | Japanese | ja |
297+
| Korean | ko | Arabic | ar |
298+
| Hindi | hi | Turkish | tr |
299+
300+
For a complete list, see [OpenAI Whisper Documentation](https://platform.openai.com/docs/guides/speech-to-text).
301+
302+
## Response Formats
303+
304+
| Format | Description | Use Case |
305+
|--------|-------------|----------|
306+
| `json` | JSON with transcribed text only | Simple transcription |
307+
| `verbose_json` | JSON with text, timestamps, and metadata | Detailed analysis |
308+
| `text` | Plain text only | Direct text output |
309+
| `srt` | SubRip subtitle format with timestamps | Video subtitles |
310+
| `vtt` | WebVTT subtitle format with timestamps | Web video subtitles |
311+
312+
### Example Outputs
313+
314+
**JSON Format:**
315+
```json
316+
{
317+
"text": "Hello, how are you today?"
318+
}
319+
```
320+
321+
**Verbose JSON Format:**
322+
```json
323+
{
324+
"task": "transcribe",
325+
"language": "english",
326+
"duration": 2.5,
327+
"text": "Hello, how are you today?",
328+
"segments": [
329+
{
330+
"id": 0,
331+
"seek": 0,
332+
"start": 0.0,
333+
"end": 2.5,
334+
"text": " Hello, how are you today?",
335+
"temperature": 0.0,
336+
"avg_logprob": -0.2
337+
}
338+
]
339+
}
340+
```
341+
342+
**SRT Format:**
343+
```
344+
1
345+
00:00:00,000 --> 00:00:02,500
346+
Hello, how are you today?
347+
```
348+
349+
## Transcription Example Configuration
350+
351+
Based on `tasks/examples/transcription_apis/graph_config.yaml`:
352+
353+
### Input Data (`test.json`)
354+
355+
```json
356+
[
357+
{
358+
"id": "1",
359+
"audio": "/path/to/audio/meeting_recording.mp3"
360+
},
361+
{
362+
"id": "2",
363+
"audio": "/path/to/audio/interview.wav"
364+
}
365+
]
366+
```
367+
368+
### Graph Configuration
369+
370+
```yaml
371+
data_config:
372+
source:
373+
type: "disk"
374+
file_path: "tasks/examples/transcription_apis/test.json"
375+
376+
graph_config:
377+
nodes:
378+
audio_to_text:
379+
output_keys: transcription
380+
node_type: llm
381+
prompt:
382+
- user:
383+
- type: audio_url
384+
audio_url: "{audio}"
385+
model:
386+
name: transcribe
387+
388+
edges:
389+
- from: START
390+
to: audio_to_text
391+
- from: audio_to_text
392+
to: END
393+
394+
output_config:
395+
output_map:
396+
id:
397+
from: id
398+
audio:
399+
from: audio
400+
transcription:
401+
from: transcription
402+
```
403+
404+
### Output
405+
406+
```json
407+
[
408+
{
409+
"id": "1",
410+
"audio": "/path/to/audio/meeting_recording.mp3",
411+
"transcription": "Welcome everyone to today's meeting. Let's start with the agenda..."
412+
},
413+
{
414+
"id": "2",
415+
"audio": "/path/to/audio/interview.wav",
416+
"transcription": "Thank you for joining us today. Can you tell us about your background?"
417+
}
418+
]
419+
```
420+
421+
## Advanced Transcription Features
422+
423+
### Language Specification
424+
425+
Specifying the language improves accuracy and speed:
426+
427+
```yaml
428+
model:
429+
name: transcribe
430+
parameters:
431+
language: es # Spanish
432+
response_format: json
433+
temperature: 0
434+
```
435+
436+
### Timestamps (Verbose JSON)
437+
438+
For detailed timestamp information:
439+
440+
```yaml
441+
model:
442+
name: transcribe
443+
parameters:
444+
response_format: verbose_json
445+
timestamp_granularities: ["word", "segment"] # Word and segment-level timestamps
446+
```
447+
448+
### Context Prompt
449+
450+
Provide context to improve accuracy on specific terms:
451+
452+
```yaml
453+
prompt:
454+
- user:
455+
- type: audio_url
456+
audio_url: "{audio}"
457+
- type: text
458+
text: "The audio contains technical terms like Kubernetes, Docker, and CI/CD."
459+
```
460+
461+
The text prompt is automatically passed as the `prompt` parameter to the transcription API.
462+
463+
## Comparison: Transcription vs Audio-Understanding LLMs
464+
465+
| Feature | Transcription Models | Audio LLMs (Qwen2-Audio) |
466+
|---------|---------------------|---------------------------|
467+
| **Primary Use** | Speech-to-text conversion | Audio understanding, reasoning, Q&A |
468+
| **API Endpoint** | `audio.transcriptions.create` | `chat.completions.create` |
469+
| **Output** | Transcribed text only | Contextual text responses |
470+
| **Timestamps** | Yes (word/segment level) | No |
471+
| **Multiple Formats** | Yes (JSON, SRT, VTT, text) | No (text only) |
472+
| **Language Support** | 50+ languages | Varies by model |
473+
| **Best For** | Accurate transcription, subtitles | Audio reasoning, classification, Q&A |
474+
| **Configuration** | `input_type: audio` required | Standard LLM config |
475+
| **Supported Audio** | MP3, MP4, MPEG, MPGA, M4A, WAV, WEBM, FLAC, OGG | Same |
476+
477+
## Best Practices for Transcription
478+
479+
### 1. Language Specification
480+
Always specify the language if known:
481+
```yaml
482+
parameters:
483+
language: en # or es, fr, de, etc.
484+
```
485+
486+
### 2. Temperature Setting
487+
Use temperature 0 for deterministic transcription:
488+
```yaml
489+
parameters:
490+
temperature: 0 # Recommended for transcription
491+
```
492+
493+
### 3. Audio Quality
494+
- Use high-quality audio files (16kHz or higher sample rate)
495+
- Minimize background noise for better accuracy
496+
- Ensure clear speech with minimal overlapping speakers
497+
498+
### 4. Context Prompts
499+
Provide context for technical terms or specific vocabulary:
500+
```yaml
501+
- type: text
502+
text: "This audio discusses machine learning models including BERT, GPT, and transformers."
503+
```
504+
505+
### 5. File Size Limits
506+
- Maximum audio file size: 25 MB (OpenAI limit)
507+
- For longer audio, split into chunks before transcription
508+
509+
---
510+
205511
## Notes
206512

207513
- **Audio generation is not supported** in this module. The `audio_url` type is strictly for passing existing audio inputs (e.g., loaded from datasets), not for generating new audio via model output.
208-
- For a complete working example, see: [`tasks/audio_to_text`](https://github.com/ServiceNow/SyGra/tree/main/tasks/examples/audio_to_text)
514+
- **Transcription models** require `input_type: audio` in model configuration to route to the transcription API.
515+
- For audio understanding LLM examples, see: [`tasks/examples/audio_to_text`](https://github.com/ServiceNow/SyGra/tree/main/tasks/examples/audio_to_text)
516+
- For transcription examples, see: [`tasks/examples/transcription_apis`](https://github.com/ServiceNow/SyGra/tree/main/tasks/examples/transcription_apis)
517+
518+
---
519+
520+
## See Also
521+
522+
- [GPT-4o Audio](./gpt_4o_audio.md) - Multimodal audio generation and understanding with GPT-4o
523+
- [Text to Speech](./text_to_speech.md) - Text-to-speech generation
524+
- [Image to Text](./image_to_text.md) - Vision-based multimodal pipelines
525+
- [OpenAI Whisper Documentation](https://platform.openai.com/docs/guides/speech-to-text) - Official OpenAI Whisper API reference
209526

210527

docs/getting_started/model_configuration.md

Lines changed: 20 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -45,25 +45,26 @@ SYGRA_MIXTRAL_8X7B_CHAT_TEMPLATE={% for m in messages %} ... {% endfor %}
4545
### Configuration Properties
4646
4747
48-
| Key | Description |
49-
|-----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
50-
| `model_type` | Type of backend server (`tgi`, `vllm`, `openai`, `azure_openai`, `azure`, `mistralai`, `ollama`, `triton`, `bedrock`, `vertex_ai`) |
51-
| `model_name` | Model name for your deployments (for Azure/Azure OpenAI) |
52-
| `api_version` | API version for Azure or Azure OpenAI |
53-
| `multi_modal` | *(Optional)* Boolean: Set this to false if the model is not multi-modal (default: true) |
54-
| `backend` | *(Optional)* Backend for the model (default: `litellm` for litellm supported models, `custom` for other models) Supported values: `litellm`, `custom` |
55-
| `completions_api` | *(Optional)* Boolean: use completions API instead of chat completions API (default: false) <br/> Supported models: `tgi`, `vllm`, `ollama` |
56-
| `hf_chat_template_model_id` | *(Optional)* Hugging Face model ID. Make sure to set this when completions_api is set to `true` |
57-
| `modify_tokenizer` | *(Optional)* Boolean: apply custom chat template and modify the base model tokenizer (default: false) |
58-
| `special_tokens` | *(Optional)* List of special stop tokens used in generation |
59-
| `post_process` | *(Optional)* Post processor after model inference (e.g. `models.model_postprocessor.RemoveThinkData`) |
60-
| `parameters` | *(Optional)* Generation parameters (see below) |
61-
| `image_capabilities` | *(Optional)* Image model limits as dict. Supports `prompt_char_limit` (warn if exceeded) and `max_edit_images` (truncate extra input images). |
62-
| `chat_template_params` | *(Optional)* Chat template parameters (e.g. `reasoning_effort` for `gpt-oss-120b`) <br/> when `completions_api` is enabled |
63-
| `ssl_verify` | *(Optional)* Verify SSL certificate (default: true) |
64-
| `ssl_cert` | *(Optional)* Path to SSL certificate file |
65-
| `json_payload` | *(Optional)* Boolean: use JSON payload instead of JSON string for `http client` based models (default: false) |
66-
| `headers` | *(Optional)* Dictionary of headers to be sent with the request for `http client` based models |
48+
| Key | Description |
49+
|-----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
50+
| `model_type` | Type of backend server (`tgi`, `vllm`, `openai`, `azure_openai`, `azure`, `mistralai`, `ollama`, `triton`, `bedrock`, `vertex_ai`) |
51+
| `model_name` | Model name for your deployments (for Azure/Azure OpenAI) |
52+
| `api_version` | API version for Azure or Azure OpenAI |
53+
| `input_type` | *(Optional)* What type of input the model accepts (default: `text`) <br/> Supported values: `text`, `image`, `audio`.<br/><br/>**`input_type: audio` mandatory for transcription models** |
54+
| `output_type` | *(Optional)* What type of output the model generates (default: `text`) <br/> Supported values: `text`, `image`, `audio` |
55+
| `backend` | *(Optional)* Backend for the model (default: `litellm` for litellm supported models, `custom` for other models) Supported values: `litellm`, `custom` |
56+
| `completions_api` | *(Optional)* Boolean: use completions API instead of chat completions API (default: false) <br/> Supported models: `tgi`, `vllm`, `ollama` |
57+
| `hf_chat_template_model_id` | *(Optional)* Hugging Face model ID. Make sure to set this when completions_api is set to `true` |
58+
| `modify_tokenizer` | *(Optional)* Boolean: apply custom chat template and modify the base model tokenizer (default: false) |
59+
| `special_tokens` | *(Optional)* List of special stop tokens used in generation |
60+
| `post_process` | *(Optional)* Post processor after model inference (e.g. `models.model_postprocessor.RemoveThinkData`) |
61+
| `parameters` | *(Optional)* Generation parameters (see below) |
62+
| `image_capabilities` | *(Optional)* Image model limits as dict. Supports `prompt_char_limit` (warn if exceeded) and `max_edit_images` (truncate extra input images). |
63+
| `chat_template_params` | *(Optional)* Chat template parameters (e.g. `reasoning_effort` for `gpt-oss-120b`) <br/> when `completions_api` is enabled |
64+
| `ssl_verify` | *(Optional)* Verify SSL certificate (default: true) |
65+
| `ssl_cert` | *(Optional)* Path to SSL certificate file |
66+
| `json_payload` | *(Optional)* Boolean: use JSON payload instead of JSON string for `http client` based models (default: false) |
67+
| `headers` | *(Optional)* Dictionary of headers to be sent with the request for `http client` based models |
6768
![Note](https://img.shields.io/badge/Note-important-yellow)
6869
> - Do **not** include `url`, `auth_token`, or `api_key` in your YAML config. These are sourced from environment variables as described above.<br>
6970
> - If you want to set **ssl_verify** to **false** globally, you can set `ssl_verify:false` under `model_config` section in config/configuration.yaml

0 commit comments

Comments
 (0)