|
1 | 1 | # Audio to Text Data Generation |
2 | 2 |
|
3 | | -This module introduces support for multimodal data generation pipelines that accept **audio** or **audio + text** as input and produce **textual outputs** using audio-capable LLMs like `Qwen2-Audio-7B`. It expands traditional text-only pipelines to support audio reasoning tasks like speech recognition, audio classification, and multimodal QA. |
| 3 | +This module introduces support for multimodal data generation pipelines that convert **audio** to **text**. SyGra supports two distinct approaches for audio-to-text conversion: |
4 | 4 |
|
| 5 | +1. **Audio Understanding LLMs** - Models like `Qwen2-Audio-7B` that can reason about, analyze, and answer questions about audio content |
| 6 | +2. **Dedicated Transcription Models** - Models like `Whisper` and `gpt-4o-transcribe` optimized specifically for accurate speech-to-text conversion |
| 7 | + |
| 8 | +> **Note:** |
| 9 | +> For gpt-4o-audio multimodal generation, see the [GPT-4o Audio](./gpt_4o_audio.md) documentation. |
5 | 10 | ## Key Features |
6 | 11 |
|
7 | | -- Supports **audio-only** and **audio+text** prompts. |
8 | | -- Converts audio fields into **base64-encoded data URLs** compatible with LLM APIs. |
9 | | -- Compatible with HuggingFace datasets, streaming, and on-disk formats. |
10 | | -- Automatically handles **lists of audio** per field. |
11 | | -- Seamless round-tripping between loading, prompting, and output publishing. |
| 12 | +### Audio Understanding LLMs |
| 13 | +- Supports **audio-only** and **audio+text** prompts |
| 14 | +- Audio reasoning, classification, and Q&A capabilities |
| 15 | +- Uses standard chat completions API |
| 16 | +- Contextual understanding of audio content |
| 17 | + |
| 18 | +### Dedicated Transcription Models |
| 19 | +- Accurate speech-to-text conversion |
| 20 | +- Multilingual support (50+ languages) |
| 21 | +- Multiple output formats (JSON, SRT, VTT, text) |
| 22 | +- Word and segment-level timestamps |
| 23 | +- Optimized for transcription accuracy |
| 24 | + |
| 25 | +### Common Features |
| 26 | +- Converts audio fields into **base64-encoded data URLs** compatible with LLM APIs |
| 27 | +- Compatible with HuggingFace datasets, streaming, and on-disk formats |
| 28 | +- Automatically handles **lists of audio** per field |
| 29 | +- Seamless round-tripping between loading, prompting, and output publishing |
| 30 | + |
| 31 | +## Choosing the Right Approach |
| 32 | + |
| 33 | +| Use Case | Recommended Approach | |
| 34 | +|----------|---------------------| |
| 35 | +| Accurate speech-to-text transcription | **Transcription Models** | |
| 36 | +| Generating subtitles with timestamps | **Transcription Models** | |
| 37 | +| Multilingual transcription | **Transcription Models** | |
| 38 | +| Audio classification or event detection | **Audio Understanding LLMs** | |
| 39 | +| Answering questions about audio | **Audio Understanding LLMs** | |
| 40 | +| Audio reasoning or analysis | **Audio Understanding LLMs** | |
| 41 | +| Combining audio with text context | **Audio Understanding LLMs** | |
12 | 42 |
|
13 | 43 | --- |
14 | | -## Supported Image Input Types |
| 44 | + |
| 45 | +# Part 1: Audio Understanding with LLMs |
| 46 | + |
| 47 | +This section covers audio understanding using LLMs like `Qwen2-Audio-7B` that can reason about audio content. |
| 48 | + |
| 49 | +## Supported Audio Input Types |
15 | 50 |
|
16 | 51 | Each audio field in a dataset record may be one of the following: |
17 | 52 |
|
@@ -202,9 +237,291 @@ output_config: |
202 | 237 | from: "animal" |
203 | 238 | ``` |
204 | 239 |
|
| 240 | +--- |
| 241 | + |
| 242 | +# Part 2: Speech-to-Text Transcription |
| 243 | + |
| 244 | +This section covers dedicated transcription models optimized for accurate speech-to-text conversion. |
| 245 | + |
| 246 | +## Supported Transcription Models |
| 247 | + |
| 248 | +- `whisper-1` - OpenAI's Whisper model, general-purpose transcription |
| 249 | +- `gpt-4o-transcribe` - OpenAI's GPT-4o-based transcription model with improved accuracy |
| 250 | + |
| 251 | +## Transcription Model Configuration |
| 252 | + |
| 253 | +Configure the transcription model in your `sygra/config/models.yaml`: |
| 254 | + |
| 255 | +```yaml |
| 256 | +transcribe: |
| 257 | + model: gpt-4o-transcribe # or whisper-1 |
| 258 | + input_type: audio # Required for transcription routing |
| 259 | + model_type: azure_openai # or openai |
| 260 | + api_version: 2025-03-01-preview |
| 261 | + # URL and auth_token from environment variables: |
| 262 | + # SYGRA_TRANSCRIBE_URL and SYGRA_TRANSCRIBE_TOKEN |
| 263 | + parameters: |
| 264 | + language: en # Optional: ISO-639-1 language code |
| 265 | + response_format: json # json, verbose_json, text, srt, vtt |
| 266 | + temperature: 0 # 0-1, controls randomness |
| 267 | +``` |
| 268 | + |
| 269 | +### Critical Configuration: `input_type: audio` |
| 270 | + |
| 271 | +Transcription requires `input_type: audio` in the model configuration to route to the transcription API: |
| 272 | + |
| 273 | +```yaml |
| 274 | +# ✓ Correct - Routes to transcription API |
| 275 | +transcribe: |
| 276 | + model: whisper-1 |
| 277 | + input_type: audio |
| 278 | + model_type: openai |
| 279 | +
|
| 280 | +# ✗ Incorrect - Will not route to transcription API |
| 281 | +transcribe: |
| 282 | + model: whisper-1 |
| 283 | + model_type: openai |
| 284 | +``` |
| 285 | + |
| 286 | +## Supported Languages |
| 287 | + |
| 288 | +Transcription models support 50+ languages including: |
| 289 | + |
| 290 | +| Language | Code | Language | Code | |
| 291 | +|----------|------|----------|------| |
| 292 | +| English | en | Spanish | es | |
| 293 | +| French | fr | German | de | |
| 294 | +| Italian | it | Portuguese | pt | |
| 295 | +| Dutch | nl | Russian | ru | |
| 296 | +| Chinese | zh | Japanese | ja | |
| 297 | +| Korean | ko | Arabic | ar | |
| 298 | +| Hindi | hi | Turkish | tr | |
| 299 | + |
| 300 | +For a complete list, see [OpenAI Whisper Documentation](https://platform.openai.com/docs/guides/speech-to-text). |
| 301 | + |
| 302 | +## Response Formats |
| 303 | + |
| 304 | +| Format | Description | Use Case | |
| 305 | +|--------|-------------|----------| |
| 306 | +| `json` | JSON with transcribed text only | Simple transcription | |
| 307 | +| `verbose_json` | JSON with text, timestamps, and metadata | Detailed analysis | |
| 308 | +| `text` | Plain text only | Direct text output | |
| 309 | +| `srt` | SubRip subtitle format with timestamps | Video subtitles | |
| 310 | +| `vtt` | WebVTT subtitle format with timestamps | Web video subtitles | |
| 311 | + |
| 312 | +### Example Outputs |
| 313 | + |
| 314 | +**JSON Format:** |
| 315 | +```json |
| 316 | +{ |
| 317 | + "text": "Hello, how are you today?" |
| 318 | +} |
| 319 | +``` |
| 320 | + |
| 321 | +**Verbose JSON Format:** |
| 322 | +```json |
| 323 | +{ |
| 324 | + "task": "transcribe", |
| 325 | + "language": "english", |
| 326 | + "duration": 2.5, |
| 327 | + "text": "Hello, how are you today?", |
| 328 | + "segments": [ |
| 329 | + { |
| 330 | + "id": 0, |
| 331 | + "seek": 0, |
| 332 | + "start": 0.0, |
| 333 | + "end": 2.5, |
| 334 | + "text": " Hello, how are you today?", |
| 335 | + "temperature": 0.0, |
| 336 | + "avg_logprob": -0.2 |
| 337 | + } |
| 338 | + ] |
| 339 | +} |
| 340 | +``` |
| 341 | + |
| 342 | +**SRT Format:** |
| 343 | +``` |
| 344 | +1 |
| 345 | +00:00:00,000 --> 00:00:02,500 |
| 346 | +Hello, how are you today? |
| 347 | +``` |
| 348 | +
|
| 349 | +## Transcription Example Configuration |
| 350 | +
|
| 351 | +Based on `tasks/examples/transcription_apis/graph_config.yaml`: |
| 352 | +
|
| 353 | +### Input Data (`test.json`) |
| 354 | +
|
| 355 | +```json |
| 356 | +[ |
| 357 | + { |
| 358 | + "id": "1", |
| 359 | + "audio": "/path/to/audio/meeting_recording.mp3" |
| 360 | + }, |
| 361 | + { |
| 362 | + "id": "2", |
| 363 | + "audio": "/path/to/audio/interview.wav" |
| 364 | + } |
| 365 | +] |
| 366 | +``` |
| 367 | + |
| 368 | +### Graph Configuration |
| 369 | + |
| 370 | +```yaml |
| 371 | +data_config: |
| 372 | + source: |
| 373 | + type: "disk" |
| 374 | + file_path: "tasks/examples/transcription_apis/test.json" |
| 375 | + |
| 376 | +graph_config: |
| 377 | + nodes: |
| 378 | + audio_to_text: |
| 379 | + output_keys: transcription |
| 380 | + node_type: llm |
| 381 | + prompt: |
| 382 | + - user: |
| 383 | + - type: audio_url |
| 384 | + audio_url: "{audio}" |
| 385 | + model: |
| 386 | + name: transcribe |
| 387 | + |
| 388 | + edges: |
| 389 | + - from: START |
| 390 | + to: audio_to_text |
| 391 | + - from: audio_to_text |
| 392 | + to: END |
| 393 | + |
| 394 | +output_config: |
| 395 | + output_map: |
| 396 | + id: |
| 397 | + from: id |
| 398 | + audio: |
| 399 | + from: audio |
| 400 | + transcription: |
| 401 | + from: transcription |
| 402 | +``` |
| 403 | +
|
| 404 | +### Output |
| 405 | +
|
| 406 | +```json |
| 407 | +[ |
| 408 | + { |
| 409 | + "id": "1", |
| 410 | + "audio": "/path/to/audio/meeting_recording.mp3", |
| 411 | + "transcription": "Welcome everyone to today's meeting. Let's start with the agenda..." |
| 412 | + }, |
| 413 | + { |
| 414 | + "id": "2", |
| 415 | + "audio": "/path/to/audio/interview.wav", |
| 416 | + "transcription": "Thank you for joining us today. Can you tell us about your background?" |
| 417 | + } |
| 418 | +] |
| 419 | +``` |
| 420 | + |
| 421 | +## Advanced Transcription Features |
| 422 | + |
| 423 | +### Language Specification |
| 424 | + |
| 425 | +Specifying the language improves accuracy and speed: |
| 426 | + |
| 427 | +```yaml |
| 428 | +model: |
| 429 | + name: transcribe |
| 430 | + parameters: |
| 431 | + language: es # Spanish |
| 432 | + response_format: json |
| 433 | + temperature: 0 |
| 434 | +``` |
| 435 | +
|
| 436 | +### Timestamps (Verbose JSON) |
| 437 | +
|
| 438 | +For detailed timestamp information: |
| 439 | +
|
| 440 | +```yaml |
| 441 | +model: |
| 442 | + name: transcribe |
| 443 | + parameters: |
| 444 | + response_format: verbose_json |
| 445 | + timestamp_granularities: ["word", "segment"] # Word and segment-level timestamps |
| 446 | +``` |
| 447 | +
|
| 448 | +### Context Prompt |
| 449 | +
|
| 450 | +Provide context to improve accuracy on specific terms: |
| 451 | +
|
| 452 | +```yaml |
| 453 | +prompt: |
| 454 | + - user: |
| 455 | + - type: audio_url |
| 456 | + audio_url: "{audio}" |
| 457 | + - type: text |
| 458 | + text: "The audio contains technical terms like Kubernetes, Docker, and CI/CD." |
| 459 | +``` |
| 460 | +
|
| 461 | +The text prompt is automatically passed as the `prompt` parameter to the transcription API. |
| 462 | + |
| 463 | +## Comparison: Transcription vs Audio-Understanding LLMs |
| 464 | + |
| 465 | +| Feature | Transcription Models | Audio LLMs (Qwen2-Audio) | |
| 466 | +|---------|---------------------|---------------------------| |
| 467 | +| **Primary Use** | Speech-to-text conversion | Audio understanding, reasoning, Q&A | |
| 468 | +| **API Endpoint** | `audio.transcriptions.create` | `chat.completions.create` | |
| 469 | +| **Output** | Transcribed text only | Contextual text responses | |
| 470 | +| **Timestamps** | Yes (word/segment level) | No | |
| 471 | +| **Multiple Formats** | Yes (JSON, SRT, VTT, text) | No (text only) | |
| 472 | +| **Language Support** | 50+ languages | Varies by model | |
| 473 | +| **Best For** | Accurate transcription, subtitles | Audio reasoning, classification, Q&A | |
| 474 | +| **Configuration** | `input_type: audio` required | Standard LLM config | |
| 475 | +| **Supported Audio** | MP3, MP4, MPEG, MPGA, M4A, WAV, WEBM, FLAC, OGG | Same | |
| 476 | + |
| 477 | +## Best Practices for Transcription |
| 478 | + |
| 479 | +### 1. Language Specification |
| 480 | +Always specify the language if known: |
| 481 | +```yaml |
| 482 | +parameters: |
| 483 | + language: en # or es, fr, de, etc. |
| 484 | +``` |
| 485 | + |
| 486 | +### 2. Temperature Setting |
| 487 | +Use temperature 0 for deterministic transcription: |
| 488 | +```yaml |
| 489 | +parameters: |
| 490 | + temperature: 0 # Recommended for transcription |
| 491 | +``` |
| 492 | + |
| 493 | +### 3. Audio Quality |
| 494 | +- Use high-quality audio files (16kHz or higher sample rate) |
| 495 | +- Minimize background noise for better accuracy |
| 496 | +- Ensure clear speech with minimal overlapping speakers |
| 497 | + |
| 498 | +### 4. Context Prompts |
| 499 | +Provide context for technical terms or specific vocabulary: |
| 500 | +```yaml |
| 501 | +- type: text |
| 502 | + text: "This audio discusses machine learning models including BERT, GPT, and transformers." |
| 503 | +``` |
| 504 | + |
| 505 | +### 5. File Size Limits |
| 506 | +- Maximum audio file size: 25 MB (OpenAI limit) |
| 507 | +- For longer audio, split into chunks before transcription |
| 508 | + |
| 509 | +--- |
| 510 | + |
205 | 511 | ## Notes |
206 | 512 |
|
207 | 513 | - **Audio generation is not supported** in this module. The `audio_url` type is strictly for passing existing audio inputs (e.g., loaded from datasets), not for generating new audio via model output. |
208 | | -- For a complete working example, see: [`tasks/audio_to_text`](https://github.com/ServiceNow/SyGra/tree/main/tasks/examples/audio_to_text) |
| 514 | +- **Transcription models** require `input_type: audio` in model configuration to route to the transcription API. |
| 515 | +- For audio understanding LLM examples, see: [`tasks/examples/audio_to_text`](https://github.com/ServiceNow/SyGra/tree/main/tasks/examples/audio_to_text) |
| 516 | +- For transcription examples, see: [`tasks/examples/transcription_apis`](https://github.com/ServiceNow/SyGra/tree/main/tasks/examples/transcription_apis) |
| 517 | + |
| 518 | +--- |
| 519 | + |
| 520 | +## See Also |
| 521 | + |
| 522 | +- [GPT-4o Audio](./gpt_4o_audio.md) - Multimodal audio generation and understanding with GPT-4o |
| 523 | +- [Text to Speech](./text_to_speech.md) - Text-to-speech generation |
| 524 | +- [Image to Text](./image_to_text.md) - Vision-based multimodal pipelines |
| 525 | +- [OpenAI Whisper Documentation](https://platform.openai.com/docs/guides/speech-to-text) - Official OpenAI Whisper API reference |
209 | 526 |
|
210 | 527 |
|
0 commit comments