Skip to content

Commit 39660fe

Browse files
authored
Merge pull request #8475 from ovh/mb-ai-endpoints-audio-fix
[AI Endpoints] - Audio documentation - Fix typos
2 parents 5636663 + f5ef6e1 commit 39660fe

File tree

2 files changed

+192
-41
lines changed

2 files changed

+192
-41
lines changed

pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/guide.en-gb.md

Lines changed: 96 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: AI Endpoints - Speech to Text
33
excerpt: Learn how to transcribe audio files with OVHcloud AI Endpoints
4-
updated: 2025-10-01
4+
updated: 2025-10-03
55
---
66

77
> [!primary]
@@ -15,7 +15,7 @@ updated: 2025-10-01
1515

1616
**Speech to Text** is a powerful feature that enables the conversion of spoken language into written text.
1717

18-
The Speech to Text endpoints on AI Endpoints allow you to easily integrate this technology into your applications, enabling you to transcribe audio files with high accuracy. Our endpoints support various audio formats and provide flexible configuration options to suit your specific use cases.
18+
The Speech to Text APIs on AI Endpoints allow you to easily integrate this technology into your applications, enabling you to transcribe audio files with high accuracy. Our endpoints support various [audio formats](#parameters-overview) and provide flexible configuration options to suit your specific use cases.
1919

2020
## Objective
2121

@@ -48,7 +48,7 @@ The examples provided during this guide can be used with one of the following en
4848
>> A standard terminal, with [cURL](https://cURL.se/) installed on the system.
4949
>>
5050
51-
*These exmaples will be using the [Whisper-large-v3](https://endpoints.ai.cloud.ovh.net/models/whisper-large-v3) model.*
51+
*These examples will be using the [Whisper-large-v3](https://endpoints.ai.cloud.ovh.net/models/whisper-large-v3) model.*
5252
5353
## Authentication & Rate Limiting
5454
@@ -66,7 +66,7 @@ The request body for the audio transcription endpoint is of type `multipart/form
6666
|--------------------------|----------|---------------|---------------------------------------------------------------------------------------|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
6767
| **file** | Yes | binary | `mp3`, `mp4`, `aac`, `m4a`, `wav`, `flac`, `ogg`, `opus`, `webm`, `mpeg`, `mpga` | - | The **audio file object (not file name)** to transcribe. |
6868
| **chunking_strategy** | No | `string`/`server_vad object`/`null` | - | null | Strategy for dividing the audio into chunks. More details [here](#chunking-strategy). |
69-
| **diarize** | No | `boolean`/`null` | `true`/`false` | false | Enables speaker separation in the transcript. When set to true, the system separates the audio into segments based on speakers, by adding labels like "Speaker 1" and "Speaker 2", so you can see who said what in conversations such as interviews, meetings, or phone calls. More details [here](#diarize). |
69+
| **diarize** | No | `boolean`/`null` | `true`/`false` | false | Enables speaker separation in the transcript. When set to true, the system separates the audio into segments based on speakers, by adding labels like "Speaker 1" and "Speaker 2", so you can see who said what in conversations such as interviews, meetings, or phone calls. More details [here](#diarization). |
7070
| **language** | No | `string`/`null` | [ISO-639-1 format](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes) | - | The language parameter specifies the language spoken in the input audio. Providing it can improve transcription accuracy and reduce latency (e.g. `en` for English, `fr` for French, `de` for German, `es` for Spanish, `zh` for Chinese, `ar` for Arabic ...). If not provided, the system will attempt automatic language detection, which may be slightly slower and less accurate in some cases. [More details on language compatibility and performance](#language-compatibility-and-performances). |
7171
| **model** | No | `string`/`null` | ID of the model to use | - | Specifies the model to use for transcription. Useful when using our [unified endpoint](/pages/public_cloud/ai_machine_learning/endpoints_guide_07_virtual_models). |
7272
| **prompt** | No | `string`/`null` | - | - | Text to guide the model's style, translate transcript to english or continue a previous audio segment. The language in which you write the prompt must match the audio's one. More details about prompt usage [here](#prompt). |
@@ -314,7 +314,7 @@ The `prompt` parameter lets you provide extra context to improve transcription.
314314
>> }
315315
>> ```
316316
>>
317-
>> **Translating transcript into English**
317+
> **Translating transcript into English**
318318
>>
319319
>> To directly translate the transcription into English instead of keeping it in the source language, you can pass the special translation token `<|translate|>` in your prompt:
320320
>>
@@ -401,8 +401,39 @@ The `timestamp_granularities` parameter controls the level of time markers inclu
401401
>> ```json
402402
>> words=[],
403403
>> segments=[
404-
>> {'id': 1, 'seek': 0, 'start': 1.76, 'end': 4.58, 'text': ' France is the world's leading tourist destination', 'tokens': [50365, 1456, 1181, 650, 1459, 1030, 476, 8124, 515, 50609], 'temperature': 0.0, 'avg_logprob': -0.14139344, 'compression_ratio': 1.2769231, 'no_speech_prob': 0.007171631},
405-
>> {'id': 2, 'seek': 0, 'start': 9.44, 'end': 14.92, 'text': ' having received 100 million foreign visitors in 2023.', 'tokens': [50609, 4042, 25011, 3925, 650, 1459, 11, 1022, 517, 594, 2672, 14303, 11, 2064, 1001, 465, 1872, 312, 66, 517, 490, 609, 13, 51117], 'temperature': 0.0, 'avg_logprob': -0.14139344, 'compression_ratio': 1.2769231, 'no_speech_prob': 0.007171631},
404+
>> {
405+
>> 'id': 1,
406+
>> 'seek': 0,
407+
>> 'start': 1.76,
408+
>> 'end': 4.58,
409+
>> 'text': ' France is the world's leading tourist destination',
410+
>> 'tokens': [
411+
>> 50365,
412+
>> 1456,
413+
>> 1181,
414+
>> ...
415+
>> ],
416+
>> 'temperature': 0.0, 'avg_logprob': -0.14139344,
417+
>> 'compression_ratio': 1.2769231,
418+
>> 'no_speech_prob': 0.007171631
419+
>> },
420+
>> {
421+
>> 'id': 2,
422+
>> 'seek': 0,
423+
>> 'start': 9.44,
424+
>> 'end': 14.92,
425+
>> 'text': 'having received 100 million foreign visitors in 2023.',
426+
>> 'tokens': [
427+
>> 50609,
428+
>> 4042,
429+
>> 25011,
430+
>> ...
431+
>> ],
432+
>> 'temperature': 0.0,
433+
>> 'avg_logprob': -0.14139344,
434+
>> 'compression_ratio': 1.2769231,
435+
>> 'no_speech_prob': 0.007171631
436+
>> },
406437
>> ...
407438
>> ]
408439
>> ```
@@ -454,8 +485,39 @@ The `timestamp_granularities` parameter controls the level of time markers inclu
454485
>> ...
455486
>> ],
456487
>> segments=[
457-
>> {'id': 1, 'seek': 0, 'start': 1.76, 'end': 4.58, 'text': ' France is the world's leading tourist destination', 'tokens': [50365, 1456, 1181, 650, 1459, 1030, 476, 8124, 515, 50609], 'temperature': 0.0, 'avg_logprob': -0.14139344, 'compression_ratio': 1.2769231, 'no_speech_prob': 0.007171631},
458-
>> {'id': 2, 'seek': 0, 'start': 9.44, 'end': 14.92, 'text': ' having received 100 million foreign visitors in 2023.', 'tokens': [50609, 4042, 25011, 3925, 650, 1459, 11, 1022, 517, 594, 2672, 14303, 11, 2064, 1001, 465, 1872, 312, 66, 517, 490, 609, 13, 51117], 'temperature': 0.0, 'avg_logprob': -0.14139344, 'compression_ratio': 1.2769231, 'no_speech_prob': 0.007171631},
488+
>> {
489+
>> 'id': 1,
490+
>> 'seek': 0,
491+
>> 'start': 1.76,
492+
>> 'end': 4.58,
493+
>> 'text': ' France is the world's leading tourist destination',
494+
>> 'tokens': [
495+
>> 50365,
496+
>> 1456,
497+
>> 1181,
498+
>> ...
499+
>> ],
500+
>> 'temperature': 0.0, 'avg_logprob': -0.14139344,
501+
>> 'compression_ratio': 1.2769231,
502+
>> 'no_speech_prob': 0.007171631
503+
>> },
504+
>> {
505+
>> 'id': 2,
506+
>> 'seek': 0,
507+
>> 'start': 9.44,
508+
>> 'end': 14.92,
509+
>> 'text': 'having received 100 million foreign visitors in 2023.',
510+
>> 'tokens': [
511+
>> 50609,
512+
>> 4042,
513+
>> 25011,
514+
>> ...
515+
>> ],
516+
>> 'temperature': 0.0,
517+
>> 'avg_logprob': -0.14139344,
518+
>> 'compression_ratio': 1.2769231,
519+
>> 'no_speech_prob': 0.007171631
520+
>> },
459521
>> ...
460522
>> ]
461523
>> ```
@@ -520,6 +582,7 @@ The `response_format` determines how the transcription data is returned. Availab
520582
>> "duration": 5
521583
>> }
522584
>> }
585+
>> ```
523586
>>
524587
> **Text**
525588
>>
@@ -530,11 +593,11 @@ The `response_format` determines how the transcription data is returned. Availab
530593
>>
531594
> **SRT**
532595
>>
533-
>> Not yet supported.
596+
>> **Not yet supported.**
534597
>>
535598
> **VTT**
536599
>>
537-
>> Not yet supported.
600+
>> **Not yet supported.**
538601
>>
539602
540603
#### Chunking Strategy
@@ -577,7 +640,7 @@ However, transcription quality and speed depend on the **language of the input a
577640
- Less common or low-resource languages may yield lower accuracy or longer processing times.
578641
- Regional accents, dialects, or code-switching (switching between multiple languages in the same recording) can reduce accuracy further.
579642
580-
Providing the `language` parameter explicitly (instead of relying on automatic detection) generally improves both accuracy and latency.
643+
Providing the `language` parameter explicitly (instead of relying on automatic detection) generally improves both accuracy and latency. Expected format is [ISO-639-1 format](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes) (e.g. `en` for English, `fr` for French, `de` for German, `es` for Spanish, `zh` for Chinese, `ar` for Arabic ...).
581644
582645
For a detailed performance breakdown by language, see [Whisper’s benchmark results](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). This includes word error rates (WER) and character error rates (CER) across different datasets.
583646
@@ -610,29 +673,42 @@ Try to avoid splitting mid-sentence, as this can cause context to be lost and re
610673
611674
**Example**
612675
613-
Splitting Audio with open-source Python PyDub library:
676+
Splitting Audio with open-source Python `pydub` library:
614677
615678
```python
616679
from pydub import AudioSegment
680+
import math
681+
import os
617682
618683
# Load the audio file
619684
audio = AudioSegment.from_mp3("long_interview.mp3")
620685
621-
# Define chunk duration in milliseconds (e.g., 10 minutes)
622-
chunk_duration = 10 * 60 * 1000
686+
# Define chunk duration in milliseconds (e.g., 30 minutes)
687+
chunk_duration = 30 * 60 * 1000 # 30 minutes
688+
689+
# Calculate how many chunks we need
690+
num_chunks = math.ceil(len(audio) / chunk_duration)
691+
692+
# Ensure output folder exists
693+
output_dir = "chunks"
694+
os.makedirs(output_dir, exist_ok=True)
623695
624-
# Split first chunk
625-
first_chunk = audio[:chunk_duration]
696+
# Loop through and export each chunk
697+
for i in range(num_chunks):
698+
start_time = i * chunk_duration
699+
end_time = min((i + 1) * chunk_duration, len(audio))
700+
chunk = audio[start_time:end_time]
626701
627-
# Export chunk
628-
first_chunk.export("long_interview_part1.mp3", format="mp3")
702+
chunk_filename = os.path.join(output_dir, f"long_interview_part{i+1}.mp3")
703+
chunk.export(chunk_filename, format="mp3")
704+
print(f"Exported {chunk_filename}")
629705
```
630706
631707
Repeat this process to create multiple chunks, then transcribe each chunk individually.
632708
633709
> [!warning]
634710
>
635-
> OVHcloud makes no guarantees about the usability or security of third-party software like PyDub.
711+
> OVHcloud makes no guarantees about the usability or security of third-party softwares like `pydub`.
636712
637713
## Conclusion
638714

0 commit comments

Comments
 (0)