You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: pages/public_cloud/ai_machine_learning/endpoints_guide_08_audio_transcriptions/guide.en-gb.md
+96-20Lines changed: 96 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
2
title: AI Endpoints - Speech to Text
3
3
excerpt: Learn how to transcribe audio files with OVHcloud AI Endpoints
4
-
updated: 2025-10-01
4
+
updated: 2025-10-03
5
5
---
6
6
7
7
> [!primary]
@@ -15,7 +15,7 @@ updated: 2025-10-01
15
15
16
16
**Speech to Text** is a powerful feature that enables the conversion of spoken language into written text.
17
17
18
-
The Speech to Text endpoints on AI Endpoints allow you to easily integrate this technology into your applications, enabling you to transcribe audio files with high accuracy. Our endpoints support various audio formats and provide flexible configuration options to suit your specific use cases.
18
+
The Speech to Text APIs on AI Endpoints allow you to easily integrate this technology into your applications, enabling you to transcribe audio files with high accuracy. Our endpoints support various [audio formats](#parameters-overview) and provide flexible configuration options to suit your specific use cases.
19
19
20
20
## Objective
21
21
@@ -48,7 +48,7 @@ The examples provided during this guide can be used with one of the following en
48
48
>> A standard terminal, with [cURL](https://cURL.se/) installed on the system.
49
49
>>
50
50
51
-
*These exmaples will be using the [Whisper-large-v3](https://endpoints.ai.cloud.ovh.net/models/whisper-large-v3) model.*
51
+
*These examples will be using the [Whisper-large-v3](https://endpoints.ai.cloud.ovh.net/models/whisper-large-v3) model.*
52
52
53
53
## Authentication & Rate Limiting
54
54
@@ -66,7 +66,7 @@ The request body for the audio transcription endpoint is of type `multipart/form
|**chunking_strategy**| No |`string`/`server_vad object`/`null`| - | null | Strategy for dividing the audio into chunks. More details [here](#chunking-strategy). |
69
-
|**diarize**| No |`boolean`/`null`|`true`/`false`|false| Enables speaker separation in the transcript. When set to true, the system separates the audio into segments based on speakers, by adding labels like "Speaker 1" and "Speaker 2", so you can see who said what in conversations such as interviews, meetings, or phone calls. More details [here](#diarize). |
69
+
|**diarize**| No |`boolean`/`null`|`true`/`false`|false| Enables speaker separation in the transcript. When set to true, the system separates the audio into segments based on speakers, by adding labels like "Speaker 1" and "Speaker 2", so you can see who said what in conversations such as interviews, meetings, or phone calls. More details [here](#diarization). |
70
70
|**language**| No |`string`/`null`| [ISO-639-1 format](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes) | - | The language parameter specifies the language spoken in the input audio. Providing it can improve transcription accuracy and reduce latency (e.g. `en`forEnglish, `fr` for French, `de` for German, `es` for Spanish, `zh` for Chinese, `ar` for Arabic ...). If not provided, the system will attempt automatic language detection, which may be slightly slower and less accuratein some cases. [More details on language compatibility and performance](#language-compatibility-and-performances). |
71
71
|**model**| No |`string`/`null`| ID of the model to use | - | Specifies the model to use for transcription. Useful when using our [unified endpoint](/pages/public_cloud/ai_machine_learning/endpoints_guide_07_virtual_models). |
72
72
|**prompt**| No |`string`/`null`| - | - | Text to guide the model's style, translate transcript to english or continue a previous audio segment. The language in which you write the prompt must match the audio's one. More details about prompt usage [here](#prompt). |
@@ -314,7 +314,7 @@ The `prompt` parameter lets you provide extra context to improve transcription.
314
314
>> }
315
315
>>```
316
316
>>
317
-
>>**Translating transcript into English**
317
+
>**Translating transcript into English**
318
318
>>
319
319
>> To directly translate the transcription into English instead of keeping it in the source language, you can pass the special translation token `<|translate|>`in your prompt:
320
320
>>
@@ -401,8 +401,39 @@ The `timestamp_granularities` parameter controls the level of time markers inclu
401
401
>>```json
402
402
>> words=[],
403
403
>> segments=[
404
-
>> {'id': 1, 'seek': 0, 'start': 1.76, 'end': 4.58, 'text': ' France is the world's leading tourist destination', 'tokens': [50365, 1456, 1181, 650, 1459, 1030, 476, 8124, 515, 50609], 'temperature': 0.0, 'avg_logprob': -0.14139344, 'compression_ratio': 1.2769231, 'no_speech_prob': 0.007171631},
>> 'text': ' France is the world's leading tourist destination',
494
+
>>'tokens': [
495
+
>> 50365,
496
+
>> 1456,
497
+
>> 1181,
498
+
>> ...
499
+
>> ],
500
+
>>'temperature': 0.0, 'avg_logprob': -0.14139344,
501
+
>>'compression_ratio': 1.2769231,
502
+
>>'no_speech_prob': 0.007171631
503
+
>> },
504
+
>> {
505
+
>>'id': 2,
506
+
>>'seek': 0,
507
+
>>'start': 9.44,
508
+
>>'end': 14.92,
509
+
>>'text': 'having received 100 million foreign visitors in 2023.',
510
+
>>'tokens': [
511
+
>> 50609,
512
+
>> 4042,
513
+
>> 25011,
514
+
>> ...
515
+
>> ],
516
+
>>'temperature': 0.0,
517
+
>>'avg_logprob': -0.14139344,
518
+
>>'compression_ratio': 1.2769231,
519
+
>>'no_speech_prob': 0.007171631
520
+
>> },
459
521
>> ...
460
522
>> ]
461
523
>>```
@@ -520,6 +582,7 @@ The `response_format` determines how the transcription data is returned. Availab
520
582
>>"duration": 5
521
583
>> }
522
584
>> }
585
+
>>```
523
586
>>
524
587
>**Text**
525
588
>>
@@ -530,11 +593,11 @@ The `response_format` determines how the transcription data is returned. Availab
530
593
>>
531
594
>**SRT**
532
595
>>
533
-
>> Not yet supported.
596
+
>>**Not yet supported.**
534
597
>>
535
598
>**VTT**
536
599
>>
537
-
>> Not yet supported.
600
+
>>**Not yet supported.**
538
601
>>
539
602
540
603
#### Chunking Strategy
@@ -577,7 +640,7 @@ However, transcription quality and speed depend on the **language of the input a
577
640
- Less common or low-resource languages may yield lower accuracy or longer processing times.
578
641
- Regional accents, dialects, or code-switching (switching between multiple languages in the same recording) can reduce accuracy further.
579
642
580
-
Providing the `language` parameter explicitly (instead of relying on automatic detection) generally improves both accuracy and latency.
643
+
Providing the `language` parameter explicitly (instead of relying on automatic detection) generally improves both accuracy and latency. Expected format is [ISO-639-1 format](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes) (e.g. `en`for English, `fr`for French, `de`for German, `es`for Spanish, `zh`for Chinese, `ar`for Arabic ...).
581
644
582
645
For a detailed performance breakdown by language, see [Whisper’s benchmark results](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). This includes word error rates (WER) and character error rates (CER) across different datasets.
583
646
@@ -610,29 +673,42 @@ Try to avoid splitting mid-sentence, as this can cause context to be lost and re
610
673
611
674
**Example**
612
675
613
-
Splitting Audio with open-source Python PyDub library:
676
+
Splitting Audio with open-source Python `pydub` library:
0 commit comments