feat(gen apis): add info for audio models (#5556)

RoRoJ · fpagny · web-flow · commit 0d14407ac058 · 2025-09-25T10:28:55.000+02:00
* feat(genapi): update supported models

* feat(genapi): update model catalog with voxtral

* feat(genapi): update rate limits with voxstral

* feat(genapi): update faq for audio models

* feat(genapis): add audio model info

* fix(genapis): fix dates

* Apply suggestions from code review

* fix(genapis): fix conflcit

* Update pages/generative-apis/how-to/query-audio-models.mdx

Co-authored-by: fpagny &lt;franckpagny@hotmail.fr&gt;

* fix(gen): switch order

---------

Co-authored-by: fpagny &lt;franckpagny@hotmail.fr&gt;
diff --git a/menu/navigation.json b/menu/navigation.json
@@ -812,6 +812,10 @@
                     "label": "Query code models",
                     "slug": "query-code-models"
                   },
+                  {
+                    "label": "Query audio models",
+                    "slug": "query-audio-models"
+                  },
                   {
                     "label": "Use structured outputs",
                     "slug": "use-structured-outputs"
diff --git a/pages/generative-apis/faq.mdx b/pages/generative-apis/faq.mdx
@@ -70,9 +70,12 @@ Note that in this example, the first line where the free tier applies will not d
 ### What are tokens, and how are they counted?
 A token is the minimum unit of content that is seen and processed by a model. Hence, token definitions depend on input types:
 - For text, on average, `1` token corresponds to `~4` characters, and thus `0.75` words (as words are on average five characters long)
-- For images, `1` token corresponds to a square of pixels. For example, `mistral-small-3.1-24b-instruct-2503` model image tokens of `28x28` pixels (28-pixels height, and 28-pixels width, hence `784` pixels in total).
+- For images, `1` token corresponds to a square of pixels. For example, `mistral-small-3.1-24b-instruct-2503` model image tokens are `28x28` pixels (28-pixels height, and 28-pixels width, hence `784` pixels in total).
+- For audio:
+  - `1` token corresponds to a duration of time. For example, `voxtral-small-24b-2507` model audio tokens are `80` milliseconds.
+  - Some models process audio in chunks having a minimum duration. For example, `voxtral-small-24b-2507` model process audio in `30` second chunks. This means audio lasting `13` seconds will be considered `375` tokens (`30` seconds / `0.08` seconds). And audio lasting `178` seconds will be considered `2 250` tokens (`30` seconds * `6` / `0.08` seconds).
 
-The exact token count and definition depend on [tokenizers](https://huggingface.co/learn/llm-course/en/chapter2/4) used by each model. When this difference is significant (such as for image processing), you can find detailed information in each model's documentation (for instance in [`mistral-small-3.1-24b-instruct-2503` size limit documentation](/managed-inference/reference-content/model-catalog/#mistral-small-31-24b-instruct-2503)). When the model is open, you can also find this information in the model files on platforms such as Hugging Face, usually in the `tokenizer_config.json` file.
+The exact token count and definition depend on the [tokenizer](https://huggingface.co/learn/llm-course/en/chapter2/4) used by each model. When this difference is significant (such as for image processing), you can find detailed information in each model's documentation (for instance in [`mistral-small-3.1-24b-instruct-2503` size limit documentation](/managed-inference/reference-content/model-catalog/#mistral-small-31-24b-instruct-2503)). When the model is open, you can also find this information in the model files on platforms such as Hugging Face, usually in the `tokenizer_config.json` file.
 
 ### How can I monitor my token consumption?
 You can see your token consumption in [Scaleway Cockpit](/cockpit/). You can access it from the Scaleway console under the [Metrics tab](https://console.scaleway.com/generative-api/metrics).
diff --git a/pages/generative-apis/how-to/query-audio-models.mdx b/pages/generative-apis/how-to/query-audio-models.mdx
@@ -0,0 +1,167 @@
+---
+title: How to query audio models
+description: Learn how to interact with powerful audio models using Scaleway's Generative APIs service.
+tags: generative-apis ai-data audio-models voxtral
+dates:
+  validation: 2025-09-22
+  posted: 2025-09-22
+---
+import Requirements from '@macros/iam/requirements.mdx'
+
+Scaleway's Generative APIs service allows users to interact with powerful audio models hosted on the platform.
+
+There are several ways to interact with audio models:
+- The Scaleway [console](https://console.scaleway.com) provides a complete [playground](/generative-apis/how-to/query-language-models/#accessing-the-playground), aiming to test models, adapt parameters, and observe how these changes affect the output in real-time.
+- Via the [Chat Completions API](https://www.scaleway.com/en/developers/api/generative-apis/#path-chat-completions-create-a-chat-completion)
+
+<Requirements />
+
+- A Scaleway account logged into the [console](https://console.scaleway.com)
+- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization
+- A valid [API key](/iam/how-to/create-api-keys/) for API authentication
+- Python 3.7+ installed on your system
+
+## Accessing the Playground
+
+Scaleway provides a web playground for instruct-based models hosted on Generative APIs.
+
+1. Navigate to **Generative APIs** under the **AI** section of the [Scaleway console](https://console.scaleway.com/) side menu. The list of models you can query displays.
+2. Click the name of the chat model you want to try. Alternatively, click <Icon name="more" /> next to the chat model, and click **Try model** in the menu. 
+
+The web playground displays.
+
+## Using the playground
+
+1. Enter a prompt at the bottom of the page, or use one of the suggested prompts in the conversation area.
+2. Edit the hyperparameters listed on the right column, for example the default temperature for more or less randomness on the outputs. 
+3. Switch models at the top of the page, to observe the capabilities of chat models offered via Generative APIs. 
+4. Click **View code** to get code snippets configured according to your settings in the playground.
+
+<Message type="tip">
+You can also use the upload button to send supported audio file formats, such as MP3, to audio models for transcription purposes.
+</Message> 
+
+## Querying audio models via API
+
+You can query the models programmatically using your favorite tools or languages.
+In the example that follows, we will use the OpenAI Python client.
+
+### Installing the OpenAI SDK
+
+Install the OpenAI SDK using pip:
+
+```bash
+pip install openai
+```
+
+### Initializing the client
+
+Initialize the OpenAI client with your base URL and API key:
+
+```python
+from openai import OpenAI
+
+# Initialize the client with your base URL and API key
+client = OpenAI(
+    base_url="https://api.scaleway.ai/v1",  # Scaleway's Generative APIs service URL
+    api_key="<SCW_SECRET_KEY>"  # Your unique API secret key from Scaleway
+)
+```
+
+### Transcribing audio
+
+You can now generate a text transcription of a given audio file using the Chat Completions API. This audio file can be remote or local.
+
+#### Transcribing a remote audio file
+
+In the example below, an audio file from a remote URL (`https://genapi-documentation-assets.s3.fr-par.scw.cloud/scaleway-ai-revolution.mp3`) is downloaded using the `requests` library, base64-encoded, and then sent to the model in a chat completion request alongside a transcription prompt. The resulting text transcription is printed to the screen.
+
+```python
+import base64
+import requests
+
+MODEL = "voxtral-small-24b-2507"
+
+url = "https://genapi-documentation-assets.s3.fr-par.scw.cloud/scaleway-ai-revolution.mp3"
+response = requests.get(url)
+audio_data = response.content
+encoded_string = base64.b64encode(audio_data).decode("utf-8")
+
+content = [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": "Transcribe this audio"
+                },
+                {
+                    "type": "input_audio",
+                    "input_audio": {
+                        "data": encoded_string,
+                        "format": "mp3"
+                    }
+                }
+            ]
+        }
+    ]
+
+
+response = client.chat.completions.create(
+    model=MODEL,
+    messages=content,
+    temperature=0.2,  # Adjusts creativity
+    max_tokens=2048,   # Limits the length of the output
+    top_p=0.95         # Controls diversity through nucleus sampling. You usually only need to use temperature.
+)
+
+print(response.choices[0].message.content)
+```
+
+Various parameters such as `temperature` and `max_tokens` control the output. See the [dedicated API documentation](https://www.scaleway.com/en/developers/api/generative-apis/#path-chat-completions-create-a-chat-completion) for a full list of all available parameters.
+
+#### Transcribing a local audio file
+
+In the example below, a local audio file [scaleway-ai-revolution.mp3](https://genapi-documentation-assets.s3.fr-par.scw.cloud/scaleway-ai-revolution.mp3) is base-64 encoded and sent to the model, alongside a transcription prompt. The resulting text transcription is printed to the screen.
+
+```python
+import base64
+
+MODEL = "voxtral-small-24b-2507"
+
+with open('scaleway-ai-revolution.mp3', 'rb') as raw_file:
+        audio_data = raw_file.read()
+encoded_string = base64.b64encode(audio_data).decode("utf-8")
+
+content = [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": "Transcribe this audio"
+                },
+                {
+                    "type": "input_audio",
+                    "input_audio": {
+                        "data": encoded_string,
+                        "format": "mp3"
+                    }
+                }
+            ]
+        }
+    ]
+
+
+response = client.chat.completions.create(
+    model=MODEL,
+    messages=content,
+    temperature=0.2,  # Adjusts creativity
+    max_tokens=2048,   # Limits the length of the output
+    top_p=0.95         # Controls diversity through nucleus sampling. You usually only need to use temperature.
+)
+
+print(response.choices[0].message.content)
+```
+
+Various parameters such as `temperature` and `max_tokens` control the output. See the [dedicated API documentation](https://www.scaleway.com/en/developers/api/generative-apis/#path-chat-completions-create-a-chat-completion) for a full list of all available parameters.
diff --git a/pages/generative-apis/reference-content/supported-models.mdx b/pages/generative-apis/reference-content/supported-models.mdx
@@ -3,19 +3,27 @@ title: Supported models
 description: This page lists which open-source chat or embedding models Scaleway is currently hosting
 tags: generative-apis ai-data supported-models
 dates:
-  validation: 2025-08-20
+  validation: 2025-09-12
   posted: 2024-09-02
 ---
 
-Our API supports the most popular models for [Chat](/generative-apis/how-to/query-language-models), [Vision](/generative-apis/how-to/query-vision-models/) and [Embeddings](/generative-apis/how-to/query-embedding-models/).
+Our API supports the most popular models for [Chat](/generative-apis/how-to/query-language-models), [Vision](/generative-apis/how-to/query-vision-models/), [Audio](/generative-apis/how-to/query-audio-models/) and [Embeddings](/generative-apis/how-to/query-embedding-models/).
 
-## Multimodal models (chat and vision)
+## Multimodal models
+
+### Chat and Vision models
 
 | Provider | Model string | Context window (Tokens) | Maximum output (Tokens)| License | Model card |
 |-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
 | Google (Preview)   | `gemma-3-27b-it`  | 40k  | 8192 | [Gemma](https://ai.google.dev/gemma/terms) | [HF](https://huggingface.co/google/gemma-3-27b-it) |
 | Mistral | `mistral-small-3.2-24b-instruct-2506`  | 128k  | 8192 | [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) | [HF](https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506) |
 
+### Chat and Audio models
+
+| Provider | Model string | Context window (Tokens) | Maximum output (Tokens)| License | Model card |
+|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
+| Mistral | `voxtral-small-24b-2507`  | 32k  | 8192 | [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) | [HF](https://huggingface.co/mistralai/Voxtral-Small-24B-2507) |
+
 ## Chat models
 
 | Provider | Model string | Context window (Tokens) | Maximum output (Tokens)| License | Model card |
diff --git a/pages/managed-inference/reference-content/model-catalog.mdx b/pages/managed-inference/reference-content/model-catalog.mdx
@@ -30,6 +30,7 @@ A quick overview of available models in Scaleway's catalog and their core attrib
 | [`mistral-small-3.2-24b-instruct-2506`](#mistral-small-32-24b-instruct-2506) | Mistral | 128k | Text, Vision | H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
 | [`mistral-small-3.1-24b-instruct-2503`](#mistral-small-31-24b-instruct-2503) | Mistral | 128k | Text, Vision | H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
 | [`mistral-small-24b-instruct-2501`](#mistral-small-24b-instruct-2501) | Mistral | 32k | Text | L40S (20k), H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
+| [`voxtral-small-24b-2507`](#voxtral-small-24b-2507) | Mistral | 32k | Text | H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
 | [`mistral-nemo-instruct-2407`](#mistral-nemo-instruct-2407) | Mistral | 128k | Text | L40S, H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
 | [`mixtral-8x7b-instruct-v0.1`](#mixtral-8x7b-instruct-v01) | Mistral | 32k | Text | H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
 | [`magistral-small-2506`](#magistral-small-2506) | Mistral | 32k | Text | L40S, H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
@@ -60,6 +61,7 @@ A quick overview of available models in Scaleway's catalog and their core attrib
 | `mistral-small-3.2-24b-instruct-2506` | Yes | Yes | English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, Farsi |
 | `mistral-small-3.1-24b-instruct-2503` | Yes | Yes | English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, Farsi |
 | `mistral-small-24b-instruct-2501` | Yes | Yes | English, French, German, Dutch, Spanish, Italian, Polish, Portuguese, Chinese, Japanese, Korean |
+| `voxtral-small-24b-2507` | Yes | Yes | English, French, German, Dutch, Spanish, Italian, Portuguese, Hindi |
 | `mistral-nemo-instruct-2407` | Yes | Yes | English, French, German, Spanish, Italian, Portuguese, Russian, Chinese, Japanese |
 | `mixtral-8x7b-instruct-v0.1` | Yes | No | English, French, German, Italian, Spanish |
 | `magistral-small-2506` | Yes | Yes | English, French, German, Spanish, Portuguese, Italian, Japanese, Korean, Russian, Chinese, Arabic, Persian, Indonesian, Malay, Nepali, Polish, Romanian, Serbian, Swedish, Turkish, Ukrainian, Vietnamese, Hindi, Bengali |
@@ -164,6 +166,30 @@ Vision-language models like Molmo can analyze an image and offer insights from v
 allenai/molmo-72b-0924:fp8
 ```
 
+## Multimodal models (Text and Audio)
+
+### Voxtral-small-24b-2507
+Voxtral-small-24b-2507 is a model developed by Mistral to perform text processing and audio analysis on many languages.
+This model was optimized to enable transcription in many languages while keeping conversational capabilities (translations, classification, etc.)
+
+| Attribute | Value |
+|-----------|-------|
+| Supports parallel tool calling | Yes |
+| Supported audio formats | WAV and MP3 |
+| Audio chunk duration | 30 seconds |
+| Token duration (audio)| 80ms |
+
+#### Model names
+```
+mistral/voxtral-small-24b-2507:bf16
+mistral/voxtral-small-24b-2507:fp8
+```
+
+- Mono and stereo audio formats are supported. For stereo formats, both left and right channels are merged before being processed.
+- Audio files are processed in 30 seconds chunks:
+  - If audio sent is less than 30 seconds, the rest of the chunk will be considered silent. 
+  - 80ms is equal to 1 input token
+
 ## Text models
 
 ### Qwen3-235b-a22b-instruct-2507
diff --git a/pages/organizations-and-projects/additional-content/organization-quotas.mdx b/pages/organizations-and-projects/additional-content/organization-quotas.mdx
@@ -203,6 +203,7 @@ Generative APIs are rate limited based on:
 | mistral-small-3.1-24b-instruct-2503	  | 200k | 400k    |
 | mistral-small-3.2-24b-instruct-2506	  | 200k | 400k    |
 | mistral-nemo-instruct-2407	  | 200k | 400k    |
+| voxtral-small-24b-2507	  | 200k | 400k    |
 | pixtral-12b-2409		  | 200k | 400k  |
 | qwen3-235b-a22b-instruct-2507	  | 200k | 400k   |
 | qwen2.5-coder-32b-instruct	  | 200k | 400k   |
@@ -221,6 +222,7 @@ Generative APIs are rate limited based on:
 | mistral-small-3.1-24b-instruct-2503	  | 300 | 600   |
 | mistral-small-3.2-24b-instruct-2506	  | 300 | 600   |
 | mistral-nemo-instruct-2407	  | 300 | 600    |
+| voxtral-small-24b-2507	  | 300 | 600    |
 | pixtral-12b-2409		  | 300 | 600     |
 | qwen3-235b-a22b-instruct-2507	  | 300 | 600   |
 | qwen2.5-coder-32b-instruct	  | 300 | 600   |