Skip to content

Commit 0d14407

Browse files
RoRoJfpagny
andauthored
feat(gen apis): add info for audio models (#5556)
* feat(genapi): update supported models * feat(genapi): update model catalog with voxtral * feat(genapi): update rate limits with voxstral * feat(genapi): update faq for audio models * feat(genapis): add audio model info * fix(genapis): fix dates * Apply suggestions from code review * fix(genapis): fix conflcit * Update pages/generative-apis/how-to/query-audio-models.mdx Co-authored-by: fpagny <[email protected]> * fix(gen): switch order --------- Co-authored-by: fpagny <[email protected]>
1 parent 390be8f commit 0d14407

File tree

6 files changed

+215
-5
lines changed

6 files changed

+215
-5
lines changed

menu/navigation.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -812,6 +812,10 @@
812812
"label": "Query code models",
813813
"slug": "query-code-models"
814814
},
815+
{
816+
"label": "Query audio models",
817+
"slug": "query-audio-models"
818+
},
815819
{
816820
"label": "Use structured outputs",
817821
"slug": "use-structured-outputs"

pages/generative-apis/faq.mdx

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,9 +70,12 @@ Note that in this example, the first line where the free tier applies will not d
7070
### What are tokens, and how are they counted?
7171
A token is the minimum unit of content that is seen and processed by a model. Hence, token definitions depend on input types:
7272
- For text, on average, `1` token corresponds to `~4` characters, and thus `0.75` words (as words are on average five characters long)
73-
- For images, `1` token corresponds to a square of pixels. For example, `mistral-small-3.1-24b-instruct-2503` model image tokens of `28x28` pixels (28-pixels height, and 28-pixels width, hence `784` pixels in total).
73+
- For images, `1` token corresponds to a square of pixels. For example, `mistral-small-3.1-24b-instruct-2503` model image tokens are `28x28` pixels (28-pixels height, and 28-pixels width, hence `784` pixels in total).
74+
- For audio:
75+
- `1` token corresponds to a duration of time. For example, `voxtral-small-24b-2507` model audio tokens are `80` milliseconds.
76+
- Some models process audio in chunks having a minimum duration. For example, `voxtral-small-24b-2507` model process audio in `30` second chunks. This means audio lasting `13` seconds will be considered `375` tokens (`30` seconds / `0.08` seconds). And audio lasting `178` seconds will be considered `2 250` tokens (`30` seconds * `6` / `0.08` seconds).
7477

75-
The exact token count and definition depend on [tokenizers](https://huggingface.co/learn/llm-course/en/chapter2/4) used by each model. When this difference is significant (such as for image processing), you can find detailed information in each model's documentation (for instance in [`mistral-small-3.1-24b-instruct-2503` size limit documentation](/managed-inference/reference-content/model-catalog/#mistral-small-31-24b-instruct-2503)). When the model is open, you can also find this information in the model files on platforms such as Hugging Face, usually in the `tokenizer_config.json` file.
78+
The exact token count and definition depend on the [tokenizer](https://huggingface.co/learn/llm-course/en/chapter2/4) used by each model. When this difference is significant (such as for image processing), you can find detailed information in each model's documentation (for instance in [`mistral-small-3.1-24b-instruct-2503` size limit documentation](/managed-inference/reference-content/model-catalog/#mistral-small-31-24b-instruct-2503)). When the model is open, you can also find this information in the model files on platforms such as Hugging Face, usually in the `tokenizer_config.json` file.
7679

7780
### How can I monitor my token consumption?
7881
You can see your token consumption in [Scaleway Cockpit](/cockpit/). You can access it from the Scaleway console under the [Metrics tab](https://console.scaleway.com/generative-api/metrics).
Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
---
2+
title: How to query audio models
3+
description: Learn how to interact with powerful audio models using Scaleway's Generative APIs service.
4+
tags: generative-apis ai-data audio-models voxtral
5+
dates:
6+
validation: 2025-09-22
7+
posted: 2025-09-22
8+
---
9+
import Requirements from '@macros/iam/requirements.mdx'
10+
11+
Scaleway's Generative APIs service allows users to interact with powerful audio models hosted on the platform.
12+
13+
There are several ways to interact with audio models:
14+
- The Scaleway [console](https://console.scaleway.com) provides a complete [playground](/generative-apis/how-to/query-language-models/#accessing-the-playground), aiming to test models, adapt parameters, and observe how these changes affect the output in real-time.
15+
- Via the [Chat Completions API](https://www.scaleway.com/en/developers/api/generative-apis/#path-chat-completions-create-a-chat-completion)
16+
17+
<Requirements />
18+
19+
- A Scaleway account logged into the [console](https://console.scaleway.com)
20+
- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization
21+
- A valid [API key](/iam/how-to/create-api-keys/) for API authentication
22+
- Python 3.7+ installed on your system
23+
24+
## Accessing the Playground
25+
26+
Scaleway provides a web playground for instruct-based models hosted on Generative APIs.
27+
28+
1. Navigate to **Generative APIs** under the **AI** section of the [Scaleway console](https://console.scaleway.com/) side menu. The list of models you can query displays.
29+
2. Click the name of the chat model you want to try. Alternatively, click <Icon name="more" /> next to the chat model, and click **Try model** in the menu.
30+
31+
The web playground displays.
32+
33+
## Using the playground
34+
35+
1. Enter a prompt at the bottom of the page, or use one of the suggested prompts in the conversation area.
36+
2. Edit the hyperparameters listed on the right column, for example the default temperature for more or less randomness on the outputs.
37+
3. Switch models at the top of the page, to observe the capabilities of chat models offered via Generative APIs.
38+
4. Click **View code** to get code snippets configured according to your settings in the playground.
39+
40+
<Message type="tip">
41+
You can also use the upload button to send supported audio file formats, such as MP3, to audio models for transcription purposes.
42+
</Message>
43+
44+
## Querying audio models via API
45+
46+
You can query the models programmatically using your favorite tools or languages.
47+
In the example that follows, we will use the OpenAI Python client.
48+
49+
### Installing the OpenAI SDK
50+
51+
Install the OpenAI SDK using pip:
52+
53+
```bash
54+
pip install openai
55+
```
56+
57+
### Initializing the client
58+
59+
Initialize the OpenAI client with your base URL and API key:
60+
61+
```python
62+
from openai import OpenAI
63+
64+
# Initialize the client with your base URL and API key
65+
client = OpenAI(
66+
base_url="https://api.scaleway.ai/v1", # Scaleway's Generative APIs service URL
67+
api_key="<SCW_SECRET_KEY>" # Your unique API secret key from Scaleway
68+
)
69+
```
70+
71+
### Transcribing audio
72+
73+
You can now generate a text transcription of a given audio file using the Chat Completions API. This audio file can be remote or local.
74+
75+
#### Transcribing a remote audio file
76+
77+
In the example below, an audio file from a remote URL (`https://genapi-documentation-assets.s3.fr-par.scw.cloud/scaleway-ai-revolution.mp3`) is downloaded using the `requests` library, base64-encoded, and then sent to the model in a chat completion request alongside a transcription prompt. The resulting text transcription is printed to the screen.
78+
79+
```python
80+
import base64
81+
import requests
82+
83+
MODEL = "voxtral-small-24b-2507"
84+
85+
url = "https://genapi-documentation-assets.s3.fr-par.scw.cloud/scaleway-ai-revolution.mp3"
86+
response = requests.get(url)
87+
audio_data = response.content
88+
encoded_string = base64.b64encode(audio_data).decode("utf-8")
89+
90+
content = [
91+
{
92+
"role": "user",
93+
"content": [
94+
{
95+
"type": "text",
96+
"text": "Transcribe this audio"
97+
},
98+
{
99+
"type": "input_audio",
100+
"input_audio": {
101+
"data": encoded_string,
102+
"format": "mp3"
103+
}
104+
}
105+
]
106+
}
107+
]
108+
109+
110+
response = client.chat.completions.create(
111+
model=MODEL,
112+
messages=content,
113+
temperature=0.2, # Adjusts creativity
114+
max_tokens=2048, # Limits the length of the output
115+
top_p=0.95 # Controls diversity through nucleus sampling. You usually only need to use temperature.
116+
)
117+
118+
print(response.choices[0].message.content)
119+
```
120+
121+
Various parameters such as `temperature` and `max_tokens` control the output. See the [dedicated API documentation](https://www.scaleway.com/en/developers/api/generative-apis/#path-chat-completions-create-a-chat-completion) for a full list of all available parameters.
122+
123+
#### Transcribing a local audio file
124+
125+
In the example below, a local audio file [scaleway-ai-revolution.mp3](https://genapi-documentation-assets.s3.fr-par.scw.cloud/scaleway-ai-revolution.mp3) is base-64 encoded and sent to the model, alongside a transcription prompt. The resulting text transcription is printed to the screen.
126+
127+
```python
128+
import base64
129+
130+
MODEL = "voxtral-small-24b-2507"
131+
132+
with open('scaleway-ai-revolution.mp3', 'rb') as raw_file:
133+
audio_data = raw_file.read()
134+
encoded_string = base64.b64encode(audio_data).decode("utf-8")
135+
136+
content = [
137+
{
138+
"role": "user",
139+
"content": [
140+
{
141+
"type": "text",
142+
"text": "Transcribe this audio"
143+
},
144+
{
145+
"type": "input_audio",
146+
"input_audio": {
147+
"data": encoded_string,
148+
"format": "mp3"
149+
}
150+
}
151+
]
152+
}
153+
]
154+
155+
156+
response = client.chat.completions.create(
157+
model=MODEL,
158+
messages=content,
159+
temperature=0.2, # Adjusts creativity
160+
max_tokens=2048, # Limits the length of the output
161+
top_p=0.95 # Controls diversity through nucleus sampling. You usually only need to use temperature.
162+
)
163+
164+
print(response.choices[0].message.content)
165+
```
166+
167+
Various parameters such as `temperature` and `max_tokens` control the output. See the [dedicated API documentation](https://www.scaleway.com/en/developers/api/generative-apis/#path-chat-completions-create-a-chat-completion) for a full list of all available parameters.

pages/generative-apis/reference-content/supported-models.mdx

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,19 +3,27 @@ title: Supported models
33
description: This page lists which open-source chat or embedding models Scaleway is currently hosting
44
tags: generative-apis ai-data supported-models
55
dates:
6-
validation: 2025-08-20
6+
validation: 2025-09-12
77
posted: 2024-09-02
88
---
99

10-
Our API supports the most popular models for [Chat](/generative-apis/how-to/query-language-models), [Vision](/generative-apis/how-to/query-vision-models/) and [Embeddings](/generative-apis/how-to/query-embedding-models/).
10+
Our API supports the most popular models for [Chat](/generative-apis/how-to/query-language-models), [Vision](/generative-apis/how-to/query-vision-models/), [Audio](/generative-apis/how-to/query-audio-models/) and [Embeddings](/generative-apis/how-to/query-embedding-models/).
1111

12-
## Multimodal models (chat and vision)
12+
## Multimodal models
13+
14+
### Chat and Vision models
1315

1416
| Provider | Model string | Context window (Tokens) | Maximum output (Tokens)| License | Model card |
1517
|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
1618
| Google (Preview) | `gemma-3-27b-it` | 40k | 8192 | [Gemma](https://ai.google.dev/gemma/terms) | [HF](https://huggingface.co/google/gemma-3-27b-it) |
1719
| Mistral | `mistral-small-3.2-24b-instruct-2506` | 128k | 8192 | [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) | [HF](https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506) |
1820

21+
### Chat and Audio models
22+
23+
| Provider | Model string | Context window (Tokens) | Maximum output (Tokens)| License | Model card |
24+
|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
25+
| Mistral | `voxtral-small-24b-2507` | 32k | 8192 | [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) | [HF](https://huggingface.co/mistralai/Voxtral-Small-24B-2507) |
26+
1927
## Chat models
2028

2129
| Provider | Model string | Context window (Tokens) | Maximum output (Tokens)| License | Model card |

pages/managed-inference/reference-content/model-catalog.mdx

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ A quick overview of available models in Scaleway's catalog and their core attrib
3030
| [`mistral-small-3.2-24b-instruct-2506`](#mistral-small-32-24b-instruct-2506) | Mistral | 128k | Text, Vision | H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
3131
| [`mistral-small-3.1-24b-instruct-2503`](#mistral-small-31-24b-instruct-2503) | Mistral | 128k | Text, Vision | H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
3232
| [`mistral-small-24b-instruct-2501`](#mistral-small-24b-instruct-2501) | Mistral | 32k | Text | L40S (20k), H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
33+
| [`voxtral-small-24b-2507`](#voxtral-small-24b-2507) | Mistral | 32k | Text | H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
3334
| [`mistral-nemo-instruct-2407`](#mistral-nemo-instruct-2407) | Mistral | 128k | Text | L40S, H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
3435
| [`mixtral-8x7b-instruct-v0.1`](#mixtral-8x7b-instruct-v01) | Mistral | 32k | Text | H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
3536
| [`magistral-small-2506`](#magistral-small-2506) | Mistral | 32k | Text | L40S, H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
@@ -60,6 +61,7 @@ A quick overview of available models in Scaleway's catalog and their core attrib
6061
| `mistral-small-3.2-24b-instruct-2506` | Yes | Yes | English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, Farsi |
6162
| `mistral-small-3.1-24b-instruct-2503` | Yes | Yes | English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, Farsi |
6263
| `mistral-small-24b-instruct-2501` | Yes | Yes | English, French, German, Dutch, Spanish, Italian, Polish, Portuguese, Chinese, Japanese, Korean |
64+
| `voxtral-small-24b-2507` | Yes | Yes | English, French, German, Dutch, Spanish, Italian, Portuguese, Hindi |
6365
| `mistral-nemo-instruct-2407` | Yes | Yes | English, French, German, Spanish, Italian, Portuguese, Russian, Chinese, Japanese |
6466
| `mixtral-8x7b-instruct-v0.1` | Yes | No | English, French, German, Italian, Spanish |
6567
| `magistral-small-2506` | Yes | Yes | English, French, German, Spanish, Portuguese, Italian, Japanese, Korean, Russian, Chinese, Arabic, Persian, Indonesian, Malay, Nepali, Polish, Romanian, Serbian, Swedish, Turkish, Ukrainian, Vietnamese, Hindi, Bengali |
@@ -164,6 +166,30 @@ Vision-language models like Molmo can analyze an image and offer insights from v
164166
allenai/molmo-72b-0924:fp8
165167
```
166168

169+
## Multimodal models (Text and Audio)
170+
171+
### Voxtral-small-24b-2507
172+
Voxtral-small-24b-2507 is a model developed by Mistral to perform text processing and audio analysis on many languages.
173+
This model was optimized to enable transcription in many languages while keeping conversational capabilities (translations, classification, etc.)
174+
175+
| Attribute | Value |
176+
|-----------|-------|
177+
| Supports parallel tool calling | Yes |
178+
| Supported audio formats | WAV and MP3 |
179+
| Audio chunk duration | 30 seconds |
180+
| Token duration (audio)| 80ms |
181+
182+
#### Model names
183+
```
184+
mistral/voxtral-small-24b-2507:bf16
185+
mistral/voxtral-small-24b-2507:fp8
186+
```
187+
188+
- Mono and stereo audio formats are supported. For stereo formats, both left and right channels are merged before being processed.
189+
- Audio files are processed in 30 seconds chunks:
190+
- If audio sent is less than 30 seconds, the rest of the chunk will be considered silent.
191+
- 80ms is equal to 1 input token
192+
167193
## Text models
168194

169195
### Qwen3-235b-a22b-instruct-2507

pages/organizations-and-projects/additional-content/organization-quotas.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -203,6 +203,7 @@ Generative APIs are rate limited based on:
203203
| mistral-small-3.1-24b-instruct-2503 | 200k | 400k |
204204
| mistral-small-3.2-24b-instruct-2506 | 200k | 400k |
205205
| mistral-nemo-instruct-2407 | 200k | 400k |
206+
| voxtral-small-24b-2507 | 200k | 400k |
206207
| pixtral-12b-2409 | 200k | 400k |
207208
| qwen3-235b-a22b-instruct-2507 | 200k | 400k |
208209
| qwen2.5-coder-32b-instruct | 200k | 400k |
@@ -221,6 +222,7 @@ Generative APIs are rate limited based on:
221222
| mistral-small-3.1-24b-instruct-2503 | 300 | 600 |
222223
| mistral-small-3.2-24b-instruct-2506 | 300 | 600 |
223224
| mistral-nemo-instruct-2407 | 300 | 600 |
225+
| voxtral-small-24b-2507 | 300 | 600 |
224226
| pixtral-12b-2409 | 300 | 600 |
225227
| qwen3-235b-a22b-instruct-2507 | 300 | 600 |
226228
| qwen2.5-coder-32b-instruct | 300 | 600 |

0 commit comments

Comments
 (0)