Skip to content
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions menu/navigation.json
Original file line number Diff line number Diff line change
Expand Up @@ -812,6 +812,10 @@
"label": "Query code models",
"slug": "query-code-models"
},
{
"label": "Query audio models",
"slug": "query-audio-models"
},
{
"label": "Use structured outputs",
"slug": "use-structured-outputs"
Expand Down
7 changes: 5 additions & 2 deletions pages/generative-apis/faq.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -83,9 +83,12 @@ Note that in this example, the first line where the free tier applies will not d
### What are tokens and how are they counted?
A token is the minimum unit of content that is seen and processed by a model. Hence, token definitions depend on input types:
- For text, on average, `1` token corresponds to `~4` characters, and thus `0.75` words (as words are on average five characters long)
- For images, `1` token corresponds to a square of pixels. For example, `mistral-small-3.1-24b-instruct-2503` model image tokens of `28x28` pixels (28-pixels height, and 28-pixels width, hence `784` pixels in total).
- For images, `1` token corresponds to a square of pixels. For example, `mistral-small-3.1-24b-instruct-2503` model image tokens are `28x28` pixels (28-pixels height, and 28-pixels width, hence `784` pixels in total).
- For audio:
- `1` token corresponds to a duration of time. For example, `voxtral-small-24b-2507` model audio tokens are `80` milliseconds.
- Some models process audio in chunks having a minimum duration. For example, `voxtral-small-24b-2507` model process audio in `30` second chunks. This means audio lasting `13` seconds will be considered `375` tokens (`30` seconds / `0.08` seconds). And audio lasting `178` seconds will be considered `2 250` tokens (`30` seconds * `6` / `0.08` seconds).

The exact token count and definition depend on [tokenizers](https://huggingface.co/learn/llm-course/en/chapter2/4) used by each model. When this difference is significant (such as for image processing), you can find detailed information in each model's documentation (for instance in [`mistral-small-3.1-24b-instruct-2503` size limit documentation](/managed-inference/reference-content/model-catalog/#mistral-small-31-24b-instruct-2503)). When the model is open, you can also find this information in the model files on platforms such as Hugging Face, usually in the `tokenizer_config.json` file.
The exact token count and definition depend on the [tokenizer](https://huggingface.co/learn/llm-course/en/chapter2/4) used by each model. When this difference is significant (such as for image processing), you can find detailed information in each model's documentation (for instance in [`mistral-small-3.1-24b-instruct-2503` size limit documentation](/managed-inference/reference-content/model-catalog/#mistral-small-31-24b-instruct-2503)). When the model is open, you can also find this information in the model files on platforms such as Hugging Face, usually in the `tokenizer_config.json` file.

### How can I monitor my token consumption?
You can see your token consumption in [Scaleway Cockpit](/cockpit/). You can access it from the Scaleway console under the [Metrics tab](https://console.scaleway.com/generative-api/metrics).
Expand Down
167 changes: 167 additions & 0 deletions pages/generative-apis/how-to/query-audio-models.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
---
title: How to query audio models
description: Learn how to interact with powerful audio models using Scaleway's Generative APIs service.
tags: generative-apis ai-data audio-models voxtral
dates:
validation: 2025-09-22
posted: 2025-09-22
---
import Requirements from '@macros/iam/requirements.mdx'

Scaleway's Generative APIs service allows users to interact with powerful audio models hosted on the platform.

There are several ways to interact with audio models:
- The Scaleway [console](https://console.scaleway.com) provides a complete [playground](/generative-apis/how-to/query-language-models/#accessing-the-playground), aiming to test models, adapt parameters, and observe how these changes affect the output in real-time.
- Via the [Chat Completions API](https://www.scaleway.com/en/developers/api/generative-apis/#path-chat-completions-create-a-chat-completion)

<Requirements />

- A Scaleway account logged into the [console](https://console.scaleway.com)
- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization
- A valid [API key](/iam/how-to/create-api-keys/) for API authentication
- Python 3.7+ installed on your system

## Accessing the Playground

Scaleway provides a web playground for instruct-based models hosted on Generative APIs.

1. Navigate to **Generative APIs** under the **AI** section of the [Scaleway console](https://console.scaleway.com/) side menu. The list of models you can query displays.
2. Click the name of the chat model you want to try. Alternatively, click <Icon name="more" /> next to the chat model, and click **Try model** in the menu.

The web playground displays.

## Using the playground

1. Enter a prompt at the bottom of the page, or use one of the suggested prompts in the conversation area.
2. Edit the hyperparameters listed on the right column, for example the default temperature for more or less randomness on the outputs.
3. Switch models at the top of the page, to observe the capabilities of chat models offered via Generative APIs.
4. Click **View code** to get code snippets configured according to your settings in the playground.

<Message type="tip">
You can also use the upload button to send supported audio file formats, such as MP3, to audio models for transcription purposes.
</Message>

## Querying audio models via API

You can query the models programmatically using your favorite tools or languages.
In the example that follows, we will use the OpenAI Python client.

### Installing the OpenAI SDK

Install the OpenAI SDK using pip:

```bash
pip install openai
```

### Initializing the client

Initialize the OpenAI client with your base URL and API key:

```python
from openai import OpenAI

# Initialize the client with your base URL and API key
client = OpenAI(
base_url="https://api.scaleway.ai/v1", # Scaleway's Generative APIs service URL
api_key="<SCW_SECRET_KEY>" # Your unique API secret key from Scaleway
)
```

### Transcribing audio

You can now generate a text transcription of a given audio file using the Chat Completions API. This audio file can be local or remote.

#### Transcribing a local audio file

In the example below, a local audio file called `scaleway-ai-revolution.mp3` is base-64 encoded and sent to the model, alongside a transcription prompt. The resulting text transcription is printed to the screen.

```python
import base64

MODEL = "voxtral-small-24b-2507"

with open('scaleway-ai-revolution.mp3', 'rb') as raw_file:
audio_data = raw_file.read()
encoded_string = base64.b64encode(audio_data).decode("utf-8")

content = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Transcribe this audio"
},
{
"type": "input_audio",
"input_audio": {
"data": encoded_string,
"format": "mp3"
}
}
]
}
]


response = client.chat.completions.create(
model=MODEL,
messages=content,
temperature=0.2, # Adjusts creativity
max_tokens=2048, # Limits the length of the output
top_p=0.95 # Controls diversity through nucleus sampling. You usually only need to use temperature.
)

print(response.choices[0].message.content)
```

Various parameters such as `temperature` and `max_tokens` control the output. See the [dedicated API documentation](https://www.scaleway.com/en/developers/api/generative-apis/#path-chat-completions-create-a-chat-completion) for a full list of all available parameters.

#### Transcribing a remote audio file

In the example below, an audio file from a remote URL (`https://genapi-documentation-assets.s3.fr-par.scw.cloud/scaleway-ai-revolution.mp3`) is downloaded using the `requests` library, base64-encoded, and then sent to the model in a chat completion request alongside a transcription prompt. The resulting text transcription is printed to the screen.

```python
import base64
import requests

MODEL = "voxtral-small-24b-2507"

url = "https://genapi-documentation-assets.s3.fr-par.scw.cloud/scaleway-ai-revolution.mp3"
response = requests.get(url)
audio_data = response.content
encoded_string = base64.b64encode(audio_data).decode("utf-8")

content = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Transcribe this audio"
},
{
"type": "input_audio",
"input_audio": {
"data": encoded_string,
"format": "mp3"
}
}
]
}
]


response = client.chat.completions.create(
model=MODEL,
messages=content,
temperature=0.2, # Adjusts creativity
max_tokens=2048, # Limits the length of the output
top_p=0.95 # Controls diversity through nucleus sampling. You usually only need to use temperature.
)

print(response.choices[0].message.content)
```

Various parameters such as `temperature` and `max_tokens` control the output. See the [dedicated API documentation](https://www.scaleway.com/en/developers/api/generative-apis/#path-chat-completions-create-a-chat-completion) for a full list of all available parameters.
14 changes: 11 additions & 3 deletions pages/generative-apis/reference-content/supported-models.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,27 @@ title: Supported models
description: This page lists which open-source chat or embedding models Scaleway is currently hosting
tags: generative-apis ai-data supported-models
dates:
validation: 2025-08-20
validation: 2025-09-12
posted: 2024-09-02
---

Our API supports the most popular models for [Chat](/generative-apis/how-to/query-language-models), [Vision](/generative-apis/how-to/query-vision-models/) and [Embeddings](/generative-apis/how-to/query-embedding-models/).
Our API supports the most popular models for [Chat](/generative-apis/how-to/query-language-models), [Vision](/generative-apis/how-to/query-vision-models/), [Audio](/generative-apis/how-to/query-audio-models/) and [Embeddings](/generative-apis/how-to/query-embedding-models/).

## Multimodal models (chat and vision)
## Multimodal models

### Chat and Vision models

| Provider | Model string | Context window (Tokens) | Maximum output (Tokens)| License | Model card |
|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| Google (Preview) | `gemma-3-27b-it` | 40k | 8192 | [Gemma](https://ai.google.dev/gemma/terms) | [HF](https://huggingface.co/google/gemma-3-27b-it) |
| Mistral | `mistral-small-3.2-24b-instruct-2506` | 128k | 8192 | [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) | [HF](https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506) |

### Chat and Audio models

| Provider | Model string | Context window (Tokens) | Maximum output (Tokens)| License | Model card |
|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| Mistral | `voxtral-small-24b-2507` | 32k | 8192 | [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) | [HF](https://huggingface.co/mistralai/Voxtral-Small-24B-2507) |

## Chat models

| Provider | Model string | Context window (Tokens) | Maximum output (Tokens)| License | Model card |
Expand Down
26 changes: 26 additions & 0 deletions pages/managed-inference/reference-content/model-catalog.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ A quick overview of available models in Scaleway's catalog and their core attrib
| [`mistral-small-3.2-24b-instruct-2506`](#mistral-small-32-24b-instruct-2506) | Mistral | 128k | Text, Vision | H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
| [`mistral-small-3.1-24b-instruct-2503`](#mistral-small-31-24b-instruct-2503) | Mistral | 128k | Text, Vision | H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
| [`mistral-small-24b-instruct-2501`](#mistral-small-24b-instruct-2501) | Mistral | 32k | Text | L40S (20k), H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
| [`voxtral-small-24b-2507`](#voxtral-small-24b-2507) | Mistral | 32k | Text | H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
| [`mistral-nemo-instruct-2407`](#mistral-nemo-instruct-2407) | Mistral | 128k | Text | L40S, H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
| [`mixtral-8x7b-instruct-v0.1`](#mixtral-8x7b-instruct-v01) | Mistral | 32k | Text | H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
| [`magistral-small-2506`](#magistral-small-2506) | Mistral | 32k | Text | L40S, H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
Expand Down Expand Up @@ -60,6 +61,7 @@ A quick overview of available models in Scaleway's catalog and their core attrib
| `mistral-small-3.2-24b-instruct-2506` | Yes | Yes | English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, Farsi |
| `mistral-small-3.1-24b-instruct-2503` | Yes | Yes | English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, Farsi |
| `mistral-small-24b-instruct-2501` | Yes | Yes | English, French, German, Dutch, Spanish, Italian, Polish, Portuguese, Chinese, Japanese, Korean |
| `voxtral-small-24b-2507` | Yes | Yes | English, French, German, Dutch, Spanish, Italian, Portuguese, Hindi |
| `mistral-nemo-instruct-2407` | Yes | Yes | English, French, German, Spanish, Italian, Portuguese, Russian, Chinese, Japanese |
| `mixtral-8x7b-instruct-v0.1` | Yes | No | English, French, German, Italian, Spanish |
| `magistral-small-2506` | Yes | Yes | English, French, German, Spanish, Portuguese, Italian, Japanese, Korean, Russian, Chinese, Arabic, Persian, Indonesian, Malay, Nepali, Polish, Romanian, Serbian, Swedish, Turkish, Ukrainian, Vietnamese, Hindi, Bengali |
Expand Down Expand Up @@ -164,6 +166,30 @@ Vision-language models like Molmo can analyze an image and offer insights from v
allenai/molmo-72b-0924:fp8
```

## Multimodal models (Text and Audio)

### Voxtral-small-24b-2507
Voxtral-small-24b-2507 is a model developed by Mistral to perform text processing and audio analysis on many languages.
This model was optimized to enable transcription in many languages while keeping conversational capabilities (translations, classification, etc.)

| Attribute | Value |
|-----------|-------|
| Supports parallel tool calling | Yes |
| Supported audio formats | WAV and MP3 |
| Audio chunk duration | 30 seconds |
| Token duration (audio)| 80ms |

#### Model names
```
mistral/voxtral-small-24b-2507:bf16
mistral/voxtral-small-24b-2507:fp8
```

- Mono and stereo audio formats are supported. For stereo formats, both left and right channels are merged before being processed.
- Audio files are processed in 30 seconds chunks:
- If audio sent is less than 30 seconds, the rest of the chunk will be considered silent.
- 80ms is equal to 1 input token

## Text models

### Qwen3-235b-a22b-instruct-2507
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,7 @@ Generative APIs are rate limited based on:
| mistral-small-3.1-24b-instruct-2503 | 200k | 400k |
| mistral-small-3.2-24b-instruct-2506 | 200k | 400k |
| mistral-nemo-instruct-2407 | 200k | 400k |
| voxtral-small-24b-2507 | 200k | 400k |
| pixtral-12b-2409 | 200k | 400k |
| qwen3-235b-a22b-instruct-2507 | 200k | 400k |
| qwen2.5-coder-32b-instruct | 200k | 400k |
Expand All @@ -221,6 +222,7 @@ Generative APIs are rate limited based on:
| mistral-small-3.1-24b-instruct-2503 | 300 | 600 |
| mistral-small-3.2-24b-instruct-2506 | 300 | 600 |
| mistral-nemo-instruct-2407 | 300 | 600 |
| voxtral-small-24b-2507 | 300 | 600 |
| pixtral-12b-2409 | 300 | 600 |
| qwen3-235b-a22b-instruct-2507 | 300 | 600 |
| qwen2.5-coder-32b-instruct | 300 | 600 |
Expand Down