diff --git a/menu/navigation.json b/menu/navigation.json index 97b07114d4..693274ba6a 100644 --- a/menu/navigation.json +++ b/menu/navigation.json @@ -812,6 +812,10 @@ "label": "Query code models", "slug": "query-code-models" }, + { + "label": "Query audio models", + "slug": "query-audio-models" + }, { "label": "Use structured outputs", "slug": "use-structured-outputs" diff --git a/pages/generative-apis/faq.mdx b/pages/generative-apis/faq.mdx index 84010a2d6f..97e9eb3b70 100644 --- a/pages/generative-apis/faq.mdx +++ b/pages/generative-apis/faq.mdx @@ -83,9 +83,12 @@ Note that in this example, the first line where the free tier applies will not d ### What are tokens and how are they counted? A token is the minimum unit of content that is seen and processed by a model. Hence, token definitions depend on input types: - For text, on average, `1` token corresponds to `~4` characters, and thus `0.75` words (as words are on average five characters long) -- For images, `1` token corresponds to a square of pixels. For example, `mistral-small-3.1-24b-instruct-2503` model image tokens of `28x28` pixels (28-pixels height, and 28-pixels width, hence `784` pixels in total). +- For images, `1` token corresponds to a square of pixels. For example, `mistral-small-3.1-24b-instruct-2503` model image tokens are `28x28` pixels (28-pixels height, and 28-pixels width, hence `784` pixels in total). +- For audio: + - `1` token corresponds to a duration of time. For example, `voxtral-small-24b-2507` model audio tokens are `80` milliseconds. + - Some models process audio in chunks having a minimum duration. For example, `voxtral-small-24b-2507` model process audio in `30` second chunks. This means audio lasting `13` seconds will be considered `375` tokens (`30` seconds / `0.08` seconds). And audio lasting `178` seconds will be considered `2 250` tokens (`30` seconds * `6` / `0.08` seconds). -The exact token count and definition depend on [tokenizers](https://huggingface.co/learn/llm-course/en/chapter2/4) used by each model. When this difference is significant (such as for image processing), you can find detailed information in each model's documentation (for instance in [`mistral-small-3.1-24b-instruct-2503` size limit documentation](/managed-inference/reference-content/model-catalog/#mistral-small-31-24b-instruct-2503)). When the model is open, you can also find this information in the model files on platforms such as Hugging Face, usually in the `tokenizer_config.json` file. +The exact token count and definition depend on the [tokenizer](https://huggingface.co/learn/llm-course/en/chapter2/4) used by each model. When this difference is significant (such as for image processing), you can find detailed information in each model's documentation (for instance in [`mistral-small-3.1-24b-instruct-2503` size limit documentation](/managed-inference/reference-content/model-catalog/#mistral-small-31-24b-instruct-2503)). When the model is open, you can also find this information in the model files on platforms such as Hugging Face, usually in the `tokenizer_config.json` file. ### How can I monitor my token consumption? You can see your token consumption in [Scaleway Cockpit](/cockpit/). You can access it from the Scaleway console under the [Metrics tab](https://console.scaleway.com/generative-api/metrics). diff --git a/pages/generative-apis/how-to/query-audio-models.mdx b/pages/generative-apis/how-to/query-audio-models.mdx new file mode 100644 index 0000000000..5ab51a1212 --- /dev/null +++ b/pages/generative-apis/how-to/query-audio-models.mdx @@ -0,0 +1,167 @@ +--- +title: How to query audio models +description: Learn how to interact with powerful audio models using Scaleway's Generative APIs service. +tags: generative-apis ai-data audio-models voxtral +dates: + validation: 2025-09-22 + posted: 2025-09-22 +--- +import Requirements from '@macros/iam/requirements.mdx' + +Scaleway's Generative APIs service allows users to interact with powerful audio models hosted on the platform. + +There are several ways to interact with audio models: +- The Scaleway [console](https://console.scaleway.com) provides a complete [playground](/generative-apis/how-to/query-language-models/#accessing-the-playground), aiming to test models, adapt parameters, and observe how these changes affect the output in real-time. +- Via the [Chat Completions API](https://www.scaleway.com/en/developers/api/generative-apis/#path-chat-completions-create-a-chat-completion) + + + +- A Scaleway account logged into the [console](https://console.scaleway.com) +- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization +- A valid [API key](/iam/how-to/create-api-keys/) for API authentication +- Python 3.7+ installed on your system + +## Accessing the Playground + +Scaleway provides a web playground for instruct-based models hosted on Generative APIs. + +1. Navigate to **Generative APIs** under the **AI** section of the [Scaleway console](https://console.scaleway.com/) side menu. The list of models you can query displays. +2. Click the name of the chat model you want to try. Alternatively, click next to the chat model, and click **Try model** in the menu. + +The web playground displays. + +## Using the playground + +1. Enter a prompt at the bottom of the page, or use one of the suggested prompts in the conversation area. +2. Edit the hyperparameters listed on the right column, for example the default temperature for more or less randomness on the outputs. +3. Switch models at the top of the page, to observe the capabilities of chat models offered via Generative APIs. +4. Click **View code** to get code snippets configured according to your settings in the playground. + + +You can also use the upload button to send supported audio file formats, such as MP3, to audio models for transcription purposes. + + +## Querying audio models via API + +You can query the models programmatically using your favorite tools or languages. +In the example that follows, we will use the OpenAI Python client. + +### Installing the OpenAI SDK + +Install the OpenAI SDK using pip: + +```bash +pip install openai +``` + +### Initializing the client + +Initialize the OpenAI client with your base URL and API key: + +```python +from openai import OpenAI + +# Initialize the client with your base URL and API key +client = OpenAI( + base_url="https://api.scaleway.ai/v1", # Scaleway's Generative APIs service URL + api_key="" # Your unique API secret key from Scaleway +) +``` + +### Transcribing audio + +You can now generate a text transcription of a given audio file using the Chat Completions API. This audio file can be remote or local. + +#### Transcribing a remote audio file + +In the example below, an audio file from a remote URL (`https://genapi-documentation-assets.s3.fr-par.scw.cloud/scaleway-ai-revolution.mp3`) is downloaded using the `requests` library, base64-encoded, and then sent to the model in a chat completion request alongside a transcription prompt. The resulting text transcription is printed to the screen. + +```python +import base64 +import requests + +MODEL = "voxtral-small-24b-2507" + +url = "https://genapi-documentation-assets.s3.fr-par.scw.cloud/scaleway-ai-revolution.mp3" +response = requests.get(url) +audio_data = response.content +encoded_string = base64.b64encode(audio_data).decode("utf-8") + +content = [ + { + "role": "user", + "content": [ + { + "type": "text", + "text": "Transcribe this audio" + }, + { + "type": "input_audio", + "input_audio": { + "data": encoded_string, + "format": "mp3" + } + } + ] + } + ] + + +response = client.chat.completions.create( + model=MODEL, + messages=content, + temperature=0.2, # Adjusts creativity + max_tokens=2048, # Limits the length of the output + top_p=0.95 # Controls diversity through nucleus sampling. You usually only need to use temperature. +) + +print(response.choices[0].message.content) +``` + +Various parameters such as `temperature` and `max_tokens` control the output. See the [dedicated API documentation](https://www.scaleway.com/en/developers/api/generative-apis/#path-chat-completions-create-a-chat-completion) for a full list of all available parameters. + +#### Transcribing a local audio file + +In the example below, a local audio file [scaleway-ai-revolution.mp3](https://genapi-documentation-assets.s3.fr-par.scw.cloud/scaleway-ai-revolution.mp3) is base-64 encoded and sent to the model, alongside a transcription prompt. The resulting text transcription is printed to the screen. + +```python +import base64 + +MODEL = "voxtral-small-24b-2507" + +with open('scaleway-ai-revolution.mp3', 'rb') as raw_file: + audio_data = raw_file.read() +encoded_string = base64.b64encode(audio_data).decode("utf-8") + +content = [ + { + "role": "user", + "content": [ + { + "type": "text", + "text": "Transcribe this audio" + }, + { + "type": "input_audio", + "input_audio": { + "data": encoded_string, + "format": "mp3" + } + } + ] + } + ] + + +response = client.chat.completions.create( + model=MODEL, + messages=content, + temperature=0.2, # Adjusts creativity + max_tokens=2048, # Limits the length of the output + top_p=0.95 # Controls diversity through nucleus sampling. You usually only need to use temperature. +) + +print(response.choices[0].message.content) +``` + +Various parameters such as `temperature` and `max_tokens` control the output. See the [dedicated API documentation](https://www.scaleway.com/en/developers/api/generative-apis/#path-chat-completions-create-a-chat-completion) for a full list of all available parameters. \ No newline at end of file diff --git a/pages/generative-apis/reference-content/supported-models.mdx b/pages/generative-apis/reference-content/supported-models.mdx index 6236874f1e..61976f99c1 100644 --- a/pages/generative-apis/reference-content/supported-models.mdx +++ b/pages/generative-apis/reference-content/supported-models.mdx @@ -3,19 +3,27 @@ title: Supported models description: This page lists which open-source chat or embedding models Scaleway is currently hosting tags: generative-apis ai-data supported-models dates: - validation: 2025-08-20 + validation: 2025-09-12 posted: 2024-09-02 --- -Our API supports the most popular models for [Chat](/generative-apis/how-to/query-language-models), [Vision](/generative-apis/how-to/query-vision-models/) and [Embeddings](/generative-apis/how-to/query-embedding-models/). +Our API supports the most popular models for [Chat](/generative-apis/how-to/query-language-models), [Vision](/generative-apis/how-to/query-vision-models/), [Audio](/generative-apis/how-to/query-audio-models/) and [Embeddings](/generative-apis/how-to/query-embedding-models/). -## Multimodal models (chat and vision) +## Multimodal models + +### Chat and Vision models | Provider | Model string | Context window (Tokens) | Maximum output (Tokens)| License | Model card | |-----------------|-----------------|-----------------|-----------------|-----------------|-----------------| | Google (Preview) | `gemma-3-27b-it` | 40k | 8192 | [Gemma](https://ai.google.dev/gemma/terms) | [HF](https://huggingface.co/google/gemma-3-27b-it) | | Mistral | `mistral-small-3.2-24b-instruct-2506` | 128k | 8192 | [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) | [HF](https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506) | +### Chat and Audio models + +| Provider | Model string | Context window (Tokens) | Maximum output (Tokens)| License | Model card | +|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------| +| Mistral | `voxtral-small-24b-2507` | 32k | 8192 | [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) | [HF](https://huggingface.co/mistralai/Voxtral-Small-24B-2507) | + ## Chat models | Provider | Model string | Context window (Tokens) | Maximum output (Tokens)| License | Model card | diff --git a/pages/managed-inference/reference-content/model-catalog.mdx b/pages/managed-inference/reference-content/model-catalog.mdx index 68ac784629..5c2ef3b6b5 100644 --- a/pages/managed-inference/reference-content/model-catalog.mdx +++ b/pages/managed-inference/reference-content/model-catalog.mdx @@ -30,6 +30,7 @@ A quick overview of available models in Scaleway's catalog and their core attrib | [`mistral-small-3.2-24b-instruct-2506`](#mistral-small-32-24b-instruct-2506) | Mistral | 128k | Text, Vision | H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | | [`mistral-small-3.1-24b-instruct-2503`](#mistral-small-31-24b-instruct-2503) | Mistral | 128k | Text, Vision | H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | | [`mistral-small-24b-instruct-2501`](#mistral-small-24b-instruct-2501) | Mistral | 32k | Text | L40S (20k), H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | +| [`voxtral-small-24b-2507`](#voxtral-small-24b-2507) | Mistral | 32k | Text | H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | | [`mistral-nemo-instruct-2407`](#mistral-nemo-instruct-2407) | Mistral | 128k | Text | L40S, H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | | [`mixtral-8x7b-instruct-v0.1`](#mixtral-8x7b-instruct-v01) | Mistral | 32k | Text | H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | | [`magistral-small-2506`](#magistral-small-2506) | Mistral | 32k | Text | L40S, H100, H100-2 | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | @@ -60,6 +61,7 @@ A quick overview of available models in Scaleway's catalog and their core attrib | `mistral-small-3.2-24b-instruct-2506` | Yes | Yes | English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, Farsi | | `mistral-small-3.1-24b-instruct-2503` | Yes | Yes | English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, Farsi | | `mistral-small-24b-instruct-2501` | Yes | Yes | English, French, German, Dutch, Spanish, Italian, Polish, Portuguese, Chinese, Japanese, Korean | +| `voxtral-small-24b-2507` | Yes | Yes | English, French, German, Dutch, Spanish, Italian, Portuguese, Hindi | | `mistral-nemo-instruct-2407` | Yes | Yes | English, French, German, Spanish, Italian, Portuguese, Russian, Chinese, Japanese | | `mixtral-8x7b-instruct-v0.1` | Yes | No | English, French, German, Italian, Spanish | | `magistral-small-2506` | Yes | Yes | English, French, German, Spanish, Portuguese, Italian, Japanese, Korean, Russian, Chinese, Arabic, Persian, Indonesian, Malay, Nepali, Polish, Romanian, Serbian, Swedish, Turkish, Ukrainian, Vietnamese, Hindi, Bengali | @@ -164,6 +166,30 @@ Vision-language models like Molmo can analyze an image and offer insights from v allenai/molmo-72b-0924:fp8 ``` +## Multimodal models (Text and Audio) + +### Voxtral-small-24b-2507 +Voxtral-small-24b-2507 is a model developed by Mistral to perform text processing and audio analysis on many languages. +This model was optimized to enable transcription in many languages while keeping conversational capabilities (translations, classification, etc.) + +| Attribute | Value | +|-----------|-------| +| Supports parallel tool calling | Yes | +| Supported audio formats | WAV and MP3 | +| Audio chunk duration | 30 seconds | +| Token duration (audio)| 80ms | + +#### Model names +``` +mistral/voxtral-small-24b-2507:bf16 +mistral/voxtral-small-24b-2507:fp8 +``` + +- Mono and stereo audio formats are supported. For stereo formats, both left and right channels are merged before being processed. +- Audio files are processed in 30 seconds chunks: + - If audio sent is less than 30 seconds, the rest of the chunk will be considered silent. + - 80ms is equal to 1 input token + ## Text models ### Qwen3-235b-a22b-instruct-2507 diff --git a/pages/organizations-and-projects/additional-content/organization-quotas.mdx b/pages/organizations-and-projects/additional-content/organization-quotas.mdx index 0b4e9ace02..f02f9f552a 100644 --- a/pages/organizations-and-projects/additional-content/organization-quotas.mdx +++ b/pages/organizations-and-projects/additional-content/organization-quotas.mdx @@ -203,6 +203,7 @@ Generative APIs are rate limited based on: | mistral-small-3.1-24b-instruct-2503 | 200k | 400k | | mistral-small-3.2-24b-instruct-2506 | 200k | 400k | | mistral-nemo-instruct-2407 | 200k | 400k | +| voxtral-small-24b-2507 | 200k | 400k | | pixtral-12b-2409 | 200k | 400k | | qwen3-235b-a22b-instruct-2507 | 200k | 400k | | qwen2.5-coder-32b-instruct | 200k | 400k | @@ -221,6 +222,7 @@ Generative APIs are rate limited based on: | mistral-small-3.1-24b-instruct-2503 | 300 | 600 | | mistral-small-3.2-24b-instruct-2506 | 300 | 600 | | mistral-nemo-instruct-2407 | 300 | 600 | +| voxtral-small-24b-2507 | 300 | 600 | | pixtral-12b-2409 | 300 | 600 | | qwen3-235b-a22b-instruct-2507 | 300 | 600 | | qwen2.5-coder-32b-instruct | 300 | 600 |