feat(genapi): update faq for audio models

fpagny · web-flow · commit 9991a74c82b9 · 2025-09-12T17:09:00.000+02:00
diff --git a/pages/generative-apis/faq.mdx b/pages/generative-apis/faq.mdx
@@ -83,7 +83,10 @@ Note that in this example, the first line where the free tier applies will not d
 ### What are tokens and how are they counted?
 A token is the minimum unit of content that is seen and processed by a model. Hence, token definitions depend on input types:
 - For text, on average, `1` token corresponds to `~4` characters, and thus `0.75` words (as words are on average five characters long)
-- For images, `1` token corresponds to a square of pixels. For example, `mistral-small-3.1-24b-instruct-2503` model image tokens of `28x28` pixels (28-pixels height, and 28-pixels width, hence `784` pixels in total).
+- For images, `1` token corresponds to a square of pixels. For example, `mistral-small-3.1-24b-instruct-2503` model image tokens are `28x28` pixels (28-pixels height, and 28-pixels width, hence `784` pixels in total).
+- For audio:
+  - `1` token corresponds to a time duration. For example, `voxtral-small-24b-2507` model audio tokens are `80` milliseconds.
+  - Some models process audio by chunks having a minimum duration. For example, `voxtral-small-24b-2507` model process audio by `30` seconds chunks. This means an audio of `13` seconds will be considered `375` tokens (`30` seconds / `0.08` seconds). And an audio of `178` seconds will considered `2 250` tokens (`30` seconds * `6` / `0.08` seconds).
 
 The exact token count and definition depend on [tokenizers](https://huggingface.co/learn/llm-course/en/chapter2/4) used by each model. When this difference is significant (such as for image processing), you can find detailed information in each model's documentation (for instance in [`mistral-small-3.1-24b-instruct-2503` size limit documentation](/managed-inference/reference-content/model-catalog/#mistral-small-31-24b-instruct-2503)). When the model is open, you can also find this information in the model files on platforms such as Hugging Face, usually in the `tokenizer_config.json` file.