diff --git a/pages/generative-apis/faq.mdx b/pages/generative-apis/faq.mdx index d4b90486b3..6fa8aee5f0 100644 --- a/pages/generative-apis/faq.mdx +++ b/pages/generative-apis/faq.mdx @@ -81,7 +81,7 @@ The exact token count and definition depend on the [tokenizer](https://huggingfa You can see your token consumption in [Scaleway Cockpit](/cockpit/). You can access it from the Scaleway console under the [Metrics tab](https://console.scaleway.com/generative-api/metrics). Note that: - Cockpits are isolated by Project, hence you first need to select the right Project in the Scaleway console before accessing Cockpit to see your token consumption for this Project (you can see the `project_id` in the Cockpit URL: `https://{project_id}.dashboard.obs.fr-par.scw.cloud/`). -- Cockpit graphs can take up to 1 hour to update token consumption. See [Troubleshooting](/generative-apis/troubleshooting/fixing-common-issues/#tokens-consumption-is-not-displayed-in-cockpit-metrics) for further details. +- Cockpit graphs can take up to 5 minutes to update token consumption. See [Troubleshooting](/generative-apis/troubleshooting/fixing-common-issues/#tokens-consumption-is-not-displayed-in-cockpit-metrics) for further details. ### Can I configure a maximum billing threshold? Currently, you cannot configure a specific threshold after which your usage will be blocked. However: @@ -92,7 +92,6 @@ Currently, you cannot configure a specific threshold after which your usage will ### How can I give access to token consumption to my users outside Scaleway? If your users do not have a Scaleway account, you can still give them access to their Generative API usage consumption by either: -- Providing them with access to Grafana inside [Cockpit](https://console.scaleway.com/cockpit/overview). You can create dedicated [Grafana users](https://console.scaleway.com/cockpit/users) with read-only access (**Viewer** Role). Note that these users will still have access to all other Cockpit dashboards for this project. - Collecting consumption data from the [Billing API](https://www.scaleway.com/en/developers/api/billing/#path-consumption-get-monthly-consumption) and exposing it to your users. Consumption can be detailed by Projects. - Collecting consumption data from [Cockpit data sources](https://console.scaleway.com/cockpit/dataSource) and exposing it to your users. As an example, you can query consumption using the following query: ```curl @@ -111,15 +110,26 @@ Make sure that you replace the following values: You can see your token consumption in [Scaleway Cockpit](https://console.scaleway.com/cockpit/). You can access it from the Scaleway console under the [Metrics tab](https://console.scaleway.com/generative-api/metrics). Note that: - Cockpits are isolated by Projects. You first need to select the right Project in the Scaleway console before accessing Cockpit to see your token consumption for the desired Project (you can see the `project_id` in the Cockpit URL: `https://{project_id}.dashboard.obs.fr-par.scw.cloud/`). -- Cockpit graphs can take up to 1 hour to update token consumption. See [Troubleshooting](/generative-apis/troubleshooting/fixing-common-issues/#tokens-consumption-is-not-displayed-in-cockpit-metrics) for further details. +- Cockpit graphs can take up to 5 minutes to update token consumption. See [Troubleshooting](/generative-apis/troubleshooting/fixing-common-issues/#tokens-consumption-is-not-displayed-in-cockpit-metrics) for further details. ## Specifications ### What are the SLAs applicable to Generative APIs? -We are currently working on defining our SLAs for Generative APIs. We will provide more information on this topic soon. +Generative APIs targets a 99.9% monthly availability rate detailed in [Service Level Agreement for Generative APIs](https://www.scaleway.com/en/generative-apis/sla/). ### What are the performance guarantees (vs Managed Inference)? -We are currently working on defining our performance guarantees for Generative APIs. We will provide more information on this topic soon. +Generative APIs is optimized and monitored to provide reliable performance in most use cases, but does not strictly guarantee performance as it depends on many client-side parameters. We recommend using Managed Inference (dedicated deployment capacity) for applications with critical performance requirements. + +As an order of magnitude, for Chat models, when performing request with `stream` activated: +- Time to first token should be less than `1` second for most standard queries (with less than 1000 input tokens) +- Output token generation speed should be above `100` tokens per second for recent small to medium size models (such as `gpt-oss-120b` or `mistral-small-3.2-24b-instruct-2506`) + +Exact performance will still vary based mainly on the following factors: +- Model size and architecture: Smaller and more recent models usually provide better performance. +- Model type: + - Chat models' time to first token increases proportionally to the input context size after a certain threshold (usually above `1 000` tokens). + - Audio transcription models' time to first token remains mostly constant, as they only need to process small numbers of input tokens (`30` seconds audio chunk) to generate a first output. +- Input and output size: In rough terms, total processing time is proportional to input and output size. However, for larger queries (usually above `10 000` tokens), processing speed may degrade with query size. For optimal performance, we recommend splitting queries into the smallest meaningful parts (`10` queries with `1 000` input tokens and `100` output tokens will be processed faster than `1` query with `10 000` input tokens and `1 000` output tokens). ## Quotas and limitations @@ -150,4 +160,4 @@ Yes, you need to comply with model licenses when using Generative APIs. Applicab ## Privacy and security ### Where can I find the privacy policy regarding Generative APIs? -You can find the privacy policy applicable to all use of Generative APIs [here](/generative-apis/reference-content/data-privacy/). \ No newline at end of file +You can find the privacy policy applicable to all use of Generative APIs [here](/generative-apis/reference-content/data-privacy/). diff --git a/pages/generative-apis/troubleshooting/fixing-common-issues.mdx b/pages/generative-apis/troubleshooting/fixing-common-issues.mdx index c5403e13e5..d0de036148 100644 --- a/pages/generative-apis/troubleshooting/fixing-common-issues.mdx +++ b/pages/generative-apis/troubleshooting/fixing-common-issues.mdx @@ -155,7 +155,7 @@ For queries where the model enters an infinite loop (more frequent when using ** ### Causes - Cockpit is isolated by `project_id` and only displays token consumption related to one Project. -- Cockpit `Tokens Processed` graphs along time can take up to an hour to update (to provide more accurate average consumptions over time). The overall `Tokens Processed` counter is updated in real-time. +- Cockpit `Tokens Processed` graphs along time can take up to 5 minutes to update (to provide more accurate average consumptions over time). The overall `Tokens Processed` counter is updated in real-time. ### Solution - Ensure you are connecting to the Cockpit corresponding to your Project. Cockpits are currently isolated by `project_id`, which you can see in their URL: `https://PROJECT_ID.dashboard.obs.fr-par.scw.cloud/`. This Project should correspond to the one used in the URL you used to perform Generative APIs requests, such as `https://api.scaleway.ai/{PROJECT_ID}/v1/chat/completions`. You can list your projects and their IDs in your [Organization dashboard](https://console.scaleway.com/organization/projects).