From b8e2e22d6973faf83f702eb982a8b67a1129aeed Mon Sep 17 00:00:00 2001 From: Rowena Date: Wed, 20 Aug 2025 17:02:34 +0200 Subject: [PATCH 1/2] fix(review): review docs --- pages/generative-apis/faq.mdx | 99 ++++++++++--------- .../adding-ai-to-intellij-using-continue.mdx | 5 +- .../adding-ai-to-vscode-using-continue.mdx | 2 +- .../reference-content/supported-models.mdx | 2 +- .../how-to/create-manage-acls.mdx | 2 +- pages/managed-inference/faq.mdx | 86 ++++++++-------- 6 files changed, 106 insertions(+), 90 deletions(-) diff --git a/pages/generative-apis/faq.mdx b/pages/generative-apis/faq.mdx index 48c200561a..8615a86170 100644 --- a/pages/generative-apis/faq.mdx +++ b/pages/generative-apis/faq.mdx @@ -2,20 +2,52 @@ title: Generative APIs FAQ description: Get answers to the most frequently asked questions about Scaleway Generative APIs. dates: - validation: 2025-02-12 + validation: 2025-08-20 productIcon: GenerativeApiProductIcon --- -## What are Scaleway Generative APIs? +## General + +### What are Scaleway Generative APIs? Scaleway's Generative APIs provide access to pre-configured, serverless endpoints of leading AI models, hosted in European data centers. This allows you to integrate advanced AI capabilities into your applications without managing underlying infrastructure. -## Which models are supported by Generative APIs? +### How do I get started with Generative APIs? +To get started, explore the [Generative APIs Playground](/generative-apis/quickstart/#start-with-the-generative-apis-playground) in the Scaleway console. For application integration, refer to our [Quickstart guide](/generative-apis/quickstart/), which provides step-by-step instructions on accessing, configuring, and using a Generative APIs endpoint. + +### Where are the inference servers located? +All models are currently hosted in a secure data center located in Paris, France, operated by [OPCORE](https://www.opcore.com/). This ensures low latency for European users and compliance with European data privacy regulations. + +### What is the difference between Generative APIs and Managed Inference? +- **Generative APIs**: A serverless service providing access to pre-configured AI models via API, billed per token usage. +- **Managed Inference**: Allows deployment of curated or custom models with chosen quantization and instances, offering predictable throughput and enhanced security features like private network isolation and access control. Managed Inference is billed by hourly usage, whether provisioned capacity is receiving traffic or not. + +## Models and libraries + +### Which models are supported by Generative APIs? Our Generative APIs support a range of popular models, including: - Chat / Text Generation models: Refer to our dedicated [documentation](/generative-apis/reference-content/supported-models/#chat-models) for a list of supported chat models. - Vision models: Refer to our dedicated [documentation](/generative-apis/reference-content/supported-models/#vision-models) for a list of supported vision models. - Embedding models: Refer to our dedicated [documentation](/generative-apis/reference-content/supported-models/#embedding-models) for a list of supported embedding models. -## How does the free tier work? +### What is the model lifecycle for Generative APIs? +Scaleway is dedicated to updating and offering the latest versions of generative AI models, while ensuring older models remain accessible for a significant time, and also ensuring the reliability of your production applications. Learn more in our [model lifecycle policy](/generative-apis/reference-content/model-lifecycle/). + +### Do model licenses apply when using Generative APIs? +Yes, you need to comply with model licenses when using Generative APIs. Applicable licenses are available for [each model in our documentation](/generative-apis/reference-content/supported-models/#vision-models) and in Console Playground. + +### Can I increase maximum output (completion) tokens for a model? +No, you cannot increase maximum output tokens above [limits for each models](https://www.scaleway.com/en/docs/generative-apis/reference-content/supported-models/) in Generative APIs. +These limits are in place to protect you against: +- Long generation which may be ended by an HTTP timeout. Limits are designed to ensure a model will send its HTTP response in less than 5 minutes. +- Uncontrolled billing, as several models are known to be able to enter infinite generation loops (specific prompts can make the model generate the same sentence over and over, without stopping at all). +If you require higher maximum output tokens, you can use [Managed Inference](https://console.scaleway.com/inference/deployments) where these limts do not apply (as your bill will be limited by the size of your deployment). + +### Can I use OpenAI libraries and APIs with Scaleway's Generative APIs? +Yes, Scaleway's Generative APIs are designed to be compatible with OpenAI libraries and SDKs, including the OpenAI Python client library and LangChain SDKs. This allows for seamless integration with existing workflows. + +## Billing and monitoring + +### How does the free tier work? The free tier allows you to process up to 1,000,000 tokens without incurring any costs. After reaching this limit, you will be charged per million tokens processed. Free tier usage is calculated by adding all input and output tokens consumed from all models used. For more information, refer to our [pricing page](https://www.scaleway.com/en/pricing/model-as-a-service/#generative-apis) or access your bills by token types and models in [billing section from Scaleway Console](https://console.scaleway.com/billing/payment) (past and provisional bills for the current month). @@ -46,22 +78,28 @@ Total tokens consumed: `900k` Total billed consumption: `6 million tokens` Total bill: `3.20€` -Note that in this example, the first line where the free tier applies will not display in your current Scaleway bills by model but will instead be listed under `Generative APIs Free Tier - First 1M tokens for free`. +Note that in this example, the first line where the free tier applies will not display in your current Scaleway bills by model, but will instead be listed under `Generative APIs Free Tier - First 1M tokens for free`. -## What is a token and how are they counted? +### What are tokens and how are they counted? A token is the minimum unit of content that is seen and processed by a model. Hence, token definitions depend on input types: - For text, on average, `1` token corresponds to `~4` characters, and thus `0.75` words (as words are on average five characters long) - For images, `1` token corresponds to a square of pixels. For example, `mistral-small-3.1-24b-instruct-2503` model image tokens of `28x28` pixels (28-pixels height, and 28-pixels width, hence `784` pixels in total). The exact token count and definition depend on [tokenizers](https://huggingface.co/learn/llm-course/en/chapter2/4) used by each model. When this difference is significant (such as for image processing), you can find detailed information in each model's documentation (for instance in [`mistral-small-3.1-24b-instruct-2503` size limit documentation](/managed-inference/reference-content/model-catalog/#mistral-small-31-24b-instruct-2503)). When the model is open, you can also find this information in the model files on platforms such as Hugging Face, usually in the `tokenizer_config.json` file. -## How can I monitor my token consumption? +### How can I monitor my token consumption? You can see your token consumption in [Scaleway Cockpit](/cockpit/). You can access it from the Scaleway console under the [Metrics tab](https://console.scaleway.com/generative-api/metrics). Note that: -- Cockpits are isolated by Projects, hence you first need to select the right project in the Scaleway console before accessing Cockpit to see your token consumption for this Project (you can see the `project_id` in the Cockpit URL: `https://{project_id}.dashboard.obs.fr-par.scw.cloud/`. +- Cockpits are isolated by Project, hence you first need to select the right Project in the Scaleway console before accessing Cockpit to see your token consumption for this Project (you can see the `project_id` in the Cockpit URL: `https://{project_id}.dashboard.obs.fr-par.scw.cloud/`. - Cockpit graphs can take up to 1 hour to update token consumption, see [Troubleshooting](https://www.scaleway.com/en/docs/generative-apis/troubleshooting/fixing-common-issues/#tokens-consumption-is-not-displayed-in-cockpit-metrics) for further details. -## How can I give access to token consumption to my users outside of Scaleway? +### Can I configure a maximum billing threshold? +Currently, you cannot configure a specific threshold after which your usage will blocked. However: +- You can [configure billing alerts](/billing/how-to/use-billing-alerts/) to ensure you are warned when you hit specific budget thresholds. +- Your total billing remains limited by the amount of tokens you can consume within [rate limits](/generative-apis/reference-content/rate-limits/). +- If you want to ensure fixed billing, you can use [Managed Inference](https://www.scaleway.com/en/inference/), which provides the same set of OpenAI-compatible APIs and a wider range of models. + +### How can I give access to token consumption to my users outside of Scaleway? If your users do not have a Scaleway account, you can still give them access to their Generative API usage consumption by either: - Providing them an access to Grafana inside [Cockpit](https://console.scaleway.com/cockpit/overview). You can create dedicated [Grafana users](https://console.scaleway.com/cockpit/users) with read-only access (**Viewer** Role). Note that these users will still have access to all other Cockpit dashboards for this project. - Collecting consumption data from the [Billing API](https://www.scaleway.com/en/developers/api/billing/#path-consumption-get-monthly-consumption) and exposing it to your users. Consumption can be detailed by Projects. @@ -84,53 +122,26 @@ Note that: - Cockpits are isolated by Projects. You first need to select the right Project in the Scaleway console before accessing Cockpit to see your token consumption for the desired Project (you can see the `project_id` in the Cockpit URL: `https://{project_id}.dashboard.obs.fr-par.scw.cloud/`. - Cockpit graphs can take up to 1 hour to update token consumption, see [Troubleshooting](/generative-apis/troubleshooting/fixing-common-issues/#tokens-consumption-is-not-displayed-in-cockpit-metrics) for further details. -## Can I configure a maximum billing threshold? -Currently, you cannot configure a specific threshold after which your usage will blocked. However: -- You can [configure billing alerts](/billing/how-to/use-billing-alerts/) to ensure you are warned when you hit specific budget thresholds. -- Your total billing remains limited by the amount of tokens you can consume within [rate limits](/generative-apis/reference-content/rate-limits/). -- If you want to ensure fixed billing, you can use [Managed Inference](https://www.scaleway.com/en/inference/), which provides the same set of OpenAI-compatible APIs and a wider range of models. - -## How can I access and use the Generative APIs? -Access is open to all Scaleway customers. You can start by using the Generative APIs Playground in the Scaleway console to experiment with different models. For integration into applications, you can use the OpenAI-compatible APIs provided by Scaleway. Detailed instructions are available in our [Quickstart guide](/generative-apis/quickstart/). +## Privacy, performance and rate limiting -## Where are the inference servers located? -All models are currently hosted in a secure data center located in Paris, France, operated by [OPCORE](https://www.opcore.com/). This ensures low latency for European users and compliance with European data privacy regulations. - -## Where can I find the privacy policy regarding Generative APIs? +### Where can I find the privacy policy regarding Generative APIs? You can find the privacy policy applicable to all use of Generative APIs [here](/generative-apis/reference-content/data-privacy/). -## Can I use OpenAI libraries and APIs with Scaleway's Generative APIs? -Yes, Scaleway's Generative APIs are designed to be compatible with OpenAI libraries and SDKs, including the OpenAI Python client library and LangChain SDKs. This allows for seamless integration with existing workflows. - -## What is the difference between Generative APIs and Managed Inference? -- **Generative APIs**: A serverless service providing access to pre-configured AI models via API, billed per token usage. -- **Managed Inference**: Allows deployment of curated or custom models with chosen quantization and instances, offering predictable throughput and enhanced security features like private network isolation and access control. Managed Inference is billed by hourly usage, whether provisioned capacity is receiving traffic or not. +### What are the SLAs applicable to Generative APIs? +We are currently working on defining our SLAs for Generative APIs. We will provide more information on this topic soon. -## How do I get started with Generative APIs? -To get started, explore the [Generative APIs Playground](/generative-apis/quickstart/#start-with-the-generative-apis-playground) in the Scaleway console. For application integration, refer to our [Quickstart guide](/generative-apis/quickstart/), which provides step-by-step instructions on accessing, configuring, and using a Generative APIs endpoint. +### What are the performance guarantees (vs Managed Inference)? +We are currently working on defining our performance guarantees for Generative APIs. We will provide more information on this topic soon. -## Are there any rate limits for API usage? +### Are there any rate limits for API usage? Yes, API rate limits define the maximum number of requests a user can make within a specific time frame to ensure fair access and resource allocation between users. If you require increased rate limits we recommend either: - Using [Managed Inference](https://console.scaleway.com/inference/deployments), which provides dedicated capacity and doesn't enforce rate limits (you remain limited by the total provisioned capacity) - Contacting your existing Scaleway account manager or our Sales team to discuss volume commitment for specific models that will allow us to increase your quota proportionally. Refer to our dedicated [documentation](/generative-apis/reference-content/rate-limits/) for more information on rate limits. -## Can I increase maximum output (completion) tokens for a model? -No, you cannot increase maximum output tokens above [limits for each models](https://www.scaleway.com/en/docs/generative-apis/reference-content/supported-models/) in Generative APIs. -These limits are in place to protect you against: -- Long generation which may be ended by an HTTP timeout. Limits are designed to ensure a model will send its HTTP response in less than 5 minutes. -- Uncontrolled billing, as several models are known to be able to enter infinite generation loops (specific prompts can make the model generate the same sentence over and over, without stopping at all). -If you require higher maximum output tokens, you can use [Managed Inference](https://console.scaleway.com/inference/deployments) where these limts do not apply (as your bill will be limited by the size of your deployment). -## What is the model lifecycle for Generative APIs? -Scaleway is dedicated to updating and offering the latest versions of generative AI models, while ensuring older models remain accessible for a significant time, and also ensuring the reliability of your production applications. Learn more in our [model lifecycle policy](/generative-apis/reference-content/model-lifecycle/). -## What are the SLAs applicable to Generative APIs? -We are currently working on defining our SLAs for Generative APIs. We will provide more information on this topic soon. -## What are the performance guarantees (vs Managed Inference)? -We are currently working on defining our performance guarantees for Generative APIs. We will provide more information on this topic soon. -## Do model licenses apply when using Generative APIs? -Yes, you need to comply with model licenses when using Generative APIs. Applicable licenses are available for [each model in our documentation](/generative-apis/reference-content/supported-models/#vision-models) and in Console Playground. + diff --git a/pages/generative-apis/reference-content/adding-ai-to-intellij-using-continue.mdx b/pages/generative-apis/reference-content/adding-ai-to-intellij-using-continue.mdx index b8e462658e..8322949f3a 100644 --- a/pages/generative-apis/reference-content/adding-ai-to-intellij-using-continue.mdx +++ b/pages/generative-apis/reference-content/adding-ai-to-intellij-using-continue.mdx @@ -3,7 +3,7 @@ title: Adding AI to IntelliJ IDEA using Continue and Generative APIs description: Learn how to integrate AI-powered code models into IntelliJ IDEA with Continue and Scaleway's Generative APIs. tags: generative-apis ai machine-learning language-models code-assistance intellij-idea continue dates: - validation: 2025-02-14 + validation: 2025-08-20 posted: 2025-02-14 --- import Requirements from '@macros/iam/requirements.mdx' @@ -20,8 +20,6 @@ This guide will help you integrate AI-powered code models into JetBrain's Intell - A valid [API key](/iam/how-to/create-api-keys/) for API authentication - Installed [IntelliJ IDEA](https://www.jetbrains.com/idea/) on your local machine. - - You can install Continue from the [JetBrains marketplace](https://plugins.jetbrains.com/plugin/22707-continue): 1. Open IntelliJ IDEA and go to **Preferences/Settings** (`Ctrl+Alt+S` on Windows/Linux, `Cmd+,` on macOS). @@ -149,7 +147,6 @@ Alternatively, a `config.json` file can be used with the following format. Note } ``` - For more details on configuring `config.yaml`, refer to the [official Continue documentation](https://docs.continue.dev/reference). If you want to limit access to a specific Scaleway Project, you should add the field `"apiBase": "https://api.scaleway.ai/###PROJECT_ID###/v1/"` for each model (ie. `models`, `embeddingsProvider` and `tabAutocompleteModel`) since the default URL `https://api.scaleway.ai/v1/` can only be used with the `default` project. diff --git a/pages/generative-apis/reference-content/adding-ai-to-vscode-using-continue.mdx b/pages/generative-apis/reference-content/adding-ai-to-vscode-using-continue.mdx index 069b6960d2..49cf4d76df 100644 --- a/pages/generative-apis/reference-content/adding-ai-to-vscode-using-continue.mdx +++ b/pages/generative-apis/reference-content/adding-ai-to-vscode-using-continue.mdx @@ -3,7 +3,7 @@ title: Adding AI to VS Code using Continue and Generative APIs description: Learn how to integrate AI-powered code models into VS Code with Continue and Scaleway's Generative APIs. tags: generative-apis ai machine-learning language-models code-assistance vs-code continue dates: - validation: 2025-02-14 + validation: 2025-08-20 posted: 2025-02-14 --- import Requirements from '@macros/iam/requirements.mdx' diff --git a/pages/generative-apis/reference-content/supported-models.mdx b/pages/generative-apis/reference-content/supported-models.mdx index c3034f9452..6236874f1e 100644 --- a/pages/generative-apis/reference-content/supported-models.mdx +++ b/pages/generative-apis/reference-content/supported-models.mdx @@ -3,7 +3,7 @@ title: Supported models description: This page lists which open-source chat or embedding models Scaleway is currently hosting tags: generative-apis ai-data supported-models dates: - validation: 2025-02-14 + validation: 2025-08-20 posted: 2024-09-02 --- diff --git a/pages/load-balancer/how-to/create-manage-acls.mdx b/pages/load-balancer/how-to/create-manage-acls.mdx index db2b46b938..1bb6e4763b 100644 --- a/pages/load-balancer/how-to/create-manage-acls.mdx +++ b/pages/load-balancer/how-to/create-manage-acls.mdx @@ -3,7 +3,7 @@ title: How to create and manage ACLs description: Discover how to create and manage ACLs for Scaleway Load Balancers. Improve security, manage traffic efficiently, and optimize your network setup easily. tags: acl load-balancer acls access-control-list access control HTTP-header filter allow reject redirect HTTPSacl load-balancer access-control access-control-list dates: - validation: 2025-02-13 + validation: 2025-08-20 posted: 2022-02-04 --- import Requirements from '@macros/iam/requirements.mdx' diff --git a/pages/managed-inference/faq.mdx b/pages/managed-inference/faq.mdx index f9cd098ca7..2a32a8df67 100644 --- a/pages/managed-inference/faq.mdx +++ b/pages/managed-inference/faq.mdx @@ -6,34 +6,32 @@ dates: productIcon: InferenceProductIcon --- -## What is Scaleway Managed Inference? +## General + +### What is Scaleway Managed Inference? Scaleway's Managed Inference is a fully managed service that allows you to deploy, run, and scale AI models in a dedicated environment. It provides optimized infrastructure, customizable deployment options, and secure access controls to meet the needs of enterprises and developers looking for high-performance inference solutions. -## Where are the inference servers located? +### Where are the inference servers located? All models are currently hosted in a secure data center located in Paris, France, operated by [OPCORE](https://www.opcore.com/). This ensures low latency for European users and compliance with European data privacy regulations. -## What is the difference between Managed Inference and Generative APIs? +### What is the difference between Managed Inference and Generative APIs? - **Managed Inference**: Allows deployment of curated or custom models with chosen quantization and instances, offering predictable throughput and enhanced security features like private network isolation and access control. Managed Inference is billed by hourly usage, whether provisioned capacity is receiving traffic or not. - **Generative APIs**: A serverless service providing access to pre-configured AI models via API, billed per token usage. -## Where can I find information regarding the data, privacy, and security policies applied to Scaleway's AI services? -You can find detailed information regarding the policies applied to Scaleway's AI services in our [Data, privacy, and security for Scaleway's AI services](/managed-inference/reference-content/data-privacy-security-scaleway-ai-services/) documentation. - -## Is Managed Inference compatible with Open AI APIs? -Managed Inference aims to achieve seamless compatibility with OpenAI APIs. Find detailed information in the [Scaleway Managed Inference as drop-in replacement for the OpenAI APIs](/managed-inference/reference-content/openai-compatibility/) documentation. +### What Instance types are available for inference? +Managed Inference offers different Instance types optimized for various workloads from Scaleway's [GPU Instances](/gpu/reference-content/choosing-gpu-instance-type/) range. +You can select the Instance type based on your model’s computational needs and compatibility. -## What are the SLAs applicable to Managed Inference? -We are currently working on defining our SLAs for Managed Inference. We will provide more information on this topic soon. +### Is Managed Inference suitable for real-time applications? +Yes, Managed Inference is designed for low-latency, high-throughput applications, making it suitable for real-time use cases such as chatbots, recommendation systems, fraud detection, and live video processing. -## What are the performance guarantees (vs. Generative APIs)? -Managed Inference provides dedicated resources, ensuring predictable performance and lower latency compared to Generative APIs, which are a shared, serverless offering optimized for infrequent traffic with moderate peak loads. Managed Inference is ideal for workloads that require consistent response times, high availability, custom hardware configurations, or generate extreme peak loads during a narrow period. -Compared to Generative APIs, no usage quota is applied to the number of tokens per second generated, since the output is limited by the GPU Instance size and number of your Managed Inference Deployment. +### Can I use Managed Inference with other Scaleway services? +Absolutely. Managed Inference integrates seamlessly with other Scaleway services, such as [Object Storage](/object-storage/quickstart/) for model hosting, [Kubernetes](/kubernetes/quickstart/) for containerized applications, and [Scaleway IAM](/iam/quickstart/) for access management. -## How can I monitor performance? -Managed Inference metrics and logs are available in [Scaleway Cockpit](https://console.scaleway.com/cockpit/overview). You can follow your deployment metrics in real-time, such as tokens throughput, requests latency, GPU power usage, and GPU VRAM usage. +## Models and APIs -## What types of models can I deploy with Managed Inference? +### What types of models can I deploy with Managed Inference? You can deploy a variety of models, including: * Large language models (LLMs) * Image processing models @@ -41,37 +39,47 @@ You can deploy a variety of models, including: * Custom AI models (through API only yet) Managed Inference supports both open-source models and your own uploaded proprietary models. -## How do I deploy a model using Managed Inference? +### How do I deploy a model using Managed Inference? Deployment is done through Scaleway's [console](https://console.scaleway.com/inference/deployments) or [API](https://www.scaleway.com/en/developers/api/inference/). You can choose a model from Scaleway’s selection or import your own directly from Hugging Face's repositories, configure [Instance types](/gpu/reference-content/choosing-gpu-instance-type/), set up networking options, and start inference with minimal setup. -## Can I fine-tune or retrain my models within Managed Inference? +### Can I fine-tune or retrain my models within Managed Inference? Managed Inference is primarily designed for deploying and running inference workloads. If you need to fine-tune or retrain models, you may need to use a separate training environment, such as [Scaleway’s GPU Instances](/gpu/quickstart/), and then deploy the trained model in Managed Inference. -## What Instance types are available for inference? -Managed Inference offers different Instance types optimized for various workloads from Scaleway's [GPU Instances](/gpu/reference-content/choosing-gpu-instance-type/) range. -You can select the Instance type based on your model’s computational needs and compatibility. - -## How is Managed Inference billed? -Billing is based on the Instance type and usage duration (in minutes). Unlike [Generative APIs](/generative-apis/quickstart/), which are billed per token, Managed Inference provides predictable costs based on the allocated infrastructure. Billing only starts when model a deployment is ready and can be queried. -Pricing details can be found on the [Scaleway pricing page](https://www.scaleway.com/en/pricing/model-as-a-service/#managed-inference). - -## Can I pause Managed Inference billing when the instance is not in use? -When a Managed Inference deployment is running, corresponding resources are provisioned and thus billed. Resources can therefore not be paused. -However, you can still optimize your Managed Inference deployment to fit within specific time ranges (such as during working hours). To do so, you can automate deployment creation and deletion using the [Managed Inference API](https://www.scaleway.com/en/developers/api/inference/), [Terraform](https://registry.terraform.io/providers/scaleway/scaleway/latest/docs/resources/inference_deployment) or [Scaleway SDKs](https://www.scaleway.com/en/docs/scaleway-sdk/). These actions can be programmed using [Serverless Jobs](/serverless-jobs/) to be automatically carried out periodically. - -## Can I run inference on private models? +### Can I run inference on private models? Yes, Managed Inference allows you to deploy private models with access control settings. You can restrict access to specific users, teams, or networks. -## Does Managed Inference support model quantization? +### Does Managed Inference support model quantization? Yes, Scaleway Managed Inference supports model [quantization](/managed-inference/concepts/#quantization) to optimize performance and reduce inference latency. You can select different quantization options depending on your accuracy and efficiency requirements. -## Is Managed Inference suitable for real-time applications? -Yes, Managed Inference is designed for low-latency, high-throughput applications, making it suitable for real-time use cases such as chatbots, recommendation systems, fraud detection, and live video processing. - -## Can I use Managed Inference with other Scaleway services? -Absolutely. Managed Inference integrates seamlessly with other Scaleway services, such as [Object Storage](/object-storage/quickstart/) for model hosting, [Kubernetes](/kubernetes/quickstart/) for containerized applications, and [Scaleway IAM](/iam/quickstart/) for access management. - -## Do model licenses apply when using Managed Inference? +### Do model licenses apply when using Managed Inference? Yes, model licenses need to be complied with when using Managed Inference. Applicable licenses are available for [each model in our documentation](/managed-inference/reference-content/). - For models provided in the Scaleway catalog, you need to accept licenses (including potential EULA) before creating any Managed Inference deployment. - For custom models you choose to import on Scaleway, you are responsible for complying with model licenses (as with any software you choose to install on a GPU Instance for example). + +### Is Managed Inference compatible with Open AI APIs? +Managed Inference aims to achieve seamless compatibility with OpenAI APIs. Find detailed information in the [Scaleway Managed Inference as drop-in replacement for the OpenAI APIs](/managed-inference/reference-content/openai-compatibility/) documentation. + +## Privacy, security and performance + +### Where can I find information regarding the data, privacy, and security policies applied to Scaleway's AI services? +You can find detailed information regarding the policies applied to Scaleway's AI services in our [Data, privacy, and security for Scaleway's AI services](/managed-inference/reference-content/data-privacy-security-scaleway-ai-services/) documentation. + +### What are the SLAs applicable to Managed Inference? +We are currently working on defining our SLAs for Managed Inference. We will provide more information on this topic soon. + +### What are the performance guarantees (vs. Generative APIs)? +Managed Inference provides dedicated resources, ensuring predictable performance and lower latency compared to Generative APIs, which are a shared, serverless offering optimized for infrequent traffic with moderate peak loads. Managed Inference is ideal for workloads that require consistent response times, high availability, custom hardware configurations, or generate extreme peak loads during a narrow period. +Compared to Generative APIs, no usage quota is applied to the number of tokens per second generated, since the output is limited by the GPU Instance size and number of your Managed Inference Deployment. + +### How can I monitor performance? +Managed Inference metrics and logs are available in [Scaleway Cockpit](https://console.scaleway.com/cockpit/overview). You can follow your deployment metrics in real-time, such as tokens throughput, requests latency, GPU power usage, and GPU VRAM usage. + +## Billing + +### How is Managed Inference billed? +Billing is based on the Instance type and usage duration (in minutes). Unlike [Generative APIs](/generative-apis/quickstart/), which are billed per token, Managed Inference provides predictable costs based on the allocated infrastructure. Billing only starts when model a deployment is ready and can be queried. +Pricing details can be found on the [Scaleway pricing page](https://www.scaleway.com/en/pricing/model-as-a-service/#managed-inference). + +### Can I pause Managed Inference billing when the instance is not in use? +When a Managed Inference deployment is running, corresponding resources are provisioned and thus billed. Resources can therefore not be paused. +However, you can still optimize your Managed Inference deployment to fit within specific time ranges (such as during working hours). To do so, you can automate deployment creation and deletion using the [Managed Inference API](https://www.scaleway.com/en/developers/api/inference/), [Terraform](https://registry.terraform.io/providers/scaleway/scaleway/latest/docs/resources/inference_deployment) or [Scaleway SDKs](https://www.scaleway.com/en/docs/scaleway-sdk/). These actions can be programmed using [Serverless Jobs](/serverless-jobs/) to be automatically carried out periodically. From 417eb6e6e8c439d16287f1774e5115c8afbf7a9a Mon Sep 17 00:00:00 2001 From: Rowena Jones <36301604+RoRoJ@users.noreply.github.com> Date: Thu, 21 Aug 2025 14:43:33 +0200 Subject: [PATCH 2/2] Apply suggestions from code review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Jessica <113192637+jcirinosclwy@users.noreply.github.com> Co-authored-by: Néda <87707325+nerda-codes@users.noreply.github.com> --- pages/generative-apis/faq.mdx | 10 +++++----- pages/managed-inference/faq.mdx | 4 ++-- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/pages/generative-apis/faq.mdx b/pages/generative-apis/faq.mdx index 8615a86170..99cd33b139 100644 --- a/pages/generative-apis/faq.mdx +++ b/pages/generative-apis/faq.mdx @@ -19,7 +19,7 @@ All models are currently hosted in a secure data center located in Paris, France ### What is the difference between Generative APIs and Managed Inference? - **Generative APIs**: A serverless service providing access to pre-configured AI models via API, billed per token usage. -- **Managed Inference**: Allows deployment of curated or custom models with chosen quantization and instances, offering predictable throughput and enhanced security features like private network isolation and access control. Managed Inference is billed by hourly usage, whether provisioned capacity is receiving traffic or not. +- **Managed Inference**: Allows deployment of curated or custom models with chosen quantization and Instances, offering predictable throughput and enhanced security features like private network isolation and access control. Managed Inference is billed by hourly usage, whether provisioned capacity is receiving traffic or not. ## Models and libraries @@ -33,10 +33,10 @@ Our Generative APIs support a range of popular models, including: Scaleway is dedicated to updating and offering the latest versions of generative AI models, while ensuring older models remain accessible for a significant time, and also ensuring the reliability of your production applications. Learn more in our [model lifecycle policy](/generative-apis/reference-content/model-lifecycle/). ### Do model licenses apply when using Generative APIs? -Yes, you need to comply with model licenses when using Generative APIs. Applicable licenses are available for [each model in our documentation](/generative-apis/reference-content/supported-models/#vision-models) and in Console Playground. +Yes, you need to comply with model licenses when using Generative APIs. Applicable licenses are available for [each model in our documentation](/generative-apis/reference-content/supported-models/#vision-models) and in the console Playground. ### Can I increase maximum output (completion) tokens for a model? -No, you cannot increase maximum output tokens above [limits for each models](https://www.scaleway.com/en/docs/generative-apis/reference-content/supported-models/) in Generative APIs. +No, you cannot increase maximum output tokens above [limits for each models](/generative-apis/reference-content/supported-models/) in Generative APIs. These limits are in place to protect you against: - Long generation which may be ended by an HTTP timeout. Limits are designed to ensure a model will send its HTTP response in less than 5 minutes. - Uncontrolled billing, as several models are known to be able to enter infinite generation loops (specific prompts can make the model generate the same sentence over and over, without stopping at all). @@ -91,10 +91,10 @@ The exact token count and definition depend on [tokenizers](https://huggingface. You can see your token consumption in [Scaleway Cockpit](/cockpit/). You can access it from the Scaleway console under the [Metrics tab](https://console.scaleway.com/generative-api/metrics). Note that: - Cockpits are isolated by Project, hence you first need to select the right Project in the Scaleway console before accessing Cockpit to see your token consumption for this Project (you can see the `project_id` in the Cockpit URL: `https://{project_id}.dashboard.obs.fr-par.scw.cloud/`. -- Cockpit graphs can take up to 1 hour to update token consumption, see [Troubleshooting](https://www.scaleway.com/en/docs/generative-apis/troubleshooting/fixing-common-issues/#tokens-consumption-is-not-displayed-in-cockpit-metrics) for further details. +- Cockpit graphs can take up to 1 hour to update token consumption, see [Troubleshooting](/generative-apis/troubleshooting/fixing-common-issues/#tokens-consumption-is-not-displayed-in-cockpit-metrics) for further details. ### Can I configure a maximum billing threshold? -Currently, you cannot configure a specific threshold after which your usage will blocked. However: +Currently, you cannot configure a specific threshold after which your usage will be blocked. However: - You can [configure billing alerts](/billing/how-to/use-billing-alerts/) to ensure you are warned when you hit specific budget thresholds. - Your total billing remains limited by the amount of tokens you can consume within [rate limits](/generative-apis/reference-content/rate-limits/). - If you want to ensure fixed billing, you can use [Managed Inference](https://www.scaleway.com/en/inference/), which provides the same set of OpenAI-compatible APIs and a wider range of models. diff --git a/pages/managed-inference/faq.mdx b/pages/managed-inference/faq.mdx index 2a32a8df67..ca9ba611b3 100644 --- a/pages/managed-inference/faq.mdx +++ b/pages/managed-inference/faq.mdx @@ -77,9 +77,9 @@ Managed Inference metrics and logs are available in [Scaleway Cockpit](https://c ## Billing ### How is Managed Inference billed? -Billing is based on the Instance type and usage duration (in minutes). Unlike [Generative APIs](/generative-apis/quickstart/), which are billed per token, Managed Inference provides predictable costs based on the allocated infrastructure. Billing only starts when model a deployment is ready and can be queried. +Billing is based on the Instance type and usage duration (in minutes). Unlike [Generative APIs](/generative-apis/quickstart/), which are billed per token, Managed Inference provides predictable costs based on the allocated infrastructure. Billing only starts when a deployment is ready and can be queried. Pricing details can be found on the [Scaleway pricing page](https://www.scaleway.com/en/pricing/model-as-a-service/#managed-inference). -### Can I pause Managed Inference billing when the instance is not in use? +### Can I pause Managed Inference billing when the Instance is not in use? When a Managed Inference deployment is running, corresponding resources are provisioned and thus billed. Resources can therefore not be paused. However, you can still optimize your Managed Inference deployment to fit within specific time ranges (such as during working hours). To do so, you can automate deployment creation and deletion using the [Managed Inference API](https://www.scaleway.com/en/developers/api/inference/), [Terraform](https://registry.terraform.io/providers/scaleway/scaleway/latest/docs/resources/inference_deployment) or [Scaleway SDKs](https://www.scaleway.com/en/docs/scaleway-sdk/). These actions can be programmed using [Serverless Jobs](/serverless-jobs/) to be automatically carried out periodically.