You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: pages/generative-apis/faq.mdx
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -118,18 +118,18 @@ Note that:
118
118
Generative APIs targets a 99.9% monthly availability rate detailed in [Service Level Agreement for Generative APIs](https://www.scaleway.com/en/generative-apis/sla/).
119
119
120
120
### What are the performance guarantees (vs Managed Inference)?
121
-
Generative APIs is optimized and monitored to provide reliable performance in most use cases but does not strictly guarantee performance as it depends on many client-side parameters. We recommend using Managed Inference (dedicated deployment capacity) for applications with critical performance requirements.
121
+
Generative APIs is optimized and monitored to provide reliable performance in most use cases, but does not strictly guarantee performance as it depends on many client-side parameters. We recommend using Managed Inference (dedicated deployment capacity) for applications with critical performance requirements.
122
122
123
123
As an order of magnitude, for Chat models, when performing request with `stream` activated:
124
-
-time to first token should be less than `1` second for most standard queries (with less than 1000 input tokens)
125
-
-output tokens generation speed should be above `100` tokens per second for recent small to medium size models (such as `gpt-oss-120b` or `mistral-small-3.2-24b-instruct-2506`)
124
+
-Time to first token should be less than `1` second for most standard queries (with less than 1000 input tokens)
125
+
-Output token generation speed should be above `100` tokens per second for recent small to medium size models (such as `gpt-oss-120b` or `mistral-small-3.2-24b-instruct-2506`)
126
126
127
-
Exact performance will still vary based on these main factors:
127
+
Exact performance will still vary based mainly on the following factors:
128
128
- Model size and architecture: Smaller and more recent models usually provide better performance.
129
129
- Model type:
130
-
- Chat models time to first token increase proportionally to the input context size after a certain threshold (usually above `1 000` tokens).
131
-
- Audio transcription models time to first token remains mostly constant, as they only need to process small number of input tokens (`30` seconds audio chunk) to generate a first output.
132
-
- Input and output size: As a first approximation, total processing time is proportionnal to input and output size. However, for significant size queries (usually above `10 000` tokens), processing speed may degrade with query size. For optimal performance, we recommend splitting queries in the smallest meaningful part (`10` queries with `1 000` input tokens and `100` output tokens will be processed faster than `1` query with `10 000` input tokens and `1 000` output tokens).
130
+
- Chat models' time to first token increases proportionally to the input context size after a certain threshold (usually above `1 000` tokens).
131
+
- Audio transcription models' time to first token remains mostly constant, as they only need to process small numbers of input tokens (`30` seconds audio chunk) to generate a first output.
132
+
- Input and output size: In rough terms, total processing time is proportional to input and output size. However, for larger queries (usually above `10 000` tokens), processing speed may degrade with query size. For optimal performance, we recommend splitting queries into the smallest meaningful parts (`10` queries with `1 000` input tokens and `100` output tokens will be processed faster than `1` query with `10 000` input tokens and `1 000` output tokens).
0 commit comments