Skip to content

Commit 5146ea4

Browse files
nerda-codesRoRoJ
andauthored
docs(add): rowena review
Co-authored-by: Rowena Jones <[email protected]>
1 parent 94ff7bc commit 5146ea4

File tree

1 file changed

+7
-7
lines changed

1 file changed

+7
-7
lines changed

pages/generative-apis/faq.mdx

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -118,18 +118,18 @@ Note that:
118118
Generative APIs targets a 99.9% monthly availability rate detailed in [Service Level Agreement for Generative APIs](https://www.scaleway.com/en/generative-apis/sla/).
119119

120120
### What are the performance guarantees (vs Managed Inference)?
121-
Generative APIs is optimized and monitored to provide reliable performance in most use cases but does not strictly guarantee performance as it depends on many client-side parameters. We recommend using Managed Inference (dedicated deployment capacity) for applications with critical performance requirements.
121+
Generative APIs is optimized and monitored to provide reliable performance in most use cases, but does not strictly guarantee performance as it depends on many client-side parameters. We recommend using Managed Inference (dedicated deployment capacity) for applications with critical performance requirements.
122122

123123
As an order of magnitude, for Chat models, when performing request with `stream` activated:
124-
- time to first token should be less than `1` second for most standard queries (with less than 1000 input tokens)
125-
- output tokens generation speed should be above `100` tokens per second for recent small to medium size models (such as `gpt-oss-120b` or `mistral-small-3.2-24b-instruct-2506`)
124+
- Time to first token should be less than `1` second for most standard queries (with less than 1000 input tokens)
125+
- Output token generation speed should be above `100` tokens per second for recent small to medium size models (such as `gpt-oss-120b` or `mistral-small-3.2-24b-instruct-2506`)
126126

127-
Exact performance will still vary based on these main factors:
127+
Exact performance will still vary based mainly on the following factors:
128128
- Model size and architecture: Smaller and more recent models usually provide better performance.
129129
- Model type:
130-
- Chat models time to first token increase proportionally to the input context size after a certain threshold (usually above `1 000` tokens).
131-
- Audio transcription models time to first token remains mostly constant, as they only need to process small number of input tokens (`30` seconds audio chunk) to generate a first output.
132-
- Input and output size: As a first approximation, total processing time is proportionnal to input and output size. However, for significant size queries (usually above `10 000` tokens), processing speed may degrade with query size. For optimal performance, we recommend splitting queries in the smallest meaningful part (`10` queries with `1 000` input tokens and `100` output tokens will be processed faster than `1` query with `10 000` input tokens and `1 000` output tokens).
130+
- Chat models' time to first token increases proportionally to the input context size after a certain threshold (usually above `1 000` tokens).
131+
- Audio transcription models' time to first token remains mostly constant, as they only need to process small numbers of input tokens (`30` seconds audio chunk) to generate a first output.
132+
- Input and output size: In rough terms, total processing time is proportional to input and output size. However, for larger queries (usually above `10 000` tokens), processing speed may degrade with query size. For optimal performance, we recommend splitting queries into the smallest meaningful parts (`10` queries with `1 000` input tokens and `100` output tokens will be processed faster than `1` query with `10 000` input tokens and `1 000` output tokens).
133133

134134
## Quotas and limitations
135135

0 commit comments

Comments
 (0)