Skip to content

Commit a6542d2

Browse files
update documentations for streaming mode
Signed-off-by: Jintao Zhang <[email protected]>
1 parent 7c12482 commit a6542d2

File tree

2 files changed

+14
-1
lines changed

2 files changed

+14
-1
lines changed

website/docs/api/router.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -351,6 +351,12 @@ histogram_quantile(0.95, sum(rate(llm_model_tpot_seconds_bucket[5m])) by (le, mo
351351

352352
These are included in the provided Grafana dashboard at deploy/llm-router-dashboard.json as “TTFT (p95) by Model” and “TPOT (p95) by Model (sec/token)”.
353353

354+
#### Streaming (SSE) notes
355+
356+
- For Server-Sent Events (SSE) responses, the router measures TTFT on the first streamed body chunk (i.e., the first token), not on response headers.
357+
- No manual change to your Envoy config is required: the ExtProc handler automatically sets a ModeOverride with `response_body_mode: STREAMED` for SSE responses so the first chunk reaches ExtProc immediately.
358+
- Prerequisite: Envoy’s ext_proc filter must have `allow_mode_override: true` (the default configs in `config/envoy.yaml` and `config/envoy-docker.yaml` already include this). Keeping `response_body_mode: BUFFERED` in the static processing mode is fine; the router will flip it to STREAMED at runtime for SSE.
359+
354360
### Pricing Configuration
355361

356362
Provide per-1M pricing for your models so the router can compute request cost and emit metrics/logs.

website/docs/overview/architecture/envoy-extproc.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -410,7 +410,7 @@ static_resources:
410410
request_header_mode: "SEND"
411411
response_header_mode: "SEND"
412412
request_body_mode: "BUFFERED" # Required for content analysis
413-
response_body_mode: "BUFFERED" # Required for caching
413+
response_body_mode: "BUFFERED" # Default: router flips to STREAMED at runtime for SSE
414414
request_trailer_mode: "SKIP"
415415
response_trailer_mode: "SKIP"
416416
@@ -419,6 +419,13 @@ static_resources:
419419
allow_mode_override: true # Allow ExtProc to change modes
420420
message_timeout: 300s # Timeout for ExtProc responses
421421
max_message_timeout: 600s # Maximum allowed timeout
422+
423+
> Note on SSE (streaming):
424+
>
425+
> When the upstream responds with `Content-Type: text/event-stream`, the router sets a per-message
426+
> `ModeOverride` with `response_body_mode: STREAMED` so the first chunk reaches ExtProc immediately.
427+
> This enables accurate TTFT measurement on the first token. No manual change to the static
428+
> `processing_mode` is required as long as `allow_mode_override: true` is set (it is in the default configs).
422429

423430
# Advanced configuration
424431
mutation_rules:

0 commit comments

Comments
 (0)