Problem
Long multi-turn conversations regularly exceed a model's context window.
Today LocalAI returns:
HTTP 500: request (61234 tokens) exceeds the available context size (32768)
The user has to truncate their conversation by hand, losing context.
Client-side workarounds exist (e.g. LibreChat's summarization), but they
duplicate effort across every frontend talking to LocalAI, and they don't
help operator-driven scenarios like the /v1/mcp/chat/completions path
or long-lived agent workflows.
Proposal
Add an opt-in, per-model middleware that compresses the head of the
conversation with a fast secondary model when a request approaches
context_size. The compressed content is replaced by a single system
message containing the summary. The request is then proxied to the
primary model.
Config (per model YAML)
compression:
enabled: true
trigger_at_ratio: 0.75 # compress when request hits 75% of context_size
keep_tail_tokens: 8000 # never compress the final 8k tokens
max_summary_tokens: 2048
compressor_model: "LFM2-24B-A2B-GGUF" # optional; falls back to primary
on_post_compression_overflow: "drop_oldest_summary" # alt: "error"
Opt-in, default off. Absent block = no change to existing behavior.
Flow
-
Middleware reads config.Compression after SetOpenAIRequest.
-
Counts request tokens via tiktoken-go.
-
Under threshold → pass through.
-
Over → partition into compress (head) + keep (tail preserving
keep_tail_tokens).
-
Invokes compressor_model with a fixed prompt:
Summarize the following conversation for an AI agent to continue
coherently. Preserve: names, numbers, decisions, URLs, error
messages, tool names and their results. Drop: pleasantries,
repetition. Max: 500 tokens.
-
Replaces compressed turns with
{"role":"system", "content":"[COMPRESSED: <summary>]"}.
-
If still over context after compression: drops oldest summary and
retries (configurable). Two consecutive drops → HTTP 413.
-
Proxies transformed request to primary model.
-
Attaches usage.compression_meta:
{ "original_tokens": 61234, "compressed_tokens": 15120,
"dropped_turns": 18, "compressor": "LFM2-24B-A2B-GGUF",
"summary_tokens": 497, "overflow_recoveries": 0 }
Prometheus metrics
localai_compression_events_total{model, result} # success | skipped | error
localai_compression_ratio{model} # histogram, original/compressed
localai_compression_duration_seconds{model} # histogram
API compatibility
- Additive only (
compression.* YAML, usage.compression_meta response
field).
- OpenAI-compat clients ignore unknown
usage.* keys.
- Applies to both
/v1/chat/completions and /v1/mcp/chat/completions
(automatic via existing handler delegation in
core/http/endpoints/localai/mcp.go:61).
Non-goals
- No recursive summarize-summaries (1-pass covers observed needs).
- No persistent compression store (stateless per-request).
- No automatic re-embedding of compressed content.
Implementation outline
- Config struct in
core/config/model_config.go:32 (mirrors
existing FunctionsConfig, ReasoningConfig, MCP).
- New
pkg/tokens/count.go wrapping tiktoken-go (already indirect
in go.mod:413 — promote to direct).
- Middleware in
core/http/middleware/compression.go following the
trace.go template, inserted after SetOpenAIRequest in the
chatMiddleware chain at core/http/routes/openai.go:35-51.
- Ginkgo tests inline; docs under
docs/content/features/compression.md.
Open questions for mudler
-
Where should tokens helper live? Proposing pkg/tokens/ —
reusable by future features. Alternatives: core/util/tokens/ or
internal/tokens/. Preference?
-
Prometheus naming — localai_compression_* or localai_mw_*
prefix? Any existing metric-naming convention I should match?
-
Compressor model resolution — if compressor_model is set but
not loaded, should middleware (a) skip compression and pass through
with a warning, (b) try loading via model loader, (c) error? Our
production default is (a).
-
SSE streaming requests — compression operates on the request
pre-send, so streaming the response is unaffected. Sound?
Prior art
This design is based on 3+ months of production use at walcz.de (in a
Python proxy called prompt-optimizer). The 1021-LOC Python reference
implementation handles token counting, partition logic, multi-message
tool-chain preservation, and overflow recovery. Happy to share for
reference if helpful.
Next step
If the design lands, I'll submit a PR with:
pkg/tokens/count.go + tests (can merge independently)
CompressionConfig + middleware + metrics + tests
docs/content/features/compression.md
Approx 3-5 days of work.
Assisted-by: Claude:claude-opus-4-7
Problem
Long multi-turn conversations regularly exceed a model's context window.
Today LocalAI returns:
The user has to truncate their conversation by hand, losing context.
Client-side workarounds exist (e.g. LibreChat's summarization), but they
duplicate effort across every frontend talking to LocalAI, and they don't
help operator-driven scenarios like the
/v1/mcp/chat/completionspathor long-lived agent workflows.
Proposal
Add an opt-in, per-model middleware that compresses the head of the
conversation with a fast secondary model when a request approaches
context_size. The compressed content is replaced by a single systemmessage containing the summary. The request is then proxied to the
primary model.
Config (per model YAML)
Opt-in, default off. Absent block = no change to existing behavior.
Flow
Middleware reads
config.CompressionafterSetOpenAIRequest.Counts request tokens via tiktoken-go.
Under threshold → pass through.
Over → partition into
compress(head) +keep(tail preservingkeep_tail_tokens).Invokes
compressor_modelwith a fixed prompt:Replaces compressed turns with
{"role":"system", "content":"[COMPRESSED: <summary>]"}.If still over context after compression: drops oldest summary and
retries (configurable). Two consecutive drops → HTTP 413.
Proxies transformed request to primary model.
Attaches
usage.compression_meta:{ "original_tokens": 61234, "compressed_tokens": 15120, "dropped_turns": 18, "compressor": "LFM2-24B-A2B-GGUF", "summary_tokens": 497, "overflow_recoveries": 0 }Prometheus metrics
API compatibility
compression.*YAML,usage.compression_metaresponsefield).
usage.*keys./v1/chat/completionsand/v1/mcp/chat/completions(automatic via existing handler delegation in
core/http/endpoints/localai/mcp.go:61).Non-goals
Implementation outline
core/config/model_config.go:32(mirrorsexisting
FunctionsConfig,ReasoningConfig,MCP).pkg/tokens/count.gowrapping tiktoken-go (already indirectin go.mod:413 — promote to direct).
core/http/middleware/compression.gofollowing thetrace.gotemplate, inserted afterSetOpenAIRequestin thechatMiddlewarechain atcore/http/routes/openai.go:35-51.docs/content/features/compression.md.Open questions for mudler
Where should
tokenshelper live? Proposingpkg/tokens/—reusable by future features. Alternatives:
core/util/tokens/orinternal/tokens/. Preference?Prometheus naming —
localai_compression_*orlocalai_mw_*prefix? Any existing metric-naming convention I should match?
Compressor model resolution — if
compressor_modelis set butnot loaded, should middleware (a) skip compression and pass through
with a warning, (b) try loading via model loader, (c) error? Our
production default is (a).
SSE streaming requests — compression operates on the request
pre-send, so streaming the response is unaffected. Sound?
Prior art
This design is based on 3+ months of production use at walcz.de (in a
Python proxy called
prompt-optimizer). The 1021-LOC Python referenceimplementation handles token counting, partition logic, multi-message
tool-chain preservation, and overflow recovery. Happy to share for
reference if helpful.
Next step
If the design lands, I'll submit a PR with:
pkg/tokens/count.go+ tests (can merge independently)CompressionConfig+ middleware + metrics + testsdocs/content/features/compression.mdApprox 3-5 days of work.
Assisted-by: Claude:claude-opus-4-7