Skip to content

Commit 7d97074

Browse files
authored
Merge pull request #7 from vstorm-co/feat/hybrid-improvements
feat: improve hybrid methods and add examples
2 parents 863c10b + 8c27ef7 commit 7d97074

27 files changed

+2109
-187
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@ jobs:
8585
- name: Upload coverage to Coveralls
8686
if: matrix.python-version == '3.12'
8787
uses: coverallsapp/github-action@v2
88+
continue-on-error: true
8889
with:
8990
github-token: ${{ secrets.GITHUB_TOKEN }}
9091
file: coverage.lcov

CHANGELOG.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,62 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [0.0.4] - 2026-02-25
9+
10+
### Added
11+
12+
- **`on_before_compress` callback** on `ContextManagerMiddleware` — called with
13+
`(messages_to_discard, cutoff_index)` before compression summarizes and discards
14+
messages. Enables persistent history archival (e.g. save full conversation to
15+
files before pruning).
16+
- **`on_after_compress` callback** — called with compressed messages after
17+
compression. Return a string to re-inject it into context as a `SystemPromptPart`
18+
(inspired by Claude Code's SessionStart hook with compact matcher).
19+
- **Continuous message persistence** via `messages_path` on `ContextManagerMiddleware`
20+
every message (user input, agent responses, tool calls) is saved to a single
21+
`messages.json` file on every history processor call. On compression, the summary
22+
is appended to the same file. The file is the permanent, uncompressed record of
23+
the full conversation. Supports session resume (loads existing history on init).
24+
- **Guided compaction**`_compress()` and `_create_summary()` accept a `focus`
25+
parameter (e.g., "Focus on the API changes") appended to the summary prompt.
26+
- **`request_compact(focus)`** method — request manual compaction on the next
27+
`__call__`, with optional focus instructions.
28+
- **`compact(messages, focus)`** method — directly compact messages with LLM
29+
summarization (for CLI `/compact` commands).
30+
- **`max_tokens` auto-detection** from `genai-prices` — when `max_tokens=None`
31+
(the new default), the middleware resolves the model's context window
32+
automatically via `genai-prices`. Falls back to 200,000 if not found.
33+
- **`resolve_max_tokens(model_name)`** function exported from the package —
34+
standalone lookup of context windows from genai-prices.
35+
- **`model_name` parameter** on `ContextManagerMiddleware` and factory — used for
36+
auto-detection of `max_tokens` when not explicitly set.
37+
- **Async token counting**`TokenCounter` type now accepts both sync and async
38+
callables (`Callable[..., int] | Callable[..., Awaitable[int]]`). Enables use of
39+
provider token counting APIs (e.g. Anthropic's `/count_tokens` endpoint) or
40+
pydantic-ai's `count_tokens()` method. ([#6](https://github.com/vstorm-co/summarization-pydantic-ai/issues/6))
41+
- **`async_count_tokens()`** helper function exported from the package.
42+
- `BeforeCompressCallback`, `AfterCompressCallback` type aliases exported.
43+
- `messages_path`, `model_name`, `on_before_compress`, `on_after_compress`
44+
parameters added to `create_context_manager_middleware()` factory.
45+
- **Examples** — 6 runnable examples in `examples/` covering all features:
46+
auto-compression, persistence, callbacks, auto-detection, interactive chat,
47+
standalone processors.
48+
49+
### Changed
50+
51+
- **`max_tokens` default** changed from `200_000` to `None` (auto-detect from
52+
genai-prices, fallback to 200,000).
53+
- **`keep` default** changed from `("messages", 20)` to `("messages", 0)`
54+
on compression, only the LLM summary survives (like Claude Code). This produces
55+
the most compact context after compression.
56+
- **Validation** now allows `0` for messages/tokens keep and trigger values
57+
(previously required > 0). Negative values are still rejected.
58+
59+
### Dependencies
60+
61+
- `genai-prices` used for auto-detection of context windows (already a transitive
62+
dependency via pydantic-ai-middleware).
63+
864
## [0.0.3] - 2025-02-15
965

1066
### Added
@@ -97,6 +153,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
97153
- Requires `pydantic-ai>=0.1.0`
98154
- Optional `tiktoken` support for accurate token counting
99155

156+
[0.0.4]: https://github.com/vstorm-co/summarization-pydantic-ai/releases/tag/v0.0.4
100157
[0.0.3]: https://github.com/vstorm-co/summarization-pydantic-ai/releases/tag/v0.0.3
101158
[0.0.2]: https://github.com/vstorm-co/summarization-pydantic-ai/releases/tag/v0.0.2
102159
[0.0.1]: https://github.com/vstorm-co/summarization-pydantic-ai/releases/tag/v0.0.1

README.md

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -124,20 +124,22 @@ processor = create_sliding_window_processor(
124124

125125
### Real-Time Context Manager
126126

127-
Dual-protocol middleware combining token tracking, auto-compression, and tool output truncation:
127+
Dual-protocol middleware combining token tracking, auto-compression, message persistence, and tool output truncation:
128128

129129
```python
130130
from pydantic_ai import Agent
131131
from pydantic_ai_summarization import create_context_manager_middleware
132132

133133
middleware = create_context_manager_middleware(
134-
max_tokens=200_000,
134+
model_name="openai:gpt-4.1", # auto-detect max_tokens from genai-prices
135135
compress_threshold=0.9,
136+
messages_path="messages.json", # persist all messages
136137
on_usage_update=lambda pct, cur, mx: print(f"{pct:.0%} used ({cur:,}/{mx:,})"),
138+
on_after_compress=lambda msgs: "Re-inject critical instructions here",
137139
)
138140

139141
agent = Agent(
140-
"openai:gpt-4o",
142+
"openai:gpt-4.1",
141143
history_processors=[middleware],
142144
)
143145
```
@@ -242,8 +244,11 @@ processor = create_summarization_processor(
242244
| **Two Strategies** | Intelligent summarization or fast sliding window |
243245
| **Flexible Triggers** | Message count, token count, or fraction-based |
244246
| **Safe Cutoff** | Never breaks tool call/response pairs |
245-
| **Custom Counters** | Bring your own token counting logic |
246-
| **Custom Prompts** | Control how summaries are generated |
247+
| **Auto max_tokens** | Auto-detect context window from genai-prices |
248+
| **Message Persistence** | Save all messages to JSON for session resume |
249+
| **Guided Compaction** | Focus summaries on specific topics |
250+
| **Callbacks** | on_before/after_compress with instruction re-injection |
251+
| **Async Token Counting** | Sync or async token counter support |
247252
| **Token Tracking** | Real-time usage monitoring with callbacks |
248253
| **Tool Truncation** | Automatic truncation of large tool outputs |
249254
| **Custom Models** | Use any pydantic-ai Model (Azure, custom providers) |

docs/advanced/context-manager.md

Lines changed: 113 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,13 @@ The middleware operates on two levels during each agent run:
2525
│ │
2626
│ 1. History Processor (__call__) │
2727
│ ├─ Count tokens in current messages │
28+
│ ├─ Persist messages to messages.json (if configured) │
2829
│ ├─ Notify usage callback (percentage, current, max) │
2930
│ ├─ If usage >= compress_threshold: │
30-
│ │ ├─ Summarize older messages via LLM │
31+
│ │ ├─ Call on_before_compress callback │
32+
│ │ ├─ Summarize older messages via LLM (with focus) │
3133
│ │ ├─ Replace old messages with summary │
34+
│ │ ├─ Call on_after_compress → re-inject instructions │
3235
│ │ └─ Notify updated usage │
3336
│ └─ Return (possibly compressed) messages │
3437
│ │
@@ -44,22 +47,80 @@ The middleware operates on two levels during each agent run:
4447

4548
**Tool output truncation**: When `max_tool_output_tokens` is set, the middleware intercepts tool results via the `after_tool_call` hook and truncates any output that exceeds the token limit, keeping configurable head and tail lines.
4649

50+
**Message persistence**: When `messages_path` is set, all messages are saved to a JSON file on every history processor call. This provides a permanent, uncompressed record of the full conversation — ideal for session resume.
51+
4752
## Parameters
4853

4954
| Parameter | Type | Default | Description |
5055
|-----------|------|---------|-------------|
51-
| `max_tokens` | `int` | `200_000` | Maximum token budget for the conversation |
56+
| `max_tokens` | `int \| None` | `None` | Maximum token budget. `None` auto-detects from genai-prices (falls back to 200,000) |
57+
| `model_name` | `str \| None` | `None` | Model name for auto-detecting `max_tokens` (e.g., `"openai:gpt-4.1"`) |
5258
| `compress_threshold` | `float` | `0.9` | Fraction of `max_tokens` at which auto-compression triggers (0.0, 1.0] |
53-
| `keep` | `ContextSize` | `("messages", 20)` | How much context to retain after compression |
59+
| `keep` | `ContextSize` | `("messages", 0)` | How much context to retain after compression. `0` = only summary survives |
5460
| `summarization_model` | `str` | `"openai:gpt-4.1-mini"` | Model used for generating summaries |
55-
| `token_counter` | `TokenCounter` | `count_tokens_approximately` | Function to count tokens in messages |
61+
| `token_counter` | `TokenCounter` | `count_tokens_approximately` | Function to count tokens (sync or async) |
5662
| `summary_prompt` | `str` | `DEFAULT_SUMMARY_PROMPT` | Prompt template for summary generation |
5763
| `trim_tokens_to_summarize` | `int` | `4000` | Max tokens to include when generating the summary |
5864
| `max_input_tokens` | `int \| None` | `None` | Model max input tokens (required for fraction-based keep) |
5965
| `max_tool_output_tokens` | `int \| None` | `None` | Per-tool-output token limit before truncation. `None` disables truncation |
6066
| `tool_output_head_lines` | `int` | `5` | Lines to show from the beginning of truncated tool output |
6167
| `tool_output_tail_lines` | `int` | `5` | Lines to show from the end of truncated tool output |
68+
| `messages_path` | `str \| None` | `None` | Path to persist messages as JSON. Enables session resume |
6269
| `on_usage_update` | `UsageCallback \| None` | `None` | Callback invoked with usage stats before each model call |
70+
| `on_before_compress` | `BeforeCompressCallback \| None` | `None` | Callback before compression — receives messages and cutoff index |
71+
| `on_after_compress` | `AfterCompressCallback \| None` | `None` | Callback after compression — return a string to re-inject into context |
72+
73+
## Auto-Detection of max_tokens
74+
75+
When `max_tokens=None` (the default), the middleware uses `resolve_max_tokens(model_name)` to look up the model's context window from `genai-prices`:
76+
77+
```python
78+
from pydantic_ai_summarization import resolve_max_tokens
79+
80+
# Returns context window size or None
81+
resolve_max_tokens("openai:gpt-4.1") # → 1,000,000
82+
resolve_max_tokens("anthropic:claude-sonnet-4-20250514") # → 200,000
83+
resolve_max_tokens("unknown:model") # → None (falls back to 200,000)
84+
```
85+
86+
This means you typically don't need to set `max_tokens` manually — just pass `model_name`:
87+
88+
```python
89+
middleware = create_context_manager_middleware(
90+
model_name="openai:gpt-4.1", # auto-detects 1M token budget
91+
)
92+
```
93+
94+
## Callbacks
95+
96+
### on_before_compress
97+
98+
Called before compression begins. Useful for logging or archival:
99+
100+
```python
101+
from pydantic_ai.messages import ModelMessage
102+
103+
def on_before_compress(messages: list[ModelMessage], cutoff_index: int) -> None:
104+
print(f"About to compress {cutoff_index} messages out of {len(messages)}")
105+
```
106+
107+
### on_after_compress
108+
109+
Called after compression. Return a string to re-inject it as a `SystemPromptPart`:
110+
111+
```python
112+
CRITICAL_INSTRUCTIONS = "Always respond in English. Never use markdown."
113+
114+
def on_after_compress(messages: list[ModelMessage]) -> str | None:
115+
# Re-inject instructions that must survive compression
116+
return CRITICAL_INSTRUCTIONS
117+
118+
middleware = create_context_manager_middleware(
119+
on_after_compress=on_after_compress,
120+
)
121+
```
122+
123+
This is inspired by Claude Code's SessionStart hook with compact matcher — ensures critical rules survive context compression.
63124

64125
## UsageCallback
65126

@@ -77,6 +138,43 @@ The callback receives three arguments:
77138

78139
Both sync and async callables are supported. If the callable returns an awaitable, it will be awaited automatically.
79140

141+
## Message Persistence
142+
143+
When `messages_path` is set, all messages are written to a JSON file on every history processor call:
144+
145+
```python
146+
middleware = create_context_manager_middleware(
147+
messages_path="/tmp/session/messages.json",
148+
)
149+
```
150+
151+
The file contains the full, uncompressed conversation history. On compression, the summary message is appended — the file is always the permanent record.
152+
153+
To resume a session, load the file and pass it as `message_history`:
154+
155+
```python
156+
from pathlib import Path
157+
from pydantic_ai.messages import ModelMessagesTypeAdapter
158+
159+
raw = Path("/tmp/session/messages.json").read_bytes()
160+
history = list(ModelMessagesTypeAdapter.validate_json(raw))
161+
result = await agent.run("Continue...", message_history=history)
162+
```
163+
164+
## Guided Compaction
165+
166+
Both `compact()` and `request_compact()` accept a `focus` parameter to guide the summary:
167+
168+
```python
169+
# Direct compaction (for CLI commands)
170+
history = await middleware.compact(history, focus="Focus on the API design decisions")
171+
172+
# Request compaction on next __call__ (deferred)
173+
middleware.request_compact(focus="Focus on the debugging session")
174+
```
175+
176+
The focus string is appended to the summary prompt, telling the LLM what to prioritize in the summary.
177+
80178
## Basic Usage
81179

82180
```python
@@ -85,9 +183,8 @@ from pydantic_ai_middleware import MiddlewareAgent
85183
from pydantic_ai_summarization import create_context_manager_middleware
86184

87185
middleware = create_context_manager_middleware(
88-
max_tokens=200_000,
186+
model_name="openai:gpt-4.1", # auto-detect max_tokens
89187
compress_threshold=0.9,
90-
keep=("messages", 20),
91188
)
92189

93190
# Register as both history processor and middleware
@@ -109,7 +206,6 @@ def on_usage(percentage: float, current: int, maximum: int) -> None:
109206
print(f"Token usage: {percentage:.0%} ({current:,} / {maximum:,})")
110207

111208
middleware = create_context_manager_middleware(
112-
max_tokens=200_000,
113209
on_usage_update=on_usage,
114210
)
115211
```
@@ -133,7 +229,6 @@ Prevent large tool outputs from consuming too much of the token budget:
133229
from pydantic_ai_summarization import create_context_manager_middleware
134230

135231
middleware = create_context_manager_middleware(
136-
max_tokens=200_000,
137232
max_tool_output_tokens=2000, # Truncate outputs > ~2000 tokens
138233
tool_output_head_lines=10, # Show first 10 lines
139234
tool_output_tail_lines=10, # Show last 10 lines
@@ -163,19 +258,19 @@ The [`create_context_manager_middleware()`][pydantic_ai_summarization.middleware
163258
```python
164259
from pydantic_ai_summarization import create_context_manager_middleware
165260

166-
# With defaults
261+
# With defaults (auto-detect max_tokens)
167262
middleware = create_context_manager_middleware()
168263

169264
# Fully configured
170265
middleware = create_context_manager_middleware(
171-
max_tokens=150_000,
266+
model_name="openai:gpt-4.1",
172267
compress_threshold=0.85,
173-
keep=("messages", 30),
268+
keep=("messages", 10),
174269
summarization_model="openai:gpt-4.1-mini",
270+
messages_path="/tmp/session/messages.json",
175271
max_tool_output_tokens=1000,
176-
tool_output_head_lines=5,
177-
tool_output_tail_lines=5,
178272
on_usage_update=lambda pct, cur, mx: print(f"{pct:.0%}"),
273+
on_after_compress=lambda msgs: "Re-injected instructions here",
179274
)
180275
```
181276

@@ -201,6 +296,11 @@ print(f"Compressed {middleware.compression_count} times")
201296
| Usage callbacks | Yes | No | No |
202297
| Auto-compression | Yes (threshold-based) | Yes (trigger-based) | No |
203298
| Tool output truncation | Yes | No | No |
299+
| Message persistence | Yes (`messages_path`) | No | No |
300+
| Guided compaction | Yes (`focus`) | No | No |
301+
| Callbacks | Before/after compress | No | No |
302+
| Auto max_tokens | Yes (genai-prices) | No | No |
303+
| Async token counter | Yes | No | No |
204304
| LLM cost | Per compression | Per trigger | Zero |
205305
| Requires extra | `[hybrid]` | No | No |
206306

docs/api/middleware.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,7 @@
77
members:
88
- ContextManagerMiddleware
99
- create_context_manager_middleware
10+
- resolve_max_tokens
1011
- UsageCallback
12+
- BeforeCompressCallback
13+
- AfterCompressCallback

0 commit comments

Comments
 (0)