vstorm-co
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 1 addition & 0 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 57 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 57 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 10 additions & 5 deletions b/‎README.md‎
Lines changed: 10 additions & 5 deletions
diff --git a/‎docs/advanced/context-manager.md‎
Lines changed: 113 additions & 13 deletions b/‎docs/advanced/context-manager.md‎
Lines changed: 113 additions & 13 deletions
diff --git a/‎docs/api/middleware.md‎
Lines changed: 3 additions & 0 deletions b/‎docs/api/middleware.md‎
Lines changed: 3 additions & 0 deletions
@@ -85,6 +85,7 @@ jobs:
       - name: Upload coverage to Coveralls
         if: matrix.python-version == '3.12'
         uses: coverallsapp/github-action@v2
+        continue-on-error: true
         with:
           github-token: ${{ secrets.GITHUB_TOKEN }}
           file: coverage.lcov
 
@@ -5,6 +5,62 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [0.0.4] - 2026-02-25
+
+### Added
+
+- **`on_before_compress` callback** on `ContextManagerMiddleware` — called with
+  `(messages_to_discard, cutoff_index)` before compression summarizes and discards
+  messages. Enables persistent history archival (e.g. save full conversation to
+  files before pruning).
+- **`on_after_compress` callback** — called with compressed messages after
+  compression. Return a string to re-inject it into context as a `SystemPromptPart`
+  (inspired by Claude Code's SessionStart hook with compact matcher).
+- **Continuous message persistence** via `messages_path` on `ContextManagerMiddleware` —
+  every message (user input, agent responses, tool calls) is saved to a single
+  `messages.json` file on every history processor call. On compression, the summary
+  is appended to the same file. The file is the permanent, uncompressed record of
+  the full conversation. Supports session resume (loads existing history on init).
+- **Guided compaction** — `_compress()` and `_create_summary()` accept a `focus`
+  parameter (e.g., "Focus on the API changes") appended to the summary prompt.
+- **`request_compact(focus)`** method — request manual compaction on the next
+  `__call__`, with optional focus instructions.
+- **`compact(messages, focus)`** method — directly compact messages with LLM
+  summarization (for CLI `/compact` commands).
+- **`max_tokens` auto-detection** from `genai-prices` — when `max_tokens=None`
+  (the new default), the middleware resolves the model's context window
+  automatically via `genai-prices`. Falls back to 200,000 if not found.
+- **`resolve_max_tokens(model_name)`** function exported from the package —
+  standalone lookup of context windows from genai-prices.
+- **`model_name` parameter** on `ContextManagerMiddleware` and factory — used for
+  auto-detection of `max_tokens` when not explicitly set.
+- **Async token counting** — `TokenCounter` type now accepts both sync and async
+  callables (`Callable[..., int] | Callable[..., Awaitable[int]]`). Enables use of
+  provider token counting APIs (e.g. Anthropic's `/count_tokens` endpoint) or
+  pydantic-ai's `count_tokens()` method. ([#6](https://github.com/vstorm-co/summarization-pydantic-ai/issues/6))
+- **`async_count_tokens()`** helper function exported from the package.
+- `BeforeCompressCallback`, `AfterCompressCallback` type aliases exported.
+- `messages_path`, `model_name`, `on_before_compress`, `on_after_compress`
+  parameters added to `create_context_manager_middleware()` factory.
+- **Examples** — 6 runnable examples in `examples/` covering all features:
+  auto-compression, persistence, callbacks, auto-detection, interactive chat,
+  standalone processors.
+
+### Changed
+
+- **`max_tokens` default** changed from `200_000` to `None` (auto-detect from
+  genai-prices, fallback to 200,000).
+- **`keep` default** changed from `("messages", 20)` to `("messages", 0)` —
+  on compression, only the LLM summary survives (like Claude Code). This produces
+  the most compact context after compression.
+- **Validation** now allows `0` for messages/tokens keep and trigger values
+  (previously required > 0). Negative values are still rejected.
+
+### Dependencies
+
+- `genai-prices` used for auto-detection of context windows (already a transitive
+  dependency via pydantic-ai-middleware).
+
 ## [0.0.3] - 2025-02-15
 
 ### Added
@@ -97,6 +153,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Requires `pydantic-ai>=0.1.0`
 - Optional `tiktoken` support for accurate token counting
 
+[0.0.4]: https://github.com/vstorm-co/summarization-pydantic-ai/releases/tag/v0.0.4
 [0.0.3]: https://github.com/vstorm-co/summarization-pydantic-ai/releases/tag/v0.0.3
 [0.0.2]: https://github.com/vstorm-co/summarization-pydantic-ai/releases/tag/v0.0.2
 [0.0.1]: https://github.com/vstorm-co/summarization-pydantic-ai/releases/tag/v0.0.1
@@ -124,20 +124,22 @@ processor = create_sliding_window_processor(
 
 ### Real-Time Context Manager
 
-Dual-protocol middleware combining token tracking, auto-compression, and tool output truncation:
+Dual-protocol middleware combining token tracking, auto-compression, message persistence, and tool output truncation:
 
 ```python
 from pydantic_ai import Agent
 from pydantic_ai_summarization import create_context_manager_middleware
 
 middleware = create_context_manager_middleware(
-    max_tokens=200_000,
+    model_name="openai:gpt-4.1",      # auto-detect max_tokens from genai-prices
     compress_threshold=0.9,
+    messages_path="messages.json",     # persist all messages
     on_usage_update=lambda pct, cur, mx: print(f"{pct:.0%} used ({cur:,}/{mx:,})"),
+    on_after_compress=lambda msgs: "Re-inject critical instructions here",
 )
 
 agent = Agent(
-    "openai:gpt-4o",
+    "openai:gpt-4.1",
     history_processors=[middleware],
 )
 ```
@@ -242,8 +244,11 @@ processor = create_summarization_processor(
 | **Two Strategies** | Intelligent summarization or fast sliding window |
 | **Flexible Triggers** | Message count, token count, or fraction-based |
 | **Safe Cutoff** | Never breaks tool call/response pairs |
-| **Custom Counters** | Bring your own token counting logic |
-| **Custom Prompts** | Control how summaries are generated |
+| **Auto max_tokens** | Auto-detect context window from genai-prices |
+| **Message Persistence** | Save all messages to JSON for session resume |
+| **Guided Compaction** | Focus summaries on specific topics |
+| **Callbacks** | on_before/after_compress with instruction re-injection |
+| **Async Token Counting** | Sync or async token counter support |
 | **Token Tracking** | Real-time usage monitoring with callbacks |
 | **Tool Truncation** | Automatic truncation of large tool outputs |
 | **Custom Models** | Use any pydantic-ai Model (Azure, custom providers) |
 
@@ -25,10 +25,13 @@ The middleware operates on two levels during each agent run:
 │                                                              │
 │  1. History Processor (__call__)                              │
 │     ├─ Count tokens in current messages                      │
+│     ├─ Persist messages to messages.json (if configured)     │
 │     ├─ Notify usage callback (percentage, current, max)      │
 │     ├─ If usage >= compress_threshold:                        │
-│     │   ├─ Summarize older messages via LLM                  │
+│     │   ├─ Call on_before_compress callback                   │
+│     │   ├─ Summarize older messages via LLM (with focus)     │
 │     │   ├─ Replace old messages with summary                 │
+│     │   ├─ Call on_after_compress → re-inject instructions   │
 │     │   └─ Notify updated usage                              │
 │     └─ Return (possibly compressed) messages                 │
 │                                                              │
@@ -44,22 +47,80 @@ The middleware operates on two levels during each agent run:
 
 **Tool output truncation**: When `max_tool_output_tokens` is set, the middleware intercepts tool results via the `after_tool_call` hook and truncates any output that exceeds the token limit, keeping configurable head and tail lines.
 
+**Message persistence**: When `messages_path` is set, all messages are saved to a JSON file on every history processor call. This provides a permanent, uncompressed record of the full conversation — ideal for session resume.
+
 ## Parameters
 
 | Parameter | Type | Default | Description |
 |-----------|------|---------|-------------|
-| `max_tokens` | `int` | `200_000` | Maximum token budget for the conversation |
+| `max_tokens` | `int \| None` | `None` | Maximum token budget. `None` auto-detects from genai-prices (falls back to 200,000) |
+| `model_name` | `str \| None` | `None` | Model name for auto-detecting `max_tokens` (e.g., `"openai:gpt-4.1"`) |
 | `compress_threshold` | `float` | `0.9` | Fraction of `max_tokens` at which auto-compression triggers (0.0, 1.0] |
-| `keep` | `ContextSize` | `("messages", 20)` | How much context to retain after compression |
+| `keep` | `ContextSize` | `("messages", 0)` | How much context to retain after compression. `0` = only summary survives |
 | `summarization_model` | `str` | `"openai:gpt-4.1-mini"` | Model used for generating summaries |
-| `token_counter` | `TokenCounter` | `count_tokens_approximately` | Function to count tokens in messages |
+| `token_counter` | `TokenCounter` | `count_tokens_approximately` | Function to count tokens (sync or async) |
 | `summary_prompt` | `str` | `DEFAULT_SUMMARY_PROMPT` | Prompt template for summary generation |
 | `trim_tokens_to_summarize` | `int` | `4000` | Max tokens to include when generating the summary |
 | `max_input_tokens` | `int \| None` | `None` | Model max input tokens (required for fraction-based keep) |
 | `max_tool_output_tokens` | `int \| None` | `None` | Per-tool-output token limit before truncation. `None` disables truncation |
 | `tool_output_head_lines` | `int` | `5` | Lines to show from the beginning of truncated tool output |
 | `tool_output_tail_lines` | `int` | `5` | Lines to show from the end of truncated tool output |
+| `messages_path` | `str \| None` | `None` | Path to persist messages as JSON. Enables session resume |
 | `on_usage_update` | `UsageCallback \| None` | `None` | Callback invoked with usage stats before each model call |
+| `on_before_compress` | `BeforeCompressCallback \| None` | `None` | Callback before compression — receives messages and cutoff index |
+| `on_after_compress` | `AfterCompressCallback \| None` | `None` | Callback after compression — return a string to re-inject into context |
+
+## Auto-Detection of max_tokens
+
+When `max_tokens=None` (the default), the middleware uses `resolve_max_tokens(model_name)` to look up the model's context window from `genai-prices`:
+
+```python
+from pydantic_ai_summarization import resolve_max_tokens
+
+# Returns context window size or None
+resolve_max_tokens("openai:gpt-4.1")         # → 1,000,000
+resolve_max_tokens("anthropic:claude-sonnet-4-20250514")  # → 200,000
+resolve_max_tokens("unknown:model")           # → None (falls back to 200,000)
+```
+
+This means you typically don't need to set `max_tokens` manually — just pass `model_name`:
+
+```python
+middleware = create_context_manager_middleware(
+    model_name="openai:gpt-4.1",  # auto-detects 1M token budget
+)
+```
+
+## Callbacks
+
+### on_before_compress
+
+Called before compression begins. Useful for logging or archival:
+
+```python
+from pydantic_ai.messages import ModelMessage
+
+def on_before_compress(messages: list[ModelMessage], cutoff_index: int) -> None:
+    print(f"About to compress {cutoff_index} messages out of {len(messages)}")
+```
+
+### on_after_compress
+
+Called after compression. Return a string to re-inject it as a `SystemPromptPart`:
+
+```python
+CRITICAL_INSTRUCTIONS = "Always respond in English. Never use markdown."
+
+def on_after_compress(messages: list[ModelMessage]) -> str | None:
+    # Re-inject instructions that must survive compression
+    return CRITICAL_INSTRUCTIONS
+
+middleware = create_context_manager_middleware(
+    on_after_compress=on_after_compress,
+)
+```
+
+This is inspired by Claude Code's SessionStart hook with compact matcher — ensures critical rules survive context compression.
 
 ## UsageCallback
 
@@ -77,6 +138,43 @@ The callback receives three arguments:
 
 Both sync and async callables are supported. If the callable returns an awaitable, it will be awaited automatically.
 
+## Message Persistence
+
+When `messages_path` is set, all messages are written to a JSON file on every history processor call:
+
+```python
+middleware = create_context_manager_middleware(
+    messages_path="/tmp/session/messages.json",
+)
+```
+
+The file contains the full, uncompressed conversation history. On compression, the summary message is appended — the file is always the permanent record.
+
+To resume a session, load the file and pass it as `message_history`:
+
+```python
+from pathlib import Path
+from pydantic_ai.messages import ModelMessagesTypeAdapter
+
+raw = Path("/tmp/session/messages.json").read_bytes()
+history = list(ModelMessagesTypeAdapter.validate_json(raw))
+result = await agent.run("Continue...", message_history=history)
+```
+
+## Guided Compaction
+
+Both `compact()` and `request_compact()` accept a `focus` parameter to guide the summary:
+
+```python
+# Direct compaction (for CLI commands)
+history = await middleware.compact(history, focus="Focus on the API design decisions")
+
+# Request compaction on next __call__ (deferred)
+middleware.request_compact(focus="Focus on the debugging session")
+```
+
+The focus string is appended to the summary prompt, telling the LLM what to prioritize in the summary.
+
 ## Basic Usage
 
 ```python
@@ -85,9 +183,8 @@ from pydantic_ai_middleware import MiddlewareAgent
 from pydantic_ai_summarization import create_context_manager_middleware
 
 middleware = create_context_manager_middleware(
-    max_tokens=200_000,
+    model_name="openai:gpt-4.1",  # auto-detect max_tokens
     compress_threshold=0.9,
-    keep=("messages", 20),
 )
 
 # Register as both history processor and middleware
@@ -109,7 +206,6 @@ def on_usage(percentage: float, current: int, maximum: int) -> None:
     print(f"Token usage: {percentage:.0%} ({current:,} / {maximum:,})")
 
 middleware = create_context_manager_middleware(
-    max_tokens=200_000,
     on_usage_update=on_usage,
 )
 ```
@@ -133,7 +229,6 @@ Prevent large tool outputs from consuming too much of the token budget:
 from pydantic_ai_summarization import create_context_manager_middleware
 
 middleware = create_context_manager_middleware(
-    max_tokens=200_000,
     max_tool_output_tokens=2000,      # Truncate outputs > ~2000 tokens
     tool_output_head_lines=10,        # Show first 10 lines
     tool_output_tail_lines=10,        # Show last 10 lines
@@ -163,19 +258,19 @@ The [`create_context_manager_middleware()`][pydantic_ai_summarization.middleware
 ```python
 from pydantic_ai_summarization import create_context_manager_middleware
 
-# With defaults
+# With defaults (auto-detect max_tokens)
 middleware = create_context_manager_middleware()
 
 # Fully configured
 middleware = create_context_manager_middleware(
-    max_tokens=150_000,
+    model_name="openai:gpt-4.1",
     compress_threshold=0.85,
-    keep=("messages", 30),
+    keep=("messages", 10),
     summarization_model="openai:gpt-4.1-mini",
+    messages_path="/tmp/session/messages.json",
     max_tool_output_tokens=1000,
-    tool_output_head_lines=5,
-    tool_output_tail_lines=5,
     on_usage_update=lambda pct, cur, mx: print(f"{pct:.0%}"),
+    on_after_compress=lambda msgs: "Re-injected instructions here",
 )
 ```
 
@@ -201,6 +296,11 @@ print(f"Compressed {middleware.compression_count} times")
 | Usage callbacks | Yes | No | No |
 | Auto-compression | Yes (threshold-based) | Yes (trigger-based) | No |
 | Tool output truncation | Yes | No | No |
+| Message persistence | Yes (`messages_path`) | No | No |
+| Guided compaction | Yes (`focus`) | No | No |
+| Callbacks | Before/after compress | No | No |
+| Auto max_tokens | Yes (genai-prices) | No | No |
+| Async token counter | Yes | No | No |
 | LLM cost | Per compression | Per trigger | Zero |
 | Requires extra | `[hybrid]` | No | No |
 
 
@@ -7,4 +7,7 @@
       members:
         - ContextManagerMiddleware
         - create_context_manager_middleware
+        - resolve_max_tokens
         - UsageCallback
+        - BeforeCompressCallback
+        - AfterCompressCallback