|
| 1 | +--- |
| 2 | +title: Evaluating Context Compression |
| 3 | +description: Summary of Factory Research’s evaluation of context compression strategies for long-running AI agent sessions. |
| 4 | +keywords: ['context compression', 'summarization', 'agents', 'memory', 'token efficiency', 'evaluation'] |
| 5 | +--- |
| 6 | + |
| 7 | +This page summarizes Factory Research’s post: **[Evaluating Context Compression for AI Agents](https://factory.ai/news/evaluating-compression)** (Dec 16, 2025). |
| 8 | + |
| 9 | +<Info> |
| 10 | + For the full methodology, charts, and examples, read the original post linked above. |
| 11 | +</Info> |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +## TL;DR |
| 16 | + |
| 17 | +- Long-running agent sessions can exceed any model’s context window, so some form of **context compression** is required. |
| 18 | +- The key metric isn’t *tokens per request*; it’s **tokens per task** (because missing details force costly re-fetching and rework). |
| 19 | +- In Factory’s evaluation, **structured summarization** retained more “continue-the-task” information than OpenAI’s `/responses/compact` and Anthropic’s SDK compression, at similar compression rates. |
| 20 | + |
| 21 | +--- |
| 22 | + |
| 23 | +## Why context compression matters |
| 24 | + |
| 25 | +As agent sessions stretch into hundreds/thousands of turns, the full transcript can reach **millions of tokens**. If an agent loses critical state (e.g., the exact endpoint, file paths changed, or the current next step), it often: |
| 26 | + |
| 27 | +- re-reads files it already read |
| 28 | +- repeats debugging dead ends |
| 29 | +- forgets what changed and where |
| 30 | + |
| 31 | +That costs more time and tokens than the compression saved. |
| 32 | + |
| 33 | +--- |
| 34 | + |
| 35 | +## How Factory evaluated “context quality” |
| 36 | + |
| 37 | +Instead of using summary similarity metrics (e.g., ROUGE), Factory used a **probe-based evaluation**: |
| 38 | + |
| 39 | +1. Take real, long-running production sessions. |
| 40 | +2. Compress the earlier portion. |
| 41 | +3. Ask probes that require remembering specific, task-relevant details from the truncated history. |
| 42 | +4. Grade the answers for functional usefulness. |
| 43 | + |
| 44 | +### Probe types |
| 45 | + |
| 46 | +- **Recall**: factual retention (e.g., “What was the original error?”) |
| 47 | +- **Artifact**: file tracking (e.g., “Which files did we modify and how?”) |
| 48 | +- **Continuation**: task planning (e.g., “What should we do next?”) |
| 49 | +- **Decision**: reasoning chain (e.g., “What did we decide and why?”) |
| 50 | + |
| 51 | +### Scoring dimensions |
| 52 | + |
| 53 | +Responses were scored (0–5) by an LLM judge (**GPT-5.2**) across: |
| 54 | + |
| 55 | +- Accuracy |
| 56 | +- Context awareness |
| 57 | +- Artifact trail |
| 58 | +- Completeness |
| 59 | +- Continuity |
| 60 | +- Instruction following |
| 61 | + |
| 62 | +The judge is blinded to which compression method produced the response. |
| 63 | + |
| 64 | +--- |
| 65 | + |
| 66 | +## Compression approaches compared |
| 67 | + |
| 68 | +| Approach | What it produces | Key trade-off | |
| 69 | +|---|---|---| |
| 70 | +| **Factory** | A **structured, persistent summary** with explicit sections (intent, file modifications, decisions, next steps). Updates by summarizing only the newly-truncated span and **merging** into the existing summary (“anchored iterative summarization”). | Slightly larger summaries than the most aggressive compression, but better retention of task-critical details. | |
| 71 | +| **OpenAI** | `/responses/compact`: an **opaque** compressed representation optimized for reconstruction fidelity. | Highest compression, but low interpretability (you can’t inspect what was preserved). | |
| 72 | +| **Anthropic** | Claude SDK built-in compression: detailed structured summaries (often 7–12k chars), regenerated on each compression. | High-quality summaries, but regenerating the whole summary each time can cause drift across repeated compressions. | |
| 73 | + |
| 74 | +--- |
| 75 | + |
| 76 | +## Results (high-level) |
| 77 | + |
| 78 | +Factory reports evaluating **36,000+ messages** from production sessions across tasks like PR review, bug fixes, feature implementation, and refactoring. |
| 79 | + |
| 80 | +### Overall scores (0–5) |
| 81 | + |
| 82 | +| Method | Overall | Accuracy | Context | Artifact | Completeness | Continuity | Instruction | |
| 83 | +|---|---:|---:|---:|---:|---:|---:|---:| |
| 84 | +| Factory | **3.70** | **4.04** | **4.01** | **2.45** | **4.44** | **3.80** | **4.99** | |
| 85 | +| Anthropic | 3.44 | 3.74 | 3.56 | 2.33 | 4.37 | 3.67 | 4.95 | |
| 86 | +| OpenAI | 3.35 | 3.43 | 3.64 | 2.19 | 4.37 | 3.77 | 4.92 | |
| 87 | + |
| 88 | +### Compression ratio vs. quality |
| 89 | + |
| 90 | +The post notes similar compression rates across methods: |
| 91 | + |
| 92 | +- OpenAI: **99.3%** token removal |
| 93 | +- Anthropic: **98.7%** token removal |
| 94 | +- Factory: **98.6%** token removal |
| 95 | + |
| 96 | +Factory retained ~0.7 percentage points more tokens than OpenAI (kept more context), and scored **+0.35** higher on overall quality. |
| 97 | + |
| 98 | +--- |
| 99 | + |
| 100 | +## What Factory says it learned |
| 101 | + |
| 102 | +- **Structure matters**: forcing explicit sections (files/decisions/next steps) reduces the chance that critical details “silently disappear” over time. |
| 103 | +- **Compression ratio is a misleading target**: aggressive compression can “save tokens” but lose details that cause expensive rework; optimize for **tokens per task**. |
| 104 | +- **Artifact tracking is still hard**: all methods scored low on tracking which files were created/modified/examined (Factory’s best was **2.45/5**), suggesting this may need dedicated state tracking beyond summarization. |
| 105 | +- **Probe-based evaluation is closer to agent reality** than text similarity metrics, because it tests whether work can continue effectively. |
| 106 | + |
| 107 | +--- |
| 108 | + |
| 109 | +## Related docs |
| 110 | + |
| 111 | +- [Memory and Context Management](/guides/power-user/memory-management) |
| 112 | +- [Token Efficiency Strategies](/guides/power-user/token-efficiency) |
0 commit comments