Skip to content

Commit 22a7dce

Browse files
docs: add CLI v0.38.0 and v0.39.0 changelog entries (#543)
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
1 parent e98dc97 commit 22a7dce

File tree

3 files changed

+141
-1
lines changed

3 files changed

+141
-1
lines changed

docs/changelog/cli-updates.mdx

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,33 @@ description: "Recent features and improvements to Factory CLI"
44
rss: true
55
---
66

7+
<Update label="December 19" rss={{ title: "CLI Updates", description: "Context utilization setting and Custom Droids enabled by default" }}>
8+
`v0.39.0`
9+
10+
## New features
11+
12+
* **Context utilization setting** - New setting to display token usage indicator in the status bar
13+
* **Custom Droids enabled by default** - Custom Droids feature is now available to all users without requiring opt-in
14+
15+
## Bug fixes
16+
17+
* **Grep tool fix** - Fixed pattern argument handling in the Grep tool
18+
* **BYOK Grok fix** - Fixed crash when using Grok models with thinking/reasoning streams
19+
* **Warmup improvements** - Added option to disable warmup requests and skip warmup for slash commands
20+
* **Non-git directory support** - Use current working directory as project directory when not in a git repository
21+
22+
</Update>
23+
24+
<Update label="December 18" rss={{ title: "CLI Updates", description: "GPT-5.2 improvements and .env loading fix" }}>
25+
`v0.38.0`
26+
27+
## Bug fixes
28+
29+
* **GPT-5.2 improvements** - Fixed request parameters and reasoning effort options for better model performance
30+
* **Prevent .env auto-loading** - CLI no longer automatically loads `.env` files from the working directory in standalone builds, making behavior more predictable
31+
32+
</Update>
33+
734
<Update label="December 17" rss={{ title: "CLI Updates", description: "Todo tool improvements, Chrome DevTools MCP, and model cleanup" }}>
835
`v0.37.0`
936

docs/docs.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -195,7 +195,8 @@
195195
"guides/power-user/memory-management",
196196
"guides/power-user/rules-conventions",
197197
"guides/power-user/prompt-crafting",
198-
"guides/power-user/token-efficiency"
198+
"guides/power-user/token-efficiency",
199+
"guides/power-user/evaluating-context-compression"
199200
]
200201
},
201202
{
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
---
2+
title: Evaluating Context Compression
3+
description: Summary of Factory Research’s evaluation of context compression strategies for long-running AI agent sessions.
4+
keywords: ['context compression', 'summarization', 'agents', 'memory', 'token efficiency', 'evaluation']
5+
---
6+
7+
This page summarizes Factory Research’s post: **[Evaluating Context Compression for AI Agents](https://factory.ai/news/evaluating-compression)** (Dec 16, 2025).
8+
9+
<Info>
10+
For the full methodology, charts, and examples, read the original post linked above.
11+
</Info>
12+
13+
---
14+
15+
## TL;DR
16+
17+
- Long-running agent sessions can exceed any model’s context window, so some form of **context compression** is required.
18+
- The key metric isn’t *tokens per request*; it’s **tokens per task** (because missing details force costly re-fetching and rework).
19+
- In Factory’s evaluation, **structured summarization** retained more “continue-the-task” information than OpenAI’s `/responses/compact` and Anthropic’s SDK compression, at similar compression rates.
20+
21+
---
22+
23+
## Why context compression matters
24+
25+
As agent sessions stretch into hundreds/thousands of turns, the full transcript can reach **millions of tokens**. If an agent loses critical state (e.g., the exact endpoint, file paths changed, or the current next step), it often:
26+
27+
- re-reads files it already read
28+
- repeats debugging dead ends
29+
- forgets what changed and where
30+
31+
That costs more time and tokens than the compression saved.
32+
33+
---
34+
35+
## How Factory evaluated “context quality”
36+
37+
Instead of using summary similarity metrics (e.g., ROUGE), Factory used a **probe-based evaluation**:
38+
39+
1. Take real, long-running production sessions.
40+
2. Compress the earlier portion.
41+
3. Ask probes that require remembering specific, task-relevant details from the truncated history.
42+
4. Grade the answers for functional usefulness.
43+
44+
### Probe types
45+
46+
- **Recall**: factual retention (e.g., “What was the original error?”)
47+
- **Artifact**: file tracking (e.g., “Which files did we modify and how?”)
48+
- **Continuation**: task planning (e.g., “What should we do next?”)
49+
- **Decision**: reasoning chain (e.g., “What did we decide and why?”)
50+
51+
### Scoring dimensions
52+
53+
Responses were scored (0–5) by an LLM judge (**GPT-5.2**) across:
54+
55+
- Accuracy
56+
- Context awareness
57+
- Artifact trail
58+
- Completeness
59+
- Continuity
60+
- Instruction following
61+
62+
The judge is blinded to which compression method produced the response.
63+
64+
---
65+
66+
## Compression approaches compared
67+
68+
| Approach | What it produces | Key trade-off |
69+
|---|---|---|
70+
| **Factory** | A **structured, persistent summary** with explicit sections (intent, file modifications, decisions, next steps). Updates by summarizing only the newly-truncated span and **merging** into the existing summary (“anchored iterative summarization”). | Slightly larger summaries than the most aggressive compression, but better retention of task-critical details. |
71+
| **OpenAI** | `/responses/compact`: an **opaque** compressed representation optimized for reconstruction fidelity. | Highest compression, but low interpretability (you can’t inspect what was preserved). |
72+
| **Anthropic** | Claude SDK built-in compression: detailed structured summaries (often 7–12k chars), regenerated on each compression. | High-quality summaries, but regenerating the whole summary each time can cause drift across repeated compressions. |
73+
74+
---
75+
76+
## Results (high-level)
77+
78+
Factory reports evaluating **36,000+ messages** from production sessions across tasks like PR review, bug fixes, feature implementation, and refactoring.
79+
80+
### Overall scores (0–5)
81+
82+
| Method | Overall | Accuracy | Context | Artifact | Completeness | Continuity | Instruction |
83+
|---|---:|---:|---:|---:|---:|---:|---:|
84+
| Factory | **3.70** | **4.04** | **4.01** | **2.45** | **4.44** | **3.80** | **4.99** |
85+
| Anthropic | 3.44 | 3.74 | 3.56 | 2.33 | 4.37 | 3.67 | 4.95 |
86+
| OpenAI | 3.35 | 3.43 | 3.64 | 2.19 | 4.37 | 3.77 | 4.92 |
87+
88+
### Compression ratio vs. quality
89+
90+
The post notes similar compression rates across methods:
91+
92+
- OpenAI: **99.3%** token removal
93+
- Anthropic: **98.7%** token removal
94+
- Factory: **98.6%** token removal
95+
96+
Factory retained ~0.7 percentage points more tokens than OpenAI (kept more context), and scored **+0.35** higher on overall quality.
97+
98+
---
99+
100+
## What Factory says it learned
101+
102+
- **Structure matters**: forcing explicit sections (files/decisions/next steps) reduces the chance that critical details “silently disappear” over time.
103+
- **Compression ratio is a misleading target**: aggressive compression can “save tokens” but lose details that cause expensive rework; optimize for **tokens per task**.
104+
- **Artifact tracking is still hard**: all methods scored low on tracking which files were created/modified/examined (Factory’s best was **2.45/5**), suggesting this may need dedicated state tracking beyond summarization.
105+
- **Probe-based evaluation is closer to agent reality** than text similarity metrics, because it tests whether work can continue effectively.
106+
107+
---
108+
109+
## Related docs
110+
111+
- [Memory and Context Management](/guides/power-user/memory-management)
112+
- [Token Efficiency Strategies](/guides/power-user/token-efficiency)

0 commit comments

Comments
 (0)