docs: add CLI v0.38.0 and v0.39.0 changelog entries (#543)

factory-ben · factory-droid[bot] · web-flow · commit 22a7dce14e3e · 2025-12-22T21:43:08.000Z
Co-authored-by: factory-droid[bot] &lt;138933559+factory-droid[bot]@users.noreply.github.com&gt;
diff --git a/docs/changelog/cli-updates.mdx b/docs/changelog/cli-updates.mdx
@@ -4,6 +4,33 @@ description: "Recent features and improvements to Factory CLI"
 rss: true
 ---
 
+<Update label="December 19" rss={{ title: "CLI Updates", description: "Context utilization setting and Custom Droids enabled by default" }}>
+  `v0.39.0`
+
+  ## New features
+
+  * **Context utilization setting** - New setting to display token usage indicator in the status bar
+  * **Custom Droids enabled by default** - Custom Droids feature is now available to all users without requiring opt-in
+
+  ## Bug fixes
+
+  * **Grep tool fix** - Fixed pattern argument handling in the Grep tool
+  * **BYOK Grok fix** - Fixed crash when using Grok models with thinking/reasoning streams
+  * **Warmup improvements** - Added option to disable warmup requests and skip warmup for slash commands
+  * **Non-git directory support** - Use current working directory as project directory when not in a git repository
+
+</Update>
+
+<Update label="December 18" rss={{ title: "CLI Updates", description: "GPT-5.2 improvements and .env loading fix" }}>
+  `v0.38.0`
+
+  ## Bug fixes
+
+  * **GPT-5.2 improvements** - Fixed request parameters and reasoning effort options for better model performance
+  * **Prevent .env auto-loading** - CLI no longer automatically loads `.env` files from the working directory in standalone builds, making behavior more predictable
+
+</Update>
+
 <Update label="December 17" rss={{ title: "CLI Updates", description: "Todo tool improvements, Chrome DevTools MCP, and model cleanup" }}>
   `v0.37.0`
 
diff --git a/docs/docs.json b/docs/docs.json
@@ -195,7 +195,8 @@
               "guides/power-user/memory-management",
               "guides/power-user/rules-conventions",
               "guides/power-user/prompt-crafting",
-              "guides/power-user/token-efficiency"
+              "guides/power-user/token-efficiency",
+              "guides/power-user/evaluating-context-compression"
             ]
           },
           {
diff --git a/docs/guides/power-user/evaluating-context-compression.mdx b/docs/guides/power-user/evaluating-context-compression.mdx
@@ -0,0 +1,112 @@
+---
+title: Evaluating Context Compression
+description: Summary of Factory Research’s evaluation of context compression strategies for long-running AI agent sessions.
+keywords: ['context compression', 'summarization', 'agents', 'memory', 'token efficiency', 'evaluation']
+---
+
+This page summarizes Factory Research’s post: **[Evaluating Context Compression for AI Agents](https://factory.ai/news/evaluating-compression)** (Dec 16, 2025).
+
+<Info>
+  For the full methodology, charts, and examples, read the original post linked above.
+</Info>
+
+---
+
+## TL;DR
+
+- Long-running agent sessions can exceed any model’s context window, so some form of **context compression** is required.
+- The key metric isn’t *tokens per request*; it’s **tokens per task** (because missing details force costly re-fetching and rework).
+- In Factory’s evaluation, **structured summarization** retained more “continue-the-task” information than OpenAI’s `/responses/compact` and Anthropic’s SDK compression, at similar compression rates.
+
+---
+
+## Why context compression matters
+
+As agent sessions stretch into hundreds/thousands of turns, the full transcript can reach **millions of tokens**. If an agent loses critical state (e.g., the exact endpoint, file paths changed, or the current next step), it often:
+
+- re-reads files it already read
+- repeats debugging dead ends
+- forgets what changed and where
+
+That costs more time and tokens than the compression saved.
+
+---
+
+## How Factory evaluated “context quality”
+
+Instead of using summary similarity metrics (e.g., ROUGE), Factory used a **probe-based evaluation**:
+
+1. Take real, long-running production sessions.
+2. Compress the earlier portion.
+3. Ask probes that require remembering specific, task-relevant details from the truncated history.
+4. Grade the answers for functional usefulness.
+
+### Probe types
+
+- **Recall**: factual retention (e.g., “What was the original error?”)
+- **Artifact**: file tracking (e.g., “Which files did we modify and how?”)
+- **Continuation**: task planning (e.g., “What should we do next?”)
+- **Decision**: reasoning chain (e.g., “What did we decide and why?”)
+
+### Scoring dimensions
+
+Responses were scored (0–5) by an LLM judge (**GPT-5.2**) across:
+
+- Accuracy
+- Context awareness
+- Artifact trail
+- Completeness
+- Continuity
+- Instruction following
+
+The judge is blinded to which compression method produced the response.
+
+---
+
+## Compression approaches compared
+
+| Approach | What it produces | Key trade-off |
+|---|---|---|
+| **Factory** | A **structured, persistent summary** with explicit sections (intent, file modifications, decisions, next steps). Updates by summarizing only the newly-truncated span and **merging** into the existing summary (“anchored iterative summarization”). | Slightly larger summaries than the most aggressive compression, but better retention of task-critical details. |
+| **OpenAI** | `/responses/compact`: an **opaque** compressed representation optimized for reconstruction fidelity. | Highest compression, but low interpretability (you can’t inspect what was preserved). |
+| **Anthropic** | Claude SDK built-in compression: detailed structured summaries (often 7–12k chars), regenerated on each compression. | High-quality summaries, but regenerating the whole summary each time can cause drift across repeated compressions. |
+
+---
+
+## Results (high-level)
+
+Factory reports evaluating **36,000+ messages** from production sessions across tasks like PR review, bug fixes, feature implementation, and refactoring.
+
+### Overall scores (0–5)
+
+| Method | Overall | Accuracy | Context | Artifact | Completeness | Continuity | Instruction |
+|---|---:|---:|---:|---:|---:|---:|---:|
+| Factory | **3.70** | **4.04** | **4.01** | **2.45** | **4.44** | **3.80** | **4.99** |
+| Anthropic | 3.44 | 3.74 | 3.56 | 2.33 | 4.37 | 3.67 | 4.95 |
+| OpenAI | 3.35 | 3.43 | 3.64 | 2.19 | 4.37 | 3.77 | 4.92 |
+
+### Compression ratio vs. quality
+
+The post notes similar compression rates across methods:
+
+- OpenAI: **99.3%** token removal
+- Anthropic: **98.7%** token removal
+- Factory: **98.6%** token removal
+
+Factory retained ~0.7 percentage points more tokens than OpenAI (kept more context), and scored **+0.35** higher on overall quality.
+
+---
+
+## What Factory says it learned
+
+- **Structure matters**: forcing explicit sections (files/decisions/next steps) reduces the chance that critical details “silently disappear” over time.
+- **Compression ratio is a misleading target**: aggressive compression can “save tokens” but lose details that cause expensive rework; optimize for **tokens per task**.
+- **Artifact tracking is still hard**: all methods scored low on tracking which files were created/modified/examined (Factory’s best was **2.45/5**), suggesting this may need dedicated state tracking beyond summarization.
+- **Probe-based evaluation is closer to agent reality** than text similarity metrics, because it tests whether work can continue effectively.
+
+---
+
+## Related docs
+
+- [Memory and Context Management](/guides/power-user/memory-management)
+- [Token Efficiency Strategies](/guides/power-user/token-efficiency)

Original file line number	Diff line number	Diff line change
`@@ -195,7 +195,8 @@`
`195`	`195`	`"guides/power-user/memory-management",`
`196`	`196`	`"guides/power-user/rules-conventions",`
`197`	`197`	`"guides/power-user/prompt-crafting",`
`198`		`- "guides/power-user/token-efficiency"`
	`198`	`+ "guides/power-user/token-efficiency",`
	`199`	`+ "guides/power-user/evaluating-context-compression"`
`199`	`200`	`]`
`200`	`201`	`},`
`201`	`202`	`{`