Skip to content

Commit 777829a

Browse files
committed
feat(writer): add garbage collection on full save
1 parent 5e7c260 commit 777829a

File tree

6 files changed

+536
-84
lines changed

6 files changed

+536
-84
lines changed
Lines changed: 293 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,293 @@
1+
# 041: Garbage Collection on Full Save
2+
3+
## Problem Statement
4+
5+
When performing a **full (non-incremental) save**, orphan objects are written to the output PDF. An "orphan object" is any indirect object that exists in the registry but is not reachable from the document's root (catalog) or Info dictionary.
6+
7+
### Current Behavior
8+
9+
1. `ensureObjectsLoaded()` walks from Root/Info, loading all reachable objects into the registry
10+
2. `writeComplete()` iterates `registry.entries()` and writes **all** objects (loaded + new)
11+
3. Objects that were loaded but are no longer referenced still get written
12+
13+
### How Orphans Accumulate
14+
15+
```typescript
16+
// Load PDF where page 1 has /Annots [5 0 R]
17+
// Object 5 references appearance stream object 10
18+
const pdf = await Pdf.open(bytes);
19+
20+
// Access page triggers loading objects 5 and 10 into registry
21+
const page = pdf.getPage(0);
22+
23+
// Remove all annotations - deletes the reference to object 5
24+
page.setAnnotations([]);
25+
26+
// Full save: objects 5 and 10 are still in registry.loaded
27+
// They get written even though they're unreachable
28+
const output = await pdf.save(); // Contains orphan objects 5 and 10
29+
```
30+
31+
### Affected Scenarios
32+
33+
- Removing annotations (orphans: annotation dicts, appearance streams)
34+
- Deleting pages (orphans: content streams, resources, annotations)
35+
- Removing form fields (orphans: field dicts, widget annotations)
36+
- Replacing fonts/images (orphans: old font/image objects)
37+
- Any operation that removes an indirect reference
38+
39+
## Goals
40+
41+
1. **Full saves should not include orphan objects** - only reachable objects get written
42+
2. **Incremental saves are unaffected** - they already work correctly (only write dirty/new objects)
43+
3. **No performance regression for typical workflows** - garbage collection should be efficient
44+
4. **Preserve backward compatibility** - existing API unchanged
45+
46+
## Non-Goals
47+
48+
- Object stream compression/reordering (separate optimization)
49+
- Free list management for incremental saves
50+
- Cross-reference stream optimization
51+
52+
## Proposed Solution
53+
54+
### Option A: Reachability-Based Writing (Recommended)
55+
56+
Change `writeComplete()` to only write objects reachable from Root and Info, rather than all registry entries.
57+
58+
```typescript
59+
// Desired API usage (unchanged externally)
60+
const output = await pdf.save(); // Only writes reachable objects
61+
```
62+
63+
#### Implementation
64+
65+
1. **Collect reachable objects** - Walk from Root and Info, collecting all `PdfRef` encountered
66+
2. **Write only reachable objects** - Filter `registry.entries()` to only include refs in the reachable set
67+
3. **Handle circular references** - The walk already handles cycles via a visited set
68+
69+
```typescript
70+
// In writeComplete():
71+
function collectReachableRefs(registry: ObjectRegistry, root: PdfRef, info?: PdfRef): Set<PdfRef> {
72+
const reachable = new Set<PdfRef>();
73+
74+
const walk = (obj: PdfObject | null): void => {
75+
if (obj === null) return;
76+
77+
if (obj instanceof PdfRef) {
78+
if (reachable.has(obj)) return; // Already visited
79+
reachable.add(obj);
80+
81+
const resolved = registry.resolve(obj);
82+
walk(resolved);
83+
} else if (obj instanceof PdfDict || obj instanceof PdfStream) {
84+
for (const [, value] of obj) {
85+
walk(value);
86+
}
87+
} else if (obj instanceof PdfArray) {
88+
for (const item of obj) {
89+
walk(item);
90+
}
91+
}
92+
};
93+
94+
walk(root);
95+
if (info) walk(info);
96+
97+
return reachable;
98+
}
99+
```
100+
101+
Then in `writeComplete()`:
102+
103+
```typescript
104+
// Collect only reachable objects
105+
const reachableRefs = collectReachableRefs(registry, options.root, options.info);
106+
107+
// Write only reachable objects
108+
for (const [ref, obj] of registry.entries()) {
109+
if (!reachableRefs.has(ref)) continue; // Skip orphans
110+
// ... write object
111+
}
112+
```
113+
114+
#### Pros
115+
116+
- Clean, explicit garbage collection
117+
- Easy to understand and debug
118+
- No changes to registry structure
119+
120+
#### Cons
121+
122+
- Two walks (ensureObjectsLoaded + collectReachableRefs) - but they're fast for in-memory data
123+
- Could be combined into one pass if performance matters
124+
125+
### Option B: Track Reachability in Registry
126+
127+
Modify `ObjectRegistry` to track which objects are reachable, updating on mutations.
128+
129+
#### Pros
130+
131+
- Single source of truth for reachability
132+
- Could enable other optimizations
133+
134+
#### Cons
135+
136+
- Significant complexity increase
137+
- Every mutation needs to update reachability
138+
- Hard to maintain correctly
139+
140+
**Recommendation: Option A** - simpler, correct, and the performance cost of an extra walk is negligible.
141+
142+
## Detailed Design
143+
144+
### Changes to `pdf-writer.ts`
145+
146+
```typescript
147+
/**
148+
* Collect all refs reachable from the document root.
149+
*
150+
* Walks the object graph starting from Root and Info,
151+
* returning the set of all PdfRef that are reachable.
152+
*/
153+
function collectReachableRefs(registry: ObjectRegistry, root: PdfRef, info?: PdfRef): Set<PdfRef> {
154+
const reachable = new Set<PdfRef>();
155+
156+
const walk = (obj: PdfObject | null): void => {
157+
if (obj === null) return;
158+
159+
if (obj instanceof PdfRef) {
160+
// Use string key for Set comparison (PdfRef instances may differ)
161+
const key = `${obj.objectNumber} ${obj.generation}`;
162+
if ([...reachable].some(r => `${r.objectNumber} ${r.generation}` === key)) {
163+
return;
164+
}
165+
reachable.add(obj);
166+
167+
const resolved = registry.resolve(obj);
168+
walk(resolved);
169+
} else if (obj instanceof PdfDict) {
170+
for (const [, value] of obj) {
171+
walk(value);
172+
}
173+
} else if (obj instanceof PdfStream) {
174+
for (const [, value] of obj) {
175+
walk(value);
176+
}
177+
} else if (obj instanceof PdfArray) {
178+
for (const item of obj) {
179+
walk(item);
180+
}
181+
}
182+
};
183+
184+
walk(root);
185+
if (info) walk(info);
186+
187+
return reachable;
188+
}
189+
190+
export function writeComplete(registry: ObjectRegistry, options: WriteOptions): WriteResult {
191+
// ... existing header writing ...
192+
193+
// Collect only reachable objects (garbage collection)
194+
const reachableRefs = collectReachableRefs(registry, options.root, options.info);
195+
196+
// Create a map for efficient lookup
197+
const reachableKeys = new Set([...reachableRefs].map(r => `${r.objectNumber} ${r.generation}`));
198+
199+
// Write only reachable objects
200+
for (const [ref, obj] of registry.entries()) {
201+
const key = `${ref.objectNumber} ${ref.generation}`;
202+
if (!reachableKeys.has(key)) continue; // Skip orphans
203+
204+
// ... existing object writing logic ...
205+
}
206+
207+
// ... existing xref/trailer writing ...
208+
}
209+
```
210+
211+
### No Changes Required
212+
213+
- `ObjectRegistry` - no changes needed
214+
- `ensureObjectsLoaded()` - can be kept or removed (collectReachableRefs does the same loading)
215+
- `writeIncremental()` - works correctly, no changes needed
216+
- Public API - no changes needed
217+
218+
### Optional Optimization: Combine Walks
219+
220+
Remove `ensureObjectsLoaded()` from `saveInternal()` since `collectReachableRefs()` does the same work:
221+
222+
```typescript
223+
// In saveInternal():
224+
// Remove: this.ensureObjectsLoaded();
225+
226+
// In writeComplete():
227+
// collectReachableRefs already resolves refs, loading them into registry
228+
```
229+
230+
## Test Plan
231+
232+
### Unit Tests
233+
234+
1. **Orphan from removed annotation**
235+
- Create PDF with annotation having appearance stream
236+
- Remove annotation
237+
- Full save should not include orphan objects
238+
- Verify by re-parsing and checking object count
239+
240+
2. **Orphan from removed page**
241+
- Create PDF with multiple pages
242+
- Remove a page
243+
- Full save should not include orphan page/content objects
244+
245+
3. **Circular references handled**
246+
- Create objects with circular refs (A → B → A)
247+
- Full save should not infinite loop or crash
248+
249+
4. **New objects included**
250+
- Create new annotation
251+
- Full save should include new objects
252+
253+
5. **Incremental save unaffected**
254+
- Remove annotation
255+
- Incremental save should still only write dirty objects
256+
- Original orphans preserved (incremental doesn't GC)
257+
258+
### Integration Tests
259+
260+
1. **Round-trip with modifications**
261+
- Load real PDF
262+
- Remove annotations
263+
- Save and reload
264+
- Verify saved size is smaller (no orphans)
265+
- Verify document still valid
266+
267+
## Migration / Breaking Changes
268+
269+
**None** - This is a bug fix. The API remains unchanged. Output PDFs will be smaller and cleaner.
270+
271+
## Open Questions
272+
273+
1. **Should we expose GC as an option?**
274+
- e.g., `pdf.save({ garbageCollect: false })` to preserve orphans
275+
- Probably not needed - why would you want orphans?
276+
277+
2. **Should we warn when orphans are removed?**
278+
- Could add to `registry.warnings` for debugging
279+
- Probably not - it's expected behavior
280+
281+
3. **Should `ensureObjectsLoaded()` be removed?**
282+
- It's now redundant with `collectReachableRefs()`
283+
- Could keep for explicit "warm up cache" use case
284+
- Recommend: keep but mark as optional optimization
285+
286+
## Implementation Checklist
287+
288+
- [ ] Add `collectReachableRefs()` function to `pdf-writer.ts`
289+
- [ ] Modify `writeComplete()` to filter by reachable refs
290+
- [ ] Add unit tests for orphan removal
291+
- [ ] Add integration test with real PDF modification
292+
- [ ] Consider removing redundant `ensureObjectsLoaded()` call
293+
- [ ] Update architecture docs if needed

src/api/pdf-fonts.test.ts

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -323,9 +323,11 @@ describe("Font subsetting options", () => {
323323
const font1 = pdf1.embedFont(fontBytes);
324324
const font2 = pdf2.embedFont(fontBytes);
325325

326-
// Encode just a few characters in both
327-
font1.encodeText("ABC");
328-
font2.encodeText("ABC");
326+
// Actually draw text to ensure fonts are reachable from the page
327+
const page1 = pdf1.getPage(0)!;
328+
const page2 = pdf2.getPage(0)!;
329+
page1.drawText("ABC", { x: 10, y: 10, font: font1, size: 12 });
330+
page2.drawText("ABC", { x: 10, y: 10, font: font2, size: 12 });
329331

330332
// Save both ways
331333
const subsetBytes = await pdf1.save({ subsetFonts: true });

src/api/pdf.ts

Lines changed: 1 addition & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -2730,7 +2730,6 @@ export class PDF {
27302730
const blocker = this.canSaveIncrementally();
27312731

27322732
// Check if incremental is requested but not possible
2733-
27342733
if (wantsIncremental && blocker !== null) {
27352734
this.ctx.registry.addWarning(
27362735
`Incremental save not possible (${blocker}), performing full save`,
@@ -2846,11 +2845,7 @@ export class PDF {
28462845
return result;
28472846
}
28482847

2849-
// For full save with changes, we need all referenced objects loaded
2850-
// Walk from catalog to ensure we have everything
2851-
this.ensureObjectsLoaded();
2852-
2853-
// Full save
2848+
// Full save (collectReachableRefs in writeComplete will load all reachable objects)
28542849
const result = writeComplete(this.ctx.registry, {
28552850
version: this.ctx.info.version,
28562851
root,
@@ -2866,55 +2861,4 @@ export class PDF {
28662861

28672862
return result;
28682863
}
2869-
2870-
/**
2871-
* Ensure all reachable objects are loaded into the registry.
2872-
*
2873-
* Walks from the catalog to load all referenced objects.
2874-
*/
2875-
private ensureObjectsLoaded(): void {
2876-
const visited = new Set<string>();
2877-
2878-
const walk = (obj: PdfObject | null): void => {
2879-
if (obj === null) {
2880-
return;
2881-
}
2882-
2883-
if (obj instanceof PdfRef) {
2884-
const key = `${obj.objectNumber} ${obj.generation}`;
2885-
2886-
if (visited.has(key)) {
2887-
return;
2888-
}
2889-
2890-
visited.add(key);
2891-
2892-
const resolved = this.getObject(obj);
2893-
2894-
walk(resolved);
2895-
} else if (obj instanceof PdfDict) {
2896-
for (const [, value] of obj) {
2897-
walk(value);
2898-
}
2899-
} else if (obj instanceof PdfArray) {
2900-
for (const item of obj) {
2901-
walk(item);
2902-
}
2903-
}
2904-
};
2905-
2906-
// Start from root
2907-
const root = this.ctx.info.trailer.getRef("Root");
2908-
2909-
if (root) {
2910-
walk(root);
2911-
}
2912-
2913-
// Also load Info if present
2914-
const infoRef = this.ctx.info.trailer.getRef("Info");
2915-
2916-
if (infoRef) {
2917-
walk(infoRef);
2918-
}
2919-
}
29202864
}

0 commit comments

Comments
 (0)