|
| 1 | +# 041: Garbage Collection on Full Save |
| 2 | + |
| 3 | +## Problem Statement |
| 4 | + |
| 5 | +When performing a **full (non-incremental) save**, orphan objects are written to the output PDF. An "orphan object" is any indirect object that exists in the registry but is not reachable from the document's root (catalog) or Info dictionary. |
| 6 | + |
| 7 | +### Current Behavior |
| 8 | + |
| 9 | +1. `ensureObjectsLoaded()` walks from Root/Info, loading all reachable objects into the registry |
| 10 | +2. `writeComplete()` iterates `registry.entries()` and writes **all** objects (loaded + new) |
| 11 | +3. Objects that were loaded but are no longer referenced still get written |
| 12 | + |
| 13 | +### How Orphans Accumulate |
| 14 | + |
| 15 | +```typescript |
| 16 | +// Load PDF where page 1 has /Annots [5 0 R] |
| 17 | +// Object 5 references appearance stream object 10 |
| 18 | +const pdf = await Pdf.open(bytes); |
| 19 | + |
| 20 | +// Access page triggers loading objects 5 and 10 into registry |
| 21 | +const page = pdf.getPage(0); |
| 22 | + |
| 23 | +// Remove all annotations - deletes the reference to object 5 |
| 24 | +page.setAnnotations([]); |
| 25 | + |
| 26 | +// Full save: objects 5 and 10 are still in registry.loaded |
| 27 | +// They get written even though they're unreachable |
| 28 | +const output = await pdf.save(); // Contains orphan objects 5 and 10 |
| 29 | +``` |
| 30 | + |
| 31 | +### Affected Scenarios |
| 32 | + |
| 33 | +- Removing annotations (orphans: annotation dicts, appearance streams) |
| 34 | +- Deleting pages (orphans: content streams, resources, annotations) |
| 35 | +- Removing form fields (orphans: field dicts, widget annotations) |
| 36 | +- Replacing fonts/images (orphans: old font/image objects) |
| 37 | +- Any operation that removes an indirect reference |
| 38 | + |
| 39 | +## Goals |
| 40 | + |
| 41 | +1. **Full saves should not include orphan objects** - only reachable objects get written |
| 42 | +2. **Incremental saves are unaffected** - they already work correctly (only write dirty/new objects) |
| 43 | +3. **No performance regression for typical workflows** - garbage collection should be efficient |
| 44 | +4. **Preserve backward compatibility** - existing API unchanged |
| 45 | + |
| 46 | +## Non-Goals |
| 47 | + |
| 48 | +- Object stream compression/reordering (separate optimization) |
| 49 | +- Free list management for incremental saves |
| 50 | +- Cross-reference stream optimization |
| 51 | + |
| 52 | +## Proposed Solution |
| 53 | + |
| 54 | +### Option A: Reachability-Based Writing (Recommended) |
| 55 | + |
| 56 | +Change `writeComplete()` to only write objects reachable from Root and Info, rather than all registry entries. |
| 57 | + |
| 58 | +```typescript |
| 59 | +// Desired API usage (unchanged externally) |
| 60 | +const output = await pdf.save(); // Only writes reachable objects |
| 61 | +``` |
| 62 | + |
| 63 | +#### Implementation |
| 64 | + |
| 65 | +1. **Collect reachable objects** - Walk from Root and Info, collecting all `PdfRef` encountered |
| 66 | +2. **Write only reachable objects** - Filter `registry.entries()` to only include refs in the reachable set |
| 67 | +3. **Handle circular references** - The walk already handles cycles via a visited set |
| 68 | + |
| 69 | +```typescript |
| 70 | +// In writeComplete(): |
| 71 | +function collectReachableRefs(registry: ObjectRegistry, root: PdfRef, info?: PdfRef): Set<PdfRef> { |
| 72 | + const reachable = new Set<PdfRef>(); |
| 73 | + |
| 74 | + const walk = (obj: PdfObject | null): void => { |
| 75 | + if (obj === null) return; |
| 76 | + |
| 77 | + if (obj instanceof PdfRef) { |
| 78 | + if (reachable.has(obj)) return; // Already visited |
| 79 | + reachable.add(obj); |
| 80 | + |
| 81 | + const resolved = registry.resolve(obj); |
| 82 | + walk(resolved); |
| 83 | + } else if (obj instanceof PdfDict || obj instanceof PdfStream) { |
| 84 | + for (const [, value] of obj) { |
| 85 | + walk(value); |
| 86 | + } |
| 87 | + } else if (obj instanceof PdfArray) { |
| 88 | + for (const item of obj) { |
| 89 | + walk(item); |
| 90 | + } |
| 91 | + } |
| 92 | + }; |
| 93 | + |
| 94 | + walk(root); |
| 95 | + if (info) walk(info); |
| 96 | + |
| 97 | + return reachable; |
| 98 | +} |
| 99 | +``` |
| 100 | + |
| 101 | +Then in `writeComplete()`: |
| 102 | + |
| 103 | +```typescript |
| 104 | +// Collect only reachable objects |
| 105 | +const reachableRefs = collectReachableRefs(registry, options.root, options.info); |
| 106 | + |
| 107 | +// Write only reachable objects |
| 108 | +for (const [ref, obj] of registry.entries()) { |
| 109 | + if (!reachableRefs.has(ref)) continue; // Skip orphans |
| 110 | + // ... write object |
| 111 | +} |
| 112 | +``` |
| 113 | + |
| 114 | +#### Pros |
| 115 | + |
| 116 | +- Clean, explicit garbage collection |
| 117 | +- Easy to understand and debug |
| 118 | +- No changes to registry structure |
| 119 | + |
| 120 | +#### Cons |
| 121 | + |
| 122 | +- Two walks (ensureObjectsLoaded + collectReachableRefs) - but they're fast for in-memory data |
| 123 | +- Could be combined into one pass if performance matters |
| 124 | + |
| 125 | +### Option B: Track Reachability in Registry |
| 126 | + |
| 127 | +Modify `ObjectRegistry` to track which objects are reachable, updating on mutations. |
| 128 | + |
| 129 | +#### Pros |
| 130 | + |
| 131 | +- Single source of truth for reachability |
| 132 | +- Could enable other optimizations |
| 133 | + |
| 134 | +#### Cons |
| 135 | + |
| 136 | +- Significant complexity increase |
| 137 | +- Every mutation needs to update reachability |
| 138 | +- Hard to maintain correctly |
| 139 | + |
| 140 | +**Recommendation: Option A** - simpler, correct, and the performance cost of an extra walk is negligible. |
| 141 | + |
| 142 | +## Detailed Design |
| 143 | + |
| 144 | +### Changes to `pdf-writer.ts` |
| 145 | + |
| 146 | +```typescript |
| 147 | +/** |
| 148 | + * Collect all refs reachable from the document root. |
| 149 | + * |
| 150 | + * Walks the object graph starting from Root and Info, |
| 151 | + * returning the set of all PdfRef that are reachable. |
| 152 | + */ |
| 153 | +function collectReachableRefs(registry: ObjectRegistry, root: PdfRef, info?: PdfRef): Set<PdfRef> { |
| 154 | + const reachable = new Set<PdfRef>(); |
| 155 | + |
| 156 | + const walk = (obj: PdfObject | null): void => { |
| 157 | + if (obj === null) return; |
| 158 | + |
| 159 | + if (obj instanceof PdfRef) { |
| 160 | + // Use string key for Set comparison (PdfRef instances may differ) |
| 161 | + const key = `${obj.objectNumber} ${obj.generation}`; |
| 162 | + if ([...reachable].some(r => `${r.objectNumber} ${r.generation}` === key)) { |
| 163 | + return; |
| 164 | + } |
| 165 | + reachable.add(obj); |
| 166 | + |
| 167 | + const resolved = registry.resolve(obj); |
| 168 | + walk(resolved); |
| 169 | + } else if (obj instanceof PdfDict) { |
| 170 | + for (const [, value] of obj) { |
| 171 | + walk(value); |
| 172 | + } |
| 173 | + } else if (obj instanceof PdfStream) { |
| 174 | + for (const [, value] of obj) { |
| 175 | + walk(value); |
| 176 | + } |
| 177 | + } else if (obj instanceof PdfArray) { |
| 178 | + for (const item of obj) { |
| 179 | + walk(item); |
| 180 | + } |
| 181 | + } |
| 182 | + }; |
| 183 | + |
| 184 | + walk(root); |
| 185 | + if (info) walk(info); |
| 186 | + |
| 187 | + return reachable; |
| 188 | +} |
| 189 | + |
| 190 | +export function writeComplete(registry: ObjectRegistry, options: WriteOptions): WriteResult { |
| 191 | + // ... existing header writing ... |
| 192 | + |
| 193 | + // Collect only reachable objects (garbage collection) |
| 194 | + const reachableRefs = collectReachableRefs(registry, options.root, options.info); |
| 195 | + |
| 196 | + // Create a map for efficient lookup |
| 197 | + const reachableKeys = new Set([...reachableRefs].map(r => `${r.objectNumber} ${r.generation}`)); |
| 198 | + |
| 199 | + // Write only reachable objects |
| 200 | + for (const [ref, obj] of registry.entries()) { |
| 201 | + const key = `${ref.objectNumber} ${ref.generation}`; |
| 202 | + if (!reachableKeys.has(key)) continue; // Skip orphans |
| 203 | + |
| 204 | + // ... existing object writing logic ... |
| 205 | + } |
| 206 | + |
| 207 | + // ... existing xref/trailer writing ... |
| 208 | +} |
| 209 | +``` |
| 210 | + |
| 211 | +### No Changes Required |
| 212 | + |
| 213 | +- `ObjectRegistry` - no changes needed |
| 214 | +- `ensureObjectsLoaded()` - can be kept or removed (collectReachableRefs does the same loading) |
| 215 | +- `writeIncremental()` - works correctly, no changes needed |
| 216 | +- Public API - no changes needed |
| 217 | + |
| 218 | +### Optional Optimization: Combine Walks |
| 219 | + |
| 220 | +Remove `ensureObjectsLoaded()` from `saveInternal()` since `collectReachableRefs()` does the same work: |
| 221 | + |
| 222 | +```typescript |
| 223 | +// In saveInternal(): |
| 224 | +// Remove: this.ensureObjectsLoaded(); |
| 225 | + |
| 226 | +// In writeComplete(): |
| 227 | +// collectReachableRefs already resolves refs, loading them into registry |
| 228 | +``` |
| 229 | + |
| 230 | +## Test Plan |
| 231 | + |
| 232 | +### Unit Tests |
| 233 | + |
| 234 | +1. **Orphan from removed annotation** |
| 235 | + - Create PDF with annotation having appearance stream |
| 236 | + - Remove annotation |
| 237 | + - Full save should not include orphan objects |
| 238 | + - Verify by re-parsing and checking object count |
| 239 | + |
| 240 | +2. **Orphan from removed page** |
| 241 | + - Create PDF with multiple pages |
| 242 | + - Remove a page |
| 243 | + - Full save should not include orphan page/content objects |
| 244 | + |
| 245 | +3. **Circular references handled** |
| 246 | + - Create objects with circular refs (A → B → A) |
| 247 | + - Full save should not infinite loop or crash |
| 248 | + |
| 249 | +4. **New objects included** |
| 250 | + - Create new annotation |
| 251 | + - Full save should include new objects |
| 252 | + |
| 253 | +5. **Incremental save unaffected** |
| 254 | + - Remove annotation |
| 255 | + - Incremental save should still only write dirty objects |
| 256 | + - Original orphans preserved (incremental doesn't GC) |
| 257 | + |
| 258 | +### Integration Tests |
| 259 | + |
| 260 | +1. **Round-trip with modifications** |
| 261 | + - Load real PDF |
| 262 | + - Remove annotations |
| 263 | + - Save and reload |
| 264 | + - Verify saved size is smaller (no orphans) |
| 265 | + - Verify document still valid |
| 266 | + |
| 267 | +## Migration / Breaking Changes |
| 268 | + |
| 269 | +**None** - This is a bug fix. The API remains unchanged. Output PDFs will be smaller and cleaner. |
| 270 | + |
| 271 | +## Open Questions |
| 272 | + |
| 273 | +1. **Should we expose GC as an option?** |
| 274 | + - e.g., `pdf.save({ garbageCollect: false })` to preserve orphans |
| 275 | + - Probably not needed - why would you want orphans? |
| 276 | + |
| 277 | +2. **Should we warn when orphans are removed?** |
| 278 | + - Could add to `registry.warnings` for debugging |
| 279 | + - Probably not - it's expected behavior |
| 280 | + |
| 281 | +3. **Should `ensureObjectsLoaded()` be removed?** |
| 282 | + - It's now redundant with `collectReachableRefs()` |
| 283 | + - Could keep for explicit "warm up cache" use case |
| 284 | + - Recommend: keep but mark as optional optimization |
| 285 | + |
| 286 | +## Implementation Checklist |
| 287 | + |
| 288 | +- [ ] Add `collectReachableRefs()` function to `pdf-writer.ts` |
| 289 | +- [ ] Modify `writeComplete()` to filter by reachable refs |
| 290 | +- [ ] Add unit tests for orphan removal |
| 291 | +- [ ] Add integration test with real PDF modification |
| 292 | +- [ ] Consider removing redundant `ensureObjectsLoaded()` call |
| 293 | +- [ ] Update architecture docs if needed |
0 commit comments