Conversation
`Defer` now grabs a zeroed struct from the pool, fills it, and `Apply` returns each struct to the pool after invoking `setField`. No other behavior changes, branches still copy `apply` slices, and errors propagate the same way.
|
Would love to get your input if it makes sense to continue searching for these low-hanging fruit performance improvements. I don't have the time to go through deep stuff, but Using the exact same technique (building a Which, compared to the last run in this PR (~98 KB/op for runtime-built, ~94 KB/op for generated), pooling the lexer buffer would trim another ~40 KB/op and ~60–70 allocs/op. That would need to come in a separate PR though, as it needs deeper analysis from you. |
|
I don't mind the idea of performance improvements in principle, though I am a bit concerned about readability dropping. I think some abstractions on top of the pool might help here, so that the low level pool casting etc. isn't spread throughout the code. There's possibly something to be done here with generics, though there might be a performance hit. Also would you mind running these through |
Problem & Rationale
Parsing defers field assignments until the winning branch is known. Each
Defercall allocates a freshcontextFieldSetso those captured values survive branch backtracking. Benchmarks showedparseContext.Defer/Branchaccounting for nearly half of total allocations (pprof:Branch~25%,Defer~19%) even on tiny inputs. These structs are short‑lived, small, and have fixed shape, so recycling them avoids steady heap pressure and reduces GC work without touching parser semantics.Fix
This change adds a
sync.PoolofcontextFieldSetobjects.Defernow grabs a zeroed struct from the pool, fills it, andApplyreturns each struct to the pool after invokingsetField. No other behaviour changes, branches still copyapplyslices, and errors propagate the same way.Benchmark
Both participle variants improved about 6–7% in wall time and shed ~350–400 KiB + ~150 allocations per parse (compared to the pre‑change baselines of 127 µs / 172 KB / 2053 allocs and 78 µs / 167 KB / 1817 allocs).
Extending the Technique
Hoping that this technique is sound, it's observable that even after pooling
contextFieldSet, profiling the Thrift benchmark still showedparseContext.Branchdominating allocations: every speculative branch clones an entireparseContext, and failed branches keep their deferred captures alive until GC.go tool pprof -alloc_spaceattributed ~25% of bytes toBranchand ~19% toDefer, so eliminating those short-lived context copies promised another allocation drop.Extending the fix
sync.PoolforparseContextinstances (context.go:37-118) plus small helpers:discardDeferredzeros and returns any unused capture records, andrecyclehands the whole context back to the pool.Acceptnow recycles the accepted branch automatically.nodes.go:263-512) now explicitly callsbranch.recycle(false)when a branch fails, ensuring both the context and any deferred captures are released immediately.Stop,Accept, and error tracking all behave exactly as before; only swapped raw allocations for pooled scratch structs.Second line of Benchmark
With both optimisations:
Compared to the prior (already pooled captures) run at ~119 µs/op with 140 kB / 1902 allocs, the new branch pooling holds throughput steady while cutting another ~40% of heap use (98 kB, 1638 allocs) for the runtime-built parser; the generated parser sees a similar improvement (from 136 kB / 1666 allocs down to 94 kB / 1402 allocs). Go-thrift remains the same, so participle now wins clearly on allocation footprint while matching its earlier speed.