Pool the dictionary buffer when training a Zstandard dictionary#129125
Pool the dictionary buffer when training a Zstandard dictionary#129125alinpahontu2912 wants to merge 1 commit into
Conversation
ZstandardDictionary.Train allocated 'new byte[maxDictionarySize]' on every call. Dictionary sizes are typically tens to hundreds of KB (zstd recommends up to ~100 KB, but the API allows more), so each training call paid for a fresh GC allocation that often landed on the LOH. Rent the buffer from ArrayPool<byte>.Shared instead. Create copies the trained slice into an exact-sized array before returning, so the rented buffer can be returned immediately. Use clearArray: true on Return because the trained dictionary is derived from caller-supplied samples and must not linger in the shared pool. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @dotnet/area-system-collections |
There was a problem hiding this comment.
Pull request overview
This PR reduces repeated large allocations in ZstandardDictionary.Train by renting the native training output buffer from ArrayPool<byte>.Shared instead of allocating a new byte[] every call, then returning the rented buffer after copying the trained dictionary into its own array.
Changes:
- Replace per-call
new byte[maxDictionarySize]withArrayPool<byte>.Shared.Rent(maxDictionarySize)for the native training output buffer. - Ensure the rented dictionary buffer is returned in a
finallyblock, currently usingclearArray: trueto avoid retaining caller-derived data in the pool.
| // Clear before returning: the trained dictionary is derived from caller-supplied samples. | ||
| ArrayPool<byte>.Shared.Return(dictionaryBuffer, clearArray: true); |
| // Clear before returning: the trained dictionary is derived from caller-supplied samples. | ||
| ArrayPool<byte>.Shared.Return(dictionaryBuffer, clearArray: true); |
There was a problem hiding this comment.
| // Clear before returning: the trained dictionary is derived from caller-supplied samples. | |
| ArrayPool<byte>.Shared.Return(dictionaryBuffer, clearArray: true); | |
| ArrayPool<byte>.Shared.Return(dictionaryBuffer); |
We practically never clear buffers outside of crypto, I don't see a reason to do it here
| ZstandardUtils.ThrowIfError(dictSize); | ||
| return Create(dictionaryBuffer.AsSpan(0, (int)dictSize)); | ||
| } | ||
| finally |
There was a problem hiding this comment.
We already have a try/finally for the lengthsArray buffer, can we avoid even more nesting by reusing the existing blocks?
There was a problem hiding this comment.
Another option is to not return the array to the pool on the exceptional path. I can't find a reference for this in learn.microsoft.com or this repo's docs, but I've seen Stephen Toub and others mentioned in a PRs that generally returning arrays to the pool in exceptions paths is more trouble than its worth. For example:
| } | ||
| finally | ||
| { | ||
| // Clear before returning: the trained dictionary is derived from caller-supplied samples. |
There was a problem hiding this comment.
This clearing is expensive and unnecessary. We are not doing it anywhere else in similar situations
ZstandardDictionary.Trainallocatesnew byte[maxDictionarySize]on every call to hold the trained dictionary returned by the nativeZDICT_trainFromBuffer. Upstream zstd guidance puts a reasonable dictionary size at ~100 KB and the CLI defaults to 112,640 bytes — both above the .NET 85,000-byte LOH threshold. So for the recommended-and-up sizes, every training call costs a fresh Large Object Heap allocation that survives until the next gen2 collection. This change rents the buffer fromArrayPool<byte>.Sharedand returns it (withclearArray: true) once the trained dictionary has been copied into its own array.Why dictionary sizes are LOH-heavy
Upstream zstd (
zdict.h):CLI manpage (
zstd.1.md):--maxdict=#default is 112640 bytes.The managed wrapper repeats this guidance (
ZstandardDictionary.cs:85-87) and only guards the lower bound (ThrowIfLessThan(maxDictionarySize, 256, ...)at line 128) — there is no upper-bound check, so MB-sized buffers are also legal..NET LOH threshold is 85,000 bytes, so the recommended ~100 KB dictionary and the 112,640-byte CLI default both land on the LOH on every call.
Empirical evidence
I modelled the exact allocation pattern (
new byte[N]vsRent/Return) and measured per-call allocations withGC.GetAllocatedBytesForCurrentThread(Release build, 1,000 iterations, workstation GC, .NET 11 preview):new byte[]per callRentper call (steady state)For the recommended/default dictionary sizes, every
Traincall previously produced a ~100 KB LOH allocation that only a gen2 collection could reclaim. After this change, steady-stateTrainallocations for the dictionary buffer drop to zero.Reproducer
Program.cs(run from a scratchnet11.0console project,dotnet run -c Release):Output (workstation GC, single thread):
The 24-byte overhead per allocation is the 64-bit
byte[]object header (sync block + method table pointer + length).Rentreports 0 bytes/iter because the pool returns the same array each iteration after warm-up. The pre-PR path stays on the SOH for the first two rows and lands on the LOH for the last three.