Skip to content

Conversation

@tcard
Copy link
Contributor

@tcard tcard commented Nov 21, 2025

What this PR does

Adds a common.instrument-reference-leaks-pct flag. When set to >0, a percentage of gRPC BufferHolder objects (most prominently, WriteRequest) will be instrumented so that, once the buffer's reference count reaches zero, if there are remaining references to the buffer, a panic is thrown.

For this to work, either the object needs to come from a generated gRPC server using our custom global codec, or the newly introduced mimirpb.Unmarshal must be used.

Since ingester storage doesn't receive WriteRequests from gRPC anymore but from Kafka, storage/ingest now calls mimirpb.Unmarshal.

To-do: plumb this mechanism with the distributor too, which has its own unmarshaling path.

Which issue(s) this PR fixes or relates to

Follow up of #13573

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]. If changelog entry is not needed, please add the changelog-not-needed label to the PR.
  • about-versioning.md updated with experimental features.

Note

Introduces an experimental mechanism and flags to instrument a fraction of BufferHolder-backed gRPC buffers and detect leaked references, wiring it into the global codec and unmarshalling paths.

  • Core/Runtime:
    • Add mimirpb.CustomCodecConfig and global codec registration; optional leak instrumentation using dedicated pages and delayed unmapping to detect stale references.
    • New mimirpb.Unmarshal(...) helper to use the global codec; replace direct unmarshalling in storage/ingest paths.
    • pkg/blockbuilder/tsdb: ensure request buffers are freed via defer req.FreeBuffer().
  • Config/Flags:
    • New experimental block common.instrument_ref_leaks with flags: -common.instrument-reference-leaks.percentage, -common.instrument-reference-leaks.before-reuse-period, -common.instrument-reference-leaks.max-inflight-instrumented-bytes.
    • Wire config into CommonConfig, inheritance, help text, descriptors, defaults, and process setup.
  • Tests:
    • Add tests for reference-leak detection, RW2 unmarshalling without leaks, and config inheritance.
  • Docs/Changelog:
    • Update docs for new experimental feature and add CHANGELOG entry; help text reflects new flags.

Written by Cursor Bugbot for commit ff97b8c. This will update automatically on new commits. Configure here.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 21, 2025

@tcard
Copy link
Contributor Author

tcard commented Nov 21, 2025

I expected the revert 69ca33a to make the regression test I added to fail, but it doest in CI? It fails locally though:

 go test ./pkg/ingester/ -run=TestIngesterNoRW2MetadataRefLeaks -v
=== RUN   TestIngesterNoRW2MetadataRefLeaks
    ingester_test.go:9167:
        	Error Trace:	/Users/tcard/repos/mimir/pkg/ingester/ingester_test.go:9167
        	Error:      	reference leak detected
        	Test:       	TestIngesterNoRW2MetadataRefLeaks
        	Messages:   	addr: 140003f0210
--- FAIL: TestIngesterNoRW2MetadataRefLeaks (0.31s)

I'll increase the timeout. I've instead reworked NextRefLeakChannel to make it report negative detections too, and used that to make tests deterministic(-ish).

@tcard tcard force-pushed the instrument-ref-leaks branch from 8abb9c9 to 9d6d2d0 Compare November 24, 2025 08:26
@tcard
Copy link
Contributor Author

tcard commented Nov 24, 2025

There's the regression failure, after reverting the fix:

  --- FAIL: TestIngesterNoRW2MetadataRefLeaks (0.41s)
      ingester_test.go:9169: 
          	Error Trace:	/__w/mimir/mimir/pkg/ingester/ingester_test.go:9169
          	Error:      	reference leak detected
          	Test:       	TestIngesterNoRW2MetadataRefLeaks
          	Messages:   	addr: 0xc000246420

I'll now reapply the fix.

@tcard tcard marked this pull request as ready for review November 24, 2025 12:11
@tcard tcard requested review from a team and tacole02 as code owners November 24, 2025 12:11
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look, not enough to do a proper review. I just left a couple of nits I noticed.

Comment on lines 147 to 149
// Unmarshal unmarshals an object using the global codec. Prefer this over calling
// the Unmarshal method directly, as it will take advantage of pools and leak
// detection.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using it everywhere and add a faillint rule (defined in the Makefile) if the Unmarshal-function-with-receiver is called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good idea long-term, but maybe for now it's better to restrict its usage to the places I know work fine, and then introduce it everywhere else progressively. What do you think?

func (c *CommonConfig) RegisterFlags(f *flag.FlagSet) {
c.Storage.RegisterFlagsWithPrefix("common.storage.", f)
c.ClientClusterValidation.RegisterFlagsWithPrefix("common.client-cluster-validation.", f)
f.Float64Var(&c.InstrumentRefLeaksPct, "common.instrument-reference-leaks-pct", 0, `Percentage of buffers from pools to instrument for reference leaks. 0 to disable.`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I suggest to change "pct" to "percentage". It will be consistent with other percentage-based settings we have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@tacole02 tacole02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs look good! I left a few minor suggestions. Thank you!

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing initialization for YAML unmarshaling field

The InstrumentRefLeaksPct field is defined in commonConfigUnmarshaler but not initialized when creating the unmarshaler in UnmarshalCommonYAML. This means setting instrument_ref_leaks_pct under the common: section in YAML config will fail. Either the field should be removed from commonConfigUnmarshaler if inheritance isn't needed, or it should be initialized here like the other fields.

pkg/mimir/mimir.go#L716-L721

mimir/pkg/mimir/mimir.go

Lines 716 to 721 in da56796

common := configWithCustomCommonUnmarshaler{
Common: &commonConfigUnmarshaler{
Storage: &specificStorageLocations,
ClientClusterValidation: &specificClusterValidationLocations,
},

Fix in Cursor Fix in Web


buf := b.ReadOnlyData()
ptr := unsafe.SliceData(buf)
allPages := unsafe.Slice(ptr, roundUpToMultiple(len(buf), pageSize))
err := syscall.Munmap(allPages)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If buf is (say) (0x70000000, 0x74000000] and immediately after this Munmap call returns, another instrumented pool buffer is mmapped that at least partially overlaps with that same range, then there can be times when dangling references to this buf are permitted to dereference data in the new buf without error. It may be more confidence-inspiring to claim "our new build served 500,000 requests out of the leak-instrumented pool with zero use-after-frees" if the guard mechanism is precise.

The idea I had was to have a long-lived MMapped buffer that has a cool-down time that is >> any reasonable request processing time. Which would have the same problem described above but only if there's a pathological request that runs for more than the cooldown time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same concern initially, but then noticed the docs for munmap say:

The munmap() system call deletes the mappings for the specified address range, and causes further references to addresses within the range to generate invalid memory references.

It seems to me the emphasized part guarantees that the address space range won't be reused, but maybe there's an implicit "... until the addresses are reused" in there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more googling/LLMing and a short foray into OpenBSD's implementation point in the direction that you are right, and the address space can potentially be reused. Thanks for challenging this!

See b39485d


func deserializeRecordContentV1(content []byte, wr *mimirpb.PreallocWriteRequest) error {
return wr.Unmarshal(content)
return mimirpb.Unmarshal(content, wr)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing FreeBuffer() call causes memory leak in block builder

The change from wr.Unmarshal(content) to mimirpb.Unmarshal(content, wr) now sets a buffer reference on the PreallocWriteRequest, requiring callers to invoke FreeBuffer() when done. While Ingester.PushToStorageAndReleaseRequest correctly calls req.FreeBuffer(), TSDBBuilder.PushToStorageAndReleaseRequest in pkg/blockbuilder/tsdb.go only calls mimirpb.ReuseSlice(req.Timeseries) without freeing the buffer. When reference leak instrumentation is enabled via common.instrument-reference-leaks-percentage, this causes mmap'd buffers to never be unmapped, resulting in memory leaks in the block builder path.

Additional Locations (1)

Fix in Cursor Fix in Web

var unmapQueue chan unmapTask

var startFreeingInstrumentedBuffers = sync.OnceFunc(func() {
unmapQueue = make(chan unmapTask, 1000)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sets a limit of 1000 WriteRequest buffers held up waiting. Is that too much maybe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I read this PR one concern I have is a queueing pileup one - we can dial in that X% of requests should get the instrumented pool, but in some cell it may work out that X% of requests may be a tipping point where requests take longer and longer and the number of mmapped regions grows until the process crashes.

To connect it to your question here, what if we have both an "X%" config knob but also a "maximum instrumented pools" knob so that we never exceed that limit. And then you don't need to limit the unmaps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, added a knob at f44250c

I have kept behavior of immediately unmapping if unmapQueue is full because it seems to me it's saner than the alternative of blocking and waiting.

allPages := unsafe.Slice(ptr, roundUpToMultiple(len(buf), pageSize))
if b.waitBeforeReuse > 0 {
select {
case unmapQueue <- unmapTask{buf: allPages, at: time.Now().Add(b.waitBeforeReuse)}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this now needs mprotect() so that the memory is rendered inaccessible between the free and unmap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed 🤦 I still need to actually verify this new machinery works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed at 325f3c6

panic(fmt.Errorf("mprotect: %w", err))
}
select {
case unmapQueue <- unmapTask{buf: allPages, at: time.Now().Add(b.waitBeforeReuse)}:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing field causes nil pointer dereference in unmap

When sending an unmapTask to unmapQueue, the inflightInstrumentedBytes field is omitted from the struct literal, leaving it as nil. Later, when the background goroutine processes this task and calls unmap(), it will call inflightInstrumentedBytes.Sub() on a nil pointer, causing a panic. The struct literal needs to include inflightInstrumentedBytes: b.inflightInstrumentedBytes.

Additional Locations (1)

Fix in Cursor Fix in Web


require.NoError(t, err)
wr.ClearTimeseriesUnmarshalData()
wr.BufferHolder = mimirpb.BufferHolder{} // We don't want to compare this.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Test missing defer FreeBuffer causes memory leak

The v1 test case is missing defer wr.FreeBuffer() that was added to the v0 test case. This causes the mmap'd buffer to never be freed in tests. Since tests run with 100% instrumentation, every buffer is mmap'd, and without FreeBuffer() being called, the mmap'd memory is leaked. Additionally, the v0 test's defer wr.FreeBuffer() is also ineffective because the BufferHolder is immediately zeroed on line 101, clearing the buffer reference before the defer can run.

Fix in Cursor Fix in Web

@seizethedave seizethedave self-requested a review December 5, 2025 03:25
@tcard
Copy link
Contributor Author

tcard commented Dec 5, 2025

Testing ff97b8c under some real load, it's apparent that we're not calling WriteRequest.FreeBuffer everywhere we should. We need to plug those leaks before merging this.

imagen

@tcard tcard marked this pull request as draft December 5, 2025 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants