Add LFBP Instrumentation, Server Shutdown Ordering, and Test TearDown Hardening#1632
Merged
Add LFBP Instrumentation, Server Shutdown Ordering, and Test TearDown Hardening#1632
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a teardown hang caused by incomplete network-handler disposal and leaked pooled buffers, and adds DEBUG-only diagnostics to identify unreturned pool entries and stuck handlers.
Changes:
- Ensure
TcpNetworkHandlerBase.Dispose()performs full cleanup by invokingDisposeImpl(). - Fix pooled buffer leak by disposing
GarnetTcpNetworkSender’s heldresponseObjectduring sender disposal. - Add pool ownership/buffer-role tagging and DEBUG diagnostics for pool/handler disposal hangs.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| libs/server/Servers/GarnetServerTcp.cs | Tags the server network pool with an owner type for diagnostics. |
| libs/server/Servers/GarnetServerBase.cs | Adds DEBUG hang diagnostics for stuck active handler disposal (but introduces a Release build issue via unused using). |
| libs/common/Networking/TcpNetworkHandlerBase.cs | Calls DisposeImpl() from public Dispose() and tags receive-buffer allocations. |
| libs/common/Networking/NetworkHandler.cs | Tags transport/network buffer allocations to aid leak diagnostics. |
| libs/common/Networking/GarnetTcpNetworkSender.cs | Disposes held responseObject during sender disposal to avoid pool reference leaks. |
| libs/common/Networking/GarnetSaeaBuffer.cs | Tags SAEA send-buffer allocations. |
| libs/common/NetworkBufferSettings.cs | Extends buffer-pool creation to accept an owner type. |
| libs/common/Memory/PoolEntryTypes.cs | Introduces enums for pool owner and buffer role tagging. |
| libs/common/Memory/PoolEntry.cs | Adds packed source field used for diagnostics. |
| libs/common/Memory/LimitedFixedBufferPool.cs | Adds owner tagging, DEBUG outstanding-entry tracking, and disposal-time diagnostics (but introduces a Release build issue via unused using). |
| libs/cluster/Server/Replication/ReplicationManager.cs | Tags replication-created pools with owner type. |
| libs/client/ClientSession/GarnetClientSession.cs | Tags client-session-created pools with owner type. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Two bugs caused LimitedFixedBufferPool.Dispose() to hang indefinitely during server teardown (blocking ClusterResetHardDuringDisklessReplicationAttach): 1. TcpNetworkHandlerBase.Dispose() never called DisposeImpl(), so when a handler thread was blocked synchronously (e.g. in TryBeginDisklessSync), the CTS was never cancelled and activeHandlerCount was never decremented. DisposeActiveHandlers() would spin forever waiting for it to reach 0. 2. GarnetTcpNetworkSender.DisposeNetworkSender() disposed the saeaStack but not the current responseObject, leaking a PoolEntry that was never returned to the pool. LimitedFixedBufferPool.Dispose() then spun forever waiting for totalReferences to reach 0. Also adds PoolEntry source tracking infrastructure (PoolEntryBufferType and PoolOwnerType enums) with DEBUG-only diagnostics that log unreturned buffer details after a 5-second timeout during pool disposal. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
59ce319 to
ce26903
Compare
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Use Interlocked.Exchange to atomically take ownership of responseObject and ReturnBuffer it back to the saeaStack before disposal. This ensures the PoolEntry is disposed exactly once when saeaStack.Dispose() iterates all items, and avoids a race with ReturnResponseObject() on the handler thread that could cause Debug.Assert(!disposed) to fail. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc
reviewed
Mar 17, 2026
9a4db3b to
e113972
Compare
a4dd6eb to
e65a0e9
Compare
9fd3b9b to
be7a296
Compare
be7a296 to
9a2eb04
Compare
badrishc
approved these changes
Mar 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds instrumentation for tracking owners and type of buffers acquired from
LimitedFixedBufferPool(LFBP) to diagnose dispose issues and buffer leaks. Improves server shutdown ordering so listening sockets are closed immediately (freeing ports), and hardens cluster test teardown so cleanup always runs even when tests fail.Changes
Buffer Pool Tracking Instrumentation
PoolEntryTypes.cs):PoolEntryBufferType(identifies buffer role, e.g.NetworkReceiveBuffer,SaeaSendBuffer) andPoolOwnerType(identifies pool creator, e.g.ServerNetwork,Replication,ClientSession).PoolEntry.sourcefield: packed int where the low byte storesPoolEntryBufferTypeand byte 1 storesPoolOwnerType, set when the entry is acquired viaLimitedFixedBufferPool.Get().LimitedFixedBufferPoolnow acceptsPoolOwnerTypeat construction andPoolEntryBufferTypeinGet(). Under#if DEBUG, aConcurrentDictionarytracks all outstanding (checked-out) entries.LimitedFixedBufferPool.Dispose()logs all unreturned buffer details (owner type, buffer type, size) to aid leak diagnosis.PoolOwnerTypeandPoolEntryBufferTypevalues (GarnetClientSession,ReplicationManager,GarnetServerTcp,NetworkHandler,TcpNetworkHandlerBase,GarnetSaeaBuffer,LightClient,MigrationManager,GarnetClient).Server Shutdown Ordering & Port Reuse
IGarnetServer.Close(): new interface method that stops listening without waiting for active connections to drain.GarnetServerTcp.Close(): closes the listening socket to free the port immediately.GarnetServer.InternalDispose: three-phase shutdown — (1)Close()all servers to free ports, (2) dispose the provider (storage engine), (3)Dispose()servers to drain active handlers.GarnetServerTcpconstructor: setsSO_REUSEADDRon TCP sockets beforeBindto handleTIME_WAITstates. UDS initialization refactored for clarity.GarnetServerTcp.Dispose: reordered to close the listening socket before callingbase.Dispose()(which drains active handlers).ExceptionInjectionHelper Improvements
Task.Yield()loops with aTaskCompletionSource<bool>notification pattern, eliminating spin-waiting when tests wait for exception injection points.WaitOnSet→ResetAndWaitAsyncfor clarity.WaitOnClearAsyncto use the same TCS-based notification.MigrateSessionSlots,ReplicaSyncSession,ReplicaDisklessSync,ReplicaReceiveCheckpoint.LocalStorageDevice Logging
ILoggerparameter toLocalStorageDeviceconstructors.ReadAsync/WriteAsyncnow log caught exceptions atCriticallevel instead of silently swallowing them.Devices.cspasses the logger through on Windows.Cluster Test TearDown Hardening
ClusterTestContext.TearDown(): capturesTestContext.CurrentContext.Result.Outcomebefore cleanup begins. All cleanup phases (DisposeCluster, logger factory, delete directory,OnTearDown) now run unconditionally;Assert.Failis deferred to the end so resources are always released.SimplePrimaryReplicaSetup(): new helper that configures a one-primary-one-replica cluster with slot assignment andMEET.AttachAndWaitForSync: removed redundantMeetandBumpEpochcalls (now handled bySimplePrimaryReplicaSetup).ReplicaSyncSession: addedcts.Token.ThrowIfCancellationRequested()at the top of theAcquireCheckpointEntryloop.Misc
DisposeActiveHandlers()diagnostics fromHANGDETECTtoDEBUGwithStopwatch-based timeout logging of stuck handlers.LimitedFixedBufferPooldoc comment.