Skip to content

Add LFBP Instrumentation, Server Shutdown Ordering, and Test TearDown Hardening#1632

Merged
vazois merged 19 commits intodevfrom
vazois/fix-lfbp-dispose-hang-dev
Mar 25, 2026
Merged

Add LFBP Instrumentation, Server Shutdown Ordering, and Test TearDown Hardening#1632
vazois merged 19 commits intodevfrom
vazois/fix-lfbp-dispose-hang-dev

Conversation

@vazois
Copy link
Copy Markdown
Contributor

@vazois vazois commented Mar 16, 2026

Summary

Adds instrumentation for tracking owners and type of buffers acquired from LimitedFixedBufferPool (LFBP) to diagnose dispose issues and buffer leaks. Improves server shutdown ordering so listening sockets are closed immediately (freeing ports), and hardens cluster test teardown so cleanup always runs even when tests fail.

Changes

Buffer Pool Tracking Instrumentation

  • New enums (PoolEntryTypes.cs): PoolEntryBufferType (identifies buffer role, e.g. NetworkReceiveBuffer, SaeaSendBuffer) and PoolOwnerType (identifies pool creator, e.g. ServerNetwork, Replication, ClientSession).
  • PoolEntry.source field: packed int where the low byte stores PoolEntryBufferType and byte 1 stores PoolOwnerType, set when the entry is acquired via LimitedFixedBufferPool.Get().
  • LimitedFixedBufferPool now accepts PoolOwnerType at construction and PoolEntryBufferType in Get(). Under #if DEBUG, a ConcurrentDictionary tracks all outstanding (checked-out) entries.
  • Dispose diagnostics (DEBUG only): after a 5-second timeout, LimitedFixedBufferPool.Dispose() logs all unreturned buffer details (owner type, buffer type, size) to aid leak diagnosis.
  • All call sites updated to pass appropriate PoolOwnerType and PoolEntryBufferType values (GarnetClientSession, ReplicationManager, GarnetServerTcp, NetworkHandler, TcpNetworkHandlerBase, GarnetSaeaBuffer, LightClient, MigrationManager, GarnetClient).

Server Shutdown Ordering & Port Reuse

  • IGarnetServer.Close(): new interface method that stops listening without waiting for active connections to drain.
  • GarnetServerTcp.Close(): closes the listening socket to free the port immediately.
  • GarnetServer.InternalDispose: three-phase shutdown — (1) Close() all servers to free ports, (2) dispose the provider (storage engine), (3) Dispose() servers to drain active handlers.
  • GarnetServerTcp constructor: sets SO_REUSEADDR on TCP sockets before Bind to handle TIME_WAIT states. UDS initialization refactored for clarity.
  • GarnetServerTcp.Dispose: reordered to close the listening socket before calling base.Dispose() (which drains active handlers).

ExceptionInjectionHelper Improvements

  • Replaced busy Task.Yield() loops with a TaskCompletionSource<bool> notification pattern, eliminating spin-waiting when tests wait for exception injection points.
  • Renamed WaitOnSetResetAndWaitAsync for clarity.
  • Updated WaitOnClearAsync to use the same TCS-based notification.
  • All callers updated: MigrateSessionSlots, ReplicaSyncSession, ReplicaDisklessSync, ReplicaReceiveCheckpoint.

LocalStorageDevice Logging

  • Added ILogger parameter to LocalStorageDevice constructors.
  • ReadAsync / WriteAsync now log caught exceptions at Critical level instead of silently swallowing them.
  • Devices.cs passes the logger through on Windows.

Cluster Test TearDown Hardening

  • ClusterTestContext.TearDown(): captures TestContext.CurrentContext.Result.Outcome before cleanup begins. All cleanup phases (DisposeCluster, logger factory, delete directory, OnTearDown) now run unconditionally; Assert.Fail is deferred to the end so resources are always released.
  • Increased teardown timeout from 5 → 30 seconds.
  • SimplePrimaryReplicaSetup(): new helper that configures a one-primary-one-replica cluster with slot assignment and MEET.
  • AttachAndWaitForSync: removed redundant Meet and BumpEpoch calls (now handled by SimplePrimaryReplicaSetup).
  • ReplicaSyncSession: added cts.Token.ThrowIfCancellationRequested() at the top of the AcquireCheckpointEntry loop.

Misc

  • Upgraded DisposeActiveHandlers() diagnostics from HANGDETECT to DEBUG with Stopwatch-based timeout logging of stuck handlers.
  • Fixed typo: "Dipose" → "Dispose" in LimitedFixedBufferPool doc comment.

Copilot AI review requested due to automatic review settings March 16, 2026 22:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a teardown hang caused by incomplete network-handler disposal and leaked pooled buffers, and adds DEBUG-only diagnostics to identify unreturned pool entries and stuck handlers.

Changes:

  • Ensure TcpNetworkHandlerBase.Dispose() performs full cleanup by invoking DisposeImpl().
  • Fix pooled buffer leak by disposing GarnetTcpNetworkSender’s held responseObject during sender disposal.
  • Add pool ownership/buffer-role tagging and DEBUG diagnostics for pool/handler disposal hangs.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
libs/server/Servers/GarnetServerTcp.cs Tags the server network pool with an owner type for diagnostics.
libs/server/Servers/GarnetServerBase.cs Adds DEBUG hang diagnostics for stuck active handler disposal (but introduces a Release build issue via unused using).
libs/common/Networking/TcpNetworkHandlerBase.cs Calls DisposeImpl() from public Dispose() and tags receive-buffer allocations.
libs/common/Networking/NetworkHandler.cs Tags transport/network buffer allocations to aid leak diagnostics.
libs/common/Networking/GarnetTcpNetworkSender.cs Disposes held responseObject during sender disposal to avoid pool reference leaks.
libs/common/Networking/GarnetSaeaBuffer.cs Tags SAEA send-buffer allocations.
libs/common/NetworkBufferSettings.cs Extends buffer-pool creation to accept an owner type.
libs/common/Memory/PoolEntryTypes.cs Introduces enums for pool owner and buffer role tagging.
libs/common/Memory/PoolEntry.cs Adds packed source field used for diagnostics.
libs/common/Memory/LimitedFixedBufferPool.cs Adds owner tagging, DEBUG outstanding-entry tracking, and disposal-time diagnostics (but introduces a Release build issue via unused using).
libs/cluster/Server/Replication/ReplicationManager.cs Tags replication-created pools with owner type.
libs/client/ClientSession/GarnetClientSession.cs Tags client-session-created pools with owner type.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Two bugs caused LimitedFixedBufferPool.Dispose() to hang indefinitely
during server teardown (blocking ClusterResetHardDuringDisklessReplicationAttach):

1. TcpNetworkHandlerBase.Dispose() never called DisposeImpl(), so when a
   handler thread was blocked synchronously (e.g. in TryBeginDisklessSync),
   the CTS was never cancelled and activeHandlerCount was never decremented.
   DisposeActiveHandlers() would spin forever waiting for it to reach 0.

2. GarnetTcpNetworkSender.DisposeNetworkSender() disposed the saeaStack
   but not the current responseObject, leaking a PoolEntry that was never
   returned to the pool. LimitedFixedBufferPool.Dispose() then spun
   forever waiting for totalReferences to reach 0.

Also adds PoolEntry source tracking infrastructure (PoolEntryBufferType
and PoolOwnerType enums) with DEBUG-only diagnostics that log unreturned
buffer details after a 5-second timeout during pool disposal.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@vazois vazois force-pushed the vazois/fix-lfbp-dispose-hang-dev branch from 59ce319 to ce26903 Compare March 16, 2026 22:24
vazois and others added 4 commits March 16, 2026 15:26
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Use Interlocked.Exchange to atomically take ownership of responseObject
and ReturnBuffer it back to the saeaStack before disposal. This ensures
the PoolEntry is disposed exactly once when saeaStack.Dispose() iterates
all items, and avoids a race with ReturnResponseObject() on the handler
thread that could cause Debug.Assert(!disposed) to fail.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@vazois vazois changed the title Fix dispose hang in network handler and buffer pool cleanup Add LFBP Instrumentation for Tracking Ownership Mar 18, 2026
@vazois vazois force-pushed the vazois/fix-lfbp-dispose-hang-dev branch from 9a4db3b to e113972 Compare March 19, 2026 19:59
@vazois vazois force-pushed the vazois/fix-lfbp-dispose-hang-dev branch from a4dd6eb to e65a0e9 Compare March 20, 2026 01:27
@vazois vazois changed the title Add LFBP Instrumentation for Tracking Ownership Add LFBP Instrumentation, Server Shutdown Ordering, and Test TearDown Hardening Mar 20, 2026
@vazois vazois force-pushed the vazois/fix-lfbp-dispose-hang-dev branch from 9fd3b9b to be7a296 Compare March 20, 2026 20:06
@vazois vazois force-pushed the vazois/fix-lfbp-dispose-hang-dev branch from be7a296 to 9a2eb04 Compare March 20, 2026 21:38
@vazois vazois merged commit 943bd93 into dev Mar 25, 2026
30 checks passed
@vazois vazois deleted the vazois/fix-lfbp-dispose-hang-dev branch March 25, 2026 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants