Skip to content

Comments

Fix gossip shutdown hang caused by blocking stop signal#5373

Merged
yacovm merged 1 commit intohyperledger:mainfrom
remo-lab:fix-stateinfocache-shutdown-hang
Jan 18, 2026
Merged

Fix gossip shutdown hang caused by blocking stop signal#5373
yacovm merged 1 commit intohyperledger:mainfrom
remo-lab:fix-stateinfocache-shutdown-hang

Conversation

@remo-lab
Copy link
Contributor

This PR fixes a shutdown hang in the gossip channel state management logic.

Problem

stateInfoCache.Stop() was using a blocking send on an unbuffered channel:

func (cache *stateInfoCache) Stop() {
    cache.stopChan <- struct{}{}
}

If Stop() was called while the background goroutine was executing Purge(), the goroutine wasn’t listening on stopChan, causing the send to block indefinitely.
This could hang gossipChannel.Stop() and prevent peers from shutting down cleanly, especially under load.

Fix

The stop signaling was updated to use a non-blocking, idempotent close pattern:

  1. Added a sync.Once guard to stateInfoCache
  2. Replaced the blocking send with a guarded close(stopChan)
func (cache *stateInfoCache) Stop() {
    cache.stopOnce.Do(func() {
        close(cache.stopChan)
    })
}

Why this works

  • close() never blocks, unlike a channel send
  • sync.Once ensures the channel is closed exactly once
  • Closing the channel reliably unblocks the background goroutine
  • This matches the shutdown pattern already used elsewhere in Fabric (e.g. messageStoreImpl.Stop())

Testing

All channel-level tests pass, including shutdown-related coverage:

go test ./gossip/gossip/channel/... -v
Screenshot 2026-01-17 215925 Screenshot 2026-01-17 215948

This change is minimal, localized, and only affects shutdown behavior by removing the possibility of an indefinite block.


@remo-lab remo-lab requested a review from a team as a code owner January 17, 2026 16:48
@remo-lab remo-lab force-pushed the fix-stateinfocache-shutdown-hang branch from c98b5a3 to 5c994c6 Compare January 17, 2026 16:50
@remo-lab
Copy link
Contributor Author

Hi @C0rWin @denyeart @yacovm ,
This fixes a shutdown hang in the gossip channel caused by a blocking stop signal and switches it to a non-blocking, idempotent close pattern.

Whenever you have time, I’d really appreciate a review. Thanks!


func (cache *stateInfoCache) Stop() {
cache.stopChan <- struct{}{}
cache.stopOnce.Do(func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an easier way of fixing it, without adding the sync.Once.

Just do:

select {
    case <- cache.stopChan:
         return
   default:
          close(cache.stopChan)
}

Signed-off-by: remo-lab <remopanda7@gmail.com>
@remo-lab remo-lab force-pushed the fix-stateinfocache-shutdown-hang branch from 5c994c6 to ce11aa7 Compare January 17, 2026 21:57
@remo-lab
Copy link
Contributor Author

Hi @yacovm ,
I’ve updated the fix based on your feedback.
Switched to a simple select-based channel close instead of the blocking send, and kept the struct unchanged (no sync.Once).
I’ve tested this locally and all channel tests are passing.

@yacovm yacovm merged commit ad1b227 into hyperledger:main Jan 18, 2026
24 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants