fix: prevent concurrent gRPC Send() on blockchain subscription streams by freemans13 · Pull Request #613 · bsv-blockchain/teranode

freemans13 · 2026-03-24T03:09:05Z

Summary

Blockchain subscription notifications were sent via goroutines, causing concurrent Send() calls on the same gRPC ServerStream
gRPC streams are not safe for concurrent writes — concurrent Send() corrupts the stream silently, causing subsequent sends to fail
Failed subscribers are removed from the map but the client's Recv() never gets an error, so it blocks forever without reconnecting
This affected ALL subscribers (RPC, propagation, block-validator, subtree-validator, etc.), causing them to stop receiving block notifications shortly after subscribing
Result: RPC reports stale block height, coinbase can't discover new blocks, split txs never reach block assembly, and the tx pipeline stalls

Fix

Send notifications synchronously in the select loop instead of spawning goroutines per subscriber
Send initial notification synchronously BEFORE adding subscriber to the map (eliminates race between initial and regular notifications)
Collect dead subscribers during iteration and queue removal after releasing the read lock

Test plan

TestSubscriptionConcurrentSendRace: proves the old goroutine-based approach triggers concurrent Send() on the same stream (200/200 races detected)
TestSubscriptionSerialSend: proves the fixed synchronous approach has zero concurrent access (101/101 serial)
Deploy to dev-scale-1, mine blocks via generate, verify RPC height updates without bouncing pods

🤖 Generated with Claude Code

Co-authored-by: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-24T03:10:11Z

🤖 Claude Code Review

Status: Complete

Summary

This PR correctly fixes two critical concurrency bugs:

Blockchain subscription race: Concurrent Send() calls on gRPC streams
Subtree storage race: Transaction deletion during async subtree serialization

Both fixes are well-implemented with comprehensive test coverage. The synchronous notification delivery may introduce performance considerations for high subscriber counts, but correctness takes precedence.

History:

✅ Fixed: Misleading comment in SubtreeProcessor.cleanupDeletedTxs (line 997 → 1010) now accurately describes unconditional cleanup behavior

github-actions · 2026-03-24T03:11:57Z

services/blockassembly/subtreeprocessor/SubtreeProcessor.go


+// cleanupDeletedTxs performs actual deletion from currentTxMap for transactions
+// that were previously soft-deleted. Called after subtree storage completes.
+// Only deletes if the transaction is still marked as deleted (not re-added).


✅ Resolved: Comment updated at line 1010 to accurately describe behavior: "Remove from deletedTxs backup map (transaction data no longer needed after storage)"

The rpcCallCache uses ttlcache with a 10s TTL, but without DisableTouchOnHit. By default, every Get() resets the TTL timer. When coinbase polls getinfo or getbestblockhash every 5s, the cache entry is touched before it expires, keeping stale data alive forever. Adding WithDisableTouchOnHit ensures entries expire exactly 10s after creation regardless of how often they are read. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-03-24T03:53:49Z

Quality Gate failed

Failed conditions
66.2% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

Copilot

Pull request overview

This PR addresses a critical reliability issue in the blockchain subscription system where concurrent gRPC Send() calls on the same stream could corrupt the stream and effectively stall downstream components that depend on block notifications.

Changes:

Serialize blockchain subscription notifications (and initial notification) to avoid concurrent gRPC Send() on a single server stream.
Add tests demonstrating the concurrent-send race and the expected serial-send behavior.
Introduce a DeletedTxs fallback + OnStorageComplete callback for subtree meta creation to tolerate parent-map mutations during async storage.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
services/blockchain/Server.go	Sends subscription notifications synchronously; sends initial notification before adding subscriber; defers removal of dead subscribers.
services/blockchain/subscription_race_test.go	Adds tests simulating concurrent vs serial `Send()` patterns.
services/blockassembly/subtreeprocessor/SubtreeProcessor.go	Adds `deletedTxs` tracking + request fields to support BlockAssembly fallback and cleanup callback.
services/blockassembly/Server.go	Falls back to `DeletedTxs` when `ParentTxMap` is missing entries; invokes `OnStorageComplete` after storage completes.
services/blockassembly/server_test.go	Updates tests for new request fields and adds coverage for DeletedTxs fallback behavior.
services/blockassembly/subtreeprocessor/subtree_size_benchmark_test.go	Updates benchmark request literal to include new fields.
services/rpc/handlers.go	Disables “touch on hit” for the RPC TTL cache (behavior change, unrelated to the subscription fix).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-24T17:36:24Z

services/blockchain/Server.go

+				// Queue dead subscribers for removal
+				for _, s := range dead {
+					b.deadSubscriptions <- s


Potential deadlock: startSubscriptions now sends into b.deadSubscriptions from the same goroutine that drains it. If more than cap(deadSubscriptions) subscribers fail during one notification, the loop at the end of this case will block before returning to the select, and dead subscribers will never be drained. Consider removing dead subscribers directly (e.g., delete from b.subscribers under a write lock and close done) instead of enqueueing to b.deadSubscriptions here, or otherwise ensure these sends cannot block (e.g., unbounded queue/non-blocking send).

Suggested change

// Queue dead subscribers for removal

for _, s := range dead {

b.deadSubscriptions <- s

// Remove dead subscribers directly to avoid potential deadlock on

// b.deadSubscriptions when many subscribers fail in one notification.

if len(dead) > 0 {

b.subscribersMu.Lock()

for _, s := range dead {

delete(b.subscribers, s)

safeClose(s.done)

}

remaining := len(b.subscribers)

b.subscribersMu.Unlock()

b.logger.Infof("[Blockchain][startSubscriptions] Removed %d dead subscription(s) (Total=%d).", len(dead), remaining)

Copilot · 2026-03-24T17:36:24Z

services/blockchain/Server.go

 				b.subscribersMu.RLock()
+				// Collect dead subscribers to remove after releasing the read lock
+				var dead []subscriber
 				for sub := range b.subscribers {
-					b.logger.Debugf("[Blockchain][startSubscriptions] Sending notification to %s in background: %s", sub.source, notification.Stringify())
-
-					go func(s subscriber) {
-						b.logger.Debugf("[Blockchain][startSubscriptions] Sending notification to %s: %s", s.source, notification.Stringify())
-
-						if err := s.subscription.Send(notification); err != nil {
-							b.deadSubscriptions <- s
-						}
-					}(sub)
+					b.logger.Debugf("[Blockchain][startSubscriptions] Sending notification to %s: %s", sub.source, notification.Stringify())
+
+					// Send synchronously — NOT in a goroutine. Concurrent Send() calls
+					// on the same gRPC ServerStream corrupt the stream, causing the
+					// subscriber to be silently dropped and never receive notifications.
+					if err := sub.subscription.Send(notification); err != nil {
+						dead = append(dead, sub)
+					}
 				}
 				b.subscribersMu.RUnlock()


b.subscribersMu.RLock is held while calling sub.subscription.Send(notification). Since Send may block on network backpressure, this can stall writers (new/dead subscription handling) and prolong read-lock hold time across potentially many subscribers. Safer pattern is to snapshot current subscribers under the RLock (copy keys to a slice), release the lock, then iterate and Send without holding the lock.

Copilot · 2026-03-24T17:36:24Z

services/blockchain/Server.go

+			// Send initial notification BEFORE adding to the subscribers map.
+			// This prevents concurrent Send() between sendInitialNotification
+			// and the notification delivery loop above.
+			b.sendInitialNotification(s)
+


sendInitialNotification can enqueue to b.deadSubscriptions on Send() error. Because this is now called synchronously from the subscription manager goroutine, a full deadSubscriptions buffer would block here and prevent the manager from draining the channel. Since the subscriber hasn't been added to the map yet, consider having sendInitialNotification return an error (or bool) and let this case close s.done / skip adding the subscriber without sending to deadSubscriptions.

Copilot · 2026-03-24T17:36:25Z

services/blockassembly/Server.go

+		// Trigger cleanup of soft-deleted transactions
+		if subtreeRequest.OnStorageComplete != nil {
+			subtreeRequest.OnStorageComplete()
+		}


OnStorageComplete is only invoked by the coordinator goroutine after subtree/meta workers complete. If storeSubtreeData returns early (e.g., subtree already exists / setup error before goroutines start), the callback never runs, which can leave deletedTxs entries around longer than intended. Consider ensuring OnStorageComplete is invoked (or explicitly skipped) on all return paths where the subtree is already stored or storage is aborted, so cleanup semantics are predictable.

Copilot · 2026-03-24T17:36:25Z

services/blockchain/subscription_race_test.go

+	for i := 0; i < 100; i++ {
+		wg.Add(2)
+		// Goroutine 1: simulates sendInitialNotification (old line 698: go b.sendInitialNotification(s))
+		go func() {
+			defer wg.Done()
+			_ = mock.Send(notification)
+		}()
+		// Goroutine 2: simulates regular notification send (old line 679: go func(s subscriber) { s.subscription.Send(...) })
+		go func() {
+			defer wg.Done()
+			_ = mock.Send(notification)
+		}()
+	}
+	wg.Wait()
+
+	if !mock.raceDetected.Load() {
+		t.Skip("Race condition not triggered in this run (timing dependent)")


This test is timing-dependent and can silently skip, which makes CI flaky and can mask regressions (it might pass without ever demonstrating the race). Consider making the overlap deterministic (e.g., use a start barrier so both goroutines enter Send() simultaneously) and fail the test if concurrent Send isn't detected under the forced conditions.

Suggested change

for i := 0; i < 100; i++ {

wg.Add(2)

// Goroutine 1: simulates sendInitialNotification (old line 698: go b.sendInitialNotification(s))

go func() {

defer wg.Done()

_ = mock.Send(notification)

}()

// Goroutine 2: simulates regular notification send (old line 679: go func(s subscriber) { s.subscription.Send(...) })

go func() {

defer wg.Done()

_ = mock.Send(notification)

}()

}

wg.Wait()

if !mock.raceDetected.Load() {

t.Skip("Race condition not triggered in this run (timing dependent)")

startCh := make(chan struct{})

for i := 0; i < 100; i++ {

wg.Add(2)

// Goroutine 1: simulates sendInitialNotification (old line 698: go b.sendInitialNotification(s))

go func() {

defer wg.Done()

<-startCh

_ = mock.Send(notification)

}()

// Goroutine 2: simulates regular notification send (old line 679: go func(s subscriber) { s.subscription.Send(...) })

go func() {

defer wg.Done()

<-startCh

_ = mock.Send(notification)

}()

}

// Release all goroutines at once to force overlapping Send() calls.

close(startCh)

wg.Wait()

if !mock.raceDetected.Load() {

t.Fatalf("expected concurrent Send() calls to be detected under forced conditions, but none were")

Copilot · 2026-03-24T17:36:25Z

services/blockassembly/subtreeprocessor/SubtreeProcessor.go

+// cleanupDeletedTxs performs actual deletion from currentTxMap for transactions
+// that were previously soft-deleted. Called after subtree storage completes.
+// Only deletes if the transaction is still marked as deleted (not re-added).
+//
+// This function is called via the OnStorageComplete callback to safely remove
+// transactions that were marked for deletion while the subtree was being stored.
+//


The comment for cleanupDeletedTxs says it "performs actual deletion from currentTxMap", but the function only deletes entries from stp.deletedTxs. Please update the comment to match the implementation (or adjust the implementation if currentTxMap cleanup is intended).

Copilot · 2026-03-24T17:36:25Z

services/blockassembly/subtreeprocessor/SubtreeProcessor.go

+		// Save to deleted backup map before removing (for Server fallback during async storage)
+		if txInpoints, found := stp.currentTxMap.Get(hash); found {
+			stp.deletedTxs.Set(hash, *txInpoints)
+		}
 		stp.currentTxMap.Delete(hash)


deletedTxs entries added here are only removed when cleanupDeletedTxs is invoked for a stored subtree containing the tx hash. If a tx is removed after its subtree was already stored (e.g., removing from chainedSubtrees during rechain), the entry can remain in deletedTxs indefinitely, causing unbounded growth over time. Consider adding a TTL/size limit for deletedTxs, or ensuring entries are removed once it's known no in-flight storage will reference them.

Copilot · 2026-03-24T17:36:26Z

services/rpc/handlers.go

+	c := ttlcache.New[string, any](
+		ttlcache.WithTTL[string, any](10*time.Second),
+		ttlcache.WithDisableTouchOnHit[string, any](),
+	)


This changes rpcCallCache semantics by disabling touch-on-hit, which is unrelated to the PR’s stated goal (gRPC subscription Send concurrency). If this behavior change is intentional, it likely deserves a brief note in the PR description; otherwise consider moving it to a separate PR to keep this fix scoped and reduce risk.

blockassembly-delete-tx-race-fix (bsv-blockchain#297)

3729d75

Co-authored-by: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>

github-actions bot reviewed Mar 24, 2026

View reviewed changes

freemans13 self-assigned this Mar 24, 2026

freemans13 requested a review from Copilot March 24, 2026 17:29

Copilot started reviewing on behalf of freemans13 March 24, 2026 17:30 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent concurrent gRPC Send() on blockchain subscription streams#613

fix: prevent concurrent gRPC Send() on blockchain subscription streams#613
freemans13 wants to merge 2 commits intobsv-blockchain:mainfrom
freemans13:fix/blockchain-subscription-concurrent-send

freemans13 commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

github-actions bot Mar 24, 2026 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-				// Queue dead subscribers for removal
-				for _, s := range dead {
-					b.deadSubscriptions <- s
+				// Remove dead subscribers directly to avoid potential deadlock on
+				// b.deadSubscriptions when many subscribers fail in one notification.
+				if len(dead) > 0 {
+					b.subscribersMu.Lock()
+					for _, s := range dead {
+						delete(b.subscribers, s)
+						safeClose(s.done)
+					}
+					remaining := len(b.subscribers)
+					b.subscribersMu.Unlock()
+					b.logger.Infof("[Blockchain][startSubscriptions] Removed %d dead subscription(s) (Total=%d).", len(dead), remaining)

Conversation

freemans13 commented Mar 24, 2026

Summary

Fix

Test plan

Uh oh!

github-actions bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

github-actions bot Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Mar 24, 2026

Quality Gate failed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Mar 24, 2026 •

edited

Loading

github-actions bot Mar 24, 2026 •

edited

Loading