Skip to content

[data-race] Monitoring the queues (sidecar) when the queue is updated with a new qeuue #449

@liran-funaro

Description

@liran-funaro

In the sidecar, we update the blockToBeCommitted queue upon recovery.
The monitoring loop might check the queue length concurrently.

This is only an apparent race. Not a real issue.


Technical Analysis

Race Location:

  • Write (Line 188 in sendBlocksAndReceiveStatus()): Replaces blockToBeCommitted channel during recovery
  • Read (Line 408 in monitorQueues()): Reads channel length for Prometheus metrics

Root Cause:
During sidecar recovery/reconnection to coordinator, sendBlocksAndReceiveStatus() replaces the entire blockToBeCommitted channel to drop enqueued blocks from the previous session, while the monitoring goroutine concurrently reads the channel length.

Why It's Benign:

  1. Both len() on a channel and channel pointer replacement are atomic operations in Go
  2. Worst case: monitoring briefly reads from old channel before seeing new one
  3. No functional impact on block processing or status delivery
  4. Race window is extremely small (only during recovery)

Recommended Solution: atomic.Pointer with Generics

Use atomic.Pointer[chan *common.Block] for type-safe atomic operations:

import "sync/atomic"

type Service struct {
    // ... existing fields ...
    blockToBeCommitted atomic.Pointer[chan *common.Block]
    committedBlock     atomic.Pointer[chan *common.Block]
}

// In New():
blockToBeCommittedChan := make(chan *common.Block, c.ChannelBufferSize)
s.blockToBeCommitted.Store(&blockToBeCommittedChan)

committedBlockChan := make(chan *common.Block, c.ChannelBufferSize)
s.committedBlock.Store(&committedBlockChan)

// In sendBlocksAndReceiveStatus() (line 188):
newChan := make(chan *common.Block, s.config.ChannelBufferSize)
s.blockToBeCommitted.Store(&newChan)

// In monitorQueues() (line 408):
blockChan := s.blockToBeCommitted.Load()
if blockChan != nil {
    promutil.SetGauge(m.yetToBeCommittedBlocksQueueSize, len(*blockChan))
}

// Wherever the channel is used:
outputBlock := *s.blockToBeCommitted.Load()

Benefits:

  • Type-safe with generics (Go 1.19+)
  • No mutex overhead
  • Lock-free atomic operations
  • Clean, idiomatic Go code

Alternative Solutions:

Option 2: Mutex Protection

type Service struct {
    blockToBeCommittedMu sync.RWMutex
    blockToBeCommitted   chan *common.Block
}

Simpler but adds mutex overhead.

Option 3: Accept & Document
Document as known benign race; suppress in race detector config.

Priority: Low - cosmetic issue for tooling, not a production bug.

Technical Analysis

Race Location:

  • Write (Line 188 in sendBlocksAndReceiveStatus()): Replaces blockToBeCommitted channel during recovery
  • Read (Line 408 in monitorQueues()): Reads channel length for Prometheus metrics

Root Cause:
During sidecar recovery/reconnection to coordinator, sendBlocksAndReceiveStatus() replaces the entire blockToBeCommitted channel to drop enqueued blocks from the previous session, while the monitoring goroutine concurrently reads the channel length.

Why It's Benign:

  1. Both len() on a channel and channel pointer replacement are atomic operations in Go
  2. Worst case: monitoring briefly reads from old channel before seeing new one
  3. No functional impact on block processing or status delivery
  4. Race window is extremely small (only during recovery)

Potential Solutions:

Option 1: Mutex Protection (Recommended for clean CI)

type Service struct {
    blockToBeCommittedMu sync.RWMutex
    blockToBeCommitted   chan *common.Block
}

Protect both write (channel replacement) and read (len()) operations with RWMutex.

Option 2: atomic.Value (More complex)
Store channel in atomic.Value for atomic pointer operations.

Option 3: Accept & Document (Simplest)
Document as known benign race; suppress in race detector config.

Recommendation: Option 1 provides clean race detector runs for CI (#442) with negligible overhead since monitoring is ticker-based and recovery is rare.

Priority: Low - cosmetic issue for tooling, not a production bug.



==================
WARNING: DATA RACE
Read at 0x00c0003b8508 by goroutine 129:
  github.com/hyperledger/fabric-x-committer/service/sidecar.(*Service).monitorQueues()
      /home/runner/work/fabric-x-committer/fabric-x-committer/service/sidecar/sidecar.go:408 +0xfd
  github.com/hyperledger/fabric-x-committer/service/sidecar.(*Service).monitorQueues-fm()
      <autogenerated>:1 +0x3e
  github.com/hyperledger/fabric-x-committer/utils/monitoring.(*Provider).StartPrometheusServer.func2()
      /home/runner/work/fabric-x-committer/fabric-x-committer/utils/monitoring/provider.go:120 +0x58
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      /home/runner/go/pkg/mod/golang.org/x/sync@v0.19.0/errgroup/errgroup.go:93 +0x86

Previous write at 0x00c0003b8508 by goroutine 112:
  github.com/hyperledger/fabric-x-committer/service/sidecar.(*Service).sendBlocksAndReceiveStatus()
      /home/runner/work/fabric-x-committer/fabric-x-committer/service/sidecar/sidecar.go:188 +0x644
  github.com/hyperledger/fabric-x-committer/service/sidecar.(*Service).Run.func4.1()
      /home/runner/work/fabric-x-committer/fabric-x-committer/service/sidecar/sidecar.go:157 +0x13c
  github.com/hyperledger/fabric-x-committer/utils/connection.Sustain()
      /home/runner/work/fabric-x-committer/fabric-x-committer/utils/connection/lifecycle.go:45 +0x159
  github.com/hyperledger/fabric-x-committer/service/sidecar.(*Service).Run.func4()
      /home/runner/work/fabric-x-committer/fabric-x-committer/service/sidecar/sidecar.go:153 +0xf9
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      /home/runner/go/pkg/mod/golang.org/x/sync@v0.19.0/errgroup/errgroup.go:93 +0x86

Goroutine 129 (running) created at:
  golang.org/x/sync/errgroup.(*Group).Go()
      /home/runner/go/pkg/mod/golang.org/x/sync@v0.19.0/errgroup/errgroup.go:78 +0x11c
  github.com/hyperledger/fabric-x-committer/utils/monitoring.(*Provider).StartPrometheusServer()
      /home/runner/work/fabric-x-committer/fabric-x-committer/utils/monitoring/provider.go:119 +0xfe4
  github.com/hyperledger/fabric-x-committer/service/sidecar.(*Service).Run.func1()
      /home/runner/work/fabric-x-committer/fabric-x-committer/service/sidecar/sidecar.go:122 +0x152

Goroutine 112 (running) created at:
  golang.org/x/sync/errgroup.(*Group).Go()
      /home/runner/go/pkg/mod/golang.org/x/sync@v0.19.0/errgroup/errgroup.go:78 +0x11c
  github.com/hyperledger/fabric-x-committer/service/sidecar.(*Service).Run()
      /home/runner/work/fabric-x-committer/fabric-x-committer/service/sidecar/sidecar.go:151 +0x944
  github.com/hyperledger/fabric-x-committer/utils/connection.StartService.func1()
      /home/runner/work/fabric-x-committer/fabric-x-committer/utils/connection/server_util.go:177 +0x9e
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      /home/runner/go/pkg/mod/golang.org/x/sync@v0.19.0/errgroup/errgroup.go:93 +0x86
==================

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions