Skip to content

Conversation

@NadavLevi
Copy link
Collaborator

Description

Closes: #XXXX


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • read the contribution guide
  • included the correct type prefix in the PR title, you can find examples of the prefixes below:
  • confirmed ! in the type prefix if API or client breaking change
  • targeted the main branch
  • provided a link to the relevant issue or specification
  • reviewed "Files changed" and left comments if necessary
  • included the necessary unit and integration tests
  • updated the relevant documentation or specification, including comments for documenting Go code
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage

@codecov
Copy link

codecov bot commented Jan 28, 2026

Codecov Report

❌ Patch coverage is 76.92308% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
protocol/lavasession/consumer_session_manager.go 60.00% 5 Missing and 1 partial ⚠️
protocol/lavasession/consumer_types.go 87.50% 3 Missing ⚠️
Flag Coverage Δ
consensus 8.54% <ø> (ø)
protocol 33.91% <76.92%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
protocol/lavasession/consumer_types.go 73.12% <87.50%> (+1.77%) ⬆️
protocol/lavasession/consumer_session_manager.go 62.86% <60.00%> (+0.40%) ⬆️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

github-actions bot commented Jan 28, 2026

Test Results

    7 files  ±0     85 suites  ±0   31m 12s ⏱️ -34s
3 324 tests +4  3 323 ✅ +4  1 💤 ±0  0 ❌ ±0 
3 517 runs  +4  3 516 ✅ +4  1 💤 ±0  0 ❌ ±0 

Results for commit ea5a923. ± Comparison against base commit e4f7444.

♻️ This comment has been updated with latest results.

The busy-wait goroutine that polls connection state did not check for
context cancellation, causing it to run forever when connections timed
out before becoming Ready. This was the primary cause of goroutine leaks
observed in production when all providers were blocked during startup.

Changes:
- Add context.Done() check in the goroutine's polling loop
- Use buffered channel to prevent goroutine blocking on send
- Add non-blocking send to handle race between Ready state and context cancel
- Add regression tests for goroutine cleanup behavior

The fix ensures that when ConnectRawClientWithTimeout returns due to
context timeout/cancellation, the internal goroutine exits promptly
instead of running indefinitely.
…ic epoch checks

The second chance goroutine previously slept for the full 3-minute
retrySecondChanceAfter duration before checking if the epoch changed.
This caused goroutine accumulation when many providers were blocked
in quick succession during startup.

Changes:
- Replace single time.After(3min) with ticker checking every 10 seconds
- Exit early if epoch changes, reducing goroutine lifetime from 3min to ~10s
- Add trace logging for early exit to aid debugging
- Add test for early exit on epoch change behavior

This fix reduces goroutine accumulation when providers are repeatedly
blocked during startup before they become available. Instead of 251
goroutines living for 3 minutes each, they now exit within ~10 seconds
of an epoch change.
Add a sync.RWMutex to the Endpoint struct to protect concurrent access
to Connections, ConnectionRefusals, and Enabled fields. This fixes a
pre-existing race condition that occurred when multiple goroutines
(from probeProviders) accessed the same Endpoint object simultaneously.

The race was detected in fetchEndpointConnectionFromConsumerSessionWithProvider
where multiple providers could share the same Endpoint object and
concurrently modify its fields without synchronization.

The fix:
- Added `mu sync.RWMutex` field to Endpoint struct
- Wrapped modifications to Connections, ConnectionRefusals, and Enabled
  with Lock/Unlock
- Used RLock/RUnlock for read-only access to Enabled field
- Carefully release lock before blocking network calls and re-acquire after
@NadavLevi NadavLevi force-pushed the fix-goroutine-leaks branch from d82099a to ea5a923 Compare January 28, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants