Fix parallel mode worker state #272

jakeloo · 2025-08-14T17:16:21Z

Summary by CodeRabbit

Bug Fixes
- Resolved race conditions during mode changes, improving stability.
- Prevented unnecessary gap handling in live mode.
- Ensured published and committed positions align on startup, with safe recovery if updates fail.
Refactor
- Introduced mode-aware gating to skip work when inactive, reducing unnecessary processing.
- Improved startup behavior with immediate cleanup of processed data for faster, leaner operation.
- Enhanced robustness by selecting data sources based on active mode.

coderabbitai · 2025-08-14T17:16:30Z

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch jl/fix-worker-state

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🔭 Outside diff range comments (3)

internal/orchestrator/committer.go (3)

254-269: Potential nil dereference and logging-before-error-check when querying GetMaxBlockNumber.

You log latestCommittedBlockNumber.String() before checking err and whether the value is nil, which can panic.
You also call Sign() without guarding against a nil pointer.

Apply this diff to fix both issues:

- latestCommittedBlockNumber, err := c.storage.MainStorage.GetMaxBlockNumber(c.rpc.GetChainID())
- log.Debug().Msgf("Committer found this max block number in main storage: %s", latestCommittedBlockNumber.String())
- if err != nil {
-   return nil, err
- }
+ latestCommittedBlockNumber, err := c.storage.MainStorage.GetMaxBlockNumber(c.rpc.GetChainID())
+ if err != nil {
+   return nil, err
+ }
+ if latestCommittedBlockNumber != nil {
+   log.Debug().Msgf("Committer found this max block number in main storage: %s", latestCommittedBlockNumber.String())
+ } else {
+   log.Debug().Msg("Committer found nil max block number in main storage")
+ }
 
-if latestCommittedBlockNumber.Sign() == 0 {
+if latestCommittedBlockNumber == nil || latestCommittedBlockNumber.Sign() == 0 {
   // If no blocks have been committed yet, start from the fromBlock specified in the config
   latestCommittedBlockNumber = new(big.Int).Sub(c.commitFromBlock, big.NewInt(1))
 } else {
   lastCommitted := new(big.Int).SetUint64(c.lastCommittedBlock.Load())
   if latestCommittedBlockNumber.Cmp(lastCommitted) < 0 {
     log.Warn().Msgf("Max block in storage (%s) is less than last committed block in memory (%s).", latestCommittedBlockNumber.String(), lastCommitted.String())
     return []*big.Int{}, nil
   }
 }

293-329: Potential nil dereference and misleading variable name in getBlockNumbersToPublish.

You log lastestPublishedBlockNumber.String() before checking err and nil, which can panic.
You call Sign() on a potentially nil pointer.
Variable name “lastestPublishedBlockNumber” is a typo; prefer “latestPublishedBlockNumber” for clarity.

Apply this diff:

- lastestPublishedBlockNumber, err := c.storage.StagingStorage.GetLastPublishedBlockNumber(c.rpc.GetChainID())
- log.Debug().Msgf("Committer found this last published block number in staging storage: %s", lastestPublishedBlockNumber.String())
- if err != nil {
-   return nil, err
- }
+ latestPublishedBlockNumber, err := c.storage.StagingStorage.GetLastPublishedBlockNumber(c.rpc.GetChainID())
+ if err != nil {
+   return nil, err
+ }
+ if latestPublishedBlockNumber != nil {
+   log.Debug().Msgf("Committer found this last published block number in staging storage: %s", latestPublishedBlockNumber.String())
+ } else {
+   log.Debug().Msg("Committer found nil last published block number in staging storage")
+ }
 
-if lastestPublishedBlockNumber.Sign() == 0 {
+if latestPublishedBlockNumber == nil || latestPublishedBlockNumber.Sign() == 0 {
   // If no blocks have been committed yet, start from the fromBlock specified in the config
-  lastestPublishedBlockNumber = new(big.Int).Sub(c.commitFromBlock, big.NewInt(1))
+  latestPublishedBlockNumber = new(big.Int).Sub(c.commitFromBlock, big.NewInt(1))
 } else {
   lastPublished := new(big.Int).SetUint64(c.lastPublishedBlock.Load())
-  if lastestPublishedBlockNumber.Cmp(lastPublished) < 0 {
+  if latestPublishedBlockNumber.Cmp(lastPublished) < 0 {
     log.Warn().Msgf("Max block in storage (%s) is less than last published block in memory (%s).", lastestPublishedBlockNumber.String(), lastPublished.String())
     return []*big.Int{}, nil
   }
 }
 
-startBlock := new(big.Int).Add(lastestPublishedBlockNumber, big.NewInt(1))
-endBlock, err := c.getBlockToCommitUntil(ctx, lastestPublishedBlockNumber)
+startBlock := new(big.Int).Add(latestPublishedBlockNumber, big.NewInt(1))
+endBlock, err := c.getBlockToCommitUntil(ctx, latestPublishedBlockNumber)

426-432: Metric for first missed block is incorrect.

You’re recording blocksData[0].Block.Number, which is the first present block, not the first missed one. Use expectedBlockNumber.

- // record the first missed block number in prometheus
- metrics.MissedBlockNumbers.Set(float64(blocksData[0].Block.Number.Int64()))
+ // record the first missed block number in prometheus
+ metrics.MissedBlockNumbers.Set(float64(expectedBlockNumber.Int64()))

🧹 Nitpick comments (5)

internal/orchestrator/committer.go (5)
105-126: Startup alignment logic is sensible; consider persisting when lastPublished is unset.

Today, when lastPublished is nil/zero you only set the in-memory lastPublishedBlock to match lastCommitted, but you don’t persist it to staging. In parallel mode, that can cause the publisher to start again from commitFromBlock (staging still at 0) and re-publish older data.

Either persist the aligned lastPublished to staging when lastPublished is nil/zero, or confirm that downstream deduplication makes this safe and acceptable.

Would you like me to provide a patch that also updates StagingStorage when lastPublished is nil/zero to prevent potential duplicate publishes on restart?

162-176: Only the commit loop drains workMode updates; publisher responsiveness may lag.

Since runPublishLoop doesn't read from workModeChan, mode changes are only applied when the commit loop’s select runs. During the default branch’s sleep, mode switches can take up to a full interval to reflect on the publisher. Consider:

Moving workMode channel handling to a dedicated goroutine in Start, or

Replacing time.Sleep with a time.Ticker and including its channel in the select to avoid blocking other cases.

This will improve responsiveness and avoid potential back-pressure on the sender (if the channel is unbuffered).

451-459: Nit: incorrect error message in getSequentialBlockDataToPublish.

The error says “commit” but this function is for publish; update for clarity.
- return nil, fmt.Errorf("error determining blocks to commit: %v", err)
+ return nil, fmt.Errorf("error determining blocks to publish: %v", err)
531-555: Skip gap handling in live mode: LGTM; consider handling poller errors (if any).

The live-mode short-circuit is correct. If Poll returns an error, consider logging it to aid triage during backfill gap recovery.

Do you want me to add defensive logging around poller.Poll if its signature returns an error?

202-215: Replace time.Sleep with time.Ticker in both loops to avoid delaying mode updates.

Sleeping in the default branch prevents the select from serving workMode updates until the sleep ends. Using time.Ticker (and selecting on ticker.C alongside ctx.Done and the mode channel) will improve responsiveness without extra complexity.

I can provide a patch converting both loops to ticker-based selects if you’d like.

Also applies to: 162-185

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 7b572b1 and cfc8cf6.

📒 Files selected for processing (1)

internal/orchestrator/committer.go (7 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

internal/orchestrator/committer.go (1)

internal/orchestrator/work_mode_monitor.go (2)

WorkModeBackfill (22-22)

WorkModeLive (21-21)

🔇 Additional comments (5)

internal/orchestrator/committer.go (5)

34-34: Thread-safe workMode access: good addition.

Protecting workMode with an RWMutex is the right call to eliminate races across commit/publish loops.

179-185: Mode gating in commit loop: LGTM.

Skipping work when mode is unset is appropriate and prevents undefined behavior during initialization.

209-215: Mode gating in publish loop: LGTM.

Same rationale as commit loop; avoids work until the system is ready.

333-350: Mode-aware “until” computation: LGTM.

Backfill mode returning computed window and live mode clamping to RPC latest block is correct.

353-379: Mode-aware data source selection: LGTM.

Using staging for backfill and a poller for live mode is appropriate. The warning + handleMissingStagingData on gaps in backfill is a nice touch.

jakeloo added 2 commits August 14, 2025 17:16

Fix publish parallel mode

3a7bace

Gofmt

cfc8cf6

jakeloo marked this pull request as ready for review August 14, 2025 17:16

coderabbitai bot reviewed Aug 14, 2025

View reviewed changes

jakeloo merged commit db4d974 into main Aug 14, 2025
5 checks passed

jakeloo deleted the jl/fix-worker-state branch August 14, 2025 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix parallel mode worker state #272

Fix parallel mode worker state #272

Uh oh!

jakeloo commented Aug 14, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Aug 14, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix parallel mode worker state #272

Fix parallel mode worker state #272

Uh oh!

Conversation

jakeloo commented Aug 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jakeloo commented Aug 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 14, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)