Skip to content

Fix COC consistency: scrub stale jobTracking entries on connect#2

Closed
anthony-murphy-agent wants to merge 1 commit intofix/taskmanager-validationfrom
fix/coc-consistency
Closed

Fix COC consistency: scrub stale jobTracking entries on connect#2
anthony-murphy-agent wants to merge 1 commit intofix/taskmanager-validationfrom
fix/coc-consistency

Conversation

@anthony-murphy-agent
Copy link
Owner

Summary

  • Fix ConsensusOrderedCollection data divergence between live clients. When a new client loads from snapshot + saved ops, it processes historical acquire ops (adding items to jobTracking) but never receives the removeMember quorum events for clients that left before it joined. Items get permanently stuck in jobTracking, diverging from long-lived clients that correctly returned those items to the queue.
  • Add scrubJobTrackingByQuorum() in onConnect() to clean up stale entries after all saved ops are processed and the quorum is current.
  • Add ConsensusOrderedCollection to quorumDependentDdsTypes for frozen container validation (frozen containers never connect, so the scrub can't run).
  • Seeds fixed: 37, 60. Seeds remaining in skip: 56, 63, 180 (queue ordering divergence from quorum-event replay timing).

Test plan

  • Full 200-seed stress test suite passes (196 passing, 4 pending, 0 failing)
  • No regressions on previously-passing seeds
  • COC DDS-level fuzz tests still pass

🤖 Generated with Claude Code

When a new client loads from snapshot + saved ops, it processes historical
acquire ops that add items to jobTracking. However, it never receives the
removeMember quorum events for clients that left before it joined. This causes
items to be permanently stuck in jobTracking on the new client, while long-lived
clients correctly returned those items to the queue via their removeMember
handlers — producing data divergence between live clients.

Fix: override onConnect() in ConsensusOrderedCollection to scrub jobTracking
entries for clients no longer in the quorum. This runs after all saved ops are
processed and the quorum is current, ensuring stale entries are cleaned up.

Also add ConsensusOrderedCollection to the quorumDependentDdsTypes set in the
stress test harness, since frozen containers (which never connect) can't
perform this scrub.

Seeds fixed: 37, 60 (data count divergence)
Seeds remaining: 56, 63, 180 (queue ordering divergence — the scrub adds items
back at the end of the queue, whereas the original removeMember event happened
at a different point during op processing, producing different item ordering)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant