-
Notifications
You must be signed in to change notification settings - Fork 167
fix: prevent orphaned DynamoDB entries when Kinesis shards close during pod termination #644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ng pod termination When Kinesis scales down (merges shards) and Bento pods scale down simultaneously in Kubernetes, DynamoDB lease table entries for closed shards were not being cleaned up, causing false positive latency alerts. This fix addresses a race condition where pods could be terminated after Kinesis closed shards but before consumers finished processing them. Changes: - Modified shutdown logic to detect finished shards (empty iterator) and delete their DynamoDB entries instead of leaving orphaned checkpoints - Added periodic cleanup in rebalancing loop to proactively remove DynamoDB entries for closed Kinesis shards - Both Delete operations are idempotent and safe to call multiple times The rebalancing cleanup (every 30s by default) serves as a safety net for edge cases and cleans up any existing orphaned entries. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request fixes a race condition where DynamoDB lease table entries for closed Kinesis shards are not cleaned up when Bento pods are terminated during Kubernetes scaling events, leading to orphaned entries and false positive latency alerts.
- Adds conditional cleanup logic during pod shutdown to delete checkpoints for finished shards instead of saving them
- Implements periodic background cleanup that removes DynamoDB entries for shards that have been closed by Kinesis
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Add lease expiry check in periodic cleanup to avoid interfering with active consumers still processing final records of closed shards - Use context.Background() in shutdown Delete to ensure cleanup completes even when k.ctx is cancelled during shutdown Co-authored-by: GitHub Copilot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Changes based on PR review comments:
1. Added OrphanedShards() method to query DynamoDB entries without ClientID
- AllClaims() skips entries without ClientID, so the previous check would
never find truly orphaned entries
- New method uses DynamoDB filter expression to find orphaned entries
2. Improved cleanup logic and logging
- Split cleanup into two phases: expired leases and orphaned entries
- Added debug logging when skipping cleanup due to active lease
- Better error messages distinguishing between different cleanup scenarios
3. Documented double-deletion edge case
- Added comment explaining why Delete can be called from multiple paths
- Clarified that this is safe due to idempotency
These changes ensure that:
- Orphaned entries (from Checkpoint(final=true)) are properly cleaned up
- Active consumers are not interfered with (lease expiry check)
- Debugging is easier with detailed logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
Summary
Fixes #643
This PR addresses a race condition where DynamoDB lease table entries for closed Kinesis shards are not cleaned up when Bento pods are terminated during Kubernetes scaling events.
Changes
1. Fix Shutdown Logic (
input_kinesis.go:486-507)Modified the
awsKinesisConsumerClosingcase to detect finished shards:What this fixes: When a pod is terminated and the shard iterator is empty (shard closed and fully consumed), the DynamoDB entry is now deleted instead of leaving an orphaned checkpoint.
2. Add Periodic Cleanup (
input_kinesis.go:700-746)Added proactive cleanup in the rebalancing loop (runs every 30s by default) with two phases:
Phase 1: Clean up entries with expired leases
Phase 2: Clean up orphaned entries (no ClientID)
3. New OrphanedShards Method (
input_kinesis_checkpointer.go:183-208)Added method to query DynamoDB entries without ClientID:
Why needed: The existing
AllClaims()function skips entries without ClientID, so a separate query is required to find truly orphaned entries created byCheckpoint(final=true).Safety
k.checkpointer.Delete()is idempotent (safe to call multiple times)DeleteItemsucceeds even if the item doesn't existTesting
Impact
🤖 Generated with Claude Code
Co-Authored-By: Claude noreply@anthropic.com