Skip to content

Conversation

@cbb330
Copy link
Collaborator

@cbb330 cbb330 commented Jan 2, 2026

Summary

  • Fix stale snapshot detection during concurrent modifications to return HTTP 409 (Conflict) instead of HTTP 400 (Bad Request)
  • Reclassify ValidationException with stale snapshot message to CommitFailedException (409) to allow client retry
  • Ensure other ValidationException instances are handled as HTTP 400 Bad Request responses (e.g., attempting to delete a non-existent snapshot)

Problem

When concurrent modifications occur during a transaction commit:

  1. Client builds snapshots based on table version N (e.g., lastSequenceNumber = 4)
  2. Client sends commit request with these snapshots in SNAPSHOTS_JSON_KEY
  3. Meanwhile, another process commits version N+1 (e.g., lastSequenceNumber = 5)
  4. Server calls doRefresh() which updates current() to version N+1
  5. Bug: The snapshots in SNAPSHOTS_JSON_KEY are now stale (their sequence numbers are based on version N)
  6. Iceberg's TableMetadata.addSnapshot() throws ValidationException → mapped to 400 Bad Request
  7. Should return 409 Conflict so clients know to refresh and retry

Solution

Let Iceberg's existing validation detect sequence number conflicts, then catch the ValidationException and reclassify it as CommitFailedException for the specific stale snapshot error pattern:

} catch (ValidationException e) {
  // Stale snapshot errors are retryable - client should refresh and retry
  if (isStaleSnapshotError(e)) {
    throw new CommitFailedException(e);
  }
  throw new BadRequestException(e, e.getMessage());
}

This approach is simpler than pre-checking and leverages Iceberg's existing validation.

Test Plan

  • Unit test testStaleSnapshotErrorDetection() verifies error detection logic
  • All existing internalcatalog tests pass
  • Integration testing in staging environment

cbb330 and others added 6 commits January 1, 2026 23:08
When concurrent modifications occur during a transaction commit, the
metadata is refreshed but the SNAPSHOTS_JSON_KEY property still contains
snapshots with stale sequence numbers. Previously, this caused Iceberg's
internal validation to throw a ValidationException (mapped to 400 Bad
Request), when it should return 409 Conflict.

This fix:
1. Uses current() to get the refreshed metadata's sequence number instead
   of metadataToCommit which is derived from stale transaction metadata
2. Includes snapshots from current() in existingSnapshotIds to detect
   duplicates added by concurrent processes
3. Throws CommitFailedException (409) before Iceberg's validation can
   throw ValidationException (400)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds a test that verifies stale snapshot detection during concurrent
modifications returns HTTP 409 Conflict (CommitFailedException) instead
of HTTP 400 Bad Request (BadRequestException).

This is the TDD test for the stale snapshot fix - the test verifies that
when a client tries to add a snapshot with sequenceNumber <= lastSequenceNumber,
the server returns 409 (retry needed) not 400 (invalid request).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use current() as the primary source for existing snapshot IDs since it is
refreshed by commit() before doCommit() runs. Fallback to metadataToCommit
only for tests that call doCommit() directly, bypassing the normal flow.
Rather than pre-checking for stale snapshots before Iceberg validation,
let Iceberg's TableMetadata.Builder.addSnapshot() detect the sequence
number conflict and throw ValidationException. Then reclassify that
specific error as CommitFailedException (409) instead of BadRequestException
(400) to allow client retry.

This approach is simpler and leverages Iceberg's existing validation:
- Remove manual stale snapshot check before adding snapshots
- Catch ValidationException and check if it's a stale snapshot error
- Reclassify stale snapshot errors to CommitFailedException (409)
- Use current() as the authoritative source for existing snapshot IDs
- Simplify test to verify error detection logic
Replace simplified error detection test with full integration test that
exercises the actual doCommit path. Key changes:

- Add parent-snapshot-id to stale_snapshot.json (required for Iceberg 1.5+
  validation to trigger - snapshots without parents bypass sequence check)
- Test verifies both direct Iceberg validation and OpenHouse doCommit path
- Confirms ValidationException (400) is reclassified to CommitFailedException (409)
@sumedhsakdeo sumedhsakdeo requested a review from teamurko January 5, 2026 03:14
Copy link
Collaborator

@sumedhsakdeo sumedhsakdeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @cbb330

Copy link
Collaborator

@teamurko teamurko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @cbb330! Looks good, asking for a test

@teamurko teamurko self-requested a review January 5, 2026 18:51
@teamurko teamurko merged commit 93886d8 into linkedin:main Jan 5, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants