Skip to content

iceberg: some retry fixes#4117

Merged
rockwotj merged 6 commits intomainfrom
iceberg
Mar 18, 2026
Merged

iceberg: some retry fixes#4117
rockwotj merged 6 commits intomainfrom
iceberg

Conversation

@rockwotj
Copy link
Contributor

@rockwotj rockwotj commented Mar 17, 2026

There are a few issues where iceberg go does not refresh authentication, so we do that when we interact with the library by agressively retrying.

  • chore: update iceberg library
  • iceberg: refresh on auth errors
  • iceberg: retry failed table operations

Upstream fixes:

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Iceberg integration to better recover from auth/metadata issues by refreshing the REST catalog client on auth-related failures, adding retry behavior around table operations, and upgrading to apache/iceberg-go v0.5.0.

Changes:

  • Upgrade github.com/apache/iceberg-go to v0.5.0 (and related dependency bumps).
  • Add catalog refresh + retry-on-auth-error behavior in catalogx.Client methods, with new unit tests.
  • Add writer/committer reload paths and retry behavior for write/commit operations.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
internal/impl/iceberg/router.go Adjusts write retry behavior and wires a table reload callback into the committer.
internal/impl/iceberg/committer.go Adds table reload on commit retry/failure so future commits use fresh metadata/auth.
internal/impl/iceberg/catalogx/catalog.go Implements auth-error-triggered catalog refresh and a refresh coordinator.
internal/impl/iceberg/catalogx/catalog_test.go Adds tests validating retry-on-auth behavior and concurrent refresh behavior.
go.mod Bumps iceberg-go and several dependencies (including x/sync for semaphore).
go.sum Updates dependency checksums for bumped modules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@claude
Copy link

claude bot commented Mar 17, 2026

Commits
LGTM

Review
The auth retry and table reload logic in the catalog client, committer, and router looks sound. The semaphore-based refresh coalescing in refreshCatalog correctly serializes concurrent refreshes, and the router's writeWithRetry now properly closes stale writers before retrying.

  1. Bug: incorrect test expectation for rest.ErrUnauthorizedcatalog_test.go#L41: isAuthErr explicitly includes rest.ErrUnauthorized in its check (catalog.go#L120), but the test expects false. This test case will fail.

This can happen if auth expires, etc, and forcing a refresh and retry is
a good idea (the snowflake connector does this too).
@claude
Copy link

claude bot commented Mar 17, 2026

Commits
LGTM

Review
The PR adds auth-error retry logic to the iceberg catalog client, table reload on commit failures, and a generic single-retry for transient write errors. The implementation is sound overall.

  1. Error wrapping format (catalog.go#L194-L195): The %w verb wraps the original auth error while the message prefix describes a refresh failure, making the error chain misleading. Same pattern at L205, L217, L225. Should use gerund prefix and wrap the refresh error per project error handling patterns.
  2. context.Background() in tests (catalog_test.go#L72-L74): All test functions use context.Background() instead of t.Context() per project test patterns.

url string
opts []rest.Option
namespace []string
mu *semaphore.Weighted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be renamed to avoid confusion, also note why we have a Semaphore would be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled it out into a syncx package and added comments to explain the rationale. PTAL

Comment on lines +215 to +219
if isAuthErr(err) {
if refreshErr := c.refreshCatalog(ctx); refreshErr != nil {
return nil, fmt.Errorf("refreshing catalog during appending data files %w: %v", err, refreshErr)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce code duplication we could have refreshCatalogOnAuthErr() that would handle this uniformly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - it's not always that simple because we retry somethings


func (c *Client) refreshCatalog(ctx context.Context) error {
if !c.mu.TryAcquire(1) {
// In this case someone else is trying to refresh the catalog,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand what is the benefit compared to RW mutex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the context can be cancelled and that is respected. When doing IO and the lock can be held for a while it's still nice to support context cancellation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's cool, thanks

@rockwotj rockwotj requested a review from mmatczuk March 18, 2026 16:16
It uses mainline of iceberg-go: apache/iceberg-go@6842055
With the following PRs applied on top:
- apache/iceberg-go#795
- apache/iceberg-go#803
@rockwotj rockwotj merged commit 5420cad into main Mar 18, 2026
6 of 7 checks passed
@rockwotj rockwotj deleted the iceberg branch March 18, 2026 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants