Skip to content

fix(rust): refetch presigned URL on HTTP 401/403/404 from cloud storage#249

Open
eric-wang-1990 wants to merge 15 commits intomainfrom
feat/peco-2918-cloudfetch-403-404-refetch
Open

fix(rust): refetch presigned URL on HTTP 401/403/404 from cloud storage#249
eric-wang-1990 wants to merge 15 commits intomainfrom
feat/peco-2918-cloudfetch-403-404-refetch

Conversation

@eric-wang-1990
Copy link
Collaborator

Summary

Fixes a bug in StreamingCloudFetchProvider::download_chunk_with_retry() where HTTP 401/403/404 responses from cloud storage would cause the driver to retry indefinitely with the same invalid presigned URL.

Root cause: The function only called refetch_link() when a URL was expired by timestamp check (link.is_expired()). When cloud storage returned actual 401/403/404 errors (URL revoked or expired before timestamp), the cached link was never cleared, so retries kept using the same bad URL.

Fix: On HTTP 401/403/404 in the error handler, set entry.link = None before sleeping. The existing logic at the top of the retry loop already handles None links by calling refetch_link(), so no further changes are needed.

Changes:

  • rust/src/reader/cloudfetch/streaming_provider.rs: Add is_cloud_storage_auth_error() helper and clear cached link on 401/403/404
  • rust/spec/cloudfetch-implementation-plan.md: Document the 401/403/404 refetch behavior

Fixes: PECO-2918

Test plan

  • cargo test passes (117 tests, +4 new unit tests for is_cloud_storage_auth_error)
  • cargo clippy -- -D warnings clean
  • cargo fmt applied
  • New unit tests cover HTTP 401, 403, 404 detection and non-matching statuses (500, 429, network errors)

🤖 Generated with Claude Code

eric-wang-1990 and others added 8 commits February 22, 2026 11:15
- Rename adbc-pr-test → adbc-csharp-pr-test to match test repo
- Add adbc-rust-pr-test dispatch when rust/ files change
- For PRs: detect changed paths via GitHub API, dispatch only relevant tests
- For merge queue: output separate csharp-changed/rust-changed flags,
  dispatch only relevant tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ange

- Workflow file changes now trigger both C# and Rust tests
- Changes outside csharp/, rust/, .github/workflows/ default to all targets
- Remove auto-approve-no-relevant-changes (can never fire with always-true logic)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…l tests

- Changes outside csharp/, rust/, .github/workflows/ now auto-approve
- Only workflow changes trigger all targets; otherwise per-folder targeting
- Restore auto-approve-no-relevant-changes for merge queue

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… to sub-workflows

- Create a single pending "Integration Tests" check before dispatching
- Pass check_run_id to both C# and Rust sub-workflow payloads
- Auto-pass immediately if no relevant driver files changed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rate checks

- Remove umbrella check approach; keep each driver's check independent
- Pass checks for non-dispatched drivers instead of creating an umbrella
- check-previous-run now validates both checks passed before auto-approving
- auto-approve jobs create both checks for no-changes and already-passed cases

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
During CloudFetch chunk downloads, cloud storage can return HTTP 401/403/404
when a presigned URL is revoked or expires before its timestamp. Previously,
download_chunk_with_retry() would retry the same invalid URL indefinitely.

Now, when a 401/403/404 is detected, the cached link is cleared (entry.link = None)
so the next retry iteration calls refetch_link() to obtain a fresh presigned URL
from the Databricks SEA API.

Fixes: PECO-2918

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@eric-wang-1990 eric-wang-1990 added the integration-test Trigger integration tests in internal repo label Feb 25, 2026
@github-actions
Copy link

🚀 Integration tests triggered! View workflow run

The `entry` DashMap read guard was declared before the `match stored_link`
block, keeping its shard read lock alive across both `refetch_link().await`
and `chunks.get_mut()`. When `stored_link` is None/expired (i.e. after the
401/403/404 handler clears `entry.link`), the task would:

  1. Suspend at `refetch_link().await` while still holding the shard read lock
  2. Resume and call `chunks.get_mut()` on the same shard
  3. Block forever waiting for the exclusive write lock — which it can never
     acquire because it holds the read lock on the same shard

Fix: extract `stored_link` in a nested block so the read guard (`entry`) is
dropped before the match, `refetch_link().await`, and `get_mut()`. No logic
change — just scope narrowing to avoid the self-deadlock.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot removed the integration-test Trigger integration tests in internal repo label Feb 26, 2026
@github-actions
Copy link

🔒 Integration test approval reset

New commits were pushed to this PR. The integration-test label has been automatically removed for security.

A maintainer must re-review the changes and re-add the label to trigger tests again.

Why is this necessary?
  • New code requires fresh security review
  • Prevents approved PRs from adding malicious code later
  • Ensures all tested code has been explicitly approved

Latest commit: 9205781

If Databricks repeatedly returns an already-expired presigned URL
(clock skew, server bug), the is_expired() refetch path would loop
indefinitely. Add a refresh_attempts counter capped at max_retries,
mirroring C# CloudFetchDownloader's maxUrlRefreshAttempts bound.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Let me write my final summary:

PR Overview

This PR fixes a bug in the CloudFetch download retry logic where HTTP 401/403/404 responses from cloud storage would cause indefinite retries with the same invalid presigned URL. The fix adds explicit detection of auth errors and triggers immediate URL refetch from Databricks.

Changes:

  • Added is_cloud_storage_auth_error() helper to detect HTTP 401/403/404 status codes in error messages
  • Refactored download_chunk_with_retry() to check for auth errors and refetch URLs immediately without sleeping
  • Updated documentation to describe the 401/403/404 refetch behavior
  • Added unit tests for the error detection helper

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
rust/src/reader/cloudfetch/streaming_provider.rs Refactored retry logic to detect and handle auth errors (401/403/404) by refetching presigned URLs; added helper function and unit tests
rust/spec/cloudfetch-implementation-plan.md Documented the new behavior of refetching URLs on 401/403/404 errors

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if let Some(mut entry) = chunks.get_mut(&chunk_index) {
entry.state = ChunkState::DownloadRetry;

// HTTP 403/404 from cloud storage means the presigned URL was
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions "HTTP 403/404" but the actual implementation also handles HTTP 401 (line 493). Update the comment to include 401 for accuracy: "HTTP 401/403/404 from cloud storage means..."

Suggested change
// HTTP 403/404 from cloud storage means the presigned URL was
// HTTP 401/403/404 from cloud storage means the presigned URL was

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Fixed — updated the comment to say "HTTP 401/403/404" to match the actual implementation.

/// so we should refetch a fresh URL from Databricks rather than retrying
/// the same invalid URL.
fn is_cloud_storage_auth_error(err_str: &str) -> bool {
err_str.contains("HTTP 401") || err_str.contains("HTTP 403") || err_str.contains("HTTP 404")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot is this the best to match http errors? i.e. is string matching required, is the code not available directly?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-wang-1990 -> this change might be better: #254 (you can merge into this branch)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is merged

Copy link

Copilot AI commented Feb 26, 2026

@vikrantpuppala I've opened a new pull request, #254, to work on those changes. Once the pull request is ready, I'll request review from you.

…ing for presigned URL error detection (#254)

## What's Changed

`is_cloud_storage_auth_error` was detecting HTTP 401/403/404 via string
matching on the error message (`err_str.contains("HTTP 401")`). The
`driverbase::error::Error` type has a `vendor_code` field purpose-built
for structured codes — use it instead.

- **`client/http.rs`**: Attach `.vendor_code(status.as_u16() as i32)` to
non-retryable HTTP errors so the status code is available structurally
on the error
- **`streaming_provider.rs`**: Replace
`is_cloud_storage_auth_error(err_str: &str)` with
`is_cloud_storage_auth_error(e: &Error)`, using
`matches!(e.get_vendor_code(), 401 | 403 | 404)` — no string parsing

```rust
// Before
fn is_cloud_storage_auth_error(err_str: &str) -> bool {
    err_str.contains("HTTP 401") || err_str.contains("HTTP 403") || err_str.contains("HTTP 404")
}

// After
fn is_cloud_storage_auth_error(e: &Error) -> bool {
    matches!(e.get_vendor_code(), 401 | 403 | 404)
}
```

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in
our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: vikrantpuppala <21273801+vikrantpuppala@users.noreply.github.com>
"Chunk {} got auth/not-found error from cloud storage, clearing cached link for refetch",
chunk_index
);
entry.link = None;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work? Only the entry.link is set to None, does it still go through current loop to delay and retry empty link and repeat until attempts are meet?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this works correctly. When entry.link = None is set, the loop sleeps (via tokio::time::sleep), then on the next iteration:

  1. stored_link is read as None
  2. The match stored_link hits the _ arm, which handles both None and expired links
  3. refetch_link() is called to get a fresh URL from the Databricks API
  4. The fresh URL is stored back in entry.link and used for the next download attempt

So we don't retry with an empty link — we refetch a valid one first. The sleep before the refetch is intentional (avoids hammering the API), and each auth error does count against attempts (also intentional — if Databricks keeps returning invalid URLs, we should eventually give up).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — updated in 9f6a9ac. When we get a 401/403/404 we now continue immediately instead of sleeping, so refetch_link() is called on the next iteration without the backoff delay. Calling refetch_link() directly inside the get_mut block would hold the DashMap write lock across an .await, risking the same deadlock described in the comment above — so the continue approach keeps things clean.

eric-wang-1990 and others added 3 commits February 27, 2026 19:52
…Fetch retry

When cloud storage returns an auth error, we know the fix is a fresh URL
from Databricks — sleeping first just adds unnecessary latency. Use
`continue` to jump straight to the next iteration's refetch_link() call.

Also fix comment text: HTTP 403/404 -> HTTP 401/403/404.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add max_refresh_retries (default 3) as a distinct config field, bounded
  independently from max_retries (mirrors C# maxUrlRefreshAttempts)
- Reset attempts=0 after a successful refetch so each fresh URL gets its
  own full download retry budget
- Make retry sleep cancellation-aware via tokio::select! so cancellation
  is not delayed by the full backoff window
- Document refetch_link(chunk_index, 0): row_offset=0 is intentional for
  the SEA (chunk-index based) backend; a Thrift backend would use it

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…haviour

Instead of the two-step dance (clear entry.link, continue, refetch on next
iteration), drop the DashMap write lock before the async call and refetch
directly in the same iteration — no sleep, no extra loop turn.

The lock guard is scoped so it drops before .await; a second brief get_mut
writes the fresh link back. This avoids the deadlock risk of holding a shard
write lock across an await point.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants