fix(rust): refetch presigned URL on HTTP 401/403/404 from cloud storage#249
fix(rust): refetch presigned URL on HTTP 401/403/404 from cloud storage#249eric-wang-1990 wants to merge 15 commits intomainfrom
Conversation
- Rename adbc-pr-test → adbc-csharp-pr-test to match test repo - Add adbc-rust-pr-test dispatch when rust/ files change - For PRs: detect changed paths via GitHub API, dispatch only relevant tests - For merge queue: output separate csharp-changed/rust-changed flags, dispatch only relevant tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ange - Workflow file changes now trigger both C# and Rust tests - Changes outside csharp/, rust/, .github/workflows/ default to all targets - Remove auto-approve-no-relevant-changes (can never fire with always-true logic) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…l tests - Changes outside csharp/, rust/, .github/workflows/ now auto-approve - Only workflow changes trigger all targets; otherwise per-folder targeting - Restore auto-approve-no-relevant-changes for merge queue Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… to sub-workflows - Create a single pending "Integration Tests" check before dispatching - Pass check_run_id to both C# and Rust sub-workflow payloads - Auto-pass immediately if no relevant driver files changed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rate checks - Remove umbrella check approach; keep each driver's check independent - Pass checks for non-dispatched drivers instead of creating an umbrella - check-previous-run now validates both checks passed before auto-approving - auto-approve jobs create both checks for no-changes and already-passed cases Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…o feat/trigger-rust-integration-tests
During CloudFetch chunk downloads, cloud storage can return HTTP 401/403/404 when a presigned URL is revoked or expires before its timestamp. Previously, download_chunk_with_retry() would retry the same invalid URL indefinitely. Now, when a 401/403/404 is detected, the cached link is cleared (entry.link = None) so the next retry iteration calls refetch_link() to obtain a fresh presigned URL from the Databricks SEA API. Fixes: PECO-2918 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
🚀 Integration tests triggered! View workflow run |
The `entry` DashMap read guard was declared before the `match stored_link`
block, keeping its shard read lock alive across both `refetch_link().await`
and `chunks.get_mut()`. When `stored_link` is None/expired (i.e. after the
401/403/404 handler clears `entry.link`), the task would:
1. Suspend at `refetch_link().await` while still holding the shard read lock
2. Resume and call `chunks.get_mut()` on the same shard
3. Block forever waiting for the exclusive write lock — which it can never
acquire because it holds the read lock on the same shard
Fix: extract `stored_link` in a nested block so the read guard (`entry`) is
dropped before the match, `refetch_link().await`, and `get_mut()`. No logic
change — just scope narrowing to avoid the self-deadlock.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
🔒 Integration test approval reset New commits were pushed to this PR. The A maintainer must re-review the changes and re-add the label to trigger tests again. Why is this necessary?
Latest commit: 9205781 |
c325c25 to
9205781
Compare
If Databricks repeatedly returns an already-expired presigned URL (clock skew, server bug), the is_expired() refetch path would loop indefinitely. Add a refresh_attempts counter capped at max_retries, mirroring C# CloudFetchDownloader's maxUrlRefreshAttempts bound. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Let me write my final summary:
PR Overview
This PR fixes a bug in the CloudFetch download retry logic where HTTP 401/403/404 responses from cloud storage would cause indefinite retries with the same invalid presigned URL. The fix adds explicit detection of auth errors and triggers immediate URL refetch from Databricks.
Changes:
- Added
is_cloud_storage_auth_error()helper to detect HTTP 401/403/404 status codes in error messages - Refactored
download_chunk_with_retry()to check for auth errors and refetch URLs immediately without sleeping - Updated documentation to describe the 401/403/404 refetch behavior
- Added unit tests for the error detection helper
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| rust/src/reader/cloudfetch/streaming_provider.rs | Refactored retry logic to detect and handle auth errors (401/403/404) by refetching presigned URLs; added helper function and unit tests |
| rust/spec/cloudfetch-implementation-plan.md | Documented the new behavior of refetching URLs on 401/403/404 errors |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if let Some(mut entry) = chunks.get_mut(&chunk_index) { | ||
| entry.state = ChunkState::DownloadRetry; | ||
|
|
||
| // HTTP 403/404 from cloud storage means the presigned URL was |
There was a problem hiding this comment.
The comment mentions "HTTP 403/404" but the actual implementation also handles HTTP 401 (line 493). Update the comment to include 401 for accuracy: "HTTP 401/403/404 from cloud storage means..."
| // HTTP 403/404 from cloud storage means the presigned URL was | |
| // HTTP 401/403/404 from cloud storage means the presigned URL was |
There was a problem hiding this comment.
✅ Fixed — updated the comment to say "HTTP 401/403/404" to match the actual implementation.
| /// so we should refetch a fresh URL from Databricks rather than retrying | ||
| /// the same invalid URL. | ||
| fn is_cloud_storage_auth_error(err_str: &str) -> bool { | ||
| err_str.contains("HTTP 401") || err_str.contains("HTTP 403") || err_str.contains("HTTP 404") |
There was a problem hiding this comment.
@copilot is this the best to match http errors? i.e. is string matching required, is the code not available directly?
There was a problem hiding this comment.
@eric-wang-1990 -> this change might be better: #254 (you can merge into this branch)
There was a problem hiding this comment.
This is merged
|
@vikrantpuppala I've opened a new pull request, #254, to work on those changes. Once the pull request is ready, I'll request review from you. |
…ing for presigned URL error detection (#254) ## What's Changed `is_cloud_storage_auth_error` was detecting HTTP 401/403/404 via string matching on the error message (`err_str.contains("HTTP 401")`). The `driverbase::error::Error` type has a `vendor_code` field purpose-built for structured codes — use it instead. - **`client/http.rs`**: Attach `.vendor_code(status.as_u16() as i32)` to non-retryable HTTP errors so the status code is available structurally on the error - **`streaming_provider.rs`**: Replace `is_cloud_storage_auth_error(err_str: &str)` with `is_cloud_storage_auth_error(e: &Error)`, using `matches!(e.get_vendor_code(), 401 | 403 | 404)` — no string parsing ```rust // Before fn is_cloud_storage_auth_error(err_str: &str) -> bool { err_str.contains("HTTP 401") || err_str.contains("HTTP 403") || err_str.contains("HTTP 404") } // After fn is_cloud_storage_auth_error(e: &Error) -> bool { matches!(e.get_vendor_code(), 401 | 403 | 404) } ``` <!-- START COPILOT CODING AGENT TIPS --> --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: vikrantpuppala <21273801+vikrantpuppala@users.noreply.github.com>
| "Chunk {} got auth/not-found error from cloud storage, clearing cached link for refetch", | ||
| chunk_index | ||
| ); | ||
| entry.link = None; |
There was a problem hiding this comment.
Does this work? Only the entry.link is set to None, does it still go through current loop to delay and retry empty link and repeat until attempts are meet?
There was a problem hiding this comment.
Yes, this works correctly. When entry.link = None is set, the loop sleeps (via tokio::time::sleep), then on the next iteration:
stored_linkis read asNone- The
match stored_linkhits the_arm, which handles bothNoneand expired links refetch_link()is called to get a fresh URL from the Databricks API- The fresh URL is stored back in
entry.linkand used for the next download attempt
So we don't retry with an empty link — we refetch a valid one first. The sleep before the refetch is intentional (avoids hammering the API), and each auth error does count against attempts (also intentional — if Databricks keeps returning invalid URLs, we should eventually give up).
There was a problem hiding this comment.
Good catch — updated in 9f6a9ac. When we get a 401/403/404 we now continue immediately instead of sleeping, so refetch_link() is called on the next iteration without the backoff delay. Calling refetch_link() directly inside the get_mut block would hold the DashMap write lock across an .await, risking the same deadlock described in the comment above — so the continue approach keeps things clean.
…Fetch retry When cloud storage returns an auth error, we know the fix is a fresh URL from Databricks — sleeping first just adds unnecessary latency. Use `continue` to jump straight to the next iteration's refetch_link() call. Also fix comment text: HTTP 403/404 -> HTTP 401/403/404. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add max_refresh_retries (default 3) as a distinct config field, bounded independently from max_retries (mirrors C# maxUrlRefreshAttempts) - Reset attempts=0 after a successful refetch so each fresh URL gets its own full download retry budget - Make retry sleep cancellation-aware via tokio::select! so cancellation is not delayed by the full backoff window - Document refetch_link(chunk_index, 0): row_offset=0 is intentional for the SEA (chunk-index based) backend; a Thrift backend would use it Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…haviour Instead of the two-step dance (clear entry.link, continue, refetch on next iteration), drop the DashMap write lock before the async call and refetch directly in the same iteration — no sleep, no extra loop turn. The lock guard is scoped so it drops before .await; a second brief get_mut writes the fresh link back. This avoids the deadlock risk of holding a shard write lock across an await point. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Fixes a bug in
StreamingCloudFetchProvider::download_chunk_with_retry()where HTTP 401/403/404 responses from cloud storage would cause the driver to retry indefinitely with the same invalid presigned URL.Root cause: The function only called
refetch_link()when a URL was expired by timestamp check (link.is_expired()). When cloud storage returned actual 401/403/404 errors (URL revoked or expired before timestamp), the cached link was never cleared, so retries kept using the same bad URL.Fix: On HTTP 401/403/404 in the error handler, set
entry.link = Nonebefore sleeping. The existing logic at the top of the retry loop already handlesNonelinks by callingrefetch_link(), so no further changes are needed.Changes:
rust/src/reader/cloudfetch/streaming_provider.rs: Addis_cloud_storage_auth_error()helper and clear cached link on 401/403/404rust/spec/cloudfetch-implementation-plan.md: Document the 401/403/404 refetch behaviorFixes: PECO-2918
Test plan
cargo testpasses (117 tests, +4 new unit tests foris_cloud_storage_auth_error)cargo clippy -- -D warningscleancargo fmtapplied🤖 Generated with Claude Code