Skip to content

Conversation

@Avinash-1394
Copy link
Contributor

@Avinash-1394 Avinash-1394 commented Aug 18, 2025

resolves # #1270

docs dbt-labs/docs.getdbt.com/#

Problem

dbt-athena currently does not implement retry logic for S3 deletion operations, causing dbt runs to fail immediately when encountering transient S3 issues. This makes the adapter fragile in production environments where temporary S3 service disruptions, network connectivity issues, rate limiting, or eventual consistency problems are common.

Current problematic behavior:

  • S3 deletion failures immediately terminate dbt operations
  • No distinction between retryable (5xx, throttling) and permanent errors (403, 404)
  • Lack of resilience against common AWS S3 transient issues
  • Poor user experience with cryptic error messages on temporary failures

Solution

This PR implements a robust retry mechanism for S3 deletion operations with the following features:

Implementation details:

  • Retry logic: Adds retry functionality with attempt limits (default: 3 retries)
  • Exponential backoff with jitter: Implements exponential backoff (base delay: 1s, max delay: 30s) with random jitter to prevent thundering herd issues
  • Selective error handling: Only retries on transient errors (5xx HTTP codes, throttling, network timeouts) while immediately failing on permanent errors
  • Enhanced logging: Adds structured logging for retry attempts with appropriate log levels for debugging and monitoring
  • Clear error messages: Provides actionable runtime errors after exhausting retry attempts, including context about the failure and retry history

Alternatives considered:

  • Using AWS SDK built-in retry logic (rejected: less control over dbt-specific retry behavior)
  • Simple linear backoff (rejected: less efficient than exponential backoff)
  • Unlimited retries (rejected: could cause indefinite hangs)

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

@Avinash-1394 Avinash-1394 requested a review from a team as a code owner August 18, 2025 14:11
@cla-bot
Copy link

cla-bot bot commented Aug 18, 2025

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @Avinash-1394

@cla-bot cla-bot bot added the cla:yes The PR author has signed the CLA label Aug 19, 2025

@retry(
stop=stop_after_attempt(4), # Up to 4x boto3 retries
wait=wait_exponential(multiplier=30, min=30, max=300), # 30s, 60s, 120s, 240s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these supposed to be hard coded? Shouldn't this use the s3_deletion_retry_* config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked for the possible connection configs here -
https://github.com/dbt-labs/dbt-adapters/blob/main/dbt-athena/src/dbt/adapters/athena/connections.py

I couldn’t find one specific for s3 retries. Should I add one there or are you referring to another config I have access to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your PR description you mentioned that you were going to add those:

Configuration options added:
s3_deletion_retry_attempts: Number of retry attempts (default: 3)
s3_deletion_retry_base_delay: Base delay in seconds (default: 1)
s3_deletion_retry_max_delay: Maximum delay in seconds (default: 30)

are you adding those?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry for including that. I tried doing it but since the connection parameters were not available to the decorator I could not find a way to make it work. I've removed that from the description 👍🏽

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you nest the _delete_with_app_retry within the delete_from_s3 function it should be accessible (that's the way it's used in AthenaCursor.execute

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey just wanted to provide an update I'm trying to do this and will reach out to you over slack when I'm able to get it working with tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla:yes The PR author has signed the CLA

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants