-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix(aws_s3 source): add retry delay in sqs::Ingestor #22999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit adds logic to wait 500ms between subsequent calls of run_once if it returns an error. This is required to prevent spamming the SQS API if it fails to receive messages, e.g. due to invalid IAM permissions or an SQS outage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @medzin, I understand the problem you are trying to fix here. Left a question.
Ok(()) => {} | ||
Err(_) => { | ||
trace!("run_once failed, will retry after delay"); | ||
tokio::time::sleep(delay).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how this constant was chosen. Would an exponential backoff strategy work better here or it's an overkill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was treating this code as a hotfix rather than a long-term solution, so I chose a simple constant delay to quickly address the issue. The current 0.5-second delay doesn't introduce significant latency but greatly reduces the number of retries triggered by the source, which helps stabilize things short-term. That said, I agree a constant delay may not be ideal across all use cases - different values might work better for different users. An exponential backoff could offer a more flexible and resilient solution without needing to make the delay configurable. Happy to explore that in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how this improves things in your situation but we generally don't want to merge hotfixes. If you have urgent need for this commit, you can always maintain a fork with this commit and use a custom Vector build.
Going back to this PR, what do you think about introducing a new opt-in config parameter here? We can probably reuse ExponentialBackoff
from src/sinks/util/retries.rs
.
Example:
- 1st retry: 1 second
- 2nd retry: 2 seconds
- 3rd retry: 4 seconds
- 4th retry: 8 seconds
- 5th retry: 16 seconds
- 6th retry: 30 seconds (capped)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I will check it.
1720078
to
ffe54be
Compare
Couldn't commit to this branch due to perms, moved to: #23996 |
Summary
This commit adds logic to wait 500ms between subsequent calls of run_once if it returns an error. This is required to prevent spamming the SQS API if it fails to receive messages, e.g. due to invalid IAM permissions or an SQS outage.
Change Type
Is this a breaking change?
How did you test this PR?
Does this PR include user facing changes?
Notes
@vectordotdev/vector
to reach out to us regarding this PR.pre-push
hook, please see this template.cargo fmt --all
cargo clippy --workspace --all-targets -- -D warnings
cargo nextest run --workspace
(alternatively, you can runcargo test --all
)./scripts/check_changelog_fragments.sh
git merge origin master
andgit push
.Cargo.lock
), pleaserun
cargo vdev build licenses
to regenerate the license inventory and commit the changes (if any). More details here.References