Skip to content

Conversation

@AndreKurait
Copy link
Contributor

@AndreKurait AndreKurait commented Jan 22, 2026

Fixes #15276

Motivation

When using exponential backoff with high retry counts, the calculation baseDuration * factor^retryNumber can overflow Go's int64 (which backs time.Duration), causing:

  • Negative backoff duration
  • Cap check if timeToWait > capDuration fails (negative < positive)
  • Immediate retries instead of respecting the cap

Overflow occurs at these retry counts (with 1s base duration):

Factor First Overflow Retry
2 32
100 6

Modifications

  • Check if factor > MaxInt64/baseDuration before multiplication to detect potential overflow
  • Cap timeToWait at math.MaxInt64 when overflow would occur
  • The subsequent cap check then correctly applies the user-configured cap

Verification

Unit test: Added TestProcessNodeRetriesBackoffOverflow that simulates 32 retries with factor=2, duration=5s

Manual E2E testing: Deployed fixed controller to minikube and ran test workflow with duration=1s, factor=100, cap=30s:

Before fix - retry intervals at overflow point:

overflow-repro(5)   22:52:32Z   (+50s) ✓
overflow-repro(6)   22:52:42Z   (+10s) ← BUG! Should be +50s
overflow-repro(7)   22:53:32Z   (+50s) ✓
overflow-repro(8)   22:53:42Z   (+10s) ← BUG!

After fix - all intervals consistent:

overflow-repro(5)   23:23:29Z   (+50s)
overflow-repro(6)   23:24:19Z   (+50s) ← Fixed!
overflow-repro(7)   23:25:09Z   (+50s)
overflow-repro(8)   23:25:59Z   (+50s)

Controller logs confirm proper backoff message for all retries including overflow point:

time=23:24:29Z msg="node message changed" message="Backoff for 30 seconds" ← Retry 6

Documentation

No documentation changes needed - this is a bug fix for an edge case.


Summary by CodeRabbit

  • Bug Fixes

    • Fixed potential integer overflow in exponential backoff calculations during retry operations, ensuring system stability when processing retries with high retry counts and large backoff factor multiplications.
  • Tests

    • Added test coverage for retry backoff overflow edge cases to verify robust handling under extreme conditions.

✏️ Tip: You can customize this high-level summary in your review settings.

@AndreKurait AndreKurait marked this pull request as draft January 22, 2026 23:42
@AndreKurait AndreKurait marked this pull request as ready for review January 22, 2026 23:42
For high retry counts with exponential backoff, the calculation
baseDuration * factor^retryNumber can overflow int64 (time.Duration).

Example: 5s base * 2^31 factor exceeds MaxInt64 nanoseconds.

This fix checks if the multiplication would overflow before performing
it, and caps at math.MaxInt64 (max time.Duration) if so.

Signed-off-by: Andre Kurait <[email protected]>
@AndreKurait AndreKurait force-pushed the fix/retry-backoff-overflow branch from 02910d2 to b6dc454 Compare January 22, 2026 23:43
@Joibel
Copy link
Member

Joibel commented Jan 23, 2026

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 23, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 23, 2026

📝 Walkthrough

Walkthrough

The pull request adds integer overflow protection to the exponential backoff calculation in the retry logic. When computing backoff duration with a factor, the code now checks if the result would exceed the maximum int64 value and clamps it to prevent negative duration overflows that caused retries to skip backoff waits.

Changes

Cohort / File(s) Summary
Backoff overflow protection
workflow/controller/operator.go
Modified processNodeRetries to compute exponential backoff with overflow safeguard: checks if baseDuration * factor exceeds MaxInt64 and caps result accordingly, replacing direct multiplication
Overflow test coverage
workflow/controller/operator_test.go
Added TestProcessNodeRetriesBackoffOverflow to verify processing with high retry counts does not panic and correctly applies backoff messages despite potential overflow scenarios
Dependency update
go.mod
Updated module dependencies (+7/-1)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: preventing int64 overflow in retry backoff calculation, which directly matches the primary objective of the PR.
Description check ✅ Passed The PR description includes all required template sections: Fixes reference (#15276), Motivation explaining the overflow problem, Modifications detailing the solution, Verification with unit and e2e testing results, and Documentation noting none are needed.
Linked Issues check ✅ Passed The PR code changes directly address issue #15276 by detecting overflow before multiplication and capping timeToWait at MaxInt64, ensuring the configured cap is correctly applied and preventing negative durations that bypass backoff logic.
Out of Scope Changes check ✅ Passed All changes are narrowly scoped to fixing the int64 overflow issue in exponential backoff calculation with a corresponding unit test; no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Joibel Joibel self-assigned this Jan 23, 2026
Copy link
Member

@Joibel Joibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix

@Joibel Joibel added cherry-pick/3.6 Cherry-pick this to release-3.6 cherry-pick/3.7 Cherry-pick this to release-3.7 area/retryStrategy Template-level retryStrategy labels Jan 23, 2026
@Joibel Joibel merged commit b8cf9ed into argoproj:main Jan 23, 2026
77 of 78 checks passed
argo-cd-cherry-pick-bot bot pushed a commit that referenced this pull request Jan 23, 2026
argo-cd-cherry-pick-bot bot pushed a commit that referenced this pull request Jan 23, 2026
@argo-cd-cherry-pick-bot
Copy link

🍒 Cherry-pick PR created for 3.6: #15290

@argo-cd-cherry-pick-bot
Copy link

🍒 Cherry-pick PR created for 3.7: #15291

Joibel added a commit that referenced this pull request Jan 26, 2026
…15277 for 3.7) (#15291)

Signed-off-by: Andre Kurait <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Andre Kurait <[email protected]>
Co-authored-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this pull request Jan 26, 2026
…15277 for 3.6) (#15290)

Signed-off-by: Andre Kurait <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Andre Kurait <[email protected]>
Co-authored-by: Alan Clucas <[email protected]>
AndreKurait added a commit to AndreKurait/opensearch-migrations that referenced this pull request Jan 29, 2026
Update argo-workflows chart version from 0.45.24 to 0.47.1 and override
images.tag to v3.7.9 to incorporate the retry delay overflow fix from
argoproj/argo-workflows#15277

Signed-off-by: Andre Kurait <[email protected]>
jugal-chauhan pushed a commit to jugal-chauhan/opensearch-migrations that referenced this pull request Jan 30, 2026
Update argo-workflows chart version from 0.45.24 to 0.47.1 and override
images.tag to v3.7.9 to incorporate the retry delay overflow fix from
argoproj/argo-workflows#15277

Signed-off-by: Andre Kurait <[email protected]>
@Joibel Joibel added the cherry-pick/4.0 Cherry-pick this to release-4.0 label Feb 4, 2026
argo-cd-cherry-pick-bot bot pushed a commit that referenced this pull request Feb 4, 2026
@argo-cd-cherry-pick-bot
Copy link

🍒 Cherry-pick PR created for 4.0: #15508

Joibel pushed a commit that referenced this pull request Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/retryStrategy Template-level retryStrategy cherry-pick/3.6 Cherry-pick this to release-3.6 cherry-pick/3.7 Cherry-pick this to release-3.7 cherry-pick/4.0 Cherry-pick this to release-4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: int64 overflow in retry backoff calculation for high retry counts

2 participants