Generalize retry code by ggreer · Pull Request #380 · ConductorOne/baton-sdk

ggreer · 2025-05-12T22:46:11Z

Syncing, granting, and revoking all have retry logic, and they're all pretty similar. This unifies them into a single retryer package.

Summary by CodeRabbit

New Features
- Introduced a standardized retry mechanism with configurable backoff for improved reliability during operations that may encounter transient errors.
Refactor
- Unified and simplified retry logic across grant, revoke, and sync operations, reducing code duplication and enhancing maintainability.
Tests
- Added tests to validate the new retry behavior and ensure correct handling of retryable and non-retryable errors.

…onditionals in the retry loops.

coderabbitai · 2025-05-12T22:46:17Z

Walkthrough

The changes introduce a new retry package to centralize and standardize retry logic with configurable backoff. Existing custom retry mechanisms in connector builder and syncer implementations are refactored to use this new package, removing duplicated logic and internal state management. Associated tests validate the new retry behavior.

Changes

File(s)	Change Summary
pkg/connectorbuilder/connectorbuilder.go	Refactored `Grant` and `Revoke` methods to use the new `retry.Retryer` for retry logic; removed custom retry helper and related imports.
pkg/retry/retry.go	Introduced new `retry` package with `Retryer`, `RetryConfig`, and standardized retry/backoff logic for gRPC operations.
pkg/retry/retry_test.go	Added unit test `TestBasicRetry` to validate retry behavior and timing in the new `retry` package.
pkg/sync/syncer.go	Replaced internal retry logic and state with the new `retry.Retryer`; removed custom retry method and related fields/imports.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant Retryer
    participant Provisioner

    Caller->>Retryer: ShouldWaitAndRetry(ctx, err)
    alt Error is retryable and attempts remain
        Retryer->>Caller: Wait and return true
        loop While ShouldWaitAndRetry returns true
            Caller->>Provisioner: Grant/Revoke/Sync operation
            Provisioner-->>Caller: Response or Error
            Caller->>Retryer: ShouldWaitAndRetry(ctx, err)
        end
    else Error is not retryable or attempts exhausted
        Retryer->>Caller: Return false
    end

Possibly related PRs

[BB-621] baton-sdk: add retry mechanism to grant and revoke method #375: Introduced the initial retry mechanism (shouldWaitAndRetry), which is now being replaced by the new centralized retry.Retryer.
Make not global #359: Added the attempts field and encapsulated retry state in the syncer struct, which is now refactored to use the new retry approach.

Suggested reviewers

jirwin
laurenleach

Poem

In the warren of code, retries did sprawl,
But now a new package brings order to all.
No more math or custom loops to maintain,
Just one Retryer to handle the pain.
With each hop and delay, we gracefully try,
Until success or carrots run dry! 🥕

Tip

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.

Enjoy the performance boost—your workflow just got faster.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f268889 and 535ed56.

📒 Files selected for processing (1)

pkg/connectorbuilder/connectorbuilder.go (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/connectorbuilder/connectorbuilder.go

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: go-test (1.23.x, windows-latest)

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

pkg/retry/retry.go (2)
67-74: Linear back-off without jitter can create thundering-herd issues and overflows.

A fleet of workers will all retry at the same deterministic interval, amplifying load.

time.Duration(int64(r.attempts))*r.initialDelay can overflow before you reach math.MaxInt64 nanoseconds on 32-bit platforms.

Consider exponential back-off with ±15 % jitter and a saturating multiply to avoid overflow.

Also applies to: 99-101

105-112: Unnecessary for { select { … }} loop – single wait is enough.

The loop executes at most once because both branches return. Removing the loop simplifies reasoning and avoids future bugs if new cases are added.
-	for {
-		select {
-		case <-time.After(wait):
-			return true
-		case <-ctx.Done():
-			return false
-		}
-	}
+	select {
+	case <-time.After(wait):
+		return true
+	case <-ctx.Done():
+		return false
+	}
pkg/retry/retry_test.go (1)

38-44: Tests rely on real wall-clock sleeps → slow & flaky.

ShouldWaitAndRetry blocks for ~100 ms and the assertion assumes <300 ms.
On contended CI agents this can overshoot and fail spuriously.

Use a tiny InitialDelay (e.g. 1 ms) or inject a clock interface so the test can fast-forward time deterministically.
pkg/sync/syncer.go (1)
365-370: Retryer is hard-coded; make it configurable for flexibility and testing.

Creating a retryer with unlimited attempts & fixed delays inside Sync:
retryer := retry.NewRetryer(ctx, retry.RetryConfig{ … })
ties production and unit tests to these constants. Consider:

Accepting a RetryConfig or retry.Retryer via WithRetryConfig/WithRetryer option.

Falling back to the current defaults when the option is not provided.

This allows tighter retry budgets in tests and different back-off strategies for callers.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3055c58 and f2e1816.

📒 Files selected for processing (4)

pkg/connectorbuilder/connectorbuilder.go (3 hunks)
pkg/retry/retry.go (1 hunks)
pkg/retry/retry_test.go (1 hunks)
pkg/sync/syncer.go (7 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

pkg/sync/syncer.go (2)

pkg/retry/retry.go (2)

NewRetryer (31-45)

RetryConfig (25-29)

pkg/sync/state.go (3)

SyncResourcesOp (110-110)

SyncExternalResourcesOp (114-114)

SyncAssetsOp (115-115)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: go-test (1.23.x, windows-latest)

🔇 Additional comments (5)

pkg/connectorbuilder/connectorbuilder.go (5)

25-25: Clean import addition.

The code correctly adds the new retry package import needed for the refactored retry functionality.

918-943: Well-structured refactoring of the Grant method.

This refactoring effectively abstracts provisioner behavior for both v1 and v2 implementations using a common function type while centralizing retry logic. The creation of grantFunc to handle the differences between provisioner versions is especially clean.

The retry configuration parameters (3 max attempts, 30s initial delay, 120s max delay) appear reasonable for this use case.

948-960: Clean retry loop implementation.

The retry loop has been simplified to use the centralized retry mechanism. The code now:

Attempts the operation

Returns on success

Uses the retryer's ShouldWaitAndRetry method to determine whether to retry

Properly handles and reports errors

This is more maintainable than the previous custom retry implementation.

972-990: Good reuse of retry pattern in Revoke method.

This refactoring applies the same pattern used in the Grant method. The code sets up a consistent retryer configuration and abstracts over different provisioner versions, promoting code consistency between the two methods.

996-1008: Consistent error handling in Revoke method.

The retry implementation follows the same pattern as in the Grant method, ensuring consistency across the codebase. The error handling, logging, and metric recording are all properly maintained during the refactoring.

pkg/retry/retry.go

coderabbitai · 2025-05-12T22:51:45Z

pkg/retry/retry.go

+	// If error contains rate limit data, use that instead
+	if st, ok := status.FromError(err); ok {
+		details := st.Details()
+		for _, detail := range details {
+			if rlData, ok := detail.(*v2.RateLimitDescription); ok {
+				waitResetAt := time.Until(rlData.ResetAt.AsTime())
+				if waitResetAt <= 0 {
+					continue
+				}
+				duration := time.Duration(rlData.Limit)
+				if duration <= 0 {
+					continue
+				}
+				waitResetAt /= duration
+				// Round up to the nearest second to make sure we don't hit the rate limit again
+				waitResetAt = time.Duration(math.Ceil(waitResetAt.Seconds())) * time.Second
+				if waitResetAt > 0 {
+					wait = waitResetAt
+					break
+				}
+			}
+		}


⚠️ Potential issue

Rate-limit wait calculation appears incorrect.

waitResetAt /= duration divides a duration by the numeric limit, yielding an arbitrary nanosecond value.
Usually you want time.Until(reset) + smallBuffer or time.Until(reset)/remaining, not “duration ÷ limit”.

Please revisit the algorithm; otherwise the retryer may sleep far too short or far too long.

coderabbitai · 2025-05-12T22:51:45Z

pkg/retry/retry.go

+	if status.Code(err) != codes.Unavailable && status.Code(err) != codes.DeadlineExceeded {
+		return false
+	}


💡 Verification agent

🧩 Analysis chain

Consider handling additional recoverable gRPC codes (e.g. ResourceExhausted, Internal).

Only Unavailable and DeadlineExceeded are currently retried. Connectors often surface transient throttling under ResourceExhausted and some load-balancers use Internal for “connection reset” conditions. You may unintentionally fail otherwise transient operations.

- if status.Code(err) != codes.Unavailable && status.Code(err) != codes.DeadlineExceeded { + switch status.Code(err) { + case codes.Unavailable, codes.DeadlineExceeded, codes.ResourceExhausted, codes.Internal: + // retryable + default: return false - } + }

🏁 Script executed:

#!/bin/bash # Locate retry.go and display the relevant retry logic context file=$(fd --exact-name retry.go) echo "Found file: $file" sed -n '40,80p' "$file"

Length of output: 345

🏁 Script executed:

#!/bin/bash # Locate retry.go using find file=$(find . -type f -name "retry.go" | head -n1) if [ -z "$file" ]; then echo "retry.go not found" exit 1 fi echo "Found file: $file" # Show lines around the retry condition (lines 40–80) sed -n '40,80p' "$file"

Length of output: 1269

Enhance retry logic to include additional transient gRPC codes

Currently only Unavailable and DeadlineExceeded are retried. In practice you’ll also see transient throttling surfaced as ResourceExhausted and connection‐reset errors as Internal. Without including these, short‐lived failures will fall through and not be retried.

Suggested change in pkg/retry/retry.go around line 55:

- if status.Code(err) != codes.Unavailable && status.Code(err) != codes.DeadlineExceeded { - return false - } + switch status.Code(err) { + case codes.Unavailable, + codes.DeadlineExceeded, + codes.ResourceExhausted, + codes.Internal: + // retryable + default: + return false + }

This ensures you’ll back off and retry on throttling (ResourceExhausted) and connection resets (Internal) as well.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

pkg/retry/retry.go (3)

59-65: Off-by-one semantics make MaxAttempts act as "max retries − 1".

r.attempts++ happens before the check, so a MaxAttempts of 3 allows only 2 retries.
If the intent is "retry at most N times", move the increment after the guard (or change the comparison to >).

55-57: Enhance retry logic to include additional transient gRPC codes

Currently only Unavailable and DeadlineExceeded are retried. In practice you'll also see transient throttling surfaced as ResourceExhausted and connection‐reset errors as Internal. Without including these, short‐lived failures will fall through and not be retried.

75-96: Rate-limit wait calculation appears incorrect.

waitResetAt /= duration divides a duration by the numeric limit, yielding an arbitrary nanosecond value.
Usually you want time.Until(reset) + smallBuffer or time.Until(reset)/remaining, not "duration ÷ limit".

Please revisit the algorithm; otherwise the retryer may sleep far too short or far too long.

🧹 Nitpick comments (3)

pkg/retry/retry.go (3)
25-29: Documentation inconsistency with implementation for MaxDelay.

The comment states "0 means no limit" for MaxDelay, but the implementation at line 42 sets a default of 60 seconds if it's 0. Either update the comment or change the implementation to match the documented behavior.
-	MaxDelay     time.Duration // Default is 60 seconds. 0 means no limit.
+	MaxDelay     time.Duration // Default is 60 seconds. 0 will be set to default.
31-45: Context parameter is unused in NewRetryer.

The function accepts a context parameter but doesn't use it anywhere within the function. Consider removing it to avoid confusion.
-func NewRetryer(ctx context.Context, config RetryConfig) *Retryer {
+func NewRetryer(config RetryConfig) *Retryer {
67-73: Handling overflow check is unnecessary.

The check for r.attempts > math.MaxInt64 is incredibly unlikely to be true in practice, as it would require billions of years of constant retrying at reasonable intervals. Consider simplifying this.
-	if r.attempts > math.MaxInt64 {
-		wait = r.maxDelay
-	} else {
-		wait = time.Duration(int64(r.attempts)) * r.initialDelay
-	}
+	wait = time.Duration(int64(r.attempts)) * r.initialDelay

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f2e1816 and f268889.

📒 Files selected for processing (2)

pkg/retry/retry.go (1 hunks)
pkg/retry/retry_test.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/retry/retry_test.go

🔇 Additional comments (1)

pkg/retry/retry.go (1)

103-112: LGTM! Good implementation of cancellable delay.

The wait loop pattern correctly handles both waiting for the specified duration and respecting context cancellation, which is important for clean shutdown behavior.

btipling · 2025-05-12T22:52:50Z

pkg/connectorbuilder/connectorbuilder.go

-			annos, err := provisioner.Grant(ctx, request.Principal, request.Entitlement)
-			if err != nil {
-				l.Error("error: grant failed", zap.Error(err))
-				if !b.shouldWaitAndRetry(ctx, err, baseDelay) || attempt >= 2 {


this change I think increased the attempt count fwiw

Yes, the old code tried twice before erroring. I changed it to three because that seems like a good number of attempts before giving up.

btipling · 2025-05-12T22:58:07Z

pkg/retry/retry.go

+
+	// use linear backoff by default
+	var wait time.Duration
+	if r.attempts > math.MaxInt64 {


that's a lot of attempts

It was to get the linter to shut up.

Bencheng21 · 2025-05-12T23:04:33Z

pkg/connectorbuilder/connectorbuilder.go

-	provisionerV2, v2ok := b.resourceProvisionersV2[rt]
-	if !v1ok && !v2ok {
+	retryer := retry.NewRetryer(ctx, retry.RetryConfig{
+		MaxAttempts:  3,


MaxAttempts should be 2.
The Temporal activity has a timeout of 5 minutes, so the time from start to close shouldn't exceed that limit. c1 ticket would end up in a weird state if we retry 3 times.

For example,

Customers are not aware of the current status of the ticket.

Three retries would wait 30, 60, then 90 seconds, which is 3 minutes total.

Three minutes seems to be the ideal timing. I haven’t looked into the code to see how C1 tickets change status, but based on my testing, three attempts don’t work — the C1 ticket seems to be stuck.

I am not sure if it's because of dev env. I retry twice, which is 30 + 60 seconds, but the c1 ticket still takes around 4 minutes to respond. so I don't think it would work if the last retry takes 90 seconds.

ggreer added 2 commits May 12, 2025 15:37

Generalize retry code.

903e186

Simplify grant/revoke functions. There's no need for the repetitive c…

f2e1816

…onditionals in the retry loops.

Fix off by one error in max attempts.

f268889

coderabbitai bot reviewed May 12, 2025

View reviewed changes

btipling approved these changes May 12, 2025

View reviewed changes

Bencheng21 reviewed May 12, 2025

View reviewed changes

Decrease retry delays for provisioning.

535ed56

ggreer merged commit f99253c into main May 12, 2025
5 checks passed

ggreer deleted the ggreer/retry branch May 12, 2025 23:30

Conversation

ggreer commented May 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot May 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggreer commented May 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented May 12, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)