Skip to content

[BB-1755, BB-1771]Add retry logic for not found errors in connector methods and fix missed grants with oauth#32

Closed
FeliLucero1 wants to merge 3 commits intomainfrom
felipelucero/add-retry-logic
Closed

[BB-1755, BB-1771]Add retry logic for not found errors in connector methods and fix missed grants with oauth#32
FeliLucero1 wants to merge 3 commits intomainfrom
felipelucero/add-retry-logic

Conversation

@FeliLucero1
Copy link
Contributor

@FeliLucero1 FeliLucero1 commented Nov 12, 2025

Description

  • Bug fix
  • New feature

Summary by CodeRabbit

  • Bug Fixes
    • Converted transient "not found" lookups for users, groups, and service principals into retryable errors to improve resilience.
    • Explicitly fetch and validate group members; invalid member references now produce clear errors.
    • Error messages for group lookups now include the group identifier for easier troubleshooting.
    • Rate-limit info now propagates into grant/revoke annotations so operations surface throttling metadata.
  • Refactor
    • Role-grant generation updated to handle nested group expansions and attach per-operation annotations.

@coderabbitai
Copy link

coderabbitai bot commented Nov 12, 2025

Walkthrough

Replaces string-based group-member parsing with explicit Group fetches (now passing attribute Vars), iterates and parses group.Members by member.Ref, expands nested groups for grants, standardizes GetGroup/Get* lookups to return gRPC NotFound errors, and adds a helper that converts NotFound results into retryable gRPC Unavailable errors.

Changes

Cohort / File(s) Summary
Group grants and member resolution
pkg/connector/groups.go
Remove trait-profile members parsing; call g.client.GetGroup(ctx, workspaceId, groupId.Resource, databricks.NewGroupAttrVars()), iterate group.Members, derive principal type/ID from member.Ref (e.g., Users/123), error on invalid refs, preserve Users/ServicePrincipals handling and expand nested Groups via groupGrantExpansion, and accumulate annotations from expansions.
Grant/Revoke Get error wrapping & vars*
pkg/connector/roles.go, pkg/connector/groups.go, pkg/connector/users.go
Wrap errors returned from GetUser/GetGroup/GetServicePrincipal with makeRetryableIfNotFound(err); update GetGroup invocations to pass databricks.NewGroupRolesAttrVars() or databricks.NewGroupAttrVars() where appropriate; propagate rate-limit annotations from role rule-set and group membership retrievals.
Retryable helper and plumbing
pkg/connector/helpers.go
Add private helper makeRetryableIfNotFound that converts certain NotFound-style errors into gRPC Unavailable (retryable) status; apply it across lookup/error branches in prepareResourceId and preparePrincipalId; add required imports (errors, net/http, google.golang.org/grpc/codes, google.golang.org/grpc/status).
Client NotFound -> gRPC status errors & GetGroup signature
pkg/databricks/client.go
Replace prior NotFound fallbacks with status.Error(codes.NotFound, "...") for finders (users, groups, service principals); extend GetGroup signature to accept variadic Vars (GetGroup(ctx, workspaceId, groupId string, vars ...Vars)) and forward them to the underlying Get call; preserve returned rate-limit data.

Sequence Diagram(s)

sequenceDiagram
    participant Connector as Connector
    participant Databricks as DatabricksClient
    participant GrantLogic as GrantProcessor

    rect `#F0F8FF`
    Connector->>Databricks: GetGroup(ctx, workspace, groupId, NewGroupAttrVars())
    note right of Databricks: returns Group with Members[]
    Databricks-->>Connector: Group{Members:[{Ref:"Users/123"}, {Ref:"Groups/456"}, ...]}
    end

    rect `#FFF5E6`
    Connector->>GrantLogic: For each member in Group.Members
    GrantLogic->>GrantLogic: parse member.Ref -> (type, id) or error
    alt member is User or ServicePrincipal
        GrantLogic->>GrantLogic: build resource ID, append annotations
    else member is Group
        GrantLogic->>Connector: expand nested group via groupGrantExpansion(...)
        Connector-->>GrantLogic: nested members + annotations
    else invalid Ref
        GrantLogic--xConnector: return error("invalid member Ref")
    end
    end

    rect `#F6FFF0`
    note over Databricks,Connector: Error handling flow
    Databricks--xConnector: status.Error(codes.NotFound,...)
    Connector->>Connector: makeRetryableIfNotFound(err) -> grpc.Unavailable (retryable)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Pay special attention to:
    • pkg/connector/groups.go — correctness and coverage of member.Ref parsing and error messages.
    • pkg/databricks/client.go — updated GetGroup signature propagation and correct forwarding of Vars.
    • pkg/connector/helpers.go — ensure makeRetryableIfNotFound correctly identifies only intended NotFound cases and preserves other errors.
    • Rate-limit annotation propagation across grants and revokes.

Poem

🐇 I hopped to fetch the group's fine list,
Each member Ref I gently parsed and kissed.
If not found, I whisper "try again" —
A retryable hop, then back to the glen.
🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the two main objectives: adding retry logic for not-found errors and fixing missed grants with OAuth, matching the substantive changes across multiple connector files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch felipelucero/add-retry-logic

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/connector/groups.go (1)

199-204: Consider consistent error wrapping for GetGroup calls.

This GetGroup call in the Grants method returns errors directly, while the Grant (line 309) and Revoke (line 453) methods wrap errors with makeRetryableIfNotFound.

If the intent is to make not-found errors retryable only for mutation operations and not for read operations during grant listing, this is fine. Otherwise, consider applying consistent error wrapping across all GetGroup calls.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ba9cf1 and 8a16f69.

📒 Files selected for processing (3)
  • pkg/connector/groups.go (3 hunks)
  • pkg/connector/roles.go (4 hunks)
  • pkg/databricks/client.go (9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/connector/roles.go
🧰 Additional context used
🧬 Code graph analysis (2)
pkg/connector/groups.go (1)
pkg/databricks/vars.go (1)
  • NewGroupAttrVars (86-94)
pkg/databricks/client.go (1)
pkg/databricks/vars.go (1)
  • Vars (8-10)
🔇 Additional comments (4)
pkg/databricks/client.go (2)

259-259: LGTM - Backward compatible extension.

The GetGroup signature has been extended to accept variadic Vars parameters, allowing callers to specify which attributes to fetch (e.g., databricks.NewGroupAttrVars() for members). This is a backward-compatible change since existing calls without the parameter will continue to work.

Also applies to: 272-272


200-200: Verification confirms all callers properly handle the new NotFound errors.

All six Find* methods are called exclusively in pkg/connector/helpers.go, and every call site properly checks for errors and wraps them with makeRetryableIfNotFound(). There are no other callers in the codebase that would be affected by the breaking change.

pkg/connector/groups.go (2)

206-234: LGTM - Clear member parsing logic.

The member parsing logic correctly:

  • Validates the member.Ref format ("Type/ID")
  • Handles all expected member types (Users, Groups, ServicePrincipals)
  • Applies group expansion for nested group memberships
  • Returns descriptive errors for invalid formats

The refactoring from profile string parsing to explicit API fetching with member iteration is a reliability improvement.


307-310: The implementation is correct and all concerns are addressed.

The makeRetryableIfNotFound function at lines 216-230 in pkg/connector/helpers.go:

  • ✓ Correctly identifies both gRPC NotFound (line 226) and HTTP NotFound (line 222) errors
  • ✓ Converts both to codes.Unavailable status, which is retryable
  • ✓ Preserves the original error message for debugging via fmt.Sprintf(..., err)
  • ✓ Returns the original error unchanged if it's not a NotFound error (line 230)

The implementation properly handles the error transformation required by the PR's retry logic.

Comment on lines +12 to +13
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Verify the gRPC library version.

The imports are necessary for the new error handling, but the library context indicates google.golang.org/grpc version v1.73.0-dev is being used. Development versions can be unstable and are generally not recommended for production use.

Run the following script to check the gRPC version in use:


🏁 Script executed:

#!/bin/bash
# Check the version of google.golang.org/grpc in go.mod
cat go.mod | grep "google.golang.org/grpc"

Length of output: 144






**No actionable concerns with the imports; gRPC v1 is stable, but verify dependency management.**

When a project reaches major version v1 it is considered stable. The imports for `google.golang.org/grpc/codes` and `google.golang.org/grpc/status` are appropriate for error handling. While the codebase currently pulls in v1.73.0-dev (a pre-release version), Version 1.73.0 (stable) should be used when available. Since this is marked as an indirect dependency, it's being pulled in by another package—consider verifying your dependency management to use stable releases where possible rather than development versions in production builds.

[resolve_review_comment]

---

🌐 Web query:

```web
google.golang.org/grpc v1.73.0-dev stability production use

💡 Result:

Short answer: don't use a -dev module version in production. The "-dev" suffix denotes a development (unpublished/unstable) build; use an official tagged release (for example v1.73.0 or a later non-dev release) and pin that version after testing. [1][2][3]

Details and sources:

  • grpc-go has an official v1.73.0 release (use the tagged release, not a dev build). [1]
  • A v1.74.0 tagging issue was reported (the project advised using v1.73.0 until a fixed 1.74.x), illustrating that unstable tags/dev builds can be transient or incorrect. [2]
  • The grpc-go repo documentation and releases are the authoritative source for which versions are stable for production; prefer published semver tags, run your tests, and pin the module in go.mod. [3]

If you want, I can look up the latest non-dev grpc-go release right now and suggest the exact module line to put in go.mod.


🌐 Web query:

grpc-go v1.73.0 release date stable version

💡 Result:

grpc-go v1.73.0 (stable) was released on June 5, 2025. [1][2]

Sources:
[1] GitHub Releases — grpc/grpc-go (v1.73.0).
[2] SourceForge mirror — grpc-go v1.73.0.


Address the development version dependency before production use.

The gRPC library version v1.73.0-dev is a development (unstable) build and must not be used in production. The stable release v1.73.0 has been available since June 5, 2025. Update go.mod to pin the stable release instead. Since this is an indirect dependency, investigate which direct dependency is pulling in the -dev version and either update that dependency or explicitly pin google.golang.org/grpc to v1.73.0 in your go.mod.

🤖 Prompt for AI Agents
pkg/databricks/client.go around lines 12-13: your module is pulling an indirect
gRPC dev release (v1.73.0-dev); locate which direct dependency requires it (use
go mod graph or go list -m -json all) and either update that direct dependency
to a version that depends on google.golang.org/grpc v1.73.0 or explicitly pin
google.golang.org/grpc to v1.73.0 in go.mod (add a require directive and run go
mod tidy) so the stable v1.73.0 is used instead of the -dev build.

@FeliLucero1 FeliLucero1 changed the title [BB-1755]Add retry logic for not found errors in connector methods [BB-1755, BB-1771]Add retry logic for not found errors in connector methods Nov 14, 2025
@FeliLucero1 FeliLucero1 changed the title [BB-1755, BB-1771]Add retry logic for not found errors in connector methods [BB-1755, BB-1771]Add retry logic for not found errors in connector methods and fix missed grants with oauth Nov 14, 2025
@FeliLucero1 FeliLucero1 requested a review from ggreer November 14, 2025 13:51
Copy link
Contributor

@luisina-santos luisina-santos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should confirm with core teams this is a safe change based on the number of times the connector will retry the action when the user does not actually exist, causing potential infinite retries.

I want to make sure the retry operation is something that we should handle at connector level, since we only want to retry if app access is configured through delegated provision (and we don't have this context here)

CC: @laurenleach @btipling

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8a16f69 and 5aaf4be.

📒 Files selected for processing (1)
  • pkg/connector/groups.go (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/connector/groups.go (2)
pkg/databricks/vars.go (1)
  • NewGroupAttrVars (86-94)
pkg/connector/account.go (1)
  • GroupsType (33-33)
🔇 Additional comments (4)
pkg/connector/groups.go (4)

211-239: Member processing logic is well-structured.

The refactored approach to parse member.Ref is clear and handles the expected member types (Users, Groups, ServicePrincipals) correctly. Format validation and nested group expansion are properly implemented.


206-209: Rate limiting annotations properly implemented.

The rate limiting data from both GetGroup and ListRuleSets is now correctly captured and accumulated in annotations, which are returned to the caller. This addresses the previous review feedback about reading rateLimitData and returning proper annotations.

Based on learnings

Also applies to: 247-249, 271-271


316-319: Retry logic correctly applied.

The GetGroup call now includes NewGroupAttrVars() to ensure member data is fetched, and the error is properly wrapped with makeRetryableIfNotFound to enable retry behavior for not-found scenarios per the PR objectives.


460-463: Retry logic correctly applied.

Consistent with the Grant method, the error handling properly wraps not-found errors to enable retry behavior.

Comment on lines +199 to +204
// Always fetch the group with members attribute to ensure we get the members
// regardless of authentication type (OAuth vs personal access token)
group, rateLimitData, err := g.client.GetGroup(ctx, workspaceId, groupId.Resource, databricks.NewGroupAttrVars())
if err != nil {
return nil, "", nil, fmt.Errorf("databricks-connector: failed to get group %s: %w", groupId.Resource, err)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Apply retry wrapper consistently with Grant and Revoke methods.

The GetGroup error is not wrapped with makeRetryableIfNotFound here, but the same calls in Grant (line 318) and Revoke (line 462) do wrap it. Per the PR objectives (adding retry logic for not found errors), this method should also benefit from the retry behavior.

Apply this diff:

 	group, rateLimitData, err := g.client.GetGroup(ctx, workspaceId, groupId.Resource, databricks.NewGroupAttrVars())
 	if err != nil {
+		err = makeRetryableIfNotFound(err)
 		return nil, "", nil, fmt.Errorf("databricks-connector: failed to get group %s: %w", groupId.Resource, err)
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Always fetch the group with members attribute to ensure we get the members
// regardless of authentication type (OAuth vs personal access token)
group, rateLimitData, err := g.client.GetGroup(ctx, workspaceId, groupId.Resource, databricks.NewGroupAttrVars())
if err != nil {
return nil, "", nil, fmt.Errorf("databricks-connector: failed to get group %s: %w", groupId.Resource, err)
}
// Always fetch the group with members attribute to ensure we get the members
// regardless of authentication type (OAuth vs personal access token)
group, rateLimitData, err := g.client.GetGroup(ctx, workspaceId, groupId.Resource, databricks.NewGroupAttrVars())
if err != nil {
err = makeRetryableIfNotFound(err)
return nil, "", nil, fmt.Errorf("databricks-connector: failed to get group %s: %w", groupId.Resource, err)
}
🤖 Prompt for AI Agents
In pkg/connector/groups.go around lines 199 to 204, the call to
g.client.GetGroup returns an error that is not wrapped with
makeRetryableIfNotFound like the same calls in Grant and Revoke; update the
error handling to pass the returned error through makeRetryableIfNotFound before
returning so transient "not found" errors become retryable (i.e., replace the
current fmt.Errorf return with returning makeRetryableIfNotFound(err) wrapped
appropriately to preserve the original error context).

return nil, "", nil, fmt.Errorf("databricks-connector: failed to parse group resource id: %w", err)
}

groupTrait, err := rs.GetGroupTrait(resource)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FeliLucero1 was the group trait missing members? or just didn't include them at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used to lose some

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the users It wasn't losing were users who had permissions over the group but weren't members of it, therefore It was losing all members

@laurenleach
Copy link

we should confirm with core teams this is a safe change based on the number of times the connector will retry the action when the user does not actually exist, causing potential infinite retries.

I want to make sure the retry operation is something that we should handle at connector level, since we only want to retry if app access is configured through delegated provision (and we don't have this context here)

CC: @laurenleach @btipling

@luisina-santos I don't think this should have been a BB ticket in the first place, this is c1 platform behavior and connectors shouldn't have special logic for this

@btipling btipling closed this Nov 26, 2025
@btipling btipling deleted the felipelucero/add-retry-logic branch November 26, 2025 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants