Handle too big commits, missing commits from fork PRs, golang REST fallback#4836
Handle too big commits, missing commits from fork PRs, golang REST fallback#4836lukaszgryglicki merged 2 commits intodevfrom
Conversation
…llback Signed-off-by: Lukasz Gryglicki <lgryglicki@cncf.io> Assisted by [OpenAI](https://platform.openai.com/) Assisted by [GitHub Copilot](https://github.com/features/copilot)
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughRefactored PR commit-author retrieval across Go and Python to prefer GraphQL with REST fallbacks, renamed GetPRHeadSHA→GetPRCommitSHA, changed doGraphQL to return only an error, and added long-CLA-comment trimming (64KB cap) plus helper utilities and pagination/merge-base logic. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant PR as Service
participant GQL as GraphQL API
participant REST as REST API
User->>PR: Request CLA comment / PR processing
rect rgba(200,220,240,0.6)
note over PR,GQL: Phase 1 — fetch base/head OIDs
PR->>GQL: Query base & head OIDs
alt GQL success
GQL-->>PR: base & head OIDs
else GQL fails
PR->>REST: Fallback: ListCommits / ListCommits compare
REST-->>PR: commit data / compare result
end
end
rect rgba(200,240,220,0.6)
note over PR,REST: Phase 2 — determine merge-base
PR->>REST: compare(base...head)
REST-->>PR: merge-base SHA
end
rect rgba(240,220,200,0.6)
note over PR,GQL: Phase 3 — page commit history until merge-base
PR->>GQL: history query (after cursor)
GQL-->>PR: commit nodes + author info
loop until merge-base found (cap pages)
PR->>GQL: fetch next page
GQL-->>PR: next nodes
end
end
rect rgba(240,240,200,0.6)
note over PR: Phase 4 — assemble & trim comment
PR->>PR: assemble_cla_comment()
alt body > 64KB
PR->>PR: trim_comment()
end
PR-->>User: return CLA comment / status updates
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull Request Overview
This PR enhances the handling of large pull requests and fork scenarios by implementing size limits, fallback mechanisms, and improved commit tracking across both Python and Golang backends.
Key changes:
- Implements comment body size limits (64KB) with intelligent trimming of SHA lists
- Adds GraphQL-to-REST fallback for PRs from forks with >250 commits
- Introduces enhanced logging for distinct commit metrics (SHAs, author IDs, logins, emails, names)
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| cla-backend/cla/utils.py | Adds comment body size validation and SHA list trimming functionality |
| cla-backend/cla/models/github_models.py | Refactors commit fetching to use merge-base strategy with proper fallbacks and enhanced logging |
| cla-backend-go/github/github_repository.go | Mirrors Python changes with GraphQL history-based approach, comment trimming, and comprehensive commit metrics logging |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (5)
cla-backend/cla/utils.py (1)
1066-1085: Clamp head when head > max_items; drop unnecessary DOTALL; minor list build nit
- If head > max_items, kept list can exceed intent. Clamp head; tail already adjusted.
- flags=re.S is unnecessary since the pattern forbids inner parens.
- Minor: apply Ruff RUF005 nit for the ellipsis append.
- # ensure head+tail <= max_items - tail = max(0, min(tail, max_items - head)) + # ensure head+tail <= max_items + head = max(0, min(head, max_items)) + tail = max(0, min(tail, max_items - head)) @@ - if tail > 0: - kept += [ellipsis] + parts[-tail:] - else: - kept += [ellipsis] + if tail > 0: + kept += [ellipsis, *parts[-tail:]] + else: + kept += [ellipsis] @@ - return re.sub(r'\(([^()]*)\)', repl, html, flags=re.S) + return re.sub(r'\(([^()]*)\)', repl, html)cla-backend/cla/models/github_models.py (2)
2257-2261: Prefer graceful REST fallback when merge-base not found in historyRaising ValueError here forces the outer caller to restart. Consider signaling “not found” and let the caller directly fall back, or invoke the REST path here for continuity.
- if d is None: - cla.log.error(f"Failed to history commits for {owner}/{repo_name} PR #{number}") - raise ValueError("failed to fetch history commits using GraphQL") + if d is None: + cla.log.error(f"Failed to fetch commit history for {owner}/{repo_name} PR #{number}") + raise ValueError("gql-history-unavailable") # caller falls back to REST @@ - if not hist["pageInfo"]["hasNextPage"]: - # merge-base not found — extremely rare (rebases or unusual ancestry). - # Fall back to non-GQL old approach (limited to 250 commits but still better that error) - raise ValueError("merge-base commit not found in PR commit history") + if not hist["pageInfo"]["hasNextPage"]: + # merge-base not in history — signal fallback to REST + raise ValueError("gql-merge-base-missing")Also applies to: 2278-2287
2698-2703: Commit lookup on base repo can 404 for fork PRs; try head.repo firstFor forked PRs, the last commit may not exist in the base repo yet. Prefer head.repo when available, falling back to base.repo.
- repo = getattr(getattr(pull_request, "base", None), "repo", None) + repo = getattr(getattr(pull_request, "head", None), "repo", None) or \ + getattr(getattr(pull_request, "base", None), "repo", None)If you’ve seen 404s from base-only lookups, this will reduce fallback churn. Please confirm against a fork PR scenario.
cla-backend-go/github/github_repository.go (2)
1987-1993: Avoid logging full (trimmed) comment bodyLogging the entire body can flood logs and leak emails. Log lengths and a small prefix only.
- body = TrimComment(body, 40, 20, 20, "…") - log.WithFields(f).Debugf("comment trimmed to (%d): %s", len(body), body) + body = TrimComment(body, 40, 20, 20, "…") + log.WithFields(f).Debugf("comment trimmed to length=%d, head='%s...'", len(body), utils.Left(body, 256))Note: utils.Left here denotes a helper; alternatively inline a safe substring.
1987-1989: Tiny HTML nit for consistencyPython uses "
". Consider standardizing the break tag across services.- body := fmt.Sprintf("%s<br >%s", badge, commentBody) + body := fmt.Sprintf("%s<br />%s", badge, commentBody)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Jira integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
cla-backend-go/github/github_repository.go(11 hunks)cla-backend/cla/models/github_models.py(3 hunks)cla-backend/cla/utils.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.1)
cla-backend/cla/models/github_models.py
2195-2195: Avoid specifying long messages outside the exception class
(TRY003)
2219-2219: Avoid specifying long messages outside the exception class
(TRY003)
2221-2221: Multiple statements on one line (semicolon)
(E702)
2259-2259: Avoid specifying long messages outside the exception class
(TRY003)
2268-2268: Multiple statements on one line (semicolon)
(E702)
2281-2281: Avoid specifying long messages outside the exception class
(TRY003)
2287-2287: Avoid specifying long messages outside the exception class
(TRY003)
cla-backend/cla/utils.py
1077-1077: Consider [ellipsis, *parts[-tail:]] instead of concatenation
Replace with [ellipsis, *parts[-tail:]]
(RUF005)
Signed-off-by: Lukasz Gryglicki <lgryglicki@cncf.io> Assisted by [OpenAI](https://platform.openai.com/) Assisted by [GitHub Copilot](https://github.com/features/copilot)
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cla-backend/cla/models/github_models.py (1)
2697-2707: Clarify error message to reflect head commit access attempt.The error message at line 2701 should indicate that this exception occurred while attempting to access the PR's head commit, not just "commit sha", since the code now falls back to retrieving the last commit from the PR's commits list.
Apply this diff:
try: last_commit_sha = getattr(getattr(pull_request, "head", None), "sha", None) repo = getattr(getattr(pull_request, "base", None), "repo", None) commit_obj = repo.get_commit(last_commit_sha) except (GithubException, AttributeError, TypeError) as exc: - cla.log.error(f"{fn} - PR {pull_request.number}: exception getting commit sha: {exc}") + cla.log.error(f"{fn} - PR {pull_request.number}: exception getting head commit from base.repo: {exc} - attempting fallback") try: commit_obj = pull_request.get_commits().reversed[0] last_commit_sha = commit_obj.sha except Exception as exc2: cla.log.error(f"{fn} - PR {pull_request.number}: exception getting last commit from PR commits: {exc2}") last_commit_sha = NoneBased on learnings
♻️ Duplicate comments (3)
cla-backend/cla/utils.py (1)
1056-1059: Add safety check after trimming to guarantee size limit.The byte-size check correctly uses
len(body.encode('utf-8')), but there's no verification thattrim_commentactually reduces the body below 0xFF00 bytes. Iftrim_commentdoesn't trim enough (e.g., no SHA-only parenthesized groups found), the body could still exceed the limit.Consider adding a final safety truncation:
body = badge + "<br />" + comment if len(body.encode('utf-8')) > 0xFF00: body = trim_comment(body, max_items=40, head=20, tail=20, ellipsis="…") + # Final safety: ensure we're under the limit + body_bytes = body.encode('utf-8') + if len(body_bytes) > 0xFF00: + body = body_bytes[:0xFF00].decode('utf-8', errors='ignore') return bodycla-backend/cla/models/github_models.py (1)
2192-2229: Add defensive checks to prevent KeyError on GraphQL responses.Direct dictionary indexing can raise
KeyErrorif the GraphQL response structure is unexpected. Additionally, multiple statements on one line (line 2221) reduce readability.Apply this diff to add safe navigation and split statements:
info = pygithub_graphql(g, q_info, {"owner": owner, "name": repo_name, "number": number}) if info is None: cla.log.error(f"Failed to fetch base and head OIDs for commits for {owner}/{repo_name} PR #{number}") raise ValueError("failed to fetch base and head OIDs for commits using GraphQL") -pr = info["repository"]["pullRequest"] +repo_data = info.get("repository") +if not repo_data: + cla.log.error(f"No repository data for {owner}/{repo_name} PR #{number}") + raise ValueError("failed to fetch repository data using GraphQL") +pr = repo_data.get("pullRequest") +if not pr: + cla.log.error(f"No pullRequest data for {owner}/{repo_name} PR #{number}") + raise ValueError("failed to fetch pull request data using GraphQL") base_oid, head_oid = pr["baseRefOid"], pr["headRefOid"] # 2) merge-base via REST (fast and reliable) repo = g.get_repo(f"{owner}/{repo_name}") mb_sha = repo.compare(base_oid, head_oid).merge_base_commit.sha # Fetch & yield HEAD commit, because history won't include the starting commit itself q_head = """ query($owner:String!, $name:String!, $oid:GitObjectID!) { repository(owner:$owner, name:$name) { object(oid:$oid) { ... on Commit { oid message author { name email user { databaseId login name email } } } } } }""" -head = pygithub_graphql(g, q_head, {"owner": owner, "name": repo_name, "oid": head_oid})["repository"]["object"] +head_data = pygithub_graphql(g, q_head, {"owner": owner, "name": repo_name, "oid": head_oid}) +if not head_data or not head_data.get("repository"): + cla.log.error(f"Failed to fetch head commit repository data for {owner}/{repo_name} PR #{number}") + raise ValueError("failed to fetch head commit using GraphQL") +head = head_data["repository"].get("object") if head is None: cla.log.error(f"Failed to fetch head commit for {owner}/{repo_name} PR #{number}") raise ValueError("failed to fetch head commit using GraphQL") if head["oid"] != mb_sha: - a = head.get("author") or {}; u = a.get("user") or {} + a = head.get("author") or {} + u = a.get("user") or {} yield CommitLite(Based on learnings
cla-backend-go/github/github_repository.go (1)
2021-2032: Fix slice bounds checking to prevent edge case bug.The conditions at lines 2023 and 2027 use
< len(parts)which excludes the boundary case wherehead == len(parts)ortail == len(parts). This causes no items to be included when they could be.For example, if
len(parts) == 50,maxItems == 40, andhead == 20, the conditionhead < len(parts)evaluates to20 < 50(true), which is fine. But if through adjustmenthead == 50, then50 < 50is false, and no head items are added even thoughparts[:50]is valid.Apply this diff:
if len(parts) > maxItems { var out []string - if head > 0 && head < len(parts) { + if head > 0 && head <= len(parts) { out = append(out, parts[:head]...) } out = append(out, ellipsis) - if tail > 0 && tail < len(parts) { + if tail > 0 && tail <= len(parts) { out = append(out, parts[len(parts)-tail:]...) } return "(" + strings.Join(out, ", ") + ")" }
🧹 Nitpick comments (3)
cla-backend/cla/utils.py (1)
1077-1079: Consider using spread operator for cleaner list concatenation.The static analysis tool suggests using the spread operator for better performance and readability.
Apply this diff:
if tail > 0: - kept += [ellipsis] + parts[-tail:] + kept += [ellipsis, *parts[-tail:]] else: kept += [ellipsis]cla-backend-go/github/github_repository.go (2)
1442-1507: Consider adding panic recovery in the enqueue goroutine.While the code appears safe, processing external GitHub data in concurrent goroutines could benefit from panic recovery to prevent crashes from unexpected conditions.
Apply this pattern inside the goroutine:
go func() { defer wg.Done() defer func() { <-sem }() + defer func() { + if r := recover(); r != nil { + log.WithFields(f).Errorf("panic in enqueue worker: %v", r) + } + }() var id64 *int64 // ... rest of function
1690-1690: Redundant but harmless nil check forpr.A past review noted that the
pr != nilcheck at line 1690 is redundant since the error path (lines 1662-1687) handles the case where the PR cannot be fetched. If execution reaches line 1690,err == nil, which means go-github should return a non-nilpr.However, the defensive nil check doesn't cause issues and makes the code more robust. Consider removing it only if you prefer a stricter error handling contract.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Jira integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
cla-backend-go/github/github_repository.go(11 hunks)cla-backend/cla/models/github_models.py(3 hunks)cla-backend/cla/utils.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.1)
cla-backend/cla/models/github_models.py
2195-2195: Avoid specifying long messages outside the exception class
(TRY003)
2219-2219: Avoid specifying long messages outside the exception class
(TRY003)
2221-2221: Multiple statements on one line (semicolon)
(E702)
2259-2259: Avoid specifying long messages outside the exception class
(TRY003)
2268-2268: Multiple statements on one line (semicolon)
(E702)
2281-2281: Avoid specifying long messages outside the exception class
(TRY003)
2286-2286: Avoid specifying long messages outside the exception class
(TRY003)
cla-backend/cla/utils.py
1077-1077: Consider [ellipsis, *parts[-tail:]] instead of concatenation
Replace with [ellipsis, *parts[-tail:]]
(RUF005)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-test-lint
- GitHub Check: cypress-functional
🔇 Additional comments (11)
cla-backend/cla/models/github_models.py (2)
2441-2450: LGTM! Excellent debugging instrumentation.The detailed logging of distinct commit SHAs and author attributes (IDs, logins, emails, names) will be invaluable for tracking and debugging commit summary processing.
2284-2286: Good defensive pagination limit.The 2000-page safety limit prevents infinite loops in case of unexpected data. This is solid defensive coding.
Note: The static analysis tool suggests extracting the error message to avoid TRY003, but given this is a rare edge case, the inline message clarity is acceptable.
cla-backend-go/github/github_repository.go (9)
93-108: LGTM!The new internal types
gqlUserandhistNodeappropriately model GraphQL response data and maintain good encapsulation by being unexported.
130-153: LGTM!The signature change from returning
(response, error)to justerrorsimplifies the API and improves usability. The internal unmarshaling logic is clean and error handling is appropriate.
1293-1327: LGTM!The new logging for distinct commit SHAs and author attributes (IDs, logins, emails, names) provides valuable observability for debugging commit author resolution. The implementation correctly uses maps for deduplication.
1355-1426: LGTM! GraphQL-first approach with appropriate REST fallbacks.The refactored commit author retrieval correctly:
- Fetches PR base/head OIDs via GraphQL with REST fallback (lines 1355-1377)
- Computes merge-base via REST Compare API with fallback (lines 1379-1392)
- Fetches HEAD commit via GraphQL with fallback (lines 1395-1426)
This addresses the past review concern about falling back to REST when merge-base is not found (line 1577).
1538-1586: LGTM! Pagination loop with proper fallback handling.The commit history pagination correctly:
- Fetches pages until merge-base is found
- Falls back to REST when GraphQL fails (line 1560)
- Falls back to REST when merge-base not found after all pages (line 1577)
- Includes safety limit of 2000 pages to prevent infinite loops (lines 1582-1585)
This addresses the past review concern about fallback behavior when merge-base is not found.
1817-1828: LGTM!The function correctly uses the renamed
GetPRCommitSHAand includes helpful logging before and after status creation. Error handling is appropriate.
1982-1993: LGTM! Comment trimming applied when exceeding length limit.The logic appropriately trims long comments and logs the result. The trimming parameters (maxItems=40, head=20, tail=20) provide reasonable truncation behavior.
2036-2068: LGTM!The helper functions
splitAndTrimByComma,isSHA, andisHexare correctly implemented:
- Proper string splitting and whitespace trimming
- SHA validation with appropriate length range (7-40 hex characters)
- Hex character validation covering digits and both cases
1591-1626: LGTM!The distinct counts logging mirrors the REST version and provides consistent observability across both code paths. This helps track commit author resolution and debug issues.
pythonandgolangbackends.pythonandgolangbackends).cc @mlehotskylf @ahmedomosanya @jarias-lfx - this is for #4835.
Signed-off-by: Lukasz Gryglicki lgryglicki@cncf.io
Assisted by OpenAI
Assisted by GitHub Copilot