Skip to content

Handle too big commits, missing commits from fork PRs, golang REST fallback#4836

Merged
lukaszgryglicki merged 2 commits intodevfrom
unicron-4835-handle-edge-cases
Oct 22, 2025
Merged

Handle too big commits, missing commits from fork PRs, golang REST fallback#4836
lukaszgryglicki merged 2 commits intodevfrom
unicron-4835-handle-edge-cases

Conversation

@lukaszgryglicki
Copy link
Member

  • Handle commend body too big (> 64K characters - can happen if PR has a lot of commits) in both python and golang backends.
  • Handle PRs from forks with > 250 commits with graceful fallack to REST (both python and golang backends).
  • Correctly locate PR head SHA using quick API call and fallback to getting last PR's commit SHA (also both backends).
  • Log number of distinct commit SHAs, authors (ids, logins, emails, names) in both backends so we can see how many commit summaries were there and how many actual commits (one commit can have > 1 authors) - both backends.
  • Handle private repos and forks correctly in both backends with fallbacks on errors.

cc @mlehotskylf @ahmedomosanya @jarias-lfx - this is for #4835.

Signed-off-by: Lukasz Gryglicki lgryglicki@cncf.io

Assisted by OpenAI

Assisted by GitHub Copilot

…llback

Signed-off-by: Lukasz Gryglicki <lgryglicki@cncf.io>

Assisted by [OpenAI](https://platform.openai.com/)

Assisted by [GitHub Copilot](https://github.com/features/copilot)
@coderabbitai
Copy link

coderabbitai bot commented Oct 22, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Refactored PR commit-author retrieval across Go and Python to prefer GraphQL with REST fallbacks, renamed GetPRHeadSHA→GetPRCommitSHA, changed doGraphQL to return only an error, and added long-CLA-comment trimming (64KB cap) plus helper utilities and pagination/merge-base logic.

Changes

Cohort / File(s) Summary
Go GraphQL & comment logic
cla-backend-go/github/github_repository.go
Added MaxCommentLength (0xff00). Changed doGraphQL to return only error. Renamed GetPRHeadSHAGetPRCommitSHA. Introduced GraphQL types (gqlUser, histNode), history-based pagination with merge-base detection, per-commit worker enqueueing, expanded logging, and comment trimming helpers (TrimComment, splitAndTrimByComma, isSHA, isHex).
Python PR commit history & authors
cla-backend/cla/models/github_models.py
Reworked iter_pr_commits_full to two-phase GraphQL/REST flow: fetch base/head OIDs, determine merge-base via REST compare, fetch HEAD via GraphQL, then page history until merge-base with safety cap. Updated update_pull_request to use PR base repo, added logging and stricter error handling.
Python comment trimming util
cla-backend/cla/utils.py
Added public trim_comment(...) to collapse long parenthesized SHA lists (head/tail/ellipsis strategy). assemble_cla_comment() now constructs body = badge + "<br />" + comment, trims when UTF‑8 length > 0xFF00, and returns the possibly trimmed body.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant PR as Service
    participant GQL as GraphQL API
    participant REST as REST API

    User->>PR: Request CLA comment / PR processing

    rect rgba(200,220,240,0.6)
    note over PR,GQL: Phase 1 — fetch base/head OIDs
    PR->>GQL: Query base & head OIDs
    alt GQL success
        GQL-->>PR: base & head OIDs
    else GQL fails
        PR->>REST: Fallback: ListCommits / ListCommits compare
        REST-->>PR: commit data / compare result
    end
    end

    rect rgba(200,240,220,0.6)
    note over PR,REST: Phase 2 — determine merge-base
    PR->>REST: compare(base...head)
    REST-->>PR: merge-base SHA
    end

    rect rgba(240,220,200,0.6)
    note over PR,GQL: Phase 3 — page commit history until merge-base
    PR->>GQL: history query (after cursor)
    GQL-->>PR: commit nodes + author info
    loop until merge-base found (cap pages)
        PR->>GQL: fetch next page
        GQL-->>PR: next nodes
    end
    end

    rect rgba(240,240,200,0.6)
    note over PR: Phase 4 — assemble & trim comment
    PR->>PR: assemble_cla_comment()
    alt body > 64KB
        PR->>PR: trim_comment()
    end
    PR-->>User: return CLA comment / status updates
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.90% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "Handle too big commits, missing commits from fork PRs, golang REST fallback" is fully related to the main changes in the changeset. It addresses the three primary objectives: handling large comment bodies (exceeding 64KB/0xFF00 limit) when PRs have many commits, implementing graceful fallback mechanisms for fork PRs with many commits via GraphQL/REST approaches, and adding REST fallback logic in the golang backend. The title is specific, clear, and directly summarizes the key changes across both the Python and Golang backends without using vague terminology.
Description Check ✅ Passed The PR description is clearly related to the changeset and provides meaningful information about the implementation. It outlines the five core objectives: handling large comment bodies (>64K), graceful REST fallback for fork PRs with >250 commits, correctly locating PR head SHA with fallbacks, logging commit and author statistics, and handling private repositories with error fallbacks. All stated objectives are reflected in the raw summary changes across both backends, including the new MaxCommentLength constant, enhanced GraphQL/REST logic, refactored commit retrieval functions, and improved logging.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch unicron-4835-handle-edge-cases

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the handling of large pull requests and fork scenarios by implementing size limits, fallback mechanisms, and improved commit tracking across both Python and Golang backends.

Key changes:

  • Implements comment body size limits (64KB) with intelligent trimming of SHA lists
  • Adds GraphQL-to-REST fallback for PRs from forks with >250 commits
  • Introduces enhanced logging for distinct commit metrics (SHAs, author IDs, logins, emails, names)

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
cla-backend/cla/utils.py Adds comment body size validation and SHA list trimming functionality
cla-backend/cla/models/github_models.py Refactors commit fetching to use merge-base strategy with proper fallbacks and enhanced logging
cla-backend-go/github/github_repository.go Mirrors Python changes with GraphQL history-based approach, comment trimming, and comprehensive commit metrics logging

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
cla-backend/cla/utils.py (1)

1066-1085: Clamp head when head > max_items; drop unnecessary DOTALL; minor list build nit

  • If head > max_items, kept list can exceed intent. Clamp head; tail already adjusted.
  • flags=re.S is unnecessary since the pattern forbids inner parens.
  • Minor: apply Ruff RUF005 nit for the ellipsis append.
-    # ensure head+tail <= max_items
-    tail = max(0, min(tail, max_items - head))
+    # ensure head+tail <= max_items
+    head = max(0, min(head, max_items))
+    tail = max(0, min(tail, max_items - head))
@@
-            if tail > 0:
-                kept += [ellipsis] + parts[-tail:]
-            else:
-                kept += [ellipsis]
+            if tail > 0:
+                kept += [ellipsis, *parts[-tail:]]
+            else:
+                kept += [ellipsis]
@@
-    return re.sub(r'\(([^()]*)\)', repl, html, flags=re.S)
+    return re.sub(r'\(([^()]*)\)', repl, html)
cla-backend/cla/models/github_models.py (2)

2257-2261: Prefer graceful REST fallback when merge-base not found in history

Raising ValueError here forces the outer caller to restart. Consider signaling “not found” and let the caller directly fall back, or invoke the REST path here for continuity.

-        if d is None:
-            cla.log.error(f"Failed to history commits for {owner}/{repo_name} PR #{number}")
-            raise ValueError("failed to fetch history commits using GraphQL")
+        if d is None:
+            cla.log.error(f"Failed to fetch commit history for {owner}/{repo_name} PR #{number}")
+            raise ValueError("gql-history-unavailable")  # caller falls back to REST
@@
-        if not hist["pageInfo"]["hasNextPage"]:
-            # merge-base not found — extremely rare (rebases or unusual ancestry).
-            # Fall back to non-GQL old approach (limited to 250 commits but still better that error)
-            raise ValueError("merge-base commit not found in PR commit history")
+        if not hist["pageInfo"]["hasNextPage"]:
+            # merge-base not in history — signal fallback to REST
+            raise ValueError("gql-merge-base-missing")

Also applies to: 2278-2287


2698-2703: Commit lookup on base repo can 404 for fork PRs; try head.repo first

For forked PRs, the last commit may not exist in the base repo yet. Prefer head.repo when available, falling back to base.repo.

-        repo = getattr(getattr(pull_request, "base", None), "repo", None)
+        repo = getattr(getattr(pull_request, "head", None), "repo", None) or \
+               getattr(getattr(pull_request, "base", None), "repo", None)

If you’ve seen 404s from base-only lookups, this will reduce fallback churn. Please confirm against a fork PR scenario.

cla-backend-go/github/github_repository.go (2)

1987-1993: Avoid logging full (trimmed) comment body

Logging the entire body can flood logs and leak emails. Log lengths and a small prefix only.

-        body = TrimComment(body, 40, 20, 20, "…")
-        log.WithFields(f).Debugf("comment trimmed to (%d): %s", len(body), body)
+        body = TrimComment(body, 40, 20, 20, "…")
+        log.WithFields(f).Debugf("comment trimmed to length=%d, head='%s...'", len(body), utils.Left(body, 256))

Note: utils.Left here denotes a helper; alternatively inline a safe substring.


1987-1989: Tiny HTML nit for consistency

Python uses "
". Consider standardizing the break tag across services.

-    body := fmt.Sprintf("%s<br >%s", badge, commentBody)
+    body := fmt.Sprintf("%s<br />%s", badge, commentBody)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Jira integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a7cadb5 and dfc44b7.

📒 Files selected for processing (3)
  • cla-backend-go/github/github_repository.go (11 hunks)
  • cla-backend/cla/models/github_models.py (3 hunks)
  • cla-backend/cla/utils.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.1)
cla-backend/cla/models/github_models.py

2195-2195: Avoid specifying long messages outside the exception class

(TRY003)


2219-2219: Avoid specifying long messages outside the exception class

(TRY003)


2221-2221: Multiple statements on one line (semicolon)

(E702)


2259-2259: Avoid specifying long messages outside the exception class

(TRY003)


2268-2268: Multiple statements on one line (semicolon)

(E702)


2281-2281: Avoid specifying long messages outside the exception class

(TRY003)


2287-2287: Avoid specifying long messages outside the exception class

(TRY003)

cla-backend/cla/utils.py

1077-1077: Consider [ellipsis, *parts[-tail:]] instead of concatenation

Replace with [ellipsis, *parts[-tail:]]

(RUF005)

Signed-off-by: Lukasz Gryglicki <lgryglicki@cncf.io>

Assisted by [OpenAI](https://platform.openai.com/)

Assisted by [GitHub Copilot](https://github.com/features/copilot)
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cla-backend/cla/models/github_models.py (1)

2697-2707: Clarify error message to reflect head commit access attempt.

The error message at line 2701 should indicate that this exception occurred while attempting to access the PR's head commit, not just "commit sha", since the code now falls back to retrieving the last commit from the PR's commits list.

Apply this diff:

 try:
     last_commit_sha = getattr(getattr(pull_request, "head", None), "sha", None)
     repo = getattr(getattr(pull_request, "base", None), "repo", None)
     commit_obj = repo.get_commit(last_commit_sha)
 except (GithubException, AttributeError, TypeError) as exc:
-    cla.log.error(f"{fn} - PR {pull_request.number}: exception getting commit sha: {exc}")
+    cla.log.error(f"{fn} - PR {pull_request.number}: exception getting head commit from base.repo: {exc} - attempting fallback")
     try:
         commit_obj = pull_request.get_commits().reversed[0]
         last_commit_sha = commit_obj.sha
     except Exception as exc2:
         cla.log.error(f"{fn} - PR {pull_request.number}: exception getting last commit from PR commits: {exc2}")
         last_commit_sha = None

Based on learnings

♻️ Duplicate comments (3)
cla-backend/cla/utils.py (1)

1056-1059: Add safety check after trimming to guarantee size limit.

The byte-size check correctly uses len(body.encode('utf-8')), but there's no verification that trim_comment actually reduces the body below 0xFF00 bytes. If trim_comment doesn't trim enough (e.g., no SHA-only parenthesized groups found), the body could still exceed the limit.

Consider adding a final safety truncation:

 body = badge + "<br />" + comment
 if len(body.encode('utf-8')) > 0xFF00:
     body = trim_comment(body, max_items=40, head=20, tail=20, ellipsis="…")
+    # Final safety: ensure we're under the limit
+    body_bytes = body.encode('utf-8')
+    if len(body_bytes) > 0xFF00:
+        body = body_bytes[:0xFF00].decode('utf-8', errors='ignore')
 return body
cla-backend/cla/models/github_models.py (1)

2192-2229: Add defensive checks to prevent KeyError on GraphQL responses.

Direct dictionary indexing can raise KeyError if the GraphQL response structure is unexpected. Additionally, multiple statements on one line (line 2221) reduce readability.

Apply this diff to add safe navigation and split statements:

 info = pygithub_graphql(g, q_info, {"owner": owner, "name": repo_name, "number": number})
 if info is None:
     cla.log.error(f"Failed to fetch base and head OIDs for commits for {owner}/{repo_name} PR #{number}")
     raise ValueError("failed to fetch base and head OIDs for commits using GraphQL")
-pr = info["repository"]["pullRequest"]
+repo_data = info.get("repository")
+if not repo_data:
+    cla.log.error(f"No repository data for {owner}/{repo_name} PR #{number}")
+    raise ValueError("failed to fetch repository data using GraphQL")
+pr = repo_data.get("pullRequest")
+if not pr:
+    cla.log.error(f"No pullRequest data for {owner}/{repo_name} PR #{number}")
+    raise ValueError("failed to fetch pull request data using GraphQL")
 base_oid, head_oid = pr["baseRefOid"], pr["headRefOid"]

 # 2) merge-base via REST (fast and reliable)
 repo = g.get_repo(f"{owner}/{repo_name}")
 mb_sha = repo.compare(base_oid, head_oid).merge_base_commit.sha

 # Fetch & yield HEAD commit, because history won't include the starting commit itself
 q_head = """
 query($owner:String!, $name:String!, $oid:GitObjectID!) {
   repository(owner:$owner, name:$name) {
     object(oid:$oid) {
       ... on Commit {
         oid
         message
         author { name email user { databaseId login name email } }
       }
     }
   }
 }"""
-head = pygithub_graphql(g, q_head, {"owner": owner, "name": repo_name, "oid": head_oid})["repository"]["object"]
+head_data = pygithub_graphql(g, q_head, {"owner": owner, "name": repo_name, "oid": head_oid})
+if not head_data or not head_data.get("repository"):
+    cla.log.error(f"Failed to fetch head commit repository data for {owner}/{repo_name} PR #{number}")
+    raise ValueError("failed to fetch head commit using GraphQL")
+head = head_data["repository"].get("object")
 if head is None:
     cla.log.error(f"Failed to fetch head commit for {owner}/{repo_name} PR #{number}")
     raise ValueError("failed to fetch head commit using GraphQL")
 if head["oid"] != mb_sha:
-    a = head.get("author") or {}; u = a.get("user") or {}
+    a = head.get("author") or {}
+    u = a.get("user") or {}
     yield CommitLite(

Based on learnings

cla-backend-go/github/github_repository.go (1)

2021-2032: Fix slice bounds checking to prevent edge case bug.

The conditions at lines 2023 and 2027 use < len(parts) which excludes the boundary case where head == len(parts) or tail == len(parts). This causes no items to be included when they could be.

For example, if len(parts) == 50, maxItems == 40, and head == 20, the condition head < len(parts) evaluates to 20 < 50 (true), which is fine. But if through adjustment head == 50, then 50 < 50 is false, and no head items are added even though parts[:50] is valid.

Apply this diff:

         if len(parts) > maxItems {
             var out []string
-            if head > 0 && head < len(parts) {
+            if head > 0 && head <= len(parts) {
                 out = append(out, parts[:head]...)
             }
             out = append(out, ellipsis)
-            if tail > 0 && tail < len(parts) {
+            if tail > 0 && tail <= len(parts) {
                 out = append(out, parts[len(parts)-tail:]...)
             }
             return "(" + strings.Join(out, ", ") + ")"
         }
🧹 Nitpick comments (3)
cla-backend/cla/utils.py (1)

1077-1079: Consider using spread operator for cleaner list concatenation.

The static analysis tool suggests using the spread operator for better performance and readability.

Apply this diff:

 if tail > 0:
-    kept += [ellipsis] + parts[-tail:]
+    kept += [ellipsis, *parts[-tail:]]
 else:
     kept += [ellipsis]
cla-backend-go/github/github_repository.go (2)

1442-1507: Consider adding panic recovery in the enqueue goroutine.

While the code appears safe, processing external GitHub data in concurrent goroutines could benefit from panic recovery to prevent crashes from unexpected conditions.

Apply this pattern inside the goroutine:

 go func() {
     defer wg.Done()
     defer func() { <-sem }()
+    defer func() {
+        if r := recover(); r != nil {
+            log.WithFields(f).Errorf("panic in enqueue worker: %v", r)
+        }
+    }()
     
     var id64 *int64
     // ... rest of function

1690-1690: Redundant but harmless nil check for pr.

A past review noted that the pr != nil check at line 1690 is redundant since the error path (lines 1662-1687) handles the case where the PR cannot be fetched. If execution reaches line 1690, err == nil, which means go-github should return a non-nil pr.

However, the defensive nil check doesn't cause issues and makes the code more robust. Consider removing it only if you prefer a stricter error handling contract.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Jira integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between dfc44b7 and 6040b31.

📒 Files selected for processing (3)
  • cla-backend-go/github/github_repository.go (11 hunks)
  • cla-backend/cla/models/github_models.py (3 hunks)
  • cla-backend/cla/utils.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.1)
cla-backend/cla/models/github_models.py

2195-2195: Avoid specifying long messages outside the exception class

(TRY003)


2219-2219: Avoid specifying long messages outside the exception class

(TRY003)


2221-2221: Multiple statements on one line (semicolon)

(E702)


2259-2259: Avoid specifying long messages outside the exception class

(TRY003)


2268-2268: Multiple statements on one line (semicolon)

(E702)


2281-2281: Avoid specifying long messages outside the exception class

(TRY003)


2286-2286: Avoid specifying long messages outside the exception class

(TRY003)

cla-backend/cla/utils.py

1077-1077: Consider [ellipsis, *parts[-tail:]] instead of concatenation

Replace with [ellipsis, *parts[-tail:]]

(RUF005)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-test-lint
  • GitHub Check: cypress-functional
🔇 Additional comments (11)
cla-backend/cla/models/github_models.py (2)

2441-2450: LGTM! Excellent debugging instrumentation.

The detailed logging of distinct commit SHAs and author attributes (IDs, logins, emails, names) will be invaluable for tracking and debugging commit summary processing.


2284-2286: Good defensive pagination limit.

The 2000-page safety limit prevents infinite loops in case of unexpected data. This is solid defensive coding.

Note: The static analysis tool suggests extracting the error message to avoid TRY003, but given this is a rare edge case, the inline message clarity is acceptable.

cla-backend-go/github/github_repository.go (9)

93-108: LGTM!

The new internal types gqlUser and histNode appropriately model GraphQL response data and maintain good encapsulation by being unexported.


130-153: LGTM!

The signature change from returning (response, error) to just error simplifies the API and improves usability. The internal unmarshaling logic is clean and error handling is appropriate.


1293-1327: LGTM!

The new logging for distinct commit SHAs and author attributes (IDs, logins, emails, names) provides valuable observability for debugging commit author resolution. The implementation correctly uses maps for deduplication.


1355-1426: LGTM! GraphQL-first approach with appropriate REST fallbacks.

The refactored commit author retrieval correctly:

  • Fetches PR base/head OIDs via GraphQL with REST fallback (lines 1355-1377)
  • Computes merge-base via REST Compare API with fallback (lines 1379-1392)
  • Fetches HEAD commit via GraphQL with fallback (lines 1395-1426)

This addresses the past review concern about falling back to REST when merge-base is not found (line 1577).


1538-1586: LGTM! Pagination loop with proper fallback handling.

The commit history pagination correctly:

  • Fetches pages until merge-base is found
  • Falls back to REST when GraphQL fails (line 1560)
  • Falls back to REST when merge-base not found after all pages (line 1577)
  • Includes safety limit of 2000 pages to prevent infinite loops (lines 1582-1585)

This addresses the past review concern about fallback behavior when merge-base is not found.


1817-1828: LGTM!

The function correctly uses the renamed GetPRCommitSHA and includes helpful logging before and after status creation. Error handling is appropriate.


1982-1993: LGTM! Comment trimming applied when exceeding length limit.

The logic appropriately trims long comments and logs the result. The trimming parameters (maxItems=40, head=20, tail=20) provide reasonable truncation behavior.


2036-2068: LGTM!

The helper functions splitAndTrimByComma, isSHA, and isHex are correctly implemented:

  • Proper string splitting and whitespace trimming
  • SHA validation with appropriate length range (7-40 hex characters)
  • Hex character validation covering digits and both cases

1591-1626: LGTM!

The distinct counts logging mirrors the REST version and provides consistent observability across both code paths. This helps track commit author resolution and debug issues.

@lukaszgryglicki lukaszgryglicki merged commit 2d417d6 into dev Oct 22, 2025
4 of 6 checks passed
@lukaszgryglicki lukaszgryglicki deleted the unicron-4835-handle-edge-cases branch October 22, 2025 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants