Bug 2023624 - Some older repos have comments that have empty comment body and user object is null. The ETL script needs to handle these better instead of crashing#12
Conversation
dklawren
commented
Mar 16, 2026
- extract_reviewers: filters out any review where user is null, preserving empty-body reviews (e.g. approve without comment)
- extract_comments: filters out any comment where user is null or body is empty
- transform_data: defensive (review.get("user") or {}) so a null user won't raise AttributeError even if it somehow reaches the transform
…body and user object is null. The ETL script needs to handle these better instead of crashing
There was a problem hiding this comment.
Pull request overview
Adds defensive handling for GitHub PR reviews/comments where user is null or body is empty, preventing ETL script crashes on older repositories with malformed data.
Changes:
extract_reviewers: Filters out reviews with nulluserand logs skipped countextract_comments: Filters out comments with nulluseror emptybodyand logs skipped counttransform_data: Uses(review.get("user") or {})pattern to safely handle null user objects
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
shtrom
left a comment
There was a problem hiding this comment.
I don't think the error we saw will be addressed by this fix.
Is there a way you could run one iteration of the loop against a given PR locally? That would make it easier to reproduce topical issues.
|
|
||
| logger.info(f"Extracted {len(reviewers)} reviewers for PR #{pr_number}") | ||
| return reviewers | ||
| filtered = [r for r in reviewers if r.get("user") is not None] |
There was a problem hiding this comment.
nit: Unless you expect r["user"] to be validly False or "", you can just
| filtered = [r for r in reviewers if r.get("user") is not None] | |
| filtered = [r for r in reviewers if r.get("user")] |
| logger.info(f"Extracted {len(comments)} comments for PR #{pr_number}") | ||
| return comments | ||
|
|
||
| filtered = [c for c in comments if c.get("user") is not None and c.get("body")] |
There was a problem hiding this comment.
Ditto.
| filtered = [c for c in comments if c.get("user") is not None and c.get("body")] | |
| filtered = [c for c in comments if c.get("user") and c.get("body")] |
| "date_reviewed": review.get("submitted_at"), | ||
| "reviewer_email": None, # TODO Placeholder for reviewer email extraction logic | ||
| "reviewer_username": review.get("user", {}).get("login", "None"), | ||
| "reviewer_username": (review.get("user") or {}).get("login"), |
There was a problem hiding this comment.
I think the problem with the error message we saw is not that user was empty, but that review was None, so that the initial Review.get... was a None.get..., which failed.
shtrom
left a comment
There was a problem hiding this comment.
Actually, I didn't consider all options.
I thought review== so that review.get raises that error
however I now think you may be right about review['user']==None, in which case your fix is correct.
In any case, I think that call for some local one-shot runability to confirm (: