Skip to content

[TD] Script to determine which reverts are caused by bad TD #6911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

clee2000
Copy link
Contributor

@clee2000 clee2000 commented Jul 9, 2025

Honestly a pretty messy script

Determines a commit was caused by bad TD, basically:

  1. find revert
  2. find originally merged commit
  3. find last commit on the PR before the merge
  4. check merged commit for failures
  5. check if failure was excluded by TD on the last commit on the PR

Obviously not perfect, but spot checking it does seem to be ok

Counts are pretty volatile if granularity is too small (week granularity is very volatile), so I'm not sure if this can really be displayed. Maybe should make this periodic and still upload to clickhouse? idk

End of the output looks like this but theres a lot of extra output before it to help debug

CAUSED BY BAD TD: 27 / 184 = 14.67%
Unable to check (lack run id) on PR: 1 / 184 = 0.54%
Total caused by bad TD: 27 / 184 = 14.67%
Month 674: 13 bad TD / 77 total = 16.88%
Month 675: 14 bad TD / 107 total = 13.08%

Also make clickhouse.py client able to be used in thread pool executor (http client didn't like having 1 http client for multiple threads i think)

@pytorch-bot pytorch-bot bot added the ci-no-td label Jul 9, 2025
Copy link

vercel bot commented Jul 9, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Updated (UTC)
torchci ⬜️ Ignored (Inspect) Visit Preview Aug 5, 2025 10:36pm

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 9, 2025
@clee2000 clee2000 marked this pull request as ready for review July 9, 2025 21:45
@clee2000 clee2000 requested a review from a team July 9, 2025 21:45
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lintrunner found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@zxiiro zxiiro requested a review from Copilot August 7, 2025 15:50
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a script to analyze reverts and determine which were caused by bad TD (Test Dependency) exclusions, along with making the ClickHouse client thread-safe.

Key changes:

  • Adds a comprehensive script to analyze Git commit history and correlate reverts with TD exclusions
  • Removes the @lru_cache decorator from get_clickhouse_client() to enable thread-safe usage
  • Implements parallel processing using ThreadPoolExecutor for performance optimization

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
tools/torchci/td/get_reverts_caused_by_td.py New script that analyzes Git history, queries ClickHouse for job failures, and determines if reverts were caused by bad TD exclusions
tools/torchci/clickhouse.py Removes @lru_cache decorator from get_clickhouse_client() to make it thread-safe for concurrent usage

month_groups[month] = (0, 0)
month_groups[month] = (month_groups[month][0] + 1, month_groups[month][1])
for commit in commits_reverted:
month = commit.timestamp_of_merge // (30 * 24 * 60 * 60)
Copy link
Preview

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line uses timestamp_of_merge but the loop is iterating over commits_reverted, not caused_by_bad_td like the previous loop. This will incorrectly group all reverted commits by their merge timestamp instead of revert timestamp for the month calculation.

Suggested change
month = commit.timestamp_of_merge // (30 * 24 * 60 * 60)
month = commit.timestamp_of_revert // (30 * 24 * 60 * 60)

Copilot uses AI. Check for mistakes.

failed_test=get_test_file(line),
)
)
del futures
Copy link
Preview

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Explicit deletion of futures is unnecessary as it will be garbage collected when it goes out of scope. This adds clutter without benefit.

Suggested change
del futures

Copilot uses AI. Check for mistakes.

f"found last pr sha != alt, {commit.last_pr_sha} != {alt_last_pr_sha[0]}"
)
bad += 1
if commit.last_pr_sha is None:
Copy link
Preview

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If alt_last_pr_sha[0] is empty string (when no matching commit is found), setting commit.last_pr_sha to an empty string could cause issues in subsequent queries. Consider checking if alt_last_pr_sha[0] is not empty before assignment.

Suggested change
if commit.last_pr_sha is None:
if commit.last_pr_sha is None and alt_last_pr_sha[0]:

Copilot uses AI. Check for mistakes.

x.revert_commit_sha,
x.merge_commit_sha,
x.merge_commit_sha_prev,
x.last_pr_sha,
Copy link
Preview

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] This commented-out line should either be removed if it's not needed or properly documented with a comment explaining why it's disabled.

Suggested change
x.last_pr_sha,
x.last_pr_sha,
# Disabled: revert_commit_sha_prev is not always available or needed for this analysis

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-no-td CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants