Skip to content

Commit e9bc36e

Browse files
[autorevert] implement autorevert and fix detection logic (#6983)
### Summary - Implemented revert detection/recording - Implemented failure-only rule matching in the autorevert detector to prevent “success” jobs with a classification label from contaminating pattern detection - Added a unit test ### Bug Fixed - Cause: The detector previously matched on `classification_rule` regardless of job `conclusion`. Baseline commit `33ec6e3` had multiple “success” shards labele d with `rule='pytest failure'`, which the detector misread as “older commit alre ady has the same failure,” suppressing the pattern for `bbc0df1`/`4fd5fab`. - Fix: Require `conclusion == 'failure'` wherever the detector compares rules (b oth for newer commit confirmation and older baseline exclusion). This prevents n oise from success+rule rows and correctly flags commit-caused failures like the ROCm case. ### Testing <details> <summary>python -m pytorch_auto_revert autorevert-checker rocm --hours 82 --do-restart --dry-run</summary> ``` python -m pytorch_auto_revert autorevert-checker rocm --hours 82 --do-restart --dry-run Fetching workflow data for 1 workflows since 2025-08-04T08:56:25.851470... Found 161 commits with job data for workflow 'rocm' ✓ 3 AUTOREVERT PATTERNS DETECTED Pattern #1: Failure rule: 'pytest failure' Recent commits with failure: bdb07a2b 8085edc8 Older commit without failure: 41081276 ✗ NOT REVERTED: 8085edc8f9c98f670f585586b4286a942927537a was not reverted ⟳ DRY RUN: Would restart rocm for 8085edc8 ⟳ DRY RUN: Would restart rocm for 41081276 Pattern #2: Failure rule: 'pytest failure' Recent commits with failure: 908c5cc4 b6c53383 Older commit without failure: 33ec6e3e ✗ NOT REVERTED: b6c53383fe2f29e6ed35430e90867dbeb8980d42 was not reverted ⟳ DRY RUN: Would restart rocm for b6c53383 ⟳ DRY RUN: Would restart rocm for 33ec6e3e Pattern #3: Failure rule: 'pytest failure' Recent commits with failure: 4fd5fabe bbc0df10 Older commit without failure: efc4b460 ✓ REVERTED (nosignal): bbc0df1094b5a4dcd2cce83f8402127b07913231 was reverted by 41081276 after 18.5 hours ================================================== SUMMARY STATISTICS ================================================== Workflow(s): rocm Timeframe: 82 hours Commits checked: 161 Auto revert patterns detected: 3 Actual reverts inside auto revert patterns detected (precision): 1 (33.3%) Total revert commits in period: 9 Revert categories: nosignal: 5 (55.6%) ignoredsignal: 2 (22.2%) ghfirst: 2 (22.2%) Total reverts excluding ghfirst: 7 Reverts (excluding ghfirst) that dont match any auto revert pattern detected (recall): 6 (85.7%) Per workflow precision: rocm: 1 reverts out of 3 patterns (33.3%) [excluding ghfirst: 1 (33.3%)] Reverted patterns: - pytest failure: bbc0df10 (nosignal) Restarted workflows: 4 - rocm for 8085edc8 - rocm for 41081276 - rocm for b6c53383 - rocm for 33ec6e3e ``` </details> the actual culprit was correctly identified: ``` Pattern #7: Failure rule: 'pytest failure' Recent commits with failure: 4fd5fabe bbc0df10 Older commit without failure: efc4b460 ✓ REVERTED (nosignal): bbc0df1094b5a4dcd2cce83f8402127b07913231 was reverted by 41081276 after 18.5 hours ``` there are multiple patterns detected, because the failure was jumping across **workflows**: rocm and rocm-mi300 --------- Co-authored-by: Jean Schmidt <[email protected]>
1 parent bcc20e4 commit e9bc36e

File tree

6 files changed

+401
-83
lines changed

6 files changed

+401
-83
lines changed

aws/lambda/pytorch-auto-revert/Makefile

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,13 @@ venv/bin/lintrunner: venv/bin/python
2121
run-local: venv/bin/python
2222
venv/bin/python -m pytorch_auto_revert
2323

24+
.PHONY: run-local-dry
25+
run-local-dry: venv/bin/python
26+
venv/bin/python -m pytorch_auto_revert --dry-run
27+
2428
.PHONY: run-local-workflows
2529
run-local-workflows: venv/bin/python
26-
venv/bin/python -m pytorch_auto_revert autorevert-checker Lint trunk pull inductor linux-binary-manywheel --hours 4320 --ignore-common-errors
30+
venv/bin/python -m pytorch_auto_revert autorevert-checker Lint trunk pull inductor linux-binary-manywheel --hours 4380 --ignore-common-errors
2731

2832
deployment.zip:
2933
mkdir -p deployment

aws/lambda/pytorch-auto-revert/pytorch_auto_revert/__main__.py

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,11 @@ def get_opts() -> argparse.Namespace:
6262
type=int,
6363
default=int(os.environ.get("GITHUB_INSTALLATION_ID", "0")),
6464
)
65+
parser.add_argument(
66+
"--dry-run",
67+
action="store_true",
68+
help="Show what would be restarted without actually doing it (use with --do-restart)",
69+
)
6570

6671
# no subcommand runs the lambda flow
6772
subparsers = parser.add_subparsers(dest="subcommand")
@@ -91,9 +96,9 @@ def get_opts() -> argparse.Namespace:
9196
help="Actually restart workflows for detected autorevert patterns",
9297
)
9398
workflow_parser.add_argument(
94-
"--dry-run",
99+
"--do-revert",
95100
action="store_true",
96-
help="Show what would be restarted without actually doing it (use with --do-restart)",
101+
help="When restarts complete and secondary pattern matches, log REVERT",
97102
)
98103
workflow_parser.add_argument(
99104
"--ignore-common-errors",
@@ -173,18 +178,20 @@ def main(*args, **kwargs) -> None:
173178
"inductor",
174179
"linux-binary-manywheel",
175180
],
181+
do_restart=True,
182+
do_revert=False,
176183
hours=2,
177184
verbose=True,
178-
do_restart=True,
179-
dry_run=False,
185+
dry_run=opts.dry_run,
180186
ignore_common_errors=True,
181187
)
182188
elif opts.subcommand == "autorevert-checker":
183189
autorevert_checker(
184190
opts.workflows,
191+
do_restart=opts.do_restart,
192+
do_revert=opts.do_revert,
185193
hours=opts.hours,
186194
verbose=opts.verbose,
187-
do_restart=opts.do_restart,
188195
dry_run=opts.dry_run,
189196
ignore_common_errors=opts.ignore_common_errors,
190197
)

0 commit comments

Comments
 (0)