[autorevert] implement autorevert and fix detection logic #6983

izaitsevfb · 2025-08-08T01:51:30Z

Summary

Implemented revert detection/recording
Implemented failure-only rule matching in the autorevert detector to prevent “success” jobs with a classification label from contaminating pattern detection
Added a unit test

Bug Fixed

Cause: The detector previously matched on classification_rule regardless of
job conclusion. Baseline commit 33ec6e3 had multiple “success” shards labele
d with rule='pytest failure', which the detector misread as “older commit alre
ady has the same failure,” suppressing the pattern for bbc0df1/4fd5fab.
Fix: Require conclusion == 'failure' wherever the detector compares rules (b
oth for newer commit confirmation and older baseline exclusion). This prevents n
oise from success+rule rows and correctly flags commit-caused failures like the
ROCm case.

Testing

python -m pytorch_auto_revert autorevert-checker rocm --hours 82 --do-restart --dry-run

python -m pytorch_auto_revert autorevert-checker rocm --hours 82 --do-restart --dry-run
Fetching workflow data for 1 workflows since 2025-08-04T08:56:25.851470...
Found 161 commits with job data for workflow 'rocm'
✓ 3 AUTOREVERT PATTERNS DETECTED

Pattern #1:
Failure rule: 'pytest failure'
Recent commits with failure: bdb07a2b 8085edc8
Older commit without failure: 41081276
✗ NOT REVERTED: 8085edc8f9c98f670f585586b4286a942927537a was not reverted
  ⟳ DRY RUN: Would restart rocm for 8085edc8
  ⟳ DRY RUN: Would restart rocm for 41081276

Pattern #2:
Failure rule: 'pytest failure'
Recent commits with failure: 908c5cc4 b6c53383
Older commit without failure: 33ec6e3e
✗ NOT REVERTED: b6c53383fe2f29e6ed35430e90867dbeb8980d42 was not reverted
  ⟳ DRY RUN: Would restart rocm for b6c53383
  ⟳ DRY RUN: Would restart rocm for 33ec6e3e

Pattern #3:
Failure rule: 'pytest failure'
Recent commits with failure: 4fd5fabe bbc0df10
Older commit without failure: efc4b460
✓ REVERTED (nosignal): bbc0df1094b5a4dcd2cce83f8402127b07913231 was reverted by 41081276 after 18.5 hours

==================================================
SUMMARY STATISTICS
==================================================
Workflow(s): rocm
Timeframe: 82 hours
Commits checked: 161
Auto revert patterns detected: 3
Actual reverts inside auto revert patterns detected (precision): 1 (33.3%)
Total revert commits in period: 9

Revert categories:
  nosignal: 5 (55.6%)
  ignoredsignal: 2 (22.2%)
  ghfirst: 2 (22.2%)

Total reverts excluding ghfirst: 7
Reverts (excluding ghfirst) that dont match any auto revert pattern detected (recall): 6 (85.7%)
Per workflow precision:
  rocm: 1 reverts out of 3 patterns (33.3%) [excluding ghfirst: 1 (33.3%)]

Reverted patterns:
  - pytest failure: bbc0df10 (nosignal)

Restarted workflows: 4
  - rocm for 8085edc8
  - rocm for 41081276
  - rocm for b6c53383
  - rocm for 33ec6e3e

the actual culprit was correctly identified:

Pattern #7:
Failure rule: 'pytest failure'
Recent commits with failure: 4fd5fabe bbc0df10
Older commit without failure: efc4b460
✓ REVERTED (nosignal): bbc0df1094b5a4dcd2cce83f8402127b07913231 was reverted by 41081276 after 18.5 hours

there are multiple patterns detected, because the failure was jumping across workflows: rocm and rocm-mi300

vercel · 2025-08-08T01:51:36Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Project	Deployment	Preview	Updated (UTC)
torchci	⬜️ Ignored	Preview	Aug 12, 2025 0:16am

jeanschmidt · 2025-08-11T12:02:56Z