[autorevert] When checking job status, only expect target job list to be finished, not waiting for all jobs to have finished (#7023)

jeanschmidt · izaitsevfb · web-flow · commit 46a680f6c852 · 2025-08-18T17:41:33.000-07:00
There are some pending jobs checks that are overly cautious and could be
problematic in case of queuing of some exotic type of runner. This can
hold autorevert hostage if there is a long queue or an outage in some
type of runner that is unrelated to the patterns we're looking for.

## Older commits (baseline)

When checking for patterns, we don't really care if there are pending
jobs for the older commit, as long it is unrelated to the identified
pattern.

## Repeated error identification

Already correctly handled, it should retry jobs at first opportunity of
a single repeated breaking job.

## `confirm_commit_caused_failure_on_restarted`

`has_rule` already checks for failure conclusion. So it is more correct
ignore if jobs are still pending and match in the first confirmation.
But it is important to only check for the relevant job names if they
finished in base, as any newly job could finish as failure, what should
avoid reverting.

---------

Co-authored-by: Ivan Zaitsev &lt;ivanzaitsev@fb.com&gt;
diff --git a/aws/lambda/pytorch-auto-revert/Makefile b/aws/lambda/pytorch-auto-revert/Makefile
@@ -27,7 +27,7 @@ run-local-dry: venv/bin/python
 
 .PHONY: run-local-workflows
 run-local-workflows: venv/bin/python
-	venv/bin/python -m pytorch_auto_revert autorevert-checker Lint trunk pull inductor linux-binary-manywheel --hours 4380 --ignore-common-errors
+	venv/bin/python -m pytorch_auto_revert autorevert-checker Lint trunk pull inductor linux-binary-manywheel --hours 4380 --ignore-common-errors --verbose
 
 deployment.zip:
 	mkdir -p deployment
diff --git a/aws/lambda/pytorch-auto-revert/pytorch_auto_revert/autorevert_checker.py b/aws/lambda/pytorch-auto-revert/pytorch_auto_revert/autorevert_checker.py
@@ -49,6 +49,12 @@ def has_pending_jobs(self) -> bool:
     def job_base_names(self) -> Set[str]:
         return self.get_job_base_names()
 
+    def jobs_with_base_name(self, job_base_name: str) -> List[JobResult]:
+        """Get all jobs with a specific normalized base name."""
+        return [
+            j for j in self.jobs if self.normalize_job_name(j.name) == job_base_name
+        ]
+
     def normalize_job_name(self, name: str) -> str:
         """Normalize job name to a stable base for matching across commits.
 
@@ -272,9 +278,6 @@ def detect_autorevert_pattern_workflow(self, workflow_name: str) -> List[Dict]:
         for i in range(1, len(commits) - 1):
             suspected_commit1 = commits[i]
 
-            if suspected_commit1.has_pending_jobs:
-                continue
-
             # Extract unique (classification_rule, normalized job) pairs for failing jobs on the suspected commit
             suspected_failures = {
                 (
@@ -340,8 +343,8 @@ def detect_autorevert_pattern_workflow(self, workflow_name: str) -> List[Dict]:
                     # No older commit with same normalized job name found
                     continue
 
-                # Ensure the oldest commit has stable signal (no running jobs)
-                if last_commit_with_same_job.has_pending_jobs:
+                # Ensure the oldest commit has stable signal for the jobs we care about (no running jobs)
+                if any(j.status != "completed" for j in last_same_jobs):
                     continue
 
                 if any(
@@ -503,30 +506,30 @@ def confirm_commit_caused_failure_on_restarted(self, pattern: Dict) -> bool:
         previous_commit = pattern["older_commit"]
 
         # Fetch restarted jobs for first failing and previous commits
-        failing_jobs = self._fetch_single_commit_jobs(
+        failing_commit_jobs = self._fetch_single_commit_jobs(
             workflow_name, first_failing, restarted_only=True
         )
-        prev_jobs = self._fetch_single_commit_jobs(
+        prev_commit_jobs = self._fetch_single_commit_jobs(
             workflow_name, previous_commit, restarted_only=True
         )
-        if not failing_jobs or not prev_jobs:
+        if not failing_commit_jobs or not prev_commit_jobs:
             return False
 
-        # Pending check
-        if failing_jobs.has_pending_jobs or prev_jobs.has_pending_jobs:
+        failing_suspected_jobs = failing_commit_jobs.jobs_with_base_name(job_base)
+        prev_suspected_jobs = prev_commit_jobs.jobs_with_base_name(job_base)
+        if any(j.status != "completed" for j in prev_suspected_jobs):
+            # Previous commit has pending jobs, cannot confirm
             return False
 
-        def has_rule(cj: CommitJobs, rule: str) -> bool:
+        def has_rule(jobs: Iterable[JobResult], rule: str) -> bool:
             return any(
-                cj.normalize_job_name(j.name) == job_base
-                and j.classification_rule == rule
-                and j.conclusion == "failure"
-                for j in cj.jobs
+                j.classification_rule == rule and j.conclusion == "failure"
+                for j in jobs
             )
 
         # Commit-caused if failing commit reproduces, previous does not
-        return has_rule(failing_jobs, failure_rule) and not has_rule(
-            prev_jobs, failure_rule
+        return has_rule(failing_suspected_jobs, failure_rule) and not has_rule(
+            prev_suspected_jobs, failure_rule
         )
 
     def _get_commits_reverted(self) -> Set[str]: