benchmarks hardening! #1078

sbryngelson · 2025-12-05T16:32:44Z

User description

PR Type

Enhancement, Tests

Description

Add comprehensive error handling and validation to benchmark execution
- Early file existence checks
- Return code validation after case execution
- Summary file and target data validation
- Detailed error reporting with failed case tracking
Implement per-test timeout mechanism (1 hour) with signal handling
Add early abort logic when failure rate exceeds 30% threshold
Improve job monitoring in CI workflows with real-time output streaming
Reduce benchmark workflow timeout from 1400 to 480 minutes

Diagram Walkthrough

flowchart LR
  A["Benchmark Execution"] --> B["Early File Validation"]
  A --> C["Run Case with Error Handling"]
  C --> D["Check Return Code"]
  D --> E["Validate Summary File"]
  E --> F["Validate Target Data"]
  F --> G["Add to Results"]
  D -->|Failure| H["Track Failed Cases"]
  E -->|Failure| H
  F -->|Failure| H
  H --> I["Report Failed Cases"]
  I --> J["Raise Exception"]
  
  K["Test Execution"] --> L["Set Timeout Alarm"]
  L --> M["Run Test Case"]
  M --> N["Cancel Alarm"]
  M -->|Timeout| O["Raise TestTimeoutError"]
  
  P["Test Loop"] --> Q["Check Failure Rate"]
  Q -->|Rate >= 30%| R["Abort Early"]

File Walkthrough

Relevant files

Error handling

bench.py `Add comprehensive error handling and validation to benchmarks` toolchain/mfc/bench.py Added early validation to check if benchmark case files exist before execution Wrapped case execution in try-except block with comprehensive error handling Added return code validation after running each benchmark case Added validation for summary file existence and required target data fields Implemented failed case tracking and reporting with detailed error messages Added success indicator for completed cases Changed exception handling in diff function to log errors instead of silently passing	+85/-16
test.py `Add timeout mechanism and early abort on high failure rate` toolchain/mfc/test/test.py Added signal import for timeout handling Introduced TEST_TIMEOUT_SECONDS constant (3600 seconds/1 hour) and timeout thresholds Created TestTimeoutError exception class for timeout scenarios Implemented timeout_handler function using SIGALRM signal Wrapped entire test case execution in try-except-finally block with timeout alarm Added early abort logic that triggers when failure rate exceeds 30% after 20 completed tests Added detailed error messages for timeout scenarios with log file references	+97/-57

Configuration changes

bench.yml `Improve job monitoring with real-time output and error handling` .github/workflows/bench.yml Updated job name to include interface parameter in display name Reduced timeout-minutes from 1400 to 480 minutes Replaced simple background job execution with comprehensive submit_and_monitor function Added job ID extraction and output file monitoring with tail streaming Implemented job status checking via squeue and exit code validation Added error handling for job submission failures and non-zero exit codes Improved real-time output visibility during benchmark execution	+79/-5
test.yml `Update job naming and reduce workflow timeout` .github/workflows/test.yml Updated self-hosted job name to display cluster and device information Reduced timeout-minutes from 1400 to 480 minutes	+2/-2

CodeAnt-AI Description

Add per-test 1-hour timeout, fail-fast on high failure rates, stricter benchmark validation, and improved CI bench monitoring

What Changed

Tests now abort and report a clear error (with path to the test log) if an individual test hangs for more than 1 hour.
Test runner will stop the whole test run early when at least 20 tests have completed and >=30% have failed, failing fast to surface systemic issues.
Benchmark runner validates each case file exists up front, captures per-case exit codes, verifies summary files include required metrics (exec and simulation grind), and accumulates and reports failed cases instead of crashing mid-run.
CI benchmark job submission now monitors and streams cluster job output, checks job submission and exit status, and fails the workflow when bench jobs or summaries are missing.
Workflow job names now include the device interface in the label and overall job timeouts were reduced from 1400 to 480 minutes to limit excessively long CI runs.

Impact

✅ Fewer hung tests
✅ Shorter CI runs on systemic failures
✅ Clearer benchmark failure logs

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

Summary by CodeRabbit

New Features
- Per-test timeouts with global early-abort on sustained failures
- Expanded CI variants with dynamic, descriptive job naming
- Submit-and-monitor benchmark flow with real-time job output streaming and explicit exit-status reporting
Bug Fixes
- Improved validation and clearer error reporting for missing or malformed benchmark outputs
- More robust handling and propagation of monitored job failures and timeout conditions
Refactor
- Reduced CI job timeouts and streamlined benchmark orchestration

_{✏️ Tip: You can customize this high-level summary in your review settings.}

codeant-ai · 2025-12-05T16:32:49Z

CodeAnt AI is reviewing your PR.

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

coderabbitai · 2025-12-05T16:33:08Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

CI workflows shorten timeouts and make job names matrix-driven; bench workflow now submits SLURM jobs and monitors them via a new script; bench tooling adds per-case validation and failure aggregation; tests gain per-test timeouts, global abort signaling, and failure-rate–based early abort.

Changes

Cohort / File(s)	Summary
Benchmark workflow `/.github/workflows/bench.yml`	Job name now includes conditional interface suffix; timeout reduced from 1400m to 480m; replaces background submissions with a `submit_and_monitor()` flow that submits SLURM jobs, extracts job IDs, runs PR and master submissions in parallel, waits for PIDs, and propagates failures.
Test workflow matrix `/.github/workflows/test.yml`	Self-hosted job name made matrix-driven with `cluster_name`, `device`, and conditional `interface` suffix; timeout reduced to 480m; generic matrix replaced by explicit `include` entries for Phoenix/Frontier variants with `device`, `interface`, `lbl`, and `cluster_name`.
SLURM monitoring script `/.github/scripts/monitor_slurm_job.sh`	New Bash script that waits for the job output file, tails it while polling `squeue` (with retry/backoff), falls back to `sacct`/`scontrol` for terminal state and exit code, stabilizes the output file before finishing, prints final output, and exits according to the job exit code.
Benchmark validation & error handling `toolchain/mfc/bench.py`	Adds early existence checks for case files, `failed_cases` aggregation, per-case try/except handling for non-zero runs and missing summaries, detailed per-target validations, enhanced error logging (with tracebacks), and raising an exception if any cases failed.
Test timeouts & early abort `toolchain/mfc/test/test.py`	Introduces `abort_tests` (`threading.Event()`), constants `MIN_CASES_BEFORE_ABORT=20`, `FAILURE_RATE_THRESHOLD=0.3`, `TEST_TIMEOUT_SECONDS=3600`, `TestTimeoutError` class, per-test `threading.Timer` timeout handling, a per-silo post-processing helper, coordinated early-abort checks, and aggregated abort reporting.

Sequence Diagram(s)

sequenceDiagram
  participant CI as CI Runner
  participant WF as submit_and_monitor (workflow)
  participant Monitor as monitor_slurm_job.sh
  participant SLURM as SLURM (sbatch / squeue / sacct / scontrol)
  participant FS as Filesystem (output file)

  Note over CI,WF: Parallel submit_and_monitor for PR and Master
  CI->>WF: invoke submit_and_monitor(branch)
  WF->>SLURM: sbatch bench job
  SLURM-->>WF: job_id
  WF->>Monitor: run monitor_slurm_job.sh(job_id, output_file)
  Monitor->>FS: wait for output_file (retries/backoff)
  Monitor->>FS: tail -f output_file
  loop while job running
    Monitor->>SLURM: poll squeue
    SLURM-->>Monitor: state info
    FS-->>Monitor: appended logs
  end
  Monitor->>SLURM: scontrol / sacct for ExitCode (fallback)
  alt ExitCode == 0:0
    Monitor-->>WF: success (0)
  else
    Monitor-->>WF: failure (non-zero)
  end
  WF-->>CI: aggregate statuses and set workflow exit

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Focus review on:
- toolchain/mfc/bench.py: per-case try/except flow, failed_cases aggregation, summary/target validation, and final exception behavior.
- toolchain/mfc/test/test.py: correct start/cancel of threading.Timer, propagation of abort_tests across threads, and accurate failure-rate computation and threshold handling.
- .github/workflows/bench.yml & .github/scripts/monitor_slurm_job.sh: robust job-id parsing, output-file wait/backoff, squeue/sacct/scontrol fallbacks, and correct interpretation of SLURM ExitCode.
- .github/workflows/test.yml: correctness of matrix include entries and conditional name interpolation.

Poem

🐰 I hopped into CI with a tiny drum,
I nudged the jobs gently till outputs come,
Timers tick, tails sing, watchers keep guard,
I counted the stumbles and flagged each hard yard,
Hop — logs glow bright and runners hum.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The title 'benchmarks hardening!' is vague and generic, lacking specificity about the main changes or their nature.	Consider a more descriptive title such as 'Add timeout, early-abort, and error handling for benchmarks and tests' to clearly convey the key improvements.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description comprehensively covers changes with detailed context, diagrams, file walkthroughs, and testing guidance, though it deviates from the template structure.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

qodo-code-review · 2025-12-05T16:33:36Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Possible Issue Early file existence check only validates the case path but not writability/creation of the summary and log paths, which could still fail later and be reported as generic errors; consider validating bench output directory and file paths up front. # Validate case file exists early if not os.path.exists(case.path): raise MFCException(f"Benchmark case file not found: {case.path}") Timeout Handling Using SIGALRM for per-test timeout may not work on non-Unix or in threads; also ensure long-running subprocesses are terminated on timeout to avoid orphaned jobs. TEST_TIMEOUT_SECONDS = 3600 class TestTimeoutError(MFCException): pass def timeout_handler(signum, frame): raise TestTimeoutError("Test case exceeded 1 hour timeout") Monitoring Robustness Job monitoring relies on parsing squeue/scontrol output and tailing a single file; failures before file creation or differing SLURM formats could cause false negatives—consider stronger checks and timeouts. - name: Bench (Master v. PR) run: \| set -e # Function to submit and monitor submit_and_monitor() { local dir=$1 local device=$2 local interface=$3 local cluster=$4 cd "$dir" # Submit job submit_output=$(bash .github/workflows/$cluster/submit-bench.sh \ .github/workflows/$cluster/bench.sh $device $interface 2>&1) job_id=$(echo "$submit_output" \| grep -oP 'Submitted batch job \K[0-9]+' \|\| echo "") job_slug="bench-$device-$interface" output_file="${job_slug}.out" if [ -z "$job_id" ]; then echo "ERROR: Failed to submit job" echo "$submit_output" return 1 fi echo "Submitted batch job $job_id" echo "Monitoring output file: $output_file" # Wait for file to appear (check job status if it takes a while) echo "Waiting for job to start..." while [ ! -f "$output_file" ]; do # Check if job failed to start if ! squeue -j "$job_id" &>/dev/null && [ ! -f "$output_file" ]; then echo "ERROR: Job $job_id finished without creating output file" return 1 fi sleep 5 done echo "=== Streaming output for job $job_id ===" # Stream output while job runs tail -f "$output_file" & tail_pid=$! # Wait for job to complete (will wait up to GitHub Actions timeout) while squeue -j "$job_id" &>/dev/null; do sleep 5 done # Stop tailing kill $tail_pid 2>/dev/null \|\| true echo "" echo "=== Final output ===" cat "$output_file" # Check exit status exit_code=$(scontrol show job "$job_id" 2>/dev/null \| grep -oP 'ExitCode=\K[0-9]+' \|\| echo "0:0") if [ "$exit_code" != "0" ] && [ "$exit_code" != "0:0" ]; then echo "ERROR: Job $job_id failed with exit code $exit_code" return 1 fi } # Run both jobs with monitoring (submit_and_monitor pr ${{ matrix.device }} ${{ matrix.interface }} ${{ matrix.cluster }}) & pr_pid=$! (submit_and_monitor master ${{ matrix.device }} ${{ matrix.interface }} ${{ matrix.cluster }}) & master_pid=$! wait $pr_pid && pr_exit=$? \|\| pr_exit=$? wait $master_pid && master_exit=$? \|\| master_exit=$? if [ $pr_exit -ne 0 ] \|\| [ $master_exit -ne 0 ]; then exit 1 fi

.github/workflows/bench.yml

toolchain/mfc/test/test.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

toolchain/mfc/test/test.py

toolchain/mfc/bench.py

toolchain/mfc/test/test.py

Copilot

Pull request overview

This PR introduces "benchmarking hardening" features to improve the robustness and reliability of MFC's testing and benchmarking infrastructure. The changes add timeout mechanisms, early failure detection, enhanced error handling, and better job monitoring for SLURM-based workflows.

Key changes:

Added per-test timeout (1 hour) and early abort on high failure rates (30% after 20 tests) to the test suite
Enhanced benchmark validation with comprehensive error checking and failed case tracking
Improved GitHub Actions workflows with better job monitoring and reduced timeout (from 1400 to 480 minutes)

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
toolchain/mfc/test/test.py	Added signal-based timeout handler (1 hour per test) and early abort logic when failure rate exceeds 30% after 20 tests
toolchain/mfc/bench.py	Added comprehensive validation (file existence, return codes, summary validation) and improved error reporting with failed case tracking
.github/workflows/test.yml	Updated job name formatting to include interface and reduced timeout from 1400 to 480 minutes
.github/workflows/bench.yml	Reduced timeout to 480 minutes and added detailed SLURM job monitoring with output streaming and exit code validation

toolchain/mfc/test/test.py

.github/workflows/bench.yml

toolchain/mfc/test/test.py

codeant-ai · 2025-12-05T16:37:44Z

CodeAnt AI finished reviewing your PR.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

.github/workflows/bench.yml (2)
105-164: Consider extracting submit_and_monitor to a standalone script.

This ~60-line function inline in YAML is hard to test and maintain. Extracting it to .github/scripts/monitor_slurm_job.sh would improve reusability and testability. This was also noted in a previous review.

146-149: Transient squeue failures could cause premature exit.

If squeue fails transiently (network blip, scheduler hiccup), the loop exits immediately even though the job may still be running. Consider adding a brief grace period or retry.
             # Wait for job to complete (will wait up to GitHub Actions timeout)
-            while squeue -j "$job_id" &>/dev/null; do
+            squeue_failures=0
+            while true; do
+              if squeue -j "$job_id" &>/dev/null; then
+                squeue_failures=0
+              else
+                squeue_failures=$((squeue_failures + 1))
+                # Allow a few transient failures before concluding job is done
+                if [ $squeue_failures -ge 3 ]; then
+                  break
+                fi
+              fi
               sleep 5
             done

🧹 Nitpick comments (2)

toolchain/mfc/test/test.py (1)
337-339: Remove extraneous f prefix from strings without placeholders.
-                cons.print(f"[bold red]This suggests a systemic issue (bad build, broken environment, etc.)[/bold red]")
-                cons.print(f"[bold red]Aborting remaining tests to fail fast.[/bold red]\n")
+                cons.print("[bold red]This suggests a systemic issue (bad build, broken environment, etc.)[/bold red]")
+                cons.print("[bold red]Aborting remaining tests to fail fast.[/bold red]\n")
toolchain/mfc/bench.py (1)
112-128: cons.unindent() may be skipped if validation fails without explicit continue.

If an exception is raised inside the try block (line 124-126), cons.unindent() at line 128 is still called. However, if validation_failed is True (line 112), you call cons.unindent() at 114 then continue, skipping line 128—which is correct. The logic is sound but fragile.

Consider using a finally block to guarantee unindent:
         try:
+            # ... existing try block code ...
+        except Exception as e:
+            cons.print(f"[bold red]ERROR[/bold red]: Unexpected error running {case.slug}: {e}")
+            failed_cases.append(case.slug)
+        finally:
             cons.unindent()
-
-        cons.unindent()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c380cd and 8c01b99.

📒 Files selected for processing (4)

.github/workflows/bench.yml (3 hunks)
.github/workflows/test.yml (1 hunks)
toolchain/mfc/bench.py (5 hunks)
toolchain/mfc/test/test.py (4 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-24T21:50:16.713Z

Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T21:50:16.713Z
Learning: Add and update focused tests for changes; ensure no regressions in golden test outputs without clear justification

Applied to files:

toolchain/mfc/bench.py

🧬 Code graph analysis (2)

toolchain/mfc/bench.py (3)

toolchain/mfc/common.py (3)

MFCException (31-32)

system (34-41)

file_load_yaml (65-70)

toolchain/mfc/state.py (1)

ARG (85-93)

toolchain/mfc/printer.py (1)

unindent (19-24)

toolchain/mfc/test/test.py (6)

toolchain/mfc/common.py (3)

MFCException (31-32)

file_write (44-54)

find (165-174)

toolchain/mfc/test/case.py (1)

compute_tolerance (248-276)

toolchain/mfc/state.py (2)

ARG (85-93)

items (33-34)

toolchain/mfc/packer/pack.py (3)

has_NaNs (90-96)

save (46-88)

find (34-35)

toolchain/mfc/packer/tol.py (1)

compare (50-97)

toolchain/mfc/test/cases.py (1)

modify_example_case (1035-1052)

🪛 Ruff (0.14.7)

toolchain/mfc/bench.py

42-42: Avoid specifying long messages outside the exception class

(TRY003)

124-124: Do not catch blind exception: Exception

(BLE001)

137-137: Avoid specifying long messages outside the exception class

(TRY003)

211-211: Do not catch blind exception: Exception

(BLE001)

toolchain/mfc/test/test.py

36-36: Unused function argument: signum

(ARG001)

36-36: Unused function argument: frame

(ARG001)

37-37: Avoid specifying long messages outside the exception class

(TRY003)

224-224: Avoid specifying long messages outside the exception class

(TRY003)

228-228: Avoid specifying long messages outside the exception class

(TRY003)

231-231: Avoid specifying long messages outside the exception class

(TRY003)

239-239: Avoid specifying long messages outside the exception class

(TRY003)

256-256: Avoid specifying long messages outside the exception class

(TRY003)

270-270: Avoid specifying long messages outside the exception class

(TRY003)

277-277: Avoid specifying long messages outside the exception class

(TRY003)

280-280: Avoid specifying long messages outside the exception class

(TRY003)

283-283: Avoid specifying long messages outside the exception class

(TRY003)

295-299: Avoid specifying long messages outside the exception class

(TRY003)

338-338: f-string without any placeholders

Remove extraneous f prefix

(F541)

339-339: f-string without any placeholders

Remove extraneous f prefix

(F541)

340-342: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: Github (macos, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, false)
GitHub Check: Github (macos, mpi, no-debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, true)
GitHub Check: Code Cleanliness Check
GitHub Check: Coverage Test on CodeCov
GitHub Check: Build & Publish

🔇 Additional comments (12)

.github/workflows/test.yml (3)

96-96: Dynamic job naming is clear and informative.

The conditional formatting nicely handles the interface suffix, making job names like "Georgia Tech | Phoenix (gpu-acc)" or "Oak Ridge | Frontier (cpu)" more descriptive for debugging CI failures.

100-100: Significant timeout reduction from 1400 to 480 minutes.

This is a substantial reduction (from ~23 hours to 8 hours). Verify this is sufficient for the longest-running test configurations, particularly on Frontier with GPU tests.

101-115: Matrix expansion and runner configuration look correct.

The new matrix properly covers CPU and GPU variants across both clusters. The runs-on configuration correctly uses the phoenix group with cluster-specific labels.

.github/workflows/bench.yml (2)

26-26: Job naming and timeout updates align with test.yml.

Consistent naming pattern and timeout reduction across both workflows improves maintainability.

Also applies to: 76-76

173-174: Exit status capture pattern has subtle issues.

The pattern wait $pid && var=$? || var=$? always sets var to 0 when the process succeeds because && short-circuits. However, since you check both values at the end, this works correctly for the failure case.

toolchain/mfc/test/test.py (2)

294-301: Good use of try/except/finally for alarm cleanup.

The finally block ensures the alarm is always cancelled, preventing spurious timeouts on subsequent tests. The exception chaining with from exc preserves the original traceback.

26-37: Timeout constants and exception class are well-defined.

The 1-hour timeout and 30% failure threshold are reasonable defaults. Consider making these configurable via command-line arguments for flexibility in different environments.

toolchain/mfc/bench.py (5)

40-42: Good early validation of benchmark case files.

Failing fast with a clear error message is better than failing deep in the execution with a cryptic error.

64-128: Robust error handling with comprehensive validation.

The try/except block properly handles:

Non-zero return codes

Missing summary files

Missing target data

Unexpected exceptions

The structure ensures failed_cases is always updated on failure. However, there's a potential issue with cons.unindent() placement.

124-126: Broad exception catch is acceptable here for robustness.

In benchmark orchestration code, catching all exceptions ensures one failing case doesn't abort the entire benchmark run. The error is logged with context, and the case is tracked as failed.

131-137: Clear failure summary with actionable output.

Listing all failed cases at the end and raising with the count helps quickly identify systemic issues. Good practice for CI diagnostics.

211-213: Improved error logging in diff() replaces silent pass.

This is a significant improvement—previously, speedup computation failures were silently ignored. Now they're logged with context and properly set the error flag.

.github/workflows/bench.yml

toolchain/mfc/test/test.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

toolchain/mfc/test/test.py (3)
202-204: SIGALRM in worker threads is still a critical issue.

This has been flagged in previous reviews but appears unaddressed. The test scheduler uses WorkerThread (see sched.py), and signal.signal()/signal.alarm() only work in the main thread—calling from worker threads raises ValueError: signal only works in main thread.

Consider one of these approaches:

Guard with a thread check and skip timeout in worker threads

Use threading.Timer with a flag-based approach

Move timeout handling to the subprocess level with subprocess.run(timeout=...)

264-266: Directory existence check still missing.

This was flagged in previous reviews. os.listdir() will raise FileNotFoundError if the silo_hdf5/p0 directory doesn't exist (which may happen for test cases that don't produce silo outputs).
+            silo_dir = os.path.join(case.get_dirpath(), 'silo_hdf5', 'p0')
+            if not os.path.isdir(silo_dir):
+                # No silo outputs for this case; skip post-processing check
+                pass
+            else:
-            for silo_filepath in os.listdir(os.path.join(case.get_dirpath(), 'silo_hdf5', 'p0')):
+                for silo_filepath in os.listdir(silo_dir):
342-344: Exception propagation from worker task may break scheduler.

As noted in previous reviews, raising MFCException here propagates through sched.sched, potentially leaving worker threads in an undefined state. Consider using a shared abort flag that the scheduler checks between tasks for a cleaner shutdown.

🧹 Nitpick comments (1)

toolchain/mfc/test/test.py (1)

340-341: Remove unnecessary f-string prefixes.

These strings have no placeholders, so the f prefix is extraneous.

-                    cons.print(f"[bold red]This suggests a systemic issue (bad build, broken environment, etc.)[/bold red]")
-                    cons.print(f"[bold red]Aborting remaining tests to fail fast.[/bold red]\n")
+                    cons.print("[bold red]This suggests a systemic issue (bad build, broken environment, etc.)[/bold red]")
+                    cons.print("[bold red]Aborting remaining tests to fail fast.[/bold red]\n")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8c01b99 and 2e1eb87.

📒 Files selected for processing (1)

toolchain/mfc/test/test.py (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

toolchain/mfc/test/test.py (2)

toolchain/mfc/common.py (1)

MFCException (31-32)

toolchain/mfc/state.py (1)

ARG (85-93)

🪛 Ruff (0.14.7)

toolchain/mfc/test/test.py

36-36: Unused function argument: signum

(ARG001)

36-36: Unused function argument: frame

(ARG001)

37-37: Avoid specifying long messages outside the exception class

(TRY003)

224-224: Avoid specifying long messages outside the exception class

(TRY003)

228-228: Avoid specifying long messages outside the exception class

(TRY003)

231-231: Avoid specifying long messages outside the exception class

(TRY003)

239-239: Avoid specifying long messages outside the exception class

(TRY003)

256-256: Avoid specifying long messages outside the exception class

(TRY003)

270-270: Avoid specifying long messages outside the exception class

(TRY003)

277-277: Avoid specifying long messages outside the exception class

(TRY003)

280-280: Avoid specifying long messages outside the exception class

(TRY003)

283-283: Avoid specifying long messages outside the exception class

(TRY003)

295-299: Avoid specifying long messages outside the exception class

(TRY003)

340-340: f-string without any placeholders

Remove extraneous f prefix

(F541)

341-341: f-string without any placeholders

Remove extraneous f prefix

(F541)

342-344: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: Code Cleanliness Check
GitHub Check: Github (ubuntu, mpi, no-debug, false)
GitHub Check: Github (macos, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, true)
GitHub Check: Github (macos, mpi, no-debug, false)
GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
GitHub Check: Github (ubuntu, mpi, debug, true)
GitHub Check: Coverage Test on CodeCov
GitHub Check: cubic · AI code reviewer
GitHub Check: Build & Publish

🔇 Additional comments (2)

toolchain/mfc/test/test.py (2)

26-38: LGTM on timeout and abort constants.

The timeout infrastructure is well-structured. The TestTimeoutError properly extends MFCException, and the constants are reasonable (1-hour timeout, 30% failure threshold after 20 tests).

Note: The unused signum and frame parameters in timeout_handler are required by the signal handler signature, so the Ruff warnings can be safely ignored.

294-301: Good timeout exception handling with cleanup.

The finally block ensures the alarm is always cancelled, preventing lingering signals. The re-raised exception provides helpful context including the log file path.

cubic-dev-ai

2 issues found across 4 files

Prompt for AI agents (all 2 issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="toolchain/mfc/test/test.py">

<violation number="1" location="toolchain/mfc/test/test.py:203">
P0: `signal.signal()` and `signal.alarm()` cannot be called from worker threads - they only work in the main thread. Since `_handle_case` is executed via `sched.sched()` which uses `WorkerThread`, this will raise `ValueError: signal only works in main thread`. Consider using `threading.Timer` or a timeout wrapper that works across threads instead.</violation>
</file>

<file name=".github/workflows/bench.yml">

<violation number="1" location=".github/workflows/bench.yml:159">
P2: Slurm reports exit codes as `X:Y`, but this regex only captures the first component. If a job finishes with `0:9` (success but terminated by signal), `$exit_code` becomes `0` and the failure check is bypassed, so signaled jobs are treated as success. Capture the full `X:Y` pair (or parse both fields) and ensure any non-`0:0` result fails the workflow.</violation>
</file>

_{Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR}

toolchain/mfc/test/test.py

.github/workflows/bench.yml

cubic-dev-ai

2 issues found across 4 files (changes from recent commits).

Prompt for AI agents (all 2 issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="toolchain/mfc/bench.py">

<violation number="1" location="toolchain/mfc/bench.py:127">
P2: The `finally` block will cause double `cons.unindent()` calls. Lines 81 and 89 still have explicit `cons.unindent()` before `continue`, but since `finally` always executes (even on `continue`), the unindent will be called twice in those error paths. Remove the `cons.unindent()` calls at lines 81 and 89, similar to how it was removed for the `validation_failed` case.</violation>
</file>

<file name=".github/scripts/monitor_slurm_job.sh">

<violation number="1" location=".github/scripts/monitor_slurm_job.sh:57">
P2: Exit code defaults to `0:0` (success) if `scontrol` or `grep` fails. This could mask real job failures when job information is unavailable. Consider failing explicitly if exit code cannot be determined.</violation>
</file>

_{Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR}

toolchain/mfc/bench.py

.github/scripts/monitor_slurm_job.sh

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

toolchain/mfc/bench.py (2)

58-83: Fix double cons.unindent() on error paths to avoid printer stack underflow

In the non-zero return-code and missing-summary branches you call cons.unindent() and then continue, but cons.unindent() is also executed in the finally block. This will unindent twice for failing cases and can corrupt the printer’s indentation stack or raise underflow errors.

Rely on the finally block for unindent and drop the inner calls:

-            if rc != 0:
-                cons.print(f"[bold red]ERROR[/bold red]: Case {case.slug} failed with exit code {rc}")
-                cons.print(f"[bold red]      Check log at: {log_filepath}[/bold red]")
-                failed_cases.append(case.slug)
-                cons.unindent()
-                continue
+            if rc != 0:
+                cons.print(f"[bold red]ERROR[/bold red]: Case {case.slug} failed with exit code {rc}")
+                cons.print(f"[bold red]      Check log at: {log_filepath}[/bold red]")
+                failed_cases.append(case.slug)
+                continue
@@
-            if not os.path.exists(summary_filepath):
-                cons.print(f"[bold red]ERROR[/bold red]: Summary file not created for {case.slug}")
-                cons.print(f"[bold red]      Expected: {summary_filepath}[/bold red]")
-                failed_cases.append(case.slug)
-                cons.unindent()
-                continue
+            if not os.path.exists(summary_filepath):
+                cons.print(f"[bold red]ERROR[/bold red]: Summary file not created for {case.slug}")
+                cons.print(f"[bold red]      Expected: {summary_filepath}[/bold red]")
+                failed_cases.append(case.slug)
+                continue

Also applies to: 124-129

1-10: Move traceback import to the module top to satisfy lint and avoid repeated imports

Pylint is flagging import traceback inside the except block (C0415 Import outside toplevel). Since this code path is not particularly hot, it’s simpler and clearer to import at the top-level and just use it in the handler.

Suggested change:

-import os, sys, uuid, subprocess, dataclasses, typing, math
+import os, sys, uuid, subprocess, dataclasses, typing, math, traceback
@@
-            except Exception as e:
-                import traceback
+            except Exception as e:
                 cons.print(
                     f"[bold red]ERROR[/bold red]: Failed to compute speedup for {target.name} in {slug}: {e}\n"
                     f"{traceback.format_exc()}"
                 )

Also applies to: 211-217

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2e1eb87 and f79f4d1.

📒 Files selected for processing (5)

.github/scripts/monitor_slurm_job.sh (1 hunks)
.github/workflows/bench.yml (3 hunks)
.github/workflows/test.yml (1 hunks)
toolchain/mfc/bench.py (5 hunks)
toolchain/mfc/test/test.py (5 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

.github/workflows/bench.yml (1)

.github/workflows/phoenix/submit-bench.sh (1)

usage (5-7)

toolchain/mfc/test/test.py (2)

toolchain/mfc/common.py (4)

MFCException (31-32)

file_write (44-54)

find (165-174)

does_command_exist (127-138)

toolchain/mfc/state.py (1)

ARG (85-93)

toolchain/mfc/bench.py (2)

toolchain/mfc/common.py (3)

MFCException (31-32)

system (34-41)

file_load_yaml (65-70)

toolchain/mfc/printer.py (1)

unindent (19-24)

🪛 actionlint (1.7.9)

.github/workflows/test.yml

105-105: label "gt" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

109-109: label "gt" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

113-113: label "gt" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

118-118: label "frontier" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

122-122: label "frontier" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

126-126: label "frontier" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🪛 GitHub Actions: Lint Toolchain

toolchain/mfc/test/test.py

[error] 230-230: pylint: R1702 Too many nested blocks (6/5).

toolchain/mfc/bench.py

[error] 212-212: pylint: C0415 Import outside toplevel ('traceback').

🪛 Ruff (0.14.7)

toolchain/mfc/test/test.py

184-186: Avoid specifying long messages outside the exception class

(TRY003)

233-233: Abstract raise to an inner function

(TRY301)

233-233: Avoid specifying long messages outside the exception class

(TRY003)

238-238: Abstract raise to an inner function

(TRY301)

238-238: Avoid specifying long messages outside the exception class

(TRY003)

246-246: Avoid specifying long messages outside the exception class

(TRY003)

250-250: Avoid specifying long messages outside the exception class

(TRY003)

253-253: Avoid specifying long messages outside the exception class

(TRY003)

261-261: Avoid specifying long messages outside the exception class

(TRY003)

278-278: Avoid specifying long messages outside the exception class

(TRY003)

294-294: Avoid specifying long messages outside the exception class

(TRY003)

301-301: Avoid specifying long messages outside the exception class

(TRY003)

304-304: Avoid specifying long messages outside the exception class

(TRY003)

307-307: Avoid specifying long messages outside the exception class

(TRY003)

319-323: Avoid specifying long messages outside the exception class

(TRY003)

toolchain/mfc/bench.py

42-42: Avoid specifying long messages outside the exception class

(TRY003)

124-124: Do not catch blind exception: Exception

(BLE001)

137-137: Avoid specifying long messages outside the exception class

(TRY003)

211-211: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: cubic · AI code reviewer
GitHub Check: Coverage Test on CodeCov
GitHub Check: Code Cleanliness Check
GitHub Check: Georgia Tech | Phoenix (NVHPC) (cpu)
GitHub Check: Github (macos, mpi, debug, false)
GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
GitHub Check: Github (ubuntu, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, true)
GitHub Check: Github (macos, mpi, no-debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, false)
GitHub Check: Github (ubuntu, mpi, debug, true)
GitHub Check: Build & Publish

🔇 Additional comments (4)

.github/workflows/bench.yml (1)

25-27: Nice extraction of submit/monitor logic and parallel PR/master runs

The new submit_and_monitor helper plus .github/scripts/monitor_slurm_job.sh integration gives you much clearer control over submission errors, job IDs, and exit-code handling, while preserving parallel PR/master execution and avoiding set -e pitfalls in the wait logic. The dynamic job name and reduced timeout also look reasonable.

Also applies to: 76-77, 100-143

.github/workflows/test.yml (1)

95-101: Matrix reshaping and dynamic self-hosted job naming look consistent

Explicit Phoenix/Frontier variants, the lbl-driven labels, and the dynamic name format align cleanly with the later build/test steps and log handling. The reduced timeout is also appropriate for CI hardening, assuming your self-hosted runners are configured with these labels.

Also applies to: 104-129
toolchain/mfc/test/test.py (2)
26-37: Early-abort mechanism via abort_tests event is structured cleanly

The combination of:

global abort_tests event,

per-case pre-check in handle_case,

failure-rate computation gated on a minimum number of completed tests, and

a post-scheduler check in test() that surfaces a clear aggregated failure,

gives you a deterministic, thread-safe way to fast‑fail runs with systemic breakage without throwing from worker threads. This is a solid improvement over raising directly inside _handle_case.

Also applies to: 179-187, 333-373

30-41: Timer-based timeout only detects timeouts after case.run returns; add note and fix pylint nesting error

Using threading.Timer with a flag is safe in worker threads, but it can't interrupt a blocked case.run call — if case.run hangs (e.g., stuck MPI job), the worker thread will remain blocked and the timeout will only be noticed after (or if) case.run returns. That means this mechanism detects slow tests but does not strictly bound wall-clock time for truly hung runs.

Two concrete suggestions:

Document/verify behavior of case.run
Confirm (and, if possible, document) that case.run itself uses a timeout-capable mechanism (e.g., subprocess with timeout= or equivalent), or that hung simulations are otherwise impossible/handled elsewhere; otherwise the 1‑hour timeout expectation is misleading.

Resolve pylint R1702 ("too many nested blocks") cheaply
Given this function is already structurally complex and you're explicitly disabling other size checks, the quickest unblock is to extend the existing pylint disable:
-# pylint: disable=too-many-locals, too-many-branches, too-many-statements, trailing-whitespace
-def _handle_case(case: TestCase, devices: typing.Set[int]):
+# pylint: disable=too-many-locals, too-many-branches, too-many-statements, too-many-nested-blocks, trailing-whitespace
+def _handle_case(case: TestCase, devices: typing.Set[int]):
If you prefer not to disable, you'd need to extract parts of _handle_case (e.g., golden-file comparison and test_all post-processing) into helpers to reduce nesting.

.github/scripts/monitor_slurm_job.sh

sbryngelson · 2025-12-05T17:21:04Z

/improve

.github/scripts/monitor_slurm_job.sh

.github/workflows/bench.yml

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 12 comments.

toolchain/mfc/test/test.py

.github/workflows/bench.yml

.github/scripts/monitor_slurm_job.sh

toolchain/mfc/test/test.py

toolchain/mfc/bench.py

.github/scripts/monitor_slurm_job.sh

toolchain/mfc/bench.py

toolchain/mfc/test/test.py

cubic-dev-ai

2 issues found across 4 files (changes from recent commits).

Prompt for AI agents (all 2 issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name=".github/workflows/bench.yml">

<violation number="1" location=".github/workflows/bench.yml:142">
P1: Bug: `$?` inside the `if` block captures the condition&#39;s exit status (0 after successful negation), not the process&#39;s actual exit code. This will always leave `pr_exit` as 0 even when the job fails, causing the workflow to incorrectly pass. Use `wait &quot;$pr_pid&quot; || pr_exit=$?` instead.</violation>

<violation number="2" location=".github/workflows/bench.yml:145">
P1: Bug: Same issue - `$?` will be 0 inside the block, not the actual exit code. Use `wait &quot;$master_pid&quot; || master_exit=$?` instead.</violation>
</file>

_{Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR}

.github/workflows/bench.yml

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

toolchain/mfc/test/test.py (1)

266-289: Consider subprocess-level timeout for more precise interruption.

The current timeout checks (lines 268-269, 272-274) detect timeouts between execution phases but don't interrupt a hung case.run() call. For 1-hour timeouts this is acceptable, but if you need precise interruption, consider passing a timeout parameter to the underlying subprocess.run() calls within case.run().

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f79f4d1 and b62854c.

📒 Files selected for processing (4)

.github/scripts/monitor_slurm_job.sh (1 hunks)
.github/workflows/bench.yml (3 hunks)
toolchain/mfc/bench.py (6 hunks)
toolchain/mfc/test/test.py (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

.github/workflows/bench.yml (1)

.github/workflows/phoenix/submit-bench.sh (1)

usage (5-7)

toolchain/mfc/bench.py (5)

toolchain/mfc/common.py (3)

MFCException (31-32)

system (34-41)

file_load_yaml (65-70)

toolchain/mfc/state.py (1)

ARG (85-93)

toolchain/mfc/printer.py (1)

unindent (19-24)

toolchain/mfc/run/run.py (1)

run (134-162)

toolchain/mfc/run/input.py (1)

generate (109-114)

🪛 Ruff (0.14.7)

toolchain/mfc/bench.py

42-42: Avoid specifying long messages outside the exception class

(TRY003)

122-122: Do not catch blind exception: Exception

(BLE001)

136-136: Avoid specifying long messages outside the exception class

(TRY003)

210-210: Do not catch blind exception: Exception

(BLE001)

toolchain/mfc/test/test.py

185-188: Avoid specifying long messages outside the exception class

(TRY003)

189-191: Avoid specifying long messages outside the exception class

(TRY003)

220-220: Avoid specifying long messages outside the exception class

(TRY003)

226-229: Avoid specifying long messages outside the exception class

(TRY003)

232-235: Avoid specifying long messages outside the exception class

(TRY003)

238-241: Avoid specifying long messages outside the exception class

(TRY003)

269-269: Abstract raise to an inner function

(TRY301)

269-269: Avoid specifying long messages outside the exception class

(TRY003)

274-274: Abstract raise to an inner function

(TRY301)

274-274: Avoid specifying long messages outside the exception class

(TRY003)

282-282: Avoid specifying long messages outside the exception class

(TRY003)

286-286: Avoid specifying long messages outside the exception class

(TRY003)

289-289: Avoid specifying long messages outside the exception class

(TRY003)

297-297: Avoid specifying long messages outside the exception class

(TRY003)

314-314: Avoid specifying long messages outside the exception class

(TRY003)

320-320: Abstract raise to an inner function

(TRY301)

320-320: Avoid specifying long messages outside the exception class

(TRY003)

348-352: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, false)
GitHub Check: Github (macos, mpi, debug, false)
GitHub Check: Github (macos, mpi, no-debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, true)
GitHub Check: Github (ubuntu, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, debug, true)
GitHub Check: Code Cleanliness Check
GitHub Check: Coverage Test on CodeCov
GitHub Check: cubic · AI code reviewer
GitHub Check: Build & Publish

🔇 Additional comments (24)

toolchain/mfc/bench.py (8)

1-1: LGTM: Traceback import added for enhanced error diagnostics.

The traceback import enables full stack traces in exception handlers (lines 124, 213), improving debuggability when benchmarks fail.

40-42: LGTM: Early file validation prevents cryptic failures.

Checking case file existence before execution provides fast, clear feedback when benchmark definitions are missing.

52-52: LGTM: Failure tracking enables comprehensive error reporting.

The failed_cases list allows the benchmark run to continue after individual failures and provides a complete failure summary at the end (lines 130-136).

64-81: LGTM: Robust execution with defensive error handling.

The code properly addresses previous concerns:

Line 71 converts ARG('mem') to string, preventing potential TypeError in subprocess calls

Lines 75-76 use hasattr to safely handle different return types from system(), avoiding AttributeError

The error messages provide clear diagnostics including exit codes and log file locations.

83-113: LGTM: Thorough post-execution validation ensures data integrity.

The validation sequence properly checks:

Summary file existence (lines 83-88)

Target presence and required fields (lines 93-109)

Specific validation for simulation's grind metric (lines 106-109)

Each failure path provides clear diagnostics and continues to the next case, enabling complete failure visibility.

115-128: LGTM: Robust per-case error isolation with complete diagnostics.

The broad Exception handler (line 122) is appropriate here because each benchmark case should be isolated—failures in one case shouldn't prevent others from running. The handler properly:

Logs full tracebacks for debugging (line 124)

Marks the case as failed

Ensures cleanup via finally (line 127)

129-136: LGTM: Clear failure summary for CI diagnostics.

The final report provides actionable information:

Complete list of failed case slugs

Failure ratio for quick assessment

Raises exception to properly fail the CI workflow

210-215: LGTM: Enhanced diff() error diagnostics.

Replacing the silent pass with explicit error logging including full traceback (line 213) significantly improves debuggability when speedup computations fail. The error is properly propagated via err = 1 (line 215).

toolchain/mfc/test/test.py (7)

26-41: LGTM: Well-designed early-abort mechanism with thread-safe signaling.

The implementation properly addresses concurrency concerns:

abort_tests = threading.Event() provides thread-safe coordination (lines 33-37)

Custom TestTimeoutError exception separates timeout failures from other errors

Public constants make thresholds configurable and self-documenting

This is a significant improvement over raising exceptions directly from worker threads.

179-191: LGTM: Abort logic properly handles edge cases.

The code defensively prevents division by zero:

Line 184 checks total_completed > 0 before calculating failure percentage (line 187)

Lines 189-191 provide a fallback message when no tests completed

This addresses the past review concern about potential ZeroDivisionError.

214-241: LGTM: Well-structured helper improves code organization.

Extracting silo file processing into a dedicated function improves maintainability. The function properly:

Locates h5dump binary with fallback to system PATH

Validates execution and output

Detects NaN/Infinity values with clear error messages

The directory existence check is appropriately handled by the caller (line 326).

249-264: LGTM: Threading.Timer correctly replaces signal.alarm for worker threads.

The implementation properly addresses the critical issue with signal handling in worker threads:

threading.Timer works correctly in worker threads (lines 249-255), unlike signal.alarm()

Clear comments explain the design rationale (lines 250-252)

Timer is properly cancelled in all paths (line 263 for dry_run, line 354 in finally)

This resolves the past review concern that would have caused ValueError: signal only works in main thread.

316-329: LGTM: Test_all flow properly validated and timeout-protected.

The code addresses previous concerns:

Line 319 adds timeout check before the potentially long post-process run

Line 326 checks directory existence before listing, preventing FileNotFoundError

340-354: LGTM: Comprehensive timeout error handling with conditional diagnostics.

The exception handler properly addresses the past concern about missing log files:

Lines 342-347 check file existence before referencing it

Provides contextual error messages for debugging

Line 354 ensures timer cleanup in all exit paths (normal, timeout, exception)

389-401: LGTM: Fail-fast logic correctly checks after every test.

The early-abort check after each test (not just failures) is the correct design—it enables the fastest possible detection when the threshold is crossed. The implementation properly:

Skips abort checks during dry-run (line 391)

Uses thread-safe flag setting instead of raising (line 400), addressing past review concerns

Returns gracefully to allow coordinated shutdown

.github/workflows/bench.yml (3)

26-26: LGTM: Enhanced job naming improves CI observability.

The dynamic job name now includes the interface (e.g., "Phoenix (gpu-acc)"), making it easier to identify specific benchmark variants in the GitHub Actions UI.

76-76: LGTM: Timeout reduction aligns with fail-fast objectives.

Reducing timeout from 1400 to 480 minutes (8 hours) is appropriate given the enhanced error detection, per-test timeouts (3600s), and early-abort logic introduced in this PR.

100-153: LGTM: Refactored workflow addresses all portability and reliability concerns.

The implementation properly addresses previous review feedback:

Line 117: Uses sed for portable job ID extraction (no -P flag dependency)

Line 128: Delegates complex monitoring logic to a dedicated script for reusability

Lines 138-153: Reliable exit code capture without shell evaluation pitfalls:

Explicit initialization (lines 139-140)

Clear if ! wait pattern (lines 142-147)

Proper quoting and comparison (lines 150-153)

.github/scripts/monitor_slurm_job.sh (6)

1-17: LGTM: Well-structured script header with proper validation.

The script follows bash best practices:

set -e ensures error propagation

Clear usage documentation

Input validation with helpful error message

18-42: LGTM: Robust retry logic handles transient SLURM failures.

The exponential backoff mechanism (lines 37-40) properly distinguishes transient squeue failures from actual job completion, addressing the past review concern. The script only errors (lines 31-34) if the job is truly gone and no output file exists after multiple retries.

50-77: LGTM: Sophisticated completion detection with sacct fallback.

The implementation comprehensively addresses SLURM monitoring best practices:

Uses sacct as fallback after repeated squeue failures (lines 56-69)

Only treats terminal states as completion (COMPLETED, FAILED, etc., lines 61-62)

Resets counter for non-terminal states to handle transient failures (line 67)

Provides graceful degradation when sacct is unavailable (lines 71-72)

This resolves past review concerns about premature completion detection.

79-100: LGTM: File stabilization prevents output truncation.

The stabilization logic (lines 79-97) ensures output is fully flushed before proceeding:

Waits for file size to remain constant for 2 consecutive checks (~10s total)

Prevents race conditions between SLURM output buffering and log capture

Line 100 safely terminates tail with || true to handle already-exited processes.

106-129: LGTM: Robust exit code retrieval with explicit failure handling.

The implementation properly addresses all past review concerns:

Line 112: Uses portable grep -oE + cut instead of PCRE-specific -P with \K

Lines 115-122: Falls back to sacct for aged-out jobs

Lines 124-129: Fails explicitly when exit code cannot be determined, preventing silent success on errors

This is a significant reliability improvement over defaulting to "0:0".

131-138: LGTM: Clear final exit status validation.

The script properly validates only "0:0" as success (line 132), rejecting any signal or non-zero exit code, and provides clear success/failure messaging.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

toolchain/mfc/test/test.py (1)

257-263: Timer cleanup is correct, but timeout doesn't interrupt blocking operations.

The finally block ensures the timer is always cancelled. However, the timeout mechanism only checks the flag before and after case.run() calls—it won't interrupt a blocking case.run() that exceeds 1 hour. If case.run() takes 2 hours, the timeout flag will be set after 1 hour, but the process continues until completion. This is a known limitation flagged by previous reviewers.

For true interruption, case.run() would need to use subprocess.run(timeout=...) or check the timeout flag internally.

Also applies to: 369-370

🧹 Nitpick comments (1)

toolchain/mfc/test/test.py (1)
307-322: Optimize: Reuse already-loaded golden file.

At line 307, you load golden = packer.load(golden_filepath). Then at line 320 (in the else branch), you reload the same file with packer.load(golden_filepath). You can reuse the golden variable instead.
-            err, msg = packtol.compare(pack, packer.load(golden_filepath), packtol.Tolerance(tol, tol))
+            err, msg = packtol.compare(pack, golden, packtol.Tolerance(tol, tol))

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between adc26a0 and 0b31278.

📒 Files selected for processing (1)

toolchain/mfc/test/test.py (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

toolchain/mfc/test/test.py (3)

toolchain/mfc/common.py (4)

MFCException (31-32)

does_command_exist (127-138)

get_program_output (104-106)

find (165-174)

toolchain/mfc/printer.py (1)

unindent (19-24)

toolchain/mfc/state.py (1)

ARG (85-93)

🪛 Ruff (0.14.7)

toolchain/mfc/test/test.py

185-188: Avoid specifying long messages outside the exception class

(TRY003)

189-191: Avoid specifying long messages outside the exception class

(TRY003)

218-221: Avoid specifying long messages outside the exception class

(TRY003)

227-227: Avoid specifying long messages outside the exception class

(TRY003)

233-237: Avoid specifying long messages outside the exception class

(TRY003)

240-243: Avoid specifying long messages outside the exception class

(TRY003)

246-249: Avoid specifying long messages outside the exception class

(TRY003)

277-277: Abstract raise to an inner function

(TRY301)

277-277: Avoid specifying long messages outside the exception class

(TRY003)

282-282: Abstract raise to an inner function

(TRY301)

282-282: Avoid specifying long messages outside the exception class

(TRY003)

290-290: Avoid specifying long messages outside the exception class

(TRY003)

294-294: Avoid specifying long messages outside the exception class

(TRY003)

297-297: Avoid specifying long messages outside the exception class

(TRY003)

305-305: Avoid specifying long messages outside the exception class

(TRY003)

322-322: Avoid specifying long messages outside the exception class

(TRY003)

328-328: Abstract raise to an inner function

(TRY301)

328-328: Avoid specifying long messages outside the exception class

(TRY003)

336-339: Avoid specifying long messages outside the exception class

(TRY003)

364-368: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: Code Cleanliness Check
GitHub Check: Coverage Test on CodeCov
GitHub Check: cubic · AI code reviewer
GitHub Check: Github (macos, mpi, no-debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, false)
GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, true)
GitHub Check: Github (ubuntu, mpi, debug, true)
GitHub Check: Github (ubuntu, mpi, debug, false)
GitHub Check: Build & Publish

🔇 Additional comments (5)

toolchain/mfc/test/test.py (5)

1-1: LGTM: Thread-safe timeout and abort mechanism.

The use of threading.Event for global abort signaling and threading.Timer for per-test timeouts is the correct approach for the worker-thread context. This properly addresses the past review concerns about signal.alarm() not working in non-main threads.

Also applies to: 26-41

342-345: LGTM: Directory existence check prevents errors.

The os.path.isdir(silo_dir) check properly guards against FileNotFoundError when test cases don't produce silo output. This correctly addresses previous review feedback.

356-368: LGTM: Helpful timeout error messages.

The conditional log message (lines 358-363) correctly handles cases where the timeout occurs before out_pre_sim.txt is written, providing clear guidance to users. This addresses previous feedback about potentially misleading error messages.

179-191: LGTM: Thread-safe early abort with proper guards.

The abort mechanism correctly uses threading.Event for cross-thread signaling:

Line 380: Pre-check allows graceful early exit

Line 416: abort_tests.set() from worker thread avoids raising exceptions that could leave the scheduler in an undefined state

Line 184: Guards against division by zero

Line 407: Sensibly skips abort check during dry-run

The early abort check runs after every test (not just failures), which is slightly less efficient but ensures rapid detection when the failure rate crosses the threshold.

Also applies to: 378-380, 405-417

214-250: LGTM: Well-structured silo file validation.

The _process_silo_file helper provides clear separation of concerns with appropriate checks:

File existence validation before processing

h5dump availability checks (both installed and system paths)

Return code validation

NaN/Infinity detection with actionable error messages

codeant-ai · 2025-12-05T21:35:24Z

CodeAnt AI is running Incremental review

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

toolchain/mfc/test/test.py (2)

179-192: Abort check logic is sound, though the else branch appears unreachable.

The division at line 185 is safe because abort_tests is only set when total_completed >= MIN_CASES_BEFORE_ABORT (20) at line 465-472. However, the else branch at lines 189-191 appears technically unreachable given the current logic, though it provides a defensive safety net if the abort mechanism changes in the future.

257-263: Timeout mechanism is correct for threading context, but note it's soft.

The threading.Timer approach properly works in worker threads and the timer is correctly cancelled in the finally block (line 426). The timeout flag checks at lines 276, 281, 327 provide reasonable timeout enforcement between operations.

However, this is a soft timeout that only sets a flag—it won't interrupt a hung subprocess. If case.run() blocks indefinitely (e.g., deadlocked simulation), the timeout won't trigger until the process completes or returns control. For stricter enforcement, consider passing timeout=TEST_TIMEOUT_SECONDS to subprocess.run() within case.run() if it uses subprocess internally.

Also applies to: 276-283, 412-427

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b31278 and b0eca32.

📒 Files selected for processing (1)

toolchain/mfc/test/test.py (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

toolchain/mfc/test/test.py (4)

toolchain/mfc/common.py (4)

MFCException (31-32)

does_command_exist (127-138)

get_program_output (104-106)

find (165-174)

toolchain/mfc/test/case.py (4)

TestCase (115-276)

get_uuid (154-155)

get_uuid (288-289)

get_dirpath (157-158)

toolchain/mfc/packer/pack.py (2)

has_NaNs (90-96)

find (34-35)

toolchain/mfc/packer/tol.py (1)

compare (50-97)

🪛 GitHub Actions: Lint Toolchain

toolchain/mfc/test/test.py

[error] 274-274: Pylint: R1702 Too many nested blocks (7/5) in test.py. (Command './mfc.sh lint' failed with exit code 8).

🪛 Ruff (0.14.7)

toolchain/mfc/test/test.py

185-188: Avoid specifying long messages outside the exception class

(TRY003)

189-191: Avoid specifying long messages outside the exception class

(TRY003)

218-221: Avoid specifying long messages outside the exception class

(TRY003)

227-227: Avoid specifying long messages outside the exception class

(TRY003)

233-237: Avoid specifying long messages outside the exception class

(TRY003)

240-243: Avoid specifying long messages outside the exception class

(TRY003)

246-249: Avoid specifying long messages outside the exception class

(TRY003)

277-277: Abstract raise to an inner function

(TRY301)

277-277: Avoid specifying long messages outside the exception class

(TRY003)

282-282: Abstract raise to an inner function

(TRY301)

282-282: Avoid specifying long messages outside the exception class

(TRY003)

290-290: Avoid specifying long messages outside the exception class

(TRY003)

294-294: Avoid specifying long messages outside the exception class

(TRY003)

297-297: Avoid specifying long messages outside the exception class

(TRY003)

305-305: Avoid specifying long messages outside the exception class

(TRY003)

322-322: Avoid specifying long messages outside the exception class

(TRY003)

328-328: Abstract raise to an inner function

(TRY301)

328-328: Avoid specifying long messages outside the exception class

(TRY003)

392-395: Avoid specifying long messages outside the exception class

(TRY003)

420-424: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)

GitHub Check: Github (ubuntu, mpi, no-debug, true)
GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
GitHub Check: Github (ubuntu, mpi, debug, true)
GitHub Check: Github (macos, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, false)
GitHub Check: Github (macos, mpi, no-debug, false)
GitHub Check: Oak Ridge | Frontier (gpu-acc)
GitHub Check: Oak Ridge | Frontier (cpu)
GitHub Check: Coverage Test on CodeCov
GitHub Check: Code Cleanliness Check
GitHub Check: Oak Ridge | Frontier (CCE) (gpu-acc)
GitHub Check: Oak Ridge | Frontier (CCE) (gpu-omp)
GitHub Check: Georgia Tech | Phoenix (NVHPC) (cpu)
GitHub Check: cubic · AI code reviewer
GitHub Check: Build & Publish

🔇 Additional comments (3)

toolchain/mfc/test/test.py (3)

1-41: LGTM! Threading-based approach correctly addresses worker thread constraints.

The use of threading.Event for abort_tests and the new timeout infrastructure using threading.Timer (lines 257-263) correctly addresses past review concerns about signal.alarm() failing in worker threads. This approach works properly with the sched.sched scheduler that runs tests via WorkerThread.

214-250: LGTM! Helper function is well-structured with proper error handling.

The _process_silo_file helper correctly checks file existence (line 217) before processing, handles h5dump lookup gracefully, and provides informative error messages. The string-based NaN/Infinity detection is simple but effective for the use case.

429-475: LGTM! Early abort logic is correctly placed and thread-safe.

The abort check at lines 461-473 correctly runs after each test completes (whether success or failure), allowing proper tracking of the overall failure rate. The placement after both the success path (line 452) and final failure path (line 456) is correct—contrary to a past review suggestion, this should not be moved inside the exception handler.

Setting abort_tests flag (line 472) and returning gracefully (line 473) properly signals other worker threads without raising exceptions that could leave the scheduler in an undefined state. The check at line 434 ensures already-queued tasks exit cleanly when abort is requested.

toolchain/mfc/test/test.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

toolchain/mfc/test/test.py (1)
312-431: Timeout detection gaps around POST_PROCESS and non-preemptive behavior

The threading.Timer + timeout_flag approach is a good fix for the earlier SIGALRM-in-worker-threads issue, and the timer is correctly cancelled in finally. Two points to consider:
No final timeout check after POST_PROCESS and packing
For test_all cases, you currently check timeout_flag:

Before the PRE_PROCESS+SIMULATION run.

After PRE_PROCESS+SIMULATION completes.

Before the POST_PROCESS run.

If POST_PROCESS (or subsequent packing / golden comparison) pushes the total case time beyond TEST_TIMEOUT_SECONDS, the timeout will not be reported. You can add a final check near the end of _handle_case:
    if ARG("test_all"):
        ...
        if cmd.returncode != 0:
            ...
            raise MFCException(
                f"Test {case}: Failed to execute MFC (post-process). "
                f"See log at: {out_filepath}"
            )
   case.delete_output()
   # Final timeout check in case post-process or packing pushed us over the limit
   if timeout_flag.is_set():
       raise TestTimeoutError("Test case exceeded 1 hour timeout")
   case.delete_output()
Timeout is still non-preemptive for hung case.run()
As earlier reviewers noted, the current pattern can only detect a timeout once case.run(...) returns; it can’t interrupt a truly hung or extremely long-running subprocess. Fully enforcing hard per-test timeouts would require plumbing a timeout into the underlying execution layer (e.g., subprocess.run(..., timeout=...) inside whatever case.run uses) or running the test in a separate process that can be killed on timeout. That’s a larger change, but worth keeping in mind if “hard” time limits are required rather than post-facto detection.

🧹 Nitpick comments (1)

toolchain/mfc/test/test.py (1)
26-41: Global abort/timeout state persists across runs; consider resetting in test()

MIN_CASES_BEFORE_ABORT, FAILURE_RATE_THRESHOLD, TEST_TIMEOUT_SECONDS, abort_tests, and the global counters are all module-level, and test() doesn’t reset them. In normal CLI usage (test() once per process) this is fine, but if test() is ever invoked multiple times in the same interpreter, an earlier abort or non-zero nFAIL/nPASS will leak into subsequent runs.

If you expect (now or later) to reuse test() programmatically, consider resetting the global state at the top of test():
 def test():
-    # pylint: disable=global-statement, global-variable-not-assigned, too-many-statements
-    global nFAIL, nPASS, nSKIP, total_test_count
-    global errors
+    # pylint: disable=global-statement, global-variable-not-assigned, too-many-statements
+    global nFAIL, nPASS, nSKIP, total_test_count
+    global errors
+
+    # Reset global state for a fresh test run (important if test() is reused)
+    abort_tests.clear()
+    nFAIL = nPASS = nSKIP = 0
+    total_test_count = 0
+    errors = []
If test() is strictly a one-shot CLI entry point, this can be left as-is.

Also applies to: 109-113

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b0eca32 and 0589b85.

📒 Files selected for processing (1)

toolchain/mfc/test/test.py (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

toolchain/mfc/test/test.py (2)

toolchain/mfc/common.py (4)

MFCException (31-32)

does_command_exist (127-138)

get_program_output (104-106)

find (165-174)

toolchain/mfc/state.py (1)

ARG (85-93)

🪛 Ruff (0.14.7)

toolchain/mfc/test/test.py

185-188: Avoid specifying long messages outside the exception class

(TRY003)

189-191: Avoid specifying long messages outside the exception class

(TRY003)

218-221: Avoid specifying long messages outside the exception class

(TRY003)

227-227: Avoid specifying long messages outside the exception class

(TRY003)

233-237: Avoid specifying long messages outside the exception class

(TRY003)

240-243: Avoid specifying long messages outside the exception class

(TRY003)

246-249: Avoid specifying long messages outside the exception class

(TRY003)

332-332: Abstract raise to an inner function

(TRY301)

332-332: Avoid specifying long messages outside the exception class

(TRY003)

337-337: Abstract raise to an inner function

(TRY301)

337-337: Avoid specifying long messages outside the exception class

(TRY003)

345-345: Avoid specifying long messages outside the exception class

(TRY003)

349-349: Avoid specifying long messages outside the exception class

(TRY003)

352-352: Avoid specifying long messages outside the exception class

(TRY003)

360-360: Avoid specifying long messages outside the exception class

(TRY003)

377-377: Avoid specifying long messages outside the exception class

(TRY003)

383-383: Abstract raise to an inner function

(TRY301)

383-383: Avoid specifying long messages outside the exception class

(TRY003)

397-400: Avoid specifying long messages outside the exception class

(TRY003)

425-429: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)

GitHub Check: Georgia Tech | Phoenix (NVHPC) (gpu-acc)
GitHub Check: Oak Ridge | Frontier (CCE) (gpu-omp)
GitHub Check: Coverage Test on CodeCov
GitHub Check: Code Cleanliness Check
GitHub Check: Github (macos, mpi, debug, false)
GitHub Check: Georgia Tech | Phoenix (cpu)
GitHub Check: Github (ubuntu, mpi, debug, false)
GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
GitHub Check: Github (macos, mpi, no-debug, false)
GitHub Check: Oak Ridge | Frontier (gpu-omp)
GitHub Check: Oak Ridge | Frontier (cpu)
GitHub Check: Github (ubuntu, mpi, debug, true)
GitHub Check: Github (ubuntu, mpi, no-debug, false)
GitHub Check: Oak Ridge | Frontier (gpu-acc)
GitHub Check: Github (ubuntu, mpi, no-debug, true)
GitHub Check: Georgia Tech | Phoenix (gpu-omp)
GitHub Check: Georgia Tech | Phoenix (gpu-acc)
GitHub Check: Build & Publish

🔇 Additional comments (2)

toolchain/mfc/test/test.py (2)

214-305: New silo post-process and multi-rank debug helpers look solid

_process_silo_file and _print_multirank_debug_info add clear, localized validation and diagnostics: they check silo file existence, verify h5dump availability with a PATH fallback, surface non-zero return codes and NaN/Inf findings as MFCExceptions, and print concrete multi-rank debug info (restart_data contents, key params, and post_process.inp snippet). This is a good hardening / observability improvement with no obvious correctness issues.

179-191: Early-abort coordination via abort_tests Event is well-structured

Using a module-level threading.Event for early abort solves the prior problem of raising from worker threads:

handle_case checks abort_tests.is_set() before doing any work, so already-aborted runs exit cheaply.

After each case finishes (success or final failure), handle_case recomputes the global failure rate and sets abort_tests once MIN_CASES_BEFORE_ABORT and FAILURE_RATE_THRESHOLD are met, then returns gracefully.

The main test() function detects abort_tests.is_set() after sched.sched(...) completes and raises a single summary MFCException with the aggregated failure statistics.

This cleanly centralizes the “we are aborting the run” decision in the main thread while still letting worker threads stop themselves without throwing cross-thread exceptions.

Also applies to: 439-442, 466-478

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

toolchain/mfc/test/test.py (2)
239-249: String-based NaN/Infinity detection may have edge cases.

The checks for "nan," and "inf," rely on h5dump's output format. This works for typical cases but could miss variations like "nan]", "-inf,", or locale-specific formatting. Consider using a more robust pattern if false negatives become an issue.
-    if "nan," in output:
+    if "nan," in output.lower():
This at least handles case variations (e.g., "NaN,").

486-488: Minor race condition in failure rate calculation (benign).

The reads of nFAIL and nPASS at lines 486-488 are not atomic across threads. However, since both counters only increment and the 30% threshold is a heuristic for detecting systemic failures, the imprecision is acceptable. The worst case is aborting slightly earlier or later than the exact threshold.

If stricter accuracy is desired in the future, consider using threading.Lock or threading.atomic counters, but this is not necessary for the current use case.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0589b85 and 6d830f3.

📒 Files selected for processing (1)

toolchain/mfc/test/test.py (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

toolchain/mfc/test/test.py (3)

toolchain/mfc/common.py (5)

MFCException (31-32)

does_command_exist (127-138)

get_program_output (104-106)

delete_file (81-83)

find (165-174)

toolchain/mfc/test/case.py (6)

TestCase (115-276)

get_filepath (160-164)

get_dirpath (157-158)

delete_output (166-185)

get_uuid (154-155)

get_uuid (288-289)

toolchain/mfc/state.py (1)

ARG (85-93)

🪛 Ruff (0.14.7)

toolchain/mfc/test/test.py

185-188: Avoid specifying long messages outside the exception class

(TRY003)

189-191: Avoid specifying long messages outside the exception class

(TRY003)

218-221: Avoid specifying long messages outside the exception class

(TRY003)

227-227: Avoid specifying long messages outside the exception class

(TRY003)

233-237: Avoid specifying long messages outside the exception class

(TRY003)

240-243: Avoid specifying long messages outside the exception class

(TRY003)

246-249: Avoid specifying long messages outside the exception class

(TRY003)

341-341: Abstract raise to an inner function

(TRY301)

341-341: Avoid specifying long messages outside the exception class

(TRY003)

346-346: Abstract raise to an inner function

(TRY301)

346-346: Avoid specifying long messages outside the exception class

(TRY003)

357-357: Avoid specifying long messages outside the exception class

(TRY003)

361-361: Avoid specifying long messages outside the exception class

(TRY003)

364-364: Avoid specifying long messages outside the exception class

(TRY003)

372-372: Avoid specifying long messages outside the exception class

(TRY003)

389-389: Avoid specifying long messages outside the exception class

(TRY003)

413-416: Avoid specifying long messages outside the exception class

(TRY003)

442-446: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)

GitHub Check: Coverage Test on CodeCov
GitHub Check: Github (macos, mpi, debug, false)
GitHub Check: Github (macos, mpi, no-debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, true)
GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
GitHub Check: Github (ubuntu, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, debug, true)
GitHub Check: Github (ubuntu, mpi, no-debug, false)
GitHub Check: Georgia Tech | Phoenix (NVHPC) (gpu-omp)
GitHub Check: Oak Ridge | Frontier (CCE) (gpu-omp)
GitHub Check: Georgia Tech | Phoenix (NVHPC) (cpu)
GitHub Check: Code Cleanliness Check
GitHub Check: cubic · AI code reviewer
GitHub Check: Build & Publish

🔇 Additional comments (6)

toolchain/mfc/test/test.py (6)
26-41: Well-designed abort and timeout mechanism.

The use of threading.Event() for cross-thread abort signaling is the correct approach—it avoids the ValueError that would occur with signal.alarm() in worker threads. The constants and exception class are appropriately defined.

179-191: Good handling of abort state after scheduler completion.

The check for abort_tests.is_set() after sched.sched returns is correct. The division-by-zero edge case (total_completed > 0) is properly handled with a separate error message.

252-305: Well-implemented debug helper with proper error handling.

The function handles filesystem errors gracefully, limits output to avoid log overflow, and provides useful diagnostic information for multi-rank failures. The _param inner function is a clean approach for defaulting missing parameters.

339-346: Timeout cannot interrupt a running subprocess.

The threading.Timer sets a flag, but if case.run() spawns a long-running subprocess, it will continue until completion—the timeout won't interrupt it mid-execution. The checks at lines 340-341 and 345-346 only detect timeouts between operations.

For true subprocess timeouts, consider passing a timeout parameter to the underlying subprocess.run() call within case.run(). This is a known limitation that may be acceptable for now, but worth tracking for future improvement.

483-495: Graceful abort handling with thread-safe flag.

Setting abort_tests.set() instead of raising an exception from the worker thread is the correct approach—it avoids leaving the scheduler in an undefined state. The placement after every test (not just failures) is appropriate since the failure rate calculation depends on total completed tests.

406-416: Redundant return code check.

This cmd.returncode != 0 check at line 406 is unreachable because the same condition was already checked at line 352, which raises MFCException. Since targets includes POST_PROCESS when ARG("test_all") is true, the single case.run(targets, ...) call at line 342 runs all stages including post-process, and any failure would have been caught earlier.
-            if cmd.returncode != 0:
-                cons.print(cmd.stdout)
-
-                # Extra debug for multi-rank restart/post-process issues
-                if getattr(case, "ppn", 1) >= 2:
-                    _print_multirank_debug_info(case)
-
-                raise MFCException(
-                    f"Test {case}: Failed to execute MFC (post-process). "
-                    f"See log at: {out_post_filepath}"
-                )
This entire block can be removed as the error is already handled at lines 352-357.

Likely an incorrect or invalid review comment.

benchmarks hardening!

e8090bd

Copilot AI review requested due to automatic review settings December 5, 2025 16:32

Copilot started reviewing on behalf of sbryngelson December 5, 2025 16:33 View session

qodo-code-review bot added the Review effort 3/5 label Dec 5, 2025

lint error

8c01b99

codeant-ai bot added the size:L This PR changes 100-499 lines, ignoring generated files label Dec 5, 2025

qodo-code-review bot reviewed Dec 5, 2025

View reviewed changes

.github/workflows/bench.yml Show resolved Hide resolved

qodo-code-review bot reviewed Dec 5, 2025

View reviewed changes

.github/workflows/bench.yml Outdated Show resolved Hide resolved

toolchain/mfc/test/test.py Outdated Show resolved Hide resolved

toolchain/mfc/test/test.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Dec 5, 2025

View reviewed changes

toolchain/mfc/test/test.py Outdated Show resolved Hide resolved

Copilot finished reviewing on behalf of sbryngelson December 5, 2025 16:36

codeant-ai bot reviewed Dec 5, 2025

View reviewed changes

Copilot AI reviewed Dec 5, 2025

View reviewed changes

toolchain/mfc/test/test.py Outdated Show resolved Hide resolved

.github/workflows/bench.yml Outdated Show resolved Hide resolved

.github/workflows/bench.yml Outdated Show resolved Hide resolved

toolchain/mfc/test/test.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

.github/workflows/bench.yml Outdated Show resolved Hide resolved

toolchain/mfc/test/test.py Outdated Show resolved Hide resolved

toolchain/mfc/test/test.py Outdated Show resolved Hide resolved

minor

2e1eb87

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

forks look nicer now i think

0668fed

cubic-dev-ai bot reviewed Dec 5, 2025

View reviewed changes

toolchain/mfc/test/test.py Outdated Show resolved Hide resolved

.github/workflows/bench.yml Outdated Show resolved Hide resolved

more fixing!

f79f4d1

cubic-dev-ai bot reviewed Dec 5, 2025

View reviewed changes

toolchain/mfc/bench.py Show resolved Hide resolved

.github/scripts/monitor_slurm_job.sh Outdated Show resolved Hide resolved

clean linter

76a6e63

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

.github/scripts/monitor_slurm_job.sh Outdated Show resolved Hide resolved

fix some more suggestions

7333098

sbryngelson requested a review from Copilot December 5, 2025 17:20

Copilot started reviewing on behalf of sbryngelson December 5, 2025 17:21 View session

qodo-code-review bot reviewed Dec 5, 2025

View reviewed changes

.github/scripts/monitor_slurm_job.sh Show resolved Hide resolved

.github/workflows/bench.yml Outdated Show resolved Hide resolved

Copilot finished reviewing on behalf of sbryngelson December 5, 2025 17:24

Copilot AI reviewed Dec 5, 2025

View reviewed changes

more hardening :O

b62854c

cubic-dev-ai bot reviewed Dec 5, 2025

View reviewed changes

.github/workflows/bench.yml Outdated Show resolved Hide resolved

.github/workflows/bench.yml Outdated Show resolved Hide resolved

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

sbryngelson added 2 commits December 5, 2025 14:21

fix bug

adc26a0

harden against h5dump missing

0b31278

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

sbryngelson added 2 commits December 5, 2025 16:30

debug

b0eca32

lint fix

0589b85

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

toolchain/mfc/test/test.py Show resolved Hide resolved

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

fix brokenness

6d830f3

coderabbitai bot reviewed Dec 6, 2025

View reviewed changes

benchmarks hardening! #1078

Are you sure you want to change the base?

benchmarks hardening! #1078

Uh oh!

Conversation

sbryngelson commented Dec 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

PR Type

Description

Diagram Walkthrough

File Walkthrough

CodeAnt-AI Description

What Changed

Impact

Checking Your Pull Request

Talking to CodeAnt AI

Example

Preserve Org Learnings with CodeAnt

Example

Retrigger review

Check Your Repository Health

Summary by CodeRabbit

Uh oh!

codeant-ai bot commented Dec 5, 2025

Thanks for using CodeAnt! 🎉

Uh oh!

coderabbitai bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

qodo-code-review bot commented Dec 5, 2025

PR Reviewer Guide 🔍

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codeant-ai bot commented Dec 5, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

sbryngelson commented Dec 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 5, 2025 •

edited

Loading