-
Notifications
You must be signed in to change notification settings - Fork 125
hangtimeout for 2 rank cases #1058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
CodeAnt AI is reviewing your PR. |
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThe changes implement timeout handling for test execution in the MFC toolchain. A new Changes
Sequence DiagramsequenceDiagram
participant User
participant TestPipeline as Test Pipeline
participant SystemCall as system.call()
participant Subprocess as Subprocess
User->>TestPipeline: Execute test (--timeout 1000)
TestPipeline->>TestPipeline: Extract timeout from ARG("timeout")
TestPipeline->>SystemCall: Call with timeout parameter
rect rgb(200, 220, 255)
Note over SystemCall,Subprocess: Normal Completion Path
SystemCall->>Subprocess: Run process
Subprocess-->>SystemCall: Completes within timeout
SystemCall-->>TestPipeline: stdout
TestPipeline->>TestPipeline: Write full output
TestPipeline-->>User: Success
end
rect rgb(255, 200, 200)
Note over SystemCall,Subprocess: Timeout Path
SystemCall->>Subprocess: Run process
Subprocess-->>SystemCall: TimeoutExpired (exceeds 1000s)
SystemCall-->>TestPipeline: subprocess.TimeoutExpired
TestPipeline->>TestPipeline: Catch exception
TestPipeline->>TestPipeline: Save partial stdout
TestPipeline-->>User: MFCException with timeout details
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10–15 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High-level Suggestion
The current timeout logic is limited to 2-rank tests. It should be generalized to apply to all multi-process tests (where ppn > 1) to prevent hangs in any MPI-based test. [High-level, importance: 7]
Solution Walkthrough:
Before:
# toolchain/mfc/test/case.py
class TestCase(case.Case):
def run(self, ...):
# ...
# Enforce per-test timeout only for 2-rank cases (to catch hangs)
timeout = ARG("timeout") if self.ppn == 2 else None
return common.system(..., timeout=timeout)
# toolchain/mfc/test/test.py
def _handle_case(...):
try:
case.run(...)
except subprocess.TimeoutExpired as exc:
raise MFCException(f"Test {case} (2-rank case): Timed out...")After:
# toolchain/mfc/test/case.py
class TestCase(case.Case):
def run(self, ...):
# ...
# Enforce per-test timeout for all multi-rank cases
timeout = ARG("timeout") if self.ppn > 1 else None
return common.system(..., timeout=timeout)
# toolchain/mfc/test/test.py
def _handle_case(...):
try:
case.run(...)
except subprocess.TimeoutExpired as exc:
raise MFCException(f"Test {case} (multi-rank case): Timed out...")| # Enforce per-test timeout only for 2-rank cases (to catch hangs) | ||
| timeout = ARG("timeout") if self.ppn == 2 else None | ||
| return common.system( | ||
| command, | ||
| print_cmd=False, | ||
| text=True, | ||
| stdout=subprocess.PIPE, | ||
| stderr=subprocess.STDOUT, | ||
| timeout=timeout | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Add validation to ensure the timeout value is positive before passing it to common.system. If the value is non-positive, it should be set to None to prevent an unhandled ValueError. [possible issue, importance: 8]
| # Enforce per-test timeout only for 2-rank cases (to catch hangs) | |
| timeout = ARG("timeout") if self.ppn == 2 else None | |
| return common.system( | |
| command, | |
| print_cmd=False, | |
| text=True, | |
| stdout=subprocess.PIPE, | |
| stderr=subprocess.STDOUT, | |
| timeout=timeout | |
| ) | |
| # Enforce per-test timeout only for 2-rank cases (to catch hangs) | |
| timeout = None | |
| if self.ppn == 2: | |
| val = ARG("timeout") | |
| if val > 0: | |
| timeout = val | |
| return common.system( | |
| command, | |
| print_cmd=False, | |
| text=True, | |
| stdout=subprocess.PIPE, | |
| stderr=subprocess.STDOUT, | |
| timeout=timeout | |
| ) |
|
|
||
| out_filepath = os.path.join(case.get_dirpath(), "out_pre_sim.txt") | ||
|
|
||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Extract the duplicated subprocess.TimeoutExpired exception handling logic into a new helper function to reduce code redundancy and improve maintainability. [general, importance: 6]
|
CodeAnt AI finished reviewing your PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds timeout handling for 2-rank MPI test cases to catch hung or deadlocked processes. The changes enable per-test timeouts specifically for 2-rank cases, which appear to be more prone to hanging, and provide detailed error messages when timeouts occur.
Key Changes:
- Added
--timeoutargument (default: 1000 seconds) to test command that applies only to 2-rank test cases - Implemented timeout exception handling with partial output capture and detailed error messages
- Updated dependency constraints: relaxed cantera version to
>=3.1.0and switched pyrometheus repository from wilfonba to sbryngelson fork
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| toolchain/pyproject.toml | Relaxed cantera version constraint from ==3.1.0 to >=3.1.0 and switched pyrometheus repository to sbryngelson fork |
| toolchain/mfc/test/test.py | Added subprocess import and implemented timeout exception handling in two test execution paths (pre/sim and post-process) with partial output capture |
| toolchain/mfc/test/case.py | Added conditional timeout enforcement for 2-rank cases (ppn==2) in the test run method |
| toolchain/mfc/args.py | Added --timeout command-line argument with default value of 1000 seconds for 2-rank test cases |
| common.file_write(out_filepath, partial_output) | ||
|
|
||
| raise MFCException( | ||
| f"Test {case} (2-rank case): Timed out after {ARG('timeout')} seconds.\n" |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message hardcodes "2-rank case" but should dynamically reference the actual ppn value from the test case. Consider using f"Test {case} ({case.ppn}-rank case): Timed out..." to make the message more accurate and maintainable, especially if the timeout logic is ever extended to other ppn values.
| f"Test {case} (2-rank case): Timed out after {ARG('timeout')} seconds.\n" | |
| f"Test {case} ({case.ppn}-rank case): Timed out after {ARG('timeout')} seconds.\n" |
| common.file_write(out_filepath, partial_output) | ||
|
|
||
| raise MFCException( | ||
| f"Test {case} (2-rank case, post-process): Timed out after {ARG('timeout')} seconds.\n" |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message hardcodes "2-rank case" but should dynamically reference the actual ppn value from the test case. Consider using f"Test {case} ({case.ppn}-rank case, post-process): Timed out..." to make the message more accurate and maintainable, especially if the timeout logic is ever extended to other ppn values.
| f"Test {case} (2-rank case, post-process): Timed out after {ARG('timeout')} seconds.\n" | |
| f"Test {case} ({case.ppn}-rank case, post-process): Timed out after {ARG('timeout')} seconds.\n" |
|
|
||
| # Chemistry | ||
| "cantera==3.1.0", | ||
| "cantera>=3.1.0", |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The version constraint for cantera was changed from ==3.1.0 to >=3.1.0. While this allows for updates, it could potentially allow future major versions (e.g., 4.x) that might introduce breaking API changes. Consider using >=3.1.0,<4.0.0 to allow minor and patch updates while preventing breaking changes from major version updates.
| "cantera>=3.1.0", | |
| "cantera>=3.1.0,<4.0.0", |
| try: | ||
| cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices) | ||
| except subprocess.TimeoutExpired as exc: | ||
| # Save any partial stdout we have | ||
| partial_output = "" | ||
| if exc.stdout: | ||
| try: | ||
| partial_output = exc.stdout.decode() if isinstance(exc.stdout, bytes) else exc.stdout | ||
| except Exception: | ||
| partial_output = str(exc.stdout) | ||
|
|
||
| if partial_output: | ||
| common.file_write(out_filepath, partial_output) | ||
|
|
||
| raise MFCException( | ||
| f"Test {case} (2-rank case): Timed out after {ARG('timeout')} seconds.\n" | ||
| f"This suggests the MPI job may have hung or deadlocked.\n" | ||
| f"Partial output (if any) saved to {out_filepath}.\n" | ||
| f"Case dictionary: {case.get_filepath()}.\n" | ||
| f"Command: {' '.join(str(x) for x in exc.cmd)}" | ||
| ) from exc |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout handling logic is duplicated between this block (lines 199-219) and lines 264-284. Consider extracting this into a helper function that takes the case, command callable, out_filepath, and optional context string (e.g., "post-process") as parameters. This would improve maintainability and ensure consistent error handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
toolchain/mfc/test/test.py (1)
264-285: Consider extracting timeout handling into a helper function.This timeout handling block is nearly identical to lines 199-220. To reduce duplication and improve maintainability, consider extracting the common logic into a helper function.
Example refactor:
def _run_with_timeout_handling(case: TestCase, targets: list, devices: typing.Set[int], out_filepath: str, phase_description: str = "") -> subprocess.CompletedProcess: """Run case with timeout handling, saving partial output on timeout.""" try: return case.run(targets, gpus=devices) except subprocess.TimeoutExpired as exc: # Save any partial stdout we have partial_output = "" if exc.stdout: try: partial_output = exc.stdout.decode() if isinstance(exc.stdout, bytes) else exc.stdout except Exception: partial_output = str(exc.stdout) if partial_output: common.file_write(out_filepath, partial_output) phase = f" ({phase_description})" if phase_description else "" raise MFCException( f"Test {case}{phase}: Timed out after {ARG('timeout')} seconds.\n" f"This suggests the MPI job may have hung or deadlocked.\n" f"Partial output (if any) saved to {out_filepath}.\n" f"Case dictionary: {case.get_filepath()}.\n" f"Command: {' '.join(str(x) for x in exc.cmd)}" ) from excThen replace both blocks with calls to this helper:
# Pre-processing/simulation cmd = _run_with_timeout_handling(case, [PRE_PROCESS, SIMULATION], devices, out_filepath) common.file_write(out_filepath, cmd.stdout) # Post-processing (if test_all) cmd = _run_with_timeout_handling(case, [PRE_PROCESS, SIMULATION, POST_PROCESS], devices, out_filepath, "post-process") common.file_write(out_filepath, cmd.stdout)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
toolchain/mfc/args.py(1 hunks)toolchain/mfc/test/case.py(1 hunks)toolchain/mfc/test/test.py(3 hunks)toolchain/pyproject.toml(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
toolchain/mfc/test/case.py (2)
toolchain/mfc/state.py (1)
ARG(85-93)toolchain/mfc/common.py (1)
system(34-41)
toolchain/mfc/test/test.py (3)
toolchain/mfc/test/case.py (1)
run(127-159)toolchain/mfc/common.py (2)
file_write(44-54)MFCException(31-32)toolchain/mfc/state.py (1)
ARG(85-93)
🪛 Ruff (0.14.5)
toolchain/mfc/test/test.py
207-207: Do not catch blind exception: Exception
(BLE001)
213-219: Avoid specifying long messages outside the exception class
(TRY003)
272-272: Do not catch blind exception: Exception
(BLE001)
278-284: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: Georgia Tech | Phoenix (NVHPC) (gpu)
- GitHub Check: Coverage Test on CodeCov
- GitHub Check: Code Cleanliness Check
- GitHub Check: Self Hosted (gpu, omp, gt)
- GitHub Check: Self Hosted (gpu, acc, gt)
- GitHub Check: Github (macos, mpi, debug, false)
- GitHub Check: Github (ubuntu, mpi, no-debug, true)
- GitHub Check: Self Hosted (cpu, none, gt)
- GitHub Check: Github (ubuntu, mpi, no-debug, false)
- GitHub Check: Github (macos, mpi, no-debug, false)
- GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
- GitHub Check: Github (ubuntu, mpi, debug, false)
- GitHub Check: Github (ubuntu, mpi, debug, true)
- GitHub Check: cubic · AI code reviewer
- GitHub Check: Agent
- GitHub Check: Build & Publish
🔇 Additional comments (6)
toolchain/mfc/args.py (1)
100-100: LGTM!The new
--timeoutargument is well-defined with a reasonable default value and clear help text. It aligns with the PR objective to handle timeout for 2-rank test cases.toolchain/mfc/test/case.py (1)
150-159: LGTM!The timeout implementation correctly applies to 2-rank cases only (when
ppn == 2) and properly integrates with the new--timeoutCLI argument. The timeout is passed tocommon.system(), which forwards it tosubprocess.run()via**kwargs.toolchain/mfc/test/test.py (2)
1-1: LGTM!The
subprocessimport is needed for theTimeoutExpiredexception handling introduced in this PR.
199-220: Robust timeout handling with helpful diagnostics.The timeout exception handling correctly:
- Catches
subprocess.TimeoutExpiredfrom the underlying system call- Preserves partial stdout for debugging (even attempting decode from bytes)
- Raises a detailed
MFCExceptionwith timeout value, case info, and commandThe broad
Exceptioncatch on Line 207 is defensive coding for decode errors and is acceptable here. The detailed error message (flagged by static analysis) provides valuable diagnostic information for debugging hung MPI runs.toolchain/pyproject.toml (2)
42-42: Pyrometheus repository accessibility verified.The verification confirms that the new repository
sbryngelson/pyrometheus-bryngelsonis publicly accessible and theOpenMPTestbranch exists. The git dependency should resolve correctly.
40-40: I can see that cantera 3.2.0 is available. Now I need to verify two things: (1) whether there are breaking changes between 3.1.0 and 3.2.0, and (2) what timeout handling was introduced in this PR and whether it's compatible with newer versions.Now I need to check the actual PR changes to understand what timeout handling was introduced and whether it's compatible with newer cantera versions:
Based on my verification:
Cantera 3.2.0 is available, with over 800 commits since 3.1.0 (released in December 2024). The version constraint relaxation from
==3.1.0to>=3.1.0allows cantera 3.2.0 and future versions.The public release notes for 3.2.0 do not highlight major breaking changes that would prevent compatibility. However, the full detailed changelog for 3.2.0 was not fully accessible through web search to comprehensively verify all API compatibility concerns.
Verify cantera 3.2.0 compatibility with your timeout handling implementation.
The version constraint was relaxed from
==3.1.0to>=3.1.0, allowing cantera 3.2.0+. While no major breaking changes are documented in publicly available release notes, you should test the timeout handling code introduced in this PR against cantera 3.2.0 to ensure compatibility before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues found across 4 files
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1058 +/- ##
=======================================
Coverage 44.35% 44.35%
=======================================
Files 71 71
Lines 20587 20587
Branches 1993 1993
=======================================
Hits 9132 9132
Misses 10310 10310
Partials 1145 1145 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
CodeAnt AI is running Incremental review |
|
CodeAnt AI Incremental review completed. |
User description
User description
2 rank hanging timeout
PR Type
Enhancement, Bug fix
Description
Add timeout mechanism for 2-rank MPI test cases to catch hung processes
Implement subprocess timeout handling with partial output preservation
Add
--timeoutCLI argument with 1000-second default for test executionUpdate dependency versions (cantera and pyrometheus repository)
Diagram Walkthrough
File Walkthrough
args.py
Add timeout CLI argument for test executiontoolchain/mfc/args.py
--timeoutargument to test command group with 1000-second defaultruns
case.py
Apply timeout to 2-rank case executiontoolchain/mfc/test/case.py
2)
common.system()call for subprocessexecution
time
test.py
Handle timeout exceptions with partial output preservationtoolchain/mfc/test/test.py
subprocessmodule to handle TimeoutExpired exceptionsblocks
subprocess.TimeoutExpiredand save partial stdout before raisingMFCException
and command
pyproject.toml
Update dependency versions and sourcestoolchain/pyproject.toml
==3.1.0to flexible>=3.1.0fork
CodeAnt-AI Description
Time out 2-rank MPI tests to avoid hung runs
What Changed
Impact
✅ Fewer hung two-rank MPI tests✅ Clearer MPI hang diagnostics✅ Predictable dependencies for MPI builds💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.