-
Notifications
You must be signed in to change notification settings - Fork 125
hangtimeout for 2 rank cases #1058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -147,7 +147,16 @@ def run(self, targets: List[Union[str, MFCTarget]], gpus: Set[int]) -> subproces | |||||||||||||||||||||||||||||||||||||||||||||||||||
| *jobs, "-t", *target_names, *gpus_select, *ARG("--") | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| return common.system(command, print_cmd=False, text=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| # Enforce per-test timeout only for 2-rank cases (to catch hangs) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| timeout = ARG("timeout") if self.ppn == 2 else None | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| return common.system( | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| command, | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| print_cmd=False, | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| text=True, | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| stdout=subprocess.PIPE, | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| stderr=subprocess.STDOUT, | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| timeout=timeout | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+150
to
+159
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggestion: Add validation to ensure the
Suggested change
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| def get_trace(self) -> str: | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| return self.trace | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -1,4 +1,4 @@ | ||||||
| import os, typing, shutil, time, itertools | ||||||
| import os, typing, shutil, time, itertools, subprocess | ||||||
| from random import sample, seed | ||||||
|
|
||||||
| import rich, rich.table | ||||||
|
|
@@ -194,10 +194,31 @@ def _handle_case(case: TestCase, devices: typing.Set[int]): | |||||
| cons.print(f" [bold magenta]{case.get_uuid()}[/bold magenta] SKIP {case.trace}") | ||||||
| return | ||||||
|
|
||||||
| cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices) | ||||||
|
|
||||||
| out_filepath = os.path.join(case.get_dirpath(), "out_pre_sim.txt") | ||||||
|
|
||||||
| try: | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggestion: Extract the duplicated |
||||||
| cmd = case.run([PRE_PROCESS, SIMULATION], gpus=devices) | ||||||
| except subprocess.TimeoutExpired as exc: | ||||||
| # Save any partial stdout we have | ||||||
| partial_output = "" | ||||||
| if exc.stdout: | ||||||
| try: | ||||||
| partial_output = exc.stdout.decode() if isinstance(exc.stdout, bytes) else exc.stdout | ||||||
| except Exception: | ||||||
| partial_output = str(exc.stdout) | ||||||
|
|
||||||
| if partial_output: | ||||||
| common.file_write(out_filepath, partial_output) | ||||||
|
|
||||||
| raise MFCException( | ||||||
| f"Test {case} (2-rank case): Timed out after {ARG('timeout')} seconds.\n" | ||||||
|
||||||
| f"Test {case} (2-rank case): Timed out after {ARG('timeout')} seconds.\n" | |
| f"Test {case} ({case.ppn}-rank case): Timed out after {ARG('timeout')} seconds.\n" |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout handling logic is duplicated between this block (lines 199-219) and lines 264-284. Consider extracting this into a helper function that takes the case, command callable, out_filepath, and optional context string (e.g., "post-process") as parameters. This would improve maintainability and ensure consistent error handling.
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message hardcodes "2-rank case" but should dynamically reference the actual ppn value from the test case. Consider using f"Test {case} ({case.ppn}-rank case, post-process): Timed out..." to make the message more accurate and maintainable, especially if the timeout logic is ever extended to other ppn values.
| f"Test {case} (2-rank case, post-process): Timed out after {ARG('timeout')} seconds.\n" | |
| f"Test {case} ({case.ppn}-rank case, post-process): Timed out after {ARG('timeout')} seconds.\n" |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -37,9 +37,9 @@ dependencies = [ | |||||
| "matplotlib", | ||||||
|
|
||||||
| # Chemistry | ||||||
| "cantera==3.1.0", | ||||||
| "cantera>=3.1.0", | ||||||
|
||||||
| "cantera>=3.1.0", | |
| "cantera>=3.1.0,<4.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High-level Suggestion
The current timeout logic is limited to 2-rank tests. It should be generalized to apply to all multi-process tests (where
ppn > 1) to prevent hangs in any MPI-based test. [High-level, importance: 7]Solution Walkthrough:
Before:
After: