Skip to content

Fix unhandled transient state issue in tools/commands that surfaces in k8s TorchX scheduling. #2047

@johnwhumphreys

Description

@johnwhumphreys

The server_ready function in commands.py doesn't handle some transient states properly (e.g. UNKNOWN). This causes intermittent failures as sometimes it makes it through fast enough to not be seen, and sometimes it is reported and fails allocation.

I have a fix for this already; will commit here soon.

Function Reference:

https://www.internalfb.com/code/fbsource/[97d54e1b9761]/fbcode/monarch/python/monarch/tools/commands.py?lines=21-22%2C24%2C28-29%2C98%2C104%2C126%2C133%2C140%2C221%2C234%2C302%2C323%2C334

Error Message:

File /opt/conda/lib/python3.11/site-packages/monarch/tools/commands.py:342, in get_or_create(name, config, check_interval, force_restart)
    337         raise RuntimeError(
    338             f"the new server `{new_server_handle}` went missing (should never happen)"
    339         )
    341     if not server_info.is_running:
--> 342         raise RuntimeError(
    343             f"the new server `{new_server_handle}` has {server_info.state}"
    344         )
    346     print(f"{CYAN}New job `{new_server_handle}` is ready to serve.{ENDC}")
    347 else:

RuntimeError: the new server `k8s:///monarch-tests:monarch-testsmonarch-root-ztfvdkdbtmf4nd` has UNKNOWN

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions