-
Notifications
You must be signed in to change notification settings - Fork 114
Open
Description
The server_ready function in commands.py doesn't handle some transient states properly (e.g. UNKNOWN). This causes intermittent failures as sometimes it makes it through fast enough to not be seen, and sometimes it is reported and fails allocation.
I have a fix for this already; will commit here soon.
Function Reference:
Error Message:
File /opt/conda/lib/python3.11/site-packages/monarch/tools/commands.py:342, in get_or_create(name, config, check_interval, force_restart)
337 raise RuntimeError(
338 f"the new server `{new_server_handle}` went missing (should never happen)"
339 )
341 if not server_info.is_running:
--> 342 raise RuntimeError(
343 f"the new server `{new_server_handle}` has {server_info.state}"
344 )
346 print(f"{CYAN}New job `{new_server_handle}` is ready to serve.{ENDC}")
347 else:
RuntimeError: the new server `k8s:///monarch-tests:monarch-testsmonarch-root-ztfvdkdbtmf4nd` has UNKNOWN