Commit 5526463
fix: ray module not found handling (meta-pytorch#1049)
Summary:
TorchX has been handling `ModuleNotFoundError` gracefully for a while now, e.g. for SageMaker when running `torchx runopts` we get:
```
...
(remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
quiet=QUIET (bool, False)
whether to suppress verbose output for image building. Defaults to ``False``.
aws_sagemaker: No module named 'sagemaker'
gcp_batch:
usage:
[project=PROJECT],[location=LOCATION]
...
```
But for `ray` we get an exception after which we won't get next runopts:
```
gcp_batch:
usage:
[project=PROJECT],[location=LOCATION]
optional arguments:
project=PROJECT (str, None)
Name of the GCP project. Defaults to the configured GCP project in the environment
location=LOCATION (str, us-central1)
Name of the location to schedule the job in. Defaults to us-central1
Traceback (most recent call last):
File "/usr/local/bin/torchx", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 118, in main
run_main(get_sub_cmds(), argv)
File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 114, in run_main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/torchx/cli/cmd_runopts.py", line 36, in run
opts = runner.scheduler_run_opts(scheduler)
File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 473, in scheduler_run_opts
return self._scheduler(scheduler).run_opts()
File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 718, in _scheduler
sched = factory(self._name, **self._scheduler_params)
File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/__init__.py", line 39, in run
module = importlib.import_module(path)
File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/ray_scheduler.py", line 448, in <module>
session_name: str, ray_client: Optional[JobSubmissionClient] = None, **kwargs: Any
NameError: name 'JobSubmissionClient' is not defined
```
That's because `ray_scheduler` has custom `ModuleNotFoundException` handling - perhaps for historic reasons.
Test Plan: [x] existing test must pass
Differential Revision: D73751531
Pulled By: andywag1 parent 41be1d8 commit 5526463
File tree
3 files changed
+990
-1018
lines changed- docs/source/schedulers
- torchx/schedulers
- test
3 files changed
+990
-1018
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
16 | | - | |
17 | 16 | | |
18 | 17 | | |
19 | 18 | | |
| |||
0 commit comments