Usage of jax.distributed.initialize
on HPC clusters
#9582
-
Hi I wanted to know if anyone has used Is there some other way to get things working on these clusters? I'd appreciate any advice/guidance on this front. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 5 replies
-
Adding some more details to this, I tried running the provided example from the PR on a single node. However, I'm running into this error: $ python3 nvidia_gpu_pjit.py --server_addr="172.11.92.39" --num_hosts=1 --host_idx=0
I0215 14:53:07.547920 22468454270784 distributed.py:49] Starting JAX distributed service on 172.11.92.39
E0215 14:53:07.551977037 475947 server_chttp2.cc:40] {"created":"@1644954787.551954935","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":395,"referenced_errors":[{"created":"@1644954787.551953102","description":"Unable to configure socket","fd":7,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1644954787.551951008","description":"Permission denied","errno":13,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":189,"os_error":"Permission denied","syscall":"bind"}]}]}
Traceback (most recent call last):
File "distributed_test.py", line 50, in <module>
app.run(main)
File "/home/user/.local/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/user/.local/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "distributed_test.py", line 25, in main
jax.distributed.initialize(FLAGS.server_addr, FLAGS.num_hosts, FLAGS.host_idx)
File "/home/user/.local/lib/python3.8/site-packages/jax/_src/distributed.py", line 50, in initialize
_service = xla_extension.get_distributed_runtime_service(coordinator_address,
RuntimeError: UNKNOWN: Failed to start RPC server Would I need admin rights to run this code? |
Beta Was this translation helpful? Give feedback.
-
Try providing the port too? --server_addr="172.11.92.39:1234" |
Beta Was this translation helpful? Give feedback.
-
@zhangqiaorjc I got this working perfectly fine on one the HPC clusters I have access to. However, I was trying out
Is there a reason I'm seeing this - it's pretty much the same python environment AFAIK. How can I go about fixing this? Or do you think this maybe something got to do with permissions on this machine? Thanks |
Beta Was this translation helpful? Give feedback.
Try providing the port too?
--server_addr="172.11.92.39:1234"