-
Notifications
You must be signed in to change notification settings - Fork 182
Description
Hello everyone,
seen by my previous issue here, i am testing the distributed capabilities of the Trn1x2 Machine for data shuffling use cases. Here i wanted to try using peer-to-peer operations "Send" and "Receive". This is available in Pytorch XLA using dist.send(tensor=send_tensor, dst=dst)/ dist.recv(tensor=receive_tensor, src=src) or by directly calling the XLA-Interface xm.send(send_tensor, channel_id)/xm.recv(recv_tensor, channel_id).
The documentation states that these operations are generally supported. When used in Pytorch NeuronX, they produce the following Errors:
- NotImplementedError for the Method
make_send_channel_idwhen using thetorch.distributedpackage - Hanging of the python program > 5 mins for sending tensor of small size when using
torch_xla.core.xla_modelpackage - NRT Error when using
torch_xla.core.xla_modeland wanting to simultaneously send AND receive in different channels at the same time
I provide the python code used and the complete Errors down below.
Here i wanted to ask, are the Peer-to-Peer Communication Commands "Send"/"Receive" not supported in Pytorch or is this a bug? Also is simultaneous send/receive not supported? Any help is appreciated!
Steps to reproduce:
- Start trn1.2xlarge Instance using image-id "ami-080b4a9b6e048125e" (Deep Learning AMI Neuron (Amazon Linux 2023) 20250115)
- source /opt/aws_neuronx_venv_pytorch_2_5/bin/activate (PyTorch 2.5 Torch NeuronX, NxD Core Environment)
- python3 test_send_receive.py (the given code snippets)
Here are the code snippets and the corresponding Errors:
- Using Pytorch distributed Package
import os
import torch
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.distributed.xla_multiprocessing as xmp
import torch.distributed as dist
def _mp_fn(index):
device = xm.xla_device()
world_size = xm.xrt_world_size()
rank = xm.get_ordinal()
dist.init_process_group(
backend='xla',
world_size=world_size,
rank=rank,
)
data_to_send = torch.ones((2, 3))
send_tensor = data_to_send.to(device)
receive_tensor = torch.empty_like(send_tensor).to(device)
src, dst = 0, 1
if rank == src:
dist.send(tensor=send_tensor, dst=dst)
elif rank == dst:
dist.recv(tensor=receive_tensor, src=src)
xm.mark_step()
torch_xla.sync()
if rank == dst:
expected = torch.ones((2, 3))
assert torch.all(receive_tensor.cpu() == expected), f'{receive_tensor} != {expected}'
print(f"Rank {rank} got:\n{receive_tensor.cpu()}")
if __name__ == '__main__':
os.environ["NEURONCORE_NUM_DEVICES"] = "2"
os.environ['NEURON_CC_FLAGS'] = os.environ.get('NEURON_CC_FLAGS', '') + ' --no_cache'
xmp.spawn(_mp_fn, args=())Error:
2025-06-28T22:11:23Z Compiler status PASS
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib64/python3.9/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/lib64/python3.9/concurrent/futures/process.py", line 205, in _process_chunk
return [fn(*args) for args in chunk]
File "/usr/lib64/python3.9/concurrent/futures/process.py", line 205, in <listcomp>
return [fn(*args) for args in chunk]
File "/opt/aws_neuronx_venv_pytorch_2_5/lib64/python3.9/site-packages/torch_xla/_internal/pjrt.py", line 77, in _run_thread_per_device
replica_results = list(
File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 609, in result_iterator
yield fs.pop().result()
File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 446, in result
return self.__get_result()
File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
File "/usr/lib64/python3.9/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/opt/aws_neuronx_venv_pytorch_2_5/lib64/python3.9/site-packages/torch_xla/_internal/pjrt.py", line 70, in _thread_fn
return fn()
File "/opt/aws_neuronx_venv_pytorch_2_5/lib64/python3.9/site-packages/torch_xla/_internal/pjrt.py", line 185, in __call__
self.fn(runtime.global_ordinal(), *self.args, **self.kwargs)
File "/home/ec2-user/home/ec2-user/test_neuron/test_send_recv_2.py", line 27, in _mp_fn
dist.send(tensor=send_tensor, dst=dst)
File "/opt/aws_neuronx_venv_pytorch_2_5/lib64/python3.9/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
File "/opt/aws_neuronx_venv_pytorch_2_5/lib64/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2148, in send
default_pg.send([tensor], dst, tag).wait()
File "/opt/aws_neuronx_venv_pytorch_2_5/lib64/python3.9/site-packages/torch_xla/distributed/xla_backend.py", line 249, in send
channel_id = self.make_send_channel_id(dst_rank, tag)
File "/opt/aws_neuronx_venv_pytorch_2_5/lib64/python3.9/site-packages/torch_xla/distributed/xla_backend.py", line 242, in make_send_channel_id
raise NotImplementedError
NotImplementedError
"""- Using
torch_xla.core.xla_modelto specify the channel_id directly
import os
import torch
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.distributed.xla_multiprocessing as xmp
import torch_xla.distributed.xla_backend
import torch.distributed as dist
def _mp_fn(index):
device = xm.xla_device()
rank = xm.get_ordinal()
data_to_send = torch.ones((2, 3))
send_tensor = data_to_send.to(device)
receive_tensor = torch.empty_like(send_tensor).to(device)
channel_id = 100
if rank == 0:
xm.send(send_tensor, channel_id)
else:
xm.recv(receive_tensor, channel_id)
xm.mark_step()
torch_xla.sync()
if rank == 1:
expected = torch.ones((2, 3))
assert torch.all(receive_tensor.cpu() == expected), f'{receive_tensor} != {expected}'
print(f"Rank {rank} got:\n{receive_tensor.cpu()}")
if __name__ == '__main__':
os.environ["NEURONCORE_NUM_DEVICES"] = "2"
os.environ['NEURON_CC_FLAGS'] = os.environ.get('NEURON_CC_FLAGS', '') + ' --no_cache'
xmp.spawn(_mp_fn, args=())Error (It hangs for more than > 5 min)
2025-06-28T22:22:39Z Compiler status PASS
2025-Jun-28 22:22:39.0863 9036:9274 [1] nccl_net_ofi_rdma_init:7734 CCOM WARN NET/OFI OFI fi_getinfo() call failed: No data available
2025-Jun-28 22:22:39.0870 9036:9274 [1] nccl_net_ofi_create_plugin:251 CCOM WARN NET/OFI Unable to find a protocol that worked. Failing initialization.
2025-Jun-28 22:22:39.0880 9036:9274 [1] nccl_net_ofi_create_plugin:316 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jun-28 22:22:39.0887 9036:9274 [1] nccl_net_ofi_init:139 CCOM WARN NET/OFI Initializing plugin failed
2025-Jun-28 22:22:39.0893 9036:9274 [1] net_plugin.cc:94 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2025-Jun-28 22:25:39.0792 9036:9274 [1] include/socket.h:483 CCOM WARN Connect to 127.0.0.1<62182> failed : Connection refused - retrying [2-rank bootstrap: rank 1 sends its info to root./If a connection error occurs here, root may be unresponsive or not yet be active./-1]- Using
torch_xla.core.xla_modelwith two simultaneous send/receive operations on different channel_ids
import os
import torch
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.distributed.xla_multiprocessing as xmp
import torch_xla.distributed.xla_backend
def _mp_fn(index):
device = xm.xla_device()
rank = xm.get_ordinal()
data_to_send = torch.ones((2, 3))
send_tensor = data_to_send.to(device)
receive_tensor = torch.empty_like(send_tensor).to(device)
channel_id = 100
channel_id2 = 101
if rank == 0:
xm.send(send_tensor, channel_id)
xm.recv(receive_tensor, channel_id2)
else:
xm.send(send_tensor, channel_id2)
xm.recv(receive_tensor, channel_id)
xm.mark_step()
torch_xla.sync()
expected = torch.ones((2, 3))
assert torch.all(receive_tensor.cpu() == expected), f'{receive_tensor} != {expected}'
print(f"Rank {rank} got:\n{receive_tensor.cpu()}")
if __name__ == '__main__':
os.environ["NEURONCORE_NUM_DEVICES"] = "2"
os.environ['NEURON_CC_FLAGS'] = os.environ.get('NEURON_CC_FLAGS', '') + ' --no_cache'
xmp.spawn(_mp_fn, args=())Error:
2025-06-28T22:19:15Z Compiler status PASS
2025-06-28T22:19:15Z Compiler status PASS
2025-Jun-28 22:19:15.466545 8652:8728 ERROR ENC:enc_enqueue_operation [nec_dev 0] peer_id does not match
2025-Jun-28 22:19:15.466613 8649:8757 ERROR ENC:enc_enqueue_operation [nec_dev 1] peer_id does not match
2025-Jun-28 22:19:15.467540 8652:8728 ERROR TDRV:instr_col_enc_enqueue Failed to post to enc for pseudo send receive instruction
2025-Jun-28 22:19:15.468485 8649:8757 ERROR TDRV:instr_col_enc_enqueue Failed to post to enc for pseudo send receive instruction
2025-Jun-28 22:19:15.469570 8652:8728 ERROR TDRV:instr_col_translate_psr Failed to setup pseudo send receive instruction
2025-Jun-28 22:19:15.470670 8649:8757 ERROR TDRV:instr_col_translate_psr Failed to setup pseudo send receive instruction
2025-Jun-28 22:19:15.471782 8652:8728 ERROR TDRV:ib_create_one_block failed to translate instructions
2025-Jun-28 22:19:15.472831 8649:8757 ERROR TDRV:ib_create_one_block failed to translate instructions
2025-Jun-28 22:19:15.473624 8652:8728 ERROR TDRV:ib_create_eib_v2 Failed to create instruction blocks
2025-Jun-28 22:19:15.474579 8649:8757 ERROR TDRV:ib_create_eib_v2 Failed to create instruction blocks
2025-Jun-28 22:19:15.475676 8652:8728 ERROR TDRV:sequencer_v2_setup_instr_one_eng Failed to allocate eib
2025-Jun-28 22:19:15.476457 8649:8757 ERROR TDRV:sequencer_v2_setup_instr_one_eng Failed to allocate eib
2025-Jun-28 22:19:15.477298 8652:8728 ERROR TDRV:kbl_model_add create_engine_refill_rings_v1() error
2025-Jun-28 22:19:15.478131 8649:8757 ERROR TDRV:kbl_model_add create_engine_refill_rings_v1() error
2025-Jun-28 22:19:15.479169 8652:8728 ERROR NMGR:dlr_kelf_stage Failed to load subgraph
2025-Jun-28 22:19:15.480125 8649:8757 ERROR NMGR:dlr_kelf_stage Failed to load subgraph
2025-Jun-28 22:19:15.480867 8652:8728 ERROR NMGR:kmgr_load_nn_internal_v2 Failed to stage graph to NeuronCore
2025-Jun-28 22:19:15.481719 8649:8757 ERROR NMGR:kmgr_load_nn_internal_v2 Failed to stage graph to NeuronCore
2025-Jun-28 22:19:15.482643 8652:8728 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: /tmp/tmpadnd5hrb/model.neff, err: 2
2025-Jun-28 22:19:15.483591 8649:8757 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: /tmp/tmpqpcbkpq0/model.neff, err: 2
2025-Jun-28 22:19:15.484682 8652:8728 ERROR NRT:nrt_infodump Neuron runtime information - please include in any support request:
2025-Jun-28 22:19:15.485774 8649:8757 ERROR NRT:nrt_infodump Neuron runtime information - please include in any support request:
2025-Jun-28 22:19:15.486938 8652:8728 ERROR NRT:nrt_infodump ------------->8------------[ cut here ]------------>8-------------
2025-Jun-28 22:19:15.488110 8649:8757 ERROR NRT:nrt_infodump ------------->8------------[ cut here ]------------>8-------------
2025-Jun-28 22:19:15.489291 8652:8728 ERROR NRT:nrt_infodump NRT version: 2.23.112.0 (9b51794926c2814d842669698af897a9d4f9b4d5)
2025-Jun-28 22:19:15.490466 8649:8757 ERROR NRT:nrt_infodump NRT version: 2.23.112.0 (9b51794926c2814d842669698af897a9d4f9b4d5)
2025-Jun-28 22:19:15.491627 8652:8728 ERROR NRT:nrt_infodump Embedded FW version: 1.13.13.0 (e7baf1ca28ed552f50ed9db7ffb9d3ec70555d22)
2025-Jun-28 22:19:15.492811 8649:8757 ERROR NRT:nrt_infodump Embedded FW version: 1.13.13.0 (e7baf1ca28ed552f50ed9db7ffb9d3ec70555d22)
2025-Jun-28 22:19:15.494035 8652:8728 ERROR NRT:nrt_infodump CCOM version: 2.23.135.0- (compat 55)
2025-Jun-28 22:19:15.495277 8649:8757 ERROR NRT:nrt_infodump CCOM version: 2.23.135.0- (compat 55)
2025-Jun-28 22:19:15.496244 8652:8728 ERROR NRT:nrt_infodump Instance ID: i-0323a99be88c6a673
2025-Jun-28 22:19:15.497189 8649:8757 ERROR NRT:nrt_infodump Instance ID: i-0323a99be88c6a673
2025-Jun-28 22:19:15.498153 8652:8728 ERROR NRT:nrt_infodump Cluster ID: N/A
2025-Jun-28 22:19:15.499043 8649:8757 ERROR NRT:nrt_infodump Cluster ID: N/A
2025-Jun-28 22:19:15.499807 8652:8728 ERROR NRT:nrt_infodump Kernel: Linux 6.1.115-126.197.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Nov 5 17:36:57 UTC 2024
2025-Jun-28 22:19:15.500598 8649:8757 ERROR NRT:nrt_infodump Kernel: Linux 6.1.115-126.197.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Nov 5 17:36:57 UTC 2024
2025-Jun-28 22:19:15.502016 8652:8728 ERROR NRT:nrt_infodump Nodename: ip-172-31-62-160.us-west-2.compute.internal
2025-Jun-28 22:19:15.503437 8649:8757 ERROR NRT:nrt_infodump Nodename: ip-172-31-62-160.us-west-2.compute.internal
2025-Jun-28 22:19:15.504560 8652:8728 ERROR NRT:nrt_infodump Driver version: 2.19.64.0
2025-Jun-28 22:19:15.505628 8649:8757 ERROR NRT:nrt_infodump Driver version: 2.19.64.0
2025-Jun-28 22:19:15.506590 8652:8728 ERROR NRT:nrt_infodump Failure: NRT_INVALID in nrt_load()
2025-Jun-28 22:19:15.507588 8649:8757 ERROR NRT:nrt_infodump Failure: NRT_INVALID in nrt_load()
2025-Jun-28 22:19:15.508518 8652:8728 ERROR NRT:nrt_infodump Visible cores: 0
2025-Jun-28 22:19:15.509436 8649:8757 ERROR NRT:nrt_infodump Visible cores: 1
2025-Jun-28 22:19:15.510224 8652:8728 ERROR NRT:nrt_infodump Environment:
2025-Jun-28 22:19:15.511029 8649:8757 ERROR NRT:nrt_infodump Environment:
2025-Jun-28 22:19:15.511782 8652:8728 ERROR NRT:nrt_infodump NEURON_LIBRARY_PATH=/opt/aws_neuronx_venv_pytorch_2_5/lib64/python3.9/site-packages/libneuronxla/libneuronpjrt.so
2025-Jun-28 22:19:15.512549 8649:8757 ERROR NRT:nrt_infodump NEURON_LIBRARY_PATH=/opt/aws_neuronx_venv_pytorch_2_5/lib64/python3.9/site-packages/libneuronxla/libneuronpjrt.so
2025-Jun-28 22:19:15.514099 8652:8728 ERROR NRT:nrt_infodump NEURON_KEEP_DONATIONS_ALIVE=1
2025-Jun-28 22:19:15.515679 8649:8757 ERROR NRT:nrt_infodump NEURON_KEEP_DONATIONS_ALIVE=1
2025-Jun-28 22:19:15.516599 8652:8728 ERROR NRT:nrt_infodump NEURON_RT_ROOT_COMM_ID=localhost:62182
2025-Jun-28 22:19:15.517521 8649:8757 ERROR NRT:nrt_infodump NEURON_RT_ROOT_COMM_ID=localhost:62182
2025-Jun-28 22:19:15.518499 8652:8728 ERROR NRT:nrt_infodump NEURON_CC_FLAGS= --no_cache
2025-Jun-28 22:19:15.519497 8649:8757 ERROR NRT:nrt_infodump NEURON_CC_FLAGS= --no_cache
2025-Jun-28 22:19:15.520405 8652:8728 ERROR NRT:nrt_infodump NEURON_PJRT_PROCESSES_NUM_DEVICES=1,1
2025-Jun-28 22:19:15.521316 8649:8757 ERROR NRT:nrt_infodump NEURON_PJRT_PROCESSES_NUM_DEVICES=1,1
2025-Jun-28 22:19:15.522282 8652:8728 ERROR NRT:nrt_infodump NEURON_PJRT_PROCESS_INDEX=0
2025-Jun-28 22:19:15.523260 8649:8757 ERROR NRT:nrt_infodump NEURON_PJRT_PROCESS_INDEX=1
2025-Jun-28 22:19:15.524168 8652:8728 ERROR NRT:nrt_infodump NEURON_INTERNAL_PJRT_C_API_VERSION=0.54
2025-Jun-28 22:19:15.525081 8649:8757 ERROR NRT:nrt_infodump NEURON_INTERNAL_PJRT_C_API_VERSION=0.54
2025-Jun-28 22:19:15.526072 8652:8728 ERROR NRT:nrt_infodump -------------8<-----------[ cut to here ]-----------8<------------
2025-Jun-28 22:19:15.527060 8649:8757 ERROR NRT:nrt_infodump -------------8<-----------[ cut to here ]-----------8<------------