Peer to Peer Operation Send/Receive Issues | Pytorch NeuronX Collective Communication

Hello everyone,

seen by my previous issue here, i am testing the distributed capabilities of the Trn1x2 Machine for data shuffling use cases. Here i wanted to try using peer-to-peer operations "Send" and "Receive". This is available in Pytorch XLA using `dist.send(tensor=send_tensor, dst=dst)`/ `dist.recv(tensor=receive_tensor, src=src)` or by directly calling the XLA-Interface `xm.send(send_tensor, channel_id)`/`xm.recv(recv_tensor, channel_id)`. 

The [documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/collective-communication.html) states that these operations are generally supported. When used in Pytorch NeuronX, they produce the following Errors:
- NotImplementedError for the Method `make_send_channel_id`  when using the `torch.distributed` package
- Hanging of the python program  > 5 mins for sending tensor of small size when using `torch_xla.core.xla_model` package
- NRT Error when using `torch_xla.core.xla_model` and wanting to simultaneously send AND receive in different channels at the same time

I  provide the python code used and the complete Errors down below.

**Here i wanted to ask, are the Peer-to-Peer Communication Commands "Send"/"Receive" not supported in Pytorch or is this a bug? Also is simultaneous send/receive not supported? Any help is appreciated!**

Steps to reproduce:
1. Start trn1.2xlarge Instance using image-id "ami-080b4a9b6e048125e" (Deep Learning AMI Neuron (Amazon Linux 2023) 20250115)
2. source /opt/aws_neuronx_venv_pytorch_2_5/bin/activate ([PyTorch 2.5 Torch NeuronX, NxD Core Environment](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/dlami/index.html#id6))
3. python3 test_send_receive.py (the given code snippets)

Here are the code snippets and the corresponding Errors:
1. Using Pytorch distributed Package
```python
import os
import torch
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.distributed.xla_multiprocessing as xmp
import torch.distributed as dist


def _mp_fn(index):
    device = xm.xla_device()
    world_size = xm.xrt_world_size()
    rank = xm.get_ordinal()
    
    dist.init_process_group(
        backend='xla',
        world_size=world_size,
        rank=rank,
    )
    
    data_to_send = torch.ones((2, 3))
    send_tensor = data_to_send.to(device)
    receive_tensor = torch.empty_like(send_tensor).to(device)

    src, dst = 0, 1

    if rank == src:
        dist.send(tensor=send_tensor, dst=dst)
    elif rank == dst:
        dist.recv(tensor=receive_tensor, src=src)

    xm.mark_step()
    torch_xla.sync()

    if rank == dst:
        expected = torch.ones((2, 3))
        assert torch.all(receive_tensor.cpu() == expected), f'{receive_tensor} != {expected}'
        print(f"Rank {rank} got:\n{receive_tensor.cpu()}")


if __name__ == '__main__':
    os.environ["NEURONCORE_NUM_DEVICES"] = "2"
    os.environ['NEURON_CC_FLAGS'] = os.environ.get('NEURON_CC_FLAGS', '') + ' --no_cache'

    xmp.spawn(_mp_fn, args=())
```
Error:
```python
2025-06-28T22:11:23Z Compiler status PASS
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib64/python3.9/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/lib64/python3.9/concurrent/futures/process.py", line 205, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/usr/lib64/python3.9/concurrent/futures/process.py", line 205, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/opt/aws_neuronx_venv_pytorch_2_5/lib64/python3.9/site-packages/torch_xla/_internal/pjrt.py", line 77, in _run_thread_per_device
    replica_results = list(
  File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 609, in result_iterator
    yield fs.pop().result()
  File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/usr/lib64/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/aws_neuronx_venv_pytorch_2_5/lib64/python3.9/site-packages/torch_xla/_internal/pjrt.py", line 70, in _thread_fn
    return fn()
  File "/opt/aws_neuronx_venv_pytorch_2_5/lib64/python3.9/site-packages/torch_xla/_internal/pjrt.py", line 185, in __call__
    self.fn(runtime.global_ordinal(), *self.args, **self.kwargs)
  File "/home/ec2-user/home/ec2-user/test_neuron/test_send_recv_2.py", line 27, in _mp_fn
    dist.send(tensor=send_tensor, dst=dst)
  File "/opt/aws_neuronx_venv_pytorch_2_5/lib64/python3.9/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
    return func(*args, **kwargs)
  File "/opt/aws_neuronx_venv_pytorch_2_5/lib64/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2148, in send
    default_pg.send([tensor], dst, tag).wait()
  File "/opt/aws_neuronx_venv_pytorch_2_5/lib64/python3.9/site-packages/torch_xla/distributed/xla_backend.py", line 249, in send
    channel_id = self.make_send_channel_id(dst_rank, tag)
  File "/opt/aws_neuronx_venv_pytorch_2_5/lib64/python3.9/site-packages/torch_xla/distributed/xla_backend.py", line 242, in make_send_channel_id
    raise NotImplementedError
NotImplementedError
"""
```
2. Using `torch_xla.core.xla_model` to specify the channel_id directly
```python
import os
import torch
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.distributed.xla_multiprocessing as xmp
import torch_xla.distributed.xla_backend
import torch.distributed as dist


def _mp_fn(index):
    device = xm.xla_device()
    rank = xm.get_ordinal()

    data_to_send = torch.ones((2, 3))
    send_tensor = data_to_send.to(device)
    receive_tensor = torch.empty_like(send_tensor).to(device)
    channel_id = 100

    if rank == 0:
        xm.send(send_tensor, channel_id)
    else:
        xm.recv(receive_tensor, channel_id)

    xm.mark_step()
    torch_xla.sync()

    if rank == 1:
        expected = torch.ones((2, 3))
        assert torch.all(receive_tensor.cpu() == expected), f'{receive_tensor} != {expected}'
        print(f"Rank {rank} got:\n{receive_tensor.cpu()}")


if __name__ == '__main__':
    os.environ["NEURONCORE_NUM_DEVICES"] = "2"
    os.environ['NEURON_CC_FLAGS'] = os.environ.get('NEURON_CC_FLAGS', '') + ' --no_cache'
    xmp.spawn(_mp_fn, args=())

```
Error (It hangs for more than > 5 min)
```python
2025-06-28T22:22:39Z Compiler status PASS
2025-Jun-28 22:22:39.0863 9036:9274 [1] nccl_net_ofi_rdma_init:7734 CCOM WARN NET/OFI OFI fi_getinfo() call failed: No data available
2025-Jun-28 22:22:39.0870 9036:9274 [1] nccl_net_ofi_create_plugin:251 CCOM WARN NET/OFI Unable to find a protocol that worked.  Failing initialization.
2025-Jun-28 22:22:39.0880 9036:9274 [1] nccl_net_ofi_create_plugin:316 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jun-28 22:22:39.0887 9036:9274 [1] nccl_net_ofi_init:139 CCOM WARN NET/OFI Initializing plugin failed
2025-Jun-28 22:22:39.0893 9036:9274 [1] net_plugin.cc:94 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2025-Jun-28 22:25:39.0792 9036:9274 [1] include/socket.h:483 CCOM WARN Connect to 127.0.0.1<62182> failed : Connection refused - retrying [2-rank bootstrap: rank 1 sends its info to root./If a connection error occurs here, root may be unresponsive or not yet be active./-1]

```

3. Using `torch_xla.core.xla_model` with two simultaneous send/receive operations on different channel_ids  
```python
import os
import torch
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.distributed.xla_multiprocessing as xmp
import torch_xla.distributed.xla_backend


def _mp_fn(index):
    device = xm.xla_device()
    rank = xm.get_ordinal()

    data_to_send = torch.ones((2, 3))
    send_tensor = data_to_send.to(device)
    receive_tensor = torch.empty_like(send_tensor).to(device)
    channel_id = 100
    channel_id2 = 101

    if rank == 0:
        xm.send(send_tensor, channel_id)
        xm.recv(receive_tensor, channel_id2)
    else:
        xm.send(send_tensor, channel_id2)
        xm.recv(receive_tensor, channel_id)

    xm.mark_step()
    torch_xla.sync()

    expected = torch.ones((2, 3))
    assert torch.all(receive_tensor.cpu() == expected), f'{receive_tensor} != {expected}'
    print(f"Rank {rank} got:\n{receive_tensor.cpu()}")


if __name__ == '__main__':
    os.environ["NEURONCORE_NUM_DEVICES"] = "2"
    os.environ['NEURON_CC_FLAGS'] = os.environ.get('NEURON_CC_FLAGS', '') + ' --no_cache'
    xmp.spawn(_mp_fn, args=())

```
Error:
```python
2025-06-28T22:19:15Z Compiler status PASS
2025-06-28T22:19:15Z Compiler status PASS
2025-Jun-28 22:19:15.466545  8652:8728  ERROR   ENC:enc_enqueue_operation                   [nec_dev 0] peer_id does not match
2025-Jun-28 22:19:15.466613  8649:8757  ERROR   ENC:enc_enqueue_operation                   [nec_dev 1] peer_id does not match
2025-Jun-28 22:19:15.467540  8652:8728  ERROR  TDRV:instr_col_enc_enqueue                   Failed to post to enc for pseudo send receive instruction
2025-Jun-28 22:19:15.468485  8649:8757  ERROR  TDRV:instr_col_enc_enqueue                   Failed to post to enc for pseudo send receive instruction
2025-Jun-28 22:19:15.469570  8652:8728  ERROR  TDRV:instr_col_translate_psr                 Failed to setup pseudo send receive instruction
2025-Jun-28 22:19:15.470670  8649:8757  ERROR  TDRV:instr_col_translate_psr                 Failed to setup pseudo send receive instruction
2025-Jun-28 22:19:15.471782  8652:8728  ERROR  TDRV:ib_create_one_block                     failed to translate instructions
2025-Jun-28 22:19:15.472831  8649:8757  ERROR  TDRV:ib_create_one_block                     failed to translate instructions
2025-Jun-28 22:19:15.473624  8652:8728  ERROR  TDRV:ib_create_eib_v2                        Failed to create instruction blocks
2025-Jun-28 22:19:15.474579  8649:8757  ERROR  TDRV:ib_create_eib_v2                        Failed to create instruction blocks
2025-Jun-28 22:19:15.475676  8652:8728  ERROR  TDRV:sequencer_v2_setup_instr_one_eng        Failed to allocate eib
2025-Jun-28 22:19:15.476457  8649:8757  ERROR  TDRV:sequencer_v2_setup_instr_one_eng        Failed to allocate eib
2025-Jun-28 22:19:15.477298  8652:8728  ERROR  TDRV:kbl_model_add                           create_engine_refill_rings_v1() error
2025-Jun-28 22:19:15.478131  8649:8757  ERROR  TDRV:kbl_model_add                           create_engine_refill_rings_v1() error
2025-Jun-28 22:19:15.479169  8652:8728  ERROR  NMGR:dlr_kelf_stage                          Failed to load subgraph
2025-Jun-28 22:19:15.480125  8649:8757  ERROR  NMGR:dlr_kelf_stage                          Failed to load subgraph
2025-Jun-28 22:19:15.480867  8652:8728  ERROR  NMGR:kmgr_load_nn_internal_v2                Failed to stage graph to NeuronCore
2025-Jun-28 22:19:15.481719  8649:8757  ERROR  NMGR:kmgr_load_nn_internal_v2                Failed to stage graph to NeuronCore
2025-Jun-28 22:19:15.482643  8652:8728  ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: /tmp/tmpadnd5hrb/model.neff, err: 2
2025-Jun-28 22:19:15.483591  8649:8757  ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: /tmp/tmpqpcbkpq0/model.neff, err: 2
2025-Jun-28 22:19:15.484682  8652:8728  ERROR   NRT:nrt_infodump                            Neuron runtime information - please include in any support request:
2025-Jun-28 22:19:15.485774  8649:8757  ERROR   NRT:nrt_infodump                            Neuron runtime information - please include in any support request:
2025-Jun-28 22:19:15.486938  8652:8728  ERROR   NRT:nrt_infodump                            ------------->8------------[ cut here ]------------>8-------------
2025-Jun-28 22:19:15.488110  8649:8757  ERROR   NRT:nrt_infodump                            ------------->8------------[ cut here ]------------>8-------------
2025-Jun-28 22:19:15.489291  8652:8728  ERROR   NRT:nrt_infodump                            NRT version: 2.23.112.0 (9b51794926c2814d842669698af897a9d4f9b4d5)
2025-Jun-28 22:19:15.490466  8649:8757  ERROR   NRT:nrt_infodump                            NRT version: 2.23.112.0 (9b51794926c2814d842669698af897a9d4f9b4d5)
2025-Jun-28 22:19:15.491627  8652:8728  ERROR   NRT:nrt_infodump                            Embedded FW version: 1.13.13.0 (e7baf1ca28ed552f50ed9db7ffb9d3ec70555d22)
2025-Jun-28 22:19:15.492811  8649:8757  ERROR   NRT:nrt_infodump                            Embedded FW version: 1.13.13.0 (e7baf1ca28ed552f50ed9db7ffb9d3ec70555d22)
2025-Jun-28 22:19:15.494035  8652:8728  ERROR   NRT:nrt_infodump                            CCOM version: 2.23.135.0- (compat 55)
2025-Jun-28 22:19:15.495277  8649:8757  ERROR   NRT:nrt_infodump                            CCOM version: 2.23.135.0- (compat 55)
2025-Jun-28 22:19:15.496244  8652:8728  ERROR   NRT:nrt_infodump                            Instance ID: i-0323a99be88c6a673
2025-Jun-28 22:19:15.497189  8649:8757  ERROR   NRT:nrt_infodump                            Instance ID: i-0323a99be88c6a673
2025-Jun-28 22:19:15.498153  8652:8728  ERROR   NRT:nrt_infodump                            Cluster ID: N/A
2025-Jun-28 22:19:15.499043  8649:8757  ERROR   NRT:nrt_infodump                            Cluster ID: N/A
2025-Jun-28 22:19:15.499807  8652:8728  ERROR   NRT:nrt_infodump                            Kernel: Linux 6.1.115-126.197.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Nov  5 17:36:57 UTC 2024
2025-Jun-28 22:19:15.500598  8649:8757  ERROR   NRT:nrt_infodump                            Kernel: Linux 6.1.115-126.197.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Nov  5 17:36:57 UTC 2024
2025-Jun-28 22:19:15.502016  8652:8728  ERROR   NRT:nrt_infodump                            Nodename: ip-172-31-62-160.us-west-2.compute.internal
2025-Jun-28 22:19:15.503437  8649:8757  ERROR   NRT:nrt_infodump                            Nodename: ip-172-31-62-160.us-west-2.compute.internal
2025-Jun-28 22:19:15.504560  8652:8728  ERROR   NRT:nrt_infodump                            Driver version: 2.19.64.0

2025-Jun-28 22:19:15.505628  8649:8757  ERROR   NRT:nrt_infodump                            Driver version: 2.19.64.0

2025-Jun-28 22:19:15.506590  8652:8728  ERROR   NRT:nrt_infodump                            Failure: NRT_INVALID in nrt_load()
2025-Jun-28 22:19:15.507588  8649:8757  ERROR   NRT:nrt_infodump                            Failure: NRT_INVALID in nrt_load()
2025-Jun-28 22:19:15.508518  8652:8728  ERROR   NRT:nrt_infodump                            Visible cores: 0
2025-Jun-28 22:19:15.509436  8649:8757  ERROR   NRT:nrt_infodump                            Visible cores: 1
2025-Jun-28 22:19:15.510224  8652:8728  ERROR   NRT:nrt_infodump                            Environment:
2025-Jun-28 22:19:15.511029  8649:8757  ERROR   NRT:nrt_infodump                            Environment:
2025-Jun-28 22:19:15.511782  8652:8728  ERROR   NRT:nrt_infodump                                NEURON_LIBRARY_PATH=/opt/aws_neuronx_venv_pytorch_2_5/lib64/python3.9/site-packages/libneuronxla/libneuronpjrt.so
2025-Jun-28 22:19:15.512549  8649:8757  ERROR   NRT:nrt_infodump                                NEURON_LIBRARY_PATH=/opt/aws_neuronx_venv_pytorch_2_5/lib64/python3.9/site-packages/libneuronxla/libneuronpjrt.so
2025-Jun-28 22:19:15.514099  8652:8728  ERROR   NRT:nrt_infodump                                NEURON_KEEP_DONATIONS_ALIVE=1
2025-Jun-28 22:19:15.515679  8649:8757  ERROR   NRT:nrt_infodump                                NEURON_KEEP_DONATIONS_ALIVE=1
2025-Jun-28 22:19:15.516599  8652:8728  ERROR   NRT:nrt_infodump                                NEURON_RT_ROOT_COMM_ID=localhost:62182
2025-Jun-28 22:19:15.517521  8649:8757  ERROR   NRT:nrt_infodump                                NEURON_RT_ROOT_COMM_ID=localhost:62182
2025-Jun-28 22:19:15.518499  8652:8728  ERROR   NRT:nrt_infodump                                NEURON_CC_FLAGS= --no_cache
2025-Jun-28 22:19:15.519497  8649:8757  ERROR   NRT:nrt_infodump                                NEURON_CC_FLAGS= --no_cache
2025-Jun-28 22:19:15.520405  8652:8728  ERROR   NRT:nrt_infodump                                NEURON_PJRT_PROCESSES_NUM_DEVICES=1,1
2025-Jun-28 22:19:15.521316  8649:8757  ERROR   NRT:nrt_infodump                                NEURON_PJRT_PROCESSES_NUM_DEVICES=1,1
2025-Jun-28 22:19:15.522282  8652:8728  ERROR   NRT:nrt_infodump                                NEURON_PJRT_PROCESS_INDEX=0
2025-Jun-28 22:19:15.523260  8649:8757  ERROR   NRT:nrt_infodump                                NEURON_PJRT_PROCESS_INDEX=1
2025-Jun-28 22:19:15.524168  8652:8728  ERROR   NRT:nrt_infodump                                NEURON_INTERNAL_PJRT_C_API_VERSION=0.54
2025-Jun-28 22:19:15.525081  8649:8757  ERROR   NRT:nrt_infodump                                NEURON_INTERNAL_PJRT_C_API_VERSION=0.54
2025-Jun-28 22:19:15.526072  8652:8728  ERROR   NRT:nrt_infodump                            -------------8<-----------[ cut to here ]-----------8<------------
2025-Jun-28 22:19:15.527060  8649:8757  ERROR   NRT:nrt_infodump                            -------------8<-----------[ cut to here ]-----------8<------------

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peer to Peer Operation Send/Receive Issues | Pytorch NeuronX Collective Communication #1189

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Peer to Peer Operation Send/Receive Issues | Pytorch NeuronX Collective Communication #1189

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions