Skip to content

PJRT_Client_Destroy is not invoked in PT/XLA 2.8.0+ #9669

@rajkthakur

Description

@rajkthakur

🐛 Bug

We are observing a memory leak issue as PJRT_Client_Destroy is not invoked in PT/XLA 2.8.0+ causing backend runtime to be not properly cleaned for AWS neuron devices. This issue is also reproducible with CPU device. Investigation reveals that this is related to issue #9384, where the ComputationClient is created using a raw pointer rather than a unique_ptr in runtime.cpp. This behavior persists in the current development branch of PyTorch/XLA.

We tested that converting from raw pointer to smart pointer fixes the issue, likely because the destructor is properly called.

To Reproduce

Steps to reproduce the behavior:

  1. Create Python 3.10+ virtual env
  2. pip install torch==2.8.0 torch-xla==2.8.0
  3. RUN export TF_CPP_MIN_LOG_LEVEL=0; export TF_CPP_VMODULE="cpu_client=1"; export NEURON_RT_LOG_LEVEL=DEBUG; export PJRT_DEVICE=CPU
  4. RUN python -c "import torch_xla; device=torch_xla.device()"
  5. you will notice logs like below where PjRtCpuClient created. is logged but PjRtCpuClient destroyed is not logged
(pt_28) ubuntu@ip-10-1-203-61:~$ python -c "import torch_xla; device=torch_xla.device()"
/home/ubuntu/pt_28/lib/python3.10/site-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
2025-10-03 04:17:42.142612: I external/xla/xla/pjrt/cpu/cpu_client.cc:334] PjRtCpuClient created.
  1. If you run the same steps[1-5] in PT/XLA 2.7.0 you will observe that the client is properly destroyed.
(pt_27) ubuntu@ip-10-1-203-61:~$ python -c "import torch_xla; device=torch_xla.device()"
/home/ubuntu/pt_27/lib/python3.10/site-packages/torch/cuda/__init__.py:174: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
2025-10-03 03:47:17.209860: I external/xla/xla/pjrt/cpu/cpu_client.cc:395] TfrtCpuClient created.
/home/ubuntu/pt_27/lib/python3.10/site-packages/torch/cuda/__init__.py:789: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
/home/ubuntu/pt_27/lib/python3.10/site-packages/torch/cuda/__init__.py:991: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  r = torch._C._cuda_getDeviceCount() if nvml_count < 0 else nvml_count
2025-10-03 03:47:19.926677: I external/xla/xla/pjrt/cpu/cpu_client.cc:398] TfrtCpuClient destroyed.
(pt_27)

Expected behavior

It is expected that PJRT_Client_Destroy is invoked on program termination logging PjRtCpuClient destroyed

Environment

  • Reproducible on XLA backend [CPU/TPU]: CPU, Neuron
  • torch_xla version: 2.8, 2.8.1, Top of Tree

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions