-
Notifications
You must be signed in to change notification settings - Fork 566
Open
Labels
Description
🐛 Bug
We are observing a memory leak issue as PJRT_Client_Destroy is not invoked in PT/XLA 2.8.0+ causing backend runtime to be not properly cleaned for AWS neuron devices. This issue is also reproducible with CPU device. Investigation reveals that this is related to issue #9384, where the ComputationClient is created using a raw pointer rather than a unique_ptr in runtime.cpp. This behavior persists in the current development branch of PyTorch/XLA.
We tested that converting from raw pointer to smart pointer fixes the issue, likely because the destructor is properly called.
To Reproduce
Steps to reproduce the behavior:
- Create Python 3.10+ virtual env
- pip install torch==2.8.0 torch-xla==2.8.0
- RUN
export TF_CPP_MIN_LOG_LEVEL=0; export TF_CPP_VMODULE="cpu_client=1"; export NEURON_RT_LOG_LEVEL=DEBUG; export PJRT_DEVICE=CPU
- RUN
python -c "import torch_xla; device=torch_xla.device()"
- you will notice logs like below where
PjRtCpuClient created.
is logged butPjRtCpuClient destroyed
is not logged
(pt_28) ubuntu@ip-10-1-203-61:~$ python -c "import torch_xla; device=torch_xla.device()"
/home/ubuntu/pt_28/lib/python3.10/site-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
2025-10-03 04:17:42.142612: I external/xla/xla/pjrt/cpu/cpu_client.cc:334] PjRtCpuClient created.
- If you run the same steps[1-5] in PT/XLA 2.7.0 you will observe that the client is properly destroyed.
(pt_27) ubuntu@ip-10-1-203-61:~$ python -c "import torch_xla; device=torch_xla.device()"
/home/ubuntu/pt_27/lib/python3.10/site-packages/torch/cuda/__init__.py:174: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
2025-10-03 03:47:17.209860: I external/xla/xla/pjrt/cpu/cpu_client.cc:395] TfrtCpuClient created.
/home/ubuntu/pt_27/lib/python3.10/site-packages/torch/cuda/__init__.py:789: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
/home/ubuntu/pt_27/lib/python3.10/site-packages/torch/cuda/__init__.py:991: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
r = torch._C._cuda_getDeviceCount() if nvml_count < 0 else nvml_count
2025-10-03 03:47:19.926677: I external/xla/xla/pjrt/cpu/cpu_client.cc:398] TfrtCpuClient destroyed.
(pt_27)
Expected behavior
It is expected that PJRT_Client_Destroy is invoked on program termination logging PjRtCpuClient destroyed
Environment
- Reproducible on XLA backend [CPU/TPU]: CPU, Neuron
- torch_xla version: 2.8, 2.8.1, Top of Tree