-
DescriptionI am encountering errors while running a simple Bayesian Inference using flowMC, a JAX-based project. The issue arises when increasing the number of chains and step size, despite working without errors on smaller values. I have consulted the following resources: I have tried setting: os.environ["XLA_PYTHON_CLIENT_PREALLOCATE"] = "false" Additionally, I have tested on two different clusters of GPUs. The code utilizes a small neural network with the following layers: I would greatly appreciate guidance on resolving this issue. Specifically, I am seeking help with:
Thank you for your time and consideration. I look forward to your guidance. Error
[Loading lalsimutils.py : MonteCarloMarginalization version]
scipy : 1.13.0
numpy : 1.26.4
['n_dim', 'n_chains', 'n_local_steps', 'n_global_steps', 'n_loop', 'output_thinning', 'verbose']
Global Tuning: 0%| | 0/100 [00:00<?, ?it/s]2024-05-22 12:55:11.339369: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.339435: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.339474: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.339516: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.339543: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.339578: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.339647: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.339682: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.339746: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.339798: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.339861: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.340864: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.344101: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.344169: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.344235: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.344286: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.344328: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.345093: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.345528: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.348663: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.348713: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.359053: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.359959: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.361676: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.361735: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.361946: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.362004: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.372054: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 312.50GiB (335548514320 bytes), down from 312.50GiB (335548514320 bytes) originally
2024-05-22 12:55:11.377217: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.388057: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.388153: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
2024-05-22 12:55:11.395130: W external/xla/xla/service/hlo_rematerialization.cc:2948] Can't reduce memory use below -266.95GiB (-286634288046 bytes) by rematerialization; only reduced to 156.25GiB (167772160000 bytes), down from 156.25GiB (167772160000 bytes) originally
Global Tuning: 0%| | 0/100 [02:45<?, ?it/s]
Compiling MALA body
jax.errors.SimplifiedTraceback: For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/muhammad.zeeshan/gwk-runs/o4_single_gaussian/single_gaussian.py", line 120, in <module>
nfmcmc_handler.run()
File "/home/muhammad.zeeshan/gwkokab/gwkokab/inference/nfmcmchandler.py", line 198, in run
nf_sampler = self.run_sampler()
File "/home/muhammad.zeeshan/gwkokab/gwkokab/inference/nfmcmchandler.py", line 128, in run_sampler
nf_sampler.sample(
File "/home/muhammad.zeeshan/gwkokab/kvenv/lib/python3.10/site-packages/flowMC/Sampler.py", line 204, in sample
) = strategy(
File "/home/muhammad.zeeshan/gwkokab/kvenv/lib/python3.10/site-packages/flowMC/strategy/global_tuning.py", line 174, in __call__
) = global_sampler.sample(
File "/home/muhammad.zeeshan/gwkokab/kvenv/lib/python3.10/site-packages/flowMC/proposal/NF_proposal.py", line 130, in sample
proposal_position, log_prob_proposal, log_prob_nf_proposal = self.sample_flow(
File "/home/muhammad.zeeshan/gwkokab/kvenv/lib/python3.10/site-packages/flowMC/proposal/NF_proposal.py", line 210, in sample_flow
log_prob_proposal = self.logpdf_vmap(proposal_position, data)
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Failed to allocate request for 156.27GiB (167788937216B) on device ordinal 0 GPUs info
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-80GB Off | 00000000:01:00.0 Off | 0 |
| N/A 43C P0 81W / 500W | 29354MiB / 81920MiB | 27% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB Off | 00000000:41:00.0 Off | 0 |
| N/A 48C P0 215W / 500W | 3052MiB / 81920MiB | 89% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB Off | 00000000:81:00.0 Off | 0 |
| N/A 65C P0 243W / 500W | 14620MiB / 81920MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB Off | 00000000:C1:00.0 Off | 0 |
| N/A 46C P0 175W / 500W | 30688MiB / 81920MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3080 Off | 00000000:01:00.0 Off | N/A |
| 30% 56C P2 236W / 320W | 2312MiB / 10240MiB | 57% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB Off | 00000000:C1:00.0 Off | 0 |
| N/A 47C P0 53W / 250W | 5416MiB / 40960MiB | 50% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+ System info
It is same for both clusters. >>> python
>>> import jax; jax.print_environment_info()
jax: 0.4.28
jaxlib: 0.4.28
numpy: 1.26.4
python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
jax.devices (2 total, 2 local): [cuda(id=0) cuda(id=1)]
process_count: 1
platform: uname_result(system='Linux', node='ldas-pcdev2', release='4.18.0-513.24.1.el8_9.x86_64', version='#1 SMP Thu Apr 4 18:13:02 UTC 2024', machine='x86_64')
>>> nvidia-smi
Wed May 22 15:01:58 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3080 Off | 00000000:01:00.0 Off | N/A |
| 30% 58C P2 244W / 320W | 2530MiB / 10240MiB | 61% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB Off | 00000000:C1:00.0 Off | 0 |
| N/A 52C P0 212W / 250W | 5841MiB / 40960MiB | 99% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+ |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hey @kazewong, can you share any insight regarding this error? |
Beta Was this translation helpful? Give feedback.
The current implementation of flowMC does not utilize multiple GPUs. The chains and network are created on the default device, and I am not sure how you are sharding the data onto multiple devices. You can see from the error message
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Failed to allocate request for 156.27GiB (167788937216B) on device ordinal 0
Jax is trying to allocate 156GB of RAM on your first GPU, which is more than what it can handle. It seems your computation is a bit too big for your device. I would advice looking into reducing the memory footprint for now