Skip to content

Out of memory even if hundreds of GB are available #35121

@JanLuca

Description

@JanLuca

Description

We encounter currently a hard to debug problem that LLVM reports an out-of-memory in a situation where there are 480 out of 500 GB free (Intel hardware, two CPU sockets, inside Slurm batch run), while the same code is working fine on a PC with 32 GB system RAM while developing (Apple M processor, one CPU socket). The code is executed without any accelerator, just CPUs.

The relevant code part is inside an outer non-jitted function, which calls iteratively multiples jitted functions. This outer function is executed multiple times without problems, before after some calls the below cited error arises. We monitored the RAM usage of the script and notice that it only consumes rougly 20GB out of the 500GB available before breaking up. Therefore, it seems not to be a issue of real memory pressure.

Do you have any advice how to debug the problem and how to fix it?

As it is expected that some of the inner functions are retraced and compiled due to changing tensor shapes, it would be surprising if the x-th recompilation of the same set of inner function would lead to this explosion of memory usage...

E0217 00:51:03.002669 1438751 execution_engine.cc:54] LLVM compilation error: Cannot allocate memory
E0217 00:51:03.002673 1438754 execution_engine.cc:54] LLVM compilation error: Cannot allocate memory
E0217 00:51:03.002674 1438767 execution_engine.cc:54] LLVM compilation error: Cannot allocate memory
E0217 00:51:03.002674 1438765 execution_engine.cc:54] LLVM compilation error: Cannot allocate memory
E0217 00:51:03.002714 1438770 execution_engine.cc:54] LLVM compilation error: Cannot allocate memory
E0217 00:51:03.002744 1438743 execution_engine.cc:54] LLVM compilation error: Cannot allocate memory
E0217 00:51:03.002780 1438749 execution_engine.cc:54] LLVM compilation error: Cannot allocate memory
E0217 00:51:03.002971 1438752 execution_engine.cc:54] LLVM compilation error: Cannot allocate memory
E0217 00:51:03.003504 1438756 execution_engine.cc:54] LLVM compilation error: Cannot allocate memory
E0217 00:51:03.003054 1438763 execution_engine.cc:54] LLVM compilation error: Cannot allocate memory
E0217 00:51:03.004015 1438744 execution_engine.cc:54] LLVM compilation error: Cannot allocate memory
E0217 00:51:03.010430 1438745 execution_engine.cc:54] LLVM compilation error: Cannot allocate memory
E0217 00:51:03.013461 1438761 execution_engine.cc:54] LLVM compilation error: Cannot allocate memory
E0217 00:51:03.020497 1438750 execution_engine.cc:54] LLVM compilation error: Cannot allocate memory
E0217 00:51:03.056769 1438747 execution_engine.cc:54] LLVM compilation error: Cannot allocate memory
Traceback (most recent call last):
  File "variance_mapleLeafLattice_Tron.py", line 296, in <module>
    result = varipeps.variance.triangular.calc_variance_triangular_nearest_neighbor(
        unitcell_iPEPS,
    ...<6 lines>...
    
    )
  File "venv_var/lib/python3.13/site-packages/varipeps/variance/triangular.py", line 51, in calc_variance_triangular_nearest_neighbor
    variance_unitcell = calc_variance_ctmrg_triangular(
        variance_unitcell,
        extra_updates=extra_updates,
        centered_bonds=("H",),
    )
  File "venv_var/lib/python3.13/site-packages/varipeps/variance/triangular_routine.py", line 2195, in calc_variance_ctmrg_triangular
    working_unitcell = do_absorption_step_triangular_variance(
        working_unitcell,
    ...<3 lines>...
        varipeps_global_state,
    )
  File "venv_var/lib/python3.13/site-packages/varipeps/variance/triangular_routine.py", line 2021, in do_absorption_step_triangular_variance
    proj_30, smallest_S_T = calc_T_30_150_270_projectors(
                            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        view_tensors, view_tensor_objs, config, state, ("30",)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
jax.errors.JaxRuntimeError: INTERNAL: Failed to materialize symbols: { (<xla_jit_dylib_6>, { wrapped_compare }) }
--------------------
For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

System info (python version, jaxlib version, accelerator, etc.)

Machine where the code fails:

jax: 0.9.0.1
jaxlib: 0.9.0.1
numpy: 2.4.2
python: 3.13.5 (main, Jun 25 2025, 18:55:22) [GCC 14.2.0]

Development machine where the code is running fine:

jax: 0.9.0.1
jaxlib: 0.9.0.1
numpy: 2.4.2
python: 3.13.2 (main, Feb 4 2025, 14:51:09) [Clang 16.0.0 (clang-1600.0.26.6)]

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions