Skip to content

Update kokkos and kokkos-kernels to latest#65

Merged
wcwitt merged 3 commits intomainfrom
update_external_kokkos
Jan 23, 2026
Merged

Update kokkos and kokkos-kernels to latest#65
wcwitt merged 3 commits intomainfrom
update_external_kokkos

Conversation

@bernstei
Copy link
Copy Markdown
Collaborator

@bernstei bernstei commented Jan 20, 2026

Avoid HostMirror issues, now that it has been deprecated in favor of host_mirror_type, to keep up with lammps updates in particular https://github.com/lammps/lammps/blob/66da026147b692d3d130b44a3d9bd4e1eb5a91a3/lib/kokkos/CHANGELOG.md?plain=1#L77

closes #64

Avoid HostMirror issues, now that it has been deprecated in favor of
host_mirror_type, to keep up with lammps updates in particular
https://github.com/lammps/lammps/blob/66da026147b692d3d130b44a3d9bd4e1eb5a91a3/lib/kokkos/CHANGELOG.md?plain=1#L77
@bernstei
Copy link
Copy Markdown
Collaborator Author

I don't understand why it's crashing after the tests are apparently done. I'll test locally.

@bernstei
Copy link
Copy Markdown
Collaborator Author

Is it true that the CI runs on CPU only?

@bernstei
Copy link
Copy Markdown
Collaborator Author

bernstei commented Jan 22, 2026

I can reproduce the seg fault locally. It happens after

test_multilayer_perceptron_kokkos.py::evaluate_gradient_batch
test_symmetrix_calc.py

but not any of the other tests.

Must be something about cleanup, because the tests themselves pass. Does test_mace.py not check kokkos? Because I'm assuming that it's the kokkos MLP that's causing the Symmetrix calculator test to segfault, and I'd have expect that to affect test_mace as well.

@bernstei
Copy link
Copy Markdown
Collaborator Author

I ran the MLP test manually and, with gdb, got the following failure location

Program received signal SIGSEGV, Segmentation fault.
0x000015554a853a32 in Kokkos::Impl::ExecSpaceManager::static_fence(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /home/cluster/bernstei/src/work/MACE/symmetrix/symmetrix/venv_pytest/lib/python3.11/site-packages/symmetrix/symmetrix.cpython-311-x86_64-linux-gnu.so
(gdb) where
#0  0x000015554a853a32 in Kokkos::Impl::ExecSpaceManager::static_fence(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /home/cluster/bernstei/src/work/MACE/symmetrix/symmetrix/venv_pytest/lib/python3.11/site-packages/symmetrix/symmetrix.cpython-311-x86_64-linux-gnu.so
#1  0x000015554a86c7cc in Kokkos::HostSpace::deallocate(char const*, void*, unsigned long, unsigned long) const ()
   from /home/cluster/bernstei/src/work/MACE/symmetrix/symmetrix/venv_pytest/lib/python3.11/site-packages/symmetrix/symmetrix.cpython-311-x86_64-linux-gnu.so
#2  0x000015554a875263 in Kokkos::Impl::SerialInternal::~SerialInternal() ()
   from /home/cluster/bernstei/src/work/MACE/symmetrix/symmetrix/venv_pytest/lib/python3.11/site-packages/symmetrix/symmetrix.cpython-311-x86_64-linux-gnu.so
#3  0x000015554a8760f1 in std::_Function_handler<void (Kokkos::Impl::SerialInternal*), Kokkos::Impl::HostSharedPtr<Kokkos::Impl::SerialInternal>::HostSharedPtr(Kokkos::Impl::SerialInternal*)::{lambda(Kokkos::Impl::SerialInternal*)#1}>::_M_invoke(std::_Any_data const&, Kokkos::Impl::SerialInternal*&&) ()
   from /home/cluster/bernstei/src/work/MACE/symmetrix/symmetrix/venv_pytest/lib/python3.11/site-packages/symmetrix/symmetrix.cpython-311-x86_64-linux-gnu.so
#4  0x000015554a876080 in Kokkos::Impl::HostSharedPtr<Kokkos::Impl::SerialInternal>::~HostSharedPtr() ()
   from /home/cluster/bernstei/src/work/MACE/symmetrix/symmetrix/venv_pytest/lib/python3.11/site-packages/symmetrix/symmetrix.cpython-311-x86_64-linux-gnu.so
#5  0x000015554e801d4c in __run_exit_handlers () from /lib64/libc.so.6
#6  0x000015554e801e80 in exit () from /lib64/libc.so.6
#7  0x000015554e7eb86c in __libc_start_main () from /lib64/libc.so.6
#8  0x00000000005bbdd3 in _start () at /usr/local/src/conda/python-3.11.10/Modules/frameobject.c:33931

@wcwitt
Copy link
Copy Markdown
Owner

wcwitt commented Jan 23, 2026

Is it true that the CI runs on CPU only?

Yes - that's the only simple option, unfortunately.

@bernstei
Copy link
Copy Markdown
Collaborator Author

When I run with a cpu-capable kokkos, I get somewhat different behavior. test_evaluate gives a seg fault, and test_evaluate_gradient_batch just complains about not calling kokkos finalize. The stack trace for the seg fault is

#0  0x000015554e890389 in __memset_avx512_unaligned_erms () from /lib64/libc.so.6
#1  0x0000155549980ec3 in MultilayerPerceptronKokkos::evaluate(Kokkos::View<double const**, Kokkos::LayoutRight>, Kokkos::View<double*, Kokkos::LayoutRight>) ()
   from /home/cluster/bernstei/src/work/MACE/symmetrix/symmetrix/venv_pytest/lib/python3.11/site-packages/symmetrix/symmetrix.cpython-311-x86_64-linux-gnu.so
#2  0x0000155549813d0b in ?? () from /home/cluster/bernstei/src/work/MACE/symmetrix/symmetrix/venv_pytest/lib/python3.11/site-packages/symmetrix/symmetrix.cpython-311-x86_64-linux-gnu.so
#3  0x0000155549815964 in ?? () from /home/cluster/bernstei/src/work/MACE/symmetrix/symmetrix/venv_pytest/lib/python3.11/site-packages/symmetrix/symmetrix.cpython-311-x86_64-linux-gnu.so
#4  0x000015554977cdfa in ?? () from /home/cluster/bernstei/src/work/MACE/symmetrix/symmetrix/venv_pytest/lib/python3.11/site-packages/symmetrix/symmetrix.cpython-311-x86_64-linux-gnu.so
#5  0x0000000000528a57 in cfunction_call (func=0x15554a137010, args=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.11.10/Include/weakrefobject.h:542
#6  0x000000000050451c in _PyObject_MakeTpCall (tstate=0x8a7a38 <_PyRuntime+166328>, callable=0x15554a137010, args=<optimized out>, nargs=<optimized out>, keywords=0x0)
    at /usr/local/src/conda/python-3.11.10/Modules/abstract.c:2373
#7  0x0000000000511a5d in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at /usr/local/src/conda/python-3.11.10/Programs/ceval_gil.h:4769
#8  0x00000000005cc1ea in _PyEval_EvalFrame (throwflag=0, throwflag@entry=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, frame=0x155555386020, 
    tstate=0x8a7a38 <_PyRuntime+166328>) at /croot/python-split_1727939961902/_build_env/x86_64-conda-linux-gnu/sysroot/usr/include/bits/pycore_frame.h:73
#9  _PyEval_Vector (tstate=0x8a7a38 <_PyRuntime+166328>, func=0x1555554844a0, locals=0x1555554e9e80, args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>)
    at /usr/local/src/conda/python-3.11.10/Programs/ceval_gil.h:6434
#10 0x00000000005cb8bf in PyEval_EvalCode (co=<optimized out>, globals=0x1555554e9e80, locals=<optimized out>) at /usr/local/src/conda/python-3.11.10/Programs/ceval_gil.h:1148
#11 0x00000000005ec9e7 in run_eval_code_obj (tstate=0x8a7a38 <_PyRuntime+166328>, co=0x155555430310, globals=0x1555554e9e80, locals=0x1555554e9e80) at Python/deepfreeze/stat.h:1741
#12 0x00000000005e8580 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x1555554e9e80, locals=0x1555554e9e80, flags=<optimized out>, arena=<optimized out>)
    at Python/deepfreeze/stat.h:1762
#13 0x00000000005fd4d2 in pyrun_file (fp=fp@entry=0x8eb370, filename=filename@entry=0x15554e6c8ab0, start=start@entry=257, globals=globals@entry=0x1555554e9e80, 
    locals=locals@entry=0x1555554e9e80, closeit=closeit@entry=1, flags=0x7fffffffaa38) at Python/deepfreeze/stat.h:1657
#14 0x00000000005fc89f in _PyRun_SimpleFileObject (fp=0x8eb370, filename=0x15554e6c8ab0, closeit=1, flags=0x7fffffffaa38) at Python/deepfreeze/stat.h:440
#15 0x00000000005fc5c3 in _PyRun_AnyFileObject (fp=0x8eb370, filename=0x15554e6c8ab0, closeit=1, flags=0x7fffffffaa38) at Python/deepfreeze/stat.h:79
#16 0x00000000005f723e in pymain_run_file_obj (skip_source_first_line=0, filename=0x15554e6c8ab0, program_name=0x155555448a50)
    at /usr/local/src/conda/python-3.11.10/Programs/initconfig.c:360
#17 pymain_run_file (config=0x88da80 <_PyRuntime+59904>) at /usr/local/src/conda/python-3.11.10/Programs/initconfig.c:379
#18 pymain_run_python (exitcode=0x7fffffffaa30) at /usr/local/src/conda/python-3.11.10/Programs/initconfig.c:605
#19 Py_RunMain () at /usr/local/src/conda/python-3.11.10/Programs/initconfig.c:684
#20 0x00000000005bbf89 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /usr/local/src/conda/python-3.11.10/Programs/initconfig.c:738
#21 0x000015554e7eb865 in __libc_start_main () from /lib64/libc.so.6
#22 0x00000000005bbdd3 in _start () at /usr/local/src/conda/python-3.11.10/Modules/frameobject.c:33931

@wcwitt wcwitt merged commit 274f002 into main Jan 23, 2026
4 checks passed
@wcwitt wcwitt deleted the update_external_kokkos branch January 23, 2026 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

incompatibility with latest lammps develop in kokkos

2 participants