-
Notifications
You must be signed in to change notification settings - Fork 12
Description
The same job ran on an interactive node with 4 GPUs
srun -u -n 4 --gpus 4 ./cuda-build/gyrokinetic/creg/rt_gk_bgk_im_asdex_high_adapt_RECY_3x2v_p1 -g -M -c 1 -d 1 -e 4
Gyrokinetic simulation initialized...
Initialization completed in 114.869 sec
Taking time-step 1 at t = 0.0000000 ... dt = 1.746118e-09
Taking time-step 10 at t = 1.5716367e-08 ... dt = 1.746407e-09 (0.0% complete)
Taking time-step 20 at t = 3.3181379e-08 ... dt = 1.746606e-09 (0.0% complete)
Taking time-step 30 at t = 5.0647801e-08 ... dt = 1.746671e-09 (0.0% complete)
Taking time-step 40 at t = 6.8114999e-08 ... dt = 1.746755e-09 (0.0% complete)
Taking time-step 50 at t = 8.5582774e-08 ... dt = 1.746806e-09 (0.0% complete)
Taking time-step 60 at t = 1.0305115e-07 ... dt = 1.746923e-09 (0.0% complete)
Taking time-step 70 at t = 1.2052108e-07 ... dt = 1.747083e-09 (0.0% complete)
Taking time-step 80 at t = 1.3799239e-07 ... dt = 1.747159e-09 (0.0% complete)
Taking time-step 90 at t = 1.5546407e-07 ... dt = 1.747202e-09 (0.0% complete)
Taking time-step 100 at t = 1.7293638e-07 ... dt = 1.747272e-09 (0.0% complete)
but failed with set fault as a slurm job on 6 nodes with 24 GPUs
The following have been reloaded with a version change:
- cray-mpich/8.1.28 => cray-mpich/8.1.30
The following have been reloaded with a version change:
- cray-mpich/8.1.30 => cray-mpich/8.1.28
srun -u -n 24 --gpus 24 ./cuda-build/gyrokinetic/creg/rt_gk_bgk_im_asdex_high_adapt_RECY_3x2v_p1 -g -M -c 1 -d 1 -e 24
srun: error: nid008312: task 6: Segmentation fault
srun: Terminating StepId=48692363.0
slurmstepd: error: *** STEP 48692363.0 ON nid008232 CANCELLED AT 2026-02-10T14:26:45 ***
srun: error: nid008461: task 12: Terminated
srun: error: nid008668: task 20: Terminated
srun: error: nid008232: task 0: Terminated
srun: error: nid008633: task 16: Terminated
srun: error: nid008400: task 8: Terminated
srun: error: nid008312: task 4: Terminated
srun: error: nid008633: task 17: Terminated
srun: error: nid008232: task 3: Terminated
srun: error: nid008400: task 11: Terminated
srun: error: nid008312: task 7: Terminated
srun: error: nid008668: tasks 21,23: Terminated
srun: error: nid008633: task 19: Terminated
srun: error: nid008461: tasks 13-14: Terminated
srun: error: nid008232: task 2: Terminated
srun: error: nid008312: task 5: Terminated
srun: error: nid008400: task 9: Terminated
srun: error: nid008668: task 22: Terminated
srun: error: nid008461: task 15: Terminated
srun: error: nid008633: task 18: Terminated
srun: error: nid008232: task 1: Terminated
srun: error: nid008400: task 10: Terminated
srun: Force Terminated StepId=48692363.0
This issue is with commit
commit 4343535 (HEAD -> main)
Merge: 8c39693 b60c356
Author: James Juno junoravin@gmail.com
Date: Wed Nov 26 12:08:09 2025 -0500Merge pull request #912 from ammarhakim/maxgauss_proj_mom_correct_fix Set the moment correction option when using a Maxwellian Gaussian projection
Attached is the input file.
rt_gk_bgk_im_asdex_high_adapt_RECY_3x2v_p1.c