-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Description
I've built HemePure-GPU on the ga005 node of the Cosma GPU testbed which has 2 MI210 GPUs with
HIP version: 6.3.42131-fa1d09cbd
AMD clang version 18.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.3.0 24455 f24aa3b4a91f6ee2fcd15629ba0b49fa545d8d6b)
Here is my entire build script
module load hipcc
module load openmpi
export HIP_PATH=/opt/rocm/
export hip_DIR=/opt/rocm/lib/cmake/hip
export HIP_PLATFORM=amd HIP_COMPILER=clang HIP_RUNTIME=rocclr
export HCC_AMDGPU_TARGET="gfx90a"
cmake -DCMAKE_C_COMPILER=hipcc \
-DCMAKE_CXX_COMPILER=hipcc \
-DHEMELB_GPU_BACKEND=HIP_ROCM \
-DHEMELB_CUDA_AWARE_MPI=OFF \
-DHEMELB_COMPUTE_ARCHITECTURE=NEUTRAL \
-DCMAKE_CXX_EXTENSIONS=OFF \
-DHEMELB_USE_VELOCITY_WEIGHTS_FILE=OFF \
-DHEMELB_INLET_BOUNDARY=NASHZEROTHORDERPRESSUREIOLET \
-DHEMELB_WALL_INLET_BOUNDARY=NASHZEROTHORDERPRESSURESBB \
-DHEMELB_OUTLET_BOUNDARY=NASHZEROTHORDERPRESSUREIOLET \
-DHEMELB_WALL_OUTLET_BOUNDARY=NASHZEROTHORDERPRESSURESBB \
-DHEMELB_LOG_LEVEL="Info" \
-DHEMELB_USE_MPI_PARALLEL_IO=OFF \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
..
cmake --build . -j 8 -v
I run the code with this script
#!/bin/bash
module load hipcc openmpi
export OMP_NUM_THREADS=1
OUTDIR=${1-results}
rm -rf $OUTDIR
mpirun -np 3 ../build/hemepure_gpu -in ./NVIDIA-TestPipe/input_PP.xml -out $OUTDIR
I get errors with GPU constant memory copy failed, followed by a segfault
[Rank 0000000, 5.152948e-03 s, 00000000531872 kB] :: INITIALISE
[Rank 0000000, 5.206819e-03 s, 00000000531872 kB] :: ----------
[Rank 0000000, 5.240793e-03 s, 00000000531872 kB] :: --> loading input and decomposing geometry
[Rank 0000000, 4.341046e-02 s, 00000000533920 kB] :: ----> opened data file ./NVIDIA-TestPipe/pipe.gmy
[Rank 0000000, 4.346045e-02 s, 00000000533920 kB] :: ----> reading preamble
[Rank 0000000, 4.670267e-02 s, 00000000533920 kB] :: ----> reading header (start)
[Rank 0000000, 6.204022e-02 s, 00000000541000 kB] :: ----> reading header (end)
[Rank 0000000, 6.249226e-02 s, 00000000541000 kB] :: ----> non-empty blocks: 29640
[Rank 0000000, 6.252750e-02 s, 00000000541000 kB] :: total blocks: 38000
[Rank 0000000, 6.256220e-02 s, 00000000541000 kB] :: ratio: 0.780000
[Rank 0000000, 6.259775e-02 s, 00000000541000 kB] :: sites: 13144201
[Rank 0000000, 6.263216e-02 s, 00000000541000 kB] :: ----> blockInformation.size(): 29640
[Rank 0000000, 6.266447e-02 s, 00000000541000 kB] :: fluidSitesOnEachBlock.size(): 0
[Rank 0000000, 6.269722e-02 s, 00000000541000 kB] :: blockWeights.size(): 0
[Rank 0000000, 6.272921e-02 s, 00000000541000 kB] :: ----> is blockInformation.size() == nonEmptyBlocks? yes
[Rank 0000000, 6.276130e-02 s, 00000000541000 kB] :: ----> not optimising decomposition
[Rank 0000000, 6.302023e-02 s, 00000000541000 kB] :: ----> basic decomposition (start)
[Rank 0000001, 7.083371e-02 s, 00000000182500 kB] :: ----> load distribution: 0.000000
[Rank 0000000, 7.152354e-02 s, 00000000545096 kB] :: ----> basic decomposition (end)
[Rank 0000000, 7.156072e-02 s, 00000000545096 kB] :: ----> read blocks (start)
[Rank 0000000, 5.518123e+01 s, 00000004237640 kB] :: ----> read blocks (end)
[Rank 0000000, 5.595535e+01 s, 00000004237640 kB] :: --> lattice data
[Rank 0000000, 5.595543e+01 s, 00000004237640 kB] :: ----> LatticeDataInitializer: sorting Nonempty blocks
[Rank 0000000, 5.667528e+01 s, 00000004237640 kB] :: ----> LatticeDataInitializer: setting basic Details
[Rank 0000000, 5.667535e+01 s, 00000004237640 kB] :: ----> LatticeDataInitializer: processing read sites
[Rank 0000000, 5.957566e+01 s, 00000011316504 kB] :: ----> LatticeDataInitializer: Collect Fluid Distributions
[Rank 0000000, 5.957576e+01 s, 00000011316504 kB] :: ----> gathering lattice information (start)
[Rank 0000000, 5.957586e+01 s, 00000011316504 kB] :: ----> gathering lattice information (end)
[Rank 0000000, 5.957590e+01 s, 00000011316504 kB] :: ----> collecting site extrema (start)
[Rank 0000000, 5.964989e+01 s, 00000011316504 kB] :: ----> collecting site extrema (end)
[Rank 0000000, 5.964993e+01 s, 00000011316504 kB] :: ----> LatticeDataInitializer: Initialize Neighbor lookups
[Rank 0000000, 5.964996e+01 s, 00000011316504 kB] :: ----> initializing neighbour lookups (start)
[Rank 0000000, 5.965000e+01 s, 00000011316504 kB] :: ----> initializing neighbour lookups: NLookup
[Rank 0000000, 6.287945e+01 s, 00000013270596 kB] :: ----> initializing neighbour lookups: Point to Point comms
[Rank 0000000, 6.288084e+01 s, 00000013270596 kB] :: ----> initializing neighbour lookups: Receive Lookup
[Rank 0000000, 6.288391e+01 s, 00000013270596 kB] :: ----> initializing neighbour lookups (end)
[Rank 0000000, 6.288394e+01 s, 00000013270596 kB] :: ----> LatticeDataInitializer: done
[Rank 0000000, 6.288398e+01 s, 00000013270596 kB] :: --> neighbouring data manager
[Rank 0000000, 6.288401e+01 s, 00000013270596 kB] :: --> lattice-Boltzmann model
LocalRank 1 attaching to GPU Device 0
LocalRank 2 attaching to GPU Device 1
Rank 0: Detected 2 GPU device(s)
===============================================
Device properties:
Device name: AMDGCN GFX-1122101688
Compute Capability: 9.0
Total Global Mem: 64.0GB
Number of Streaming Multiprocessors: 104
Shared Mem Per SM: 64KB
Max Number of Threads per Block: 1024
Max Number of Blocks allowed in x-dir: 2147483647
Max Number of Blocks allowed in y-dir: 65536
Warp Size: 64
===============================================
================================================================
Rank: 1, n_Inlets_Inner: 0, n_InletsWall_Inner: 0, n_Inlets_Edge: 0, n_InletsWall_Edge: 0
Rank: 1, n_Outlets_Inner: 1, n_OutletsWall_Inner: 1, n_Outlets_Edge: 0, n_OutletsWall_Edge: 0
================================================================
GPU constant memory copy failed (1)
================================================================
Rank: 2, n_Inlets_Inner: 1, n_InletsWall_Inner: 1, n_Inlets_Edge: 0, n_InletsWall_Edge: 0
Rank: 2, n_Outlets_Inner: 0, n_OutletsWall_Inner: 0, n_Outlets_Edge: 0, n_OutletsWall_Edge: 0
================================================================
GPU constant memory copy failed (1)
[Rank 0000000, 6.701489e+01 s, 00000034609576 kB] :: -------------------
[Rank 0000000, 6.701494e+01 s, 00000034609576 kB] :: INITIALISE FINISHED
[ga005:1318189:0:1318189] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x27c)
[Rank 0000000, 6.723233e+01 s, 00000034609576 kB] :: SIMULATION STARTING
[Rank 0000000, 6.723239e+01 s, 00000034609576 kB] :: -------------------
[ga005:1318188:0:1318188] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x27c)
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
==== backtrace (tid:1318188) ====
0 0x000000000003ebf0 __GI___sigaction() :0
1 0x0000000000222c29 hipRegisterTracerCallback() ???:0
2 0x000000000022df7e hipRegisterTracerCallback() ???:0
3 0x00000000002571f9 hemelb::__device_stub__GPU_CollideStream_mMidFluidCollision_mWallCollision_sBB_WallShearStress() ???:0
4 0x0000000000248e37 hemelb::lb::LBM<hemelb::lb::lattices::D3Q19>::PreSend() ???:0
5 0x0000000000274ed3 hemelb::net::IteratedAction::CallAction() ???:0
6 0x000000000027927d hemelb::net::phased::StepManager::CallActionsForStep() ???:0
7 0x000000000027964e hemelb::net::phased::StepManager::CallActions() ???:0
8 0x0000000000241669 SimulationMaster::DoTimeStep() ???:0
9 0x00000000002414ab SimulationMaster::RunSimulation() ???:0
10 0x000000000023e262 main() ???:0
11 0x00000000000295d0 __libc_start_call_main() ???:0
12 0x0000000000029680 __libc_start_main_alias_2() :0
13 0x000000000023e105 _start() ???:0
=================================
BFD: DWARF error: invalid or unhandled FORM value: 0x25
==== backtrace (tid:1318189) ====
0 0x000000000003ebf0 __GI___sigaction() :0
1 0x0000000000222c29 hipRegisterTracerCallback() ???:0
2 0x000000000022df7e hipRegisterTracerCallback() ???:0
3 0x00000000002571f9 hemelb::__device_stub__GPU_CollideStream_mMidFluidCollision_mWallCollision_sBB_WallShearStress() ???:0
4 0x0000000000248e37 hemelb::lb::LBM<hemelb::lb::lattices::D3Q19>::PreSend() ???:0
5 0x0000000000274ed3 hemelb::net::IteratedAction::CallAction() ???:0
6 0x000000000027927d hemelb::net::phased::StepManager::CallActionsForStep() ???:0
7 0x000000000027964e hemelb::net::phased::StepManager::CallActions() ???:0
8 0x0000000000241669 SimulationMaster::DoTimeStep() ???:0
9 0x00000000002414ab SimulationMaster::RunSimulation() ???:0
10 0x000000000023e262 main() ???:0
11 0x00000000000295d0 __libc_start_call_main() ???:0
12 0x0000000000029680 __libc_start_main_alias_2() :0
13 0x000000000023e105 _start() ???:0
=================================
--------------------------------------------------------------------------
prterun noticed that process rank 1 with PID 1318188 on node ga005 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Metadata
Metadata
Assignees
Labels
No labels