Skip to content

GPU constant memory copy failed #11

@tkoskela

Description

@tkoskela

I've built HemePure-GPU on the ga005 node of the Cosma GPU testbed which has 2 MI210 GPUs with

HIP version: 6.3.42131-fa1d09cbd
AMD clang version 18.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.3.0 24455 f24aa3b4a91f6ee2fcd15629ba0b49fa545d8d6b)

Here is my entire build script

module load hipcc
module load openmpi

export HIP_PATH=/opt/rocm/
export hip_DIR=/opt/rocm/lib/cmake/hip
export HIP_PLATFORM=amd HIP_COMPILER=clang HIP_RUNTIME=rocclr
export HCC_AMDGPU_TARGET="gfx90a"

cmake -DCMAKE_C_COMPILER=hipcc \
      -DCMAKE_CXX_COMPILER=hipcc \
      -DHEMELB_GPU_BACKEND=HIP_ROCM \
      -DHEMELB_CUDA_AWARE_MPI=OFF \
      -DHEMELB_COMPUTE_ARCHITECTURE=NEUTRAL \
      -DCMAKE_CXX_EXTENSIONS=OFF \
      -DHEMELB_USE_VELOCITY_WEIGHTS_FILE=OFF \
      -DHEMELB_INLET_BOUNDARY=NASHZEROTHORDERPRESSUREIOLET \
      -DHEMELB_WALL_INLET_BOUNDARY=NASHZEROTHORDERPRESSURESBB \
      -DHEMELB_OUTLET_BOUNDARY=NASHZEROTHORDERPRESSUREIOLET \
      -DHEMELB_WALL_OUTLET_BOUNDARY=NASHZEROTHORDERPRESSURESBB \
      -DHEMELB_LOG_LEVEL="Info" \
      -DHEMELB_USE_MPI_PARALLEL_IO=OFF \
      -DCMAKE_BUILD_TYPE=RelWithDebInfo \
      ..

cmake --build . -j 8  -v

I run the code with this script

#!/bin/bash
module load hipcc openmpi
export OMP_NUM_THREADS=1

OUTDIR=${1-results}

rm -rf $OUTDIR
mpirun -np 3 ../build/hemepure_gpu -in ./NVIDIA-TestPipe/input_PP.xml -out $OUTDIR

I get errors with GPU constant memory copy failed, followed by a segfault

[Rank 0000000, 5.152948e-03 s, 00000000531872 kB] :: INITIALISE
[Rank 0000000, 5.206819e-03 s, 00000000531872 kB] :: ----------
[Rank 0000000, 5.240793e-03 s, 00000000531872 kB] :: --> loading input and decomposing geometry
[Rank 0000000, 4.341046e-02 s, 00000000533920 kB] :: ----> opened data file ./NVIDIA-TestPipe/pipe.gmy
[Rank 0000000, 4.346045e-02 s, 00000000533920 kB] :: ----> reading preamble
[Rank 0000000, 4.670267e-02 s, 00000000533920 kB] :: ----> reading header (start)
[Rank 0000000, 6.204022e-02 s, 00000000541000 kB] :: ----> reading header (end)
[Rank 0000000, 6.249226e-02 s, 00000000541000 kB] :: ----> non-empty blocks: 29640
[Rank 0000000, 6.252750e-02 s, 00000000541000 kB] ::           total blocks: 38000
[Rank 0000000, 6.256220e-02 s, 00000000541000 kB] ::                  ratio: 0.780000
[Rank 0000000, 6.259775e-02 s, 00000000541000 kB] ::                  sites: 13144201
[Rank 0000000, 6.263216e-02 s, 00000000541000 kB] :: ----> blockInformation.size(): 29640
[Rank 0000000, 6.266447e-02 s, 00000000541000 kB] ::  fluidSitesOnEachBlock.size(): 0
[Rank 0000000, 6.269722e-02 s, 00000000541000 kB] ::           blockWeights.size(): 0
[Rank 0000000, 6.272921e-02 s, 00000000541000 kB] :: ----> is blockInformation.size() == nonEmptyBlocks? yes
[Rank 0000000, 6.276130e-02 s, 00000000541000 kB] :: ----> not optimising decomposition
[Rank 0000000, 6.302023e-02 s, 00000000541000 kB] :: ----> basic decomposition (start)
[Rank 0000001, 7.083371e-02 s, 00000000182500 kB] :: ----> load distribution: 0.000000
[Rank 0000000, 7.152354e-02 s, 00000000545096 kB] :: ----> basic decomposition (end)
[Rank 0000000, 7.156072e-02 s, 00000000545096 kB] :: ----> read blocks (start)
[Rank 0000000, 5.518123e+01 s, 00000004237640 kB] :: ----> read blocks (end)
[Rank 0000000, 5.595535e+01 s, 00000004237640 kB] :: --> lattice data
[Rank 0000000, 5.595543e+01 s, 00000004237640 kB] :: ----> LatticeDataInitializer: sorting Nonempty blocks
[Rank 0000000, 5.667528e+01 s, 00000004237640 kB] :: ----> LatticeDataInitializer: setting basic Details
[Rank 0000000, 5.667535e+01 s, 00000004237640 kB] :: ----> LatticeDataInitializer: processing read sites
[Rank 0000000, 5.957566e+01 s, 00000011316504 kB] :: ----> LatticeDataInitializer: Collect Fluid Distributions
[Rank 0000000, 5.957576e+01 s, 00000011316504 kB] :: ----> gathering lattice information (start)
[Rank 0000000, 5.957586e+01 s, 00000011316504 kB] :: ----> gathering lattice information (end)
[Rank 0000000, 5.957590e+01 s, 00000011316504 kB] :: ----> collecting site extrema (start)
[Rank 0000000, 5.964989e+01 s, 00000011316504 kB] :: ----> collecting site extrema (end)
[Rank 0000000, 5.964993e+01 s, 00000011316504 kB] :: ----> LatticeDataInitializer: Initialize Neighbor lookups
[Rank 0000000, 5.964996e+01 s, 00000011316504 kB] :: ----> initializing neighbour lookups (start)
[Rank 0000000, 5.965000e+01 s, 00000011316504 kB] :: ----> initializing neighbour lookups: NLookup
[Rank 0000000, 6.287945e+01 s, 00000013270596 kB] :: ----> initializing neighbour lookups: Point to Point comms
[Rank 0000000, 6.288084e+01 s, 00000013270596 kB] :: ----> initializing neighbour lookups: Receive Lookup
[Rank 0000000, 6.288391e+01 s, 00000013270596 kB] :: ----> initializing neighbour lookups (end)
[Rank 0000000, 6.288394e+01 s, 00000013270596 kB] :: ----> LatticeDataInitializer: done
[Rank 0000000, 6.288398e+01 s, 00000013270596 kB] :: --> neighbouring data manager
[Rank 0000000, 6.288401e+01 s, 00000013270596 kB] :: --> lattice-Boltzmann model
LocalRank 1 attaching to GPU Device 0
LocalRank 2 attaching to GPU Device 1
Rank 0: Detected 2 GPU device(s)
===============================================
Device properties: 
Device name:  AMDGCN GFX-1122101688
Compute Capability: 9.0

Total Global Mem:    64.0GB
Number of Streaming Multiprocessors:  104
Shared Mem Per SM:   64KB
Max Number of Threads per Block:  1024
Max Number of Blocks allowed in x-dir:  2147483647
Max Number of Blocks allowed in y-dir:  65536
Warp Size:  64
===============================================

================================================================
Rank: 1, n_Inlets_Inner: 0, n_InletsWall_Inner: 0, n_Inlets_Edge: 0, n_InletsWall_Edge: 0 
Rank: 1, n_Outlets_Inner: 1, n_OutletsWall_Inner: 1, n_Outlets_Edge: 0, n_OutletsWall_Edge: 0 
================================================================
GPU constant memory copy failed (1)
================================================================
Rank: 2, n_Inlets_Inner: 1, n_InletsWall_Inner: 1, n_Inlets_Edge: 0, n_InletsWall_Edge: 0 
Rank: 2, n_Outlets_Inner: 0, n_OutletsWall_Inner: 0, n_Outlets_Edge: 0, n_OutletsWall_Edge: 0 
================================================================
GPU constant memory copy failed (1)
[Rank 0000000, 6.701489e+01 s, 00000034609576 kB] :: -------------------
[Rank 0000000, 6.701494e+01 s, 00000034609576 kB] :: INITIALISE FINISHED
[ga005:1318189:0:1318189] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x27c)
[Rank 0000000, 6.723233e+01 s, 00000034609576 kB] :: SIMULATION STARTING
[Rank 0000000, 6.723239e+01 s, 00000034609576 kB] :: -------------------
[ga005:1318188:0:1318188] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x27c)
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
BFD: DWARF error: invalid or unhandled FORM value: 0x25
==== backtrace (tid:1318188) ====
 0 0x000000000003ebf0 __GI___sigaction()  :0
 1 0x0000000000222c29 hipRegisterTracerCallback()  ???:0
 2 0x000000000022df7e hipRegisterTracerCallback()  ???:0
 3 0x00000000002571f9 hemelb::__device_stub__GPU_CollideStream_mMidFluidCollision_mWallCollision_sBB_WallShearStress()  ???:0
 4 0x0000000000248e37 hemelb::lb::LBM<hemelb::lb::lattices::D3Q19>::PreSend()  ???:0
 5 0x0000000000274ed3 hemelb::net::IteratedAction::CallAction()  ???:0
 6 0x000000000027927d hemelb::net::phased::StepManager::CallActionsForStep()  ???:0
 7 0x000000000027964e hemelb::net::phased::StepManager::CallActions()  ???:0
 8 0x0000000000241669 SimulationMaster::DoTimeStep()  ???:0
 9 0x00000000002414ab SimulationMaster::RunSimulation()  ???:0
10 0x000000000023e262 main()  ???:0
11 0x00000000000295d0 __libc_start_call_main()  ???:0
12 0x0000000000029680 __libc_start_main_alias_2()  :0
13 0x000000000023e105 _start()  ???:0
=================================
BFD: DWARF error: invalid or unhandled FORM value: 0x25
==== backtrace (tid:1318189) ====
 0 0x000000000003ebf0 __GI___sigaction()  :0
 1 0x0000000000222c29 hipRegisterTracerCallback()  ???:0
 2 0x000000000022df7e hipRegisterTracerCallback()  ???:0
 3 0x00000000002571f9 hemelb::__device_stub__GPU_CollideStream_mMidFluidCollision_mWallCollision_sBB_WallShearStress()  ???:0
 4 0x0000000000248e37 hemelb::lb::LBM<hemelb::lb::lattices::D3Q19>::PreSend()  ???:0
 5 0x0000000000274ed3 hemelb::net::IteratedAction::CallAction()  ???:0
 6 0x000000000027927d hemelb::net::phased::StepManager::CallActionsForStep()  ???:0
 7 0x000000000027964e hemelb::net::phased::StepManager::CallActions()  ???:0
 8 0x0000000000241669 SimulationMaster::DoTimeStep()  ???:0
 9 0x00000000002414ab SimulationMaster::RunSimulation()  ???:0
10 0x000000000023e262 main()  ???:0
11 0x00000000000295d0 __libc_start_call_main()  ???:0
12 0x0000000000029680 __libc_start_main_alias_2()  :0
13 0x000000000023e105 _start()  ???:0
=================================
--------------------------------------------------------------------------
prterun noticed that process rank 1 with PID 1318188 on node ga005 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions