Skip to content

accessor_helper.hpp:57 assertion failure with HIP backend on MI250X (ROCm 6.4.3) in GMRES solver #1986

@Tchoupi2001

Description

@Tchoupi2001

When running the GMRES solver with the HIP backend on AMD Instinct MI250X hardware (ROCm 6.4.3),
an assertion is triggered at every iteration of the solver:

accessor/accessor_helper.hpp:57: compute: Assertion `first < static_cast<IndexType>(size[dim_idx])' failed

The assertion fires exactly 4 times per iteration, starting from iteration 0, in a perfectly
systematic way. Despite the assertions, the solver converges correctly.

Environment

Ginkgo version v1.10.0
ROCm version 6.4.3-128
GPU AMD Instinct MI250X (gfx90a)
OS Linux
Build HIP backend, no MPI, no CUDA

Minimal reproducer

auto exec = gko::HipExecutor::create(0, gko::ReferenceExecutor::create());

auto A   = share(gko::matrix::Csr<double>::create(exec, gko::dim<2>{N, N}));
auto rhs = gko::matrix::Dense<double>::create(exec, gko::dim<2>(N, 1));
auto u   = gko::matrix::Dense<double>::create(exec, gko::dim<2>(N, 1));

auto solver = gko::solver::Gmres<double>::build()
    .with_criteria(
        gko::stop::Iteration::build().with_max_iters(2000),
        gko::stop::ResidualNorm<double>::build()
            .with_baseline(gko::stop::mode::rhs_norm)
            .with_reduction_factor(1e-6))
    .with_preconditioner(gko::preconditioner::Jacobi<double>::build())
    .on(exec)->generate(A);

solver->apply(rhs, u);

With N = 99792 (99792×99792 sparse CSR matrix, ~7 non-zeros per row).

Actual behavior

4 assertions per iteration, every iteration, from iteration 0:

[LOG] >>> apply started on A LinOp[gko::solver::Gmres<double>,...] ...
accessor/accessor_helper.hpp:57: compute: Assertion `first < static_cast<IndexType>(size[dim_idx])' failed
accessor/accessor_helper.hpp:57: compute: Assertion `first < static_cast<IndexType>(size[dim_idx])' failed
accessor/accessor_helper.hpp:57: compute: Assertion `first < static_cast<IndexType>(size[dim_idx])' failed
accessor/accessor_helper.hpp:57: compute: Assertion `first < static_cast<IndexType>(size[dim_idx])' failed
[LOG] >>> iteration 0 completed ... Stopped the iteration process false
accessor/accessor_helper.hpp:57: compute: Assertion `first < static_cast<IndexType>(size[dim_idx])' failed
...

The solver still converges (has_converged = 1, 175 iterations) but the assertions suggest
incorrect memory access patterns in the HIP kernels.

Steps to reproduce

  1. Build Ginkgo v1.10.0 with ROCm 6.4.3 and HIP backend (-DKokkos_ARCH_AMD_GFX90A=ON)
  2. Create a GMRES solver with Jacobi preconditioner on a HipExecutor
  3. Call solver->apply(rhs, u) on a sparse system of size ~100k
  4. Observe assertions firing 4 times per iteration in accessor_helper.hpp:57

In the hope that you can answer my question, thank you very much in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions