-
Notifications
You must be signed in to change notification settings - Fork 110
Description
When running the GMRES solver with the HIP backend on AMD Instinct MI250X hardware (ROCm 6.4.3),
an assertion is triggered at every iteration of the solver:
accessor/accessor_helper.hpp:57: compute: Assertion `first < static_cast<IndexType>(size[dim_idx])' failed
The assertion fires exactly 4 times per iteration, starting from iteration 0, in a perfectly
systematic way. Despite the assertions, the solver converges correctly.
Environment
Ginkgo version v1.10.0
ROCm version 6.4.3-128
GPU AMD Instinct MI250X (gfx90a)
OS Linux
Build HIP backend, no MPI, no CUDA
Minimal reproducer
auto exec = gko::HipExecutor::create(0, gko::ReferenceExecutor::create());
auto A = share(gko::matrix::Csr<double>::create(exec, gko::dim<2>{N, N}));
auto rhs = gko::matrix::Dense<double>::create(exec, gko::dim<2>(N, 1));
auto u = gko::matrix::Dense<double>::create(exec, gko::dim<2>(N, 1));
auto solver = gko::solver::Gmres<double>::build()
.with_criteria(
gko::stop::Iteration::build().with_max_iters(2000),
gko::stop::ResidualNorm<double>::build()
.with_baseline(gko::stop::mode::rhs_norm)
.with_reduction_factor(1e-6))
.with_preconditioner(gko::preconditioner::Jacobi<double>::build())
.on(exec)->generate(A);
solver->apply(rhs, u);With N = 99792 (99792×99792 sparse CSR matrix, ~7 non-zeros per row).
Actual behavior
4 assertions per iteration, every iteration, from iteration 0:
[LOG] >>> apply started on A LinOp[gko::solver::Gmres<double>,...] ...
accessor/accessor_helper.hpp:57: compute: Assertion `first < static_cast<IndexType>(size[dim_idx])' failed
accessor/accessor_helper.hpp:57: compute: Assertion `first < static_cast<IndexType>(size[dim_idx])' failed
accessor/accessor_helper.hpp:57: compute: Assertion `first < static_cast<IndexType>(size[dim_idx])' failed
accessor/accessor_helper.hpp:57: compute: Assertion `first < static_cast<IndexType>(size[dim_idx])' failed
[LOG] >>> iteration 0 completed ... Stopped the iteration process false
accessor/accessor_helper.hpp:57: compute: Assertion `first < static_cast<IndexType>(size[dim_idx])' failed
...
The solver still converges (has_converged = 1, 175 iterations) but the assertions suggest
incorrect memory access patterns in the HIP kernels.
Steps to reproduce
- Build Ginkgo
v1.10.0with ROCm 6.4.3 and HIP backend (-DKokkos_ARCH_AMD_GFX90A=ON) - Create a GMRES solver with Jacobi preconditioner on a
HipExecutor - Call
solver->apply(rhs, u)on a sparse system of size ~100k - Observe assertions firing 4 times per iteration in
accessor_helper.hpp:57
In the hope that you can answer my question, thank you very much in advance.