Skip to content

[Issue]: [MIOpen] Windows gfx1151/gfx110x-all: 12 test failures (MultiMarginLoss, CPU_CandidateSelection, graphapi errors) — regression in a608060..880166b #5501

@AmosLewis

Description

@AmosLewis

Problem Description

CI regression observed when bumping rocm-libraries in TheRock from a608060 to 880166b. All 4 miopen test shards fail on Windows gfx1151 (release).

Failed test suite: miopen_gtest_standard_suite (exit code 8).

Failed tests (12 total), representative:

  • CPU_TuningPolicy_NONE.TestSetApiLogged
  • Full/CPU_CandidateSelection_NONE.EncodeInputFeatures_Test/gfx942_ConvHipImplicitGemm3DGroupWrwXdlops_DeviceGroupedConvBwdWeight_Xdl_CShuffle_splitk8 (and KernelStrMappingUnknownKernelThrows_Test, SelectBestCandidateValid_Test with same param)
  • Smoke/GPU_MultiMarginLoss_FP32.Test/1, /5, /9 (and FP16, BFP16 variants with same param patterns)

Runtime errors in log (rocm-libraries miopen):

  • MIOpen(HIP): Error at graphapi/convolution.hpp:407, :642; enginecfg.cpp:54; execution_plan.cpp:102; reduction.cpp:217; reshape.cpp:63; errors.hpp:146 (Passing nullptr).

CI job (full log): https://github.com/ROCm/TheRock/actions/runs/23139971395/job/67290574092
TheRock bump PR: ROCm/TheRock#3985
Commit range: rocm-libraries a608060..880166b.

Operating System

Windows (GitHub Actions runner; exact version in workflow)

CPU

strix-halo

GPU

gfx1151

ROCm Version

7.13.0

ROCm Component

MIOpen

Steps to Reproduce

Run TheRock CI for TheRock PR #3985, Windows gfx1151 release → Test miopen (any shard). Or build rocm-libraries at commit 880166b and run miopen tests on Windows gfx1151.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

2026-03-16T20:29:19.6602262Z 2: [  FAILED  ] Full/CPU_CandidateSelection_NONE.SelectBestCandidateValid_Test/gfx942_ConvHipImplicitGemm3DGroupWrwXdlops_DeviceGroupedConvBwdWeight_Xdl_CShuffle_splitk8, where GetParam() = gfx942_ConvHipImplicitGemm3DGroupWrwXdlops_DeviceGroupedConvBwdWeight_Xdl_CShuffle_splitk8 (9 ms)
......
2026-03-16T20:53:10.4492714Z 2: [  FAILED  ] Smoke/GPU_MultiMarginLoss_FP32.Test/1, where GetParam() = dims:22x12 cont:0 reduction_mode:1 p:1 (1092 ms)
2026-03-16T20:53:10.4493643Z 2: [ RUN      ] Smoke/GPU_MultiMarginLoss_FP32.Test/5
2026-03-16T20:53:10.9702194Z 2: C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\multimarginloss.hpp(215): error: Expected: (error) < (tolerance), actual: 0.24170883745259764 vs 1.1920929e-06
2026-03-16T20:53:10.9703045Z 2: 
2026-03-16T20:53:10.9703532Z 2: [  FAILED  ] Smoke/GPU_MultiMarginLoss_FP32.Test/5, where GetParam() = dims:9456x13 cont:0 reduction_mode:0 p:2 (521 ms)
2026-03-16T20:53:10.9704125Z 2: [ RUN      ] Smoke/GPU_MultiMarginLoss_FP32.Test/9
2026-03-16T20:53:11.7311812Z 2: C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\multimarginloss.hpp(215): error: Expected: (error) < (tolerance), actual: 1 vs 1.1920929e-06
2026-03-16T20:53:11.7312582Z 2: 
2026-03-16T20:53:11.7378673Z 2: [  FAILED  ] Smoke/GPU_MultiMarginLoss_FP32.Test/9, where GetParam() = dims:3995776x6 cont:1 reduction_mode:2 p:1 (767 ms)
2026-03-16T20:53:11.7379488Z 2: [----------] 3 tests from Smoke/GPU_MultiMarginLoss_FP32 (2381 ms total)
2026-03-16T20:53:11.7379851Z 2: 
2026-03-16T20:53:11.7380102Z 2: [----------] 2 tests from Smoke/GPU_MultiMarginLoss_FP16
2026-03-16T20:53:11.7380463Z 2: [ RUN      ] Smoke/GPU_MultiMarginLoss_FP16.Test/1
2026-03-16T20:53:13.0597507Z 2: C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\multimarginloss.hpp(215): error: Expected: (error) < (tolerance), actual: 1 vs 0.009765625
2026-03-16T20:53:13.0599343Z 2: 
2026-03-16T20:53:13.0600106Z 2: [  FAILED  ] Smoke/GPU_MultiMarginLoss_FP16.Test/1, where GetParam() = dims:22x12 cont:0 reduction_mode:1 p:1 (1321 ms)
2026-03-16T20:53:13.0601064Z 2: [ RUN      ] Smoke/GPU_MultiMarginLoss_FP16.Test/5
2026-03-16T20:53:13.5799782Z 2: C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\multimarginloss.hpp(215): error: Expected: (error) < (tolerance), actual: 0.2950681563357076 vs 0.009765625
2026-03-16T20:53:13.5800601Z 2: 
2026-03-16T20:53:13.5801106Z 2: [  FAILED  ] Smoke/GPU_MultiMarginLoss_FP16.Test/5, where GetParam() = dims:9456x13 cont:0 reduction_mode:0 p:2 (520 ms)
2026-03-16T20:53:13.5801755Z 2: [----------] 2 tests from Smoke/GPU_MultiMarginLoss_FP16 (1842 ms total)
2026-03-16T20:53:13.5802111Z 2: 
2026-03-16T20:53:13.5802349Z 2: [----------] 3 tests from Smoke/GPU_MultiMarginLoss_BFP16
2026-03-16T20:53:13.5802855Z 2: [ RUN      ] Smoke/GPU_MultiMarginLoss_BFP16.Test/1
2026-03-16T20:53:14.7541177Z 2: C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\multimarginloss.hpp(215): error: Expected: (error) < (tolerance), actual: 1 vs 0.078125
2026-03-16T20:53:14.7541883Z 2: 
2026-03-16T20:53:14.7551034Z 2: [  FAILED  ] Smoke/GPU_MultiMarginLoss_BFP16.Test/1, where GetParam() = dims:22x12 cont:0 reduction_mode:1 p:1 (1174 ms)
2026-03-16T20:53:14.7551702Z 2: [ RUN      ] Smoke/GPU_MultiMarginLoss_BFP16.Test/5
2026-03-16T20:53:15.4139333Z 2: C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\multimarginloss.hpp(215): error: Expected: (error) < (tolerance), actual: 0.17214981176546584 vs 0.078125
2026-03-16T20:53:15.4140754Z 2: 
2026-03-16T20:53:15.4141540Z 2: [  FAILED  ] Smoke/GPU_MultiMarginLoss_BFP16.Test/5, where GetParam() = dims:9456x13 cont:0 reduction_mode:0 p:2 (659 ms)
2026-03-16T20:53:15.4142925Z 2: [ RUN      ] Smoke/GPU_MultiMarginLoss_BFP16.Test/9
2026-03-16T20:53:16.2481820Z 2: C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\multimarginloss.hpp(215): error: Expected: (error) < (tolerance), actual: 1 vs 0.078125
2026-03-16T20:53:16.2482636Z 2: 
2026-03-16T20:53:16.2527416Z 2: [  FAILED  ] Smoke/GPU_MultiMarginLoss_BFP16.Test/9, where GetParam() = dims:3995776x6 cont:1 reduction_mode:2 p:1 (838 ms)
2026-03-16T20:53:16.2528198Z 2: [----------] 3 tests from Smoke/GPU_MultiMarginLoss_BFP16 (2672 ms total)

......
2026-03-16T21:17:30.9897735Z MIOpen(HIP): Error [C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/src/conv/heuristics/ai_candidate_selection.cpp:400] Failed to construct CandidateSelectionModel for arch: gfx1151, solver: ConvHipImplicitGemm3DGroupBwdXdlops. Exception: CHRN-SI-109:C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/src/conv/heuristics/ai_candidate_selection.cpp:63: Could not open metadata file: gfx1151_ConvHipImplicitGemm3DGroupBwdXdlops_metadata.tn.model
2026-03-16T21:17:30.9897931Z Buffered 1 messages to file: CHRN-SI-109:C:\windows\SystemTemp\miopen_error_3600.log
2026-03-16T21:17:30.9898006Z a_grid_desc_m_ak_container_{2809856, 32}
2026-03-16T21:17:30.9898088Z b_grid_desc_n_bk_container_{64, 32}
2026-03-16T21:17:30.9898243Z ds_grid_desc_mblock_mperblock_nblock_nperblock_container_{2809856, 64}
2026-03-16T21:17:30.9898392Z e_grid_desc_mblock_mperblock_nblock_nperblock_container_{2809856, 64}
2026-03-16T21:17:30.9898623Z [       OK ] Full/GPU_GroupConv3D_BackwardData_FP16.GroupConv3D_BackwardData_half_Test/78 (327 ms)
2026-03-16T21:17:30.9898828Z [ RUN      ] Full/GPU_GroupConv3D_BackwardData_FP16.GroupConv3D_BackwardData_half_Test/82
2026-03-16T21:17:30.9899171Z  G:8 N:128 C:16 K:16 D:28 H:28 W:28 z:3 y:3 x:3 pad.z:1 pad.y:1 pad.x:1 stride.z:2 stride.y:2 stride.x:2 dilation.z:1 dilation.y:1 dilation.x:1
2026-03-16T21:17:30.9899857Z MIOpen(HIP): Error [C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/src/conv/heuristics/ai_candidate_selection.cpp:63] Could not open metadata file: gfx1151_ConvHipImplicitGemm3DGroupBwdXdlops_metadata.tn.model
2026-03-16T21:17:30.9900106Z Buffered 27 messages to file: CHRN-SI-109:C:\windows\SystemTemp\miopen_error_3600.log
2026-03-16T21:17:30.9901641Z MIOpen(HIP): Error [C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/src/conv/heuristics/ai_candidate_selection.cpp:400] Failed to construct CandidateSelectionModel for arch: gfx1151, solver: ConvHipImplicitGemm3DGroupBwdXdlops. Exception: CHRN-SI-109:C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/src/conv/heuristics/ai_candidate_selection.cpp:63: Could not open metadata file: gfx1151_ConvHipImplicitGemm3DGroupBwdXdlops_metadata.tn.model
...
2026-03-16T21:17:31.7387428Z [  FAILED  ] 12 tests, listed below:
2026-03-16T21:17:31.7387621Z [  FAILED  ] CPU_TuningPolicy_NONE.TestSetApiLogged
2026-03-16T21:17:31.7389274Z [  FAILED  ] Full/CPU_CandidateSelection_NONE.EncodeInputFeatures_Test/gfx942_ConvHipImplicitGemm3DGroupWrwXdlops_DeviceGroupedConvBwdWeight_Xdl_CShuffle_splitk8, where GetParam() = gfx942_ConvHipImplicitGemm3DGroupWrwXdlops_DeviceGroupedConvBwdWeight_Xdl_CShuffle_splitk8
2026-03-16T21:17:31.7392600Z [  FAILED  ] Full/CPU_CandidateSelection_NONE.KernelStrMappingUnknownKernelThrows_Test/gfx942_ConvHipImplicitGemm3DGroupWrwXdlops_DeviceGroupedConvBwdWeight_Xdl_CShuffle_splitk8, where GetParam() = gfx942_ConvHipImplicitGemm3DGroupWrwXdlops_DeviceGroupedConvBwdWeight_Xdl_CShuffle_splitk8
2026-03-16T21:17:31.7394354Z [  FAILED  ] Full/CPU_CandidateSelection_NONE.SelectBestCandidateValid_Test/gfx942_ConvHipImplicitGemm3DGroupWrwXdlops_DeviceGroupedConvBwdWeight_Xdl_CShuffle_splitk8, where GetParam() = gfx942_ConvHipImplicitGemm3DGroupWrwXdlops_DeviceGroupedConvBwdWeight_Xdl_CShuffle_splitk8
2026-03-16T21:17:31.7394979Z [  FAILED  ] Smoke/GPU_MultiMarginLoss_FP32.Test/1, where GetParam() = dims:22x12 cont:0 reduction_mode:1 p:1
2026-03-16T21:17:31.7395465Z [  FAILED  ] Smoke/GPU_MultiMarginLoss_FP32.Test/5, where GetParam() = dims:9456x13 cont:0 reduction_mode:0 p:2
2026-03-16T21:17:31.7395949Z [  FAILED  ] Smoke/GPU_MultiMarginLoss_FP32.Test/9, where GetParam() = dims:3995776x6 cont:1 reduction_mode:2 p:1
2026-03-16T21:17:31.7396412Z [  FAILED  ] Smoke/GPU_MultiMarginLoss_FP16.Test/1, where GetParam() = dims:22x12 cont:0 reduction_mode:1 p:1
2026-03-16T21:17:31.7396872Z [  FAILED  ] Smoke/GPU_MultiMarginLoss_FP16.Test/5, where GetParam() = dims:9456x13 cont:0 reduction_mode:0 p:2
2026-03-16T21:17:31.7397345Z [  FAILED  ] Smoke/GPU_MultiMarginLoss_BFP16.Test/1, where GetParam() = dims:22x12 cont:0 reduction_mode:1 p:1
2026-03-16T21:17:31.7397837Z [  FAILED  ] Smoke/GPU_MultiMarginLoss_BFP16.Test/5, where GetParam() = dims:9456x13 cont:0 reduction_mode:0 p:2
2026-03-16T21:17:31.7398324Z [  FAILED  ] Smoke/GPU_MultiMarginLoss_BFP16.Test/9, where GetParam() = dims:3995776x6 cont:1 reduction_mode:2 p:1
2026-03-16T21:17:31.7398446Z 12 FAILED TESTS
2026-03-16T21:17:31.7398594Z   YOU HAVE 107 DISABLED TESTS
2026-03-16T21:17:31.7398602Z 
2026-03-16T21:17:31.7398609Z 
2026-03-16T21:17:31.7398615Z 
2026-03-16T21:17:31.7398774Z 0% tests passed, 1 tests failed out of 1
2026-03-16T21:17:31.7398899Z 
2026-03-16T21:17:31.7399017Z Label Time Summary:
2026-03-16T21:17:31.7399160Z pr          = 3462.94 sec*proc (1 test)
2026-03-16T21:17:31.7399291Z standard    = 3462.94 sec*proc (1 test)
2026-03-16T21:17:31.7399299Z 
2026-03-16T21:17:31.7399423Z Total Test time (real) = 3463.30 sec
2026-03-16T21:17:31.7399430Z 
2026-03-16T21:17:31.7399629Z The following tests FAILED:
2026-03-16T21:17:31.7399904Z 	  2 - miopen_gtest_standard_suite (Failed)              pr standard

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions