-
Notifications
You must be signed in to change notification settings - Fork 244
Open
Description
Problem Description
CI regression observed when bumping rocm-libraries in TheRock from a608060 to 880166b. All 4 miopen test shards fail on Windows gfx1151 (release).
Failed test suite: miopen_gtest_standard_suite (exit code 8).
Failed tests (12 total), representative:
CPU_TuningPolicy_NONE.TestSetApiLoggedFull/CPU_CandidateSelection_NONE.EncodeInputFeatures_Test/gfx942_ConvHipImplicitGemm3DGroupWrwXdlops_DeviceGroupedConvBwdWeight_Xdl_CShuffle_splitk8(and KernelStrMappingUnknownKernelThrows_Test, SelectBestCandidateValid_Test with same param)Smoke/GPU_MultiMarginLoss_FP32.Test/1,/5,/9(and FP16, BFP16 variants with same param patterns)
Runtime errors in log (rocm-libraries miopen):
MIOpen(HIP): Erroratgraphapi/convolution.hpp:407,:642;enginecfg.cpp:54;execution_plan.cpp:102;reduction.cpp:217;reshape.cpp:63;errors.hpp:146(Passing nullptr).
CI job (full log): https://github.com/ROCm/TheRock/actions/runs/23139971395/job/67290574092
TheRock bump PR: ROCm/TheRock#3985
Commit range: rocm-libraries a608060..880166b.
Operating System
Windows (GitHub Actions runner; exact version in workflow)
CPU
strix-halo
GPU
gfx1151
ROCm Version
7.13.0
ROCm Component
MIOpen
Steps to Reproduce
Run TheRock CI for TheRock PR #3985, Windows gfx1151 release → Test miopen (any shard). Or build rocm-libraries at commit 880166b and run miopen tests on Windows gfx1151.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
2026-03-16T20:29:19.6602262Z 2: [ FAILED ] Full/CPU_CandidateSelection_NONE.SelectBestCandidateValid_Test/gfx942_ConvHipImplicitGemm3DGroupWrwXdlops_DeviceGroupedConvBwdWeight_Xdl_CShuffle_splitk8, where GetParam() = gfx942_ConvHipImplicitGemm3DGroupWrwXdlops_DeviceGroupedConvBwdWeight_Xdl_CShuffle_splitk8 (9 ms)
......
2026-03-16T20:53:10.4492714Z 2: [ FAILED ] Smoke/GPU_MultiMarginLoss_FP32.Test/1, where GetParam() = dims:22x12 cont:0 reduction_mode:1 p:1 (1092 ms)
2026-03-16T20:53:10.4493643Z 2: [ RUN ] Smoke/GPU_MultiMarginLoss_FP32.Test/5
2026-03-16T20:53:10.9702194Z 2: C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\multimarginloss.hpp(215): error: Expected: (error) < (tolerance), actual: 0.24170883745259764 vs 1.1920929e-06
2026-03-16T20:53:10.9703045Z 2:
2026-03-16T20:53:10.9703532Z 2: [ FAILED ] Smoke/GPU_MultiMarginLoss_FP32.Test/5, where GetParam() = dims:9456x13 cont:0 reduction_mode:0 p:2 (521 ms)
2026-03-16T20:53:10.9704125Z 2: [ RUN ] Smoke/GPU_MultiMarginLoss_FP32.Test/9
2026-03-16T20:53:11.7311812Z 2: C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\multimarginloss.hpp(215): error: Expected: (error) < (tolerance), actual: 1 vs 1.1920929e-06
2026-03-16T20:53:11.7312582Z 2:
2026-03-16T20:53:11.7378673Z 2: [ FAILED ] Smoke/GPU_MultiMarginLoss_FP32.Test/9, where GetParam() = dims:3995776x6 cont:1 reduction_mode:2 p:1 (767 ms)
2026-03-16T20:53:11.7379488Z 2: [----------] 3 tests from Smoke/GPU_MultiMarginLoss_FP32 (2381 ms total)
2026-03-16T20:53:11.7379851Z 2:
2026-03-16T20:53:11.7380102Z 2: [----------] 2 tests from Smoke/GPU_MultiMarginLoss_FP16
2026-03-16T20:53:11.7380463Z 2: [ RUN ] Smoke/GPU_MultiMarginLoss_FP16.Test/1
2026-03-16T20:53:13.0597507Z 2: C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\multimarginloss.hpp(215): error: Expected: (error) < (tolerance), actual: 1 vs 0.009765625
2026-03-16T20:53:13.0599343Z 2:
2026-03-16T20:53:13.0600106Z 2: [ FAILED ] Smoke/GPU_MultiMarginLoss_FP16.Test/1, where GetParam() = dims:22x12 cont:0 reduction_mode:1 p:1 (1321 ms)
2026-03-16T20:53:13.0601064Z 2: [ RUN ] Smoke/GPU_MultiMarginLoss_FP16.Test/5
2026-03-16T20:53:13.5799782Z 2: C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\multimarginloss.hpp(215): error: Expected: (error) < (tolerance), actual: 0.2950681563357076 vs 0.009765625
2026-03-16T20:53:13.5800601Z 2:
2026-03-16T20:53:13.5801106Z 2: [ FAILED ] Smoke/GPU_MultiMarginLoss_FP16.Test/5, where GetParam() = dims:9456x13 cont:0 reduction_mode:0 p:2 (520 ms)
2026-03-16T20:53:13.5801755Z 2: [----------] 2 tests from Smoke/GPU_MultiMarginLoss_FP16 (1842 ms total)
2026-03-16T20:53:13.5802111Z 2:
2026-03-16T20:53:13.5802349Z 2: [----------] 3 tests from Smoke/GPU_MultiMarginLoss_BFP16
2026-03-16T20:53:13.5802855Z 2: [ RUN ] Smoke/GPU_MultiMarginLoss_BFP16.Test/1
2026-03-16T20:53:14.7541177Z 2: C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\multimarginloss.hpp(215): error: Expected: (error) < (tolerance), actual: 1 vs 0.078125
2026-03-16T20:53:14.7541883Z 2:
2026-03-16T20:53:14.7551034Z 2: [ FAILED ] Smoke/GPU_MultiMarginLoss_BFP16.Test/1, where GetParam() = dims:22x12 cont:0 reduction_mode:1 p:1 (1174 ms)
2026-03-16T20:53:14.7551702Z 2: [ RUN ] Smoke/GPU_MultiMarginLoss_BFP16.Test/5
2026-03-16T20:53:15.4139333Z 2: C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\multimarginloss.hpp(215): error: Expected: (error) < (tolerance), actual: 0.17214981176546584 vs 0.078125
2026-03-16T20:53:15.4140754Z 2:
2026-03-16T20:53:15.4141540Z 2: [ FAILED ] Smoke/GPU_MultiMarginLoss_BFP16.Test/5, where GetParam() = dims:9456x13 cont:0 reduction_mode:0 p:2 (659 ms)
2026-03-16T20:53:15.4142925Z 2: [ RUN ] Smoke/GPU_MultiMarginLoss_BFP16.Test/9
2026-03-16T20:53:16.2481820Z 2: C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\multimarginloss.hpp(215): error: Expected: (error) < (tolerance), actual: 1 vs 0.078125
2026-03-16T20:53:16.2482636Z 2:
2026-03-16T20:53:16.2527416Z 2: [ FAILED ] Smoke/GPU_MultiMarginLoss_BFP16.Test/9, where GetParam() = dims:3995776x6 cont:1 reduction_mode:2 p:1 (838 ms)
2026-03-16T20:53:16.2528198Z 2: [----------] 3 tests from Smoke/GPU_MultiMarginLoss_BFP16 (2672 ms total)
......
2026-03-16T21:17:30.9897735Z MIOpen(HIP): Error [C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/src/conv/heuristics/ai_candidate_selection.cpp:400] Failed to construct CandidateSelectionModel for arch: gfx1151, solver: ConvHipImplicitGemm3DGroupBwdXdlops. Exception: CHRN-SI-109:C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/src/conv/heuristics/ai_candidate_selection.cpp:63: Could not open metadata file: gfx1151_ConvHipImplicitGemm3DGroupBwdXdlops_metadata.tn.model
2026-03-16T21:17:30.9897931Z Buffered 1 messages to file: CHRN-SI-109:C:\windows\SystemTemp\miopen_error_3600.log
2026-03-16T21:17:30.9898006Z a_grid_desc_m_ak_container_{2809856, 32}
2026-03-16T21:17:30.9898088Z b_grid_desc_n_bk_container_{64, 32}
2026-03-16T21:17:30.9898243Z ds_grid_desc_mblock_mperblock_nblock_nperblock_container_{2809856, 64}
2026-03-16T21:17:30.9898392Z e_grid_desc_mblock_mperblock_nblock_nperblock_container_{2809856, 64}
2026-03-16T21:17:30.9898623Z [ OK ] Full/GPU_GroupConv3D_BackwardData_FP16.GroupConv3D_BackwardData_half_Test/78 (327 ms)
2026-03-16T21:17:30.9898828Z [ RUN ] Full/GPU_GroupConv3D_BackwardData_FP16.GroupConv3D_BackwardData_half_Test/82
2026-03-16T21:17:30.9899171Z G:8 N:128 C:16 K:16 D:28 H:28 W:28 z:3 y:3 x:3 pad.z:1 pad.y:1 pad.x:1 stride.z:2 stride.y:2 stride.x:2 dilation.z:1 dilation.y:1 dilation.x:1
2026-03-16T21:17:30.9899857Z MIOpen(HIP): Error [C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/src/conv/heuristics/ai_candidate_selection.cpp:63] Could not open metadata file: gfx1151_ConvHipImplicitGemm3DGroupBwdXdlops_metadata.tn.model
2026-03-16T21:17:30.9900106Z Buffered 27 messages to file: CHRN-SI-109:C:\windows\SystemTemp\miopen_error_3600.log
2026-03-16T21:17:30.9901641Z MIOpen(HIP): Error [C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/src/conv/heuristics/ai_candidate_selection.cpp:400] Failed to construct CandidateSelectionModel for arch: gfx1151, solver: ConvHipImplicitGemm3DGroupBwdXdlops. Exception: CHRN-SI-109:C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/src/conv/heuristics/ai_candidate_selection.cpp:63: Could not open metadata file: gfx1151_ConvHipImplicitGemm3DGroupBwdXdlops_metadata.tn.model
...
2026-03-16T21:17:31.7387428Z [ FAILED ] 12 tests, listed below:
2026-03-16T21:17:31.7387621Z [ FAILED ] CPU_TuningPolicy_NONE.TestSetApiLogged
2026-03-16T21:17:31.7389274Z [ FAILED ] Full/CPU_CandidateSelection_NONE.EncodeInputFeatures_Test/gfx942_ConvHipImplicitGemm3DGroupWrwXdlops_DeviceGroupedConvBwdWeight_Xdl_CShuffle_splitk8, where GetParam() = gfx942_ConvHipImplicitGemm3DGroupWrwXdlops_DeviceGroupedConvBwdWeight_Xdl_CShuffle_splitk8
2026-03-16T21:17:31.7392600Z [ FAILED ] Full/CPU_CandidateSelection_NONE.KernelStrMappingUnknownKernelThrows_Test/gfx942_ConvHipImplicitGemm3DGroupWrwXdlops_DeviceGroupedConvBwdWeight_Xdl_CShuffle_splitk8, where GetParam() = gfx942_ConvHipImplicitGemm3DGroupWrwXdlops_DeviceGroupedConvBwdWeight_Xdl_CShuffle_splitk8
2026-03-16T21:17:31.7394354Z [ FAILED ] Full/CPU_CandidateSelection_NONE.SelectBestCandidateValid_Test/gfx942_ConvHipImplicitGemm3DGroupWrwXdlops_DeviceGroupedConvBwdWeight_Xdl_CShuffle_splitk8, where GetParam() = gfx942_ConvHipImplicitGemm3DGroupWrwXdlops_DeviceGroupedConvBwdWeight_Xdl_CShuffle_splitk8
2026-03-16T21:17:31.7394979Z [ FAILED ] Smoke/GPU_MultiMarginLoss_FP32.Test/1, where GetParam() = dims:22x12 cont:0 reduction_mode:1 p:1
2026-03-16T21:17:31.7395465Z [ FAILED ] Smoke/GPU_MultiMarginLoss_FP32.Test/5, where GetParam() = dims:9456x13 cont:0 reduction_mode:0 p:2
2026-03-16T21:17:31.7395949Z [ FAILED ] Smoke/GPU_MultiMarginLoss_FP32.Test/9, where GetParam() = dims:3995776x6 cont:1 reduction_mode:2 p:1
2026-03-16T21:17:31.7396412Z [ FAILED ] Smoke/GPU_MultiMarginLoss_FP16.Test/1, where GetParam() = dims:22x12 cont:0 reduction_mode:1 p:1
2026-03-16T21:17:31.7396872Z [ FAILED ] Smoke/GPU_MultiMarginLoss_FP16.Test/5, where GetParam() = dims:9456x13 cont:0 reduction_mode:0 p:2
2026-03-16T21:17:31.7397345Z [ FAILED ] Smoke/GPU_MultiMarginLoss_BFP16.Test/1, where GetParam() = dims:22x12 cont:0 reduction_mode:1 p:1
2026-03-16T21:17:31.7397837Z [ FAILED ] Smoke/GPU_MultiMarginLoss_BFP16.Test/5, where GetParam() = dims:9456x13 cont:0 reduction_mode:0 p:2
2026-03-16T21:17:31.7398324Z [ FAILED ] Smoke/GPU_MultiMarginLoss_BFP16.Test/9, where GetParam() = dims:3995776x6 cont:1 reduction_mode:2 p:1
2026-03-16T21:17:31.7398446Z 12 FAILED TESTS
2026-03-16T21:17:31.7398594Z YOU HAVE 107 DISABLED TESTS
2026-03-16T21:17:31.7398602Z
2026-03-16T21:17:31.7398609Z
2026-03-16T21:17:31.7398615Z
2026-03-16T21:17:31.7398774Z 0% tests passed, 1 tests failed out of 1
2026-03-16T21:17:31.7398899Z
2026-03-16T21:17:31.7399017Z Label Time Summary:
2026-03-16T21:17:31.7399160Z pr = 3462.94 sec*proc (1 test)
2026-03-16T21:17:31.7399291Z standard = 3462.94 sec*proc (1 test)
2026-03-16T21:17:31.7399299Z
2026-03-16T21:17:31.7399423Z Total Test time (real) = 3463.30 sec
2026-03-16T21:17:31.7399430Z
2026-03-16T21:17:31.7399629Z The following tests FAILED:
2026-03-16T21:17:31.7399904Z 2 - miopen_gtest_standard_suite (Failed) pr standard
Reactions are currently unavailable