-
Notifications
You must be signed in to change notification settings - Fork 75
[rocm7.0_internal_testing] skip 3D NCHW FP16 batchnorm test due to Native accuracy issue #2370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rocm7.0_internal_testing] skip 3D NCHW FP16 batchnorm test due to Native accuracy issue #2370
Conversation
|
Jenkins build for 2f9e18c5fb255cbd1f554c070bcf6d852ab9b848 commit finished as FAILURE |
|
! cherry-pick --onto release/2.7 |
|
! cherry-pick --onto release/2.6 |
…tive accuracy issue (#2370) Skip for `test_nn.py::TestNN.test_batchnorm_3D_train_NCHW_vs_native_mixed_float16` Test failed on `weight gradient` comparison MIOpen/CuDNN vs Native batchnorm. But CPU test `test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16` passed. It looks like FP16 Native batchnorm issue. Failed on MI200/MI300 and V100 It passed somehow on Navi (with enabled MIOpen) Fixes SWDEV-541024, SWDEV-539171 ``` python test_nn.py -v -k test_batchnorm_3D_train_NCHW_vs_native_mixed_float16 test_batchnorm_3D_train_NCHW_vs_native_mixed_float16 (__main__.TestNN) ... skipped '3D float16 NCHW train failed on CUDA and ROCm due to Native batchnorm accuracy issue SWDEV-541024' OK (skipped=1) ```
|
Created branch autogenerated/release/2.7_cherry-pick_pr-2370 and #2390 |
…tive accuracy issue (#2370) Skip for `test_nn.py::TestNN.test_batchnorm_3D_train_NCHW_vs_native_mixed_float16` Test failed on `weight gradient` comparison MIOpen/CuDNN vs Native batchnorm. But CPU test `test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16` passed. It looks like FP16 Native batchnorm issue. Failed on MI200/MI300 and V100 It passed somehow on Navi (with enabled MIOpen) Fixes SWDEV-541024, SWDEV-539171 ``` python test_nn.py -v -k test_batchnorm_3D_train_NCHW_vs_native_mixed_float16 test_batchnorm_3D_train_NCHW_vs_native_mixed_float16 (__main__.TestNN) ... skipped '3D float16 NCHW train failed on CUDA and ROCm due to Native batchnorm accuracy issue SWDEV-541024' OK (skipped=1) ```
|
Created branch autogenerated/release/2.6_cherry-pick_pr-2370 and #2391 |
… Native accuracy issue (#2391) Cherry-pick of #2370 Co-authored-by: Dmitry Nikolaev <[email protected]>
… Native accuracy issue (#2390) Cherry-pick of #2370 Co-authored-by: Dmitry Nikolaev <[email protected]>
#2440) This PR has fixes for P1 Jira https://ontrack-internal.amd.com/browse/SWDEV-542659. In this Jira, there are 3 test files with failing tests. 1) distributed.test_distributed_spawn 2) test_binary_ufuncs 3) test_nn The test files **distributed.test_distributed_spawn** & **test_binary_ufuncs** are passing with latest mainline build- **registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16426_ubuntu22.04_py3.10_pytorch_lw_release-2.7_fe3d37a9**. The test file **test_nn** has 2 failing tests- **test_batchnorm_3D_train_NCHW_vs_native_mixed_float16** & **test_RNN_dropout_state**. The **test_batchnorm_3D_train_NCHW_vs_native_mixed_float16** test is skipped from PR #2370. The **test_RNN_dropout_state** is fixed by cherry picking upstream commit 1aa971a. Tested on MI200 with docker image- **registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16426_ubuntu22.04_py3.10_pytorch_lw_release-2.7_fe3d37a9**. --------- Co-authored-by: Iurii Paikov <[email protected]> Co-authored-by: Jeff Daily <[email protected]> Co-authored-by: Nikita Shulga <[email protected]>
#2440) This PR has fixes for P1 Jira https://ontrack-internal.amd.com/browse/SWDEV-542659. In this Jira, there are 3 test files with failing tests. 1) distributed.test_distributed_spawn 2) test_binary_ufuncs 3) test_nn The test files **distributed.test_distributed_spawn** & **test_binary_ufuncs** are passing with latest mainline build- **registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16426_ubuntu22.04_py3.10_pytorch_lw_release-2.7_fe3d37a9**. The test file **test_nn** has 2 failing tests- **test_batchnorm_3D_train_NCHW_vs_native_mixed_float16** & **test_RNN_dropout_state**. The **test_batchnorm_3D_train_NCHW_vs_native_mixed_float16** test is skipped from PR #2370. The **test_RNN_dropout_state** is fixed by cherry picking upstream commit 1aa971a. Tested on MI200 with docker image- **registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16426_ubuntu22.04_py3.10_pytorch_lw_release-2.7_fe3d37a9**. --------- Co-authored-by: Iurii Paikov <[email protected]> Co-authored-by: Jeff Daily <[email protected]> Co-authored-by: Nikita Shulga <[email protected]>
#2440) This PR has fixes for P1 Jira https://ontrack-internal.amd.com/browse/SWDEV-542659. In this Jira, there are 3 test files with failing tests. 1) distributed.test_distributed_spawn 2) test_binary_ufuncs 3) test_nn The test files **distributed.test_distributed_spawn** & **test_binary_ufuncs** are passing with latest mainline build- **registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16426_ubuntu22.04_py3.10_pytorch_lw_release-2.7_fe3d37a9**. The test file **test_nn** has 2 failing tests- **test_batchnorm_3D_train_NCHW_vs_native_mixed_float16** & **test_RNN_dropout_state**. The **test_batchnorm_3D_train_NCHW_vs_native_mixed_float16** test is skipped from PR #2370. The **test_RNN_dropout_state** is fixed by cherry picking upstream commit 1aa971a. Tested on MI200 with docker image- **registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16426_ubuntu22.04_py3.10_pytorch_lw_release-2.7_fe3d37a9**. --------- Co-authored-by: Iurii Paikov <[email protected]> Co-authored-by: Jeff Daily <[email protected]> Co-authored-by: Nikita Shulga <[email protected]>
#2440) This PR has fixes for P1 Jira https://ontrack-internal.amd.com/browse/SWDEV-542659. In this Jira, there are 3 test files with failing tests. 1) distributed.test_distributed_spawn 2) test_binary_ufuncs 3) test_nn The test files **distributed.test_distributed_spawn** & **test_binary_ufuncs** are passing with latest mainline build- **registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16426_ubuntu22.04_py3.10_pytorch_lw_release-2.7_fe3d37a9**. The test file **test_nn** has 2 failing tests- **test_batchnorm_3D_train_NCHW_vs_native_mixed_float16** & **test_RNN_dropout_state**. The **test_batchnorm_3D_train_NCHW_vs_native_mixed_float16** test is skipped from PR #2370. The **test_RNN_dropout_state** is fixed by cherry picking upstream commit 1aa971a. Tested on MI200 with docker image- **registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16426_ubuntu22.04_py3.10_pytorch_lw_release-2.7_fe3d37a9**. --------- Co-authored-by: Iurii Paikov <[email protected]> Co-authored-by: Jeff Daily <[email protected]> Co-authored-by: Nikita Shulga <[email protected]>
|
! cherry-pick --onto release/2.8 |
…tive accuracy issue (#2370) Skip for `test_nn.py::TestNN.test_batchnorm_3D_train_NCHW_vs_native_mixed_float16` Test failed on `weight gradient` comparison MIOpen/CuDNN vs Native batchnorm. But CPU test `test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16` passed. It looks like FP16 Native batchnorm issue. Failed on MI200/MI300 and V100 It passed somehow on Navi (with enabled MIOpen) Fixes SWDEV-541024, SWDEV-539171 ``` python test_nn.py -v -k test_batchnorm_3D_train_NCHW_vs_native_mixed_float16 test_batchnorm_3D_train_NCHW_vs_native_mixed_float16 (__main__.TestNN) ... skipped '3D float16 NCHW train failed on CUDA and ROCm due to Native batchnorm accuracy issue SWDEV-541024' OK (skipped=1) ```
|
Created branch autogenerated/release/2.8_cherry-pick_pr-2370 and #2652 Comment processed by Build |
Cherry-pick of #2370 Co-authored-by: Dmitry Nikolaev <[email protected]>
|
! cherry-pick --onto release/2.9 |
|
Created branch autogenerated/release/2.9_cherry-pick_pr-2370 and #2788. It contains a merge conflict. Please resolve it Comment processed by Build |
… Native accuracy issue (#2788) Skip for `test_batchnorm_3D_train_NCHW_vs_native_mixed_float16` Cherry-pick of #2370 ~Need to resolve conflicts~ - resolved --------- Co-authored-by: Dmitry Nikolaev <[email protected]>
… Native accuracy issue (#2788) Skip for `test_batchnorm_3D_train_NCHW_vs_native_mixed_float16` Cherry-pick of #2370 ~Need to resolve conflicts~ - resolved --------- Co-authored-by: Dmitry Nikolaev <[email protected]>
Skip for
test_nn.py::TestNN.test_batchnorm_3D_train_NCHW_vs_native_mixed_float16Test failed on
weight gradientcomparison MIOpen/CuDNN vs Native batchnorm.But CPU test
test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16passed.It looks like FP16 Native batchnorm issue.
Failed on MI200/MI300 and V100
It passed somehow on Navi (with enabled MIOpen)
Fixes SWDEV-541024, SWDEV-539171
Cherry-picked to release/2.7 branch via #2390
Cherry-picked to release/2.6 branch via #2391
Cherry-picked to release/2.8 branch via #2652
Cherry-picked to release/2.9 branch via #2788