[mxfp] Reland remove col-major assert for mx weight #5285

wdziurdz · 2025-10-10T09:40:12Z

Follow-up triton-lang/triton#7795 Now transposed weight is supported, remove unnecessary assertion that mx weight should be col-major  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) --------- Co-authored-by: Thomas Raoux <[email protected]>

Signed-off-by: Witold Dziurdz <[email protected]>

whitneywhtsang · 2025-10-13T03:08:10Z

Looks like triton-kernels tests hang.

januszjah · 2025-10-13T12:03:30Z

python/triton_kernels/tests/test_matmul.py

+            w_scale_tri_rowmajor_sampled = w_scale_tri_rowmajor_blocked[..., 0:1]
+            assert torch.equal(w_scale_tri_sampled.expand_as(w_scale_tri_blocked), w_scale_tri_blocked)
+            assert torch.equal(w_scale_tri_rowmajor_sampled.expand_as(w_scale_tri_rowmajor_blocked), w_scale_tri_rowmajor_blocked)
+            assert torch.equal(w_scale_tri_sampled.squeeze(-1), w_scale_tri_rowmajor_sampled.squeeze(-1).mT)


IRT hangs I remember that @whitneywhtsang said something about asserts causing hangs, is it problem already at ttir level ?

No, the hang should be during IGC if I am not mistaken, @anmyachev and @HBN-MichalSzy may have more info.

If it is the same issue, there is a driver with the fix which you can give it a try.

I wanted to test it, but couldn't hit the hang even with old driver.

Anyway now I see it's assert in pytest not in kernel so probably it won't fix it.

whitneywhtsang · 2025-10-16T00:32:32Z

Converted to draft, as it needs more work before it will be ready for review again.

HBN-MichalSzy · 2025-10-22T08:58:52Z

The PR passed, but it's only due to lowering the num_procs of pytest for kernel tests - not a solution, only experiment, that confirms that the hang is due to traffic on the device.

When process hanged, the callstack on CI machnie showed hang during mem copy:

#0  0x00007fb50f43dd0b in sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1  0x00007fb3f03cdaab in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#2  0x00007fb3f03cdf8a in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#3  0x00007fb3f08ae172 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#4  0x00007fb3f049b469 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#5  0x00007fb3f0372e01 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#6  0x00007fb3f2a84e7b in enqueueMemCopyHelper(ur_command_t, ur_queue_handle_t_*, void*, unsigned char, unsigned long, void const*, unsigned int, ur_event_handle_t_* const*, ur_event_handle_t_**, bool) () from /opt/intel/oneapi/compiler/2025.2/lib/libur_adapter_level_zero.so.0
#7  0x00007fb3f2a8cf07 in ur::level_zero::urEnqueueUSMMemcpy(ur_queue_handle_t_*, bool, void*, void const*, unsigned long, unsigned int, ur_event_handle_t_* const*, ur_event_handle_t_**) () from /opt/intel/oneapi/compiler/2025.2/lib/libur_adapter_level_zero.so.0
#8  0x00007fb4a7712b3e in ur_loader::urEnqueueUSMMemcpy(ur_queue_handle_t_*, bool, void*, void const*, unsigned long, unsigned int, ur_event_handle_t_* const*, ur_event_handle_t_**) () from /opt/intel/oneapi/compiler/2025.2/lib/libur_loader.so.0
#9  0x00007fb4a77d6dac in ur_tracing_layer::urEnqueueUSMMemcpy(ur_queue_handle_t_*, bool, void*, void const*, unsigned long, unsigned int, ur_event_handle_t_* const*, ur_event_handle_t_**) () from /opt/intel/oneapi/compiler/2025.2/lib/libur_loader.so.0
#10 0x00007fb4a77297ef in urEnqueueUSMMemcpy () from /opt/intel/oneapi/compiler/2025.2/lib/libur_loader.so.0
#11 0x00007fb50cc34c29 in sycl::_V1::detail::MemoryManager::copy_usm(void const*, std::shared_ptr<sycl::_V1::detail::queue_impl>, unsigned long, void*, std::vector<ur_event_handle_t_*, std::allocator<ur_event_handle_t_*> >, ur_event_handle_t_**, std::shared_ptr<sycl::_V1::detail::event_impl> const&) () from /opt/intel/oneapi/compiler/2025.2/lib/libsycl.so.8
...

There is internal issue on the Agama reported for same type of hang and there is potential hotfix that might help here - need to check it with @kwasd .

jongsoo-openai and others added 2 commits October 10, 2025 08:34

Fix device compability assert

d6e3e58

Signed-off-by: Witold Dziurdz <[email protected]>

wdziurdz force-pushed the dev/wdziurdz/test-matmul-6 branch from b115046 to d6e3e58 Compare October 10, 2025 13:46

wdziurdz requested a review from whitneywhtsang October 10, 2025 13:46

Merge branch 'main' into dev/wdziurdz/test-matmul-6

8ff7070

januszjah reviewed Oct 13, 2025

View reviewed changes

Merge branch 'main' into dev/wdziurdz/test-matmul-6

c877758

whitneywhtsang marked this pull request as draft October 16, 2025 00:31

Merge branch 'main' into dev/wdziurdz/test-matmul-6

7ab10a5

wdziurdz self-assigned this Oct 20, 2025

HBN-MichalSzy added 3 commits October 21, 2025 11:20

Merge branch 'main' into dev/wdziurdz/test-matmul-6

20c8877

due to hang in memcpy - reduce num proc for pytest (experiment)

bdcf257

Merge branch 'main' into dev/wdziurdz/test-matmul-6

ca7085e

HBN-MichalSzy added 3 commits October 22, 2025 10:59

revert experiment for num proc of pytest

431bc26

Merge branch 'main' into dev/wdziurdz/test-matmul-6

6267a5c

Merge branch 'main' into dev/wdziurdz/test-matmul-6

fec3869

HBN-MichalSzy mentioned this pull request Oct 23, 2025

Reland upstream commit 60605d8 #5269

Open

Merge branch 'main' into dev/wdziurdz/test-matmul-6

ee5e2c8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[mxfp] Reland remove col-major assert for mx weight #5285

[mxfp] Reland remove col-major assert for mx weight #5285

wdziurdz commented Oct 10, 2025

Uh oh!

whitneywhtsang commented Oct 13, 2025

Uh oh!

januszjah Oct 13, 2025

Uh oh!

whitneywhtsang Oct 13, 2025

Uh oh!

whitneywhtsang Oct 13, 2025

Uh oh!

januszjah Oct 14, 2025

Uh oh!

januszjah Oct 14, 2025

Uh oh!

whitneywhtsang commented Oct 16, 2025

Uh oh!

HBN-MichalSzy commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[mxfp] Reland remove col-major assert for mx weight #5285

Are you sure you want to change the base?

[mxfp] Reland remove col-major assert for mx weight #5285

Conversation

wdziurdz commented Oct 10, 2025

Uh oh!

whitneywhtsang commented Oct 13, 2025

Uh oh!

januszjah Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

whitneywhtsang Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

whitneywhtsang Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

januszjah Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

januszjah Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

whitneywhtsang commented Oct 16, 2025

Uh oh!

HBN-MichalSzy commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants