Skip to content

Conversation

@wdziurdz
Copy link
Contributor

Fixes #5269

jongsoo-openai and others added 2 commits October 10, 2025 08:34
Follow-up triton-lang/triton#7795
Now transposed weight is supported, remove unnecessary assertion that mx
weight should be col-major

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [x] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

---------

Co-authored-by: Thomas Raoux <[email protected]>
Signed-off-by: Witold Dziurdz <[email protected]>
@wdziurdz wdziurdz force-pushed the dev/wdziurdz/test-matmul-6 branch from b115046 to d6e3e58 Compare October 10, 2025 13:46
@whitneywhtsang
Copy link
Contributor

Looks like triton-kernels tests hang.

w_scale_tri_rowmajor_sampled = w_scale_tri_rowmajor_blocked[..., 0:1]
assert torch.equal(w_scale_tri_sampled.expand_as(w_scale_tri_blocked), w_scale_tri_blocked)
assert torch.equal(w_scale_tri_rowmajor_sampled.expand_as(w_scale_tri_rowmajor_blocked), w_scale_tri_rowmajor_blocked)
assert torch.equal(w_scale_tri_sampled.squeeze(-1), w_scale_tri_rowmajor_sampled.squeeze(-1).mT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IRT hangs I remember that @whitneywhtsang said something about asserts causing hangs, is it problem already at ttir level ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the hang should be during IGC if I am not mistaken, @anmyachev and @HBN-MichalSzy may have more info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is the same issue, there is a driver with the fix which you can give it a try.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to test it, but couldn't hit the hang even with old driver.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway now I see it's assert in pytest not in kernel so probably it won't fix it.

@whitneywhtsang whitneywhtsang marked this pull request as draft October 16, 2025 00:31
@whitneywhtsang
Copy link
Contributor

Converted to draft, as it needs more work before it will be ready for review again.

@wdziurdz wdziurdz self-assigned this Oct 20, 2025
@HBN-MichalSzy
Copy link
Contributor

The PR passed, but it's only due to lowering the num_procs of pytest for kernel tests - not a solution, only experiment, that confirms that the hang is due to traffic on the device.

When process hanged, the callstack on CI machnie showed hang during mem copy:

#0  0x00007fb50f43dd0b in sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1  0x00007fb3f03cdaab in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#2  0x00007fb3f03cdf8a in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#3  0x00007fb3f08ae172 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#4  0x00007fb3f049b469 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#5  0x00007fb3f0372e01 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#6  0x00007fb3f2a84e7b in enqueueMemCopyHelper(ur_command_t, ur_queue_handle_t_*, void*, unsigned char, unsigned long, void const*, unsigned int, ur_event_handle_t_* const*, ur_event_handle_t_**, bool) () from /opt/intel/oneapi/compiler/2025.2/lib/libur_adapter_level_zero.so.0
#7  0x00007fb3f2a8cf07 in ur::level_zero::urEnqueueUSMMemcpy(ur_queue_handle_t_*, bool, void*, void const*, unsigned long, unsigned int, ur_event_handle_t_* const*, ur_event_handle_t_**) () from /opt/intel/oneapi/compiler/2025.2/lib/libur_adapter_level_zero.so.0
#8  0x00007fb4a7712b3e in ur_loader::urEnqueueUSMMemcpy(ur_queue_handle_t_*, bool, void*, void const*, unsigned long, unsigned int, ur_event_handle_t_* const*, ur_event_handle_t_**) () from /opt/intel/oneapi/compiler/2025.2/lib/libur_loader.so.0
#9  0x00007fb4a77d6dac in ur_tracing_layer::urEnqueueUSMMemcpy(ur_queue_handle_t_*, bool, void*, void const*, unsigned long, unsigned int, ur_event_handle_t_* const*, ur_event_handle_t_**) () from /opt/intel/oneapi/compiler/2025.2/lib/libur_loader.so.0
#10 0x00007fb4a77297ef in urEnqueueUSMMemcpy () from /opt/intel/oneapi/compiler/2025.2/lib/libur_loader.so.0
#11 0x00007fb50cc34c29 in sycl::_V1::detail::MemoryManager::copy_usm(void const*, std::shared_ptr<sycl::_V1::detail::queue_impl>, unsigned long, void*, std::vector<ur_event_handle_t_*, std::allocator<ur_event_handle_t_*> >, ur_event_handle_t_**, std::shared_ptr<sycl::_V1::detail::event_impl> const&) () from /opt/intel/oneapi/compiler/2025.2/lib/libsycl.so.8
...

There is internal issue on the Agama reported for same type of hang and there is potential hotfix that might help here - need to check it with @kwasd .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reland upstream commit 60605d8

5 participants