-
Couldn't load subscription status.
- Fork 75
[mxfp] Reland remove col-major assert for mx weight #5285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Follow-up triton-lang/triton#7795 Now transposed weight is supported, remove unnecessary assertion that mx weight should be col-major <!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) --------- Co-authored-by: Thomas Raoux <[email protected]>
Signed-off-by: Witold Dziurdz <[email protected]>
b115046 to
d6e3e58
Compare
|
Looks like triton-kernels tests hang. |
| w_scale_tri_rowmajor_sampled = w_scale_tri_rowmajor_blocked[..., 0:1] | ||
| assert torch.equal(w_scale_tri_sampled.expand_as(w_scale_tri_blocked), w_scale_tri_blocked) | ||
| assert torch.equal(w_scale_tri_rowmajor_sampled.expand_as(w_scale_tri_rowmajor_blocked), w_scale_tri_rowmajor_blocked) | ||
| assert torch.equal(w_scale_tri_sampled.squeeze(-1), w_scale_tri_rowmajor_sampled.squeeze(-1).mT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IRT hangs I remember that @whitneywhtsang said something about asserts causing hangs, is it problem already at ttir level ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the hang should be during IGC if I am not mistaken, @anmyachev and @HBN-MichalSzy may have more info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is the same issue, there is a driver with the fix which you can give it a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to test it, but couldn't hit the hang even with old driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway now I see it's assert in pytest not in kernel so probably it won't fix it.
|
Converted to draft, as it needs more work before it will be ready for review again. |
|
The PR passed, but it's only due to lowering the num_procs of pytest for kernel tests - not a solution, only experiment, that confirms that the hang is due to traffic on the device. When process hanged, the callstack on CI machnie showed hang during mem copy: There is internal issue on the Agama reported for same type of hang and there is potential hotfix that might help here - need to check it with @kwasd . |
Fixes #5269