-
Notifications
You must be signed in to change notification settings - Fork 79
Add GatherOp support #354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add GatherOp support #354
Conversation
|
@microsoft-github-policy-service agree |
|
why not convert |
While the two gather op are both gathers, my understanding is that they operate at different levels (@python3kgae can correct me). tts::GatherOp is it is from |
|
Hi @bmyerz0, just wanted to check if my understanding is correct: the current lowering of triton::GatherOp to linalg is fine, and the only concern is the naming of tts::GatherOp to avoid confusion. Please let me know if there’s anything else I should address. Thanks! |
Yes, I do not think it is good to lower triton::GatherOp to tts::GatherOp. The suggestion is to lower triton::GatherOp directly to linalg. |
But in some npu hardware, 'gather' op is needed. I thought trition-shared should provide a middle-layer dialect to reserve this semantic. If we directly convert 'trition gather' to 'linalg.generic', we will lost it. Could we consider 'tts gather'? |
|
Yes, for the sake of backends flexibility, I think it is reasonable to make any tts op lowering to linalg.generic optional. But we should still provide the lowering. And I think that tts should include two different ops, one from tl.load vs tl.gather. |
|
Maybe we should use tts.gather for tl.load and introduce a new ttx.gather for tl.gather? |
|
The current PR implements a fully functional lowering for triton::GatherOp on the CPU backend. Would it be possible to merge this PR first? Support ttx.gather op for npu hardware can be addressed separately, if needed. |
| ([32], [64], 0), | ||
| ([4, 4], [8, 4], 0), | ||
| ([128, 64], [256, 64], 0), | ||
| ([128, 64], [128, 128], 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these all appear increase the size of the tensor. Can you have test cases that contract the size of the tensor?
I think the direct triton::GatherOp to linalg that you have is a good solution. It mirrors what we do for other ops on tensors. |
Added support for
triton::GatherOpconversion. Using cf.assert for out-of-bound indices, consistent with the CUDA backend behavior (which triggers a device assert).