Skip to content

Conversation

@ziliangzl
Copy link
Contributor

Added support for triton::GatherOp conversion. Using cf.assert for out-of-bound indices, consistent with the CUDA backend behavior (which triggers a device assert).

@ziliangzl
Copy link
Contributor Author

@microsoft-github-policy-service agree

@enjustli
Copy link
Contributor

enjustli commented Sep 19, 2025

why not convert triton::GatherOp to tts::GatherOp. 🤔 tts::GatherOp will be converted to affine::AffineLoadop in UnstructuredToMemrefPass.

@bmyerz0
Copy link
Contributor

bmyerz0 commented Sep 19, 2025

why not convert triton::GatherOp to tts::GatherOp. 🤔 tts::GatherOp will be converted to affine::AffineLoadop in UnstructuredToMemrefPass.

While the two gather op are both gathers, my understanding is that they operate at different levels (@python3kgae can correct me). tts::GatherOp is it is from tl.load/store, an operation on pointers. Whereas triton::GatherOp is from tl.gather, an operation on tensor block. In our lowering to memref/linalg, we generally associate load/store with memref and other tensor block ops with tensor/linalg. So I think tl.gather when lowered to linalg (possibly through a new tts dialect op if it is necessary) should operate on tensor dialect not memref. I admit it is confusing naming, so it could make sense to re-name tts::GatherOp in the process.

@ziliangzl
Copy link
Contributor Author

Hi @bmyerz0, just wanted to check if my understanding is correct: the current lowering of triton::GatherOp to linalg is fine, and the only concern is the naming of tts::GatherOp to avoid confusion. Please let me know if there’s anything else I should address. Thanks!

@bmyerz0
Copy link
Contributor

bmyerz0 commented Oct 3, 2025

Hi @bmyerz0, just wanted to check if my understanding is correct: the current lowering of triton::GatherOp to linalg is fine, and the only concern is the naming of tts::GatherOp to avoid confusion. Please let me know if there’s anything else I should address. Thanks!

Yes, I do not think it is good to lower triton::GatherOp to tts::GatherOp. The suggestion is to lower triton::GatherOp directly to linalg.

@enjustli
Copy link
Contributor

enjustli commented Oct 6, 2025

Hi @bmyerz0, just wanted to check if my understanding is correct: the current lowering of triton::GatherOp to linalg is fine, and the only concern is the naming of tts::GatherOp to avoid confusion. Please let me know if there’s anything else I should address. Thanks!

Yes, I do not think it is good to lower triton::GatherOp to tts::GatherOp. The suggestion is to lower triton::GatherOp directly to linalg.

But in some npu hardware, 'gather' op is needed. I thought trition-shared should provide a middle-layer dialect to reserve this semantic. If we directly convert 'trition gather' to 'linalg.generic', we will lost it. Could we consider 'tts gather'?

@bmyerz0
Copy link
Contributor

bmyerz0 commented Oct 6, 2025

Yes, for the sake of backends flexibility, I think it is reasonable to make any tts op lowering to linalg.generic optional. But we should still provide the lowering. And I think that tts should include two different ops, one from tl.load vs tl.gather.

@ziliangzl
Copy link
Contributor Author

Maybe we should use tts.gather for tl.load and introduce a new ttx.gather for tl.gather?

@ziliangzl
Copy link
Contributor Author

The current PR implements a fully functional lowering for triton::GatherOp on the CPU backend. Would it be possible to merge this PR first? Support ttx.gather op for npu hardware can be addressed separately, if needed.

@ziliangzl ziliangzl requested a review from bmyerz0 October 21, 2025 02:04
([32], [64], 0),
([4, 4], [8, 4], 0),
([128, 64], [256, 64], 0),
([128, 64], [128, 128], 1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these all appear increase the size of the tensor. Can you have test cases that contract the size of the tensor?

@bmyerz0
Copy link
Contributor

bmyerz0 commented Oct 21, 2025

The current PR implements a fully functional lowering for triton::GatherOp on the CPU backend. Would it be possible to merge this PR first? Support ttx.gather op for npu hardware can be addressed separately, if needed.

I think the direct triton::GatherOp to linalg that you have is a good solution. It mirrors what we do for other ops on tensors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants