[AMD] Fixed `make_desc` lowering - i.e., findEncodingFromUsers by ravil-mobile · Pull Request #9585 · triton-lang/triton

ravil-mobile · 2026-02-26T13:47:27Z

The PR fixes findEncodingFromUsers function used in make_desc op lowering by taking into account information about value uses in all basic blocks

Closes https://github.com/ROCm/triton-internal/issues/1598

cc @antiagainst

ThomasRaoux · 2026-02-26T16:41:12Z

third_party/amd/lib/TritonAMDGPUToLLVM/TensorPtrOpsToLLVM.cpp

    if (!sharedEnc) {
      // TODO: add an extra pass to assign layout to descriptors
      sharedEnc = findEncodingFromUsers(op);
      if (!sharedEnc)
        return rewriter.notifyMatchFailure(op, "Descriptor has no layout.");
    }


looks like a very fragile solution. might be worth doing a proper fix?

@ThomasRaoux Well, the author is @yangshuxin. I believe he is working on a proper solution which may take a while. Meanwhile, this PR fixes the logic of findEncodingFromUsers which doesn't confirm the language syntax at its current implementation.

The kernel which were failing on GFX1250 is

triton/python/test/unit/language/test_tensor_descriptor.py

Lines 751 to 811 in 2adfa7a

@triton.jit

def batched_gemm_2d_tma_kernel(a_ptr, b_ptr, c_ptr, #

B, M, N, K, #

dtype: tl.constexpr, #

BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr, #

NUM_SMS: tl.constexpr):

start_pid = tl.program_id(axis=0)

num_tiles_m = tl.cdiv(M, BLOCK_M)

num_tiles_n = tl.cdiv(N, BLOCK_N)

k_tiles = tl.cdiv(K, BLOCK_K)

num_tiles_per_batch = num_tiles_m * num_tiles_n

num_tiles = B * num_tiles_per_batch

tiles_per_SM = num_tiles // NUM_SMS

if start_pid < num_tiles % NUM_SMS:

tiles_per_SM += 1

tile_id = start_pid - NUM_SMS

ki = -1

tile_m = 0

tile_n = 0

tile_b = 0

offs_m = 0

offs_n = 0

offs_b = 0

a_desc = tl.make_tensor_descriptor(a_ptr + offs_b * (M * K), [M, K], [K, 1], [BLOCK_M, BLOCK_K])

b_desc = tl.make_tensor_descriptor(b_ptr + offs_b * (N * K), [N, K], [K, 1], [BLOCK_N, BLOCK_K])

c_desc = tl.make_tensor_descriptor(c_ptr + offs_b * (M * N), [M, N], [N, 1], [BLOCK_M, BLOCK_N])

accumulator = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)

for _ in range(k_tiles * tiles_per_SM):

ki = tl.where(ki == k_tiles - 1, 0, ki + 1)

if ki == 0:

tile_id += NUM_SMS

tile_b = tile_id // num_tiles_per_batch

tile_m = (tile_id // num_tiles_n) % num_tiles_m

tile_n = tile_id % num_tiles_n

offs_b = tile_b

offs_m = tile_m * BLOCK_M

offs_n = tile_n * BLOCK_N

a_desc = tl.make_tensor_descriptor(a_ptr + offs_b * (M * K), [M, K], [K, 1], [BLOCK_M, BLOCK_K])

b_desc = tl.make_tensor_descriptor(b_ptr + offs_b * (N * K), [N, K], [K, 1], [BLOCK_N, BLOCK_K])

c_desc = tl.make_tensor_descriptor(c_ptr + offs_b * (M * N), [M, N], [N, 1], [BLOCK_M, BLOCK_N])

offs_k = ki * BLOCK_K

a = a_desc.load([offs_m, offs_k])

b = b_desc.load([offs_n, offs_k])

accumulator = tl.dot(a, b.T, accumulator)

if ki == k_tiles - 1:

c = accumulator.to(dtype)

c_desc.store([offs_m, offs_n], c)

accumulator = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)

The current implementation in the upstream assumes that the tensor descriptor definition and all uses are in the same basic block which is not always true.

Yes this is temporary stop gap to avoid crashes; will be replaced with more proper implementation very soon.

antiagainst · 2026-03-06T02:29:09Z

third_party/amd/lib/TritonAMDGPUToLLVM/TensorPtrOpsToLLVM.cpp

    if (!sharedEnc) {
      // TODO: add an extra pass to assign layout to descriptors
      sharedEnc = findEncodingFromUsers(op);
      if (!sharedEnc)
        return rewriter.notifyMatchFailure(op, "Descriptor has no layout.");
    }


Yes this is temporary stop gap to avoid crashes; will be replaced with more proper implementation very soon.

ravil-mobile added 2 commits February 26, 2026 13:43

[AMD] Fixed make_desc lowering - i.e., findEncodingFromUsers

7de82e2

[AMD] generalized collectUsers from TensorPtrOpsToLLVM.cpp

dda2322

ThomasRaoux reviewed Feb 26, 2026

View reviewed changes

Merge branch 'main' into ravil/make-desc-fix

51c0866

ravil-mobile requested a review from ThomasRaoux February 27, 2026 14:21

Merge branch 'main' into ravil/make-desc-fix

4eeed7a

antiagainst approved these changes Mar 6, 2026

View reviewed changes

antiagainst marked this pull request as ready for review March 6, 2026 02:29

antiagainst requested a review from zhanglx13 as a code owner March 6, 2026 02:29

Merge branch 'main' into ravil/make-desc-fix

326394f

antiagainst merged commit 4b986a0 into triton-lang:main Mar 6, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Fixed `make_desc` lowering - i.e., findEncodingFromUsers #9585

[AMD] Fixed `make_desc` lowering - i.e., findEncodingFromUsers #9585
antiagainst merged 5 commits intotriton-lang:mainfrom
ravil-mobile:ravil/make-desc-fix

ravil-mobile commented Feb 26, 2026 •

edited

Loading

Uh oh!

ThomasRaoux Feb 26, 2026

Uh oh!

ravil-mobile Feb 26, 2026 •

edited

Loading

Uh oh!

antiagainst Mar 6, 2026

Uh oh!

antiagainst Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	@triton.jit
	def batched_gemm_2d_tma_kernel(a_ptr, b_ptr, c_ptr, #
	B, M, N, K, #
	dtype: tl.constexpr, #
	BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr, #
	NUM_SMS: tl.constexpr):
	start_pid = tl.program_id(axis=0)
	num_tiles_m = tl.cdiv(M, BLOCK_M)
	num_tiles_n = tl.cdiv(N, BLOCK_N)
	k_tiles = tl.cdiv(K, BLOCK_K)
	num_tiles_per_batch = num_tiles_m * num_tiles_n
	num_tiles = B * num_tiles_per_batch

	tiles_per_SM = num_tiles // NUM_SMS
	if start_pid < num_tiles % NUM_SMS:
	tiles_per_SM += 1

	tile_id = start_pid - NUM_SMS
	ki = -1

	tile_m = 0
	tile_n = 0
	tile_b = 0

	offs_m = 0
	offs_n = 0
	offs_b = 0

	a_desc = tl.make_tensor_descriptor(a_ptr + offs_b * (M * K), [M, K], [K, 1], [BLOCK_M, BLOCK_K])
	b_desc = tl.make_tensor_descriptor(b_ptr + offs_b * (N * K), [N, K], [K, 1], [BLOCK_N, BLOCK_K])
	c_desc = tl.make_tensor_descriptor(c_ptr + offs_b * (M * N), [M, N], [N, 1], [BLOCK_M, BLOCK_N])

	accumulator = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)

	for _ in range(k_tiles * tiles_per_SM):
	ki = tl.where(ki == k_tiles - 1, 0, ki + 1)
	if ki == 0:
	tile_id += NUM_SMS
	tile_b = tile_id // num_tiles_per_batch
	tile_m = (tile_id // num_tiles_n) % num_tiles_m
	tile_n = tile_id % num_tiles_n

	offs_b = tile_b
	offs_m = tile_m * BLOCK_M
	offs_n = tile_n * BLOCK_N

	a_desc = tl.make_tensor_descriptor(a_ptr + offs_b * (M * K), [M, K], [K, 1], [BLOCK_M, BLOCK_K])
	b_desc = tl.make_tensor_descriptor(b_ptr + offs_b * (N * K), [N, K], [K, 1], [BLOCK_N, BLOCK_K])
	c_desc = tl.make_tensor_descriptor(c_ptr + offs_b * (M * N), [M, N], [N, 1], [BLOCK_M, BLOCK_N])

	offs_k = ki * BLOCK_K

	a = a_desc.load([offs_m, offs_k])
	b = b_desc.load([offs_n, offs_k])
	accumulator = tl.dot(a, b.T, accumulator)

	if ki == k_tiles - 1:
	c = accumulator.to(dtype)

	c_desc.store([offs_m, offs_n], c)
	accumulator = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)

Conversation

ravil-mobile commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ThomasRaoux Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

ravil-mobile Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antiagainst Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

antiagainst Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ravil-mobile commented Feb 26, 2026 •

edited

Loading

ravil-mobile Feb 26, 2026 •

edited

Loading