Merge OpenAI Triton commit `c186592` #5480

whitneywhtsang · 2025-11-14T16:11:04Z

This PR changes the Triton base from 3c2e6f8 to c186592 (Oct 29).
Pass rate: 94.91%->94.95%

…ons in matmul (#8483)

…nerically (#8421) (#8495) This PR relands triton-lang/triton#8386. It depends on triton-lang/triton#8492 to avoid regressing in some workloads.

There's silent data corruption when calling `tl.histogram` with interpreter. ```python # test.py import torch import ctypes import triton import triton.language as tl @triton.jit def histogram_kernel(x_ptr, z_ptr): offset = tl.arange(0, 1) x = tl.load(x_ptr + offset) z = tl.histogram(x, 1) buf = (ctypes.c_int32 * 2).from_address(int(z_ptr)) print(f'before store: {list(buf)}') tl.store(z_ptr + offset, z) # tl.store treats z values as int64 while they're int32 print(f'after store: {list(buf)}') device = 'cpu' torch.manual_seed(17) x = torch.ones(1, device=device, dtype=torch.int32) z = torch.ones(2, dtype=torch.int32, device=device) histogram_kernel[(1, )](x, z) # Output: # TRITON_INTERPRET=1 TRITON_TEST_SUITE=interpreter python test.py # before store: [1, 1] # after store: [1, 0] <- second element shouldn't be cleared ``` Based on `np.histogram` docs: https://numpy.org/doc/2.3/reference/generated/numpy.histogram.html Returned dtype is taken account when optional weights param is passed, int64 othwerwise. That leads to `tl.store` thinking it's saving int64 values while there's int32 in my example tensor passed, so it's writing 8 bytes at once instead of 4 bytes, leading to writing 4 bytes exceeding it's data range causing silent data corruption. ```python import numpy as np data = np.array([1], dtype=np.int32) bins = 1 print(f'Data dtype before: {data.dtype}') histogram = np.histogram(data, bins=bins, range=(0, bins))[0] print(f'Data dtype after: {histogram.dtype}') # Data dtype before: int32 # Data dtype after: int64 ``` Applying "dummy_weights" fixes returned data type as expected fixing data corruption. ------------------------------  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because np.histogram specific behavior with interpreter mode. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)

… transactions (#8575) `ttg.async_wait` counts the number of outstanding `ttg.commit_groups`. However, when lowering to LLVM on AMD we require the number of outstanding async intrinsics/final assembly instructions. The conversion is already done by `UpdateAsyncWaitCnt` which modifies the `num` of `ttg.async_wait` in place. This PR introduces a new op `amdgpu.async_wait` to make the change in semantics explicit in the IR. `UpdateAsyncWaitCount` is moved to `TTGIR->LLVM` primarily to also include in for `Gluon` kernels and we should always call it since it will only have an effect if there are `ttg.async_wait` ops present in the kernel. To avoid membar changes this also adds a `ttgpu.LocalBarrier` after each `amdgpu.async_wait`. Membar will respect the newly added barrier and behave the same as for `ttg.async_wait`.

Fixes #8578 We're using the wrong output constraint which leads llvm to extend the fp16 value to 32-bits. Fixing the constraint removes the conversion. Note that we still end up with a no-op sequence like: ```ptx mov.b32 {%rs1, %rs2}, %r1 mov.b32 %r2, {%rs1, %rs2} ``` However, `ptxas` is able to optimize these out.

### The Problem with the Original Formula The original formula is: ``` tanh(x) = (e^(2x) - 1) / (e^(2x) + 1) ``` - Issue with large positive x: - When x = 20: e^(40) ≈ 2.4 × 10^17 → manageable - When x = 50: e^(100) ≈ 2.7 × 10^43 → overflow to infinity - Result: (∞ - 1)/(∞ + 1) = NaN x - For negative x: The formula actually works fine because e^(2x) → 0, giving (-1)/(1) = -1 ### The Numerically Stable Solution - For Positive x: Reformulation ``` tanh(x) = (e^(2x) - 1) / (e^(2x) + 1) = (e^(2x) + 1 - 2) / (e^(2x) + 1) = 1 - 2/(e^(2x) + 1) ``` - For Negative x: Using Symmetry ``` tanh(-x) = (e^(-2x) - 1) / (e^(-2x) + 1) = (2/(e^(-2x) + 1) - 1) = -1 × (1 - 2/(e^(2|x|) + 1)) ``` ### Unified formulation: ``` tanh(x) = sign(x) × (1 - 2/(e^(2|x|) + 1)) ```

Signed-off-by: Whitney Tsang <[email protected]>

ptillet and others added 8 commits October 29, 2025 00:18

[triton_kernels] decouple split-k reduction from inter-expert reducti…

2b29c3d

…ons in matmul (#8483)

[RELAND][LAYOUTS] Generate distributed layouts for tcgen05.ld/st ge…

b620136

…nerically (#8421) (#8495) This PR relands triton-lang/triton#8386. It depends on triton-lang/triton#8492 to avoid regressing in some workloads.

[Gluon] Unwrap constexpr on TensorMemoryLayout attributes (#8585)

bd4df82

[FRONTEND] Add scales dimension checks for dot_scaled (#8564)

c186592

whitneywhtsang self-assigned this Nov 14, 2025

whitneywhtsang requested a review from anmyachev November 14, 2025 16:12

anmyachev approved these changes Nov 14, 2025

View reviewed changes

Merge commit 'c186592a17299439900d712e85556e8578345821'

1dd095c

whitneywhtsang force-pushed the whitneywhtsang/merge branch from 0111436 to 1dd095c Compare November 14, 2025 17:04

[TEST] Update triton_kernels skiplist after 2b29c3d

4788166

Signed-off-by: Whitney Tsang <[email protected]>

whitneywhtsang force-pushed the whitneywhtsang/merge branch from ec12118 to 4788166 Compare November 15, 2025 07:06

whitneywhtsang marked this pull request as ready for review November 15, 2025 17:23

whitneywhtsang merged commit 4546255 into main Nov 15, 2025
23 checks passed

whitneywhtsang deleted the whitneywhtsang/merge branch November 15, 2025 17:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge OpenAI Triton commit `c186592` #5480

Merge OpenAI Triton commit `c186592` #5480

Uh oh!

whitneywhtsang commented Nov 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Merge OpenAI Triton commit c186592 #5480

Merge OpenAI Triton commit c186592 #5480

Uh oh!

Conversation

whitneywhtsang commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Merge OpenAI Triton commit `c186592` #5480

Merge OpenAI Triton commit `c186592` #5480

whitneywhtsang commented Nov 14, 2025 •

edited

Loading