Skip to content

Commit 6239a92

Browse files
Merge OpenAI Triton commit 05b2c18 (#4926)
This PR change the Triton base from 690f690 to 05b2c18 (Aug 8). Pass rate: 98.85%
2 parents 3714e9b + f060b4a commit 6239a92

File tree

111 files changed

+2348
-1052
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

111 files changed

+2348
-1052
lines changed

.github/CODEOWNERS

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ lib/Dialect/TritonGPU/Transforms/TritonGPUConversion.cpp @ptillet
4646
# third_party
4747
# -----------
4848
third_party/amd/ @antiagainst @zhanglx13
49+
third_party/proton/ @Jokeren @crobeck @fywkevin
4950

5051
# -----------
5152
# gluon
@@ -56,3 +57,9 @@ python/test/gluon @peterbell10
5657
test/Gluon @peterbell10
5758
include/triton/Dialect/Gluon @peterbell10
5859
lib/Dialect/Gluon @peterbell10
60+
61+
# -----------
62+
# Linear Layouts
63+
# -----------
64+
lib/Tools/ @lezcano
65+
lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp @lezcano

.github/workflows/wheels.yml

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,10 @@ jobs:
5151
echo "new_commit=true" >> "$GITHUB_OUTPUT"
5252
fi
5353
54+
- uses: actions/setup-python@v5
55+
with:
56+
python-version: '3.11'
57+
5458
- name: Patch setup.py
5559
if: ${{ steps.check-version.outputs.new_commit == 'true' }}
5660
run: |
@@ -61,6 +65,7 @@ jobs:
6165
- name: Build wheels
6266
if: ${{ steps.check-version.outputs.new_commit == 'true' }}
6367
run: |
68+
python --version
6469
# Make sure cibuildwheel is updated to latest, this will enable latest python builds
6570
python3 -m pip install cibuildwheel --upgrade --user
6671
# Pass MAX_JOBS=4 because, at time of writing, the VM "only" has 32GB
@@ -69,6 +74,10 @@ jobs:
6974
export CIBW_ENVIRONMENT="MAX_JOBS=4 \
7075
TRITON_BUILD_WITH_CLANG_LLD=1"
7176
77+
# required to build Python 3.14 with cibuildwheel 2.23.3
78+
# todo: Need to update system Python to 3.11 and update cibuildwheel to latest
79+
80+
7281
# many_linux_2_28 image comes with GCC 12.2.1, but not clang.
7382
# With this install, it gets clang 16.0.6.
7483
export CIBW_BEFORE_ALL="dnf install clang lld -y"
@@ -79,9 +88,9 @@ jobs:
7988
export CIBW_MANYLINUX_AARCH64_IMAGE="quay.io/pypa/manylinux_2_28_${{ matrix.config.arch }}:latest"
8089
fi
8190
82-
export CIBW_BUILD="cp3{9,10,11,12,13,13t}-manylinux_${{ matrix.config.arch }}"
91+
export CIBW_BUILD="cp3{9,10,11,12,13,13t,14,14t}-manylinux_${{ matrix.config.arch }}"
8392
export CIBW_SKIP="cp{35,36,37,38}-*"
84-
export CIBW_FREE_THREADED_SUPPORT=1
93+
export CIBW_ENABLE=cpython-freethreading
8594
python3 -m cibuildwheel . --output-dir wheelhouse
8695
8796
- uses: actions/upload-artifact@v4

Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ test-distributed: all
5353
.PHONY: test-gluon
5454
test-gluon: all
5555
$(PYTEST) -s -n $(NUM_PROCS) python/test/gluon
56+
$(PYTEST) -vs python/tutorials/gluon/01-attention-forward.py
5657

5758
.PHONY: test-regression
5859
test-regression: all

docs/meetups/07-09-2025/notes.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# Agenda:
2+
3+
## Items:
4+
1. Gluon update (Jeff Niu, OpenAI)
5+
2. Interest and requirements for a nightly performance regression suite (Simon Waters, kernelize.ai)
6+
3. Triton developers’ summit update (Ofer Dekel, Microsoft)
7+
4. Open mic for other topics.
8+
9+
## Minutes:
10+
Recording link [here](https://youtu.be/zoSY_WXHmF0)
11+
12+
1. Triton developers’ summit update (Ofer Dekel, Microsoft)
13+
- 3rd Annual Triton Developer conference
14+
- Oct 21, 2025 (day before the PyTorch conference in SF)
15+
- Where: Microsoft Silicon Valley Campus, Mountain View, CA
16+
- There may be busses from SF to Mountain View (survey coming)
17+
- Up to 500 people can be accomodated in their auditorium.
18+
- Everyone interested in Triton, developers, developers working on extensions, etc.
19+
- Registration website is imminent! (possibly in a week).
20+
- Talks (proposed):
21+
- Nvidia - Blackwell optimizations
22+
- AMD - MI300/MI350
23+
- OpenAI - Gluon
24+
- Microsoft/LinkedIn - Liger-kernel
25+
- ByteDance - Triton distributed
26+
- Meta - Helion
27+
- GPU mode - community talk
28+
- And more!
29+
- Invitation letters will be available on the website.
30+
- Q> Any tutorials like how to write a kernel or perf analysis.
31+
- A> Not planned. Filled schedule with new tech over last year (working with Phil on program). Maybe we should extend to two days next year. Conference for professions. Should this be a conference for non-experts too? Targeting folks who know and live/breathe Triton.
32+
- A> Should have talks on tooling like Proton and guidelines on performance. Want people to be able to reproduce their results.
33+
- Q> Last years audience was Triton developers and Triton users but felt like the topic skewed toward developers and get people to contributed. Any plan to have content for users?
34+
- A> First 2 talks on triton internals. Others include tooling that should be interesting to users (like liger, triton-distributed, helion and GPU mode). Users will benefit from learning what goes on under the hood.
35+
- Q> Social aspect to Triton conference?
36+
- A> Full day of talks with coffee breaks/lunch/happy hour for unstructured social interaction. No plans for structured social engagement (like breaking into pods). But still in flux. Would like suggestions for what we can do for other social engagements (send ideas to Ofer).
37+
- Q> is GPU mode led by Mark Saroufim?
38+
- A> Yes.
39+
- Q> Any Triton/workshops to be given in conjunction with the PyTorch conference?
40+
- A> No. Other than being in good proximity (location and timing wise). Hoping to get folks who are attending PyTorch conference will come out a day early for Triton Conference.
41+
2. Gluon update (Jeff Niu, OpenAI)
42+
- A lower-level language based on the same compiler tech as Triton.
43+
- Expose more control over layouts, scheduling and memory. Bypasses middle-end, goes right to backend.
44+
- Can still use tile-based programming.
45+
- Expose more of the GPU to users.
46+
- Why Gluon? Out of the box better perf only approaches 80%. Compilers struggling to make best use of hardware (hardware complexity).
47+
- Targeting:
48+
- better register and memory layouts
49+
- Warp specialization partitioning and loop scheduling
50+
- Gluon - a system programming language for GPUs.
51+
- expose low-level hardware details
52+
- tile-based abstraction
53+
- no global state management
54+
- Trade-offs
55+
- not hardware portable across hw platforms
56+
- you need hardware knowledge
57+
- harder to write
58+
- Implementation
59+
- @peterbell10 did most of the work.
60+
- Focus on blackwell, but some H100 support
61+
- Example: FMHA on B200
62+
- Still slower than cudnn
63+
- But much better than out of the box triton.
64+
- Future work
65+
- Very experimental
66+
- Need better layout management functions
67+
- *Not planning on accepting contributions now*
68+
- Q> Gluon is for specific type of GPU. What about other GPUs/generations?
69+
- A> Don't need to rewrite everything. To get best performance on newer generations, yes, you will need to do rewrites. Kernels have bells and whistles. Triton kernels program are a declarative specification for what the kernel should do. The triton compiler figures out how to make that spec performant. With Gluon, you will need to do this yourself.
70+
- Q> In the future, will certain ops be implemented in Gluon vs in the compiler? E.g. tl.histogram written as a gluon kernel.
71+
- A> Probably not. Triton ops are tile-level. These aren't exposed in Gluon. Idea of interop between Gluon & Triton exist but may not be implemented.
72+
- Q> Pushing onus like scheduling to kernel writers, Any thoughts about tooling to help guide the kernel writers like timeline views?
73+
- A> 1) intrakernel profiler with proton (very imporant, NCU stall counts example of something that might not be on the critical path) complicated dependency graphs 2) more function calls in gluon. but you won't see them in cuda gdb. Tooling needs to catch up and we expect it to do so.
74+
- Q> Microkernel for hotloops. Is this what you're envisioning for interop?
75+
- A> No, we haven't thought about it that much. If you had a large kernel, but our kernels are small so its not worth it.
76+
- Q> AMD other processors & gluon.
77+
- A> AMD is as simple as adding the bindings and Python code. But its very early and we're focusing on executing on Blackwell.
78+
3. Interest and requirements for a nightly performance regression suite (Simon Waters, kernelize.ai)
79+
- Brian Bowyer (kernelize.ai)
80+
- Nightly performance CI. In past we did the same at AMD while working on Triton compiler.
81+
- Noticed, almost every night, we would see performance regressions due to changes made during the day.
82+
- Hard to do performance optimizations if you don't know impact over different hardware, different versions, and data types.
83+
- Request to community:
84+
- Where to get resources to run on
85+
- Inside and outside of companies
86+
- Where to store the data
87+
- Help on setting up and running CI & doing operations.
88+
- Proposal from kernelize.ai
89+
- Nosql based cloud storage
90+
- pipelines on pulic cloud
91+
- Use torchbench to store tests
92+
- visualization: https://triton-bench.ai (currently contains fake data)
93+
- discord for questions
94+
- Run on AWS (to start)
95+
- Demo of dashboard
96+
- Personalizable
97+
- Dig into operators/hardware performance over time
98+
- Detailed views/exports.
99+
- Requests
100+
- kernelize.ai can provide people
101+
- We need community to help with costs(running tests)
102+
- kernels/data types/hardware.
103+
- Q> selfhosted runners. How to run securely?
104+
- A> Manage it like cron. Meaning we'd do scheduling. We have partners that have experience with secure cloud execution.
105+
- Q> Do you have live data?
106+
- A> Yes, 10 tests from tritonbench but just as a smoke test. We really want to know what to run.
107+
- Q> What is the business model?
108+
- A> This is for the community. Meant to be publicly open.
109+
- Q> Challenging to run tests on Blackwell.
110+
- A> Expensive but we have access. Amazon makes you buy a time block.
111+
- Q> Who's paying for this?
112+
- A> Asking community for support. Looking for the money or resources from community.
113+
- Q> What if hardware platforms look different for different businesses
114+
- A> We'll need to work with folks to figure out what makes sense to record like frequency pinning, OS, etc. (do this offline).
115+
- Q> Tritonbench at Meta is hosted on PyTorch Opensource allotment on Google Cloud with autoscaling in PyTorch. UI. would like A/B testing. Running experimental branches/repos and look for regressions/speedups.
116+
- A> I see that in tritonbench.
117+
- Will post on slack and discord
118+
4. Open mic for other topics.
119+
- No additional topics.
120+
121+
## Minutes:
122+
Recording link [here](https://youtu.be/zoSY_WXHmF0)

0 commit comments

Comments
 (0)