Skip to content

Commit f060b4a

Browse files
Merge commit '05b2c186c1b6c9a08375389d5efe9cb4c401c075'
2 parents f93ddb1 + 05b2c18 commit f060b4a

File tree

16 files changed

+334
-154
lines changed

16 files changed

+334
-154
lines changed

docs/meetups/07-09-2025/notes.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# Agenda:
2+
3+
## Items:
4+
1. Gluon update (Jeff Niu, OpenAI)
5+
2. Interest and requirements for a nightly performance regression suite (Simon Waters, kernelize.ai)
6+
3. Triton developers’ summit update (Ofer Dekel, Microsoft)
7+
4. Open mic for other topics.
8+
9+
## Minutes:
10+
Recording link [here](https://youtu.be/zoSY_WXHmF0)
11+
12+
1. Triton developers’ summit update (Ofer Dekel, Microsoft)
13+
- 3rd Annual Triton Developer conference
14+
- Oct 21, 2025 (day before the PyTorch conference in SF)
15+
- Where: Microsoft Silicon Valley Campus, Mountain View, CA
16+
- There may be busses from SF to Mountain View (survey coming)
17+
- Up to 500 people can be accomodated in their auditorium.
18+
- Everyone interested in Triton, developers, developers working on extensions, etc.
19+
- Registration website is imminent! (possibly in a week).
20+
- Talks (proposed):
21+
- Nvidia - Blackwell optimizations
22+
- AMD - MI300/MI350
23+
- OpenAI - Gluon
24+
- Microsoft/LinkedIn - Liger-kernel
25+
- ByteDance - Triton distributed
26+
- Meta - Helion
27+
- GPU mode - community talk
28+
- And more!
29+
- Invitation letters will be available on the website.
30+
- Q> Any tutorials like how to write a kernel or perf analysis.
31+
- A> Not planned. Filled schedule with new tech over last year (working with Phil on program). Maybe we should extend to two days next year. Conference for professions. Should this be a conference for non-experts too? Targeting folks who know and live/breathe Triton.
32+
- A> Should have talks on tooling like Proton and guidelines on performance. Want people to be able to reproduce their results.
33+
- Q> Last years audience was Triton developers and Triton users but felt like the topic skewed toward developers and get people to contributed. Any plan to have content for users?
34+
- A> First 2 talks on triton internals. Others include tooling that should be interesting to users (like liger, triton-distributed, helion and GPU mode). Users will benefit from learning what goes on under the hood.
35+
- Q> Social aspect to Triton conference?
36+
- A> Full day of talks with coffee breaks/lunch/happy hour for unstructured social interaction. No plans for structured social engagement (like breaking into pods). But still in flux. Would like suggestions for what we can do for other social engagements (send ideas to Ofer).
37+
- Q> is GPU mode led by Mark Saroufim?
38+
- A> Yes.
39+
- Q> Any Triton/workshops to be given in conjunction with the PyTorch conference?
40+
- A> No. Other than being in good proximity (location and timing wise). Hoping to get folks who are attending PyTorch conference will come out a day early for Triton Conference.
41+
2. Gluon update (Jeff Niu, OpenAI)
42+
- A lower-level language based on the same compiler tech as Triton.
43+
- Expose more control over layouts, scheduling and memory. Bypasses middle-end, goes right to backend.
44+
- Can still use tile-based programming.
45+
- Expose more of the GPU to users.
46+
- Why Gluon? Out of the box better perf only approaches 80%. Compilers struggling to make best use of hardware (hardware complexity).
47+
- Targeting:
48+
- better register and memory layouts
49+
- Warp specialization partitioning and loop scheduling
50+
- Gluon - a system programming language for GPUs.
51+
- expose low-level hardware details
52+
- tile-based abstraction
53+
- no global state management
54+
- Trade-offs
55+
- not hardware portable across hw platforms
56+
- you need hardware knowledge
57+
- harder to write
58+
- Implementation
59+
- @peterbell10 did most of the work.
60+
- Focus on blackwell, but some H100 support
61+
- Example: FMHA on B200
62+
- Still slower than cudnn
63+
- But much better than out of the box triton.
64+
- Future work
65+
- Very experimental
66+
- Need better layout management functions
67+
- *Not planning on accepting contributions now*
68+
- Q> Gluon is for specific type of GPU. What about other GPUs/generations?
69+
- A> Don't need to rewrite everything. To get best performance on newer generations, yes, you will need to do rewrites. Kernels have bells and whistles. Triton kernels program are a declarative specification for what the kernel should do. The triton compiler figures out how to make that spec performant. With Gluon, you will need to do this yourself.
70+
- Q> In the future, will certain ops be implemented in Gluon vs in the compiler? E.g. tl.histogram written as a gluon kernel.
71+
- A> Probably not. Triton ops are tile-level. These aren't exposed in Gluon. Idea of interop between Gluon & Triton exist but may not be implemented.
72+
- Q> Pushing onus like scheduling to kernel writers, Any thoughts about tooling to help guide the kernel writers like timeline views?
73+
- A> 1) intrakernel profiler with proton (very imporant, NCU stall counts example of something that might not be on the critical path) complicated dependency graphs 2) more function calls in gluon. but you won't see them in cuda gdb. Tooling needs to catch up and we expect it to do so.
74+
- Q> Microkernel for hotloops. Is this what you're envisioning for interop?
75+
- A> No, we haven't thought about it that much. If you had a large kernel, but our kernels are small so its not worth it.
76+
- Q> AMD other processors & gluon.
77+
- A> AMD is as simple as adding the bindings and Python code. But its very early and we're focusing on executing on Blackwell.
78+
3. Interest and requirements for a nightly performance regression suite (Simon Waters, kernelize.ai)
79+
- Brian Bowyer (kernelize.ai)
80+
- Nightly performance CI. In past we did the same at AMD while working on Triton compiler.
81+
- Noticed, almost every night, we would see performance regressions due to changes made during the day.
82+
- Hard to do performance optimizations if you don't know impact over different hardware, different versions, and data types.
83+
- Request to community:
84+
- Where to get resources to run on
85+
- Inside and outside of companies
86+
- Where to store the data
87+
- Help on setting up and running CI & doing operations.
88+
- Proposal from kernelize.ai
89+
- Nosql based cloud storage
90+
- pipelines on pulic cloud
91+
- Use torchbench to store tests
92+
- visualization: https://triton-bench.ai (currently contains fake data)
93+
- discord for questions
94+
- Run on AWS (to start)
95+
- Demo of dashboard
96+
- Personalizable
97+
- Dig into operators/hardware performance over time
98+
- Detailed views/exports.
99+
- Requests
100+
- kernelize.ai can provide people
101+
- We need community to help with costs(running tests)
102+
- kernels/data types/hardware.
103+
- Q> selfhosted runners. How to run securely?
104+
- A> Manage it like cron. Meaning we'd do scheduling. We have partners that have experience with secure cloud execution.
105+
- Q> Do you have live data?
106+
- A> Yes, 10 tests from tritonbench but just as a smoke test. We really want to know what to run.
107+
- Q> What is the business model?
108+
- A> This is for the community. Meant to be publicly open.
109+
- Q> Challenging to run tests on Blackwell.
110+
- A> Expensive but we have access. Amazon makes you buy a time block.
111+
- Q> Who's paying for this?
112+
- A> Asking community for support. Looking for the money or resources from community.
113+
- Q> What if hardware platforms look different for different businesses
114+
- A> We'll need to work with folks to figure out what makes sense to record like frequency pinning, OS, etc. (do this offline).
115+
- Q> Tritonbench at Meta is hosted on PyTorch Opensource allotment on Google Cloud with autoscaling in PyTorch. UI. would like A/B testing. Running experimental branches/repos and look for regressions/speedups.
116+
- A> I see that in tritonbench.
117+
- Will post on slack and discord
118+
4. Open mic for other topics.
119+
- No additional topics.
120+
121+
## Minutes:
122+
Recording link [here](https://youtu.be/zoSY_WXHmF0)

0 commit comments

Comments
 (0)