|
| 1 | +# Agenda: |
| 2 | + |
| 3 | +## Items: |
| 4 | +1. Gluon update (Jeff Niu, OpenAI) |
| 5 | +2. Interest and requirements for a nightly performance regression suite (Simon Waters, kernelize.ai) |
| 6 | +3. Triton developers’ summit update (Ofer Dekel, Microsoft) |
| 7 | +4. Open mic for other topics. |
| 8 | + |
| 9 | +## Minutes: |
| 10 | +Recording link [here](https://youtu.be/zoSY_WXHmF0) |
| 11 | + |
| 12 | +1. Triton developers’ summit update (Ofer Dekel, Microsoft) |
| 13 | + - 3rd Annual Triton Developer conference |
| 14 | + - Oct 21, 2025 (day before the PyTorch conference in SF) |
| 15 | + - Where: Microsoft Silicon Valley Campus, Mountain View, CA |
| 16 | + - There may be busses from SF to Mountain View (survey coming) |
| 17 | + - Up to 500 people can be accomodated in their auditorium. |
| 18 | + - Everyone interested in Triton, developers, developers working on extensions, etc. |
| 19 | + - Registration website is imminent! (possibly in a week). |
| 20 | + - Talks (proposed): |
| 21 | + - Nvidia - Blackwell optimizations |
| 22 | + - AMD - MI300/MI350 |
| 23 | + - OpenAI - Gluon |
| 24 | + - Microsoft/LinkedIn - Liger-kernel |
| 25 | + - ByteDance - Triton distributed |
| 26 | + - Meta - Helion |
| 27 | + - GPU mode - community talk |
| 28 | + - And more! |
| 29 | + - Invitation letters will be available on the website. |
| 30 | + - Q> Any tutorials like how to write a kernel or perf analysis. |
| 31 | + - A> Not planned. Filled schedule with new tech over last year (working with Phil on program). Maybe we should extend to two days next year. Conference for professions. Should this be a conference for non-experts too? Targeting folks who know and live/breathe Triton. |
| 32 | + - A> Should have talks on tooling like Proton and guidelines on performance. Want people to be able to reproduce their results. |
| 33 | + - Q> Last years audience was Triton developers and Triton users but felt like the topic skewed toward developers and get people to contributed. Any plan to have content for users? |
| 34 | + - A> First 2 talks on triton internals. Others include tooling that should be interesting to users (like liger, triton-distributed, helion and GPU mode). Users will benefit from learning what goes on under the hood. |
| 35 | + - Q> Social aspect to Triton conference? |
| 36 | + - A> Full day of talks with coffee breaks/lunch/happy hour for unstructured social interaction. No plans for structured social engagement (like breaking into pods). But still in flux. Would like suggestions for what we can do for other social engagements (send ideas to Ofer). |
| 37 | + - Q> is GPU mode led by Mark Saroufim? |
| 38 | + - A> Yes. |
| 39 | + - Q> Any Triton/workshops to be given in conjunction with the PyTorch conference? |
| 40 | + - A> No. Other than being in good proximity (location and timing wise). Hoping to get folks who are attending PyTorch conference will come out a day early for Triton Conference. |
| 41 | +2. Gluon update (Jeff Niu, OpenAI) |
| 42 | + - A lower-level language based on the same compiler tech as Triton. |
| 43 | + - Expose more control over layouts, scheduling and memory. Bypasses middle-end, goes right to backend. |
| 44 | + - Can still use tile-based programming. |
| 45 | + - Expose more of the GPU to users. |
| 46 | + - Why Gluon? Out of the box better perf only approaches 80%. Compilers struggling to make best use of hardware (hardware complexity). |
| 47 | + - Targeting: |
| 48 | + - better register and memory layouts |
| 49 | + - Warp specialization partitioning and loop scheduling |
| 50 | + - Gluon - a system programming language for GPUs. |
| 51 | + - expose low-level hardware details |
| 52 | + - tile-based abstraction |
| 53 | + - no global state management |
| 54 | + - Trade-offs |
| 55 | + - not hardware portable across hw platforms |
| 56 | + - you need hardware knowledge |
| 57 | + - harder to write |
| 58 | + - Implementation |
| 59 | + - @peterbell10 did most of the work. |
| 60 | + - Focus on blackwell, but some H100 support |
| 61 | + - Example: FMHA on B200 |
| 62 | + - Still slower than cudnn |
| 63 | + - But much better than out of the box triton. |
| 64 | + - Future work |
| 65 | + - Very experimental |
| 66 | + - Need better layout management functions |
| 67 | + - *Not planning on accepting contributions now* |
| 68 | + - Q> Gluon is for specific type of GPU. What about other GPUs/generations? |
| 69 | + - A> Don't need to rewrite everything. To get best performance on newer generations, yes, you will need to do rewrites. Kernels have bells and whistles. Triton kernels program are a declarative specification for what the kernel should do. The triton compiler figures out how to make that spec performant. With Gluon, you will need to do this yourself. |
| 70 | + - Q> In the future, will certain ops be implemented in Gluon vs in the compiler? E.g. tl.histogram written as a gluon kernel. |
| 71 | + - A> Probably not. Triton ops are tile-level. These aren't exposed in Gluon. Idea of interop between Gluon & Triton exist but may not be implemented. |
| 72 | + - Q> Pushing onus like scheduling to kernel writers, Any thoughts about tooling to help guide the kernel writers like timeline views? |
| 73 | + - A> 1) intrakernel profiler with proton (very imporant, NCU stall counts example of something that might not be on the critical path) complicated dependency graphs 2) more function calls in gluon. but you won't see them in cuda gdb. Tooling needs to catch up and we expect it to do so. |
| 74 | + - Q> Microkernel for hotloops. Is this what you're envisioning for interop? |
| 75 | + - A> No, we haven't thought about it that much. If you had a large kernel, but our kernels are small so its not worth it. |
| 76 | + - Q> AMD other processors & gluon. |
| 77 | + - A> AMD is as simple as adding the bindings and Python code. But its very early and we're focusing on executing on Blackwell. |
| 78 | +3. Interest and requirements for a nightly performance regression suite (Simon Waters, kernelize.ai) |
| 79 | + - Brian Bowyer (kernelize.ai) |
| 80 | + - Nightly performance CI. In past we did the same at AMD while working on Triton compiler. |
| 81 | + - Noticed, almost every night, we would see performance regressions due to changes made during the day. |
| 82 | + - Hard to do performance optimizations if you don't know impact over different hardware, different versions, and data types. |
| 83 | + - Request to community: |
| 84 | + - Where to get resources to run on |
| 85 | + - Inside and outside of companies |
| 86 | + - Where to store the data |
| 87 | + - Help on setting up and running CI & doing operations. |
| 88 | + - Proposal from kernelize.ai |
| 89 | + - Nosql based cloud storage |
| 90 | + - pipelines on pulic cloud |
| 91 | + - Use torchbench to store tests |
| 92 | + - visualization: https://triton-bench.ai (currently contains fake data) |
| 93 | + - discord for questions |
| 94 | + - Run on AWS (to start) |
| 95 | + - Demo of dashboard |
| 96 | + - Personalizable |
| 97 | + - Dig into operators/hardware performance over time |
| 98 | + - Detailed views/exports. |
| 99 | + - Requests |
| 100 | + - kernelize.ai can provide people |
| 101 | + - We need community to help with costs(running tests) |
| 102 | + - kernels/data types/hardware. |
| 103 | + - Q> selfhosted runners. How to run securely? |
| 104 | + - A> Manage it like cron. Meaning we'd do scheduling. We have partners that have experience with secure cloud execution. |
| 105 | + - Q> Do you have live data? |
| 106 | + - A> Yes, 10 tests from tritonbench but just as a smoke test. We really want to know what to run. |
| 107 | + - Q> What is the business model? |
| 108 | + - A> This is for the community. Meant to be publicly open. |
| 109 | + - Q> Challenging to run tests on Blackwell. |
| 110 | + - A> Expensive but we have access. Amazon makes you buy a time block. |
| 111 | + - Q> Who's paying for this? |
| 112 | + - A> Asking community for support. Looking for the money or resources from community. |
| 113 | + - Q> What if hardware platforms look different for different businesses |
| 114 | + - A> We'll need to work with folks to figure out what makes sense to record like frequency pinning, OS, etc. (do this offline). |
| 115 | + - Q> Tritonbench at Meta is hosted on PyTorch Opensource allotment on Google Cloud with autoscaling in PyTorch. UI. would like A/B testing. Running experimental branches/repos and look for regressions/speedups. |
| 116 | + - A> I see that in tritonbench. |
| 117 | + - Will post on slack and discord |
| 118 | +4. Open mic for other topics. |
| 119 | + - No additional topics. |
| 120 | + |
| 121 | +## Minutes: |
| 122 | +Recording link [here](https://youtu.be/zoSY_WXHmF0) |
0 commit comments