Skip to content

Commit d25df42

Browse files
Notes from 2025-03-12 community meetup (#7255)
Notes for 2025-03-12 community meetup <!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because this is purely documentation. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)
1 parent 7f86d16 commit d25df42

File tree

1 file changed

+118
-0
lines changed

1 file changed

+118
-0
lines changed

docs/meetups/03-12-2025/notes.md

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
# Agenda:
2+
1. Improving ILP (Instruction Level Parallelism) with Warp Specialization
3+
2. Triton-shared (Progress and updates)
4+
3. Question about generic tensor descriptors
5+
6+
# Meeting notes:
7+
8+
## Improving ILP (Instruction Level Parallelism) with Warp Specialization
9+
Speakers: Hongtao Yu (Meta), Yuanwei (Kevin) Fang (Meta), Manman Ren (Meta)
10+
11+
Notes:
12+
* Pytorch 2.6 with Triton release branch 3.2
13+
* Targeting: Nvidia Hopper arch, Blackwell coming soon.
14+
* Performance
15+
* Meta’s FP8Rowwise GEMM (3-5% improvement, 1D persistent loop)
16+
* FlashAttention (10-15% improvement, could be faster with pipelining and pingpong scheduling).
17+
* What is warp specialization?
18+
* Improves hardware instruction scheduling. GPUs don’t have good dynamic instruction scheduling.
19+
* Use multi-way warp scheduler. Allows warps on a single core targeting different function units (e.g. memory, ALU, tensor core, etc.) All run in parallel.
20+
* Comparison using GEMM * *
21+
* Uniform warps: 8 warps, each loading/processing 1/8th of data. Divided into two groups, each doing ½ the data. Good for GEMM but not for more complicated kernels.
22+
* Warp specialized: 12 warps, 4 warps for producing data-only do load, 8 for wgmma-only do wmma. Frees up more capacity for more complex kernels like flash attention.
23+
* Compiler implementation
24+
* How to enable warp specialization
25+
* Automaticlly enabled by adding two switches to autotune config.
26+
* Num_consumer_groups - non-load warp groups
27+
* Num_buffer_warp_spec - # of buffers between producer and consumer
28+
* Concept
29+
* Async tasks run in parallel with other async tasks.
30+
* Tasks should use different memory and GPU resources.
31+
* Coordination through shared memory and barriers for synchronization.
32+
* Compiler Implementation
33+
* Automatic task partitioning.
34+
* Dataflow Multi-buffering
35+
* Task partitioning
36+
* Automatic task partitioning identifies tasks like loads, alu ops, stores, etc.
37+
* Identifies dependency chains. Links producers to consumers.
38+
* Continue partitioning and inserting synchronization primitives in both producer and consumer warps.
39+
* Multi-buffering
40+
* Producer continues to load/populate buffers in round-robin while consumers processes individual buffer.
41+
* Producer blocks when no free buffers available.
42+
* In the future
43+
* Multi-buffering multi-dimensional loops
44+
* Buffer reuse in over multiple regions in a single group
45+
* Complex control flows, partition schemes (ping-pong, support for Blackwell)
46+
* Case Study: Flash Attention - Kevin and Manman
47+
* Without WS
48+
* Compute Througput: 45%
49+
* Memory Throughput: 35%
50+
* SM Busy: 46%
51+
* No interleaving: CUDA core idle when tensor cores running
52+
* With WS
53+
* Compute Throughput: 69%
54+
* Memory Throughput: 35%
55+
* SM Busy: 71%
56+
* Interleaving (speed up due to):
57+
* Overlapping TMA with CUDA core op
58+
* Overlapping cuda core and tensor core
59+
* Overlapping tensor core and instruction issuing.
60+
* Data partitioning
61+
* Communication pipelining and ping-pong scheduling
62+
* Ping-pong is named barrier pair. Only one consumer can be in region.
63+
64+
## Questions
65+
* Q> Is there an equivalent warp group for AMD? Does this apply to AMD GPUs?
66+
* A> Meta is doing this for AMD. No named barrier in AMD. Simulating this using shared-memory atomics on AMD to get the same effect.
67+
68+
* Q> Would it make sense to promote these to a higher level inside Triton for complex cases where it would be difficult for the compiler to detect?
69+
* A> Yes. We allow users to annotate programs with their partitions in [facebookexperimental/triton](https://github.com/facebookexperimental/triton). We want to see if more automation is possible.
70+
71+
* Q> What should we target first? Warp specialization or software pipelining as an initial optimization? From your experience, which lowering is preferred? Are you going to bring it to main?
72+
* A> Not mutually exclusive. You need to figure out what makes sense for yourself. WS benefit: outerloop support for pipelining. WS benefit: overlapping of cuda core and tensor core.
73+
74+
* Q> What improvements are you seeing?
75+
* A> Flash attention: 20% + computational pipelining and ping-pong scheduling approaches flash attention v3 performance.
76+
77+
## Triton-shared (Progress and updates)
78+
Presenter: Nhat Nguyen (Microsoft), Haishan Zhu (Meta)
79+
80+
Notes:
81+
82+
### Goal:
83+
* Lower Triton IR to mlir core dialects (linalg, memref, …) Easier path to running on CPUs.
84+
* Focus on supporting strided memory access for accelerators
85+
* Open-sourced at https://github.com/microsoft/triton-shared
86+
* Trying to keep it in sync with OSS triton (albeit a little delayed)
87+
88+
### Progress
89+
* Modularizing compiler passes. Decoupled data extraction from lowering. Allowed for customized lowering flows. Predictable behavior for analysis failures.
90+
* Triton-to-structured
91+
* triton-arith-to-linalg
92+
* Structured-to-memref
93+
* Improvements to pointer analysis
94+
* Supports nested loops
95+
* Non-contiguous memory access.
96+
* Support for lowering unstructured access with single base pointer
97+
* Support lowering triton ops to linalg/mlir (split, join, cat, etc.)
98+
99+
### Roadmap
100+
* Complete support for non-contiguous pointers
101+
* Detect other memory access patterns (e.g. row-gather/scatter pointer sequences)
102+
* Extend to control flow ops
103+
104+
### Thanks!
105+
Meta, Qualcomm and community
106+
107+
### Questions
108+
* Q> Future plans, what are the higher priority items you want to work on?
109+
* A> Many Triton kernel have memory access patterns that can’t be detected. We don’t have fall back solutions (e.g. gather-scatter support). Need to wait for the mlir pointer dialect to land so we can use it. MxN loads pointer analysis fails if loads are contiguous. But rows may be contiguous so we can split analysis into multiple chunks (row scatter, row gather).
110+
* A> In places where pointer analysis can’t extract information, we leave the IR intact so existing passes that can deal with them. We can handle loop iteration over tensors of pointers (common patterns). More complicated operations like if/else look like low hanging fruit.
111+
112+
## Questions about Generic Tensor Descriptor
113+
* Q> What is the progress on generic tensor descriptor programming? Not Nvidia specific. (from last month).
114+
* A> TMA accelerator will probably become more general across GPUs.
115+
* A> TMA (tensor descriptors) support should be landing over next few weeks. Will add compatibility mode for GPUs without TMA (but will probably be slower). And will be adding block pointer support. We will deprecate host side tensor descriptors (only provided minor performance benefit for persistent kernels). Allow user to autotune.
116+
117+
## Minutes:
118+
Recording link [here](https://www.youtube.com/watch?v=cIW6ZL_LmGc)

0 commit comments

Comments
 (0)