|
9 | 9 | Speakers: Hongtao Yu (Meta), Yuanwei (Kevin) Fang (Meta), Manman Ren (Meta)
|
10 | 10 |
|
11 | 11 | Notes:
|
12 |
| -* Pytorch 2.6 with Triton release branch 3.2 |
| 12 | +* Pytorch 2.6 with Triton release branch 3.2 |
13 | 13 | * Targeting: Nvidia Hopper arch, Blackwell coming soon.
|
14 | 14 | * Performance
|
15 | 15 | * Meta’s FP8Rowwise GEMM (3-5% improvement, 1D persistent loop)
|
16 | 16 | * FlashAttention (10-15% improvement, could be faster with pipelining and pingpong scheduling).
|
17 | 17 | * What is warp specialization?
|
18 | 18 | * Improves hardware instruction scheduling. GPUs don’t have good dynamic instruction scheduling.
|
19 | 19 | * Use multi-way warp scheduler. Allows warps on a single core targeting different function units (e.g. memory, ALU, tensor core, etc.) All run in parallel.
|
20 |
| -* Comparison using GEMM * * |
| 20 | +* Comparison using GEMM * * |
21 | 21 | * Uniform warps: 8 warps, each loading/processing 1/8th of data. Divided into two groups, each doing ½ the data. Good for GEMM but not for more complicated kernels.
|
22 | 22 | * Warp specialized: 12 warps, 4 warps for producing data-only do load, 8 for wgmma-only do wmma. Frees up more capacity for more complex kernels like flash attention.
|
23 | 23 | * Compiler implementation
|
|
60 | 60 | * Data partitioning
|
61 | 61 | * Communication pipelining and ping-pong scheduling
|
62 | 62 | * Ping-pong is named barrier pair. Only one consumer can be in region.
|
63 |
| - |
| 63 | + |
64 | 64 | ## Questions
|
65 | 65 | * Q> Is there an equivalent warp group for AMD? Does this apply to AMD GPUs?
|
66 | 66 | * A> Meta is doing this for AMD. No named barrier in AMD. Simulating this using shared-memory atomics on AMD to get the same effect.
|
|
87 | 87 |
|
88 | 88 | ### Progress
|
89 | 89 | * Modularizing compiler passes. Decoupled data extraction from lowering. Allowed for customized lowering flows. Predictable behavior for analysis failures.
|
90 |
| - * Triton-to-structured |
| 90 | + * Triton-to-structured |
91 | 91 | * triton-arith-to-linalg
|
92 | 92 | * Structured-to-memref
|
93 | 93 | * Improvements to pointer analysis
|
|
0 commit comments