Skip to content

Commit da7c595

Browse files
authored
Fix precommit (#7376)
Ran: `pre-commit run --all-files` (removes trailing whitespace)
1 parent d25df42 commit da7c595

File tree

2 files changed

+5
-5
lines changed

2 files changed

+5
-5
lines changed

docs/meetups/03-12-2025/notes.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,15 +9,15 @@
99
Speakers: Hongtao Yu (Meta), Yuanwei (Kevin) Fang (Meta), Manman Ren (Meta)
1010

1111
Notes:
12-
* Pytorch 2.6 with Triton release branch 3.2
12+
* Pytorch 2.6 with Triton release branch 3.2
1313
* Targeting: Nvidia Hopper arch, Blackwell coming soon.
1414
* Performance
1515
* Meta’s FP8Rowwise GEMM (3-5% improvement, 1D persistent loop)
1616
* FlashAttention (10-15% improvement, could be faster with pipelining and pingpong scheduling).
1717
* What is warp specialization?
1818
* Improves hardware instruction scheduling. GPUs don’t have good dynamic instruction scheduling.
1919
* Use multi-way warp scheduler. Allows warps on a single core targeting different function units (e.g. memory, ALU, tensor core, etc.) All run in parallel.
20-
* Comparison using GEMM * *
20+
* Comparison using GEMM * *
2121
* Uniform warps: 8 warps, each loading/processing 1/8th of data. Divided into two groups, each doing ½ the data. Good for GEMM but not for more complicated kernels.
2222
* Warp specialized: 12 warps, 4 warps for producing data-only do load, 8 for wgmma-only do wmma. Frees up more capacity for more complex kernels like flash attention.
2323
* Compiler implementation
@@ -60,7 +60,7 @@ Notes:
6060
* Data partitioning
6161
* Communication pipelining and ping-pong scheduling
6262
* Ping-pong is named barrier pair. Only one consumer can be in region.
63-
63+
6464
## Questions
6565
* Q> Is there an equivalent warp group for AMD? Does this apply to AMD GPUs?
6666
* A> Meta is doing this for AMD. No named barrier in AMD. Simulating this using shared-memory atomics on AMD to get the same effect.
@@ -87,7 +87,7 @@ Notes:
8787

8888
### Progress
8989
* Modularizing compiler passes. Decoupled data extraction from lowering. Allowed for customized lowering flows. Predictable behavior for analysis failures.
90-
* Triton-to-structured
90+
* Triton-to-structured
9191
* triton-arith-to-linalg
9292
* Structured-to-memref
9393
* Improvements to pointer analysis

docs/meetups/05-01-2025/notes.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ Speaker: Sayce Falk (Google), Cicie Wang (Meta), Jason Knight (Nvidia), Keren Zh
7272
* Q> Anyone interested in this?
7373
* A> Maybe first step, identify how much generated code is affected by a pull request (give a signal to say something about the blast radius of a change).
7474
* Q> Intel had an intern looking at this.
75-
* Q> Intel<Alexander> - if you're interested reach out over slack.
75+
* Q> Intel<Alexander> - if you're interested reach out over slack.
7676

7777
## What talks/tutorials/open discussions would you like to see at the 2025 Triton Developers' Summit? How can we help?
7878
Speaker: Adnan Aziz (Meta)

0 commit comments

Comments
 (0)