Skip to content

Commit ce5e378

Browse files
committed
Merge commit '3b279710302cb22431097d9fc3721afbac5bd91b'
Signed-off-by: Anatoly Myachev <[email protected]>
2 parents 0d92c31 + 3b27971 commit ce5e378

File tree

53 files changed

+1702
-483
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+1702
-483
lines changed

.github/workflows/integration-tests-amd.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ jobs:
101101
pytest --capture=tee-sys -rfs third_party/amd/python/test/test_extract_slice_concat_op.py
102102
TRITON_ALWAYS_COMPILE=1 pytest --capture=tee-sys -rfs third_party/amd/python/test/test_scalarize_packed_fops.py
103103
cd python/test/unit
104-
pytest --capture=tee-sys -rfs -n 12 language runtime \
104+
pytest --capture=tee-sys -rfs -n 12 language runtime tools \
105105
--ignore=language/test_line_info.py \
106106
--ignore=test_debug.py
107107
# TODO: uncomment

README.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ pip install -e .
5050
# Building with a custom LLVM
5151

5252
Triton uses LLVM to generate code for GPUs and CPUs. Normally, the Triton build
53-
downloads a prebuilt LLVM, but you can also build LLVM from source and use that.
53+
downloads a prebuilt LLVM, but you can also build and use LLVM from source.
5454

5555
LLVM does not have a stable API, so the Triton build will not work at an
5656
arbitrary LLVM version.
@@ -68,15 +68,15 @@ Alternatively, follow these steps to build LLVM from source manually.
6868

6969
1. Find the version of LLVM that Triton builds against. Check
7070
`cmake/llvm-hash.txt` to see the current version. For example, if it says:
71-
49af6502c6dcb4a7f7520178bd14df396f78240c
71+
49af6502c6dcb4a7f7520178bd14df396f78240c.
7272

7373
This means that the version of Triton you have builds against
7474
[LLVM](https://github.com/llvm/llvm-project) 49af6502.
7575

7676
2. `git checkout` LLVM at this revision. Optionally, make additional
7777
modifications to LLVM.
7878

79-
3. [Build LLVM](https://llvm.org/docs/CMake.html). For example, you might run
79+
3. [Build LLVM](https://llvm.org/docs/CMake.html). For example, you might run:
8080

8181
$ cd $HOME/llvm-project # your clone of LLVM.
8282
$ mkdir build
@@ -86,7 +86,7 @@ Alternatively, follow these steps to build LLVM from source manually.
8686

8787
4. Grab a snack, this will take a while.
8888

89-
5. Build Triton as above, but set the following environment variables.
89+
5. Build Triton as above, but set the following environment variables:
9090

9191
# Modify as appropriate to point to your LLVM build.
9292
$ export LLVM_BUILD_DIR=$HOME/llvm-project/build
@@ -125,10 +125,10 @@ Alternatively, follow these steps to build LLVM from source manually.
125125

126126
If IntelliSense does not work, you can try the following steps:
127127

128-
- Do a local build. Run command `pip install -e .`
128+
- Do a local build. Run command `pip install -e .`.
129129
- Get the full path to the `compile_commands.json` file produced by the build:
130130
`find ./build -name 'compile_commands.json' | xargs readlink -f`.
131-
You might get a full path similar to `/Users/{username}/triton/build/cmake.macosx-11.1-arm64-cpython-3.12/compile_commands.json`
131+
You might get a full path similar to `/Users/{username}/triton/build/cmake.macosx-11.1-arm64-cpython-3.12/compile_commands.json`.
132132
- In VSCode, install the
133133
[C/C++
134134
extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode.cpptools),
@@ -140,7 +140,7 @@ Alternatively, follow these steps to build LLVM from source manually.
140140
# Running tests
141141

142142
There currently isn't a turnkey way to run all the Triton tests, but you can
143-
follow the following recipe.
143+
follow the following recipe:
144144

145145
```shell
146146
# One-time setup. Note this will reinstall local Triton because torch
@@ -164,7 +164,7 @@ See [`python/triton/knobs.py`](python/triton/knobs.py) for the full list of conf
164164

165165
- `MLIR_ENABLE_DUMP=1` dumps the IR before every MLIR pass Triton runs, for all
166166
kernels. Use `MLIR_ENABLE_DUMP=kernelName` to dump for a specific kernel only.
167-
- Triton cache can interfere with the dump. In cases where `MLIR_ENABLE_DUMP=1` does not work, try cleaning your triton cache: `rm -r ~/.triton/cache/*`
167+
- Triton cache can interfere with the dump. In cases where `MLIR_ENABLE_DUMP=1` does not work, try cleaning your triton cache: `rm -r ~/.triton/cache/*`.
168168
- `MLIR_DUMP_PATH` specifies where `MLIR_ENABLE_DUMP` will dump to. If unset will dump to stderr.
169169
- `LLVM_IR_ENABLE_DUMP=1` dumps the IR before every pass run over the LLVM IR.
170170
- `TRITON_REPRODUCER_PATH=<reproducer_path>` will generate an MLIR reproducer file
@@ -175,11 +175,11 @@ See [`python/triton/knobs.py`](python/triton/knobs.py) for the full list of conf
175175
- `TRITON_ENABLE_LLVM_DEBUG=1` passes `-debug` to LLVM, printing a lot of
176176
debugging information to stdout. If this is too noisy, run with just
177177
`TRITON_LLVM_DEBUG_ONLY` instead to limit the output.
178-
179-
An alternative way to reduce output noisiness is running with
178+
- An alternative way to reduce output noisiness is running with
180179
`LLVM_IR_ENABLE_DUMP=1`, extract the IR before the LLVM pass of interest, and
181180
then run LLVM's `opt` standalone, perhaps passing `-debug-only=foo` on the
182181
command line.
182+
183183
- `TRITON_LLVM_DEBUG_ONLY=<comma-separated>` is the equivalent of LLVM's
184184
`-debug-only` command-line option. This limits the LLVM debug output to
185185
specific pass or component names (which are specified using `#define
@@ -191,8 +191,7 @@ See [`python/triton/knobs.py`](python/triton/knobs.py) for the full list of conf
191191
- `TRITON_ENABLE_ASAN=1` invokes the LLVM address sanitizer for
192192
memory leak and out of bounds access detection. Currently only supported on the AMD
193193
backend. This must be run using the ASAN libraries documented [here](https://rocm.docs.amd.com/projects/llvm-project/en/latest/conceptual/using-gpu-sanitizer.html).
194-
195-
When enabling the address sanitizer it is recommended to disable various memory caching strategies
194+
- When enabling the address sanitizer it is recommended to disable various memory caching strategies
196195
both within the ROCm stack and PyTorch. This will give the address sanitizer the best chance at finding the
197196
memory fault where it originates. See this [test](https://github.com/triton-lang/triton/blob/main/third_party/amd/python/test/test_address_sanitizer.py) for more details.
198197

@@ -227,9 +226,10 @@ See [`python/triton/knobs.py`](python/triton/knobs.py) for the full list of conf
227226
- `TRITON_OVERRIDE_DIR` specifies the directory from which to load the IR/ptx/amdgcn files when `TRITON_KERNEL_OVERRIDE` is set to 1.
228227
- `TRITON_F32_DEFAULT` sets the default input precision of `tl.dot` when using 32-bit floats, which can be either `ieee`, `tf32`, or `tf32x3`.
229228
- `TRITON_FRONT_END_DEBUGGING=1` disables exception wrapping when an error occurs in the compiler frontend, allowing the full stack trace to be seen.
230-
- `TRITON_DISABLE_LINE_INFO=1` removes all line information from the module
229+
- `TRITON_DISABLE_LINE_INFO=1` removes all line information from the module.
231230

232-
N.B. Some of these environment variables don't have a knob in `knobs.py`-- those are only relevant to the C++ layer(s), hence they don't exist in the python layer.
231+
> [!NOTE]
232+
> Some of these environment variables don't have a knob in `knobs.py`-- those are only relevant to the C++ layer(s), hence they don't exist in the python layer.
233233
234234
**Kernel Override Steps**
235235

@@ -274,7 +274,7 @@ Supported Hardware:
274274
# Development Container (Dev Container)
275275

276276
**Dev Containers** for the Triton project are available from
277-
the [triton-dev-containers repository](https://github.com/redhat-et/triton-dev-containers)
277+
the [triton-dev-containers repository](https://github.com/redhat-et/triton-dev-containers).
278278

279279
### Key Benefits:
280280
- **Consistency**: All developers can work with the same development
@@ -286,5 +286,5 @@ the [triton-dev-containers repository](https://github.com/redhat-et/triton-dev-c
286286

287287
### How to Use the Dev Container:
288288

289-
For detailed instructions on how to use the dev containers please see
290-
the [dev container user guide](https://github.com/redhat-et/triton-dev-containers/blob/main/.devcontainer/devcontainer.md)
289+
For detailed instructions on how to use the dev containers, please see
290+
the [dev container user guide](https://github.com/redhat-et/triton-dev-containers/blob/main/.devcontainer/devcontainer.md).

bin/RegisterTritonDialects.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,7 @@ inline void registerTritonDialects(mlir::DialectRegistry &registry) {
9797
mlir::triton::registerTritonGENToSPIRVPasses();
9898

9999
// TritonAMDGPUToLLVM passes
100+
mlir::triton::registerAllocateAMDGPUSharedMemory();
100101
mlir::triton::registerConvertTritonAMDGPUToLLVM();
101102
mlir::triton::registerConvertBuiltinFuncToLLVM();
102103
mlir::triton::registerOptimizeAMDLDSUsage();

docs/meetups/03-12-2025/notes.md

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
# Agenda:
2+
1. Improving ILP (Instruction Level Parallelism) with Warp Specialization
3+
2. Triton-shared (Progress and updates)
4+
3. Question about generic tensor descriptors
5+
6+
# Meeting notes:
7+
8+
## Improving ILP (Instruction Level Parallelism) with Warp Specialization
9+
Speakers: Hongtao Yu (Meta), Yuanwei (Kevin) Fang (Meta), Manman Ren (Meta)
10+
11+
Notes:
12+
* Pytorch 2.6 with Triton release branch 3.2
13+
* Targeting: Nvidia Hopper arch, Blackwell coming soon.
14+
* Performance
15+
* Meta’s FP8Rowwise GEMM (3-5% improvement, 1D persistent loop)
16+
* FlashAttention (10-15% improvement, could be faster with pipelining and pingpong scheduling).
17+
* What is warp specialization?
18+
* Improves hardware instruction scheduling. GPUs don’t have good dynamic instruction scheduling.
19+
* Use multi-way warp scheduler. Allows warps on a single core targeting different function units (e.g. memory, ALU, tensor core, etc.) All run in parallel.
20+
* Comparison using GEMM * *
21+
* Uniform warps: 8 warps, each loading/processing 1/8th of data. Divided into two groups, each doing ½ the data. Good for GEMM but not for more complicated kernels.
22+
* Warp specialized: 12 warps, 4 warps for producing data-only do load, 8 for wgmma-only do wmma. Frees up more capacity for more complex kernels like flash attention.
23+
* Compiler implementation
24+
* How to enable warp specialization
25+
* Automaticlly enabled by adding two switches to autotune config.
26+
* Num_consumer_groups - non-load warp groups
27+
* Num_buffer_warp_spec - # of buffers between producer and consumer
28+
* Concept
29+
* Async tasks run in parallel with other async tasks.
30+
* Tasks should use different memory and GPU resources.
31+
* Coordination through shared memory and barriers for synchronization.
32+
* Compiler Implementation
33+
* Automatic task partitioning.
34+
* Dataflow Multi-buffering
35+
* Task partitioning
36+
* Automatic task partitioning identifies tasks like loads, alu ops, stores, etc.
37+
* Identifies dependency chains. Links producers to consumers.
38+
* Continue partitioning and inserting synchronization primitives in both producer and consumer warps.
39+
* Multi-buffering
40+
* Producer continues to load/populate buffers in round-robin while consumers processes individual buffer.
41+
* Producer blocks when no free buffers available.
42+
* In the future
43+
* Multi-buffering multi-dimensional loops
44+
* Buffer reuse in over multiple regions in a single group
45+
* Complex control flows, partition schemes (ping-pong, support for Blackwell)
46+
* Case Study: Flash Attention - Kevin and Manman
47+
* Without WS
48+
* Compute Througput: 45%
49+
* Memory Throughput: 35%
50+
* SM Busy: 46%
51+
* No interleaving: CUDA core idle when tensor cores running
52+
* With WS
53+
* Compute Throughput: 69%
54+
* Memory Throughput: 35%
55+
* SM Busy: 71%
56+
* Interleaving (speed up due to):
57+
* Overlapping TMA with CUDA core op
58+
* Overlapping cuda core and tensor core
59+
* Overlapping tensor core and instruction issuing.
60+
* Data partitioning
61+
* Communication pipelining and ping-pong scheduling
62+
* Ping-pong is named barrier pair. Only one consumer can be in region.
63+
64+
## Questions
65+
* Q> Is there an equivalent warp group for AMD? Does this apply to AMD GPUs?
66+
* A> Meta is doing this for AMD. No named barrier in AMD. Simulating this using shared-memory atomics on AMD to get the same effect.
67+
68+
* Q> Would it make sense to promote these to a higher level inside Triton for complex cases where it would be difficult for the compiler to detect?
69+
* A> Yes. We allow users to annotate programs with their partitions in [facebookexperimental/triton](https://github.com/facebookexperimental/triton). We want to see if more automation is possible.
70+
71+
* Q> What should we target first? Warp specialization or software pipelining as an initial optimization? From your experience, which lowering is preferred? Are you going to bring it to main?
72+
* A> Not mutually exclusive. You need to figure out what makes sense for yourself. WS benefit: outerloop support for pipelining. WS benefit: overlapping of cuda core and tensor core.
73+
74+
* Q> What improvements are you seeing?
75+
* A> Flash attention: 20% + computational pipelining and ping-pong scheduling approaches flash attention v3 performance.
76+
77+
## Triton-shared (Progress and updates)
78+
Presenter: Nhat Nguyen (Microsoft), Haishan Zhu (Meta)
79+
80+
Notes:
81+
82+
### Goal:
83+
* Lower Triton IR to mlir core dialects (linalg, memref, …) Easier path to running on CPUs.
84+
* Focus on supporting strided memory access for accelerators
85+
* Open-sourced at https://github.com/microsoft/triton-shared
86+
* Trying to keep it in sync with OSS triton (albeit a little delayed)
87+
88+
### Progress
89+
* Modularizing compiler passes. Decoupled data extraction from lowering. Allowed for customized lowering flows. Predictable behavior for analysis failures.
90+
* Triton-to-structured
91+
* triton-arith-to-linalg
92+
* Structured-to-memref
93+
* Improvements to pointer analysis
94+
* Supports nested loops
95+
* Non-contiguous memory access.
96+
* Support for lowering unstructured access with single base pointer
97+
* Support lowering triton ops to linalg/mlir (split, join, cat, etc.)
98+
99+
### Roadmap
100+
* Complete support for non-contiguous pointers
101+
* Detect other memory access patterns (e.g. row-gather/scatter pointer sequences)
102+
* Extend to control flow ops
103+
104+
### Thanks!
105+
Meta, Qualcomm and community
106+
107+
### Questions
108+
* Q> Future plans, what are the higher priority items you want to work on?
109+
* A> Many Triton kernel have memory access patterns that can’t be detected. We don’t have fall back solutions (e.g. gather-scatter support). Need to wait for the mlir pointer dialect to land so we can use it. MxN loads pointer analysis fails if loads are contiguous. But rows may be contiguous so we can split analysis into multiple chunks (row scatter, row gather).
110+
* A> In places where pointer analysis can’t extract information, we leave the IR intact so existing passes that can deal with them. We can handle loop iteration over tensors of pointers (common patterns). More complicated operations like if/else look like low hanging fruit.
111+
112+
## Questions about Generic Tensor Descriptor
113+
* Q> What is the progress on generic tensor descriptor programming? Not Nvidia specific. (from last month).
114+
* A> TMA accelerator will probably become more general across GPUs.
115+
* A> TMA (tensor descriptors) support should be landing over next few weeks. Will add compatibility mode for GPUs without TMA (but will probably be slower). And will be adding block pointer support. We will deprecate host side tensor descriptors (only provided minor performance benefit for persistent kernels). Allow user to autotune.
116+
117+
## Minutes:
118+
Recording link [here](https://www.youtube.com/watch?v=cIW6ZL_LmGc)

0 commit comments

Comments
 (0)