You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -140,7 +140,7 @@ Alternatively, follow these steps to build LLVM from source manually.
140
140
# Running tests
141
141
142
142
There currently isn't a turnkey way to run all the Triton tests, but you can
143
-
follow the following recipe.
143
+
follow the following recipe:
144
144
145
145
```shell
146
146
# One-time setup. Note this will reinstall local Triton because torch
@@ -164,7 +164,7 @@ See [`python/triton/knobs.py`](python/triton/knobs.py) for the full list of conf
164
164
165
165
-`MLIR_ENABLE_DUMP=1` dumps the IR before every MLIR pass Triton runs, for all
166
166
kernels. Use `MLIR_ENABLE_DUMP=kernelName` to dump for a specific kernel only.
167
-
- Triton cache can interfere with the dump. In cases where `MLIR_ENABLE_DUMP=1` does not work, try cleaning your triton cache: `rm -r ~/.triton/cache/*`
167
+
- Triton cache can interfere with the dump. In cases where `MLIR_ENABLE_DUMP=1` does not work, try cleaning your triton cache: `rm -r ~/.triton/cache/*`.
168
168
-`MLIR_DUMP_PATH` specifies where `MLIR_ENABLE_DUMP` will dump to. If unset will dump to stderr.
169
169
-`LLVM_IR_ENABLE_DUMP=1` dumps the IR before every pass run over the LLVM IR.
170
170
-`TRITON_REPRODUCER_PATH=<reproducer_path>` will generate an MLIR reproducer file
@@ -175,11 +175,11 @@ See [`python/triton/knobs.py`](python/triton/knobs.py) for the full list of conf
175
175
-`TRITON_ENABLE_LLVM_DEBUG=1` passes `-debug` to LLVM, printing a lot of
176
176
debugging information to stdout. If this is too noisy, run with just
177
177
`TRITON_LLVM_DEBUG_ONLY` instead to limit the output.
178
-
179
-
An alternative way to reduce output noisiness is running with
178
+
- An alternative way to reduce output noisiness is running with
180
179
`LLVM_IR_ENABLE_DUMP=1`, extract the IR before the LLVM pass of interest, and
181
180
then run LLVM's `opt` standalone, perhaps passing `-debug-only=foo` on the
182
181
command line.
182
+
183
183
-`TRITON_LLVM_DEBUG_ONLY=<comma-separated>` is the equivalent of LLVM's
184
184
`-debug-only` command-line option. This limits the LLVM debug output to
185
185
specific pass or component names (which are specified using `#define
@@ -191,8 +191,7 @@ See [`python/triton/knobs.py`](python/triton/knobs.py) for the full list of conf
191
191
-`TRITON_ENABLE_ASAN=1` invokes the LLVM address sanitizer for
192
192
memory leak and out of bounds access detection. Currently only supported on the AMD
193
193
backend. This must be run using the ASAN libraries documented [here](https://rocm.docs.amd.com/projects/llvm-project/en/latest/conceptual/using-gpu-sanitizer.html).
194
-
195
-
When enabling the address sanitizer it is recommended to disable various memory caching strategies
194
+
- When enabling the address sanitizer it is recommended to disable various memory caching strategies
196
195
both within the ROCm stack and PyTorch. This will give the address sanitizer the best chance at finding the
197
196
memory fault where it originates. See this [test](https://github.com/triton-lang/triton/blob/main/third_party/amd/python/test/test_address_sanitizer.py) for more details.
198
197
@@ -227,9 +226,10 @@ See [`python/triton/knobs.py`](python/triton/knobs.py) for the full list of conf
227
226
-`TRITON_OVERRIDE_DIR` specifies the directory from which to load the IR/ptx/amdgcn files when `TRITON_KERNEL_OVERRIDE` is set to 1.
228
227
-`TRITON_F32_DEFAULT` sets the default input precision of `tl.dot` when using 32-bit floats, which can be either `ieee`, `tf32`, or `tf32x3`.
229
228
-`TRITON_FRONT_END_DEBUGGING=1` disables exception wrapping when an error occurs in the compiler frontend, allowing the full stack trace to be seen.
230
-
-`TRITON_DISABLE_LINE_INFO=1` removes all line information from the module
229
+
-`TRITON_DISABLE_LINE_INFO=1` removes all line information from the module.
231
230
232
-
N.B. Some of these environment variables don't have a knob in `knobs.py`-- those are only relevant to the C++ layer(s), hence they don't exist in the python layer.
231
+
> [!NOTE]
232
+
> Some of these environment variables don't have a knob in `knobs.py`-- those are only relevant to the C++ layer(s), hence they don't exist in the python layer.
233
233
234
234
**Kernel Override Steps**
235
235
@@ -274,7 +274,7 @@ Supported Hardware:
274
274
# Development Container (Dev Container)
275
275
276
276
**Dev Containers** for the Triton project are available from
277
-
the [triton-dev-containers repository](https://github.com/redhat-et/triton-dev-containers)
277
+
the [triton-dev-containers repository](https://github.com/redhat-et/triton-dev-containers).
278
278
279
279
### Key Benefits:
280
280
-**Consistency**: All developers can work with the same development
@@ -286,5 +286,5 @@ the [triton-dev-containers repository](https://github.com/redhat-et/triton-dev-c
286
286
287
287
### How to Use the Dev Container:
288
288
289
-
For detailed instructions on how to use the dev containers please see
290
-
the [dev container user guide](https://github.com/redhat-et/triton-dev-containers/blob/main/.devcontainer/devcontainer.md)
289
+
For detailed instructions on how to use the dev containers, please see
290
+
the [dev container user guide](https://github.com/redhat-et/triton-dev-containers/blob/main/.devcontainer/devcontainer.md).
* FlashAttention (10-15% improvement, could be faster with pipelining and pingpong scheduling).
17
+
* What is warp specialization?
18
+
* Improves hardware instruction scheduling. GPUs don’t have good dynamic instruction scheduling.
19
+
* Use multi-way warp scheduler. Allows warps on a single core targeting different function units (e.g. memory, ALU, tensor core, etc.) All run in parallel.
20
+
* Comparison using GEMM * *
21
+
* Uniform warps: 8 warps, each loading/processing 1/8th of data. Divided into two groups, each doing ½ the data. Good for GEMM but not for more complicated kernels.
22
+
* Warp specialized: 12 warps, 4 warps for producing data-only do load, 8 for wgmma-only do wmma. Frees up more capacity for more complex kernels like flash attention.
23
+
* Compiler implementation
24
+
* How to enable warp specialization
25
+
* Automaticlly enabled by adding two switches to autotune config.
26
+
* Num_consumer_groups - non-load warp groups
27
+
* Num_buffer_warp_spec - # of buffers between producer and consumer
28
+
* Concept
29
+
* Async tasks run in parallel with other async tasks.
30
+
* Tasks should use different memory and GPU resources.
31
+
* Coordination through shared memory and barriers for synchronization.
32
+
* Compiler Implementation
33
+
* Automatic task partitioning.
34
+
* Dataflow Multi-buffering
35
+
* Task partitioning
36
+
* Automatic task partitioning identifies tasks like loads, alu ops, stores, etc.
37
+
* Identifies dependency chains. Links producers to consumers.
38
+
* Continue partitioning and inserting synchronization primitives in both producer and consumer warps.
39
+
* Multi-buffering
40
+
* Producer continues to load/populate buffers in round-robin while consumers processes individual buffer.
41
+
* Producer blocks when no free buffers available.
42
+
* In the future
43
+
* Multi-buffering multi-dimensional loops
44
+
* Buffer reuse in over multiple regions in a single group
45
+
* Complex control flows, partition schemes (ping-pong, support for Blackwell)
46
+
* Case Study: Flash Attention - Kevin and Manman
47
+
* Without WS
48
+
* Compute Througput: 45%
49
+
* Memory Throughput: 35%
50
+
* SM Busy: 46%
51
+
* No interleaving: CUDA core idle when tensor cores running
52
+
* With WS
53
+
* Compute Throughput: 69%
54
+
* Memory Throughput: 35%
55
+
* SM Busy: 71%
56
+
* Interleaving (speed up due to):
57
+
* Overlapping TMA with CUDA core op
58
+
* Overlapping cuda core and tensor core
59
+
* Overlapping tensor core and instruction issuing.
60
+
* Data partitioning
61
+
* Communication pipelining and ping-pong scheduling
62
+
* Ping-pong is named barrier pair. Only one consumer can be in region.
63
+
64
+
## Questions
65
+
* Q> Is there an equivalent warp group for AMD? Does this apply to AMD GPUs?
66
+
* A> Meta is doing this for AMD. No named barrier in AMD. Simulating this using shared-memory atomics on AMD to get the same effect.
67
+
68
+
* Q> Would it make sense to promote these to a higher level inside Triton for complex cases where it would be difficult for the compiler to detect?
69
+
* A> Yes. We allow users to annotate programs with their partitions in [facebookexperimental/triton](https://github.com/facebookexperimental/triton). We want to see if more automation is possible.
70
+
71
+
* Q> What should we target first? Warp specialization or software pipelining as an initial optimization? From your experience, which lowering is preferred? Are you going to bring it to main?
72
+
* A> Not mutually exclusive. You need to figure out what makes sense for yourself. WS benefit: outerloop support for pipelining. WS benefit: overlapping of cuda core and tensor core.
73
+
74
+
* Q> What improvements are you seeing?
75
+
* A> Flash attention: 20% + computational pipelining and ping-pong scheduling approaches flash attention v3 performance.
* Lower Triton IR to mlir core dialects (linalg, memref, …) Easier path to running on CPUs.
84
+
* Focus on supporting strided memory access for accelerators
85
+
* Open-sourced at https://github.com/microsoft/triton-shared
86
+
* Trying to keep it in sync with OSS triton (albeit a little delayed)
87
+
88
+
### Progress
89
+
* Modularizing compiler passes. Decoupled data extraction from lowering. Allowed for customized lowering flows. Predictable behavior for analysis failures.
90
+
* Triton-to-structured
91
+
* triton-arith-to-linalg
92
+
* Structured-to-memref
93
+
* Improvements to pointer analysis
94
+
* Supports nested loops
95
+
* Non-contiguous memory access.
96
+
* Support for lowering unstructured access with single base pointer
97
+
* Support lowering triton ops to linalg/mlir (split, join, cat, etc.)
98
+
99
+
### Roadmap
100
+
* Complete support for non-contiguous pointers
101
+
* Detect other memory access patterns (e.g. row-gather/scatter pointer sequences)
102
+
* Extend to control flow ops
103
+
104
+
### Thanks!
105
+
Meta, Qualcomm and community
106
+
107
+
### Questions
108
+
* Q> Future plans, what are the higher priority items you want to work on?
109
+
* A> Many Triton kernel have memory access patterns that can’t be detected. We don’t have fall back solutions (e.g. gather-scatter support). Need to wait for the mlir pointer dialect to land so we can use it. MxN loads pointer analysis fails if loads are contiguous. But rows may be contiguous so we can split analysis into multiple chunks (row scatter, row gather).
110
+
* A> In places where pointer analysis can’t extract information, we leave the IR intact so existing passes that can deal with them. We can handle loop iteration over tensors of pointers (common patterns). More complicated operations like if/else look like low hanging fruit.
111
+
112
+
## Questions about Generic Tensor Descriptor
113
+
* Q> What is the progress on generic tensor descriptor programming? Not Nvidia specific. (from last month).
114
+
* A> TMA accelerator will probably become more general across GPUs.
115
+
* A> TMA (tensor descriptors) support should be landing over next few weeks. Will add compatibility mode for GPUs without TMA (but will probably be slower). And will be adding block pointer support. We will deprecate host side tensor descriptors (only provided minor performance benefit for persistent kernels). Allow user to autotune.
116
+
117
+
## Minutes:
118
+
Recording link [here](https://www.youtube.com/watch?v=cIW6ZL_LmGc)
0 commit comments