intel
diff --git a/‎.github/workflows/integration-tests-amd.yml
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/integration-tests-amd.yml
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md
Lines changed: 17 additions & 17 deletions b/‎README.md
Lines changed: 17 additions & 17 deletions
diff --git a/‎bin/RegisterTritonDialects.h
Lines changed: 1 addition & 0 deletions b/‎bin/RegisterTritonDialects.h
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/meetups/03-12-2025/notes.md
Lines changed: 118 additions & 0 deletions b/‎docs/meetups/03-12-2025/notes.md
Lines changed: 118 additions & 0 deletions
@@ -101,7 +101,7 @@ jobs:
           pytest --capture=tee-sys -rfs third_party/amd/python/test/test_extract_slice_concat_op.py
           TRITON_ALWAYS_COMPILE=1 pytest --capture=tee-sys -rfs third_party/amd/python/test/test_scalarize_packed_fops.py
           cd python/test/unit
-          pytest --capture=tee-sys -rfs -n 12 language runtime \
+          pytest --capture=tee-sys -rfs -n 12 language runtime tools \
                  --ignore=language/test_line_info.py \
                  --ignore=test_debug.py
           # TODO: uncomment
 
@@ -50,7 +50,7 @@ pip install -e .
 # Building with a custom LLVM
 
 Triton uses LLVM to generate code for GPUs and CPUs.  Normally, the Triton build
-downloads a prebuilt LLVM, but you can also build LLVM from source and use that.
+downloads a prebuilt LLVM, but you can also build and use LLVM from source.
 
 LLVM does not have a stable API, so the Triton build will not work at an
 arbitrary LLVM version.
@@ -68,15 +68,15 @@ Alternatively, follow these steps to build LLVM from source manually.
 
 1. Find the version of LLVM that Triton builds against.  Check
 `cmake/llvm-hash.txt` to see the current version. For example, if it says:
-       49af6502c6dcb4a7f7520178bd14df396f78240c
+       49af6502c6dcb4a7f7520178bd14df396f78240c.
 
    This means that the version of Triton you have builds against
    [LLVM](https://github.com/llvm/llvm-project) 49af6502.
 
 2. `git checkout` LLVM at this revision.  Optionally, make additional
    modifications to LLVM.
 
-3. [Build LLVM](https://llvm.org/docs/CMake.html).  For example, you might run
+3. [Build LLVM](https://llvm.org/docs/CMake.html).  For example, you might run:
 
        $ cd $HOME/llvm-project  # your clone of LLVM.
        $ mkdir build
@@ -86,7 +86,7 @@ Alternatively, follow these steps to build LLVM from source manually.
 
 4. Grab a snack, this will take a while.
 
-5. Build Triton as above, but set the following environment variables.
+5. Build Triton as above, but set the following environment variables:
 
        # Modify as appropriate to point to your LLVM build.
        $ export LLVM_BUILD_DIR=$HOME/llvm-project/build
@@ -125,10 +125,10 @@ Alternatively, follow these steps to build LLVM from source manually.
 
   If IntelliSense does not work, you can try the following steps:
 
-    - Do a local build. Run command `pip install -e .`
+    - Do a local build. Run command `pip install -e .`.
     - Get the full path to the `compile_commands.json` file produced by the build:
       `find ./build -name 'compile_commands.json' | xargs readlink -f`.
-      You might get a full path similar to `/Users/{username}/triton/build/cmake.macosx-11.1-arm64-cpython-3.12/compile_commands.json`
+      You might get a full path similar to `/Users/{username}/triton/build/cmake.macosx-11.1-arm64-cpython-3.12/compile_commands.json`.
     - In VSCode, install the
       [C/C++
       extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode.cpptools),
@@ -140,7 +140,7 @@ Alternatively, follow these steps to build LLVM from source manually.
 # Running tests
 
 There currently isn't a turnkey way to run all the Triton tests, but you can
-follow the following recipe.
+follow the following recipe:
 
 ```shell
 # One-time setup.  Note this will reinstall local Triton because torch
@@ -164,7 +164,7 @@ See [`python/triton/knobs.py`](python/triton/knobs.py) for the full list of conf
 
 - `MLIR_ENABLE_DUMP=1` dumps the IR before every MLIR pass Triton runs, for all
    kernels. Use `MLIR_ENABLE_DUMP=kernelName` to dump for a specific kernel only.
-  - Triton cache can interfere with the dump. In cases where `MLIR_ENABLE_DUMP=1` does not work, try cleaning your triton cache: `rm -r ~/.triton/cache/*`
+  - Triton cache can interfere with the dump. In cases where `MLIR_ENABLE_DUMP=1` does not work, try cleaning your triton cache: `rm -r ~/.triton/cache/*`.
 - `MLIR_DUMP_PATH` specifies where `MLIR_ENABLE_DUMP` will dump to. If unset will dump to stderr.
 - `LLVM_IR_ENABLE_DUMP=1` dumps the IR before every pass run over the LLVM IR.
 - `TRITON_REPRODUCER_PATH=<reproducer_path>` will generate an MLIR reproducer file
@@ -175,11 +175,11 @@ See [`python/triton/knobs.py`](python/triton/knobs.py) for the full list of conf
 - `TRITON_ENABLE_LLVM_DEBUG=1` passes `-debug` to LLVM, printing a lot of
   debugging information to stdout.  If this is too noisy, run with just
   `TRITON_LLVM_DEBUG_ONLY` instead to limit the output.
-
-  An alternative way to reduce output noisiness is running with
+  - An alternative way to reduce output noisiness is running with
   `LLVM_IR_ENABLE_DUMP=1`, extract the IR before the LLVM pass of interest, and
   then run LLVM's `opt` standalone, perhaps passing `-debug-only=foo` on the
   command line.
+
 - `TRITON_LLVM_DEBUG_ONLY=<comma-separated>` is the equivalent of LLVM's
   `-debug-only` command-line option. This limits the LLVM debug output to
   specific pass or component names (which are specified using `#define
@@ -191,8 +191,7 @@ See [`python/triton/knobs.py`](python/triton/knobs.py) for the full list of conf
 - `TRITON_ENABLE_ASAN=1` invokes the LLVM address sanitizer for
   memory leak and out of bounds access detection. Currently only supported on the AMD
   backend. This must be run using the ASAN libraries documented [here](https://rocm.docs.amd.com/projects/llvm-project/en/latest/conceptual/using-gpu-sanitizer.html).
-
-  When enabling the address sanitizer it is recommended to disable various memory caching strategies
+  - When enabling the address sanitizer it is recommended to disable various memory caching strategies
   both within the ROCm stack and PyTorch. This will give the address sanitizer the best chance at finding the
   memory fault where it originates. See this [test](https://github.com/triton-lang/triton/blob/main/third_party/amd/python/test/test_address_sanitizer.py) for more details.
 
@@ -227,9 +226,10 @@ See [`python/triton/knobs.py`](python/triton/knobs.py) for the full list of conf
 - `TRITON_OVERRIDE_DIR` specifies the directory from which to load the IR/ptx/amdgcn files when `TRITON_KERNEL_OVERRIDE` is set to 1.
 - `TRITON_F32_DEFAULT` sets the default input precision of `tl.dot` when using 32-bit floats, which can be either `ieee`, `tf32`, or `tf32x3`.
 - `TRITON_FRONT_END_DEBUGGING=1` disables exception wrapping when an error occurs in the compiler frontend, allowing the full stack trace to be seen.
-- `TRITON_DISABLE_LINE_INFO=1` removes all line information from the module
+- `TRITON_DISABLE_LINE_INFO=1` removes all line information from the module.
 
-N.B. Some of these environment variables don't have a knob in `knobs.py`-- those are only relevant to the C++ layer(s), hence they don't exist in the python layer.
+> [!NOTE]
+> Some of these environment variables don't have a knob in `knobs.py`-- those are only relevant to the C++ layer(s), hence they don't exist in the python layer.
 
 **Kernel Override Steps**
 
@@ -274,7 +274,7 @@ Supported Hardware:
 # Development Container (Dev Container)
 
 **Dev Containers** for the Triton project are available from
-the [triton-dev-containers repository](https://github.com/redhat-et/triton-dev-containers)
+the [triton-dev-containers repository](https://github.com/redhat-et/triton-dev-containers).
 
 ### Key Benefits:
 - **Consistency**: All developers can work with the same development
@@ -286,5 +286,5 @@ the [triton-dev-containers repository](https://github.com/redhat-et/triton-dev-c
 
 ### How to Use the Dev Container:
 
-For detailed instructions on how to use the dev containers please see
-the [dev container user guide](https://github.com/redhat-et/triton-dev-containers/blob/main/.devcontainer/devcontainer.md)
+For detailed instructions on how to use the dev containers, please see
+the [dev container user guide](https://github.com/redhat-et/triton-dev-containers/blob/main/.devcontainer/devcontainer.md).
@@ -97,6 +97,7 @@ inline void registerTritonDialects(mlir::DialectRegistry &registry) {
   mlir::triton::registerTritonGENToSPIRVPasses();
 
   // TritonAMDGPUToLLVM passes
+  mlir::triton::registerAllocateAMDGPUSharedMemory();
   mlir::triton::registerConvertTritonAMDGPUToLLVM();
   mlir::triton::registerConvertBuiltinFuncToLLVM();
   mlir::triton::registerOptimizeAMDLDSUsage();
 
@@ -0,0 +1,118 @@
+# Agenda:
+1. Improving ILP (Instruction Level Parallelism) with Warp Specialization
+2. Triton-shared (Progress and updates)
+3. Question about generic tensor descriptors
+
+# Meeting notes:
+
+## Improving ILP (Instruction Level Parallelism) with Warp Specialization
+Speakers: Hongtao Yu (Meta), Yuanwei (Kevin) Fang (Meta), Manman Ren (Meta)
+
+Notes:
+* Pytorch 2.6 with Triton release branch 3.2
+* Targeting: Nvidia Hopper arch, Blackwell coming soon.
+* Performance
+  * Meta’s FP8Rowwise GEMM (3-5% improvement, 1D persistent loop)
+  * FlashAttention (10-15% improvement, could be faster with pipelining and pingpong scheduling).
+* What is warp specialization?
+  * Improves hardware instruction scheduling. GPUs don’t have good dynamic instruction scheduling.
+  * Use multi-way warp scheduler. Allows warps on a single core targeting different function units (e.g. memory, ALU, tensor core, etc.)  All run in parallel.
+* Comparison using GEMM * *
+  * Uniform warps: 8 warps, each loading/processing 1/8th of data.  Divided into two groups, each doing ½ the data. Good for GEMM but not for more complicated kernels.
+  * Warp specialized: 12 warps, 4 warps for producing data-only do load, 8 for wgmma-only do wmma.  Frees up more capacity for more complex kernels like flash attention.
+* Compiler implementation
+  * How to enable warp specialization
+    * Automaticlly enabled by adding two switches to autotune config.
+      * Num_consumer_groups - non-load warp groups
+      * Num_buffer_warp_spec - # of buffers between producer and consumer
+  * Concept
+    * Async tasks run in parallel with other async tasks.
+    * Tasks should use different memory and GPU resources.
+    * Coordination through shared memory and barriers for synchronization.
+  * Compiler Implementation
+    * Automatic task partitioning.
+    * Dataflow Multi-buffering
+  * Task partitioning
+    * Automatic task partitioning identifies tasks like loads, alu ops, stores, etc.
+    * Identifies dependency chains. Links producers to consumers.
+    * Continue partitioning and inserting synchronization primitives in both producer and consumer warps.
+  * Multi-buffering
+    * Producer continues to load/populate buffers in round-robin while consumers processes individual buffer.
+    * Producer blocks when no free buffers available.
+  * In the future
+    * Multi-buffering multi-dimensional loops
+    * Buffer reuse in over multiple regions in a single group
+    * Complex control flows, partition schemes (ping-pong, support for Blackwell)
+* Case Study: Flash Attention - Kevin and Manman
+  * Without WS
+    * Compute Througput: 45%
+    * Memory Throughput: 35%
+    * SM Busy: 46%
+    * No interleaving: CUDA core idle when tensor cores running
+  * With WS
+    * Compute Throughput: 69%
+    * Memory Throughput: 35%
+    * SM Busy: 71%
+    * Interleaving (speed up due to):
+      * Overlapping TMA with CUDA core op
+      * Overlapping cuda core and tensor core
+      * Overlapping tensor core and instruction issuing.
+    * Data partitioning
+    * Communication pipelining and ping-pong scheduling
+    * Ping-pong is named barrier pair. Only one consumer can be in region.
+
+## Questions
+* Q> Is there an equivalent warp group for AMD? Does this apply to AMD GPUs?
+* A> Meta is doing this for AMD. No named barrier in AMD. Simulating this using shared-memory atomics on AMD to get the same effect.
+
+* Q> Would it make sense to promote these to a higher level inside Triton for complex cases where it would be difficult for the compiler to detect?
+* A> Yes. We allow users to annotate programs with their partitions in [facebookexperimental/triton](https://github.com/facebookexperimental/triton).  We want to see if more automation is possible.
+
+* Q> What should we target first? Warp specialization or software pipelining as an initial optimization? From your experience, which lowering is preferred?  Are you going to bring it to main?
+* A> Not mutually exclusive.  You need to figure out what makes sense for yourself.  WS benefit: outerloop support for pipelining. WS benefit: overlapping of cuda core and tensor core.
+
+* Q> What improvements are you seeing?
+* A> Flash attention: 20%  + computational pipelining and ping-pong scheduling approaches flash attention v3 performance.
+
+## Triton-shared (Progress and updates)
+Presenter: Nhat Nguyen (Microsoft), Haishan Zhu (Meta)
+
+Notes:
+
+### Goal:
+* Lower Triton IR to mlir core dialects (linalg, memref, …)  Easier path to running on CPUs.
+* Focus on supporting strided memory access for accelerators
+* Open-sourced at https://github.com/microsoft/triton-shared
+  * Trying to keep it in sync with OSS triton (albeit a little delayed)
+
+### Progress
+* Modularizing compiler passes. Decoupled data extraction from lowering. Allowed for customized lowering flows. Predictable behavior for analysis failures.
+  * Triton-to-structured
+  * triton-arith-to-linalg
+  * Structured-to-memref
+* Improvements to pointer analysis
+  * Supports nested loops
+  * Non-contiguous memory access.
+* Support for lowering unstructured access with single base pointer
+* Support lowering triton ops to linalg/mlir (split, join, cat, etc.)
+
+### Roadmap
+* Complete support for non-contiguous pointers
+* Detect other memory access patterns (e.g. row-gather/scatter pointer sequences)
+* Extend to control flow ops
+
+### Thanks!
+Meta, Qualcomm and community
+
+### Questions
+* Q> Future plans, what are the higher priority items you want to work on?
+* A> Many Triton kernel have memory access patterns  that can’t be detected. We don’t have fall back solutions (e.g. gather-scatter support). Need to wait for the mlir pointer dialect to land so we can use it.  MxN loads pointer analysis fails if loads are contiguous. But rows may be contiguous so we can split analysis into multiple chunks (row scatter, row gather).
+* A> In places where pointer analysis can’t extract information, we leave the IR intact so existing passes that can deal with them. We can handle loop iteration over tensors of pointers (common patterns). More complicated operations like if/else look like low hanging fruit.
+
+## Questions about Generic Tensor Descriptor
+* Q> What is the progress on generic tensor descriptor programming?  Not Nvidia specific. (from last month).
+* A> TMA accelerator will probably become more general across GPUs.
+* A> TMA (tensor descriptors) support should be landing over next few weeks.  Will add compatibility mode for GPUs without TMA (but will probably be slower).  And will be adding block pointer support.  We will deprecate host side tensor descriptors (only provided minor performance benefit for persistent kernels).  Allow user to autotune.
+
+## Minutes:
+Recording link [here](https://www.youtube.com/watch?v=cIW6ZL_LmGc)