Merge OpenAI Triton commit `16ce143` #2807

whitneywhtsang · 2024-11-22T20:08:02Z

This PR change the Triton base from e9db186 to 16ce143 (Nov 22).
Pass rate: 93.24%

Please do not squash and merge this PR.

After this PR, `MemDesc` will be a type only in the TritonGPU dialect, as will the `TensorOrMemDesc` interface.

The pass was reordering scf.if operations without checking the extra dependencies coming from the region. For now just prevent this case although this part of the code might still be fragile.

If you build using the `CMakeLists.txt` and not `setup.py` and you build in `Release` then you get ``` /__w/triton/triton/third_party/amd/lib/TritonAMDGPUTransforms/MfmaGroup.cpp: In function ‘std::pair<mlir::Type, mlir::Type> mlir::TypesFromMfmaId(MLIRContext*, MfmaTypeId)’: Warning: /__w/triton/triton/third_party/amd/lib/TritonAMDGPUTransforms/MfmaGroup.cpp:240:1: warning: control reaches end of non-void function [-Wreturn-type] ```

While working on some higher dimension tensor kernels, I noticed poor performance due to the fact that layouts wouldn't propagate to local loads. Since we do allow layout folding with local store and local alloc, this seems like a bit of an oversight. The change gives a 40% speed improvement on certain kernels for NVidia GPUs. This also removes asserts in lowering for higher dimensional kernels. As far as I can tell, those restrictions aren't required in practice. # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - [x] I have added tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices)

…5139) Adding a shortcut case for fp8 MFMA to dot operand layout conversion that avoids using shared memory, to speed up FP8 attention kernels.

We simplify the implementation of `getElemsPerThread` and strengthen the preconditions of `getRepForOperand`. More generally, we should try to minimise the calls to `isAmpere` and `isHopper` throughout the codebase. I'll do a pass fixing many of these once we land LLs for `ldmatrix` and Hopper.

… issues (#5225)

TMA fences require CUDA toolchain 12.3 or greater, but current gating does not check the CUDA toolchain version. This causes `test_experimental_tma.py` to fail when run with older CUDA toolchains. ## Before With cuda-12.0: ``` 55 failed, 9 passed in 18.11s ``` With cuda-12.4: ``` 64 passed in 11.99s ``` ## After With cuda-12.0: ``` 9 passed, 55 skipped in 4.26s ``` With cuda-12.4: ``` 64 passed in 11.96s ```

If you build with `-DTRITON_BUILD_UT=OFF` on Mac you will get something like ``` -- Looking for histedit.h CMake Error at /opt/homebrew/Cellar/cmake/3.30.5/share/cmake/Modules/CheckIncludeFile.cmake:90 (try_compile): Unknown extension ".c" for file -- Looking for histedit.h - not found /Users/runner/work/triton/triton/triton-build/CMakeFiles/CMakeScratch/TryCompile-QA06d6/CheckIncludeFile.c try_compile() works only for enabled languages. Currently these are: CXX See project() command to enable other languages. Call Stack (most recent call first): llvm-bd9145c8-macos-arm64/lib/cmake/llvm/FindLibEdit.cmake:28 (check_include_file) llvm-bd9145c8-macos-arm64/lib/cmake/llvm/LLVMConfig.cmake:177 (find_package) llvm-bd9145c8-macos-arm64/lib/cmake/mlir/MLIRConfig.cmake:10 (find_package) ``` because `C` isn't an enabled project language.

This PR disables inline of print related functions, which speeds up compilation of test_scan_layouts dramatically. --------- Co-authored-by: Lei Zhang <[email protected]>

triton-lang/triton#5153 fixed the issue; but we missed enabling one of the disabled case.

Jokeren and others added 12 commits November 21, 2024 16:45

[DRAFT] Completely remove MemDesc from the Triton dialect (#5208)

133109d

After this PR, `MemDesc` will be a type only in the TritonGPU dialect, as will the `TensorOrMemDesc` interface.

[AMD] Prevent wrong reordering of scf operations (#5203)

c8a31a0

The pass was reordering scf.if operations without checking the extra dependencies coming from the region. For now just prevent this case although this part of the code might still be fragile.

[BACKEND] Fix transpose optimization missed during refactor (#5226)

4ae95e7

[AMD] Use warp shuffle for fp8 MFMA to dot operand layout conversion (#…

af0649d

…5139) Adding a shortcut case for fp8 MFMA to dot operand layout conversion that avoids using shared memory, to speed up FP8 attention kernels.

[BACKEND] Use LL to simplify redundant elements check and fix related…

e558838

… issues (#5225)

[AMD] Fix slow compilation due to inlining print calls (#5153)

422e5d3

This PR disables inline of print related functions, which speeds up compilation of test_scan_layouts dramatically. --------- Co-authored-by: Lei Zhang <[email protected]>

[AMD] Re-enable overflow test in test_reduce_layouts (#5233)

16ce143

triton-lang/triton#5153 fixed the issue; but we missed enabling one of the disabled case.

whitneywhtsang requested a review from pbchekin November 22, 2024 20:08

whitneywhtsang self-assigned this Nov 22, 2024

whitneywhtsang force-pushed the whitneywhtsang/merge branch from e5545af to c75dfa9 Compare November 22, 2024 20:10

Merge commit '4ae95e70cd81eb62f89ec530605440b85e799dee'

b96e1c3

whitneywhtsang force-pushed the whitneywhtsang/merge branch from c75dfa9 to b96e1c3 Compare November 22, 2024 20:18

Merge commit '16ce143b54eacf465c5a90a6aabdc9c3a723cb99'

4825a43

pbchekin approved these changes Nov 22, 2024

View reviewed changes

whitneywhtsang marked this pull request as ready for review November 22, 2024 21:07

whitneywhtsang merged commit 4825a43 into main Nov 23, 2024
5 checks passed

whitneywhtsang deleted the whitneywhtsang/merge branch November 23, 2024 12:37

whitneywhtsang changed the title ~~Merge OpenAI Triton commit 4ae95e7~~ Merge OpenAI Triton commit 16ce143 Nov 23, 2024

whitneywhtsang mentioned this pull request Nov 29, 2024

Merge OpenAI Triton till Nov 29th #2682

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge OpenAI Triton commit `16ce143` #2807

Merge OpenAI Triton commit `16ce143` #2807

Uh oh!

whitneywhtsang commented Nov 22, 2024 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Merge OpenAI Triton commit 16ce143 #2807

Merge OpenAI Triton commit 16ce143 #2807

Uh oh!

Conversation

whitneywhtsang commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Merge OpenAI Triton commit `16ce143` #2807

Merge OpenAI Triton commit `16ce143` #2807

whitneywhtsang commented Nov 22, 2024 •

edited

Loading