Skip to content

Conversation

@whitneywhtsang
Copy link
Contributor

@whitneywhtsang whitneywhtsang commented Nov 22, 2024

This PR change the Triton base from e9db186 to 16ce143 (Nov 22).
Pass rate: 93.24%

Please do not squash and merge this PR.

Jokeren and others added 12 commits November 21, 2024 16:45
After this PR, `MemDesc` will be a type only in the TritonGPU dialect,
as will the `TensorOrMemDesc` interface.
The pass was reordering scf.if operations without checking the extra
dependencies coming from the region.
For now just prevent this case although this part of the code might
still be fragile.
If you build using the `CMakeLists.txt` and not `setup.py` and you build
in `Release` then you get

```
/__w/triton/triton/third_party/amd/lib/TritonAMDGPUTransforms/MfmaGroup.cpp: In function ‘std::pair<mlir::Type, mlir::Type> mlir::TypesFromMfmaId(MLIRContext*, MfmaTypeId)’:
Warning: /__w/triton/triton/third_party/amd/lib/TritonAMDGPUTransforms/MfmaGroup.cpp:240:1: warning: control reaches end of non-void function [-Wreturn-type]
```
While working on some higher dimension tensor kernels, I noticed poor
performance due to the fact that layouts wouldn't propagate to local
loads. Since we do allow layout folding with local store and local
alloc, this seems like a bit of an oversight.

The change gives a 40% speed improvement on certain kernels for NVidia
GPUs.

This also removes asserts in lowering for higher dimensional kernels. As
far as I can tell, those restrictions aren't required in practice.


# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.
- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.
- [x] I have added tests.
- [x] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices)
…5139)

Adding a shortcut case for fp8 MFMA to dot operand layout conversion
that avoids using shared memory, to speed up FP8 attention kernels.
We simplify the implementation of `getElemsPerThread` and strengthen the
preconditions of `getRepForOperand`.

More generally, we should try to minimise the calls to `isAmpere` and
`isHopper` throughout the codebase. I'll do a pass fixing many of these
once we land LLs for `ldmatrix` and Hopper.
TMA fences require CUDA toolchain 12.3 or greater, but current gating
does not check the CUDA toolchain version. This causes
`test_experimental_tma.py` to fail when run with older CUDA toolchains.

## Before
With cuda-12.0:
```
55 failed, 9 passed in 18.11s
```

With cuda-12.4:
```
64 passed in 11.99s
```

## After
With cuda-12.0:
```
9 passed, 55 skipped in 4.26s
```

With cuda-12.4:
```
64 passed in 11.96s
```
If you build with `-DTRITON_BUILD_UT=OFF` on Mac you will get something
like

```
-- Looking for histedit.h
CMake Error at /opt/homebrew/Cellar/cmake/3.30.5/share/cmake/Modules/CheckIncludeFile.cmake:90 (try_compile):
  Unknown extension ".c" for file
-- Looking for histedit.h - not found

    /Users/runner/work/triton/triton/triton-build/CMakeFiles/CMakeScratch/TryCompile-QA06d6/CheckIncludeFile.c

  try_compile() works only for enabled languages.  Currently these are:

    CXX

  See project() command to enable other languages.
Call Stack (most recent call first):
  llvm-bd9145c8-macos-arm64/lib/cmake/llvm/FindLibEdit.cmake:28 (check_include_file)
  llvm-bd9145c8-macos-arm64/lib/cmake/llvm/LLVMConfig.cmake:177 (find_package)
  llvm-bd9145c8-macos-arm64/lib/cmake/mlir/MLIRConfig.cmake:10 (find_package)
```

because `C` isn't an enabled project language.
This PR disables inline of print related functions, which speeds up
compilation of test_scan_layouts dramatically.

---------

Co-authored-by: Lei Zhang <[email protected]>
triton-lang/triton#5153 fixed
the issue; but we missed enabling one of the disabled
case.
@whitneywhtsang whitneywhtsang marked this pull request as ready for review November 22, 2024 21:07
@whitneywhtsang whitneywhtsang merged commit 4825a43 into main Nov 23, 2024
5 checks passed
@whitneywhtsang whitneywhtsang deleted the whitneywhtsang/merge branch November 23, 2024 12:37
@whitneywhtsang whitneywhtsang changed the title Merge OpenAI Triton commit 4ae95e7 Merge OpenAI Triton commit 16ce143 Nov 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.