Skip to content

Conversation

jataylo and others added 13 commits December 4, 2024 08:55
Cherry pick list:
- triton-lang#4925
- triton-lang#5053 
- triton-lang#5019 
- triton-lang#5002 
- triton-lang#4935 - required additional cherry picks triton-lang#4991 and triton-lang#4951
- triton-lang#4998 
- triton-lang#4925 
- triton-lang#5281 
- triton-lang#5308 
- All previous LLVM hash PRs before triton-lang#5308

---------

Co-authored-by: Ilya V <[email protected]>
Co-authored-by: Lei Zhang <[email protected]>
Co-authored-by: Lixun Zhang <[email protected]>
Co-authored-by: Keren Zhou <[email protected]>
Co-authored-by: Alexander Efimov <[email protected]>
Co-authored-by: Kyle Wang <[email protected]>
Co-authored-by: Jungwook Park <[email protected]>
Co-authored-by: peterbell10 <[email protected]>
Co-authored-by: Hongtao Yu <[email protected]>
Reverts triton-lang#5191 due to some mlir errors in pytorch unit tests

Smaller set of cherry picks:
- triton-lang#5308 (and previous LLVM upgrades)
- triton-lang#5281 
- triton-lang#4925 
- triton-lang#5053 
- triton-lang#5019 
- triton-lang#4998

---------

Co-authored-by: Jungwook Park <[email protected]>
Co-authored-by: peterbell10 <[email protected]>
Co-authored-by: Hongtao Yu <[email protected]>
Co-authored-by: Lei Zhang <[email protected]>
Co-authored-by: Ilya V <[email protected]>
Co-authored-by: Kyle Wang <[email protected]>
In the case of 16 bit floats operands for tt::AtomicRMWOp, construct
only one LLVM::AtomicRMWOp but use vector of elements.
Such approach allows to generate packed intrinsics and process 2
elements at once.
Added a lit test for f16 vectorized case.

(cherry picked from commit 78c8054)
This commit adds support for warp-level reduction
with DPP instructions, which can improve performance.

See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/

(cherry picked from commit 21119e3)
TritonAMDGPUTransforms now depends on it.

(cherry picked from commit 0b443ce)
This relands triton-lang#5392
to enable new arch target since backend support has been
added--it doesn't depend on the reverted LLVM upgrade in
triton-lang#5341; basic
necessary enablement is already included in the current llvm
version we're using.

(cherry picked from commit f257479)
triton-lang#5064)

Bumping llvm to include a loop unroller fix:
llvm/llvm-project#114573. This is needed for
subsequent loop unroller upstreaming work.

(cherry picked from commit 3c296ab)
This pulls in llvm/llvm-project@bd9145c8c213
to enable ASan on AMD backend.

(cherry picked from commit 0bd30a2)
@jataylo
Copy link
Contributor Author

jataylo commented Dec 12, 2024

@atalman
Copy link
Collaborator

atalman commented Dec 12, 2024

@jataylo this is a lot of feature work, not landed yet on pytorch/main. Normally after branch cut we accept only critical fixes. cc @malfet @bertmaher

@jataylo jataylo marked this pull request as draft December 12, 2024 20:58
@jataylo jataylo closed this Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants