Skip to content

Conversation

@whitneywhtsang
Copy link
Contributor

This PR change the Triton base from 13594bb to 152ef2d (Oct 24).
Pass rate: 98.98%

Please do not squash and merge this PR.

knwng and others added 7 commits October 24, 2024 08:55
This PR added `fast_expf` operator under libdevice for AMD hardwares.

Aligning with other operators in the exp family, the way to deal with
denorm inputs is controled by `__HIP_FTZ`, which currently is fixed to
be
True.

- If `__HIP_FTZ = 1`, the operator uses `llvm.amdgcn.exp2.f32`, which
will
  flush denorms in inputs and outputs;
- If `__HIP_FTZ = 0`, the operator uses `llvm.exp2.f32`, which will not
  flush denorms.

Ref:
https://github.com/ROCm/llvm-project/blob/amd-staging/amd/device-libs/cuda2gcn/src/precision.cl

Fixes ROCm/triton-internal#314
… use it for vectorized atomics (#4982)

Vectorized atomics on NVIDIA
(triton-lang/triton#4971) are only available on
Hopper (>=sm90) and PTX >= 8.1. It's possible to be running with PTX 8.0
on a Hopper machine. This PR passes ptx-version to the ttgir->llir
conversion pass for NVIDIA, and uses the ptx version to determine
whether vectorized atomics should be used.
`add_optimize_dot_operands` may introduce a immutable shared buffer for
transposed dot operands. Our stream-pipeliner then replaces the
immutable buffer with a mutable buffer to be able to reuse it across
iterations (pre-fetching). This will then produce incorrect transOps
because the input is mutable but the result is immutable.
This PR rewrites those transOps to output a mutable layout.
…CES` is set (#4986)

Based on the feedback from AMD, the device mapping problem has to be
addressed by the ROCm team, so we emit an error for now.
This PR is only introducing a ttgir pass to convert `tt.load`/`tt.store`
to `amdgpu.buffer_load`/`amdgpu.buffer_load`, _when this is possible_ :
this means we need to check for 3 conditions:
1. The pointer arithmetic has been canonicalized
   (`scalarPtr->splat->addptr->load/store`)
2. The offsets are 32-bits
3. The offsets are non-negative. We use a mix of analysis and
   assumptions to verify this condition

Right now the functionality is gated behind an `AMDGCN_USE_BUFFER_OPS`,
which now also covers the pointer canonicalization pass which is mostly
meant to handle this.
…#4983)

This PR:
- Introduces fallback from normal TTG->LLVM converter in case it does
not support given local_load.
- Enables conversion of MFMA dot layout to Linear Layout in local_load
pattern.
@whitneywhtsang whitneywhtsang self-assigned this Oct 28, 2024
@whitneywhtsang whitneywhtsang marked this pull request as ready for review October 28, 2024 16:52
@whitneywhtsang whitneywhtsang merged commit 1bc283c into main Oct 28, 2024
8 checks passed
@whitneywhtsang whitneywhtsang deleted the whitneywhtsang/merge2 branch October 28, 2024 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants