Skip to content

Conversation

@whitneywhtsang
Copy link
Contributor

This PR change the Triton base from 8966e5c to fa229d1 (Oct 14).
Pass rate: 98.98%

Please do not squash and merge this PR.

ravil-mobile and others added 6 commits October 12, 2024 22:56
…4819)

Advanced software pipelining may require fine-grained adjustments
regarding instruction scheduling in the main `tt.dot` loop to achieve
higher performance. Such adjustments require detailed information
regarding the number of issued `v_mfma`, `ds_read`, `ds_write` and
`global_load`, instructions. This PR extends the Triton AMDGPU backend
by adding instruction counting during `TritonAMDGPUToLLVM` pass
execution.

An example of instruction counting and instruction scheduling is
demonstrated in the `createCKV3Schedule` method which implements the
[CK's V3 software
pipelining](https://github.com/ROCm/composable_kernel/blob/de3e3b642402eac5b4a466f6a2fa5e9f022ba680/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_xdlops_v3.hpp#L160-L263).

This change is experimental for better GEMM performance. The design
is not final and may subject to change in the future.
I ran into an error in the RewriteTensorPointer pass. In my IR, there's
an scf.if that produces a non-pointer result. The rewriteIfOp() created
a new scf.if, but the use of scf.if result is still referencing the old one,
which caused a compile error. In this patch, I updated all uses of scf.if
with the results of the new if-op.
…rectory and update name (#4899)

If no install location is set, CMake by default puts all shared
libraries in triton/_C. This PR moves the instrumentation test/example
out of the triton install directory into it's own stand alone directory
that can be populated with future development examples and gives it a
more useful name.
Adjust the placement of LDS writes and reads to immediately follow the
definition of their operands in case where LDS write is in the loop but
it's operand is not. This is a heuristic for optimizing fused attention
by hoisting Q tensor LDS read/write operations outside of the loop, as Q
is a loop invariant and can be loaded once before entering the loop.

In the previous implementation, the heuristic incorrectly assumed that
the operand of the LDS write had to be a load operation, which is
unnecessary. Additionally, there was no explicit check to verify whether
the LDS write was in the loop while its defining operand was not. This
PR addresses both issues.

---------

Co-authored-by: Ognjen Plavsic <[email protected]>
This helps backend to interleave global load and mfma
instructions and can reduce global load issue latency.
triton-lang/triton#4589 mistakenly deactivated
these and reverted to the previous always-cast-to-int32 semantics.
@whitneywhtsang whitneywhtsang self-assigned this Oct 18, 2024
@whitneywhtsang whitneywhtsang changed the title Merge OpenAI Triton commit 8966e5c Merge OpenAI Triton commit fa229d1 Oct 18, 2024
Copy link
Contributor

@pbchekin pbchekin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but we need to fix the instrumentation test.

@whitneywhtsang whitneywhtsang marked this pull request as ready for review October 19, 2024 01:24
@whitneywhtsang whitneywhtsang merged commit f213106 into main Oct 19, 2024
4 checks passed
@whitneywhtsang whitneywhtsang deleted the whitneywhtsang/merge branch October 19, 2024 01:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants