Skip to content

Conversation

@whitneywhtsang
Copy link
Contributor

@whitneywhtsang whitneywhtsang commented Sep 30, 2024

This PR change the Triton base from e7ec3fe to 6c3e953 (Sept 30).
Pass rate: 98.99%

Please do not squash and merge this PR.

ptillet and others added 8 commits September 27, 2024 10:02
Currently we sleep between each rep for Triton kernels, but not for the
cuBLAS kernel. This may improve cuBLAS performance on fp8 due to thermal
issues.
Fixing an compile error like below when passing dtype through kernel arg
for `tl._experimental_descriptor_load`:

 AttributeError: 'constexpr' object has no attribute 'to_ir'
)

This helps to improve writeout to use `global_store_dwordx2`.

Along the way this PR
- Fixed the issue with getOrder for mfma layout
- Fixed the issue with reduceOp when dealing with mfma.transposed layout

In general, getOrder and getThreadOrder can return different values, and
this is the case for mfma.transposed layout. Therefore, we shouldn't
assume order and threadOrder are always the same.
LLD is not supported on macOS. This addresses failures like

> clang: error: invalid linker name in argument '-fuse-ld=lld'

See
https://github.com/triton-lang/triton/actions/runs/11099205977/job/30833066194#step:10:61
…idia_gpu ops (#4686)

If you want to dump layouts read from an MLIR file, and that file
contains ops like `triton_nvidia_gpu.warp_group_dot`, this tool needs to
know about the `triton_nvidia_gpu` dialect, or else it will throw an
error about not finding the dialect
I think we should always set the right alignment to the
`maskedload`/`maskedstore` instructions.
Pinned LLVM to v19; cannot do the same for LLD though.

This allows us to revert #4827.
@whitneywhtsang whitneywhtsang self-assigned this Sep 30, 2024
davidberard98 and others added 3 commits September 30, 2024 17:28
Previously, if an arg inside the loop was marked as a depArg, then a new
iter_arg would be added to the for loop to handle the arg; but any
usages of these variables _after_ the for loop would not be updated;
those usages would get the wrong value. This PR fixes this by updating
the return mapping. See the comment added in StreamPipeline.cpp for an
example.

Co-authored-by: Hongtao Yu <[email protected]>
…IDIA GPUs (#4674)

This PR adds the "cupti_pcsampling" backend for collecting and
attributing instruction samples to the corresponding GPU code, including
the file path, function name, and line number. It currently serializes
kernel execution so that kernel runtime and GPU samples can be collected
in the same pass.
…ocal_alloc` ops (#4763)

This PR enables the use of `stmatrix` for `local_alloc` ops through
linear layout and removes the legacy code from the `TargetInfo` class.
@whitneywhtsang whitneywhtsang changed the title Merge OpenAI Triton commit 1df64d1 Merge OpenAI Triton commit 6af74b2 Oct 1, 2024
@whitneywhtsang whitneywhtsang reopened this Oct 1, 2024
@whitneywhtsang whitneywhtsang marked this pull request as ready for review October 1, 2024 03:07
It's very common that we need to figure out the exact commit from which
the current installed triton package is compiled. Right now it will just
show a version number like `3.0.0` which isn't quite helpful.

With this commit we have

```
> pip show triton
Name: triton
Version: 3.0.0+git78e4f837
```
@whitneywhtsang whitneywhtsang changed the title Merge OpenAI Triton commit 6af74b2 Merge OpenAI Triton commit 6c3e953 Oct 1, 2024
@whitneywhtsang whitneywhtsang merged commit 17dac54 into main Oct 1, 2024
4 checks passed
@whitneywhtsang whitneywhtsang deleted the whitneywhtsang/merge branch October 1, 2024 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.