Commit fb68aea
[IMP][Launch Latency] native specialize (triton-lang#7771)
This PR is the fifth in a series of contributions aiming at reducing the
launch overhead of Triton kernels ran without CUDA Graphs. A latency
profiler script shared offline currently puts the main branch at a
latency of around `27.16 us` (on a AMD EPYC 7413 24-C system with a
H100-HBM3 GPU) which can be reduced via several contributions in
different places.
One remaining larger amount of time during launch is spent in
`specialize_impl` and related calls which are necessary to process the
signature and create the `specialization` for a kernel-cache lookup and
eventually launching a kernel.
Within that current logic, there are two things that cost time in
particular
- multiple calls to `specialize_impl` (each function call in Python is
by `PyEval_EvalFrame` calls related to interpreting that function, doing
some setup, and eventually GC afterwards) for each argument which can be
up to 100ns per call
- a surprisingly long time spent in calculating the alignment from
native Python types
This PR thus addresses these two issues in two major parts
- "native" implementations of specializing integers and data-pointers
which cuts down time spent in computing alignments and divisibility
- a manual "inlining" of some of these specialization calls in
`dynamic_func` which avoids some of the mentioned overheads from
function calls above
This PR also comes with two minor improvements accompanying these
changes
- slightly re-ordering the if/else conditions in the specialization
logic - favoring types used more often
- adding another branch for finding tensors based on its class name
which should be faster than trying to access `data_ptr`
Overall, this cuts down latency reported in the shared profiling script
to `21.68 us` (from `27.16 us`).
| name | PR | latency | reduction |
|------|----|---------|-----------|
| main | x | `27.16 us` | x |
| cache-knob | triton-lang#7767 | `25.10 us` | `2.06 us` |
| native key | triton-lang#7768 | `21.71 us` | `5.45 us` |
| backend `GetAttrString` | triton-lang#7769 | `26.80 us` | `0.36 us` |
| misc compiler/kernel | triton-lang#7770 | `23.90 us` | `3.26 us`|
| native-specialize | triton-lang#7771 | `21.68 us` | `5.48 us` |
| **total** | x | `~10.5 us` | `16.61 us` |
# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.
- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.
- Select one of the following.
- [ ] I have added tests.
- `/test` for `lit` tests
- `/unittest` for C++ tests
- `/python/test` for end-to-end tests
- [x] This PR does not need a test because `it should be covered by
existing tests (no new functionality)`.
- Select one of the following.
- [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
and using the instructions it generates is not minimal.)
---------
Co-authored-by: peterbell10 <[email protected]>
Co-authored-by: Peter Bell <[email protected]>1 parent 8e87ed6 commit fb68aea
File tree
8 files changed
+596
-94
lines changed- python
- src
- test/unit/runtime
- triton
- backends
- runtime
- third_party/amd/backend
8 files changed
+596
-94
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
288 | 288 | | |
289 | 289 | | |
290 | 290 | | |
291 | | - | |
| 291 | + | |
| 292 | + | |
292 | 293 | | |
293 | 294 | | |
294 | 295 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
| 46 | + | |
46 | 47 | | |
47 | 48 | | |
48 | 49 | | |
49 | 50 | | |
50 | 51 | | |
51 | 52 | | |
| 53 | + | |
52 | 54 | | |
53 | 55 | | |
54 | 56 | | |
| |||
0 commit comments